Demystifying the Black Box of Computational Text Analysis Workflows: From Static Textual Archives to Visualizations and Reports of U.S. Congressional Activity

Computational text analysis workflows are long and complex. Too few scholars know how to evaluate critically the multiple decisions a researcher might make in preparing, processing, and analyzing data; fewer still know how to carry their own research through such workflows. This Matrix Theme Team will make this whole process transparent and understandable, by designing and documenting a complete workflow.

Starting with basic, digital scans of the Congressional Record, they will develop programming scripts and pedagogical materials to model the process of textual data acquisition—cleaning, chunking, databasing, analysis, and visualization—that characterize the research process from beginning to end. In addition to creating materials helpful to anyone who wants to understand or implement a text analysis workflow, they aim to produce a research-ready database enabling a wave of scholarship into the behavior of the U.S. Congress.

Their goal for the preprocessing stage of the project is to create a clean, complete, and fully structured data set from the Congressional Record, which can be made available as a package for researchers. The second goal is to develop a much fuller understanding of the promises and pitfalls of computational methods, and to teach one another in the context of the working group and in workshops. They also expect that individual working group members will leverage both the dataset and the computational methods in their own research projects. They intend to create a free and open dataset that is superior in scope to what is available from the Government Publishing Office and superior in quality to what is sold by Proquest.