Analyzing and Visualizing Text with Constellate and ProQuest TDM Studio

Week 3 of the Data Jam focused on Visualizing Texts and Networks. Text analysis (also referred to as text mining, data mining, or TDM) is the use of computational methods to derive information from texts to search, find patterns, discover relationships, and analyze, in order to gain insights for a research question. In this workshop, we tested two platforms for sourcing text data for visualization and analysis: Constellate, a text analytics platform from JSTOR Labs ;and ProQuest TDM Studio, an end-to-end text and data mining solution from ProQuest. Watch the recording and read through the workshop notes to learn more about comparing these two platforms, perform searches, and learn more about text analysis more generally. Here are the key takeaways we covered:

1. Introducing Text Analysis

In order to understand the two platforms for sourcing text data, we reviewed the text analysis process and vocabulary for working in this method. Text analysis is often used as a way to gain a better understanding of large volumes of content - whether to discover new resources via non-traditional search methods or identify possible topics and research questions. We reviewed the process for getting started in text analysis, and key words used across platforms. We also discussed some text analysis projects, like the Fan Engagement Meter, to understand how text analysis includes a number of different methods and tools for exploring research questions.

A slide entitled "Text Analysis Research Process." The slide lists the following, from top to bottom: Formulate a Research Question, Build a Corpus, Clean Data, Analyze Text, Gain Insights.

You can find the Penn Libraries guide to Text and Data Mining at guides.library.upenn.edu/penntdm.

2. Constellate (constellate.org)

Constellate is a new text and data analytics service from JSTOR and Portico. Constellate provides users with the ability to build datasets for analysis from a variety of sources, perform basic text analysis and visualizations, and gather with a growing community of practitioners to share text analytics materials. Constellate provides text and data analysis capabilities and access to content from a variety of databases in an open environment with teaching materials that can be used, modified, and shared.

Constellate provides content from JSTOR (journal articles, book chapters, research reports, pamphlets), Portico (journal articles, book chapters, full books), Chronicling America (historical newspapers, 1789-1963), Doc South (documents, books), South Asia Open Archives (journal articles, reports, newspapers, periodical, pamphlets, and surveys), and Reveal Digital (alternative press, newspapers, magazines, journals).

Term frequency chart of unigram frequency across the dataset. The term "amtrak" appears in 100% of documents over time, while terms like "chicago", "railroad", and "government" fluctuate. — A term frequency chart generated by the Constellate platform, using a dataset focused on the keyword "amtrak"

For more on how to use Constellate, check out the How-To Guides.

3. ProQuest TDM Studio (tdmstudio.proquest.com)

ProQuest TDM Studio allows researchers to mine and computationally analyze large volumes of published content from news, scholarly and other publications provided to the University of Pennsylvania Libraries via current ProQuest subscriptions. Currently, TDM Studio offers access to 176 ProQuest Databases or 51,711 publications (magazines, books, conference papers, dissertations and theses, scholarly journals, current and historical newspapers like Wall Street Journal, NYTimes, Washington Post).

A topic list of keywords based on a dataset related to the keyword "Amtrak". — Three of the five topics produced for the Amtrak dataset, showing the frequency of topics over time

For more on how to use ProQuest TDM Studio, check out these videos, resources, and guides.

4. Comparisons

These two platforms are both actively under development, and offer similar but distinct opportunities for researchers to engage in the text analysis process. This chart offers a quick comparison of the two platforms for sourcing, visualizing, and analyzing textual datasets.

	CONSTELLATE	TDM STUDIO
PROCESSING DATA	Jupyter Notebooks	Jupyter Notebooks
EXPORTING DATA	A JSON-L dataset containing the n-grams, full-text and metadata	Rolling, 7-day export limit of 15MB
BUILT-IN TOOLS FOR VISUALIZING DATA	Number of Documents Over Time Key phrases Term Frequency Document Categories over time Category Treemap	Geographic Analysis Topic Modeling
DATASETS	JSTOR Portico Chronicling America Reveal Digital Doc South South Asia Open Archives	176 Databases 51,711 Publications, including current newspapers
DATASET SIZE	50,000 items per dataset	Up to 2 million documents per dataset (10 datasets max) for Workbench Dashboard Up to 10,000 documents per dataset (5 data sets max) for Visualization Dashboard
ACCESS	Access provided through University of Pennsylvania	Contact RDDS Team for information