Text Analysis Project

Comparing approaches to analyzing news articles on Michigan data centers

The Project Overview

In our final collaborative project, the class compared two approaches of doing text analysis, one powered by AI, and one that's not:

The Two Approaches

ProQuest TDM Visualization Service

An AI-assisted topic modeling service that provides a user-friendly interface for text analysis. The service handles preprocessing and topic extraction automatically and in minutes.

Pros:

  • Easy to use, no coding required
  • Quick results
  • Handles preprocessing automatically

Cons:

  • Less control/understanding over how data was processed
  • Some topics had random or unrelated words, and words that overlapped between sections.

Python LDA Workflow

A custom Jupyter notebook workflow where students wrote Python code to implement Latent Dirichlet Allocation from scratch.

Pros:

  • Full control over preprocessing, including ability to remove duplicate words
  • Transparent methodology
  • Customizable parameters

Considerations:

  • Requires programming knowledge
  • More time investment

Topic Modeling Explained

What is Topic Modeling?

Topic modeling is an unsupervised machine learning technique that automatically identifies themes or "topics" within a collection of documents. It works by finding groups of words that frequently appear together, assuming these represent coherent topics. The dataset we analyzed covered nearly fifteen months of relevant news articles on data centers in Michigan. There were nearly 500 articles in total.

Key Findings

Why Transparency Matters

By comparing these two approaches, we discovered how preprocessing decisions and parameter choices shape topic modeling outcomes. The same data can reveal different "topics" depending on how it's processed and analyzed.

What We Learned

  1. Preprocessing Matters: How text is cleaned, tokenized, and filtered significantly affects results
  2. Parameter Sensitivity: The number of topics, iteration count, and other parameters influence outcomes
  3. Reproducibility: Transparent code-based approaches enable others to verify and build upon work

Implications for Research

This project demonstrates that AI tools are not neutral—they embed assumptions and choices that shape what we learn. For researchers, this means:

← Back to Home