Text Mining with Python

Python for the Humanities 17

Nov 28, 2025

Counting Words: A Beginner’s Guide to Text Mining with Python

Hello and welcome! Today, we’re diving into the exciting world of text mining. It sounds super professional, doesn’t it? But at its core, it’s all about taking a text file and analyzing it to uncover patterns, like counting words or even figuring out its emotional tone. So, let’s get started!

Text mining is an absolute goldmine for the humanities. Imagine you want to track how the sentiment of newspaper articles changed over time. Were the 1950s really more optimistic than the 1970s? You could analyze a dozen articles from each decade, count the positive versus negative words, and find out! Or perhaps you want to explore a question of authorship. Did Shakespeare really write that contested play? You could analyze the statistical properties of plays you know are his, then compare them to the mystery play. You can even use these techniques to track characters in a novel, see how many lines they have, and when they appear or disappear. The applications are endless.

Getting Your Text Ready

First things first, you’ll need some text to analyze. You can copy and paste a chunk of text from anywhere you like.

A crucial heads-up: make sure you’re using a plain text file (a .txt file), not a Word document (.doc or .docx). Word files contain a lot of hidden formatting code that our Python program can’t read. The easiest way to create one is to right-click on your desktop, select “New,” and then choose “Text Document.” Paste your text into that file and save it.

Step 1: Uploading and Reading the File

Alright, let’s get our hands dirty with some code. If you’re using Google Colab, you first need to upload your text file to Google’s servers. We’ll use the google.colab.files library for this. The code below will prompt you to upload a file. Once uploaded, we’ll open it in read mode (‘r’) and use file.read() to load its entire contents into a single string variable called full_text.

import string
from google.colab import files

# Prompt the user to upload a file
uploaded = files.upload()

# Get the filename of the uploaded file
# (This assumes the user uploads only one file)
file_name = list(uploaded.keys())[0]

print(f”Analyzing file: {file_name}”)

# Open and read the entire file into one string
with open(file_name, ‘r’) as file:
    full_text = file.read()

Step 2: Cleaning and Normalizing the Text

Before we can count words, we need to clean up our text. This is a standard and very important step in any text analysis project.

First, we’ll convert everything to lowercase. Computers are sticklers for detail; to a computer, “The” and “the” are two completely different words. To get an accurate count, we need to level the playing field. The .lower() string method makes this incredibly easy.

# Convert the entire text to lowercase
lower_text = full_text.lower()

Next, we have to deal with punctuation. If we don’t, a word like “play,” will be counted as a different word from “play” because of the comma. We need to strip all that out. The string module we imported earlier is a huge help here. It contains a handy string called string.punctuation that lists all common punctuation characters.

# This is what string.punctuation looks like:
# ‘!”#$%&\’()*+,-./:;<=>?@[\\]^_`{|}~’

We can loop through every character in string.punctuation and use the .replace() method to remove it from our text. Note that we replace it with an empty string (‘’), effectively deleting it.

# Loop through each punctuation mark and remove it from the text
for punc in string.punctuation:
    lower_text = lower_text.replace(punc, ‘’)

A quick thought: should we replace punctuation with nothing, or with a space? That’s a great question! In most properly written text, there’s already a space after punctuation, so replacing it with nothing is fine. But if you have text where words are joined by a comma without a space (e.g., “word,word”), our current method would incorrectly merge them into “wordword”. For most cases, our simple replacement works, but it’s something to be mindful of with messy data!

Step 3: Splitting the Text into a List of Words

With our text all clean and shiny, it’s time to break it down. Right now, we have one giant string. We need a list of individual words. That’s where the wonderfully simple .split() method comes in. By default, it splits the string by any whitespace (spaces, newlines, tabs), giving us exactly what we need.

# Split the single string into a list of words
all_words = lower_text.split()

If you were to print(all_words) now, you’d see a Python list where each element is one of the words from your file.

Step 4: Counting the Word Frequencies

Now for the main event: counting the words! The perfect tool for this job is a dictionary, where each key will be a word and its value will be the number of times that word has appeared. We’ll start by creating an empty dictionary.

Then, we’ll loop through our all_words list. For each word, we need to check if we’ve seen it before. If we have, we’ll increment its count. If it’s a new word, we’ll add it to our dictionary with a count of 1.

The dictionary’s .get() method is the secret sauce here. The line current_count = word_counts.get(word, 0) safely checks the dictionary. If word is already a key, it returns its current count. If it’s not in the dictionary, instead of causing an error, it returns the default value we provided: 0. This makes our code clean and simple. We then add one to this count and update the dictionary.

# Create an empty dictionary to store word counts
word_counts = {}

# Loop through every word in our list
for word in all_words:
    # Get the current count for this word, defaulting to 0 if it’s not found
    current_count = word_counts.get(word, 0)
    
    # Update the dictionary with the new count (+1)
    word_counts[word] = current_count + 1

Step 5: Displaying the Results

We’ve done the hard work, so let’s see the fruits of our labor! We can now loop through the .items() of our word_counts dictionary to get each word and its final count, and then print them out.

print(”\n--- Word Frequency Results ---”)
for word, count in word_counts.items():
    print(f”{word}: {count}”)

When you run this, you’ll get a full list of every unique word in your document and how many times it appeared. You might notice that the list is unsorted, but that’s a topic for another day! You may also see some oddities, like leftover curly quotes that weren’t in the standard string.punctuation. This is a great example of how text cleaning is often an iterative process of finding and handling new edge cases.

Tomorrow's Teaching

Discussion about this post

Ready for more?