David Patrick McKenzie

Digital Public Historian

Short tutorial: Cleaning up a Google Book for text mining

This post has now been cross-posted to the class blog. Any future updates, as of 12/17/2012, will appear there.

Lesson Goals & Reasons

Why?

Google Books has increasingly become one of the main repositories of textual sources for historians and other humanities scholars. As many have discussed, it has its issues. Nonetheless, it remains a valuable place to gather texts for those interested in text mining–provided we clean the text first.

Why is this important? You want your text mining to give you accurate results. The example that I am using, a volume of the Diplomatic Correspondence of the Republic of Texas, exemplifies some of these issues. When I ran the text file through Voyant, Here’s what happened:

As you can see, “Digitized” and “Google” appeared as two of the most common words! Presumably Texas diplomats in the 1830s and 1840s would not have been writing about digitization or about Google. Additionally, the words United and States popped up frequently–because the top of each page included “Correspondence with the United States” for a large part of the book. I want to know if the words “United States” do indeed show up frequently otherwise, along with other important ideas, like words having to do with annexation. Using text mining techniques could help contribute to an understanding of Texas’s diplomatic policies when it was a republic–but only if we can run those features on an accurate copy of the correspondence.

Goals

This lesson will use a sample Google Books document with some of the issues that tend to accompany texts from that repository. You will clean that document–removing some of the specific Google Books formatting, then going through basic steps to prepare it for text mining. Some of these principles will be applicable for many different books.

You will need:

  • A Google Books document, preferably in text format. I downloaded this one from the Internet Archive. Otherwise, you can convert a Google Books PDF to text.
  • A text editor to create Python scripts.
  • The ability to execute Python. I personally like to use Komodo Edit because it includes the ability to execute a Python script right in your window. Programming Historian 2 has a method for setting that up. You can also, of course, test your scripts using the command line.

What Will You Do?

In this tutorial, we will create a Python script to:

  • Download the text file of a Google Book from Archive.org.
  • Strip out page numbers and titles on the tops of pages
  • Remove the introductory portions of the file & strip out the HTML.
  • Strip out “Digitized by Google”.
  • Save the file to your drive

Begin the script and get the document

To start, we’ll begin our Python script in KomodoEdit. Open KomodoEdit and save a new file as “tx-dip-corr.py” (or whatever you’d like to, for whatever you will be using).

First we want to give our script access to a couple of different Python libraries–one that deals with getting documents from the Internet, and one that gives us access to regular expressions (which will be explained later). To access those, we use the “import” command:

import urllib2
import re

Next, we want to open a file from the web. To do so, let’s create a variable, called “url,” and give it the web address of the item we want to open. So type:

url = ""

Then we want to get our URL, which will go into the quotation marks.

Many of the out-of-copyright works that Google has scanned have been uploaded to the Internet Archive and are available in multiple formats. In this case, we want to use the text version. For this tutorial, I am using a volume of the Diplomatic Correspondence of the Republic of Texas from the Internet Archive.

Right-click on the link that says “Full Text” and copy the URL. This is the file we will be using. Paste the URL between the quotation marks. So, you will now have:

url = "http://archive.org/stream/diplomaticcorre33statgoog/diplomaticcorre33statgoog_djvu.txt"

That only gave us the URL for opening the file, though. Next we want actually to open it. First we have it opened, then we have it read. Python makes us do this in two lines:

response = urllib2.urlopen(url)
txt = response.read()

If you want to use a file you have already downloaded, this lesson from Programming Historian 2 shows how to open a file already on your computer.

Strip out page numbers and titles on the tops of pages

The title of the book or section, plus the page number, was captured in the scan–not to mention digitization information:

Here is what it looks like in the text:

This could throw off our text mining results, so let’s get rid of it! We want to strip out the page numbers and titles on the tops of pages. Unfortunately, the scan of this book was not the best (to put it mildly), and so the text on the top of each page rendered differently. It should say, on one side, “[page number] American Historical Association” (because the book was published by the American Historical Association). On the other side, it should say, “Correspondence with [country]. [page number].” As you can see, it doesn’t do that; the only consistency we get is that the page number is at the beginning or ending of the line.

Luckily, we can use regular expressions to find these lines and get rid of them.

We begin with setting a variable–let’s call it “txt”–for holding the text. We set it empty–for now:

txt = ""

Next we have the program go through the file looking for what we want it to find. To do that, we set up a variable that we’ll call “line”:

for line in response:

Then we set up a search, using regular expressions, to find the one consistency that we identified:

for line in response:
    matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)

We are creating a variable, “matchObj,” that is searching the whole text line-by-line, using regular expressions (“re.search”). My classmate Laura explains more about regular expressions in her tutorial. Some explanation for what is happening here: We are parsing through the document and finding instances where either a line begins (indicated by ^) with a number (indicated by [0-9]) or ends (indicated by $) with a number (again, [0-9]). The “|” splitter gets the program to do either one or the other. It also includes any words (indicated by *) and any spaces and other white space (indicated by “\s”, which would typically stop the search). The “+”, meanwhile, indicates that we want to find the previous item (e.g., numbers) more than once. Here is a complete listing of regular expressions and what they do.

But, all that this has done is search through the text. We then need to tell it what to do. For that, we use an “if…else” statement. First the if:

for line in response:
    matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)
    if matchObj is not None:
        pass

Here, we are telling the program that if the search results yield something, then that line is not to be saved; that is the equivalent to deleting it from the file.

Next, we want to make sure that every other line is coming through. So we tell it:

for line in response:
    matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)
    if matchObj is not None:
        pass
    else:
        txt += line

This is saying that if something the line does not come up in the search, it is added to the text to be saved to our computer. Now, those page headers are all gone!

Remove the introductory portions of the file & strip out the HTML

Diplomatic Correspondence of the Republic of Texas is a compilation of primary source documents. As such, for our purposes we are not interested in parts that are not the primary source documents; thus, we do not want the introductory material to the volume, not to mention Google’s information about it. This particular file also has information from Archive.org on the top.

Open the file in your web browser. Scroll to where the primary sources begin–in this case, it’s a line that says “CORRESPONDENCE HITHERTO UNPUBLISHED.” Note this, as it will be important.

Write the function

Now we are going to write a function to grab the parts of the file that we want and strip out any HTML in it, leaving us with text. We begin by naming our function:

def stripTags(pageContents):

First, in our function, we want to define where to start grabbing the file. Remember where we noted the text where we wanted to begin? Now we see it again:

def stripTags(pageContents):
    startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.")
    pageContents = pageContents[startLoc:]

This tells the program to go through that document and find the line we previously identified, and set that as the starting location. Then, it takes everything from that point forward as the text that we want. In other words, it doesn’t send over the beginning, introductory text.

Next, we want to strip out the HTML. We do this with an “if…else” statement. All HTML in the file, of course, is found between “<>.” So we want to take anything between those symbols and remove it.

To do this, we will use the integers 0 and 1 for what is inside and outside the “<>.” So, we define the variable “inside” and set up an empty list for the variable “cleantext” (which will hold our cleaned-up text). Here is the code:

def stripTags(pageContents):
    startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.")
    pageContents = pageContents[startLoc:]

    inside = 0
    cleantext = ''

After this, we want to search for the “<>” and remove them, plus anything inside. We do this with an if…else statement:

def stripTags(pageContents):
    startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.")
    pageContents = pageContents[startLoc:]

    inside = 0
    cleantext = ''
    for char in pageContents:
        if char == '<':
            inside = 1
        elif(inside == 1 and char == '>'):
            inside = 0
        elif inside == 1:
            continue
        else:
            cleantext += char  
    return cleantext

After all of that, we now have our cleaned text. Sort of.

Execute the function

We now need to execute our function on the text we’ve retrieved from the Internet.

Let’s call the variable for the executed function “ctext” for “cleaned text.” We have that variable execute that function on the text (defined as txt) downloaded:

ctext = stripTags(txt)

Next, we’ll clean some other results of the digitization out of the text.

Strip out “Digitized by Google”

As you scroll through the file, you’ll notice there are parts that repeat. In Google Books, each page image gets “Digitized by Google” placed on it. When whoever created the plain text file from Diplomatic Correspondence of the Republic of Texas did the OCR, that portion was caught.

Luckily, it’s rather easy to strip out the “Digitized by Google” lines using Python’s replace function. We set a variable to house the text without that portion–let’s call it “stripGoogle”:

stripGoogle =

Next, we get the source of the text–in this case, the variable we just created to execute the stripTags function. So we tell stripGoogle to take that variable:

stripGoogle = ctext

Unfortunately, those words exist on separate lines, so we have to replace them separately. We append this to the end of “ctext”–we are replacing the text found in “ctext.” We replace this with nothing, indicated by the quotation marks with nothing between them:

stripGoogle = ctext.replace("Digitized by","").replace("Google","")

Save the file to your drive

Now that the file is cleaned, we’re prepared to commit it to our hard drive. From there, we can take the file into tools like Voyant, or use it for text mining.

First, we want to create the file, which we’ll call “tx-dip-corr.txt.” So we set a variable–we’ll call it “f”–to create a blank file of that title:

f = open('tx-dip-corr.txt','w')

Next, we want to write the final result of our manipulations–which we contained in a variable called “stripGoogle” (this will change as I figure out the regular expressions for the last part!)–into the file. We do this with the write function:

f = open('tx-dip-corr.txt','w')
f.write(stripGoogle)

Finally, we close that file at the end of our script:

f = open('tx-dip-corr.txt','w')
f.write(stripGoogle)
f.close

Wrapping it up

Unfortunately, this tutorial doesn’t deliver you a perfectly clean copy of the Diplomatic Correspondence of the Republic of Texas. A lot of the OCR is pretty bad, so some manual cleanup will still be needed. Some of the OCR errors are consistent, so a simple find/replace in a text editor might help–or even a simple Python script.

In the end, the cleaned version still shows “United” and “States” as two of the most common words. So, one could do even more with this. Nonetheless, we have removed some of the issues:

This tutorial has shown how to strip out some of the more common issues with the text of a Google Book. Once you execute the script, you will have the file saved on your computer. Enjoy the text mining! For your reference, here is the entire script, including comments about what does what:

#tx-dip-corr.py
#import libraries
import urllib2
import re

#open file from the web
url = "http://archive.org/stream/diplomaticcorre33statgoog/diplomaticcorre33statgoog_djvu.txt"
response = urllib2.urlopen(url)  # open('C:\\temp\\diplomaticcorre33statgoog_djvu.txt', 'r') 

# build up txt without page numbers
txt = ""
for line in response:
    matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)
    if matchObj is not None:
        pass
    else:
        txt += line

#function to strip the introductory portion and HTML tags
def stripTags(pageContents):
    startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.")
    pageContents = pageContents[startLoc:]

    inside = 0
    cleantext = ''
    for char in pageContents:
        if char == '<':
            inside = 1
        elif(inside == 1 and char == '>'):
            inside = 0
        elif inside == 1:
            continue
        else:
            cleantext += char  
    return cleantext

#now execute the function
ctext = stripTags(txt)

#strip out "Digitized by Google" through the whole file
stripGoogle = ctext.replace("Digitized by","").replace("Google","")

#create the file and write results to it
f = open('tx-dip-corr.txt','w')
f.write(stripGoogle)
f.close

I am eternally grateful to my friend Kelvin Pan for helping me resolve some issues that arose as I tried to figure this out, and to my professor, Fred Gibbs, for his helpful comments.

5 Comments

  1. David McKenzie

    November 23, 2012 at 8:54 pm

    Here is a variation of what I tried. I also tried the re.sub function. No luck in either instance. Not sure if I’m getting the regex wrong or the Python, or both:

    leftPageNo = re.search(“[0-9]{2}+.*+$”, stripGoogle)
    rightPageNo = re.search(“^+.*+[0-9]{2}$”, leftPageNo)

    noPageNo = stripGoogle.replace(leftPageNo, “”).replace(rightPageNo, “”)

  2. David McKenzie

    November 24, 2012 at 1:27 pm

    Thanks to my friend Kelvin Pan on Facebook, I got an improved regex:

    #find page numbers and headers
    leftPageNo = re.search(“^[ \t]*[0-9]”, str)
    rightPageNo = re.search(“[0-9][ \t]*$”, str)

    #strip out page numbers and headers
    noPageNo = stripGoogle.replace(leftPageNo, “”).replace(rightPageNo, “”)

    But I still get an error:
    TypeError: expected string or buffer

    from the leftPageNo line. Any suggestions?

  3. very nice, david. a few suggestions:
    it may help the casual reader (and convert casual readers to serious ones) to see an image of the google book you are using, maybe with the offending textual problems highlighted so that readers will have a very clear visual idea of what kind of cleaning you’ll explain how to do.

    in the second paragraph, you mention the “‘topic’ of the United States”, but remember that a topic in this context is a cluster of words that in human language characterizes what a group of texts are about. looking for united states is just looking for a bi-gram, not a topic. similarly, are you looking for words related to annexation (how would that topic appear?) or just the word itself? in fact, i would just remove references like in the “goals” section that you are preparing the text for topic modeling–just say text mining. that’s more accurate.

    really, the user doesn’t need komodo edit; i realize you’re following PH here, but any text editor will do. i like to impose as few “needs” as possible.

    your acknowledgement of kelvin’s help should go at the end of the tutorial.

    perhaps in the beginning of the page number section is a good place for a small image that shows the page headings.

    please don’t use all caps for variable names, like TXT. all caps variable names are generally reserved for pre-defined constants that the code needs.

    could you add a few links to the pages or sites or tutorials that helped you get started with regular expressions?

    your code explanation for the first regex isn’t quite right. the “lines” after the comma does not tell it to parse each line–that’s what the for loop declaration does. the “lines” refers to the current line of the response that is being processed in the loop. you might use the variable line instead of lines, since that’s what the loop operates on (remember that “for X in Y” is short for “for each X in set Y”)

    in the section on saving the file, you should the delete the first sentence. a) it is not hard to believe that we haven’t saved the file; b) it’s not being operated on in the ether, but in the computer’s RAM, which is probably implicitly understood by anyone going through a python tutorial.

    your regex explanation doesn’t explain the pipe (|) or the \s or the * or the + or the groups of parentheses. either explain everything or nothing, unless there is something unusually clever about the regex, but this one is pretty straightforward.

    you might consider also showing how to load a text file from a file in addition to using a url, since many people will already have a pdf or text file to work with (or a set of them) maybe you should explain how to automatically read in a directory of files.

    but a very nice start to your tutorial, which i think will be very helpful for people getting started with text mining!

  4. in fact, you might leave more of the reg ex explanation to laura’s tutorial and focus here less on the general concept, but more on the specifics of what you’re stripping out–which is sort of the direction that the tutorial leans anyway. stripping out some of the fluff prose will make it a bit snappier and more to the point.

  5. You use Internet Archive so you can get a PDF? Are all google books available on Internet Archive?

Leave a Reply

%d bloggers like this: