This post has now been cross-posted to the class blog. Any future updates, as of 12/17/2012, will appear there.
Lesson Goals & Reasons
Google Books has increasingly become one of the main repositories of textual sources for historians and other humanities scholars. As many have discussed, it has its issues. Nonetheless, it remains a valuable place to gather texts for those interested in text mining–provided we clean the text first.
Why is this important? You want your text mining to give you accurate results. The example that I am using, a volume of the Diplomatic Correspondence of the Republic of Texas, exemplifies some of these issues. When I ran the text file through Voyant, Here’s what happened:
As you can see, “Digitized” and “Google” appeared as two of the most common words! Presumably Texas diplomats in the 1830s and 1840s would not have been writing about digitization or about Google. Additionally, the words United and States popped up frequently–because the top of each page included “Correspondence with the United States” for a large part of the book. I want to know if the words “United States” do indeed show up frequently otherwise, along with other important ideas, like words having to do with annexation. Using text mining techniques could help contribute to an understanding of Texas’s diplomatic policies when it was a republic–but only if we can run those features on an accurate copy of the correspondence.
This lesson will use a sample Google Books document with some of the issues that tend to accompany texts from that repository. You will clean that document–removing some of the specific Google Books formatting, then going through basic steps to prepare it for text mining. Some of these principles will be applicable for many different books.
You will need:
- A Google Books document, preferably in text format. I downloaded this one from the Internet Archive. Otherwise, you can convert a Google Books PDF to text.
- A text editor to create Python scripts.
- The ability to execute Python. I personally like to use Komodo Edit because it includes the ability to execute a Python script right in your window. Programming Historian 2 has a method for setting that up. You can also, of course, test your scripts using the command line.
What Will You Do?
In this tutorial, we will create a Python script to:
- Download the text file of a Google Book from Archive.org.
- Strip out page numbers and titles on the tops of pages
- Remove the introductory portions of the file & strip out the HTML.
- Strip out “Digitized by Google”.
- Save the file to your drive
Begin the script and get the document
To start, we’ll begin our Python script in KomodoEdit. Open KomodoEdit and save a new file as “tx-dip-corr.py” (or whatever you’d like to, for whatever you will be using).
First we want to give our script access to a couple of different Python libraries–one that deals with getting documents from the Internet, and one that gives us access to regular expressions (which will be explained later). To access those, we use the “import” command:
import urllib2 import re
Next, we want to open a file from the web. To do so, let’s create a variable, called “url,” and give it the web address of the item we want to open. So type:
url = ""
Then we want to get our URL, which will go into the quotation marks.
Many of the out-of-copyright works that Google has scanned have been uploaded to the Internet Archive and are available in multiple formats. In this case, we want to use the text version. For this tutorial, I am using a volume of the Diplomatic Correspondence of the Republic of Texas from the Internet Archive.
Right-click on the link that says “Full Text” and copy the URL. This is the file we will be using. Paste the URL between the quotation marks. So, you will now have:
url = "http://archive.org/stream/diplomaticcorre33statgoog/diplomaticcorre33statgoog_djvu.txt"
That only gave us the URL for opening the file, though. Next we want actually to open it. First we have it opened, then we have it read. Python makes us do this in two lines:
response = urllib2.urlopen(url) txt = response.read()
If you want to use a file you have already downloaded, this lesson from Programming Historian 2 shows how to open a file already on your computer.
Strip out page numbers and titles on the tops of pages
The title of the book or section, plus the page number, was captured in the scan–not to mention digitization information:
Here is what it looks like in the text:
This could throw off our text mining results, so let’s get rid of it! We want to strip out the page numbers and titles on the tops of pages. Unfortunately, the scan of this book was not the best (to put it mildly), and so the text on the top of each page rendered differently. It should say, on one side, “[page number] American Historical Association” (because the book was published by the American Historical Association). On the other side, it should say, “Correspondence with [country]. [page number].” As you can see, it doesn’t do that; the only consistency we get is that the page number is at the beginning or ending of the line.
Luckily, we can use regular expressions to find these lines and get rid of them.
We begin with setting a variable–let’s call it “txt”–for holding the text. We set it empty–for now:
txt = ""
Next we have the program go through the file looking for what we want it to find. To do that, we set up a variable that we’ll call “line”:
for line in response:
Then we set up a search, using regular expressions, to find the one consistency that we identified:
for line in response: matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line)
We are creating a variable, “matchObj,” that is searching the whole text line-by-line, using regular expressions (“re.search”). My classmate Laura explains more about regular expressions in her tutorial. Some explanation for what is happening here: We are parsing through the document and finding instances where either a line begins (indicated by ^) with a number (indicated by [0-9]) or ends (indicated by $) with a number (again, [0-9]). The “|” splitter gets the program to do either one or the other. It also includes any words (indicated by *) and any spaces and other white space (indicated by “\s”, which would typically stop the search). The “+”, meanwhile, indicates that we want to find the previous item (e.g., numbers) more than once. Here is a complete listing of regular expressions and what they do.
But, all that this has done is search through the text. We then need to tell it what to do. For that, we use an “if…else” statement. First the if:
for line in response: matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line) if matchObj is not None: pass
Here, we are telling the program that if the search results yield something, then that line is not to be saved; that is the equivalent to deleting it from the file.
Next, we want to make sure that every other line is coming through. So we tell it:
for line in response: matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line) if matchObj is not None: pass else: txt += line
This is saying that if something the line does not come up in the search, it is added to the text to be saved to our computer. Now, those page headers are all gone!
Remove the introductory portions of the file & strip out the HTML
Diplomatic Correspondence of the Republic of Texas is a compilation of primary source documents. As such, for our purposes we are not interested in parts that are not the primary source documents; thus, we do not want the introductory material to the volume, not to mention Google’s information about it. This particular file also has information from Archive.org on the top.
Open the file in your web browser. Scroll to where the primary sources begin–in this case, it’s a line that says “CORRESPONDENCE HITHERTO UNPUBLISHED.” Note this, as it will be important.
Write the function
Now we are going to write a function to grab the parts of the file that we want and strip out any HTML in it, leaving us with text. We begin by naming our function:
First, in our function, we want to define where to start grabbing the file. Remember where we noted the text where we wanted to begin? Now we see it again:
def stripTags(pageContents): startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.") pageContents = pageContents[startLoc:]
This tells the program to go through that document and find the line we previously identified, and set that as the starting location. Then, it takes everything from that point forward as the text that we want. In other words, it doesn’t send over the beginning, introductory text.
Next, we want to strip out the HTML. We do this with an “if…else” statement. All HTML in the file, of course, is found between “<>.” So we want to take anything between those symbols and remove it.
To do this, we will use the integers 0 and 1 for what is inside and outside the “<>.” So, we define the variable “inside” and set up an empty list for the variable “cleantext” (which will hold our cleaned-up text). Here is the code:
def stripTags(pageContents): startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.") pageContents = pageContents[startLoc:] inside = 0 cleantext = ''
After this, we want to search for the “<>” and remove them, plus anything inside. We do this with an if…else statement:
def stripTags(pageContents): startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.") pageContents = pageContents[startLoc:] inside = 0 cleantext = '' for char in pageContents: if char == '<': inside = 1 elif(inside == 1 and char == '>'): inside = 0 elif inside == 1: continue else: cleantext += char return cleantext
After all of that, we now have our cleaned text. Sort of.
Execute the function
We now need to execute our function on the text we’ve retrieved from the Internet.
Let’s call the variable for the executed function “ctext” for “cleaned text.” We have that variable execute that function on the text (defined as txt) downloaded:
ctext = stripTags(txt)
Next, we’ll clean some other results of the digitization out of the text.
Strip out “Digitized by Google”
As you scroll through the file, you’ll notice there are parts that repeat. In Google Books, each page image gets “Digitized by Google” placed on it. When whoever created the plain text file from Diplomatic Correspondence of the Republic of Texas did the OCR, that portion was caught.
Luckily, it’s rather easy to strip out the “Digitized by Google” lines using Python’s replace function. We set a variable to house the text without that portion–let’s call it “stripGoogle”:
Next, we get the source of the text–in this case, the variable we just created to execute the stripTags function. So we tell stripGoogle to take that variable:
stripGoogle = ctext
Unfortunately, those words exist on separate lines, so we have to replace them separately. We append this to the end of “ctext”–we are replacing the text found in “ctext.” We replace this with nothing, indicated by the quotation marks with nothing between them:
stripGoogle = ctext.replace("Digitized by","").replace("Google","")
Save the file to your drive
Now that the file is cleaned, we’re prepared to commit it to our hard drive. From there, we can take the file into tools like Voyant, or use it for text mining.
First, we want to create the file, which we’ll call “tx-dip-corr.txt.” So we set a variable–we’ll call it “f”–to create a blank file of that title:
f = open('tx-dip-corr.txt','w')
Next, we want to write the final result of our manipulations–which we contained in a variable called “stripGoogle” (this will change as I figure out the regular expressions for the last part!)–into the file. We do this with the write function:
f = open('tx-dip-corr.txt','w') f.write(stripGoogle)
Finally, we close that file at the end of our script:
f = open('tx-dip-corr.txt','w') f.write(stripGoogle) f.close
Wrapping it up
Unfortunately, this tutorial doesn’t deliver you a perfectly clean copy of the Diplomatic Correspondence of the Republic of Texas. A lot of the OCR is pretty bad, so some manual cleanup will still be needed. Some of the OCR errors are consistent, so a simple find/replace in a text editor might help–or even a simple Python script.
In the end, the cleaned version still shows “United” and “States” as two of the most common words. So, one could do even more with this. Nonetheless, we have removed some of the issues:
This tutorial has shown how to strip out some of the more common issues with the text of a Google Book. Once you execute the script, you will have the file saved on your computer. Enjoy the text mining! For your reference, here is the entire script, including comments about what does what:
#tx-dip-corr.py #import libraries import urllib2 import re #open file from the web url = "http://archive.org/stream/diplomaticcorre33statgoog/diplomaticcorre33statgoog_djvu.txt" response = urllib2.urlopen(url) # open('C:\\temp\\diplomaticcorre33statgoog_djvu.txt', 'r') # build up txt without page numbers txt = "" for line in response: matchObj = re.search("(^[\s]*[0-9]+\s.*)|(.*\s[0-9]+[\s]*$)", line) if matchObj is not None: pass else: txt += line #function to strip the introductory portion and HTML tags def stripTags(pageContents): startLoc = pageContents.find("CORRESPONDENCE HITHERTO UNPUBLISHED.") pageContents = pageContents[startLoc:] inside = 0 cleantext = '' for char in pageContents: if char == '<': inside = 1 elif(inside == 1 and char == '>'): inside = 0 elif inside == 1: continue else: cleantext += char return cleantext #now execute the function ctext = stripTags(txt) #strip out "Digitized by Google" through the whole file stripGoogle = ctext.replace("Digitized by","").replace("Google","") #create the file and write results to it f = open('tx-dip-corr.txt','w') f.write(stripGoogle) f.close
I am eternally grateful to my friend Kelvin Pan for helping me resolve some issues that arose as I tried to figure this out, and to my professor, Fred Gibbs, for his helpful comments.