David Patrick McKenzie

Digital Public Historian

Data Mining & Distant Reading: Valuable Tools, but Merely Tools

This week’s readings (scroll to Week 10) concerned using digital technology to “read” texts in different ways.

I use the term “read” in quotation marks to draw attention to it, as this is not what many of us colloquially call reading–that is, what you are doing now, going over my post with your eyes. That term nonetheless applies–it describes what, for example, Google is doing with this post, going through it with algorithms to fish certain information out of it.

For me, the readings harkened back to those from week 3, particularly Susan Hockey’s “History of Humanities Computing.” In my post for that week, I mentioned my surprise, based on my own experience, how long of a history humanities computing had. Through most of that history, computers had been used for production of knowledge rather than its dissemination, beginning with Father Busa’s use of punchcards to index the works of Thomas Aquinas. This week’s readings focused on new, and not-so-new, ways of using digital technology in humanities research, particularly with texts.

Digital technology has assisted with knowledge production in the humanities by assisting us with the problem of quantity. Besides the basic function of searching through mountains of material to pull out what we need, the technology enables us to find patterns and quantities in the material itself.

As the readings all make clear, however, these tools are merely tools–means to an end, not ends in themselves. Nor should they be ends in themselves. To show that, I’ll use an example from my own work here.

For my American Revolution seminar at GW in 2006, I wrote a paper comparing ideology in the American Revolution and the contemporaneous Tupac Amaru Rebellion in Peru. Referencing other works’ historiography, I stated that interest in the Tupac Amaru Rebellion had picked up in the 1960s and 1970s. I revised that paper for my Ph.D. program writing sample in late 2010–just after the debut of Google’s N-grams Viewer.

So just for fun, I used the N-gram Viewer to find instances of the term “Tupac Amaru” in the English and Spanish corpuses since 1780. The results largely bore out what the historiography said: at least in English, a rise in mentions of that combination of terms in the 1960s and 1970s. Interestingly, though, the Spanish corpus shows a rise–indeed, a peak–in the 1950s.

As Dan Cohen correctly points out, using this tool is merely a start. Indeed, it leads to a host of other questions. For example, why do the English and Spanish corpuses have their peaks at different times? As Franco Moretti does with 18th- and 19th-century English novels, we need to look at the social contexts of those times to understand those peaks. In the case of Tupac Amaru, the rise of the term, in the English corpus at least, coincides–not coincidentally–with the rise of anticolonial movements and subaltern history. That’s what the historiographies in recent works said, at least. Why an earlier rise of the term’s frequency in the Spanish corpus? That is a question for further research.

To tease out other issues, we need to look more closely at the works cited. For example, the English corpus shows a rise of that combination of terms in the 1990s–not surprisingly, corresponding with the rise in popularity of the rapper Tupac Amaru Shakur, and, I’m guessing to a lesser extent, the Tupac Amaru Revolutionary Movement’s 1997 seizure of the Japanese embassy in Lima. Only by reading deeper–i.e., reading in the traditional, commonly-understood sense of the term–would one be able to learn whether that 1990s rise had to do with increased scholarship about the 1780-83 rebellion or the prominence of an individual and a group named for that rebellion’s leader.

Thus, my takeaway from this week’s readings: similar caveats as those that apply to the N-gram Viewer apply to other data mining and distant reading tools. The tools help us formulate questions, help us answer those and other questions, help us make sense of a mass of information. And they are super-cool. But they do not provide answers in themselves. For that, we still need to rely on the oldest tool in the humanities arsenal: the human brain.


  1. David,

    I’m intrigued by your example of putting the Ngrams to work. I think you framed it best by saying “just for fun” and that you were able to take the information gleaned from the graph and make it relevant to what you already knew about the movements and their social context throughout time. Do you think the information would have been as relevant or could be placed into a scholarly work without your previous understanding of the social and temporal context? What if you had started with the Ngram? I can see what you are saying about Moretti advising to seek out the “why” the spikes in frequency are happening in vastly different time periods. Maybe it does suggest a way that the information about your researched events have a broader effect beyond just the immediate and in your case, could potentially answer the question of why is the original event still relevant today (something that a microstudy may not incorporate, but should, into the greater narrative of historical significance. Thank you for including your experience and how it helped your analysis (or confirmed it). This helps explain the potential utility of the tool for me!

  2. Thanks for sharing your experience with nGrams. Your point about the rapper really reinforces my feeling that ngram would be more useful if it had more search parameters; you could then do a search on “Tupac Amaru” -Shakur and thereby eliminate the rapper. It would also be nice to have one graph to compare the Spanish and English results. I searched Impressment 1800-1850 and the Uk and US results peak at different times (predictably).

    Drawing on what Sheri said – can you think of a situation in which going to ngrams first would make sense or might help focus research?

  3. Thanks both of you for your comments. As for when I might go to N-grams first… For my dissertation topic, I may do a search of certain terms, like “gringo,” in the Spanish corpus, say 1776-1846. As Megan said, though, it’d be nice if N-grams had more advanced search ability. For this term, it would be nice to be able to limit my search to just works produced in Mexico. Then I could see how much that pejorative was present in the lead-up to the U.S.-Mexican War. Its presence in the Spanish-language corpus yields much less interesting information than would a search of just the Mexican corpus. Even then, if it only included books, that wouldn’t tell me as much, as most of the flame-fanning took place in pamphlets and newspapers. Even if those were included, of course, the graph would only tell me so much, as I would need to go into the works to see the exact context of that term. So, again, it would present a picture, but by no means the entire picture.

    With all those caveats, though, I could see where this tool might be useful as a start in charting prevalence of a certain derogatory term.

    As a side note: At the very least, an N-gram search of that term puts to rest the idea I heard in Peace Corps–that that term originated in the U.S.-Mexican War. It is present much, much earlier!

  4. Then again, for my purposes, a better place to look is this–a collection of political manifestos from Mexico, 1821-1876: http://arts.st-andrews.ac.uk/pronunciamientos/

  5. “For my American Revolution seminar at GW in 2006 . . .”

    Man, that was a long time ago.

    The N-gram viewer seems like a really useful tool. I just have to figure out something it could help me with. Gotta rack my brain . . .

Leave a Reply

%d bloggers like this: