Data Mining & Distant Reading: Valuable Tools, but Merely Tools
This week’s readings (scroll to Week 10) concerned using digital technology to “read” texts in different ways.
I use the term “read” in quotation marks to draw attention to it, as this is not what many of us colloquially call reading–that is, what you are doing now, going over my post with your eyes. That term nonetheless applies–it describes what, for example, Google is doing with this post, going through it with algorithms to fish certain information out of it.
For me, the readings harkened back to those from week 3, particularly Susan Hockey’s “History of Humanities Computing.” In my post for that week, I mentioned my surprise, based on my own experience, how long of a history humanities computing had. Through most of that history, computers had been used for production of knowledge rather than its dissemination, beginning with Father Busa’s use of punchcards to index the works of Thomas Aquinas. This week’s readings focused on new, and not-so-new, ways of using digital technology in humanities research, particularly with texts.
Digital technology has assisted with knowledge production in the humanities by assisting us with the problem of quantity. Besides the basic function of searching through mountains of material to pull out what we need, the technology enables us to find patterns and quantities in the material itself.
As the readings all make clear, however, these tools are merely tools–means to an end, not ends in themselves. Nor should they be ends in themselves. To show that, I’ll use an example from my own work here.
For my American Revolution seminar at GW in 2006, I wrote a paper comparing ideology in the American Revolution and the contemporaneous Tupac Amaru Rebellion in Peru. Referencing other works’ historiography, I stated that interest in the Tupac Amaru Rebellion had picked up in the 1960s and 1970s. I revised that paper for my Ph.D. program writing sample in late 2010–just after the debut of Google’s N-grams Viewer.
So just for fun, I used the N-gram Viewer to find instances of the term “Tupac Amaru” in the English and Spanish corpuses since 1780. The results largely bore out what the historiography said: at least in English, a rise in mentions of that combination of terms in the 1960s and 1970s. Interestingly, though, the Spanish corpus shows a rise–indeed, a peak–in the 1950s.
As Dan Cohen correctly points out, using this tool is merely a start. Indeed, it leads to a host of other questions. For example, why do the English and Spanish corpuses have their peaks at different times? As Franco Moretti does with 18th- and 19th-century English novels, we need to look at the social contexts of those times to understand those peaks. In the case of Tupac Amaru, the rise of the term, in the English corpus at least, coincides–not coincidentally–with the rise of anticolonial movements and subaltern history. That’s what the historiographies in recent works said, at least. Why an earlier rise of the term’s frequency in the Spanish corpus? That is a question for further research.
To tease out other issues, we need to look more closely at the works cited. For example, the English corpus shows a rise of that combination of terms in the 1990s–not surprisingly, corresponding with the rise in popularity of the rapper Tupac Amaru Shakur, and, I’m guessing to a lesser extent, the Tupac Amaru Revolutionary Movement’s 1997 seizure of the Japanese embassy in Lima. Only by reading deeper–i.e., reading in the traditional, commonly-understood sense of the term–would one be able to learn whether that 1990s rise had to do with increased scholarship about the 1780-83 rebellion or the prominence of an individual and a group named for that rebellion’s leader.
Thus, my takeaway from this week’s readings: similar caveats as those that apply to the N-gram Viewer apply to other data mining and distant reading tools. The tools help us formulate questions, help us answer those and other questions, help us make sense of a mass of information. And they are super-cool. But they do not provide answers in themselves. For that, we still need to rely on the oldest tool in the humanities arsenal: the human brain.