David Patrick McKenzie

Digital Public Historian

Building Two Databases for My Dissertation…?

Wow, it’s been a long time since I’ve blogged outside of a class. As much as I admire the people who blog through their dissertations and have meant to do so, well… Good intentions and all that. I hope to blog my progress in the future, though, both as a way for me to explore and refine ideas, and, perhaps more importantly, to share the process for those who are currently undertaking or might undertake this type of work in the future.

For this first post, I’m delving into an area with which I’m struggling at the moment: Structuring and using data.

via GIPHY

Data in my Dissertation

When the History Department at George Mason University accepted my prospectus in December 2016, my topic was the experiences of U.S. and Mexican travelers and migrants between each others’ countries from the start of Mexico’s Wars of Independence until the two nations went to war in 1846. I planned the work to be mostly qualitative, using a series of case studies of individual experiences to illuminate broader trends.

Although I’ve long had an interest in digital methods, I didn’t want to do digital work for the sake of doing so. I saw some uses for digital methods:

But I didn’t have concrete plans.

That said, one of the pieces with which I’ve long struggled in pondering my topic is how to integrate quantitative work with qualitative analysis, to explore what a dataset could yield that individual case studies could not.

As I began to research, I found a data source that could help, thanks to Harold Dana Sims’s The Expulsion of Mexico’s Spaniards, 1821-1836. In 1820, the United States began to require inbound vessels from foreign ports to submit passenger manifests. These manifests are available on microfilm at the U.S. National Archives nearby. Then, even better, I found that these manifests were also available via Ancestry.com and FamilySearch. For someone working full-time Monday through Friday, this was golden, especially following the tragic (but, given the federal budget situation, sadly understandable and unsurprising) decision of the National Archives to end Saturday hours in summer 2017.

A black-and-white ship manifest of the Schooner Sally Ann, which sailed to New Orleans from Rio Grande, Mexico, in October 1826. The manifest contains the names of five passengers.

Passenger manifest of the Sally Ann.

Each manifest contains:

Ship information:

  • Ship Name
  • Port of Departure
  • Port of Arrival
  • Date of Arrival

Passenger information:

  • First and last names
  • Age
  • Sex
  • Occupation
  • Country to Which They Belong
  • Country in Which They Intend to Become Inhabitants

Why Ship Manifests?

What could a dataset based on these ship manifests add to this dissertation? For one thing, they could yield numbers of ships and passengers going between the United States and Mexico between 1820 and 1846.

Tracking those raw numbers could help me identify ebbs and flows in travel and trade, which I could then investigate to determine why. I could also see a more detailed picture of how traffic between individual ports ebbed and flowed over time, and, again, investigate why. I also quickly began to recognize some distinctive names in the records, yielding clues as to who might have business interests in which places and could make for a prime case study.

For example, a merchant or hatter (depending on which manifest) named John Baptiste Passement showed up frequently in voyages between New Orleans and various Mexican ports, most frequently Campeche, in the early 1820s. Conducting a further search of his name on Ancestry.com yielded a will listing creditors in Mexican cities.

I could also, on a more advanced level, even find social networks among travelers. Who traveled together multiple times? How were these people connected?

I could also see if demographic profiles changed over time.

What Format?

Since there seemed many possibilities of what I could do with this data, I created an Excel spreadsheet as a first step. Pretty quickly, the spreadsheet began to grow unwieldy. For one thing, I was entering a lot of repetitive information—for example, I put a new person on each line, but a lot of repetitive information about the ship voyage. I suspected I needed something more sophisticated. But after setting up a custom MySQL database in my Clio 3 class in 2012, I wasn’t ready to do that again. If the data is a mid-sized nail, an Excel spreadsheet is too small of a hammer, but a custom MySQL database is a sledgehammer. I needed something in-between.

A screenshot of a spreadsheet of data about ships coming into U.S. ports from Mexico in the early 1820s, showing a large number of repeated cells.

What my spreadsheet began to look like. Note the repeated cells.

Thankfully, as I was wrestling with this question, I attended THAT Camp DC 2017. At the spur of the moment, I suggested a session on dealing with historical data—hoping that my experiences could help others and that I could get some advice on what to do with this.

Thankfully, someone there—I don’t remember whom, as the person I suspect doesn’t think it was him—suggested Heurist. Heurist, as I learned, is a database platform created specifically for humanities research at the University of Sydney. It seemed that this would do the trick for me.

Indeed, it has.

Setting Up My Heurist Database

Amazingly, within 24 hours of me signing up for Heurist, I received an email from the project’s lead, Dr. Ian Johnson. I told him about what I was trying to do and shared my spreadsheet. He and I then exchanged emails and held a Skype call in which I sat on my balcony in Arlington, Virginia, he was in Paris, and we were pinging a server in Sydney. After some working to figure out how to structure the data, he came up with a scheme based on how the ship manifests are structured and what I might do with the data in the dissertation:

Chart showing structure of David McKenzie's Heurist database. The database includes five tables: Trip, Person, Voyage (of Ship), Ship, and Place.

The structure of my Heurist database.

After I used OpenRefine to clean up my spreadsheet, he imported the data for me as a way to test out the importing features.

I cannot say “thank you” enough to him for all that he did.

And Now the Fun Part…

Since getting the database set up in the late spring, I’ve been going in spurts inputting data. Sadly, although I inquired on Twitter whether Ancestry or FamilySearch have these manifests available for bulk download, it seems I’ll be inputting manually (which, given the state of the OCR of names in particular, might not be a bad thing). I input two types of voyages:

  • For those inbound from Mexican ports, I input data on the voyage itself, as well as on all passengers.
  • For those inbound from other ports but with Mexicans on board, I input data about the voyage and then only input data about the Mexicans on board.

I realize that this method would not allow me to answer the question of what percentage of inbound voyages to a particular port is from Mexico, but I’ve decided that the amount of time doing that additional data entry would not be worth any questions it could answer.

Screenshot of the Heurist database, showing David McKenzie's "Add Trip (of Person)" interface.

The interface of a Trip record.

My workflow starts with selecting “Add Trip (of person).”

The Trip is the basic unit of my database—each Person takes a Trip on a Voyage of a Ship.

If I’m starting on a new manifest, I create a new Voyage record (often involving creating a new Ship record, as well).

I then check the person’s name to see if the name already exists in the database. Often, this is a guessing game as to whether a person is the same or not. Are the birthdates listed similar? Does the person who might be the same have a record of traveling between the same ports? Is the name unique enough to lessen the possibility that the records are referring to different people?

After making that judgment, I then proceed, if it’s a new person, to copy over the information that I can glean from the manifest about that person. Finally, after the Person record is created, I input the rest of the data about that person on the trip.

I found a large number of Voyages—roughly 400—between Mexican ports and New Orleans just from 1820 to 1826. I started to question whether creating a comprehensive dataset would be worth the effort.

When I discussed this database project with faculty members and classmates at George Mason University’s Early Americas Workshop, opinion in the room was divided on that question.

I’m still debating, although I’m leaning toward plowing ahead in creating a comprehensive dataset to really be able to see and show change over time.

I’d love advice on this question!

Right now, with working full-time, I’ve taken the advice of others and given up on doing brain-intensive work on weeknights; instead, I’ll either read secondary literature or input data (often with a basketball game on in the background), while reserving primary source research and, eventually, writing for the weekends. It’s still taking quite a bit of time, though!

To take a break from the intensive data entry of ships coming into New Orleans, I’ve begun inputting ships coming into New York. So far, I’ve found many fewer (which, admittedly, creates the opposite problem—sorting through those manifests can be mind-numbing, although it does get me thinking about the differences in traffic).

What’s Next for the Manifests?

In the time since I started this database, my research focus has narrowed. My advisor, Joan C. Bristol, and I agreed that the original topic was too broad and ambitious. We agreed that I would instead focus on traffic going one way: U.S. migration into the interior of Mexico. This is because I found I was developing an argument about U.S. commercial expansion into the interior of Mexico being related but distinct from the migrations into border regions like Texas that eventually resulted in U.S. territorial expansion. Preliminarily, I suggest that this secondary migration laid the groundwork for the formation of an informal, as opposed to territorial, U.S. empire in Latin America (more on this in another post).

This has led me to question the value of ship manifest data for that topic. I still think being able to quantify shipping and movement will be valuable, as will being able to pinpoint comings and goings of U.S. and Mexican nationals. I could still see connections between U.S. and Mexican ports, and find more people who were U.S. nationals but resident in Mexico.

What are your thoughts on how this data could be valuable?

And Possibly Another Database…

Meanwhile, examining how U.S. migrants to the interior of Mexico laid the groundwork for a future informal, commercial empire brought me back to the database that I already began constructing over five years ago: Tracking U.S.-Americans who filed claims against the Mexican government.

When I set up the database, I mainly looked at what the extensive files of 1839 (15 boxes) and 1849 (30 boxes) claims commissions could tell us about the claims themselves: Who the claimants were, summaries of the cases, and amounts of claims. But I’ve since realized, thanks to rethinking the topic, that I might have been asking the wrong questions, and thus extracting the wrong data.

While many of these files have to do with incidents involving merchants who simply traveled to Mexico but did not stay, there are also many about U.S.-Americans who settled in Mexico’s interior. Many of these files contain information such as when they settled in Mexico, where, their occupations, and demographics.

As part of the narrowed focus, I’m realizing that this data could prove valuable in painting a portrait of the U.S.-American diaspora in Mexico’s interior during this era, and how that group of people changed over time. What patterns exist? Where did the U.S.-Americans who settled in Mexico come from? Where did they settle? When? With whom did they interact? This could allow for a good number of visualizations that can paint a broader picture, beyond qualitative exploration of individual experiences.

Looking for Advice

I would love advice on how to build and use this dataset.

Should I use the records of claimants as my main source, keeping an intentionally limited but self-selected data set?

Or should I cast a wider net, knowing that it would be nearly impossible to create a comprehensive set?

And furthermore, should I scrap the claims database that I already started to create, or simply change some of the categories?

Should I create a new Heurist database and import the previous custom database, or, knowing that some of the same people are likely to be passengers on vessels (indeed, I’ve found my central case study, John Baldwin, on at least two ship manifests), add them to the ship manifest database?

Lots of questions for going forward…

1 Comment

  1. I don’t have any advise other than to keep going and keep exploring and expect there to be more developments and shifts in your plans as you keep moving forward. This research is exciting and important. It also sounds like the groundwork for future projects, too. I will suggest–and this is something I had to make myself do–that eventually just decide what you are going to do and then eventually one day decide to just stop and declare the dissertation finished.

Leave a Reply

%d bloggers like this: