Don't Tell Your Computer Where You've Been

How a digital humanities researcher protects his subject's privacy and his own

Mar 31, 2021

This is part two of my conversation with Scott B. Weingart, the director of the Navari Family Center for Digital Scholarship at the University of Notre Dame. You can go here to read part one of our discussion of Kyle Chayka’s essay “The Uncanniness of Algorithmic Style.”

You’ll never spot a digital humanities researcher snooping on you

Leah: Let’s start off with another digital humanities ethics question! In ordinary life, most of us rely on security through obscurity (and even “rely” is too strong a way to put it!). We assume no one is going to put in the legwork to tape our conversations, track our movements, etc.

But once you bring computing to bear, it’s much easier to surface documents, comments, etc. that might have required hours of graduate student research and travels to different archives. This has come up in questions about drones—do advances in technology change our reasonable assumptions about privacy, or should technology use be restricted to match our prior expectations of privacy.

With the power to rifle through any digitized corpus, how do you think about the expectations of the people who made those texts?

Scott: As a historian, most people I care about died over three hundred years ago, so I never felt much guilt rifling through their mail. That said, historians’ perspectives on privacy are changing, and these days you’ll find many who try to be sensitive to their subjects’ cultural norms around privacy.

Facebook’s Mark Zuckerburg and Google’s Eric Schmidt both famously claimed privacy is less important in today’s world. With the privacy-eroding Patriot Act following September 11th, that stance isn’t a surprise. And when given a choice between privacy and convenience, those who claim to prioritize privacy still often choose convenience. Researchers call this the privacy paradox.

This feels like a step in the wrong direction. When people opt for convenience over privacy, it’s often either because they’re not fully informed producers of information, or because it’s difficult not to make the more convenient choice. With Facebook, choosing not to have an account can have negative social repercussions, and the interface is addictive by design.

When I say people aren’t fully informed, I mean that data collectors and aggregators intentionally hide their practices behind dense shrinkwrap agreements and nondisclosure clauses to avoid creeping people out. It’s easy not to think about giving up personal data when you don’t know how it’s being used and by whom.

But even when there are explicit expectations of privacy, data analysis can easily circumvent those. When 70,000 OKCupid accounts were scraped and released by researchers, I found that I could easily identify the real names of roughly 10,000 of those users, alongside their sexual preferences, kinks, and whatever else they’d opted to include. That’s because gender, month/year of birth, and zip code is sufficient to uniquely identify 5% of U.S. citizens, and other context clues available in most OKCupid profiles could help with the rest.

In fact, even when you opt out of social media entirely, a you-shaped hole appears in the data. Some years ago, the military reached out to me for help designing a visualization that could warn them when commercial planes were hijacked by spotlighting unusual flight paths. They gave me flight data for every plane except Air Force One, since I don’t have security clearance. But, at least at the time, flights couldn’t take off or land when near Air Force One. Can you guess what the blank spot that moved around the map was? When studying social data, it’s not unusual to see someone’s silhouette in their absence.

Being a data-literate netizen (wow, what a dated term) has changed how I think about the ethics of primary sources. As a historian-detective, I bring to bear every possible tool to uncover the past; on the other hand, I’m increasingly uncomfortable inferring what was never meant to be inferred.

For the distant past, maybe it’s not a big deal—I’m still uncertain—but when studying contemporary society I now shy away from non-consensual inferences. As an example, I once computationally guessed people’s genders in order to study diversity and bias. After a trans friend pointed out the numerous issues associated with that practice, I now instead rely on agreed-to gender disclosures. I’m still learning.

Peoples’ expectations of privacy are complicated, change over time and context, and rarely anticipate future uses. Until research ethics frameworks catch up with our changing data practices and the murky waters of semi-informed consent, we in this business need to take extra care.

I don’t know the solution, but part of it has to be teaching the future consumers, producers, and collectors of data about the issues at hand. I did that with my students at Carnegie Mellon, and I’ll continue those efforts now that I’m at Notre Dame.

Leah: Do you try to make your own life legible for researchers to come? I tend to think about this in my own writing in terms of reducing linkrot, or at least leaving enough context behind for people to reconstruct what broken links are pointing to.

But I don’t try to make my personal life legible and I certainly hope I never attract the attentions of a biographer! Mind you, I don’t go as far as Vernor Vinge’s fictional “Friends of Privacy” who try to create enough of a swirl of rumor that any truth is plausibly deniable.

I know you’ve sometimes operated with a locked Twitter. Do you take any actions in your personal or professional life to make your life easier to parse for future digital humanities researchers?

Scott: As the informed netizen I described earlier, I am paranoid about personal privacy. My internet traffic often travels through layers of Tor nodes or VPNs, and I’ve got more disconnected email addresses and accounts than is sensible. It’s a problem.

And yet my research is on the Republic of Letters, a half-millennium-old community whose importance was made accessible to historians by the correspondences that were preserved by its members. Without those old letters, we’d be in the dark.

You can see the dilemma. Most aren’t donating their emails to a favorite local archive upon their death (excluding Susan Sontag, whose old files you can access by visiting UCLA and checking out a special laptop). And in the few situations when we do have good digital records, they’re at high risk of disappearing within a few short years.

I do at least try to make it easier for future scholars to access my research. There are increasingly good standards around digital preservation, including saving things in simple formats like plaintext and comma-separated values files and storing them in university repositories with a preservation commitment. For websites I want to keep alive, I send them to the Internet Archive.

It’s good and equalizing that tomorrow’s historians will know less about a few famous individuals and more about random twitter users collected by the Library of Congress. But our collective memory of the early 2000s will undoubtedly be weird and different from most of the rest of recorded history.

I wish I had better answers for both of this week’s questions. I guess I increasingly opt for privacy, both my own and my historical subjects’, and trust future historians will figure it out just like we have. A collective cultural memory is critically important, but our society has other priorities too, like ensuring oppressive governments can’t identify and persecute practitioners of a forbidden faith. Despite my cynical demeanor I’m an optimist at heart, so I believe we’ll reach a healthy balance, but it’s all a bit too new to figure out exactly how.

Tiny Book Club

Discussion about this post