The Vanity of Systems: Data Management for Humanists

Last week at the c19 conference in Berkeley, I had the pleasure of sitting on the “Digits, Data, and Dilemmas” panel with some very distinguished folks from the Digital Humanities world. My remarks, “The Vanity of Systems: Data Management for Humanists,” are below:

Unless you’re a librarian or scientist, you probably neither know nor care that last fall the National Science Foundation mandated that all grant applications must include a 2-page plan for the retention and sharing research data. This has caused a significant kerfuffle in the scientific research communities tasked with making and implementing these plans, and in the library community tasked with helping academic researchers follow through on what they promise the NSF.

This might seem irrelevant to the present conversation, if it weren’t for the fact that the National Endowment for the Humanities Office of Digital Humanities now has a similar requirement. Applicants must prepare a plan to retain and share their data, whatever that may be.

In the humanities, we tend to be resistant to the language of data, and we’re not particularly good at sharing the stuff behind our work product. So as an heuristic exercise this morning, I want to ask us to speculate about what counts as data in a humanities context, and what ways can we think of to meaningfully share that data, and perhaps offer some ways that the lowly bibliography can help us.

Data – however we might define it – is idiosyncratic. The constructed nature of a data set is subject to the complexities of individual disciplines, the whims of individual researchers, and margins of error that range from end user glitches to the vagaries of technology. It may seem like it would be easy to talk data in the sciences and social sciences.

There are standards, taxonomies, established ontologies, and even discipline specific repositories, for some fields. But given that the basic units of data can be anything from specimens, to medical images, to sensor array readings, to manual readings and measurements, even in the supposedly ordered world of the sciences, simply defining data can be daunting.

The National Science Foundation defines data as “everything needed to have reproducible science,” a notion which ostensibly works well for the sciences, but doesn’t seem to translate well to our conversation.

Lacking the requirement for reproducibility, what can meaningfully be considered humanities data, how is it best managed? And, given that sharing data is now a priority of the NEH ODH what can meaningfully be shared with others?

Text has been the primary focus of most DH projects, and understandably so. Bodies of text are relatively easy to mine and mark with simple tools, and we generally understand what we extract when we mine text. Beyond simple search functions, we’re also able to do rapid calculations of word count, frequency, and other statistical analysis, which still requires analysis in the context of the original work to be meaningful, and to move from the arena of raw data, into the arena of information – and requires further interpretation and argument in order to produce knowledge.

However, a method as simple and apparently straightforward as text markup can be idiosyncratic, and textual variables like OCR errors, orthographic eccentricities, and user error can create havoc with even the simplest of systems.

Further, what an individual researcher chooses to mark can be meaningful only to that individual. No text is standardized, even within individual genres or poetic forms.

And macro tools like Google’s Ngram Viewer – which can do distance reading across a vast corpus, are no substitute for contextual and close reading. Don’t take my word for it, just ask the development team.

So text as a data set can be fraught. But we do have a standardized information set that’s common to nearly every text that we produce, and that is the basic unit on which nearly all humanities research is based, and that’s the bibliography.

In what’s left of my time, I want to try to make a case for the lowly bibliography as a potential information set that can open up new modes of knowledge production, and even change the way we view our disciplines.

Bibliography works as an information set because it’s a standardized way of encoding specific units of data. Those data points, which include proper names, titles, publishers, cities, and dates for books, can be marked up with further descriptive detail – what information scientists call metadata – to identify each item and make it accessible within a database or to various types of search engine – think, for example, about MARC data in your library catalogue. But each of these data points can also provide the basis for a variety of other explorations, that can themselves be interpreted to produce knowledge, and that don’t require one to self-identify as a digital humanist in order to exploit them.

The easiest example to visualize quickly is cities. It’s nothing, really to take the first 5entries of primary sources in a given bibliography and slap them into Google maps, and what this produces is a simplistic GIS visualization of a book’s production – in this case, illustrated editions of the novel Charlotte Temple, helping underscore the regionalization of print in a given area, at a specific point in time.

In a more sophisticated system, you could pair that GIS data with dates and visualize distribution over time. It’s necessary to interpret that information to create an argument about what it means, but the basics to construct that sort of narrative are already present in the bibliography.

Examining the list of printers and publishers in a given bibliography can set the parameters for research that can lead to a greater understanding of relationships between printers, or identify patterns of rivalry and the pirate trade. Of course this requires additional primary source research, but the seed from which that process can grow is already present in the bibliography. And if this feels a little like restoring to bibliography its original place as the disciplinary name for History of the Book, then so much the better.

The way rethinking the bibliography can potentially change the way we think about our discipline lies in secondary sources. Using the data of proper names in the bibliography – in this case authors and editors – us really – one can construct a visualization of relationships that facilitates the development of intellectual or disciplinary genealogies; a little bit of extrapolation and we can see the potential to identify institutions that are particularly influential. Synching bibliographies with MARC or LOC records can also help us understand relationships between disciplines, and better identify which disciplines are truly interdisciplinary, and which continue to be self-replicating, despite the changing ecosystem of the academy.

From analysis of Electronic Theses and Dissertations, to examinations of the outputs of the faculty of a given department, looking closely at this information can help us better understand what a department is actually good at. It can help us identify trends in interdisciplinarity, and help us understand the networks and genealogies that enrich our disciplines. And finally, it has, I think, the potential to help us challenge the centrality of citation indices in the tenure and promotion process to better identify the strengths and influence of a given work to the academy.

In the changing world of the 21st century university, rethinking bibliography is just one step to helping others recognize the importance of humanist scholarship, and for helping our peers understand the merit and value of the new things that technology is helping us do.

Creative Commons License
The Vanity of Systems: Data Management for Humanists by Spencer D C Keralis is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.


