Using algorithms to analyze the literary canon

A new book describes efforts to use algorithms to discover what is in and out of the literary canon:

There’s no single term that captures the range of new, large-scale work currently underway in the literary academy, and that’s probably as it should be. More than a decade ago, the Stanford scholar of world literature Franco Moretti dubbed his quantitative approach to capturing the features and trends of global literary production “distant reading,” a practice that paid particular attention to counting books themselves and owed much to bibliographic and book historical methods. In earlier decades, so-called “humanities computing” joined practitioners of stylometry and authorship attribution, who attempted to quantify the low-level differences between individual texts and writers. More recently, the catchall term “digital humanities” has been used to describe everything from online publishing and new media theory to statistical genre discrimination. In each of these cases, however, the shared recognition — like the impulse behind the earlier turn to cultural theory, albeit with a distinctly quantitative emphasis — has been that there are big gains to be had from looking at literature first as an interlinked, expressive system rather than as something that individual books do well, badly, or typically. At the same time, the gains themselves have as yet been thin on the ground, as much suggestions of future progress as transformative results in their own right. Skeptics could be forgiven for wondering how long the data-driven revolution can remain just around the corner.

Into this uncertain scene comes an important new volume by Matthew Jockers, offering yet another headword (“macroanalysis,” by analogy to macroeconomics) and a range of quantitative studies of 19th-century fiction. Jockers is one of the senior figures in the field, a scholar who has been developing novel ways of digesting large bodies of text for nearly two decades. Despite Jockers’s stature, Macroanalysis is his first book, one that aims to summarize and unify much of his previous research. As such, it covers a lot of ground with varying degrees of technical sophistication. There are chapters devoted to methods as simple as counting the annual number of books published by Irish-American authors and as complex as computational network analysis of literary influence. Aware of this range, Jockers is at pains to draw his material together under the dual headings of literary history and critical method, which is to say that the book aims both to advance a specific argument about the contours of 19th-century literature and to provide a brief in favor of the computational methods that it uses to support such an argument. For some readers, the second half of that pairing — a detailed look into what can be done today with new techniques — will be enough. For others, the book’s success will likely depend on how far they’re persuaded that the literary argument is an important one that can’t be had in the absence of computation…

More practically interesting and ambitious are Jockers’s studies of themes and influence in a larger set of novels from the same period (3,346 of them, to be exact, or about five to 10 percent of those published during the 19th century). These are the only chapters of the book that focus on what we usually understand by the intellectual content of the texts in question, seeking to identify and trace the literary use of meaningful clusters of subject-oriented terms across the corpus. The computational method involved is one known as topic modeling, a statistical approach to identifying such clusters (the topics) in the absence of outside input or training data. What’s exciting about topic modeling is that it can be run quickly over huge swaths of text about which we initially know very little. So instead of developing a hunch about the thematic importance of urban poverty or domestic space or Native Americans in 19th-century fiction and then looking for words that might be associated with those themes — that is, instead of searching Google Books more or less at random on the basis of limited and biased close reading — topic models tell us what groups of words tend to co-occur in statistically improbable ways. These computationally derived word lists are for the most part surprisingly coherent and highly interpretable. Specifically in Jockers’s case, they’re both predictable enough to inspire confidence in the method (there are topics “about” poverty, domesticity, Native Americans, Ireland, sea faring, servants, farming, etc.) and unexpected enough to be worth examining in detail…

The notoriously difficult problem of literary influence finally unites many of the methods in Macroanalysis. The book’s last substantive chapter presents an approach to finding the most central texts among the 3,346 included in the study. To assess the relative influence of any book, Jockers first combines the frequency measures of the roughly 100 most common words used previously for stylistic analysis with the more than 450 topic frequencies used to assess thematic interest. This process generates a broad measure of each book’s position in a very high-dimensional space, allowing him to calculate the “distance” between every pair of books in the corpus. Pairs that are separated by smaller distances are more similar to each other, assuming we’re okay with a definition of similarity that says two books are alike when they use high-frequency words at the same rates and when they consist of equivalent proportions of topic-modeled terms. The most influential books are then the ones — roughly speaking and skipping some mathematical details — that show the shortest average distance to the other texts in the collection. It’s a nifty approach that produces a fascinatingly opaque result: Tristram Shandy, Laurence Sterne’s famously odd 18th-century bildungsroman, is judged to be the most influential member of the collection, followed by George Gissing’s unremarkable The Whirlpool (1897) and Benjamin Disraeli’s decidedly minor romance Venetia (1837). If you can make sense of this result, you’re ahead of Jockers himself, who more or less throws up his hands and ends both the chapter and the analytical portion of the book a paragraph later. It might help if we knew what else of Gissing’s or Disraeli’s was included in the corpus, but that information is provided in neither Macroanalysis nor its online addenda.

Sounds interesting. I wonder if there isn’t a great spot for mixed method analysis: Jockers’ analysis provides the big picture but you also need more intimate and deep knowledge of the smaller groups of texts or individual texts to interpret what the results mean. So, if the data suggests three books are the most influential, you would have to know these books and their context to make sense of what the data says. Additionally, you still want to utilize theories and hypotheses to guide the analysis rather than simply looking for patterns.

This reminds me of the work sociologist Wendy Griswold has done in analyzing whether American novels shared common traits (she argues copyright law was quite influential) or how a reading culture might emerge in a developing nation. Her approach is somewhere between the interpretation of texts and the algorithms described above, relying on more traditional methods in sociology like analyzing samples and conducting interviews.

The sociology of literature and looking for data and insights in the margins of books

As a big reader, I was interested to see this review of research built on data about readers left behind in books:

Price’s work perches at the leading edge of a growing body of investigations into the history of reading. The field draws from many others, including book history and bibliography, literary criticism and social history, and communication studies. It looks backward to the pre-Gutenberg era, back to the clay tablets and scrolls of ancient civilizations, and forward to current debates about how technology is changing the way we read. Although much of the relevant research has centered on Anglo-American culture of the last three or four centuries, the field has expanded its purview, as scholars uncover the hidden reading histories of cultures many used to dismiss as mostly oral.

It’s a tricky business. A bibliographer works with hard physical evidence—a manuscript, a printed book, a copy of the Times of London. A scholar seeking to pin down the readers of the past often has to read between the lines. Marginalia can be a gold mine of information about a book’s owners and readers, but it’s rare. “Most of the time, most readers historically didn’t, and still don’t, write in their books,” Price explains.

But even a book’s apparent lack of use can be read as evidence. “The John F. Kennedy Library here in Boston owns a copy of Ulysses whose pages—other than a few at the very beginning and very end—are completely uncut,” she says. “This tells us something about the owner of the copy—who happens to be Ernest Hemingway.”…

Since Reading the Romance, the ethnography of reading has taken off among scholars. Radway points to Forgotten Readers, Elizabeth McHenry’s study of African-American literary societies, Ellen Gruber Garvey’s Writing With Scissors, about scrapbooking, and David Henkin’s City Reading, about signage in the urban environment, as strong examples. “People have become very creative about trying to figure out how groups of readers interact with the text as it’s embodied in various forms,” she says.

I have wondered in recent years why more sociologists don’t take up the subject of reading. It seems crucial for understanding the development of modern societies as information moved from a highly regulated environment to a diffuse distribution through books, newspapers, and other printed materials.

I’ve enjoyed the work of sociologist Wendy Griswold who studies reading. I’ve used a few of her pieces in class. Here are some of her fascinating works in the “sociology of literature” that I recommend:

1. Bearing Witness published in 2000. Griswold examines the reading culture in Nigeria and why novels, a common genre in Western society, aren’t prevalent in Nigeria. The short version of the story: it takes a lot of work for a society to be at a level where novels can be easily produced and read.

2. “American Character and the American Novel: An Expansion of Reflection Theory in the Sociology of Literature.” American Journal of Sociology 86(4), 1981. Griswold compares American and European novels in the late 1800s and early 1900s and finds the differences in their content is due more to copyright law than “national characters.”

3. With Terry McDonnell and Nathan Wright. “Reading and the Reading Class in the Twenty-First Century.” Annual Review of Sociology 31, 2005. Here is the abstract:

Sociological research on reading, which formerly focused on literacy, now conceptualizes reading as a social practice. This review examines the current state of knowledge on (a) who reads, i.e., the demographic characteristics of readers; (b) how they read, i.e., reading as a form of social practice; (c) how reading relates to electronic media, especially television and the Internet; and (d) the future of reading. We conclude that a reading class is emerging, restricted in size but disproportionate in influence, and that the Internet is facilitating this development.

Some fascinating stuff about the social forces influencing reading in today’s world.

4. With Nathan Wright. “Wired and Well Read.” In Society Online: The Internet in Context, 2004. If I remember correctly, Griswold and Wright argue the Internet doesn’t compete with reading; rather it enhances reading as those who read before the Internet use the Internet to read more.