Using algorithms to analyze the literary canon

A new book describes efforts to use algorithms to discover what is in and out of the literary canon:

There’s no single term that captures the range of new, large-scale work currently underway in the literary academy, and that’s probably as it should be. More than a decade ago, the Stanford scholar of world literature Franco Moretti dubbed his quantitative approach to capturing the features and trends of global literary production “distant reading,” a practice that paid particular attention to counting books themselves and owed much to bibliographic and book historical methods. In earlier decades, so-called “humanities computing” joined practitioners of stylometry and authorship attribution, who attempted to quantify the low-level differences between individual texts and writers. More recently, the catchall term “digital humanities” has been used to describe everything from online publishing and new media theory to statistical genre discrimination. In each of these cases, however, the shared recognition — like the impulse behind the earlier turn to cultural theory, albeit with a distinctly quantitative emphasis — has been that there are big gains to be had from looking at literature first as an interlinked, expressive system rather than as something that individual books do well, badly, or typically. At the same time, the gains themselves have as yet been thin on the ground, as much suggestions of future progress as transformative results in their own right. Skeptics could be forgiven for wondering how long the data-driven revolution can remain just around the corner.

Into this uncertain scene comes an important new volume by Matthew Jockers, offering yet another headword (“macroanalysis,” by analogy to macroeconomics) and a range of quantitative studies of 19th-century fiction. Jockers is one of the senior figures in the field, a scholar who has been developing novel ways of digesting large bodies of text for nearly two decades. Despite Jockers’s stature, Macroanalysis is his first book, one that aims to summarize and unify much of his previous research. As such, it covers a lot of ground with varying degrees of technical sophistication. There are chapters devoted to methods as simple as counting the annual number of books published by Irish-American authors and as complex as computational network analysis of literary influence. Aware of this range, Jockers is at pains to draw his material together under the dual headings of literary history and critical method, which is to say that the book aims both to advance a specific argument about the contours of 19th-century literature and to provide a brief in favor of the computational methods that it uses to support such an argument. For some readers, the second half of that pairing — a detailed look into what can be done today with new techniques — will be enough. For others, the book’s success will likely depend on how far they’re persuaded that the literary argument is an important one that can’t be had in the absence of computation…

More practically interesting and ambitious are Jockers’s studies of themes and influence in a larger set of novels from the same period (3,346 of them, to be exact, or about five to 10 percent of those published during the 19th century). These are the only chapters of the book that focus on what we usually understand by the intellectual content of the texts in question, seeking to identify and trace the literary use of meaningful clusters of subject-oriented terms across the corpus. The computational method involved is one known as topic modeling, a statistical approach to identifying such clusters (the topics) in the absence of outside input or training data. What’s exciting about topic modeling is that it can be run quickly over huge swaths of text about which we initially know very little. So instead of developing a hunch about the thematic importance of urban poverty or domestic space or Native Americans in 19th-century fiction and then looking for words that might be associated with those themes — that is, instead of searching Google Books more or less at random on the basis of limited and biased close reading — topic models tell us what groups of words tend to co-occur in statistically improbable ways. These computationally derived word lists are for the most part surprisingly coherent and highly interpretable. Specifically in Jockers’s case, they’re both predictable enough to inspire confidence in the method (there are topics “about” poverty, domesticity, Native Americans, Ireland, sea faring, servants, farming, etc.) and unexpected enough to be worth examining in detail…

The notoriously difficult problem of literary influence finally unites many of the methods in Macroanalysis. The book’s last substantive chapter presents an approach to finding the most central texts among the 3,346 included in the study. To assess the relative influence of any book, Jockers first combines the frequency measures of the roughly 100 most common words used previously for stylistic analysis with the more than 450 topic frequencies used to assess thematic interest. This process generates a broad measure of each book’s position in a very high-dimensional space, allowing him to calculate the “distance” between every pair of books in the corpus. Pairs that are separated by smaller distances are more similar to each other, assuming we’re okay with a definition of similarity that says two books are alike when they use high-frequency words at the same rates and when they consist of equivalent proportions of topic-modeled terms. The most influential books are then the ones — roughly speaking and skipping some mathematical details — that show the shortest average distance to the other texts in the collection. It’s a nifty approach that produces a fascinatingly opaque result: Tristram Shandy, Laurence Sterne’s famously odd 18th-century bildungsroman, is judged to be the most influential member of the collection, followed by George Gissing’s unremarkable The Whirlpool (1897) and Benjamin Disraeli’s decidedly minor romance Venetia (1837). If you can make sense of this result, you’re ahead of Jockers himself, who more or less throws up his hands and ends both the chapter and the analytical portion of the book a paragraph later. It might help if we knew what else of Gissing’s or Disraeli’s was included in the corpus, but that information is provided in neither Macroanalysis nor its online addenda.

Sounds interesting. I wonder if there isn’t a great spot for mixed method analysis: Jockers’ analysis provides the big picture but you also need more intimate and deep knowledge of the smaller groups of texts or individual texts to interpret what the results mean. So, if the data suggests three books are the most influential, you would have to know these books and their context to make sense of what the data says. Additionally, you still want to utilize theories and hypotheses to guide the analysis rather than simply looking for patterns.

This reminds me of the work sociologist Wendy Griswold has done in analyzing whether American novels shared common traits (she argues copyright law was quite influential) or how a reading culture might emerge in a developing nation. Her approach is somewhere between the interpretation of texts and the algorithms described above, relying on more traditional methods in sociology like analyzing samples and conducting interviews.

Using algorithms to judge cultural works

Imagine the money that could be made or the status acquired if algorithms could correctly predict the merit of cultural works:

The budget for the film was $180m and, Meaney says, “it was breathtaking that it was under serious consideration”. There were dinosaurs and tigers. It existed in a fantasy prehistory—with a fantasy language. “Preposterous things were happening, without rhyme or reason.” Meaney, who will not reveal the film’s title because he “can’t afford to piss these people off”, told the studio that his program concurred with his own view: it was a stinker.

The difference is the program puts a value on it. Technically a neural network, with a structure modelled on that of our brain, it gradually learns from experience and then applies what it has learnt to new situations. Using this analysis, and comparing it with data on 12 years of American box-office takings, it predicted that the film in question would make $30m. With changes, Meaney reckoned they could increase the take—but not to $180m. On the day the studio rejected the film, another one took it up. They made some changes, but not enough—and it earned $100m. “Next time we saw our studio,” Meaney says, “they brought in the board to greet us. The chairman said, ‘This is Nick—he’s just saved us $80m.’”…

But providing a service that adapts to individual humans is not the same as becoming like a human, let alone producing art like humans. This is why the rise of algorithms is not necessarily relentless. Their strength is that they can take in that information in ways we cannot quickly understand. But the fact that we cannot understand it is also a weakness. It is worth noting that trading algorithms in America now account for 10% fewer trades than they did in 2009.

Those who are most sanguine are those who use them every day. Nick Meaney is used to answering questions about whether computers can—or should—judge art. His answer is: that’s not what they’re doing. “This isn’t about good, or bad. It is about numbers. These data represent the law of absolute numbers, the cinema-going audience. We have a process which tries to quantify them, and provide information to a client who tries to make educated decisions.”…

Equally, his is not a formula for the perfect film. “If you take a rich woman and a poor man and crash them into an iceberg, will that film always make money?” No, he says. No algorithm has the ability to write a script; it can judge one—but only in monetary terms. What Epagogix does is a considerably more sophisticated version, but still a version, of noting, say, that a film that contains nudity will gain a restricted rating, and thereby have a more limited market.

The larger article suggests algorithms can do better at predicting some human behaviors, such a purchasing consumer items, but not so good in other areas, like critical evaluations of cultural works. There are two ways this might go in the future. On one hand, some will argue this is just about collecting the right data or enough data. Perhaps we simply aren’t looking at the right things to correctly judge cultural products. On the other hand, some will argue that the value of an object may be too difficult for an algorithm to ever figure out. And, even if a formula starts hinting at good or bad art, humans can change their minds and opinions – see all the various cultural, art, and music movements just in the last few hundred years.

There is a lot of money that could be made here. This might be the bigger issue with cultural works in the future: whether algorithms can evaluate them or not, does it matter if they are all commoditized?

Using algorithms for better realignment in the NHL?

The NHL recently announced realignment plans. However, a group of West Point mathematicians developed an algorithm they argue provides a better realignment:

Well, a team of mathematicians at West Point set out to find an algorithm that could solve some of these problems. In their article posted on the arXiv titled Realignment in the NHL, MLB, the NFL, and the NBA, they explore how to easily construct different team divisions. For example, with the relatively recent move of Atlanta’s hockey team to Winnipeg, the current team alignment is pretty weird (below left), and the NHL has proposed a new 4-division configuration (below right):

Here’s how it works. First, they use a rough approximation for distance traveled by each team (which is correlated with actual travel distances), and then examine all the different ways to divide the cities in a league into geographic halves. You then can subdivide those portions until you get the division sizes you want. However, only certain types of divisions will work, such as not wanting to make teams travel too laterally, due to time zone differences…

Anyway, using this method, here are two ways of dividing the NHL into six different divisions that are found to be optimal:

My first thought when looking at the algorithm realignment plans is that it is based less on time zones and more on regions like the Southwest, Northwest, Central, Southeast, North, and Northeast.

But here is where I think the demands of the NHL don’t quite line up with the goals of the algorithm to minimize travel. The grouping of sports teams is often dependent on historic patterns, rivalries, and when teams entered the league. For example, the NHL realignment plans generated a lot of discussion in Chicago because it meant that the long rivalry between the Chicago Blackhawks and the Detroit Red Wings would end. In other words, there is cultural baggage to realignment that can’t only be solved with statistics. Data loses out to narratives.

Another way an algorithm could redraw the boundaries: spread out the winning teams across the league. What teams are really good tends to be cyclical but occasionally leagues end up with multiple good teams in a single division or an imbalance of power between conferences. Why not spread out teams by records which then gives teams a better chance to meet in the finals or other teams in those stacked divisions or conferences a chance to make the playoffs?b