Google offers tool to analyze texts going back to the 1500s

Among other projects Google has been working on, they recently opened a new online tool that allows users to search for certain words in texts going back to the 1500s:

With little fanfare, Google has made a mammoth database culled from nearly 5.2 million digitized books available to the public for free downloads and online searches, opening a new landscape of possibilities for research and education in the humanities.

The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian…

“The goal is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books,” said Erez Lieberman Aiden, a junior fellow at the Society of Fellows at Harvard…

“We wanted to show what becomes possible when you apply very high-turbo data analysis to questions in the humanities,” said Mr. Lieberman Aiden, whose expertise is in applied mathematics and genomics. He called the method “culturomics.”

The article mentions some projects that use this database and sound interesting. And it sounds the dataset can be downloaded and analyzed by users on their own computers.

But thinking about the methodology of this all, I would have some questions.

1. Do we know how well these digitized texts represent the full population of texts? This is a sampling issue – could there be some sort of bias in what kind of texts ended up in this database?

2. Studying word frequency by itself is tricky. Simply counting words and when they appear is one measurement while trying to assess the importance placed in each word is another task. Do the three little “culturnomics” graphs on the left side of the online story really tell us much?

3. It sounds like this would be best for looking at how language (grammar, word choices, structure, etc.) has changed over time.

Language style matching in relationships

New research suggests that people, particularly those who are happy in relationships, tend to match their language to those of those around them or to authors they have just read. Here are some of the findings:

Pennebaker and his colleagues tracked language use by 2,000 college students responding to class assignments written in different language styles. The results confirmed that language style matching extends to the written word. When an essay question was written in a dry, confusing tone, students responded with dry, confusing answers. If the question took a flighty, casual tone, students responded with “Valley girl”-like answers peppered with “like” and “sorta.”

Next, the researchers used historical figures to find out if language style matching could reveal schisms or closeness in a relationship.

They began with Sigmund Freud and Carl Jung, psychologists who corresponded almost weekly for seven years. Using style-matching statistics, the researchers were able to chart the two men’s tempestuous relationship from their early days of joint admiration to their final days of mutual contempt by counting the ways they used pronouns, prepositions and other words, such as “the,” “you,” “a” and “as,” that have little meaning outside the context of the sentence. Such words can be indicators of a person’s style of writing (and speaking)…

Married Victorian Poets Elizabeth Barrett and Robert Browning, along with 20th century poet couple Sylvia Plath and Ted Hughes, also revealed more in their poetry than they perhaps realized.

I’ll have to watch for this. How much do others typically pick up on this when being around people who are matching each other’s language? Is this how we know people “are good together”?