Among other projects Google has been working on, they recently opened a new online tool that allows users to search for certain words in texts going back to the 1500s:
With little fanfare, Google has made a mammoth database culled from nearly 5.2 million digitized books available to the public for free downloads and online searches, opening a new landscape of possibilities for research and education in the humanities.
The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian…
“The goal is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books,” said Erez Lieberman Aiden, a junior fellow at the Society of Fellows at Harvard…
“We wanted to show what becomes possible when you apply very high-turbo data analysis to questions in the humanities,” said Mr. Lieberman Aiden, whose expertise is in applied mathematics and genomics. He called the method “culturomics.”
The article mentions some projects that use this database and sound interesting. And it sounds the dataset can be downloaded and analyzed by users on their own computers.
But thinking about the methodology of this all, I would have some questions.
1. Do we know how well these digitized texts represent the full population of texts? This is a sampling issue – could there be some sort of bias in what kind of texts ended up in this database?
2. Studying word frequency by itself is tricky. Simply counting words and when they appear is one measurement while trying to assess the importance placed in each word is another task. Do the three little “culturnomics” graphs on the left side of the online story really tell us much?
3. It sounds like this would be best for looking at how language (grammar, word choices, structure, etc.) has changed over time.