How the Library of Congress will archive and make available all tweets

The Library of Congress announced a few years ago they will archive all tweets. Here is how they plan to store the data and make it available:

Osterberg says the costs associated with the project, in terms of developing the infrastructure to house the tweets, is in the low tens of thousands of dollars. The tweets were offered as a free gift from Twitter, and are being transferred to the Library through a separate company, Gnip, at no cost. Each day tweets are automatically pulled in from Gnip, organized chronologically and scanned to ensure they’re not corrupted. Then the data are stored on two separate tapes which are housed in different parts of the Library for security reasons.

The Library has mostly figured out how to make the archive organized, but usability remains a challenge. A simple query of just the 2006-2010 tweets currently takes about 24 hours. Increasing search speeds to a reasonable level would require purchasing hundreds of servers, which the Library says is financially unfeasible right now. There’s no timetable for when the tweets might become accessible to researchers…

While you can’t yet make a trip to Washington D.C. and have casual perusal of all the world’s tweets, the technology to do exactly that is readily available—for a cost. Gnip, the organization feeding the tweets to the Library, is a social media data company that has exclusive access to the Twitter “firehose,” the never-ending, comprehensive stream of all of our tweets. Companies such as IBM pay for Gnip’s services, which also include access to posts from other social networks like Facebook and Tumblr. The company also works with academics and public policy experts, the type of people likely to make use of a free, government-sponsored Twitter archive when it comes to fruition…

All the researchers agree that Twitter is a powerful tool for sociological study. Soon, if the Library of Congress can make its database fully functional, it’ll also be an easily accessible one. And one day, long after we’ve all sent our final snarky tweet, our messages will live on.

And what will people of the future think when they read all these tweets?

While this could be a really interesting data source (notwithstanding all of the sample selection issues), I find it odd there is no timetable for when it might be more easily searchable. What is the point of collecting all of this information if it can’t be put to use?

Using Twitter as a data source; examining emotions and more

In April, the Library of Congress announced plans to archive all public tweets since the start of Twitter in March 2006. So what might researchers do with this data?

A recent study provides an example. Scholars from Northeastern and Harvard examined the emotions of Americans through their Tweets. By coding certain words as having positive or negative emotional value, researchers were able to map out data. According to New Scientist:

[T]hese “tweets” suggest that the west coast is happier than the east coast, and across the country happiness peaks each Sunday morning, with a trough on Thursday evenings.

The mood map is cool.

While the findings about when people are happy may not be too surprising, the research does bring up the question about the value of Tweets as a data source. Since it is likely skewed to a younger sample and also perhaps a wealthier and more educated group, it is not representative data. But it could provide some insights into reactions to certain events or for seeing the beginning and end of certain trends.

So what else will researchers study using tweets?