The Library of Congress announced a few years ago they will archive all tweets. Here is how they plan to store the data and make it available:
Osterberg says the costs associated with the project, in terms of developing the infrastructure to house the tweets, is in the low tens of thousands of dollars. The tweets were offered as a free gift from Twitter, and are being transferred to the Library through a separate company, Gnip, at no cost. Each day tweets are automatically pulled in from Gnip, organized chronologically and scanned to ensure they’re not corrupted. Then the data are stored on two separate tapes which are housed in different parts of the Library for security reasons.
The Library has mostly figured out how to make the archive organized, but usability remains a challenge. A simple query of just the 2006-2010 tweets currently takes about 24 hours. Increasing search speeds to a reasonable level would require purchasing hundreds of servers, which the Library says is financially unfeasible right now. There’s no timetable for when the tweets might become accessible to researchers…
While you can’t yet make a trip to Washington D.C. and have casual perusal of all the world’s tweets, the technology to do exactly that is readily available—for a cost. Gnip, the organization feeding the tweets to the Library, is a social media data company that has exclusive access to the Twitter “firehose,” the never-ending, comprehensive stream of all of our tweets. Companies such as IBM pay for Gnip’s services, which also include access to posts from other social networks like Facebook and Tumblr. The company also works with academics and public policy experts, the type of people likely to make use of a free, government-sponsored Twitter archive when it comes to fruition…
All the researchers agree that Twitter is a powerful tool for sociological study. Soon, if the Library of Congress can make its database fully functional, it’ll also be an easily accessible one. And one day, long after we’ve all sent our final snarky tweet, our messages will live on.
And what will people of the future think when they read all these tweets?
While this could be a really interesting data source (notwithstanding all of the sample selection issues), I find it odd there is no timetable for when it might be more easily searchable. What is the point of collecting all of this information if it can’t be put to use?