Activist charged for downloading millions of JSTOR articles

Many academics use databases like JSTOR to find articles from academic journals. However, one user violated the terms of service by downloading millions of articles and is now being charged by the federal government:

Swartz, the 25-year-old executive director of Demand Progress, has a history of downloading massive data sets, both to use in research and to release public domain documents from behind paywalls. He surrendered in July 2011, remains free on bond and faces dozens of years in prison and a $1 million fine if convicted.

Like last year’s original grand jury indictment on four felony counts, (.pdf) the superseding indictment (.pdf) unveiled Thursday accuses Swartz of evading MIT’s attempts to kick his laptop off the network while downloading millions of documents from JSTOR, a not-for-profit company that provides searchable, digitized copies of academic journals that are normally inaccessible to the public…

“JSTOR authorizes users to download a limited number of journal articles at a time,” according to the latest indictment. “Before being given access to JSTOR’s digital archive, each user must agree and acknowledge that they cannot download or export content from JSTOR’s computer servers with automated programs such as web robots, spiders, and scrapers. JSTOR also uses computerized measures to prevent users from downloading an unauthorized number of articles using automated techniques.”

MIT authorizes guests to use the service, which was the case with Swartz, who at the time was a fellow at Harvard’s Safra Center for Ethics.

It sounds like there is some disconnect here: services like JSTOR want to maintain some control over the academic content they provide even as they exist to help researchers find printed scholarly articles. Services like JSTOR can make big money by collating journal articles and requiring libraries to pay for access. Thus, someone like Swartz could download a lot of the articles and then avoid paying for or using JSTOR down the road (though academic users are primarily paying through institutions who pass the costs along to users). But what is “a limited number of journal articles at a time”? Using an automated program is clearly out according to the terms of service but what if a team of undergraduates banded together, downloaded a similar number of articles, and pooled their downloads?

If we are indeed headed toward a world of “big data,” which presumably would include the thousands of scholarly articles published each year, we are likely in for some interesting battles in a number of areas over who gets to control, download, and access this data.

Another thought: does going to open access academic journals eliminate this issue?

Scientists call for more rules and regulations about data

There are a lot of academics and researchers collecting data on a variety of topics. Some scientists argue that we need more regulations about data so that researchers can work with and access data collected by others:

In 10 new articles, also published in Science, researchers in fields as diverse as paleontology and neuroscience say the lack of data libraries, insufficient support from federal research agencies, and the lack of academic credit for sharing data sets have created a situation in which money is wasted and information that could reveal better cancer treatments or the causes of climate change goes by the wayside…

A big problem is the many forms of data and the difficulty of comparing them. In neuroscience, for instance, researchers collect data on scales of time that range from nanoseconds, if they are looking at rates of neuron firing, to years, if they are looking at developmental changes. There are also difference in the kind of data that come from optical microscopes and those that come from electron microscopes, and data on a cellular scale and data from a whole organism…

He added that he was limited by how data are published. “When I see a figure in a paper, it’s just the tip of the iceberg to me. I want to see it in a different form in order to do a different kind of analysis.” But the data are not available in a public, searchable format.

Shared data libraries sound like they could be useful. Based on experience, however, even if data is made available, it still takes a good amount of time to download data, read the documentation, and reshape the data in a way that one can start to replicate findings from journal articles.