Activist charged for downloading millions of JSTOR articles

Many academics use databases like JSTOR to find articles from academic journals. However, one user violated the terms of service by downloading millions of articles and is now being charged by the federal government:

Swartz, the 25-year-old executive director of Demand Progress, has a history of downloading massive data sets, both to use in research and to release public domain documents from behind paywalls. He surrendered in July 2011, remains free on bond and faces dozens of years in prison and a $1 million fine if convicted.

Like last year’s original grand jury indictment on four felony counts, (.pdf) the superseding indictment (.pdf) unveiled Thursday accuses Swartz of evading MIT’s attempts to kick his laptop off the network while downloading millions of documents from JSTOR, a not-for-profit company that provides searchable, digitized copies of academic journals that are normally inaccessible to the public…

“JSTOR authorizes users to download a limited number of journal articles at a time,” according to the latest indictment. “Before being given access to JSTOR’s digital archive, each user must agree and acknowledge that they cannot download or export content from JSTOR’s computer servers with automated programs such as web robots, spiders, and scrapers. JSTOR also uses computerized measures to prevent users from downloading an unauthorized number of articles using automated techniques.”

MIT authorizes guests to use the service, which was the case with Swartz, who at the time was a fellow at Harvard’s Safra Center for Ethics.

It sounds like there is some disconnect here: services like JSTOR want to maintain some control over the academic content they provide even as they exist to help researchers find printed scholarly articles. Services like JSTOR can make big money by collating journal articles and requiring libraries to pay for access. Thus, someone like Swartz could download a lot of the articles and then avoid paying for or using JSTOR down the road (though academic users are primarily paying through institutions who pass the costs along to users). But what is “a limited number of journal articles at a time”? Using an automated program is clearly out according to the terms of service but what if a team of undergraduates banded together, downloaded a similar number of articles, and pooled their downloads?

If we are indeed headed toward a world of “big data,” which presumably would include the thousands of scholarly articles published each year, we are likely in for some interesting battles in a number of areas over who gets to control, download, and access this data.

Another thought: does going to open access academic journals eliminate this issue?

Improving the word cloud: NYT adds rates of word usage and comparisons between groups

I’m generally not a big fan of word clouds but one of students recently pointed out to me an example from the New York Times that makes some improvements: looking at the rates of word usage at both the Republican and Democratic National Conventions. (Click through to see the interactive graphic.) Here is how I think this improves on a typical word cloud:

1. It doesn’t display word frequency but rather the rate of the word usage. Thus, we get an idea of how often the words were used in comparison to all the words that were said. Frequencies by themselves don’t tell you much but this helps put them into a context. (A note: I would like the graphic to include the total word usage for each convention so we have a quick idea of how many words were spoken).

2. The display also makes a comparison between the two political parties so we can see the relative word usage across two groups. This could run into the same problem as frequencies – just because one group uses the term more doesn’t necessarily mean they think it is more important – but we can start getting some clues into the differences in how Republicans and Democrats made a case for their party.

Overall, this is an improvement over the typical word cloud (make your own at wordle.net) and helps us start analyzing the tens of thousands of words spoken at the conventions. Of course, we would need a more complete analysis, probably including multiple coders, to really get at what was conveyed through the words (and that doesn’t even get at the visuals, body language, presentation).

Facebook runs 2010 voting experiment with over 61 million users

Experiments don’t just take place in laboratories; they also happen on Facebook.

On November 2nd, 2010, more than 61 million adults visited Facebook’s website, and every single one of them unwittingly took part in a massive experiment. It was a randomised controlled trial, of the sort used to conclusively test the worth of new medicines. But rather than drugs or vaccines, this trial looked at the effectiveness of political messages, and the influence of our friends, in swaying our actions. And unlike most medical trials, this one had a sample size in the millions.

It was the day of the US congressional elections. The vast majority of the users aged 18 and over (98 percent of them) saw a “social message” at the top of their News Feed, encouraging them to vote. It gave them a link to local polling places, and clickable button that said “I voted”. They could see how many people had clicked the button on a counter, and which of their friends had done so through a set of randomly selected profile pictures.

But the remaining 2 percent saw something different, thanks to a team of scientists, led by James Fowler from the University of California, San Diego. Half of them saw the same box, wording, button and counter, but without the pictures of their friends—this was the “informational message” group. The other half saw nothing—they were the “no message” group.

By comparing the three groups, Fowler’s team showed that the messages mobilised people to express their desire to vote by clicking the button, and the social ones even spurred some to vote. These effects rippled through the network, affecting not just friends, but friends of friends. By linking the accounts to actual voting records, Fowler estimated that tens of thousands of votes eventually cast during the election were generated by this single Facebook message.

The effects appear to be small but could still be influential when multiplied through large social networks.

I suspect we’ll continue to see more and more of this in the future. Platforms like Facebook or Google or Amazon have access to millions of users and can run experiments that don’t change a user’s experience of the website much.

Argument: we could have skewed survey results because we ignore prisoners

Several sociologists suggest American survey results may be off because they tend to ignore prisoners:

“We’re missing 1% of the population,” said Becky Pettit, a University of Washington sociologist and author of the book, “Invisible Men.” “People might say, ‘That’s not a big deal.’ “But it is for some groups, she writes — particularly young black men. And for young black men, especially those without a high-school diploma, official statistics paint a rosier picture than reality on factors such as employment and voter turnout.

“Because many surveys skip institutionalized populations, and because we incarcerate lots of people, especially young black men with low levels of education, certain statistics can look rosier than if we included” prisoners in surveys, said Jason Schnittker, a sociologist at the University of Pennsylvania. “Whether you regard the impact as ‘massive’ depends on your perspective. The problem of incarceration tends to get swept under the rug in lots of different ways, rendering the issue invisible.”

Further commentary in the article suggests sociologists and others, like the Census Bureau, are split on whether they think including prisoners in surveys is necessary.

Based on this discussion here, I wonder if there is another issue here: is getting slightly better survey results through picking up 1% of the population going to significantly affect results and policy decisions? If not, some would conclude it is not worth the effort. But, Petit argues some statistics could change a lot:

Among the generally accepted ideas about African-American young-male progress over the last three decades that Becky Pettit, a University of Washington sociologist, questions in her book “Invisible Men”: that the high-school dropout rate has dropped precipitously; that employment rates for young high-school dropouts have stopped falling; and that the voter-turnout rate has gone up.

For example, without adjusting for prisoners, the high-school completion gap between white and black men has fallen by more than 50% since 1980, says Prof. Pettit. After adjusting, she says, the gap has barely closed and has been constant since the late 1980s. “Given the data available, I’m very confident that if we include inmates” in more surveys, “the trends are quite different than we would otherwise have known,” she says…

For instance, commonly accepted numbers show that the turnout rate among black male high-school dropouts age 20 to 34 surged between 1980 and 2008, to the point where about one in three were voting in presidential races. Prof. Pettit says her research indicates that instead the rate was flat, at around one in five, even after the surge in interest in voting among many young black Americans with Barack Obama in the 2008 race.

It will be interesting to see how this plays out.

Lack of good data on grad students who go into nonacademic jobs

I was just asked about this recently so I was interested to see this story in the Chronicle of Higher Education about efforts to get better data about graduate students who go on to nonacademic careers:

The Council of Graduate Schools published a wider-scoped study this year. “Pathways Through Graduate School and Into Careers” focuses on the transition from graduate school to job. Its findings, based on consultation with students, deans, and employers, are now resonating in an academic culture that remains fixated on the tenure-track outcome.

The council’s study found that professors don’t talk enough to their graduate students about possible jobs outside of academe, even though such nonfaculty positions are “of interest to students.” That lack of guidance is particularly egregious in light of where graduate students actually end up: About half of new Ph.D.’s get their first jobs outside of academe, “in business, government, or nonprofit jobs,” the council’s report said.

The CGS study included a survey but the results have not been published. Incredibly, there has been no significant survey of graduate-student career outcomes since Nerad and Cerny’s [a 1999 study]—and they limited their sample to Ph.D.’s who had received their degrees nearly 30 years ago now.

So it’s big news that the Scholarly Communication Institute is conducting a new survey of former graduate students who have (or are building) careers outside the professoriate—a career category now commonly called alternative academic, or “alt-ac.” (You can tell how embedded an idea has become when it gets a handle as brief as that.)

You would think there would be more data on this topic but since graduate schools themselves may not have a great interest in this information, it takes some other group or interested party to pull it all together.

I know in reports like these graduate school faculty tend to take a beating because they don’t talk enough about nonacademic options. While they should know something about the topic and perhaps in the future they can point their students to this new survey and database, how much could they really know about the nonacademic world? They often face a lot of pressure to keep up in their own settings, let alone find out about areas that their schools and departments wouldn’t really reward them for. Perhaps there would be some way to introduce incentives to the system that could help reward faculty for also talking about life outside academia? I wonder how many departments in certain subjects would feel like failures if half their graduates ended up in nonacademic jobs…this is not conducive to wanting to share more information with students.

Disconnect between how much Americans say they give to church and charity versus what they actually give

Research working with recent data on charitable and religious giving suggests there is an interesting disconnect: some people say they give more than they actually do.

A quarter of respondents in a new national study said they tithed 10 percent of their income to charity. But when their donations were checked against income figures, only 3 percent of the group gave more than 5 percent to charity…

But other figures from the Science of Generosity Survey and the 2010 General Social Survey indicate how little large numbers of people actually give to charity.

The generosity survey found just 57 percent of respondents gave more than $25 in the past year to charity; the General Social Survey found 77 percent donated more than $25, Price and Smith reported in their presentation on “Religion and Monetary Donations: We All Give Less Than We Think.”

In one indication of the gap between perception and reality, 10 percent of the respondents to the generosity survey reported tithing 10 percent of their income to charity although their records showed they gave $200 or less.

Two thoughts, more about methodological issues than the subject at hand:

1. What people say on surveys or in interviews doesn’t always match what they actually do. There are a variety of reasons for this, not all malicious or intentional. But, this leads me to thought #2…

2. I like the way some of these studies make use of multiple sources of data to find the disconnect between what people say and what they do. When looking at an important area of social life, like altruism, having multiple sources of data goes a long way. Measuring attitudes is often important in of itself but we also need data on practices and behaviors.

 

Mapping Walmart’s rise

I’ve seen this before but it is still a cool set of maps: watch Walmart expand across the United States.

Several things I like about this:

1. The flowing data is a nice touch as you can see changes over time. Maps can sometimes appear to be static but merging them into a time-series presentation makes it more dynamic.

2. This is a reminder that Walmart began as a Southern regional retailer who then expanded greatly. Though it is hard to remember a time when Walmarts were not located pretty much everywhere, its rise was not inevitable and it was relatively recent.

3. I remember the first Walmart that opened in our area in the Chicago suburbs. It did seem like an oddity as it had such a range of products and low prices. For example, I bought many of my first CDs there as it was significantly cheaper than the local music stores like Tower Records or Sam Goody. Of course, that initial store looks paltry compared to the more recent editions that feature even more products including a full grocery section.

4. I wonder if this couldn’t be enhanced with some other layers of data. Perhaps color shadings for each state that would show Walmart’s share of the retail market. Or the sales figures for each state. Or the number of Walmart employees per state. For critics, perhaps the number of local businesses that were forced out of business by Walmart (though this would probably be difficult to quantify).

Sociologist: “one-year change in test results doesn’t make a trend”

A sociologist provides some insights into how firms “norm” test scores from year to year and what this means about how to interpret the results:

The most challenging part of this process, though, is trying to place this year’s test results on the same scale as last year’s results, so that a score of 650 on this year’s test represents the same level of performance as a score of 650 on last year’s test. It’s this process of equating the tests from one year to the next which allows us to judge whether scores this year went up, declined or stayed the same.But it’s not straightforward, because the test questions change from one year to the next, and even the format and content coverage of the test may change.

Different test companies even have different computer programs and statistical techniques to estimate a student’s score and, hence, the overall picture of how a student, school or state is performing. (Teachers too, but that’s a subject for another day.)

All of these variables – different test questions from year to year; variations in test length, difficulty and content coverage; and different statistical procedures to calculate the scores – introduce some uncertainty about what the “true” results are…

In testing, every year is like changing labs, in somewhat unpredictable ways, even if a state hires the same testing contractor from one year to the next. For this reason, I urge readers to not react too strongly to changes from last year to this year, or to consider them a referendum on whether a particular set of education policies – or worse, a particular initiative – is working.

One-year changes have many uncertainties built into them; if there’s a real positive trend, it will persist over a period of several years. Schooling is a long-term process, the collective and sustained work of students, teachers and administrators; and there are few “silver bullets” that can be counted on to elevate scores over the period of a single school year.

Overall, this piece gives us some important things to remember: one data point is hard to put into context. You can draw a trend line between two data points. Having more data points gives you a better indication of what is happening over time. However, just having statistics isn’t enough; we also need to consider the reliability and validity of the data. Politicians and administrators seem to like test scores because they offer concrete numbers which can help them point out progress or suggest that changes need to be made. Yet, just because these are numbers doesn’t mean that there isn’t a process that goes into them or that we need to understand exactly what the numbers involve.

Using Twitter to predict when you will get sick with 90% accuracy

A new study uses tweets in New York City to predict when a user will get sick – and does so with 90% accuracy.

Using 4.4 million tweets with GPS location from over 630,000 users in New York City, Sadilek and his team were able to predict when an individual would get sick with the flu and tweet about it up to eight days in advance of their first symptoms. Researchers found they could predict said results with 90 percent accuracy.

Similar to Google’s Flu trends, which uses “flu” search trends to pinpoint where and how outbreaks are spreading, Sadilek’s system uses an algorithm to differentiate between alternative definitions of the word ‘sick.’ For example, “My stomach is in revolt. Knew I shouldn’t have licked that door knob. Think I’m sick,” is different from “I’m so sick of ESPN’s constant coverage of Tim Tebow.”

Of course, Sadilek’s system isn’t an exhaustive crystal ball. Not everyone tweets about their symptoms and not everyone is on Twitter. But considering New York City has more Twitter users than any other city in the world, the Big Apple is as good as a place as any for this study.

While one could look at this and marvel at the power of Twitter, I think the real story here is about two things: (1) the power of big data and (2) the power of social networks that Twitter harnesses. If you have people volunteering information about their lives, access to the data, and information about who users are connected to, you can do things that would have been very difficult even ten years ago.

It is interesting that this study was conducted in New York City where there is a high percentage of Twitter users. How good are predictions in cities with lower usage rates? Are we headed toward a world where public health requires people to report on their health so that outbreaks can be contained or quelled?

Questions about a study of the top Chicago commuter suburbs

The Chaddick Institute for Metropolitan Development at DePaul just released a new study that identifies the “top [20] transit suburbs of metropolitan Chicago.” Here is the top 10, starting with the top one: LaGrange, Wilmette, Arlington Heights, Glenview, Elmhurst, Wheaton, Downers Grove, Naperville, Des Plaines, and Mount Prospect. Here is the criteria used to identify these suburbs:

The DePaul University team considered 45 measurable factors to rank the best transit suburbs based on their:

1. Station buildings and platforms;

2. Station grounds and parking;

3. Walkable downtown amenities adjacent to the station; and

4. Degree of community connectivity to public transportation, as measured by the use of commuter rail services.

A couple of things strike me as interesting:

1. These tend to be wealthier suburbs but not the wealthiest. On one hand, this seems strange as living in a nicer place doesn’t necessarily translate into nicer mass transit facilities (particularly if more people can afford to drive). On the other hand, having a thriving, walkable downtown nearby is probably linked to having the money to make that happen.

2. There are several other important factors that influence which suburbs made the list:

Communities in the northern and northwestern parts of the region tended to outperform those in the southern parts, with much of the differences due to their published Walk Scores. Similarly, communities on the outer periphery of the region tend to have lower scores due to the tendency for the density of development to decline as one moves farther from downtown Chicago. As a result, both Walk Scores and connectivity to transit tended to be lower in far-out suburbs than closer-in ones.

It might be more interesting here to pick out suburbs that buck these trends and have truly put a premium on attractive transportation options. For example, can a suburb 35 miles out of Chicago put together a mass transit facilities that truly draw new residents or does the distance simply matter too much?

3. I’m not sure why they didn’t include “city suburbs.” Here is the explanation from the full report (p.11 of the PDF):

All suburbs with stations on metropolitan Chicago’s commuter-rail system, whether they are located in Illinois or Indiana, are considered for analysis except those classified as city suburbs, such as Evanston, Forest Park, and Oak Park, which have CTA rapid transit service to their downtown districts. Gary, Hammond, and Whiting, Indiana, also are generally considered cities or city suburbs rather than conventional suburbs, because all of these communities have distinct urban qualities. To assure meaningful and fair comparisons, these communities were not included in the study.

Hammond is not a “conventional suburb”? CTA service isn’t a plus over Metra commuter rail service?

4. The included suburbs had to meet three criteria (p.11 of the PDF):

1) commuter-rail service available seven days a week, with at least 14 inbound departures on weekdays, including some express trains;
2) at least 150 people who walk or bike to the train daily; and
3) a Walk Score of at least 65 on a 100-point scale at its primary downtown station (putting it near the middle of the category, described as “somewhat walkable”).

This is fairly strict criteria so not that many Chicago suburbs qualified for the study (p.11 of the PDF):

Twenty-five communities, all on the Metra system, met these three criteria (Figure 2). All were adjacent to downtown districts that support a transit-oriented lifestyle and tend to have a transit culture that many find appealing. Numerous communities, such as Buffalo Grove, Lockport, and Orland Park, were not eligible because they do not currently meet the first criteria, relating to train frequency. Some smaller suburbs, such as Flossmoor, Kenilworth and Glencoe, while heavily oriented toward transit, lack diversified downtown amenities and the services of larger stations, and therefore did not have published Walk Scores above the minimum threshold of 65.

I can imagine what might happen: all suburbs in the top 20 are going to proclaim that they are a top 20 commuter suburb! But it was only out 25…

5. There are some other intriguing methodological bits here. Stations earned points for having coffee available or displaying railroad heritage. Parking lot lighting was measured this way (p.24 of the PDF):

The illumination of the parking lot was evaluated using a standard light meter. Readings were collected during the late-evening hours between June 23 and July 5, 2012 at three locations in the main parking lots:
1) locations directly under light poles (which tend to be the best illuminated parts of the lots);
2) locations midway between the light poles (which tend to be among the most poorly illuminated parts of the lot); and
3) tangential locations, 20 and 25 feet perpendicular to the alignment of light poles and directly adjacent to the poles (in some cases, these areas having lighting provided from lamps on adjacent streets).

At least three readings were collected for category 1 and at least two readings were collected for categories two and three.

There is no widely accepted standard on parking lot lighting that balances aesthetics and security. Research suggests, however, that lighting of 35 or more lumens is preferable, but at a minimum, 10 lumens is necessary for proper pedestrian activity and safety. Scores of parking lot illuminate were based on a relative scale, as noted below. In effect, the scales grades on a “curve”, resulting in a relatively equal distribution of high and low scores for each category. In several instances, Category 3 readings were not possible due to the configuration of the parking lot. In these instances, final scores were determined by averaging the Category 1 and 2 scores.

I don’t see any evidence that commuters themselves were asked about the amenities though there was some direct observation. Why not also get information directly from those who consistently use the facilities?

Overall, I’m not sure how useful this study really is. I can see how it might be utilized by some interested parties including people in real estate and planners but I don’t know that it really captures enough of the full commuting experience available to suburbanites in the Chicago suburbs.