Sharing data among scientists vs. “Big Data”

In a quest to make data available to other researchers to verify research results, researchers have come up against one kind of data that is not made publicly available: “big data” from big Internet firms.

The issue came to a boil last month at a scientific conference in Lyon, France, when three scientists from Google and the University of Cambridge declined to release data they had compiled for a paper on the popularity of YouTube videos in different countries.

The chairman of the conference panel — Bernardo A. Huberman, a physicist who directs the social computing group at HP Labs here — responded angrily. In the future, he said, the conference should not accept papers from authors who did not make their data public. He was greeted by applause from the audience…

At leading social science journals, there are few clear guidelines on data sharing. “The American Journal of Sociology does not at present have a formal position on proprietary data,” its editor, Andrew Abbott, a sociologist at the University of Chicago, wrote in an e-mail. “Nor does it at present have formal policies enforcing the sharing of data.”

The problem is not limited to the social sciences. A recent review found that 44 of 50 leading scientific journals instructed their authors on sharing data but that fewer than 30 percent of the papers they published fully adhered to the instructions. A 2008 review of sharing requirements for genetics data found that 40 of 70 journals surveyed had policies, and that 17 of those were “weak.”

Who will win the battle between proprietary data and science? The article makes it sound like scientists are all on one side, particularly because of an interest in fighting issues like scientific fraud. At the same time, scientific journals don’t seem to be “enforcing” their guidelines or the individual scientists who are publishing in these journals aren’t following these guidelines.

The other side of this debate is not presented in this story: what do these big Internet firms, like Google, Yahoo, and Facebook think about sharing this data? This is not a small issue: these firms are spending a good amount of money on analyzing this data and probably hoping to use it for their own business and research purposes. For example, Microsoft recently set up a lab with several well-known researchers in New York City. Would the social scientists who work in such labs want to insist that the data be open? Should these companies have to open up their proprietary data to satisfy the requirements of the larger scientific community?

I suspect this will be an ongoing issue as social scientists look to analyze more innovative data that big companies have collected and that are more difficult for researchers to collect on their own. Will researchers be willing to forgo sharing this kind of data with the wider scientific community if they can get their hands on unique data?

The NFL says the “All-22” camera angle is proprietary information

The NFL is a TV ratings powerhouse and makes billions each year on selling television rights. However, fans don’t see the same action that the league and teams watch because the league claims its “All-22” view is proprietary information:

If you ask the league to see the footage that was taken from on high to show the entire field and what all 22 players did on every play, the response will be emphatic. “NO ONE gets that,” NFL spokesman Brian McCarthy wrote in an email. This footage, added fellow league spokesman Greg Aiello, “is regarded at this point as proprietary NFL coaching information.”

For decades, NFL TV broadcasts have relied most heavily on one view: the shot from a sideline camera that follows the progress of the ball. Anyone who wants to analyze the game, however, prefers to see the pulled-back camera angle known as the “All 22.”

While this shot makes the players look like stick figures, it allows students of the game to see things that are invisible to TV watchers: like what routes the receivers ran, how the defense aligned itself and who made blocks past the line of scrimmage.

By distributing this footage only to NFL teams, and rationing it out carefully to its TV partners and on its web site, the NFL has created a paradox. The most-watched sport in the U.S. is also arguably the least understood. “I don’t think you can get a full understanding without watching the entirety of the game,” says former head coach Bill Parcells. The zoomed-in footage on TV broadcasts, he says, only shows a “fragment” of what happens on the field.

Why does the NFL do this? Here are a few plausible scenarios:

1. It can do it so it will. The NFL won’t be bullied into doing something it doesn’t want to do. As long as the money keeps pouring in for TV rights, there is little pressure the public can put on the league for this footage.

1a. If enough fans and commentators picked up on this, could they force the NFL’s hand? It seems unlikely.

2. The NFL makes billions on TV rights and perhaps wants to package this video in a certain way. A later part of the story suggests the NFL has quietly floated the idea of selling access to this footage.

3. The league is worried about legitimate football competitors. There are not currently any viable threats but this could pop up again.

4. The league thinks this is the core data of the NFL, what actually happens on all plays, and will go to great lengths to protect its “intellectual property.” I find this a little hard to believe: aren’t there plenty of people who could understand and scheme what happens on a football field even if the primary camera angle doesn’t show it? Are teams really that worried about what the public might see or that other teams are missing things in the video?

The most dangerous American neighborhoods has its second annual list of the most dangerous neighborhoods in America:

For the second year in a row, using exclusive data developed by Dr. Andrew Schiller’s team at, and based on FBI data from all 17,000 local law enforcement agencies, WalletPop reveals the top 25 most dangerous neighborhoods with the highest predicted rates of violent crime in America.

This year, Chicago took the not-so coveted top spot from Cincinnati for the most dangerous neighborhood, while Atlanta has the highest number of neighborhoods making the list (four).

You may ask, why neighborhoods and not cities? Schiller explains that even the cities with the highest crime rates can have relatively safe neighborhoods, and thus it is less useful to generalize about an entire city.

The reason for looking at neighborhoods rather than cities is a good one – most American cities are quite large so city-level data is not very useful. To see the data for the Chicago neighborhood that tops the list, check out this page. seems to have made an interpretation error with the data:

According to the info, anyone walking down Lake Street between Damen and Western has a 1 in 4 chance of being a victim of a crime.  Those who choose to live there face the same odds with the chances of being robbed.
As far as I can tell, the neighborhood crime rates apply to people living there for a full year, not people just walking in the neighborhood.
The website that this crime data was developed for,, seems like it has some interesting proprietary data. When you enter a zip code, you can purchase a full report – though they leak out a few interesting tidbits. According to the website, zip code 60187 (Wheaton, IL) is “More sophisticated than 97%of U.S. neighborhoods. More walkable than 65%.”