In a quest to make data available to other researchers to verify research results, researchers have come up against one kind of data that is not made publicly available: “big data” from big Internet firms.
The issue came to a boil last month at a scientific conference in Lyon, France, when three scientists from Google and the University of Cambridge declined to release data they had compiled for a paper on the popularity of YouTube videos in different countries.
The chairman of the conference panel — Bernardo A. Huberman, a physicist who directs the social computing group at HP Labs here — responded angrily. In the future, he said, the conference should not accept papers from authors who did not make their data public. He was greeted by applause from the audience…
At leading social science journals, there are few clear guidelines on data sharing. “The American Journal of Sociology does not at present have a formal position on proprietary data,” its editor, Andrew Abbott, a sociologist at the University of Chicago, wrote in an e-mail. “Nor does it at present have formal policies enforcing the sharing of data.”
The problem is not limited to the social sciences. A recent review found that 44 of 50 leading scientific journals instructed their authors on sharing data but that fewer than 30 percent of the papers they published fully adhered to the instructions. A 2008 review of sharing requirements for genetics data found that 40 of 70 journals surveyed had policies, and that 17 of those were “weak.”
Who will win the battle between proprietary data and science? The article makes it sound like scientists are all on one side, particularly because of an interest in fighting issues like scientific fraud. At the same time, scientific journals don’t seem to be “enforcing” their guidelines or the individual scientists who are publishing in these journals aren’t following these guidelines.
The other side of this debate is not presented in this story: what do these big Internet firms, like Google, Yahoo, and Facebook think about sharing this data? This is not a small issue: these firms are spending a good amount of money on analyzing this data and probably hoping to use it for their own business and research purposes. For example, Microsoft recently set up a lab with several well-known researchers in New York City. Would the social scientists who work in such labs want to insist that the data be open? Should these companies have to open up their proprietary data to satisfy the requirements of the larger scientific community?
I suspect this will be an ongoing issue as social scientists look to analyze more innovative data that big companies have collected and that are more difficult for researchers to collect on their own. Will researchers be willing to forgo sharing this kind of data with the wider scientific community if they can get their hands on unique data?