The possibilities of linking together sets of data

I saw multiple interesting presentations at ASA this year that linked together several datasets to develop robust analysis and interesting findings. These data sources included government data, data collected by the researchers, and other available data. Doing this unlocks a lot of possibilities for answering research questions.

Photo by Manuel Geissinger on Pexels.com

But, how might this happen more regularly? Or, put differently, how might more researchers use multiple datasets in a single project? Here are some quick thoughts on what could help make this possible:

-More access to data. Some data is publicly available. Other data is restricted for a variety of reasons. Having more big datasets accessible opens up possibilities. Just knowing where to request data is a process plus whatever applications and/or resources might be needed to access it.

-Having the know-how to put datasets together. It takes work to become familiar with a single dataset. To be able to merge data requires additional work. I do not know if it would be useful to offer more instruction in doing this or whether it matters which individual datasets are involved.

-Asking research questions gets more interesting and complicated with more variables and layers at play. Constructing sets of questions that build on the strengths of the combined data is a skill.

-Including more – but concise and understandable – explanations of how the data was merged in publications can help demystify the process.

And with all of this data innovation, it is interesting to consider how projects that link multiple datasets complement and come alongside other projects with only one source of data.

“The most closely studied troublemakers in history”

See this story for how a large study of Boston’s youths begun in 1939 sheds light on the recent arrest of mobster James “Whitey” Bulger:

It all began in 1939, when husband-and-wife researchers Sheldon and Eleanor Glueck assembled a team of investigators to go door to door through a number of poor Boston neighborhoods and collect data on boys who had grown up there. Their goal was to understand what causes some boys and not others to get involved with crime, a question which, as it happened, would be dramatically brought to life in the story of Whitey Bulger and his overachieving brother in the state Senate, William.

The Gluecks picked a sample of 1,000 boys, half of whom had stayed out of trouble while the other half had racked up records and gotten themselves locked up at one of two local reform schools, Lyman and Shirley. The boys were interviewed repeatedly – once when they were around 14, then again when they were 25 and 32 – as were their teachers, parents, and neighbors. Their world – Whitey’s world – was carefully documented, and their lives were charted as they grew from adolescents into adults…

The original researchers didn’t publish all of their data and several decades later, two criminologists dug into the data and interviewed some of the original participants. Here is what they found:

Their study earned Laub and Sampson accolades in their field for their insights into the nature of crime. But it also points to a few truths specifically about Boston, and the way the city shaped the Glueck boys while they grew into the Glueck men. It mattered a lot where these boys came from, Laub and Sampson concluded: The city had influenced them like no other city could have. Specifically, according to Sampson, it had made them cynical about authority.

All the poor neighborhoods in Boston were isolated to some degree in the 1940s: As Sampson and Laub discovered, kids who grew up in ethnic enclaves like Southie or the North End during that time did not identify with the city as a whole. Their lives were just too separate from everyone else’s, their daily routines too local. Plus, they knew the people who ran the show on Beacon Hill thought of their neighborhoods as slums, and they resented it.

This is an interesting piece as such large studies can offer a wealth of data and insights. This makes me wonder if other large datasets would benefit from teams of researchers later combing through the data to explore different areas and follow-up.

This is the sort of information that would help provide a broader context to Bulger’s case but I suspect the media will mainly stick to his mob background.

Scientists call for more rules and regulations about data

There are a lot of academics and researchers collecting data on a variety of topics. Some scientists argue that we need more regulations about data so that researchers can work with and access data collected by others:

In 10 new articles, also published in Science, researchers in fields as diverse as paleontology and neuroscience say the lack of data libraries, insufficient support from federal research agencies, and the lack of academic credit for sharing data sets have created a situation in which money is wasted and information that could reveal better cancer treatments or the causes of climate change goes by the wayside…

A big problem is the many forms of data and the difficulty of comparing them. In neuroscience, for instance, researchers collect data on scales of time that range from nanoseconds, if they are looking at rates of neuron firing, to years, if they are looking at developmental changes. There are also difference in the kind of data that come from optical microscopes and those that come from electron microscopes, and data on a cellular scale and data from a whole organism…

He added that he was limited by how data are published. “When I see a figure in a paper, it’s just the tip of the iceberg to me. I want to see it in a different form in order to do a different kind of analysis.” But the data are not available in a public, searchable format.

Shared data libraries sound like they could be useful. Based on experience, however, even if data is made available, it still takes a good amount of time to download data, read the documentation, and reshape the data in a way that one can start to replicate findings from journal articles.