On the soc undergrad resume: data collection and analysis

Graduating sociology majors have worked on their resumes and tried to sum up their training for prospective employers. Following up on yesterday’s post on the importance for data in sociology, in my opinion, these graduates should include data collection and analysis in their collection of resume skills. Here are a few reasons why:

  1. From the beginning of their sociology training, we work to help them observe and interpret patterns in the social world. While there is no single class that does this all at once, the path from beginning to end is full of opportunities both to see how sociologists do this as well as try their own hand at developing sociological arguments. Final papers in any class (as well as other assignments) offer opportunities to practice data analysis and interpretation.
  2. Sociology majors do tend to have classes explicitly devoted to Statistics and/or Research Methods. For example, while many people think they can put a survey together, it is in these classes where they learn important basics: what sample do you want? How do you ask good questions? How do you report survey data? At the least, these classes help undergraduates know what questions to ask about data collection and analysis and at their best give them chances to practice these skills.
  3. Organizations – from non-profits to businesses to governments – want people with data collection and analysis skills. Now that it is easier than ever to work with data (though we should not underestimate the value of collecting good data in the first place), how can a prospective employee help the organization understand and communicate what is in the data? In a world awash with data, what do we do with it all?

Undergraduates may be leery of claiming these skills as they do not view themselves as  experts and don’t have years of work experience in data analysis. Yet, these abilities are at the heart of sociology and they are skills that are in demand.

Using statistics to find lost airplanes

Here is a quick look at how Bayesian statistics helped find Air France 447 in the Atlantic Ocean:

Stone and co are statisticians who were brought in to reëxamine the evidence after four intensive searches had failed to find the aircraft. What’s interesting about this story is that their analysis pointed to a location not far from the last known position, in an area that had almost certainly been searched soon after the disaster. The wreckage was found almost exactly where they predicted at a depth of 14,000 feet after only one week’s additional search…

This is what statisticians call the posterior distribution. To calculate it, Stone and co had to take into account the failure of four different searches after the plane went down. The first was the failure to find debris or bodies for six days after the plane went missing in June 2009; then there was the failure of acoustic searches in July 2009 to detect the pings from underwater locator beacons on the flight data recorder and cockpit voice recorder; next, another search in August 2009 failed to find anything using side-scanning sonar; and finally, there was another unsuccessful search using side-scanning sonar in April and May 2010…

That’s an important point. A different analysis might have excluded this location on the basis that it had already been covered. But Stone and co chose to include the possibility that the acoustic beacons may have failed, a crucial decision that led directly to the discovery of the wreckage. Indeed, it seems likely that the beacons did fail and that this was the main reason why the search took so long.

The key point, of course, is that Bayesian inference by itself can’t solve these problems. Instead, statisticians themselves play a crucial role in evaluating the evidence, deciding what it means and then incorporating it in an appropriate way into the Bayesian model.

It is not just about knowing where to look – it is also about knowing how to look. Finding a needle in a haystack is a difficult business whether it is looking for small social trends in mounds of big data or finding a crashed plane in the middle of the ocean.

This could also be a good reminder that only having one search in such circumstances may not be enough. When working with data, failures are not necessarily bad as long as they can help move to a solution.

Argument: scientists need help in handling big data

Collecting, analyzing, and interpreting big data may just be a job that requires more scientists:

For projects like NEON, interpreting the data is a complicated business. Early on, the team realized that its data, while mid-size compared with the largest physics and biology projects, would be big in complexity. “NEON’s contribution to big data is not in its volume,” said Steve Berukoff, the project’s assistant director for data products. “It’s in the heterogeneity and spatial and temporal distribution of data.”

Unlike the roughly 20 critical measurements in climate science or the vast but relatively structured data in particle physics, NEON will have more than 500 quantities to keep track of, from temperature, soil and water measurements to insect, bird, mammal and microbial samples to remote sensing and aerial imaging. Much of the data is highly unstructured and difficult to parse — for example, taxonomic names and behavioral observations, which are sometimes subject to debate and revision.

And, as daunting as the looming data crush appears from a technical perspective, some of the greatest challenges are wholly nontechnical. Many researchers say the big science projects and analytical tools of the future can succeed only with the right mix of science, statistics, computer science, pure mathematics and deft leadership. In the big data age of distributed computing — in which enormously complex tasks are divided across a network of computers — the question remains: How should distributed science be conducted across a network of researchers?

Two quick thoughts:

1. There is a lot of potential here for crossing disciplinary boundaries to tackle big data projects. This isn’t just about parceling out individual pieces of the project; bringing multiple perspectives together could lead to an improved final outcome.

2. I wonder if sociologists aren’t particularly well-suited for this kind of big data work. Given our emphasis on theory and methods, we both emphasize the big picture as well as how to effectively collect, analyze, and interpret data. Sociology students could be able to step into such projects and provide needed insights.

Route sociology majors can go: data analyst

I try to remind my students in Statistics and Social Research that there is a need in a lot of industries for people who can collect and analyze data. I was reminded of this when I saw an obituary about a sociologist who had gone on to become a well-known medical data analyst:

A professor in the Department of Health Services at the UCLA Fielding School of Public Health, [E. Richard] Brown founded the UCLA Center for Health Policy Research in 1994.

One of the center’s major activities has been the development of the California Health Interview Survey, the premier source of information about individual and household health status in California. It has served as a model for health surveys for other states.

Brown was the founder and principal investigator for the survey, which produced its first data from interviews with more than 55,000 California households in 2001. Information from the survey, which has been conducted every two years, has been used by policymakers, community advocates, researchers and others.

And working with important data can then lead to public policy options:

“The single thing that makes Rick stand out in this field is that he had an extraordinary capacity to use evidence about the public’s health and strategize and advocate to turn that evidence into the best policy and action,” said Dr. Linda Rosenstock, dean of the UCLA Fielding School of Public Health.

In 1990, Brown was co-author of California’s first single-payer healthcare legislation. He also co-wrote several other healthcare reform bills over the last two decades…

He also was a full-time senior consultant to President Clinton’s Task Force on National Health Care Reform and served as a senior health policy advisor for the Barack Obama for President Campaign — as well as serving as an advisor to U.S. Sens. Bob Kerrey, Paul Wellstone and Al Franken.

We need more people to collect useful data and then interpret what they mean. These days, the problem often is not a lack of information; rather, we need to know how to separate the good data from the bad and then be able to provide a useful interpretation. While some students may prefer to skip over the methodological sections of articles or books, understanding how to collect and analyze data can go a long way. Additionally, learning about these methods and data analysis can help one move toward a sociological view of the social world where personal anecdotes don’t matter as much as broad trends and looking at how social factors (variables) are related to each other.

Quick Review: American Grace

I recently wrote about a small section of American Grace but I have had a chance to complete the full book. Here are my thoughts about this broad-ranging book about religion in America:

1. On one hand, I like the broad overview. There is a lot of data and analysis here about American religion. If someone had to pick up one book about the topic, this wouldn’t be a bad one to choose. I also liked some of the historical insights, including the idea that what we see now in American religion is a fallout of action in the 1960s and two counteractions that followed.

2. On the other hand, I’m not sure this book provides much new information. There is a lot of research contained in this book but much of it is already out there. The authors try to produce new insights from their own survey but I this is an issue in itself: after reading the full book, it was somewhat unclear why the authors undertook two waves of the Faith Matters Survey. The questions led to some new insights (like feelings toward the construction of a large religious building nearby) but much of it seemed duplicated and the short period between the waves didn’t help.

3. There is a lot of talk about data analysis and interpretation in this book. While it is aimed for a more general audience, the authors are careful in their explanations. For example, they are careful to explain what exactly a correlation means, it indicates a relationship between variables but causation is unclear, over and over again. Elsewhere, the authors explain exactly why they asked the questions they did and discuss the quality of this data. Some of these little descriptions would be useful in basic statistics or research classes. On the whole, they do a nice job in explaining how they interpret the data though I wonder how this might play with a general public that might just want the takeaway points. Perhaps this is why one reviewer thought this text was so readable!

4. Perhaps as a counterpoint to the discussions of data, the book includes a number of vignettes regarding religious congregations. These could be quite lengthy and I’m not sure that they added much to the book. They don’t pack the same punch as the representative characters of a book like Habits of the Heart and sometimes seem like filler.

5. The book ends with the conclusion that Americans can be both religiously diverse and devoted because of the many relationships between people of different faiths and denominations. On the whole, the authors suggest most people are in the middle regarding religion, not too confident in the idea that their religion is the only way but unwilling to say that having no religion is the way to go. I would like to have read more about how this plays out within religious congregations: how do religious leaders then talk doctrine or has everyone simply shifted to a more accomodating approach? Additionally, why doesn’t this lead down the path of secularization? From a societal perspective, religious pluralism may be desirable but is it also desirable for smaller groups?

On the whole, this book is a good place to start if one is looking for an overview of American religion. But, if one is looking for more detailed research and discussion regarding a particular topic, one would be better served going to those conducting research within these specific areas.

Using a sociological approach in “e-discovery technologies”

Legal cases can generate a tremendous amount of documents that each side needs to examine. With new searching technology, legal teams can now go through a lot more data for a lot less money. In one example, “Blackstone Discovery of Palo Alto, Calif., helped analyze 1.5 million documents for less than $100,000.” But within this discussion, the writer suggests that these searches can be done in two ways:

E-discovery technologies generally fall into two broad categories that can be described as “linguistic” and “sociological.”

The most basic linguistic approach uses specific search words to find and sort relevant documents. More advanced programs filter documents through a large web of word and phrase definitions. A user who types “dog” will also find documents that mention “man’s best friend” and even the notion of a “walk.”

The sociological approach adds an inferential layer of analysis, mimicking the deductive powers of a human Sherlock Holmes. Engineers and linguists at Cataphora, an information-sifting company based in Silicon Valley, have their software mine documents for the activities and interactions of people — who did what when, and who talks to whom. The software seeks to visualize chains of events. It identifies discussions that might have taken place across e-mail, instant messages and telephone calls…

The Cataphora software can also recognize the sentiment in an e-mail message — whether a person is positive or negative, or what the company calls “loud talking” — unusual emphasis that might give hints that a document is about a stressful situation. The software can also detect subtle changes in the style of an e-mail communication.

A shift in an author’s e-mail style, from breezy to unusually formal, can raise a red flag about illegal activity.

So this second technique gets branded as “sociological” because it is looking for patterns of behavior and interaction. If you wondered how the programmers set up their code in order to this kind of analysis, it sounds like some academics have been working on the problem for almost a decade:

[A computer scientist] bought a copy of the database [of Enron emails] for $10,000 and made it freely available to academic and corporate researchers. Since then, it has become the foundation of a wealth of new science — and its value has endured, since privacy constraints usually keep large collections of e-mail out of reach. “It’s made a massive difference in the research community,” Dr. McCallum said.

The Enron Corpus has led to a better understanding of how language is used and how social networks function, and it has improved efforts to uncover social groups based on e-mail communication.

Any sociologists involved in this project to provide input on what the programs should be looking for in human interactions?

This sort of analysis software could be very handy for sociological research when one has hundreds of documents or sources to look through. Of course, the algorithms might have be changed for specific projects or settings but I wonder if this sort of software might be widely available in a few years. Would this analysis be better than going through one by one through documents in coding software like Atlas.Ti or NVivo?

Interpreting the FBI’s 2009 hate crime report

Hate crime legislation is a topic that seems to rile people up. The Atlantic provides five sources that try to summarize and make sense of the latest annual data released by the FBI:

Agence France-Presse reports that “out of 6,604 hate crimes committed in the United States in 2009, some 4,000 were racially motivated and nearly 1,600 were driven by hatred for a particular religion … Blacks made up around three-quarters of victims of the racially motivated hate crimes and Jews made up the same percentage of victims of anti-religious hate crimes.” The report also notes that “anti-Muslim crimes were a distant second to crimes against Jews, making up just eight percent of the hate crimes driven by religious intolerance.” Finally, the report notes a drop in hate crimes overall: “Some 8,300 people fell victim to hate crimes in 2009, down from 9,700 the previous year.”

This is a reminder that there is a lot of data out there, particularly generated by government agencies, but we need qualified and skilled people to interpret its meaning.

You can find the data on hate crimes at the FBI website of uniform crime reports. Here is the FBI’s summary of the incidents, 6,604 in all.