When a pie chart works for analyzing the lyrics of a song, Hey Jude edition

Earlier this week, a data visualization expert presented a pie chart for the lyrics of The Beatles’ hit “Hey Jude”:

HeyJudeLyricsPieChart

Pie charts are very effective when you want to show the readers that a large percentage of what you are examining is made of one or two categories. In contrast, too many categories or not a clear larger category can render a pie chart less useful. In this case, the word/lyrics “na” makes up 40% of the song “Hey Jude.” In contrast, the words in the song’s title – “hey” and “Jude” – comprise 14% of the song and “all other words” – the song has three verses (the fourth one repeats the first verse) and two bridges – account for 40%.

This should lead to questions about what made this song such a hit. Singing “na” over and over again leads to a number one hit and a song played countless time on radio? The lyrics Paul McCartney wrote out in the studio sold for over $900,000 though there are no written “na”s on that piece of paper. Of course, the song was written and performed by the Beatles, a musical and sociological phenomena if there ever was one, and the song is a hopeful as Paul aimed to reassure John Lennon’s son Julian. Could the song stand on its own as a 3 minute single (and these first minutes contain few “na”s)? These words are still hopeful and the way the Beatles stack instruments and harmonies from a relatively quiet first verse through the second bridge is interesting. Yet, the “na”s at the end make the song unique, not just for the number of them (roughly minutes before fading out) but the spirit in which they are offered (big sound plus Paul improvising over the top).

Thus, the pie graph above does a good job. It points out the lyrical peculiarities of this hit song and hints at deeper questions about the Beatles, music, and what makes songs and cultural products popular.

Maps, distortions, and realities

Maps do not just reflect reality; a new online exhibit at the Boston Public Library looks at how they help shape reality:

desk globe on shallow focus lens

Photo by NastyaSensei on Pexels.com

The original topic was to do an exhibition of a classic category of maps called persuasive cartography, which tends to refer to propaganda maps, ads, political campaign maps, maps that obviously you can tell have an agenda. We have those materials in our collections of about a quarter million flat maps, atlases, globes and other cartographic materials. But we decided in recognition of what’s going on now to expand into a bigger theme about how maps produce truth, and how trust in maps and other visual data is produced in media and civil society. So rather than thinking about just about maps which are obviously treacherous, distorting, and deceptive, we wanted to think about how every map goes about presenting the world and how they can all reflect biases and absences or incorrect classifications of data. We also wanted to think about this as a way to promote data literacy, which is a critical attitude towards media and data visualizations, to bring together this long history of how maps produce our sense of reality…

We commissioned a special set of maps where we compiled geographic data about the state of Massachusetts across a few different categories, like demographics, infrastructure, and the environment. We gave the data to a handful of cartographers and asked them to make a pair of maps that show different conclusions that disagree with each other. One person made two maps from environmental data from toxic waste sites: One map argues that cities are most impacted by pollution, and the other says it’s more rural towns that have a bigger impact. So this project was really meant to say, we’d like to think that numbers speak for themselves, but whenever we’re using data there’s a crucial role for the interpreter, and the way people make those maps can really reflect the assumptions they’ve brought into the assignment…

In one section of the show called “How the Lines Get Bent,” we talk about some of the most common cartographic techniques that deserve our scrutiny: whether the data is or isn’t normalized to population size, for example, will produce really different outcomes. We also look at how data is produced by people in the world by looking at how census classifications change over time, not because people themselves change but because of racist attitudes about demographic categorizations that were encoded into census data tables. So you have to ask: What assumptions can data itself hold on to? Throughout the show we look at historic examples as well as more modern pieces to give people questions about how to look at a map, whether it’s simple media criticism, like: Who made this and when? Do they show sources? What are their methods, and what kinds of rhetorical framing like titles and captions do they use? We also hit on geographic analysis, like data normalization and the modifiable area unit problem…

So rather than think about maps as simply being true or false, we want to think about them as trustworthy or untrustworthy and to think about social and political context in which they circulate. A lot of our evidence of parts of the world we’ve never seen is based on maps: For example, most of us accept that New Zealand is off the Australian coast because we see maps and assume they’re trustworthy. So how do societies and institutions produce that trust, what can be trusted and what happens when that trust frays? The conclusion shouldn’t be that we can’t trust anything but that we have to read things in an informed skeptical manner and decide where to place our trust.

Another reminder that data does not interpret itself. Ordering reality – which we could argue that maps do regarding spatial information – is not a neutral process. People look at the evidence, draw conclusions, and then make arguments with the data. This extends across all kinds of evidence or data, ranging from statistical evidence to personal experiences to qualitative data to maps.

Educating the readers of maps (and other evidence) is important: as sociologist Joel Best argues regarding statistics, people should not be naive (completely trusting) or cynical (completely rejecting) but rather should be critical (questioning, skeptical). But, there is another side to this: how many cartographers and others that produce maps are aware of the possibilities of biased or skewed representations? If they know this, how do they then combat it? There would be a range of cartographers to consider, from people who make road atlases to world maps to those working in media who make maps for the public regarding current events. What guides their processes and how often do they interrogate their own presentation? Similarly, are people more trusting of maps than they might be of statistics or qualitative data or people’s stories (or personal maps)?

Finally, the interview hints at the growing use of maps with additional data. I feel like I read about John Snow’s famous 1854 map of cholera cases in London everywhere but this has really picked up in recent decades. As we know more about spatial patterns as well as have the tools (like GIS) to overlay data, maps with data are everywhere. But, finding and communicating the patterns is not necessarily easy nor is the full story of the analysis and presentation given. Instead, we might just see a map. As someone who has published an article using maps as key evidence, I know that collecting the data, putting it into a map, and presenting the data required multiple decisions.

On the soc undergrad resume: data collection and analysis

Graduating sociology majors have worked on their resumes and tried to sum up their training for prospective employers. Following up on yesterday’s post on the importance for data in sociology, in my opinion, these graduates should include data collection and analysis in their collection of resume skills. Here are a few reasons why:

  1. From the beginning of their sociology training, we work to help them observe and interpret patterns in the social world. While there is no single class that does this all at once, the path from beginning to end is full of opportunities both to see how sociologists do this as well as try their own hand at developing sociological arguments. Final papers in any class (as well as other assignments) offer opportunities to practice data analysis and interpretation.
  2. Sociology majors do tend to have classes explicitly devoted to Statistics and/or Research Methods. For example, while many people think they can put a survey together, it is in these classes where they learn important basics: what sample do you want? How do you ask good questions? How do you report survey data? At the least, these classes help undergraduates know what questions to ask about data collection and analysis and at their best give them chances to practice these skills.
  3. Organizations – from non-profits to businesses to governments – want people with data collection and analysis skills. Now that it is easier than ever to work with data (though we should not underestimate the value of collecting good data in the first place), how can a prospective employee help the organization understand and communicate what is in the data? In a world awash with data, what do we do with it all?

Undergraduates may be leery of claiming these skills as they do not view themselves as  experts and don’t have years of work experience in data analysis. Yet, these abilities are at the heart of sociology and they are skills that are in demand.

Using statistics to find lost airplanes

Here is a quick look at how Bayesian statistics helped find Air France 447 in the Atlantic Ocean:

Stone and co are statisticians who were brought in to reëxamine the evidence after four intensive searches had failed to find the aircraft. What’s interesting about this story is that their analysis pointed to a location not far from the last known position, in an area that had almost certainly been searched soon after the disaster. The wreckage was found almost exactly where they predicted at a depth of 14,000 feet after only one week’s additional search…

This is what statisticians call the posterior distribution. To calculate it, Stone and co had to take into account the failure of four different searches after the plane went down. The first was the failure to find debris or bodies for six days after the plane went missing in June 2009; then there was the failure of acoustic searches in July 2009 to detect the pings from underwater locator beacons on the flight data recorder and cockpit voice recorder; next, another search in August 2009 failed to find anything using side-scanning sonar; and finally, there was another unsuccessful search using side-scanning sonar in April and May 2010…

That’s an important point. A different analysis might have excluded this location on the basis that it had already been covered. But Stone and co chose to include the possibility that the acoustic beacons may have failed, a crucial decision that led directly to the discovery of the wreckage. Indeed, it seems likely that the beacons did fail and that this was the main reason why the search took so long.

The key point, of course, is that Bayesian inference by itself can’t solve these problems. Instead, statisticians themselves play a crucial role in evaluating the evidence, deciding what it means and then incorporating it in an appropriate way into the Bayesian model.

It is not just about knowing where to look – it is also about knowing how to look. Finding a needle in a haystack is a difficult business whether it is looking for small social trends in mounds of big data or finding a crashed plane in the middle of the ocean.

This could also be a good reminder that only having one search in such circumstances may not be enough. When working with data, failures are not necessarily bad as long as they can help move to a solution.

Argument: scientists need help in handling big data

Collecting, analyzing, and interpreting big data may just be a job that requires more scientists:

For projects like NEON, interpreting the data is a complicated business. Early on, the team realized that its data, while mid-size compared with the largest physics and biology projects, would be big in complexity. “NEON’s contribution to big data is not in its volume,” said Steve Berukoff, the project’s assistant director for data products. “It’s in the heterogeneity and spatial and temporal distribution of data.”

Unlike the roughly 20 critical measurements in climate science or the vast but relatively structured data in particle physics, NEON will have more than 500 quantities to keep track of, from temperature, soil and water measurements to insect, bird, mammal and microbial samples to remote sensing and aerial imaging. Much of the data is highly unstructured and difficult to parse — for example, taxonomic names and behavioral observations, which are sometimes subject to debate and revision.

And, as daunting as the looming data crush appears from a technical perspective, some of the greatest challenges are wholly nontechnical. Many researchers say the big science projects and analytical tools of the future can succeed only with the right mix of science, statistics, computer science, pure mathematics and deft leadership. In the big data age of distributed computing — in which enormously complex tasks are divided across a network of computers — the question remains: How should distributed science be conducted across a network of researchers?

Two quick thoughts:

1. There is a lot of potential here for crossing disciplinary boundaries to tackle big data projects. This isn’t just about parceling out individual pieces of the project; bringing multiple perspectives together could lead to an improved final outcome.

2. I wonder if sociologists aren’t particularly well-suited for this kind of big data work. Given our emphasis on theory and methods, we both emphasize the big picture as well as how to effectively collect, analyze, and interpret data. Sociology students could be able to step into such projects and provide needed insights.

Route sociology majors can go: data analyst

I try to remind my students in Statistics and Social Research that there is a need in a lot of industries for people who can collect and analyze data. I was reminded of this when I saw an obituary about a sociologist who had gone on to become a well-known medical data analyst:

A professor in the Department of Health Services at the UCLA Fielding School of Public Health, [E. Richard] Brown founded the UCLA Center for Health Policy Research in 1994.

One of the center’s major activities has been the development of the California Health Interview Survey, the premier source of information about individual and household health status in California. It has served as a model for health surveys for other states.

Brown was the founder and principal investigator for the survey, which produced its first data from interviews with more than 55,000 California households in 2001. Information from the survey, which has been conducted every two years, has been used by policymakers, community advocates, researchers and others.

And working with important data can then lead to public policy options:

“The single thing that makes Rick stand out in this field is that he had an extraordinary capacity to use evidence about the public’s health and strategize and advocate to turn that evidence into the best policy and action,” said Dr. Linda Rosenstock, dean of the UCLA Fielding School of Public Health.

In 1990, Brown was co-author of California’s first single-payer healthcare legislation. He also co-wrote several other healthcare reform bills over the last two decades…

He also was a full-time senior consultant to President Clinton’s Task Force on National Health Care Reform and served as a senior health policy advisor for the Barack Obama for President Campaign — as well as serving as an advisor to U.S. Sens. Bob Kerrey, Paul Wellstone and Al Franken.

We need more people to collect useful data and then interpret what they mean. These days, the problem often is not a lack of information; rather, we need to know how to separate the good data from the bad and then be able to provide a useful interpretation. While some students may prefer to skip over the methodological sections of articles or books, understanding how to collect and analyze data can go a long way. Additionally, learning about these methods and data analysis can help one move toward a sociological view of the social world where personal anecdotes don’t matter as much as broad trends and looking at how social factors (variables) are related to each other.

Quick Review: American Grace

I recently wrote about a small section of American Grace but I have had a chance to complete the full book. Here are my thoughts about this broad-ranging book about religion in America:

1. On one hand, I like the broad overview. There is a lot of data and analysis here about American religion. If someone had to pick up one book about the topic, this wouldn’t be a bad one to choose. I also liked some of the historical insights, including the idea that what we see now in American religion is a fallout of action in the 1960s and two counteractions that followed.

2. On the other hand, I’m not sure this book provides much new information. There is a lot of research contained in this book but much of it is already out there. The authors try to produce new insights from their own survey but I this is an issue in itself: after reading the full book, it was somewhat unclear why the authors undertook two waves of the Faith Matters Survey. The questions led to some new insights (like feelings toward the construction of a large religious building nearby) but much of it seemed duplicated and the short period between the waves didn’t help.

3. There is a lot of talk about data analysis and interpretation in this book. While it is aimed for a more general audience, the authors are careful in their explanations. For example, they are careful to explain what exactly a correlation means, it indicates a relationship between variables but causation is unclear, over and over again. Elsewhere, the authors explain exactly why they asked the questions they did and discuss the quality of this data. Some of these little descriptions would be useful in basic statistics or research classes. On the whole, they do a nice job in explaining how they interpret the data though I wonder how this might play with a general public that might just want the takeaway points. Perhaps this is why one reviewer thought this text was so readable!

4. Perhaps as a counterpoint to the discussions of data, the book includes a number of vignettes regarding religious congregations. These could be quite lengthy and I’m not sure that they added much to the book. They don’t pack the same punch as the representative characters of a book like Habits of the Heart and sometimes seem like filler.

5. The book ends with the conclusion that Americans can be both religiously diverse and devoted because of the many relationships between people of different faiths and denominations. On the whole, the authors suggest most people are in the middle regarding religion, not too confident in the idea that their religion is the only way but unwilling to say that having no religion is the way to go. I would like to have read more about how this plays out within religious congregations: how do religious leaders then talk doctrine or has everyone simply shifted to a more accomodating approach? Additionally, why doesn’t this lead down the path of secularization? From a societal perspective, religious pluralism may be desirable but is it also desirable for smaller groups?

On the whole, this book is a good place to start if one is looking for an overview of American religion. But, if one is looking for more detailed research and discussion regarding a particular topic, one would be better served going to those conducting research within these specific areas.

Using a sociological approach in “e-discovery technologies”

Legal cases can generate a tremendous amount of documents that each side needs to examine. With new searching technology, legal teams can now go through a lot more data for a lot less money. In one example, “Blackstone Discovery of Palo Alto, Calif., helped analyze 1.5 million documents for less than $100,000.” But within this discussion, the writer suggests that these searches can be done in two ways:

E-discovery technologies generally fall into two broad categories that can be described as “linguistic” and “sociological.”

The most basic linguistic approach uses specific search words to find and sort relevant documents. More advanced programs filter documents through a large web of word and phrase definitions. A user who types “dog” will also find documents that mention “man’s best friend” and even the notion of a “walk.”

The sociological approach adds an inferential layer of analysis, mimicking the deductive powers of a human Sherlock Holmes. Engineers and linguists at Cataphora, an information-sifting company based in Silicon Valley, have their software mine documents for the activities and interactions of people — who did what when, and who talks to whom. The software seeks to visualize chains of events. It identifies discussions that might have taken place across e-mail, instant messages and telephone calls…

The Cataphora software can also recognize the sentiment in an e-mail message — whether a person is positive or negative, or what the company calls “loud talking” — unusual emphasis that might give hints that a document is about a stressful situation. The software can also detect subtle changes in the style of an e-mail communication.

A shift in an author’s e-mail style, from breezy to unusually formal, can raise a red flag about illegal activity.

So this second technique gets branded as “sociological” because it is looking for patterns of behavior and interaction. If you wondered how the programmers set up their code in order to this kind of analysis, it sounds like some academics have been working on the problem for almost a decade:

[A computer scientist] bought a copy of the database [of Enron emails] for $10,000 and made it freely available to academic and corporate researchers. Since then, it has become the foundation of a wealth of new science — and its value has endured, since privacy constraints usually keep large collections of e-mail out of reach. “It’s made a massive difference in the research community,” Dr. McCallum said.

The Enron Corpus has led to a better understanding of how language is used and how social networks function, and it has improved efforts to uncover social groups based on e-mail communication.

Any sociologists involved in this project to provide input on what the programs should be looking for in human interactions?

This sort of analysis software could be very handy for sociological research when one has hundreds of documents or sources to look through. Of course, the algorithms might have be changed for specific projects or settings but I wonder if this sort of software might be widely available in a few years. Would this analysis be better than going through one by one through documents in coding software like Atlas.Ti or NVivo?

Interpreting the FBI’s 2009 hate crime report

Hate crime legislation is a topic that seems to rile people up. The Atlantic provides five sources that try to summarize and make sense of the latest annual data released by the FBI:

Agence France-Presse reports that “out of 6,604 hate crimes committed in the United States in 2009, some 4,000 were racially motivated and nearly 1,600 were driven by hatred for a particular religion … Blacks made up around three-quarters of victims of the racially motivated hate crimes and Jews made up the same percentage of victims of anti-religious hate crimes.” The report also notes that “anti-Muslim crimes were a distant second to crimes against Jews, making up just eight percent of the hate crimes driven by religious intolerance.” Finally, the report notes a drop in hate crimes overall: “Some 8,300 people fell victim to hate crimes in 2009, down from 9,700 the previous year.”

This is a reminder that there is a lot of data out there, particularly generated by government agencies, but we need qualified and skilled people to interpret its meaning.

You can find the data on hate crimes at the FBI website of uniform crime reports. Here is the FBI’s summary of the incidents, 6,604 in all.