How one woman helped make preventable injuries an American public health issue

The epidemiologist Susan P. Baker devoted her career to making preventable injuries a public health issue. Here is part of the story:

She embarked on an independent research project — a comparison of drivers who were not responsible for their fatal crashes with drivers who were — and in 1968 she sent Haddon a letter seeking federal financing for her study. He came through with $10,000 and continued to finance her research after he became president of the Insurance Institute for Highway Safety a year later…

Among Baker’s most important legacies is the widespread use of the infant car seat. By examining data from car crashes, she demonstrated that the passengers most likely to die were those younger than 6 months. They were killed at double the rate of 1-year-olds and triple the rate for ages 6 to 12. Why? Because babies rested in their mothers’ arms or laps, often in the front passenger seat, and because their still-fragile bodies were more susceptible to fatal injury than those of older children. Baker published her study in the journal Pediatrics in 1979, making headlines in newspapers across the country…

Around that time, Baker was one of the main authors of a report calling for the creation of a federal injury-prevention agency. Today the National Center for Injury Prevention and Control coordinates with state programs and underwrites research projects aimed at preventing injury, ranging from the intentional (rape, homicide, suicide) to the unintentional (falls, residential fires, drownings)…

Of course, Baker knows that we can’t make the world completely injury-proof. But her decades of research show how fairly simple preventive measures — fences around swimming pools, bike helmets, childproof caps on medicine containers — can save thousands of lives.

I couldn’t help thinking while reading this story that it demonstrates the interplay between science, culture, and government. The first paragraph of the article argues that in the 1960s that few people worried about preventable injuries but this has clearly changed since. Aiding this process was new scientific findings about injuries as well as presentable statistics that captured people’s attention. This reminds me of sociologist Joel Best’s explanation in Damned Lies and Statistics that the use of statistics emerged in the mid 1800s because reformers wanted to attach numbers and science to social problems they cared about. But for these numbers to matter and the science to be taken seriously, you need a culture as well as institutions that see science as a viable way of knowing about the world. Similarly, the numbers themselves are not enough to immediately lead to change; social problems such as automobile deaths go through a process by which the public becomes aware, a critical mass starts pressing the issue, and leaders respond by changing regulations. Is it a coincidence that these concerns about public health began to emerge in the 1960s at the same time of American ascendency in the scientific realm, the growth of the welfare state, the continued development of the mass media as well as mass consumption, and an era of more movements calling for human rights and governmental protections? Probably not.

h/t Instapundit

World population in 1804 = Facebook users today

Here is an interesting, if not misguided, comparison of how many people are now Facebook users:

One billion people. That’s how many active monthly users Facebook has accrued in the eight years of its existence, the company announced today.

It took the population of modern humans about 200,000 years to reach that number, a milestone that was hit, demographers believe, just over two centuries ago in 1804 (bearing in mind that population tabs, then and now, are not exactly precise). Since then, human population has just exploded, enabled and protected by advances in medicine, agriculture, and hygiene. In the past year, it is estimated that the human headcount hit 7 billion.

I think I know what this comparison is trying to do: show the remarkable speed at which Facebook has attracted users. I agree. It has been remarkable.

At the same time, this is comparing apples to oranges. Yes, they are both large numbers of people. But one number is tied to human development, birth rates, life expectancy, technological improvement, and so on. This number reminds us of the broader scope of human history which is longer and progress is relatively slow. Having seven billion people on earth requires a lot of resources, space, and creative energy to tackle everyday and long-term problems. On the other side, you have Facebook, an Internet site that has attracted lots of users. While some of these users may be mega-users, people who are constantly online updating their status, tagging photos, reading other people’s walls, it is still just an online program, a relatively small part of human existence.

Perhaps there would be better ways to make a comparison to Facebook’s user total:

1. Looking at adoption rates compared to other technologies. In other words, is Facebook’s growth something completely new, a sign of the digital world, or does its adoption rate compare more to other technologies? Comparisons can be made here.

2. What one billion people in the world do on a daily basis or how many other objects have such broad appeal. For example, this website suggests there are 5.6 billion cell phone users in the world. (Meaning: Facebook has many more users to attract.)

Genius and creativity = “a probabilistic function of quantity”

I was recently reading a Malcolm Gladwell article about the invention of the computer mouse and came across this statistical definition of genius and creativity:

The psychologist Dean Simonton argues that this fecundity is often at the heart of what distinguishes the truly gifted. The difference between Bach and his forgotten peers isn’t necessarily that he had a better ratio of hits to misses. The difference is that the mediocre might have a dozen ideas, while Bach, in his lifetime, created more than a thousand full-fledged musical compositions. A genius is a genius, Simonton maintains, because he can put together such a staggering number of insights, ideas, theories, random observations, and unexpected connections that he almost inevitably ends up with something great. “Quality,” Simonton writes, is “a probabilistic function of quantity.”

Simonton’s point is that there is nothing neat and efficient about creativity. “The more successes there are,” he says, “the more failures there are as well”—meaning that the person who had far more ideas than the rest of us will have far more bad ideas than the rest of us, too.

To put this in graph terms: as time increases, a creative person has an increasing number of ideas, a line with positive slope. Underneath this overall line of ideas is another positive line tracking the unsuccessful ideas and below that, increasing steadily but perhaps at a slower rate, is the line of successful ideas. In other words, the more overall ideas someone has, the more failures but also the more quality ideas.

The rest of the article is about creating the right structural environment to take advantage of ideas. Most groups and organizations won’t recognize all the best ideas but innovative organizations find ways to encourage and push the good ideas to the top. Indeed, the clincher at the end of the article is that Steve Jobs, supposedly one of the best innovators America has had in recent decades, missed some opportunities as well.

Sociologist: “one-year change in test results doesn’t make a trend”

A sociologist provides some insights into how firms “norm” test scores from year to year and what this means about how to interpret the results:

The most challenging part of this process, though, is trying to place this year’s test results on the same scale as last year’s results, so that a score of 650 on this year’s test represents the same level of performance as a score of 650 on last year’s test. It’s this process of equating the tests from one year to the next which allows us to judge whether scores this year went up, declined or stayed the same.But it’s not straightforward, because the test questions change from one year to the next, and even the format and content coverage of the test may change.

Different test companies even have different computer programs and statistical techniques to estimate a student’s score and, hence, the overall picture of how a student, school or state is performing. (Teachers too, but that’s a subject for another day.)

All of these variables – different test questions from year to year; variations in test length, difficulty and content coverage; and different statistical procedures to calculate the scores – introduce some uncertainty about what the “true” results are…

In testing, every year is like changing labs, in somewhat unpredictable ways, even if a state hires the same testing contractor from one year to the next. For this reason, I urge readers to not react too strongly to changes from last year to this year, or to consider them a referendum on whether a particular set of education policies – or worse, a particular initiative – is working.

One-year changes have many uncertainties built into them; if there’s a real positive trend, it will persist over a period of several years. Schooling is a long-term process, the collective and sustained work of students, teachers and administrators; and there are few “silver bullets” that can be counted on to elevate scores over the period of a single school year.

Overall, this piece gives us some important things to remember: one data point is hard to put into context. You can draw a trend line between two data points. Having more data points gives you a better indication of what is happening over time. However, just having statistics isn’t enough; we also need to consider the reliability and validity of the data. Politicians and administrators seem to like test scores because they offer concrete numbers which can help them point out progress or suggest that changes need to be made. Yet, just because these are numbers doesn’t mean that there isn’t a process that goes into them or that we need to understand exactly what the numbers involve.

Positive results for teaching statistics by computer

A recent study shows that students taking an online statistics course utilizing software from Carnegie Mellon do better than students who take a hybrid course with a classroom classroom:

The study, called “Interactive Learning Online at Public Universities,” involved students taking introductory statistics courses at six (unnamed) public universities. A total of 605 students were randomly assigned to take the course in a “hybrid” format: they met in person with their instructors for one hour a week; otherwise, they worked through lessons and exercises using an artificially intelligent learning platform developed by learning scientists at Carnegie Mellon University’s Open Learning Initiative.

Researchers compared these students against their peers in the traditional-format courses, for which students met with a live instructor for three hours per week, using several measuring sticks: whether they passed the course, their performance on a standardized test (the Comprehensive Assessment of Statistics), and the final exam for the course, which was the same for both sections of the course at each of the universities…

The robotic software did have disadvantages, the researchers found. For one, students found it duller than listening to a live instructor. Some felt as though they had learned less, even if they scored just as well on tests. Engaging students, such as professors might by sprinkling their lectures with personal anecdotes and entertaining asides, remains one area where humans have the upper hand.

But on straight teaching the machines were judged to be as effective, and more efficient, than their personality-having counterparts.

As someone who regularly teaches both Statistics and Social Research (a research methods course), these findings are intriguing. I understand the urge to curb costs while still providing a good education. However, I have three questions that perhaps go beyond these findings:

1. Are there any benefits for students from being in a classroom for three hours a week beyond learning outcomes? Is there a social dimension to the classroom setting that could enhance learning? For example, it is common for professors to have students work in groups or with each other, sometimes with the idea that being able to teach or effectively help another student will increase a student’s learning. Also, I wonder about learning becoming strictly an individualistic activity. Sure, there are ways to do this online (discussion boards, using Skype, etc.) but does this replicate the kind of discussions faculty and students can have in a classroom?

2. Are there any professors in the United States who might secretly welcome not having to teach statistics?

3. Is there a point in a discipline, like statistics, where the difficulty of the subject matter makes it more helpful to have a live instructor? This study looked at introductory stats courses but would the findings be the same if the courses covered more advanced topics that require more “intuition” and “art” than pure steps or facts?

h/t Instapundit

Route sociology majors can go: data analyst

I try to remind my students in Statistics and Social Research that there is a need in a lot of industries for people who can collect and analyze data. I was reminded of this when I saw an obituary about a sociologist who had gone on to become a well-known medical data analyst:

A professor in the Department of Health Services at the UCLA Fielding School of Public Health, [E. Richard] Brown founded the UCLA Center for Health Policy Research in 1994.

One of the center’s major activities has been the development of the California Health Interview Survey, the premier source of information about individual and household health status in California. It has served as a model for health surveys for other states.

Brown was the founder and principal investigator for the survey, which produced its first data from interviews with more than 55,000 California households in 2001. Information from the survey, which has been conducted every two years, has been used by policymakers, community advocates, researchers and others.

And working with important data can then lead to public policy options:

“The single thing that makes Rick stand out in this field is that he had an extraordinary capacity to use evidence about the public’s health and strategize and advocate to turn that evidence into the best policy and action,” said Dr. Linda Rosenstock, dean of the UCLA Fielding School of Public Health.

In 1990, Brown was co-author of California’s first single-payer healthcare legislation. He also co-wrote several other healthcare reform bills over the last two decades…

He also was a full-time senior consultant to President Clinton’s Task Force on National Health Care Reform and served as a senior health policy advisor for the Barack Obama for President Campaign — as well as serving as an advisor to U.S. Sens. Bob Kerrey, Paul Wellstone and Al Franken.

We need more people to collect useful data and then interpret what they mean. These days, the problem often is not a lack of information; rather, we need to know how to separate the good data from the bad and then be able to provide a useful interpretation. While some students may prefer to skip over the methodological sections of articles or books, understanding how to collect and analyze data can go a long way. Additionally, learning about these methods and data analysis can help one move toward a sociological view of the social world where personal anecdotes don’t matter as much as broad trends and looking at how social factors (variables) are related to each other.

Data guru Hans Rosling named to Time’s 100 most influential people

Hans Rosling’s talks are fascinating as he makes data and charts exciting and explanatory in his own enthusiastic manner. Named as one of the 100 most influential people by Time, Rosling is profiled by sociologist and MD Nicholas Christakis:

Hans Rosling trained in statistics and medicine and spent years on the front lines of public health in Africa. Yet his greatest impact has come from his stunning renderings of the numbers that characterize the human condition.

His 2006 TED talk, in which he animated statistics to tell the story of socio-economic development, has been viewed over 3.8 million times and translated into dozens of languages. His subsequent talks have moved millions of people worldwide to see themselves and our planet in new ways by showing how our actions affect our health and wealth and one another across space and time.

When you meet Rosling, 63, you are struck by his energy and clarity. He has the quiet assurance of a sword swallower (which he is) but also of a man who is in the vanguard of a critically important activity: advancing the public understanding of science.

What does Rosling make of his statistical analysis of worldwide trends? “I am not an optimist,” he says. “I’m a very serious possibilist. It’s a new category where we take emotion apart and we just work analytically with the world.” We can all, Rosling thinks, become healthy and wealthy. What a promising thought, so eloquently rendered with data.

Here are some of Rosling’s presentations that are well worth watching:

200 Countries, 200 Years, 4 minutes – The Joy of Stats

TED Talk: No More Boring Data

TED Talk: The Good News of the Decade?

Here is what The Economist thinks are Rosling’s greatest hits.

I’ve used several of Rosling’s talk in class to illustrate what is possible with data and charts. Rosling gets at an important issue: data should tell a story and be interactive and available to people so they too can dig into it and understand the world better. By simply taking a chart and adding some extra information (like population size of a country displayed as a larger circle or being able to quickly show the quartile income distributions for a country) and the dimension of time, you can start to visualize patterns and possible explanations of how the world works.

(A side note: alas, I don’t think any sociologists were named as one of the 100 most influential people.)

Controversy in using sampling for the dicennial Census

In a story about the resignation of sociologist Robert Groves as director of the United States Census Bureau, there is an overview of some of the controversy over Groves’ nomination. The issue: the political implications of using statistical sampling.

Dr. Robert M. Groves announced on Tuesday that he was resigning his position as director of the U.S. Census Bureau in order to take over as provost of Georgetown University. “I’m an academic at heart,” Groves told The Washington Post. He will leave the Bureau in August. Unlike some government officials who recently have had to resign under a cloud, such as Regina Dugan of DARPA and Martha Johnson of the General Services Administration, Groves received universal praise for the job he did directing the 2010 Census, a herculean task he completed on time and almost $2 billion under budget.
At the time of Groves’ nomination, Rep. Darrell Issa, (R-California), chairman of the House Committee on Oversight and Government Reform, said that he found it “an incredibly troubling selection that contradicts the administration’s assurances that the census process would not be used to advance an ulterior political agenda.” However, by the time Groves announced that he was leaving, Issa had changed his tune and issued a statement that “His tenure is proof that appointing good people makes a big difference.”
When President Barack Obama nominated Groves on April 2, 2009, he was viewed as a generally uncontroversial professor of sociology.  However, his nomination turned out to be contentious anyway because his support for using statistical sampling, a statistical method commonly used to correct for errors and biases in the census, raised the ire of Republican critics, who believed that sampling would benefit minorities and the poor, who generally vote Democratic…
A specialist in survey methodology and statistics, Groves was no stranger to the Census Bureau, whose decennial census is one of the world’s largest and most sophisticated statistical exercises.  Groves served there early in his career as a visiting statistician in 1982, and later as associate director of Statistical Design, Standards, and Methodology from 1990 to 1992.  It was during the latter period that Groves became embroiled in the controversy over the proposed use of statistical sampling to correct known biases and deficiencies in the Census head count.  Groves and others at the Census Bureau proposed using sampling techniques to correct an admitted 1.2% undercount in the 1990 Census, which failed to include millions of homeless, minority and poor persons mainly living in big cities, which lost millions of dollars in federal funds when Republican Commerce Secretary Robert Mosbacher vetoed the sampling proposal.

Considering Groves’ track record in sociology, I’m not surprised that he is now regarded to have done a good job in this position.

Perhaps this is a silly question in today’s world but does everything have to become politicized? Is the ultimate goal to get the most accurate count of American residents or do both parties simply assume that the other side wants to use the occasion for political gain? If you want to limit funding to cities based on population, why not go after this funding rather than try to skew the count?

Of course, this is not the first time that the dicennial Census has been politicized…

Another note: a sociologist apparently saved the government $2 billion! That alone should draw some attention.

Statistics learning opportunity: “Hunger Games Survival Analysis”

Fun with statistics: a survival analysis of The Hunger Games (quick reviews of the books and movie). According to the final analysis, the only significant factor is the rating of each participant:

My interpretation of this is that the Gamemakers know what they’re doing when they assign the ratings. They’ve been doing this for years, so they give scores that are so accurate that they’re actually better predictors of survival time than whether a tribute is a volunteer, a Career, male or female, or forms an alliance. Pretty impressive.

An alternate and more cynical interpretation is that the Gamemakers are concerned about their own reputations and thus engineer the games so as to confirm their ratings, occasionally killing off players who do better or worse than expected based on the ratings, all so that the Gamemakers can look like they knew what they were doing all along. Unfortunately, the political system of Panem ranks so slow on Freedom House’s annual scores that we simply can’t tell what’s going on behind the scenes at all. To cut through their lies we simply need more data.

If you read this, you just also learn something about survival analysis and event history analysis. Bonus: the data and Stata code is also available for download!

Thinking about the event history class I took during grad school, we didn’t look at any data that was remotely close to popular culture.

Also, why not include the data from the second and third books? Granted, the games change a bit in the sequels to ratchet up the tension but that would provide more data to work with…

Five main methods of detecting patterns in data mining

Here is a summary of five of the main methods utilized to uncover patterns when data mining:

Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.

Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. These types of findings are often used for targeting coupons/deals or advertising. Similarly, this form of data mining (albeit a quite complex version) is behind Netflix movie recommendations.

Cluster detection: one type of pattern recognition that is particularly useful is recognizing distinct clusters or sub-categories within the data. Without data mining, an analyst would have to look at the data and decide on a set of categories which they believe captures the relevant distinctions between apparent groups in the data. This would risk missing important categories. With data mining it is possible to let the data itself determine the groups. This is one of the black-box type of algorithms that are hard to understand. But in a simple example – again with purchasing behavior – we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.

Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. Learning from a large set of pre-classified examples, algorithms can detect persistent systemic differences between items in each group and apply these rules to new classification problems. Spam filters are a great example of this – large sets of emails that have been identified as spam have enabled filters to notice differences in word usage between legitimate and spam messages, and classify incoming messages according to these rules with a high degree of accuracy.

Regression: Data mining can be used to construct predictive models based on many variables. Facebook, for example, might be interested in predicting future engagement for a user based on past behavior. Factors like the amount of personal information shared, number of photos tagged, friend requests initiated or accepted, comments, likes etc. could all be included in such a model. Over time, this model could be honed to include or weight things differently as Facebook compares how the predictions differ from observed behavior. Ultimately these findings could be used to guide design in order to encourage more of the behaviors that seem to lead to increased engagement over time.

Several of these seem similar to methods commonly used by sociologists:

1. Anomaly detection seems like looking for outliers. On one hand, outliers can throw off basic measures of central tendency or dispersion. On the other hand, outliers can help prompt researchers to reassess their models and/or theories to account for the unusual cases.

2. Cluster detection and/or classification appear similar to factor analysis. This involves a statistical analysis of a set of variables to see which ones “hang together.” This can be helpful for finding categories and reducing the number of variables in an analysis to a lesser number of important concepts.

3. Regression is used all the time both for modeling and predictions.

This all reminds me of what I heard in graduate school about the difference between data mining and statistical research: data mining amounted to atheoretical analysis. In other words, you might find relationships  between variables (or apparent relationships between variables – could always be a spurious association or there could be suppressor or distorter effects) but you wouldn’t have compelling explanations for these relationships. While you might be able to develop some explanations, this is a different process than hypothesis testing where you set out to look and test for relationships and patterns.