Fun with statistics: people flock to stores that sold winning lottery tickets in the past

Ahead of the recent large Powerball jackpot, stores that sold winning tickets in the past experienced an increase in business:

When word got out that a southeast Pennsylvania 7-Eleven sold a $1 million Powerball ticket on Saturday, customers hoping to experience some luck of their own flocked to the store…

At a Casey’s General Store in Bondurant, Iowa, everyone knows it’s the place where a $202.1 million Powerball jackpot ticket was sold to a local woman in September. Asked what types of questions the store gets when the jackpots get huge, assistant manager Debra Fetters said: “Does lightning strike twice here?”…

“When you get those stores where they’ve actually seen someone win, they’re very enthusiastic about it. They know about the game, they have regular customers. A lot of it really does come down to great retailers that support the lottery, understand that there are winners on both sides.”

Linda Hamlin, also of the New Mexico Lottery, noted the story of “Millionaire Mary” Torres of Albuquerque. After she sold a $1 million winning Powerball ticket to an Albuquerque man in May 2011, she became known as a good luck charm. Her customers followed her to another store a few miles away.

And the article ends with this quote:

“Humans tend to be superstitious about things,” said Strutt of the Multi-State Lottery Association. “We all have our ways to ensure our best luck. But every ticket has the exact same chance of winning.”

What would happen if this argument, that their odds of winning do not increase, was presented to these purchasers who go back to the place of past winners? Would they say the numbers aren’t right or say it doesn’t matter? Perhaps this is a sort of Pascal’s Wager for Powerball: it doesn’t increase my odds of winning to shop at this particular location, but it can’t hurt!

This could be chalked up to superstition but it is also the result of humans looking for patterns where there aren’t any. Two things make where the winning person bought the ticket stand out: (1) there are few big winners and (2) the big prizes are noteworthy. Put these two together and all of the sudden people start seeing trends even though there is little data to work with. But, then you have news coverage a few years ago about a woman in Texas who won the lottery four times – four data points make a much better pattern than a one-time winner!

“The Nate Silver of immigration reform”

Want a statistical model that tells you which Congressman to lobby on immigration reform? Look no further than a political scientist at UC San Diego:

In the mold of Silver, who is famous for his election predictions, Wong bridges the gap between equations and shoe-leather politics, said David Damore, a political science professor at the University of Nevada, Las Vegas and a senior analyst for Latino Decisions, a political opinion research group.

Activists already have an idea of which lawmakers to target, but Wong gives them an extra edge. He can generate a custom analysis for, say, who might be receptive to an argument based on religious faith. With the House likely to consider separate measures rather than a comprehensive bill, Wong covers every permutation.

“In the House, everybody’s in their own unique geopolitical context,” Damore said. “What he’s doing is very, very useful.”

The equations Wong uses are familiar to many political scientists. So are his raw materials: each lawmaker’s past votes and the ethnic composition of his or her district. But no one else appears to be applying those tools to immigration in quite the way Wong does.

So is there something extra in the models that others don’t have or is Wong extra good at interpreting the results? The article suggests there are some common factors all political scientists would consider but then it also hints there are some more hidden factors like religiosity or district-specific happenings.

A fear I have for Nate Silver as well: what happens when the models are wrong? Those who work with statistics know they are just predictions and statistical models always have error but this isn’t necessarily how the public sees things.

“The average Australian is a suburban Frankenstein”?

One columnist is not pleased with the idea of the average Australian in the suburbs:

Earlier this month the Bureau of Statistics, apparently hoping to deter Wayne Swan from cutting its allocation in the May budget, made a grab for publicity with a report on the characteristics of “the average Australian”. In the process it broke its own rules.

The ABS applied mathematical magic to data from the 2011 census and sent the media off in search of a blonde brown-eyed 37 year old woman with two photogenic children aged nine and six, two cars and a mortgage of $1800 a month on her three bedroom home. Edna Everage’s granddaughter was born here (like her parents), describes herself as Christian, weighs 71.1 kg, and works as a sales assistant…

Start packing your bags. The ABS decision to build a suburban Frankenstein for the sake of a publicity boost risks returning us to the point in recent history when certain people were labelled “unAustralian” if their language or behavior did not match the world view of Alan Jones, John Laws, Neil Mitchell or Andrew Bolt.

The ABS has played into the hands of those titans of talkback who like to keep the message simple. They’re not interested in this qualifier the ABS included at the end of the report to salve its conscience: “While many people will share a number of characteristics in common with this ‘average’ Australian, out of nearly 22 million people counted in Australia on Census night, no single person met all these criteria. While the description of the average Australian may sound quite typical, the fact that no-one meets all these criteria shows that the notion of the ‘average’ masks considerable (and growing) diversity in Australia.”

The columnist may indeed be correct that the best way to do this would have been to use medians, rather than averages. But, the bigger issue here seems to be the idea that there is a “suburban mold” that Australians need to fit into. Not everyone likes this image as the suburbs are often associated with homogeneous populations, consumption and behaviors to keep up with the Joneses, and middle-class conservatism. Regardless of what the statistics say or whether a majority of Australians (or Americans) live in the suburbs, these suburban critiques will likely continue.

Argument: Big Data reduces humans to something less than human

One commentator suggests Big Data can’t quite capture what makes humans human:

I have been browsing in the literature on “sentiment analysis,” a branch of digital analytics that—in the words of a scientific paper—“seeks to identify the viewpoint(s) underlying a text span.” This is accomplished by mechanically identifying the words in a proposition that originate in “subjectivity,” and thereby obtaining an accurate understanding of the feelings and the preferences that animate the utterance. This finding can then be tabulated and integrated with similar findings, with millions of them, so that a vast repository of information about inwardness can be created: the Big Data of the Heart. The purpose of this accumulated information is to detect patterns that will enable prediction: a world with uncertainty steadily decreasing to zero, as if that is a dream and not a nightmare. I found a scientific paper that even provided a mathematical model for grief, which it bizarrely defined as “dissatisfaction.” It called its discovery the Good Grief Algorithm.

The mathematization of subjectivity will founder upon the resplendent fact that we are ambiguous beings. We frequently have mixed feelings, and are divided against ourselves. We use different words to communicate similar thoughts, but those words are not synonyms. Though we dream of exactitude and transparency, our meanings are often approximate and obscure. What algorithm will capture “the feel of not to feel it?/?when there is none to heal it,” or “half in love with easeful Death”? How will the sentiment analysis of those words advance the comprehension of bleak emotions? (In my safari into sentiment analysis I found some recognition of the problem of ambiguity, but it was treated as merely a technical obstacle.) We are also self-interpreting beings—that is, we deceive ourselves and each other. We even lie. It is true that we make choices, and translate our feelings into actions; but a choice is often a coarse and inadequate translation of a feeling, and a full picture of our inner states cannot always be inferred from it. I have never voted wholeheartedly in a general election.

For the purpose of the outcome of an election, of course, it does not matter that I vote complicatedly. All that matters is that I vote. The same is true of what I buy. A business does not want my heart; it wants my money. Its interest in my heart is owed to its interest in my money. (For business, dissatisfaction is grief.) It will come as no surprise that the most common application of the datafication of subjectivity is to commerce, in which I include politics. Again and again in the scholarly papers on sentiment analysis the examples given are restaurant reviews and movie reviews. This is fine: the study of the consumer is one of capitalism’s oldest techniques. But it is not fine that the consumer is mistaken for the entirety of the person. Mayer-Schönberger and Cukier exult that “datafication is a mental outlook that may penetrate all areas of life.” This is the revolution: the Rotten Tomatoes view of life. “Datafication represents an essential enrichment in human comprehension.” It is this inflated claim that gives offense. It would be more proper to say that datafication represents an essential enrichment in human marketing. But marketing is hardly the supreme or most consequential human activity. Subjectivity is not most fully achieved in shopping. Or is it, in our wired consumerist satyricon?

“With the help of big data,” Mayer-Schönberger and Cukier continue, “we will no longer regard our world as a string of happenings that we explain as natural and social phenomena, but as a universe comprised essentially of information.” An improvement! Can anyone seriously accept that information is the essence of the world? Of our world, perhaps; but we are making this world, and acquiescing in its making. The religion of information is another superstition, another distorting totalism, another counterfeit deliverance. In some ways the technology is transforming us into brilliant fools. In the riot of words and numbers in which we live so smartly and so articulately, in the comprehensively quantified existence in which we presume to believe that eventually we will know everything, in the expanding universe of prediction in which hope and longing will come to seem obsolete and merely ignorant, we are renouncing some of the primary human experiences. We are certainly renouncing the inexpressible. The other day I was listening to Mahler in my library. When I caught sight of the computer on the table, it looked small.

I think there are a couple of arguments possible about the limitations of big data and Wieseltier is making a particular argument. He does not appear to be saying that big data can’t predict or model human complexity. And fans of big data would probably say the biggest issue is that we simply don’t have enough data yet and we are developing better and better models. In other words, our abilities and data will eventually catch up to the problem of complexity. But I think Wieseltier is arguing something else: he, along with many others, does not want humans to be reduced to information. Even if we had the best models, it is one thing to see people as complex individuals and yet another to say they are simply another piece of information. Doing the latter takes away the dignity of people. Reducing people to data means we stop seeing people as people that can change their minds, be creative, and confound predictions.

It will be interesting to see how this plays out in the coming years. I think this is the same fear many people have about statistics. Particularly in our modern world where we see ourselves as sovereign individuals, describing statistical trends to people strikes them as reducing their own agency and negating their own experiences. Of course, this is not what statistics is about and something more training in statistics could help change. But, how we talk about data and its uses might go a long way to how big data is viewed in the future.

Pollster provides concise defense of polls

The chief pollster for Fox News defends polls succinctly here. The conclusion:

Likewise, we don’t need to contact every American — more than 230 million adults — to find out what the public is thinking. Suffice it to say that with proper sampling and random selection of respondents so that every person has an equal chance of being contacted, a poll of 800-1,000 people provides an incredibly accurate representation of the country as a whole. It’s a pretty amazing process if you think about it.

Still, many people seem to have a love-hate relationship with polls. Even if they enjoy reading the polls, some people can turn into skeptics if they personally don’t feel the same as the majority. Maybe they don’t even know anyone who feels the same as the majority.  Yet assuming everyone shares your views and those of your friends and neighbors would be like the cook skimming a taste from just the top of the pot without stirring the soup first.

Basic but a staple of many a statistics and research methods course. Unfortunately, more people need this kind of education in a world where statistics are becoming more and more common.

What is more important: the absolute number of crimes or the crime rate?

Chicago has received a lot of unwanted attention because of the absolute number of murders in recent years. But, a new study finds having more gun laws leads to lower gun death rates. Which is better: the absolute number or the rate?

In the dozen or so states with the most gun control-related laws, far fewer people were shot to death or killed themselves with guns than in the states with the fewest laws, the study found. Overall, states with the most laws had a 42 percent lower gun death rate than states with the least number of laws.

The results are based on an analysis of 2007-2010 gun-related homicides and suicides from the federal Centers for Disease Control and Prevention. The researchers also used data on gun control measures in all 50 states compiled by the Brady Center to Prevent Gun Violence, a well-known gun control advocacy group. They compared states by dividing them into four equal-sized groups according to the number of gun laws.

The results were published online Wednesday in the medical journal JAMA Internal Medicine.

More than 30,000 people nationwide die from guns every year nationwide, and there’s evidence that gun-related violent crime rates have increased since 2008, a journal editorial noted.

Even this first quoted paragraph conflates two different measures: the absolute number of gun deaths versus gun death rates. What does the public care most about? Rates make more sense from a comparative point of view as they reduce the differences in population. Of course Chicago would have more murders and crimes than other cities with smaller populations – after all, it is the third most populous city in the United States. Researchers are probably more inclined to use rates. But, absolute numbers tend to lead to more scintillating stories. The media can focus on milestone numbers, 400, 500, 600 murders, as well as consistently report on percentage differences as the months go by. Rates are not complicated to understand but are not as simple as absolute numbers.

I can’t help but think that a little more statistical literacy could be beneficial here. If the public and the media heard about and knew how to interpret rates, perhaps the conversation would be different.

Will Nate Silver ruin his brand with NCAA predictions?

Statistical guru Nate Silver, known for his 2012 election predictions, has been branching out into other areas recently on the New York Times site. Check out his 2013 NCAA predictions. Or look at his 2013 Oscar predictions.

While Silver has a background in sports statistics, I wonder if these forays into new areas with the imprimatur of the New York Times will eventually backfire. In many ways, these new areas have less data than presidential elections and thus, Silver has to step further out on a limb. For example, look at these predictions for the 2013 NCAA bracket:

The top pick for 2013, Louisville, only has a 22.7% chance of winning. If Silver goes with this pick of Louisville, and he does, then he by his own figures will be wrong 77.3% of the time. These are not good odds.

I’m not sure Silver can really win much by predicting the NCAA champion or the Oscars because the odds of making a wrong prediction are higher. What happens if he is wrong a number of times in a row? Will people still listen to him in the same way? What happens when the 2016 presidential election comes along? Of course, Silver could continue to develop better models and make more accurate picks but even this takes attention away from his political predictions.

Getting the data to model society like we model the natural world

A recent session at the American Association for the Advancement of Science included a discussion of how to model the social world:

Dirk Helbing was speaking at a session entitled “Predictability: from physical to data sciences”. This was an opportunity for participating scientists to share ways in which they have applied statistical methodologies they usually use in the physical sciences to issues which are more ‘societal’ in nature. Examples stretched from use of Twitter data to accurately predict where a person is at any moment of each day, to use of social network data in identifying the tipping point at which opinions held by a minority of committed individuals influence the majority view (essentially looking at how new social movements develop) through to reducing travel time across an entire road system by analysing mobile phone and GIS (Geographical Information Systems) data…

With their eye on the big picture, Dr Helbing and multidisciplinary colleagues are collaborating on FuturICT, a 10-year, 1 billion EUR programme which, starting in 2013, is set to explore social and economic life on earth to create a huge computer simulation intended to simulate the interactions of all aspects of social and physical processes on the planet. This open resource will be available to us all and particularly targeted at policy and decision makers. The simulation will make clear the conditions and mechanisms underpinning systemic instabilities in areas as diverse as finance, security, health, the environment and crime. It is hoped that knowing why and being able to see how global crises and social breakdown happen, will mean that we will be able to prevent or mitigate them.

Modelling so many complex matters will take time but in the future, we should be able to use tools to predict collective social phenomena as confidently as we predict physical pheno[men]a such as the weather now.

This will require a tremendous amount of data. It may also require asking for a lot more data from individual members of society in a way that has not happened yet. To this point, individuals have been willing to volunteer information in places like Facebook and Twitter but we will need much more consistent information than that to truly develop models like are suggested here. Additionally, once that minute to minute information is collected, it needs to be put in a central dataset or location to see all the possible connections. Who is going to keep and police this information? People might be convinced to participate if they could see the payoff. A social model will be able to do what exactly – limit or stop crime or wars? Help reduce discrimination? Thus, getting the data from people might be as much of a problem as knowing what to do with it once it is obtained.

Using analytics and statistics in sports and society: a ways to go

Truehoop has been doing a fine job covering the 2013 MIT Sloan Sports Analytics Conference. One post from last Saturday highlighted five quotes “On how far people have delved into the potential of analytics“:

“We are nowhere yet.”
— Morey

“There is a human element in sports that is not quantifiable. These players bleed for you, give you everything they have, and there’s a bond there.”
— Bill Polian, ESPN NFL analyst

“When visualizing data, it’s not about how much can I put in but how much can I take out.”
— Joe Ward, The New York Times sports graphics editor

“If you are not becoming a digital CMO (Chief Marketing Officer), you are becoming extinct.”
— Tim McDermott, Philadelphia Eagles CMO

“Even if God came down and said this model is correct … there is still randomness, and you can be wrong.”
— Phil Birnbaum, By The Numbers editor

In other words, there is a lot of potential in these statistics and models but we have a long way to go in deploying them correctly. I think this is a good reminder when thinking about big data as well: simply having the numbers and recognizing they might mean something is a long way from making sense of the numbers and improving lives because of our new knowledge.

Looking at the data behind the claim that more black men are in jail than college

A scholar looks at his own usage of a statistic and where it came from:

About six years ago I wrote, “In 2000, the Justice Policy Institute (JPI) found evidence that more black men are in prison than in college,” in my first “Breaking Barriers” (pdf) report. At the time, I did not question the veracity of this statement. The statement fit well among other stats that I used to establish the need for more solution-focused research on black male achievement…

Today there are approximately 600,000 more black men in college than in jail, and the best research evidence suggests that the line was never true to begin with. In this two-part entry in Show Me the Numbers, the Journal of Negro Education’s monthly series for The Root, I examine the dubious origins, widespread use and harmful effects of what is arguably the most frequently quoted statistic about black men in the United States…

In September 2012, in response to the Congressional Black Caucus Foundation’s screening of the film Hoodwinked, directed by Janks Morton, JPI issued a press release titled, “JPI Stands by Data in 2002 on Education and Incarceration.” However, if one examines the IPEDS data from 2001 to 2011, it is clear that many colleges and universities were not reporting JPI’s data 10 years ago.

In 2011, 4,503 colleges and universities across the United States reported having at least one black male student. In 2001, only 2,734 colleges and universities reported having at least one black male student, with more than 1,000 not reporting any data at all. When perusing the IPEDS list of colleges with significant black male populations today but none reported in 2001, I noticed several historically black colleges and universities, including Bowie State University, and my own alma mater, Temple University. Ironically, I was enrolled at Temple as a doctoral candidate in 2001.

When I first saw this, I first thought it might be an example of what sociologist Joel Best calls a “mutant statistic.” This is a statistic that might originally be based in fact but at some point undergoes a transformation and keeps getting repeated until it seems unchallengeable.

There might be some mutant statistic going here but it also appears to be an issue of methodology. As Toldson points out, it looks like this was a missing data issue: the 2001 survey did not include data from over 1,000 colleges. When more colleges were counted in 2011, the findings changed. If it is a methodological issue, then this issue should have been caught at the beginning.

As Best notes, it can take some time for bad statistics to be reversed. It will be interesting to see how long this particular “fact” continues to be repeated.