Looking at the data behind the claim that more black men are in jail than college

A scholar looks at his own usage of a statistic and where it came from:

About six years ago I wrote, “In 2000, the Justice Policy Institute (JPI) found evidence that more black men are in prison than in college,” in my first “Breaking Barriers” (pdf) report. At the time, I did not question the veracity of this statement. The statement fit well among other stats that I used to establish the need for more solution-focused research on black male achievement…

Today there are approximately 600,000 more black men in college than in jail, and the best research evidence suggests that the line was never true to begin with. In this two-part entry in Show Me the Numbers, the Journal of Negro Education’s monthly series for The Root, I examine the dubious origins, widespread use and harmful effects of what is arguably the most frequently quoted statistic about black men in the United States…

In September 2012, in response to the Congressional Black Caucus Foundation’s screening of the film Hoodwinked, directed by Janks Morton, JPI issued a press release titled, “JPI Stands by Data in 2002 on Education and Incarceration.” However, if one examines the IPEDS data from 2001 to 2011, it is clear that many colleges and universities were not reporting JPI’s data 10 years ago.

In 2011, 4,503 colleges and universities across the United States reported having at least one black male student. In 2001, only 2,734 colleges and universities reported having at least one black male student, with more than 1,000 not reporting any data at all. When perusing the IPEDS list of colleges with significant black male populations today but none reported in 2001, I noticed several historically black colleges and universities, including Bowie State University, and my own alma mater, Temple University. Ironically, I was enrolled at Temple as a doctoral candidate in 2001.

When I first saw this, I first thought it might be an example of what sociologist Joel Best calls a “mutant statistic.” This is a statistic that might originally be based in fact but at some point undergoes a transformation and keeps getting repeated until it seems unchallengeable.

There might be some mutant statistic going here but it also appears to be an issue of methodology. As Toldson points out, it looks like this was a missing data issue: the 2001 survey did not include data from over 1,000 colleges. When more colleges were counted in 2011, the findings changed. If it is a methodological issue, then this issue should have been caught at the beginning.

As Best notes, it can take some time for bad statistics to be reversed. It will be interesting to see how long this particular “fact” continues to be repeated.

Proposing marriage through a statistical model

Statistical models are supposed to help predict life, right? So why not use a statistical model to help make a marriage proposal?

I think this is clever. And the one-page paper that goes with it is not bad either.

h/t Instapundit

Argument: statistics can help us understand and enjoy baseball

An editor and writer for Baseball Prospectus argues that we need science and statistics to understand baseball:

Fight it if you like, but baseball has become too complicated to solve without science. Every rotation of every pitch is measured now. Every inch that a baseball travels is measured now. Teams that used to get mocked for using spreadsheets now rely on databases packed with precise location and movement of every player on every play — and those teams are the norm, not the film-inspiring exceptions. This is exciting and it’s terrifying…

I’m not a mathematician and I’m not a scientist. I’m a guy who tries to understand baseball with common sense. In this era, that means embracing advanced metrics that I don’t really understand. That should make me a little uncomfortable, and it does. WAR is a crisscrossed mess of routes leading toward something that, basically, I have to take on faith…

Yet baseball’s front offices, the people in charge of $100 million payrolls and all your hope for the 2013 season, side overwhelmingly with data. For team executives, the basic framework of WAR — measuring players’ total performance against a consistent baseline — is commonplace, used by nearly every front office, according to insiders. The writers who helped guide the creation of WAR over the decades — including Bill James, Sean Smith and Keith Woolner — work for teams now. As James told me, the war over WAR has ceased where it matters. “There’s a practical necessity for measurements like that in a front office that make it irrelevant whether you like them or you don’t.”

Whether you do is up to you and ultimately matters only to you. In the larger perspective, the debate is over, and data won. So fight it if you’d like. But at a certain point, the question in any debate against science is: What are you really fighting and why?

As someone who likes data, I would statistics is just another tool that can help us understand baseball better. It doesn’t have to be an either/or argument, baseball with advanced statistics versus baseball without advanced statistics. Baseball with advanced statistics is a more complete and gets at some of the underlying mechanics of the game rather than the visual cues or the culturally accepted statistics.

While this story is specifically about baseball, I think it also mirrors larger conversations in American society about the use of statistics. Why interrupt people’s common sense understandings of the world with abstract data? Aren’t these new statistics difficult to understand and can’t they also be manipulated? Some of this is true: looking at data can involve seeing things in news ways and there are disagreements about how to define concepts as well as how to collect to interpret data. But, in the end, these statistics can help us better understand the world.

Sears hopes Moneyball addition to its board can help revive the company

Here is an odd mixing of the data, sports, and business worlds: Sears recently named Paul Podesta to its board.

Paul DePodesta, one of the heroes of Michael Lewis’ “Moneyball: The Art of Winning an Unfair Game,” a great 2003 baseball book (and later a movie) about the 2002 A’s that’s more about business and epistemology than baseball, has been named to the board of Hoffman Estates-based Sears Holdings Corp.

To be sure, he’s an unconventional choice for the parent of Sears and Kmart. But Chairman Edward Lampert is thinking outside the box score, welcoming the New York Mets’ vice president of player development and amateur scouting into his clubhouse…

“What Paul DePodesta … did to bring analytics into the world of baseball is absolutely parallel to what needs to happen — and is happening — in retail,” said Greg Girard, program director of merchandising strategies and retail analytics for Framingham, Mass.-based IDC Retail Insights.

“It’s a big cultural change, but that’s something a board member can effect,” Girard said. “And he’s got street cred to take it down to the line of business guys who need to change, who need to bring analytics and analysis into retail decisions.”…

“Analytics has been something folks in retail have talked about for quite some time, but they’re redoubling their efforts now,” Girard said. “Drowning in data and not knowing what data’s relevant, which data to retain and for how long, is the No. 1 challenge retailers are having as they move into what we call Big Data.”

Fascinating. People like Podesta are credited with starting a revolution in sports by developing new statistics and then using that information to outwit the market. For example, Podesta and a host of others before him (possibly with Bill James at the beginning), found that certain traits like on-base percentage were undervalued and teams, like the small-market Oakland Athletics, could build decent teams without overpaying for the biggest free agents. Of course, once other teams caught on to this idea, on-base percentage was no longer undervalued. The Boston Red Sox, one of the biggest spending baseball teams, picked up this idea and paid handsomely for such skills and went on to win two World Series championships. So teams now have to look at other undervalued areas. One recent area that Major League Baseball shut down was spending more on overseas talent and draft picks to build up a farm system quickly. These ideas are now spreading to other sports as some NBA teams are making use of such data and new precise data will soon be collected with soccer players while they are on the pitch.

The same thought process could apply to business. If so, the process might look like this: find new ways to measure retail activity or hone in on less understood data that is out there. Then maximize a response to these lesser-known concepts and move around competitors. When they start to catch on, keep innovating and stay ahead a step or two. Sears could use a lot of this moving forward as they have struggled in recent years. Even if Podesta is able to identify trends others have not, he would still have to convince a board and company to change course.

It will be interesting to see how Podesta comes out of this. If Sears continues to lose ground, how much of that will rub off on him? If there is a turnaround, how much credit would he get?

NCAA Scholarly Colloquium: ideology versus “In God we trust; everyone else should bring data”

The Chronicle of Higher Education examines how much criticism of the NCAA will be allowed at its upcoming annual Scholarly Colloquium and includes a fascinating quote about how data should be used:

The colloquium was the brainchild of Myles Brand, a former NCAA president and philosopher who saw a need for more serious research on college sports. He and others believed that such an event could foster more open dialogue between the scholars who study sport issues and the people who work in the game.

Mr. Brand emphasized that the colloquium should be data-based and should avoid ideology. “Myles always used to joke: ‘In God we trust; everyone else should bring data,'” said Mr. Renfro, a former top adviser to Mr. Brand.

But as Mr. Renfro watched presentations at last year’s colloquium, which focused on changes the NCAA has made in its academic policies in recent years, he did not see a variety of perspectives.

“I was hearing virtually one voice being sung by a number of people … and it was relatively critical of the NCAA’s academic-reform effort,” he said. “I don’t care whether it was critical or not, but I care about whether there are different perspectives presented.”

This is a classic argument: data versus ideology, facts versus opinions. This short bit about Myles Brand makes it sound like Brand thought bringing more data to the table when discussing the NCAA would be a good thing. Data might blunt opinions and arguments and push people with an agenda to back up their arguments. It could lead to more constructive conversations. But, data is not completely divorced from ideology. Researchers choose what kind of topics to study. Data has to be collected in a good manner. Interpreting data is still an important skill; people can use data incorrectly. And it sounds like an issue here is that people might be able to use data to continue to criticize the NCAA – and this does not make the NCAA happy.

Generally, I’m in favor of bringing more data to the table when discussing issues. However, having data doesn’t necessarily solve problems. As I tell my statistics classes, I don’t want them to be people who blindly believe all data or statistics because it is data and I also don’t want them to be people who dismiss all data or statistics because they can be misused and twisted. It sounds like some of this still needs to be sorted out with the NCAA Scholarly Colloquium.

Reading between the lines of an ABC News story on the bad odds of winning the $500 million Powerball lottery

Check out this ABC News video about the odds of winning the $500 million Powerball lottery.

Several things are striking about the content of the video beyond the bad odds of winning: 1 in 175 million chance.

1. A journalist admits he doesn’t know much about math or statistics. It is not uncommon for reporters to go to experts like statisticians in times like these (appealing to the expert boosts the credentials of the story) but it is more unusual for journalists to admit they are doing so because they don’t know the information. I’ve argued before we need more journalists who understand statistics and science.

2. The reporter mentions some interesting odds that are more favorable than winning the Powerball. One of these is the idea that you are more likely to be possessed by the devil today than win the lottery. Who exactly keeps track of these figures and how accurate are they?

3. The story includes some talk about being more likely to win in particular states than others. Really? This sounds more like statistical noise or something related to the population of the states with multiple Powerball winners (like Illinois and New Jersey).

4. Interesting closing: the math expert himself hasn’t bought a lottery ticket before. So the moral of the story is that people shouldn’t buy any tickets?

How statistics “change(s) the way you see the world”

This article suggests looking at some well-known statistics problems will “change the way you see the world.” Enjoy the Monty Hall problem, the birthday paradox, gambler’s ruin, Abraham Wald’s memo, and Simpson’s paradox.

Here is what is missing from this article: explaining how statistics is helpful beyond these five particular cases. How would statistics help in a different situation? What is the day-to-day usefulness of statistics? I would suggest several things:

1. Statistics helps move us away from individualistic or anecdotal views of reality and toward a broader view.

2. Statistics can encourage us to ask questions about “reality” and look for data and hidden patterns.

3. Knowing about statistics can help people decipher the numbers that see every day in news stories and advertisements. What do survey or poll results mean? What numbers can I trust?

Tackling these sorts of issues would be much better for the public than looking at five fun and odd applications of statistics. Of course, these three points may not be as interesting as five statistical brain teasers but these five cases should be used to point us to the larger issues at hand.

Three changes that come with “The Rise of Poll Quants”

Nate Silver isn’t the only one making election predictions based on poll data; there are now a number of “poll quants” who are using similar techniques.

So what exactly do these guys do? Basically, they take polls, aggregate the results, and make predictions. They each do it somewhat differently. Silver factors in state polls and national polls, along with other indicators, like monthly job numbers. Wang focuses on state polls exclusively. Linzer’s model looks at historical factors several months before the election but, as voting draws nearer, weights polls more heavily.

At the heart of all their models, though, are the state polls. That makes sense because, thanks to the Electoral College system, it’s the state outcomes that matter. It’s possible to win the national vote and still end up as the head of a cable-television channel rather than the leader of the free world. But also, as Wang explains, it’s easier for pollsters to find representative samples in a particular state. Figuring out which way Arizona or even Florida might go isn’t as tough as sizing up a country as big and diverse as the United States.”The race is so close that, at a national level, it’s easy to make a small error and be a little off,” Wang says. “So it’s easier to call states. They give us a sharper, more accurate picture.”

But the forecasters don’t just look at one state poll. While most news organizations trot out the latest, freshest poll and discuss it in isolation, these guys plug it into their models. One poll might be an outlier; a whole bunch of polls are likely to get closer to the truth. Or so the idea goes. Wang uses all the state polls, but gives more weight to those that survey likely voters, as opposed to those who are just registered to vote. Silver has his own special sauce that he doesn’t entirely divulge.

Both Wang and Linzer find it annoying that individual polls are hyped to make it seem as if the race is closer than it is, or to create the illusion that Romney and Obama are trading the lead from day to day. They’re not. According to the state polls, when taken together, the race has been fairly stable for weeks, and Obama has remained well ahead and, going into Election Day, is a strong favorite. “The best information comes from combining all the polls together,” says Linzer, who projects that Obama will get 326 electoral votes, well over the 270 required to win. “I want to give readers the right information, even if it’s more boring.”

While it may not seem likely, poll aggregation is a threat to the supremacy of the punditocracy. In the past week, you could sense that some high-profile media types were being made slightly uncomfortable by the bespectacled quants, with their confusing mathematical models and zippy computer programs. The New York Times columnist David Brooks said pollsters who offered projections were citizens of “sillyland.”

Three things strike me from reading these “poll quants” leading up to the election:

1. This is what is possible when data is widely available: these pundits use different methods for their models but it wouldn’t be possible without accessible data, consistent and regular polling (at the state and national level), and relatively easy to use statistical programs. In other words, could this scenario have taken place even 20 years ago?

2. It will be fascinating to watch how the media deals with these predictive models. Can they incorporate these predictions into their typical entertainment presentation? Will we have a new kind of pundit in the next few years? The article still noted the need for these quantitative pundits to have personality and style so it their results are not too dry for the larger public. Could we end up in a world where CNN has the exclusive rights to Silver’s model, Fox News has rights to another model, and so on?

3. All of this conversation about statistics, predictions, and modeling has the potential to really show where the American public and elite stands in terms of statistical knowledge. Can people understand the basics of these models? Do they simply blindly trust the models because they are “scientific proof” or do they automatically reject them because all numbers can be manipulated? Do some pundits know just enough to be dangerous and ask endless numbers of questions about the assumptions of different models? There is a lot of potential here to push quantitative literacy as a key part of living in the 21st century world. And it is only going to get more statistical as more organizations collect more data and new research and prediction opportunities arise.

Correlation and not causation: Redskins games predict results of presidential election

Big events like presidential elections tend to bring out some crazy data patterns. Here is my nomination for the oddest one of this election season: how the Washington Redskins do in their final game before the election predicts the presidential election.

Since 1940 — when the Redskins moved to D.C. — the team’s outcome in its final game before the presidential election has predicted which party would win the White House each time but once.

When the Redskins win their game before the election, the incumbent party wins the presidential vote. If the Redskins lose, the non-incumbent wins.

The only exception was in 2004, when Washington fell to Green Bay, but George W. Bush still went on to win the election over John Kerry.

This is simply a quirk of data: how the Redskins do should have little to no effect on voting in other states. This is exactly what correlation without causation is about; there may be a clear pattern ut it doesn’t necessarily mean the two related facts cause each other. There may be some spurious association here, some variable that predicts both outcomes, but even that is hard to imagine. Yet, the Redskins Rule has garnered a lot of attention in recent days. Why? A few possible reasons:

1. It connects two American obsessions: presidential elections and the NFL. A sidelight: both may involve a lot of betting.

2. So much reporting has been done on the 2012 elections that this adds a more whimsical and mysterious element.

3. Humans like to find patterns, even if these patterns don’t make much sense.

What’s next, an American octopus who can predict presidential elections?