Statistical models are supposed to help predict life, right? So why not use a statistical model to help make a marriage proposal?
I think this is clever. And the one-page paper that goes with it is not bad either.
h/t Instapundit
Statistical models are supposed to help predict life, right? So why not use a statistical model to help make a marriage proposal?
I think this is clever. And the one-page paper that goes with it is not bad either.
h/t Instapundit
Fight it if you like, but baseball has become too complicated to solve without science. Every rotation of every pitch is measured now. Every inch that a baseball travels is measured now. Teams that used to get mocked for using spreadsheets now rely on databases packed with precise location and movement of every player on every play — and those teams are the norm, not the film-inspiring exceptions. This is exciting and it’s terrifying…
I’m not a mathematician and I’m not a scientist. I’m a guy who tries to understand baseball with common sense. In this era, that means embracing advanced metrics that I don’t really understand. That should make me a little uncomfortable, and it does. WAR is a crisscrossed mess of routes leading toward something that, basically, I have to take on faith…
Yet baseball’s front offices, the people in charge of $100 million payrolls and all your hope for the 2013 season, side overwhelmingly with data. For team executives, the basic framework of WAR — measuring players’ total performance against a consistent baseline — is commonplace, used by nearly every front office, according to insiders. The writers who helped guide the creation of WAR over the decades — including Bill James, Sean Smith and Keith Woolner — work for teams now. As James told me, the war over WAR has ceased where it matters. “There’s a practical necessity for measurements like that in a front office that make it irrelevant whether you like them or you don’t.”
Whether you do is up to you and ultimately matters only to you. In the larger perspective, the debate is over, and data won. So fight it if you’d like. But at a certain point, the question in any debate against science is: What are you really fighting and why?
As someone who likes data, I would statistics is just another tool that can help us understand baseball better. It doesn’t have to be an either/or argument, baseball with advanced statistics versus baseball without advanced statistics. Baseball with advanced statistics is a more complete and gets at some of the underlying mechanics of the game rather than the visual cues or the culturally accepted statistics.
While this story is specifically about baseball, I think it also mirrors larger conversations in American society about the use of statistics. Why interrupt people’s common sense understandings of the world with abstract data? Aren’t these new statistics difficult to understand and can’t they also be manipulated? Some of this is true: looking at data can involve seeing things in news ways and there are disagreements about how to define concepts as well as how to collect to interpret data. But, in the end, these statistics can help us better understand the world.
The Atlantic has a front page lead-in to a recent magazine article I recently blogged about. While the article is good, a new superhero is born: the p<.05 hero.
Statistics could use a good superhero. This one inspires at least 95% confidence.
Here is an odd mixing of the data, sports, and business worlds: Sears recently named Paul Podesta to its board.
Paul DePodesta, one of the heroes of Michael Lewis’ “Moneyball: The Art of Winning an Unfair Game,” a great 2003 baseball book (and later a movie) about the 2002 A’s that’s more about business and epistemology than baseball, has been named to the board of Hoffman Estates-based Sears Holdings Corp.
To be sure, he’s an unconventional choice for the parent of Sears and Kmart. But Chairman Edward Lampert is thinking outside the box score, welcoming the New York Mets’ vice president of player development and amateur scouting into his clubhouse…
“What Paul DePodesta … did to bring analytics into the world of baseball is absolutely parallel to what needs to happen — and is happening — in retail,” said Greg Girard, program director of merchandising strategies and retail analytics for Framingham, Mass.-based IDC Retail Insights.
“It’s a big cultural change, but that’s something a board member can effect,” Girard said. “And he’s got street cred to take it down to the line of business guys who need to change, who need to bring analytics and analysis into retail decisions.”…
“Analytics has been something folks in retail have talked about for quite some time, but they’re redoubling their efforts now,” Girard said. “Drowning in data and not knowing what data’s relevant, which data to retain and for how long, is the No. 1 challenge retailers are having as they move into what we call Big Data.”
Fascinating. People like Podesta are credited with starting a revolution in sports by developing new statistics and then using that information to outwit the market. For example, Podesta and a host of others before him (possibly with Bill James at the beginning), found that certain traits like on-base percentage were undervalued and teams, like the small-market Oakland Athletics, could build decent teams without overpaying for the biggest free agents. Of course, once other teams caught on to this idea, on-base percentage was no longer undervalued. The Boston Red Sox, one of the biggest spending baseball teams, picked up this idea and paid handsomely for such skills and went on to win two World Series championships. So teams now have to look at other undervalued areas. One recent area that Major League Baseball shut down was spending more on overseas talent and draft picks to build up a farm system quickly. These ideas are now spreading to other sports as some NBA teams are making use of such data and new precise data will soon be collected with soccer players while they are on the pitch.
The same thought process could apply to business. If so, the process might look like this: find new ways to measure retail activity or hone in on less understood data that is out there. Then maximize a response to these lesser-known concepts and move around competitors. When they start to catch on, keep innovating and stay ahead a step or two. Sears could use a lot of this moving forward as they have struggled in recent years. Even if Podesta is able to identify trends others have not, he would still have to convince a board and company to change course.
It will be interesting to see how Podesta comes out of this. If Sears continues to lose ground, how much of that will rub off on him? If there is a turnaround, how much credit would he get?
The Chronicle of Higher Education examines how much criticism of the NCAA will be allowed at its upcoming annual Scholarly Colloquium and includes a fascinating quote about how data should be used:
The colloquium was the brainchild of Myles Brand, a former NCAA president and philosopher who saw a need for more serious research on college sports. He and others believed that such an event could foster more open dialogue between the scholars who study sport issues and the people who work in the game.
Mr. Brand emphasized that the colloquium should be data-based and should avoid ideology. “Myles always used to joke: ‘In God we trust; everyone else should bring data,'” said Mr. Renfro, a former top adviser to Mr. Brand.
But as Mr. Renfro watched presentations at last year’s colloquium, which focused on changes the NCAA has made in its academic policies in recent years, he did not see a variety of perspectives.
“I was hearing virtually one voice being sung by a number of people … and it was relatively critical of the NCAA’s academic-reform effort,” he said. “I don’t care whether it was critical or not, but I care about whether there are different perspectives presented.”
This is a classic argument: data versus ideology, facts versus opinions. This short bit about Myles Brand makes it sound like Brand thought bringing more data to the table when discussing the NCAA would be a good thing. Data might blunt opinions and arguments and push people with an agenda to back up their arguments. It could lead to more constructive conversations. But, data is not completely divorced from ideology. Researchers choose what kind of topics to study. Data has to be collected in a good manner. Interpreting data is still an important skill; people can use data incorrectly. And it sounds like an issue here is that people might be able to use data to continue to criticize the NCAA – and this does not make the NCAA happy.
Generally, I’m in favor of bringing more data to the table when discussing issues. However, having data doesn’t necessarily solve problems. As I tell my statistics classes, I don’t want them to be people who blindly believe all data or statistics because it is data and I also don’t want them to be people who dismiss all data or statistics because they can be misused and twisted. It sounds like some of this still needs to be sorted out with the NCAA Scholarly Colloquium.
Check out this ABC News video about the odds of winning the $500 million Powerball lottery.
Several things are striking about the content of the video beyond the bad odds of winning: 1 in 175 million chance.
1. A journalist admits he doesn’t know much about math or statistics. It is not uncommon for reporters to go to experts like statisticians in times like these (appealing to the expert boosts the credentials of the story) but it is more unusual for journalists to admit they are doing so because they don’t know the information. I’ve argued before we need more journalists who understand statistics and science.
2. The reporter mentions some interesting odds that are more favorable than winning the Powerball. One of these is the idea that you are more likely to be possessed by the devil today than win the lottery. Who exactly keeps track of these figures and how accurate are they?
3. The story includes some talk about being more likely to win in particular states than others. Really? This sounds more like statistical noise or something related to the population of the states with multiple Powerball winners (like Illinois and New Jersey).
4. Interesting closing: the math expert himself hasn’t bought a lottery ticket before. So the moral of the story is that people shouldn’t buy any tickets?
This article suggests looking at some well-known statistics problems will “change the way you see the world.” Enjoy the Monty Hall problem, the birthday paradox, gambler’s ruin, Abraham Wald’s memo, and Simpson’s paradox.
Here is what is missing from this article: explaining how statistics is helpful beyond these five particular cases. How would statistics help in a different situation? What is the day-to-day usefulness of statistics? I would suggest several things:
1. Statistics helps move us away from individualistic or anecdotal views of reality and toward a broader view.
2. Statistics can encourage us to ask questions about “reality” and look for data and hidden patterns.
3. Knowing about statistics can help people decipher the numbers that see every day in news stories and advertisements. What do survey or poll results mean? What numbers can I trust?
Tackling these sorts of issues would be much better for the public than looking at five fun and odd applications of statistics. Of course, these three points may not be as interesting as five statistical brain teasers but these five cases should be used to point us to the larger issues at hand.
Nate Silver isn’t the only one making election predictions based on poll data; there are now a number of “poll quants” who are using similar techniques.
So what exactly do these guys do? Basically, they take polls, aggregate the results, and make predictions. They each do it somewhat differently. Silver factors in state polls and national polls, along with other indicators, like monthly job numbers. Wang focuses on state polls exclusively. Linzer’s model looks at historical factors several months before the election but, as voting draws nearer, weights polls more heavily.
At the heart of all their models, though, are the state polls. That makes sense because, thanks to the Electoral College system, it’s the state outcomes that matter. It’s possible to win the national vote and still end up as the head of a cable-television channel rather than the leader of the free world. But also, as Wang explains, it’s easier for pollsters to find representative samples in a particular state. Figuring out which way Arizona or even Florida might go isn’t as tough as sizing up a country as big and diverse as the United States.”The race is so close that, at a national level, it’s easy to make a small error and be a little off,” Wang says. “So it’s easier to call states. They give us a sharper, more accurate picture.”
But the forecasters don’t just look at one state poll. While most news organizations trot out the latest, freshest poll and discuss it in isolation, these guys plug it into their models. One poll might be an outlier; a whole bunch of polls are likely to get closer to the truth. Or so the idea goes. Wang uses all the state polls, but gives more weight to those that survey likely voters, as opposed to those who are just registered to vote. Silver has his own special sauce that he doesn’t entirely divulge.
Both Wang and Linzer find it annoying that individual polls are hyped to make it seem as if the race is closer than it is, or to create the illusion that Romney and Obama are trading the lead from day to day. They’re not. According to the state polls, when taken together, the race has been fairly stable for weeks, and Obama has remained well ahead and, going into Election Day, is a strong favorite. “The best information comes from combining all the polls together,” says Linzer, who projects that Obama will get 326 electoral votes, well over the 270 required to win. “I want to give readers the right information, even if it’s more boring.”
While it may not seem likely, poll aggregation is a threat to the supremacy of the punditocracy. In the past week, you could sense that some high-profile media types were being made slightly uncomfortable by the bespectacled quants, with their confusing mathematical models and zippy computer programs. The New York Times columnist David Brooks said pollsters who offered projections were citizens of “sillyland.”
Three things strike me from reading these “poll quants” leading up to the election:
1. This is what is possible when data is widely available: these pundits use different methods for their models but it wouldn’t be possible without accessible data, consistent and regular polling (at the state and national level), and relatively easy to use statistical programs. In other words, could this scenario have taken place even 20 years ago?
2. It will be fascinating to watch how the media deals with these predictive models. Can they incorporate these predictions into their typical entertainment presentation? Will we have a new kind of pundit in the next few years? The article still noted the need for these quantitative pundits to have personality and style so it their results are not too dry for the larger public. Could we end up in a world where CNN has the exclusive rights to Silver’s model, Fox News has rights to another model, and so on?
3. All of this conversation about statistics, predictions, and modeling has the potential to really show where the American public and elite stands in terms of statistical knowledge. Can people understand the basics of these models? Do they simply blindly trust the models because they are “scientific proof” or do they automatically reject them because all numbers can be manipulated? Do some pundits know just enough to be dangerous and ask endless numbers of questions about the assumptions of different models? There is a lot of potential here to push quantitative literacy as a key part of living in the 21st century world. And it is only going to get more statistical as more organizations collect more data and new research and prediction opportunities arise.
Big events like presidential elections tend to bring out some crazy data patterns. Here is my nomination for the oddest one of this election season: how the Washington Redskins do in their final game before the election predicts the presidential election.
Since 1940 — when the Redskins moved to D.C. — the team’s outcome in its final game before the presidential election has predicted which party would win the White House each time but once.
When the Redskins win their game before the election, the incumbent party wins the presidential vote. If the Redskins lose, the non-incumbent wins.
The only exception was in 2004, when Washington fell to Green Bay, but George W. Bush still went on to win the election over John Kerry.
This is simply a quirk of data: how the Redskins do should have little to no effect on voting in other states. This is exactly what correlation without causation is about; there may be a clear pattern ut it doesn’t necessarily mean the two related facts cause each other. There may be some spurious association here, some variable that predicts both outcomes, but even that is hard to imagine. Yet, the Redskins Rule has garnered a lot of attention in recent days. Why? A few possible reasons:
1. It connects two American obsessions: presidential elections and the NFL. A sidelight: both may involve a lot of betting.
2. So much reporting has been done on the 2012 elections that this adds a more whimsical and mysterious element.
3. Humans like to find patterns, even if these patterns don’t make much sense.
What’s next, an American octopus who can predict presidential elections?
Political polling has come under a lot of recent fire but a sociologist defends these predictions and reminds us that we rely on many such predictions:
We rely on statistical models for many decisions every single day, including, crucially: weather, medicine, and pretty much any complex system in which there’s an element of uncertainty to the outcome. In fact, these are the same methods by which scientists could tell Hurricane Sandy was about to hit the United States many days in advance…
This isn’t wizardry, this is the sound science of complex systems. Uncertainty is an integral part of it. But that uncertainty shouldn’t suggest that we don’t know anything, that we’re completely in the dark, that everything’s a toss-up.
Polls tell you the likely outcome with some uncertainty and some sources of (both known and unknown) error. Statistical models take a bunch of factors and run lots of simulations of elections by varying those outcomes according to what we know (such as other polls, structural factors like the economy, what we know about turnout, demographics, etc.) and what we can reasonably infer about the range of uncertainty (given historical precedents and our logical models). These models then produce probability distributions…
Refusing to run statistical models simply because they produce probability distributions rather than absolute certainty is irresponsible. For many important issues (climate change!), statistical models are all we have and all we can have. We still need to take them seriously and act on them (well, if you care about life on Earth as we know it, blah, blah, blah).
A key point here: statistical models have uncertainty (we are making inferences about larger populations or systems from samples that we can collect) but that doesn’t necessarily mean they are flawed.
A second key point: because of what I stated above, we should expect that some statistical predictions will be wrong. But this is how science works: you tweak models, take in more information, perhaps change your data collection, perhaps use different methods of analysis, and hope to get better. While it may not be exciting, confirming what we don’t know does help us get to an outcome.
I’ve become more convinced in recent years that one of the reasons polls are not used effectively in reporting is that many in the media don’t know exactly how they work. Journalists need to be trained in how to read, interpret, and report on data. This could also be a time issue; how much time to those in the media have to pore over the details of research findings or do they simply have to scan for new findings? Scientists can pump out study after study but part of the dissemination of this information to the public requires a media who understands how scientific research and the scientific process work. This includes understanding how models are consistently refined, collecting the right data to answer the questions we want to answer, and looking at the accumulated scientific research rather than just grabbing the latest attention-getting finding.
An alternative to this idea about media statistical illiteracy is presented in the article: perhaps the media perhaps knows how polls work but likes a political horse race. This may also be true but there is a lot of reporting on statistics on data outside of political elections that also needs work.