Scientists have difficulty explaining p-values

Scientists regularly use p-values to evaluate their findings but apparently have difficulty explain exactly what they mean:

To be clear, everyone I spoke with at METRICS could tell me the technical definition of a p-value — the probability of getting results at least as extreme as the ones you observed, given that the null hypothesis is correct — but almost no one could translate that into something easy to understand.

It’s not their fault, said Steven Goodman, co-director of METRICS. Even after spending his “entire career” thinking about p-values, he said he could tell me the definition, “but I cannot tell you what it means, and almost nobody can.” Scientists regularly get it wrong, and so do most textbooks, he said. When Goodman speaks to large audiences of scientists, he often presents correct and incorrect definitions of the p-value, and they “very confidently” raise their hand for the wrong answer. “Almost all of them think it gives some direct information about how likely they are to be wrong, and that’s definitely not what a p-value does,” Goodman said.

We want to know if results are right, but a p-value doesn’t measure that. It can’t tell you the magnitude of an effect, the strength of the evidence or the probability that the finding was the result of chance.

So what information can you glean from a p-value? The most straightforward explanation I found came from Stuart Buck, vice president of research integrity at the Laura and John Arnold Foundation. Imagine, he said, that you have a coin that you suspect is weighted toward heads. (Your null hypothesis is then that the coin is fair.) You flip it 100 times and get more heads than tails. The p-value won’t tell you whether the coin is fair, but it will tell you the probability that you’d get at least as many heads as you did if the coin was fair. That’s it — nothing more. And that’s about as simple as I can make it, which means I’ve probably oversimplified it and will soon receive exasperated messages from statisticians telling me so.

Complicated but necessary? This can lead to fun situations when teaching statistics: students need to know enough to do the statistical work and evaluate findings (we at least need to know what to do with a calculated p-value, even if we don’t quite understand what it means) but explaining the complexity of some of these techniques wouldn’t necessarily help the learning process. In fact, the more you learn about statistics, you tend to find that the various methods and techniques have limitations even as they can help us better understand phenomena.

Confronting the problems with p-values

Nature provides an overview of concerns about how much scientists rely on p-values “which is neither as reliable nor as objective as most scientists assume”:

One result is an abundance of confusion about what the P value means4. Consider Motyl’s study about political extremists. Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is…

These are sticky concepts, but some statisticians have tried to provide general rule-of-thumb conversions (see ‘Probable cause’). According to one widely used calculation5, a P value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of 0.05 raises that chance to at least 29%. So Motyl’s finding had a greater than one in ten chance of being a false alarm. Likewise, the probability of replicating his original result was not 99%, as most would assume, but something closer to 73% — or only 50%, if he wanted another ‘very significant’ result6, 7. In other words, his inability to replicate the result was about as surprising as if he had called heads on a coin toss and it had come up tails…

Critics also bemoan the way that P values can encourage muddled thinking. A prime example is their tendency to deflect attention from the actual size of an effect. Last year, for example, a study of more than 19,000 people showed8 that those who meet their spouses online are less likely to divorce (p < 0.002) and more likely to have high marital satisfaction (p < 0.001) than those who meet offline (see Nature http://doi.org/rcg; 2013). That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale. To pounce on tiny P values and ignore the larger question is to fall prey to the “seductive certainty of significance”, says Geoff Cumming, an emeritus psychologist at La Trobe University in Melbourne, Australia. But significance is no indicator of practical relevance, he says: “We should be asking, ‘How much of an effect is there?’, not ‘Is there an effect?’”

Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously. It may be the first statistical term to rate a definition in the online Urban Dictionary, where the usage examples are telling: “That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05”, and “She is a p-hacker, she always monitors data while it is being collected.”

As the article then goes on to note, alternatives haven’t quite caught on. It seems the most basic defense is one that statisticians should adopt anyhow: always recognizing the chance that their statistics could be wrong. It also highlights the need for replicating studies with different datasets to confirm results.

At a relatively basic level, if p-levels are so problematic, how does this change the basic statistics courses so many undergraduates take?

Wired’s five tips for “p-hacking” your way to a positive study result

As part of its “Cheat Code to Life,” Wired includes four tips for researchers to obtain positive results in their studies:

Many a budding scientist has found themself one awesome result from tenure and unable to achieve that all-important statistical significance. Don’t let such setbacks deter you from a life of discovery. In a recent paper, Joseph Simmons, Leif Nelson, and Uri Simonsohn describe “p-hacking”—common tricks that researchers use to fish for positive results. Just promise us you’ll be more responsible when you’re a full professor. —MATTHEW HUTSON

Create Options. Let’s say you want to prove that listening to dubstep boosts IQ (aka the Skrillex effect). The key is to avoid predefining what exactly the study measures—then bury the failed attempts. So use two different IQ tests; if only one shows a pattern, toss the other.

Expand the Pool. Test 20 dubstep subjects and 20 control subjects. If the findings reach significance, publish. If not, run 10 more subjects in each group and give the stats another whirl. Those extra data points might randomly support the hypothesis.

Get Inessential. Measure an extraneous variable like gender. If there’s no pattern in the group at large, look for one in just men or women.

Run Three Groups. Have some people listen for zero hours, some for one, and some for 10. Now test for differences between groups A and B, B and C, and A and C. If all compar­isons show significance, great. If only one does, then forget about the existence of the p-value poopers.

Wait for the NSF Grant. Use all four of these fudges and, even if your theory is flat wrong, you’re more likely than not to confirm it—with the necessary 95 percent confidence.

This might be summed up as “things that are done but would never be explicitly taught in a research methods course.” Several quick thoughts:

1. This is a reminder of how important 95% significant is in the world of science. My students often ask why the cut-point is 95% – why do we accept 5% error and not 10% (which people sometimes “get away with” in some studies) or 1% (wouldn’t we be more sure of our results?).

2. Even if significance is important and scientists hack their way to more positive results, they can still have a humility about their findings. Reaching 95% significance still means there is a 5% chance of error. Problems arise when findings are countered or disproven but we should expect this to happen occasionally. Additionally, results can be statistically significant but have little substantive significance. All together, having a significant finding is not the end of the process for the scientist: it still needs to be interpreted and then tested again.

3. This is also tied to the pressure of needing to find positive results. In other words, publishing an academic study is more likely if you disprove the null hypothesis. At the same time, not disproving the hypothesis is still useful knowledge and such studies should also be published. Think of the example of Edison’s quest to find the proper material for a lightbulb filament. The story is often told in such a way to suggest that he went through a lot of work to finally find the right answer. But, this is often how science works: you go through a lot of ideas and data before the right answer emerges.

The statistical calculations used for counting votes

Some might be surprised to hear that “Counting lots of ballots [in elections] with absolute precision is impossible.” Wired takes a brief look at how the vote totals are calculated:

Most laws leave the determination of the recount threshold to the discretion of registrars. But not California—at least not since earlier this year, when the state assembly passed a bill piloting a new method to make sure the vote isn’t rocking a little too hard. The formula comes from UC Berkeley statistician Philip Stark; he uses the error rate from audited precincts to calculate a key statistical number called the P-value. Election auditors already calculate the number of errors in any given precinct; the P-value helps them determine whether that error rate means the results are wrong. A low P-value means everything is copacetic: The purported winner is probably the one who indeed got the most votes. If you get a high value? Maybe hold off on those balloon drops.

A p-value is a key measure in most statistical analysis – it provides a measure of how much error is in the data and whether the obtained results are just by chance or whether we can be fairly sure (95% or more) the statistical estimation represents the whole population.

So what is the acceptable p-value for elections in California?

I would be curious to know whether people might seize upon this information for two reasons: (1) it shows the political system is not exact and therefore, possibly corrupt and (2) they distrust statistics altogether.