Fighting math-phobia in America

The president of Barnard College offers three suggestions for making math more enticing and relevant for Americans:

First, we can work to bring math to those who might shy away from it. Requiring that all students take courses that push them to think empirically with data, regardless of major, is one such approach. At Barnard — a college long known for its writers and dancers — empirical reasoning requirements are built into our core curriculum. And, for those who struggle to meet the demands of data-heavy classes, we provide access (via help rooms) to tutors who focus on diminishing a student’s belief that they “just aren’t good at math.”

Second, employers should encourage applications from and be open to having students with diverse educational interests in their STEM-related internships. Don’t only seek out the computer science majors. This means potentially taking a student who doesn’t come with all the computation chops in hand but does have a good attitude and a willingness to learn. More often than not, such opportunities will surprise both intern and employee. When bright students are given opportunities to tackle problems head on and learn how to work with and manipulate data to address them, even those anxious about math tend to find meaning in what they are doing and succeed. STEM internships also allow students to connect with senior leaders who might have had to overcome a similar experience of questioning their mathematical or computational skills…

Finally, we need to reject the social acceptability of being bad at math. Think about it: You don’t hear highly intelligent people proclaiming that they can’t read, but you do hear many of these same individuals talking about “not being a math person.” When we echo negative sentiments like that to ourselves and each other, we perpetuate a myth that increases overall levels of math phobia. When students reject math, they pigeonhole themselves into certain jobs and career paths, foregoing others only because they can’t imagine doing more computational work. Many people think math ability is an immutable trait, but evidence clearly shows this is a subject in which we can all learn and succeed.

Fighting innumeracy – an inability to use or understand numbers – is a worthwhile goal. I like the efforts suggested above though I worry a bit if they are tied too heavily to jobs and national competitiveness. These goals can veer toward efficiency and utilitarianism rather than more tangible results like better understanding of and interaction society and self. Fighting stigma is going to be hard by invoking more pressure – the US is falling behind! your future career is on the line! – rather than showing how numbers can help people.

This is why I would be in favor of more statistics training for students at all levels. The math required to do statistics can be tailored to different levels, statistical tests, and subjects. The basic knowledge can be helpful in all sorts of areas citizens run into: interpreting reports on surveys and polls, calculating odds and risks (including in finances and sports), and understanding research results. The math does not have to be complicated and instruction can address understanding where statistics come from and how they can be used.

I wonder how much of this might also be connected to the complicated relationship Americans have with expertise and advanced degrees. Think of the typical Hollywood scene of a genius at work: do they look crazy or unusual? Think about presidential candidates: do Americans want people with experience and knowledge or someone they can identify with and have dinner with? Math, in being unknowable to people of average intelligence, may be connected to those smart eccentrics who are necessary for helping society progress but not necessarily the people you would want to be or hang out with.

The retraction of a study provides a reminder of the importance of levels of measurement

Early in Statistics courses, students learn about different ways that variables can be measured. This is often broken down into three categories: nominal variables (unordered, unranked), ordinal variables (ranked but with varied category widths), and interval-ratio (ranked and with consistent spaces between categories). Decisions about how to measure variables can have significant influence on what can be done with the data later. For example, here is a study that received a lot of attention when published but the researchers miscoded a nominal variable:

In 2015, a paper by Jean Decety and co-authors reported that children who were brought up religiously were less generous. The paper received a great deal of attention, and was covered by over 80 media outlets including The Economist, the Boston Globe, the Los Angeles Times, and Scientific American. As it turned out, however, the paper by Decety was wrong. Another scholar, Azim Shariff, a leading expert on religion and pro-social behavior, was surprised by the results, as his own research and meta-analysis (combining evidence across studies from many authors) indicated that religious participation, in most settings, increased generosity. Shariff requested the data to try to understand more clearly what might explain the discrepancy.

To Decety’s credit, he released the data. And upon re-analysis, Shariff discovered that the results were due to a coding error. The data had been collected across numerous countries, e.g. United States, Canada, Turkey, etc. and the country information had been coded as “1, 2, 3…” Although Decety’s paper had reported that they had controlled for country, they had accidentally not controlled for each country, but just treated it as a single continuous variable so that, for example “Canada” (coded as 2) was twice the “United States” (coded as 1). Regardless of what one might think about the relative merits and rankings of countries, this is obviously not the right way to analyze data. When it was correctly analyzed, using separate indicators for each country, Decety’s “findings” disappeared. Shariff’s re-analysis and correction was published in the same journal, Current Biology, in 2016. The media, however, did not follow along. While it covered extensively the initial incorrect results, only four media outlets picked up the correction.

In fact, Decety’s paper has continued to be cited in media articles on religion. Just last month two such articles appeared (one on Buzzworthy and one on TruthTheory) citing Decety’s paper that religious children were less generous. The paper’s influence seems to continue even after it has been shown to be wrong.

Last month, however, the journal, Current Biology, at last formally retracted the paper. If one looks for the paper on the journal’s website, it gives notice of the retraction by the authors. Correction mechanisms in science can sometimes work slowly, but they did, in the end, seem to be effective here. More work still needs to be done as to how this might translate into corrections in media reporting as well: The two articles above were both published after the formal retraction of the paper.

To reiterate, the researcher treated country – a nominal variable in this case since the countries were not ranked or ordered in any particular way – incorrectly which then threw off the overall results. When then using country correctly – from the description above, it sounds like using country as a dummy variable coded 1 and 0 – the findings that received all the attention disappeared.

The other issue at play here is whether corrections to academic studies or retractions are treated as such. It is hard to notify readers that a previously published study had flaws and the results have changed.

All that to say, paying attention to level of measurement earlier in the process helps avoid problems down the road.

Recommendations to help with SCOTUS’ innumeracy

In the wake of recent comments about “sociological gobbledygook” and measures of gerrymandering, here are some suggestions for how the Supreme Court can better use statistical evidence:

McGhee, who helped develop the efficiency gap measure, wondered if the court should hire a trusted staff of social scientists to help the justices parse empirical arguments. Levinson, the Texas professor, felt that the problem was a lack of rigorous empirical training at most elite law schools, so the long-term solution would be a change in curriculum. Enos and his coauthors proposed “that courts alter their norms and standards regarding the consideration of statistical evidence”; judges are free to ignore statistical evidence, so perhaps nothing will change unless they take this category of evidence more seriously.

But maybe this allergy to statistical evidence is really a smoke screen — a convenient way to make a decision based on ideology while couching it in terms of practicality.

“I don’t put much stock in the claim that the Supreme Court is afraid of adjudicating partisan gerrymanders because it’s afraid of math,” Daniel Hemel, who teaches law at the University of Chicago, told me. “[Roberts] is very smart and so are the judges who would be adjudicating partisan gerrymandering claims — I’m sure he and they could wrap their minds around the math. The ‘gobbledygook’ argument seems to be masking whatever his real objection might be.”

If there is indeed innumeracy present, the justices would not be alone in this. Many Americans do not receive an education in statistics, let alone have enough training to make sense of the statistics regularly used in academic studies.

At the same time, we might go further than the argument made above: should judges make decisions based on statistics (roughly facts) more than ideology or arguments (roughly interpretation)? Again, many Americans struggle with this: there can be broad empirical patterns or even correlations but some would insist that their own personal experiences do not match these. Should judicial decisions be guided by principles and existing case law or by current statistical realities? The courts are not the only social spheres that struggle with this.

Using a GRIM method to find unlikely published results

Discovering which published studies may be incorrect or fraudulent takes some work and here is a newer tool: GRIM.

GRIM is the acronym for Granularity-Related Inconsistency of Means, a mathematical method that determines whether an average reported in a scientific paper is consistent with the reported sample size and number of items. Here’s a less-technical answer: GRIM is a B.S. detector. The method is based on the simple insight that only certain averages are possible given certain sets of numbers. So if a researcher reports an average that isn’t possible, given the relevant data, then that researcher either (a) made a mistake or (b) is making things up.

GRIM is the brainchild of Nick Brown and James Heathers, who published a paper last year in Social Psychological and Personality Science explaining the method. Using GRIM, they examined 260 psychology papers that appeared in well-regarded journals and found that, of the ones that provided enough necessary data to check, half contained at least one mathematical inconsistency. One in five had multiple inconsistencies. The majority of those, Brown points out, are “honest errors or slightly sloppy reporting.”…

After spotting the Wansink post, Anaya took the numbers in the papers and — to coin a verb — GRIMMED them. The program found that the four papers based on the Italian buffet data were shot through with impossible math. If GRIM was an actual machine, rather than a humble piece of code, its alarms would have been blaring. “This lights up like a Christmas tree,” Brown said after highlighting on his computer screen the errors Anaya had identified…

Anaya, along with Brown and Tim van der Zee, a graduate student at Leiden University, also in the Netherlands, wrote a paper pointing out the 150 or so GRIM inconsistencies in those four Italian-restaurant papers that Wansink co-authored. They found discrepancies between the papers, even though they’re obviously drawn from the same dataset, and discrepancies within the individual papers. It didn’t look good. They drafted the paper using Twitter direct messages and titled it, memorably, “Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab.”

I wonder how long it will be before journals employ such methods for submitted manuscripts. Imagine Turnitin for academic studies. Then, what would happen to authors if problems are found?

It also sounds like a program like this could make it easy to do mass analysis of published studies to help answer questions like how many findings are fraudulent.

Perhaps it is too easy to ask whether GRIM has been vetted by outside persons…

The most important annual statistical moment in America: the start of March Madness

When do statistics matter the most for the average American? The week of the opening weekend of March Madness – the period between the revealing of the 68 team field to the final games of the Round of 32 – may just be that point. All the numbers are hard to resist; win-loss records, various other metrics of team performance (strength of schedule, RPI, systems attached to particular analysts, advanced basketball statistics, etc.), comparing seed numbers and their historic performance, seeing who the rest of America has picked (see the percentages for the millions of brackets at ESPN), and betting lines and pools.

Considering the suggestions that Americans are fairly innumerate, perhaps this would be a good period for public statistics education. How does one sift through all these numbers, thinking about how they are measured and making decisions based on the figures? Sadly, I usually teach Statistics in the fall so I can’t put any of my own ideas into practice…

When software – like Excel – hampers scientific research

Statistical software can be very helpful but it does not automatically guarantee correct analyses:

A team of Australian researchers analyzed nearly 3,600 genetics papers published in a number of leading scientific journals — like Nature, Science and PLoS One. As is common practice in the field, these papers all came with supplementary files containing lists of genes used in the research.

The Australian researchers found that roughly 1 in 5 of these papers included errors in their gene lists that were due to Excel automatically converting gene names to things like calendar dates or random numbers…

Genetics isn’t the only field where a life’s work can potentially be undermined by a spreadsheet error. Harvard economists Carmen Reinhart and Kenneth Rogoff famously made an Excel goof — omitting a few rows of data from a calculation — that caused them to drastically overstate the negative GDP impact of high debt burdens. Researchers in other fields occasionally have to issue retractions after finding Excel errors as well…

For the time being, the only fix for the issue is for researchers and journal editors to remain vigilant when working with their data files. Even better, they could abandon Excel completely in favor of programs and languages that were built for statistical research, like R and Python.

Excel has particular autoformatting issues but all statistical programs have unique ways of handling data. Spreadsheets of data – often formatted with cases in the rows and variables in the columns – don’t automatically read in correctly.

Additionally, user error can lead to issues with any sort of statistical software. Different programs may have different quirks but various researchers can do all sort of weird things from recoding incorrectly to misreading missing data to misinterpreting results. Data doesn’t analyze itself and statistical software is just a tool that needs to be used correctly.

A number of researchers have in recent years called for open data once a paper is published and this could help those in an academic field spot mistakes. Of course, the best solution is to double-check (at least) data before review and publication. Yet, when you are buried in a quantitative project and there are dozens of steps of data work and analysis, it can be hard to (1) keep track of everything and (2) closely watch for errors. Perhaps we need independent data review even before publication.

Scientists have difficulty explaining p-values

Scientists regularly use p-values to evaluate their findings but apparently have difficulty explain exactly what they mean:

To be clear, everyone I spoke with at METRICS could tell me the technical definition of a p-value — the probability of getting results at least as extreme as the ones you observed, given that the null hypothesis is correct — but almost no one could translate that into something easy to understand.

It’s not their fault, said Steven Goodman, co-director of METRICS. Even after spending his “entire career” thinking about p-values, he said he could tell me the definition, “but I cannot tell you what it means, and almost nobody can.” Scientists regularly get it wrong, and so do most textbooks, he said. When Goodman speaks to large audiences of scientists, he often presents correct and incorrect definitions of the p-value, and they “very confidently” raise their hand for the wrong answer. “Almost all of them think it gives some direct information about how likely they are to be wrong, and that’s definitely not what a p-value does,” Goodman said.

We want to know if results are right, but a p-value doesn’t measure that. It can’t tell you the magnitude of an effect, the strength of the evidence or the probability that the finding was the result of chance.

So what information can you glean from a p-value? The most straightforward explanation I found came from Stuart Buck, vice president of research integrity at the Laura and John Arnold Foundation. Imagine, he said, that you have a coin that you suspect is weighted toward heads. (Your null hypothesis is then that the coin is fair.) You flip it 100 times and get more heads than tails. The p-value won’t tell you whether the coin is fair, but it will tell you the probability that you’d get at least as many heads as you did if the coin was fair. That’s it — nothing more. And that’s about as simple as I can make it, which means I’ve probably oversimplified it and will soon receive exasperated messages from statisticians telling me so.

Complicated but necessary? This can lead to fun situations when teaching statistics: students need to know enough to do the statistical work and evaluate findings (we at least need to know what to do with a calculated p-value, even if we don’t quite understand what it means) but explaining the complexity of some of these techniques wouldn’t necessarily help the learning process. In fact, the more you learn about statistics, you tend to find that the various methods and techniques have limitations even as they can help us better understand phenomena.