Strong spurious correlations enhanced in appearance with mismatched dual axes

I stumbled across a potentially fascinating website titled Spurious Correlations that looks at relationships between odd variables. Here are two examples:

According to the site, both of these pairs have correlations higher than 0.94. In other words, very strong.

One issue: using dual axes can throw things off. The bottom chart above shows a negative relationship – but this is only because the axes are different. The top chart makes it look like the lines really go together – but the axes are way off from each other with the left side ranging from 29-34 and the right side ranging from 300-900. Overall, the charts reinforce the strong correlations between the two variables but using dual axes can be misleading.

Chicago crime stats: beware the “official” data in recent years

Chicago has a fascinating look at some interesting choices made about how to classify homicides in Chicago – with the goal of trying to reduce the murder count.

For the case of Tiara Groves is not an isolated one. Chicago conducted a 12-month examination of the Chicago Police Department’s crime statistics going back several years, poring through public and internal police records and interviewing crime victims, criminologists, and police sources of various ranks. We identified 10 people, including Groves, who were beaten, burned, suffocated, or shot to death in 2013 and whose cases were reclassified as death investigations, downgraded to more minor crimes, or even closed as noncriminal incidents—all for illogical or, at best, unclear reasons…

Many officers of different ranks and from different parts of the city recounted instances in which they were asked or pressured by their superiors to reclassify their incident reports or in which their reports were changed by some invisible hand. One detective refers to the “magic ink”: the power to make a case disappear. Says another: “The rank and file don’t agree with what’s going on. The powers that be are making the changes.”

Granted, a few dozen crimes constitute a tiny percentage of the more than 300,000 reported in Chicago last year. But sources describe a practice that has become widespread at the same time that top police brass have become fixated on demonstrating improvement in Chicago’s woeful crime statistics.

And has there ever been improvement. Aside from homicides, which soared in 2012, the drop in crime since Police Superintendent Garry McCarthy arrived in May 2011 is unprecedented—and, some of his detractors say, unbelievable. Crime hasn’t just fallen, it has freefallen: across the city and across all major categories.

Two quick thoughts:

1. “Official” statistics are often taken for granted and it is assumed that they measure what they say they measure. This is not necessarily the case. All statistics have to be operationalized, taken from a more conceptual form into something that can be measured. Murder seems fairly clear-cut but as the article notes, there is room for different people to classify things differently.

2. Fiddling with the statistics is not right but, at the same time, we should consider the circumstances within which this takes place. Why exactly does the murder count – the number itself – matter so much? Are we more concerned about the numbers or the people and communities involved? How happy should we be that the number of murders was once over 500 and now is closer to 400? Numerous parties mentioned in this article want to see progress: aldermen, the mayor, the police chief, the media, the general public. Is progress simply reducing the crime rate or rebuilding neighborhoods? In other words, we might consider whether the absence of major crimes is the best end goal here.

A call for better statistics to better distinguish between competitive gamers

Here is a call for more statistics in gaming, which would help understand the techniques of and differentiation between competitive gamers:

Some people even believe that competitive gaming can get more out of stats than any conventional sport can. After all, what kind of competition is more quantifiable than one that’s run not on a field or on a wooden floor but on a computer? What kind of sport should be able to more defined by stats than eSports?

“The dream is the end of bullshit,” says David Joerg, owner of the StarCraft statistic website GGTracker. “eSports is the one place where everything the player has done is recorded by the computer. It’s possible—and only possible in eSports—where we can have serious competition and know everything that’s going on in the game. It’s the only place where you can have an end to the bullshit that surrounds every other sport. You could have bullshit-free analysis. You’d have better conversations, better players, and better games. There’s a lot of details needed to get there, but the dream is possible.”…

“There are some stats in every video game that are directly visible to the player, like kill/death,” GGTacker’s Joerg said. “Everyone will use it because it’s right in front of their face, and then people will say that stat doesn’t tell the whole story. So then a brave soul will try to invent a stat that’s a better representation of a player’s value, but that leads to a huge uphill battle trying to get people to use it correctly and recognize its importance.”…

You could make the argument that a sport isn’t a sport until it has numbers backing it up. Until someone can point a series of statistics that clearly designate a player’s superiority, there will always be doubters. If that’s true, then it’s true for eSports as much as it was for baseball, football and any other sport when it was young. For gaming, those metrics remain hidden in the computers running StarCraft, League of Legends, Call of Duty and any other game being played in high-stakes tournaments. Slowly, though, we’re starting to discover how competitive gaming truly works. We’re starting to find the numbers that tell the story. That’s exciting.

This is a two part problem:

1. Developing good statistics based on important actions with a game that have predictive ability.

2. Getting the community of gamers to agree that these statistics are relevant and can be helpful to the community.

Both are complex problems in their own right and this will likely take some time. Gaming’s most basic statistic – who won – is relatively easy to determine but the numbers behind that winning and losing are less clear.

The difficulty in wording survey questions about American education

Emily Richmond points out some of the difficulties in creating and interpreting surveys regarding public opinion on American education:

As for the PDK/Gallup poll, no one recognizes the importance of a question’s wording better than Bill Bushaw, executive director of PDK. He provided me with an interesting example from the September 2009 issue of Phi Delta Kappan magazine, explaining how the organization tested a question about teacher tenure:

“Americans’ opinions about teacher tenure have much to do with how the question is asked. In the 2009 poll, we asked half of respondents if they approved or disapproved of teacher tenure, equating it to receiving a “lifetime contract.” That group of Americans overwhelmingly disapproved of teacher tenure 73% to 26%. The other half of the sample received a similar question that equated tenure to providing a formal legal review before a teacher could be terminated. In this case, the response was reversed, 66% approving of teacher tenure, 34% disapproving.”

So what’s the message here? It’s one I’ve argued before: That polls, taken in context, can provide valuable information. At the same time, journalists have to be careful when comparing prior years’ results to make sure that methodological changes haven’t influenced the findings; you can see how that played out in last year’s MetLife teacher poll. And it’s a good idea to use caution when comparing findings among different polls, even when the questions, at least on the surface, seem similar.

Surveys don’t write themselves nor is the interpretation of the results necessarily straightforward. Change the wording or the order of the questions and results can change. I like the link to the list of “20 Questions A Journalist Should Ask About Poll Results” put out by the National Council on Public Polls. Our public life would be improved if journalists, pundits, and the average citizen would pay attention to these questions.

Rare events may happen multiple times due to the law of truly large numbers plus the law of combinations

Rare events don’t happen all the time but they may still happen multiple times if there are lots of chances for their occurrence:

Improbability Principle tells us that we should not be surprised by coincidences. In fact, we should expect coincidences to happen. One of the key strands of the principle is the law of truly large numbers. This law says that given enough opportunities, we should expect a specified event to happen, no matter how unlikely it may be at each opportunity. Sometimes, though, when there are really many opportunities, it can look as if there are only relatively few. This misperception leads us to grossly underestimate the probability of an event: we think something is incredibly unlikely, when it’s actually very likely, perhaps almost certain…

For another example of how a seemingly improbable event is actually quite probable, let’s look at lotteries. On September 6, 2009, the Bulgarian lottery randomly selected as the winning numbers 4, 15, 23, 24, 35, 42. There is nothing surprising about these numbers. The digits that make up the numbers are all low values—1, 2, 3, 4 or 5—but that is not so unusual. Also, there is a consecutive pair of values, 23 and 24, although this happens far more often than is generally appreciated (if you ask people to randomly choose six numbers from 1 to 49, for example, they choose consecutive pairs less often than pure chance would).

What was surprising was what happened four days later: on September 10, the Bulgarian lottery randomly selected as the winning numbers 4, 15, 23, 24, 35, 42—exactly the same numbers as the previous week. The event caused something of a media storm at the time. “This is happening for the first time in the 52-year history of the lottery. We are absolutely stunned to see such a freak coincidence, but it did happen,” a spokeswoman was quoted as saying in a September 18 Reuters article. Bulgaria’s then sports minister Svilen Neikov ordered an investigation. Could a massive fraud have been perpetrated? Had the previous numbers somehow been copied?

In fact, this rather stunning coincidence was simply another example of the Improbability Principle, in the form of the law of truly large numbers amplified by the law of combinations. First, many lotteries are conducted around the world. Second, they occur time after time, year in and year out. This rapidly adds up to a large number of opportunities for lottery numbers to repeat. And third, the law of combinations comes into effect: each time a lottery result is drawn, it could contain the same numbers as produced in any of the previous draws. In general, as with the birthday situation, if you run a lottery n times, there are n × (n ? 1)/2 pairs of lottery draws that could have a matching string of numbers.

Rare events happening multiple times within a short time also tends to provoke another issue in human reasoning: we tend to develop causal explanations for having multiple rare events. These multiple occurrences can still be random but we want to know a clear reason why they occurred. Having truly random outcomes doesn’t mean outcomes can’t be repeated, just that there is not a pattern to their occurrence.

Confronting the problems with p-values

Nature provides an overview of concerns about how much scientists rely on p-values “which is neither as reliable nor as objective as most scientists assume”:

One result is an abundance of confusion about what the P value means4. Consider Motyl’s study about political extremists. Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is…

These are sticky concepts, but some statisticians have tried to provide general rule-of-thumb conversions (see ‘Probable cause’). According to one widely used calculation5, a P value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a P value of 0.05 raises that chance to at least 29%. So Motyl’s finding had a greater than one in ten chance of being a false alarm. Likewise, the probability of replicating his original result was not 99%, as most would assume, but something closer to 73% — or only 50%, if he wanted another ‘very significant’ result6, 7. In other words, his inability to replicate the result was about as surprising as if he had called heads on a coin toss and it had come up tails…

Critics also bemoan the way that P values can encourage muddled thinking. A prime example is their tendency to deflect attention from the actual size of an effect. Last year, for example, a study of more than 19,000 people showed8 that those who meet their spouses online are less likely to divorce (p < 0.002) and more likely to have high marital satisfaction (p < 0.001) than those who meet offline (see Nature http://doi.org/rcg; 2013). That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale. To pounce on tiny P values and ignore the larger question is to fall prey to the “seductive certainty of significance”, says Geoff Cumming, an emeritus psychologist at La Trobe University in Melbourne, Australia. But significance is no indicator of practical relevance, he says: “We should be asking, ‘How much of an effect is there?’, not ‘Is there an effect?’”

Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously. It may be the first statistical term to rate a definition in the online Urban Dictionary, where the usage examples are telling: “That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05”, and “She is a p-hacker, she always monitors data while it is being collected.”

As the article then goes on to note, alternatives haven’t quite caught on. It seems the most basic defense is one that statisticians should adopt anyhow: always recognizing the chance that their statistics could be wrong. It also highlights the need for replicating studies with different datasets to confirm results.

At a relatively basic level, if p-levels are so problematic, how does this change the basic statistics courses so many undergraduates take?

Denver Broncos scoring at 3.13 standard deviations above the NFL average

Bill Barnwell puts the scoring of the 2013 Denver Broncos in statistical perspective:

That brings us to z-score (or standard score), the measure that analyzes a figure’s distance from the rest of the data set using the mean and standard deviation from the set. By comparing each team’s points scored to the league average (and calculating the standard deviation) for the points scored of each team from that given season, we can get a measure of how much better or worse it was than the average team from that season. Fortuitously, that measure also allows us to compare teams across different years and eras. It’s not perfect, since it can’t account for things like strength of schedule or whether a team let up late in games or not, but it’s a much better measure than raw points scored.

As it turns out, even after we make these adjustments, the 2013 Denver Broncos have still scored points at a higher rate through four games than anybody else since the merger. The Broncos are scoring points on a per-game basis at a rate of 3.13 standard deviations over the mean, which is unmatched over that 43-year run. No team has ever scored more frequently, relative to its peers, than the Broncos have done relative to the rest of the league in 2013.

Because these are standardized figures, it’s possible to translate each team’s scoring rate in 2013 figures and see how close it is to Denver. In this case, after we account for the different populations, a bunch of teams move closer to Denver’s throne. Chief among them is the 1991 Super Bowl–winning team from Washington, which scored 146 points through four games in a league whose teams averaged a mere 72 points through their first four tilts. Washington’s figure placed it 2.85 standard deviations above the mean and translates to 170.9 points scored in 2013, just 8.1 points behind the Broncos. Other famous teams follow: the 2000 Rams, 1992 Bills, 1996 Packers, 1981 Chargers, 2005 Giants …

And you thought standard deviations were good only for statisticians. If you know your normal distribution, that’s way above the league average. I can only imagine how Sportscenter anchors might try to present this information…

Actually, this is quite useful for two reasons: (1) it allows us to look at the Broncos compared to the rest of the league without having to rely on the actual points scored; (2) it allows us to standardize points scored over the years so you can compare figures over a 43 year stretch. Both advantages are part of the wave of new statistical analysis taking over sports: don’t just look at the absolute value of statistics but put them in comparison to others teams or players and also provide statistics that allow for comparisons across time periods.

Irresponsible to take FBI crime statistics and name a “murder capital”

News stories like this one seem to suggest that the FBI just designated Chicago the murder capital of the United States.

Move over New York, the Second City is now the murder capital of America.

According to new crime statistics released this week by the Federal Bureau of Investigation, Chicago had more homicides in 2012 than any other city in the country. There were 500 murders in Chicago last year, the FBI said, surpassing New York City, which had 419.

In 2011, there were 515 homicides in the Big Apple, compared with the 431 in Chicago.

But as the Washington Post noted, residents of Chicago and New York were much less likely to be victims of a homicide than some Michigan residents. In Flint, for example, there were 63 killings — a staggering number when you consider Flint’s population is 101,632 — “meaning 1 in every 1,613 city residents were homicide victims.” In Detroit, where 386 killings occurred in 2012, 1 in 1,832 were homicide victims.

Check out the FBI press release announcing the 2012 figures: there is no mention of a “murder capital.” In fact, the press release seems to caution against the sort of sensationalistic interpretations that are implied by “murder capital”:

Each year when Crime in the United States is published, some entities use the figures to compile rankings of cities and counties. These rough rankings provide no insight into the numerous variables that mold crime in a particular town, city, county, state, tribal area, or region. Consequently, they lead to simplistic and/or incomplete analyses that often create misleading perceptions adversely affecting communities and their residents. Valid assessments are possible only with careful study and analysis of the range of unique conditions affecting each local law enforcement jurisdiction. The data user is, therefore, cautioned against comparing statistical data of individual reporting units from cities, metropolitan areas, states, or colleges or universities solely on the basis of their population coverage or student enrollment.

To their credit, a number of these news stories include figures like those in the quoted section above: the murder rate is probably more important than the actual number of murders since populations can vary quite a bit. But, that still doesn’t stop media sources from leading with the “murder capital” idea.

My conclusion: this is an example of an irresponsible approach to crime statistics. Even if murders were down everywhere, the media could still designate a “murder capital” referring to whatever city had the most murders.

Long tail: 17% of the seven foot tall men between ages 20 and 40 in the US play in the NBA

As part of dissecting whether Shaq can really fit in a Buick Lacrosse (I’ve asked this myself when watching the commercial), Car & Driver drops in this little statistic about men in the United States who are seven feet tall:

The population of seven-footers is infinitesimal. In 2011, Sports Illustrated estimated that there are fewer than 70 men between the ages of 20 and 40 in the United States who stand seven feet or taller. A shocking 17 percent of them play in the NBA.

In the distribution of heights in the United States, being at least seven feet tall is quite unusual and at the far right side of a fairly normal distribution. But, being that tall increases the odds of playing in the NBA by quite a lot. As a Forbes post suggests, “Being 7 Feet Tall [may be] the Fastest Way To Get Rich in America“:

Drawing on Centers for Disease Control data, Sports Illustrated‘s Pablo Torre estimated that no more than 70 American men are between the ages of 20 and 40 and at least 7 feet tall. “While the probability of, say, an American between 6’6? and 6’8? being an NBA player today stands at a mere 0.07%, it’s a staggering 17% for someone 7 feet or taller,” Torre writes.

(While that claim might seem like a tall tale, more than 42 U.S.-born players listed at 7 feet did debut in NBA games between 1993 and 2013. Even accounting for the typical 1-inch inflation in players’ listed heights would still mean that 15 “true” 7-footers made it to the NBA, out of Torre’s hypothetical pool of about 70 men.)…

And given the market need for players who can protect the rim, there are extra rewards for this extra height. The league’s median player last season was 6 feet 7 inches tall, and paid about $2.5 million for his service. But consider the rarified air of the 7-footer-and-up club. The average salary of those 35 NBA players: $6.1 million.

(How much does one more inch matter? The 39 players listed at 6 feet 11 inches were paid an average of $4.9 million, or about 20% less than the 7 footers.)

Standing as an outlier at the far end of the distribution seems to pay off in this case.

Jobs available for those who can analyze big data

Now that there is plenty of big data available, companies are looking for employees to analyze the data:

By 2018, the United States might face a shortfall of about 35 percent in the number of people with advanced training in statistics and other disciplines who can help companies realize the potential of digital information generated from their own operations as well as from suppliers and customers, according to McKinsey & Co…

Workers in big data are hard to come by in the short term. A recent survey by CareerBuilder, an affiliate of Tribune Co., which also owns the Chicago Tribune, found that “jobs tied to managing and interpreting big data” were among the “hot areas for hiring” in the second half of 2013…

Dhingra pointed out that the McKinsey report, in addition to citing a shortage of 140,000 to 190,000 qualified data scientists in coming years, also said there will be a need for 1.5 million executives and support staff who understand data.

Mu Sigma’s entry-level trainee professionals go through “an intense recruitment program” that includes aptitude tests to determine who has a “quantitative bent of mind”; group discussion, to spot individuals who can present and back their views and listen to feedback; and a “synthesis” test in which a candidate is shown a video and then asked to identify the key message. If they make it through those rounds, they undergo several personal interviews, a process that includes “props and interesting puzzles and case studies.”

Once a decision scientist trainee is recruited, they go through Mu Sigma University, where they learn such skills as the basics of consulting, the “art of problem solving” and the “art of insight generation.” They also take advanced statistics and are taught about machine learning, natural language processing and visualization, along with behavioral sciences and such big data technologies as Hadoop, Mahout and Cassandra.

The numbers don’t just interpret themselves. It is amazing how much data is available these days but people are still needed to figure out what it all means. Being able to do the conceptual and software work that goes into analyzing data can go a long ways these days…