You know you are thinking like a sociologist or a statistician when…

Last week, I received a phone call from the news editor of the campus newspaper regarding a story: this year’s freshman class is 52% male, a change from recent years where it tended to be 51/49, 52/48 female. (For those who don’t know: Wheaton College tries to have an even gender ratio.) I was asked, “how would this affect the freshman class?”

My first thought was to check how much of a percentage change this was from previous years. Having a freshman class that is over 50% male might be a symbolic change might a 3% difference between last year and this year is less important than a 5 or 7 or or 10% difference from the previous year. Thinking about this possible story in this way takes the shock value of the percentages away and puts it in a more proper context.Second, in absolute numbers, how many more males does this mean are in the freshman class? Since it is likely a small percentage change, this is perhaps a shift of 10-20 people, not a huge number among roughly 600 freshman. Even if the next three freshman classes had these same percentage distributions, this is only a shift of roughly 40-80 males throughout the entire college of about 2,400 underclassmen.

While this might make a good example of thinking statistically for my Statistics class, there could be broader implications about who I now am as a person…

On reporting on statistics

Felix Salmon has a great post about the journalistic use of statistics, and it’s well worth the read.  Here’s his summary, complete with thoughtful reminders:

Before you start quoting statistics, then, it’s always worth (a) knowing where exactly they come from; (b) verifying them independently if you were fed them by some pressure group; and (c) making sure that they say what you say that they say. Otherwise, you just end up looking credulous and silly.

Predicting and preventing burglaries though statistical models in Indio, California

In January 2011, I wrote about how Santa Clara, California was going to use statistical models to predict where crime would take place and then deploy police accordingly. Another California community, Indio, is going down a similar route to reduce burglaries:

The Indio Police Department with the help of a college professor and a wealth of data and analysis is working on just that — predicting where certain burglaries will occur.The goal is to stop them from happening through effective deployment or preventative measures…

The police department began the Smart Policing Initiative a year ago with $220,617 in federal funding from the U.S. Department of Justice…

Robert Nash Parker, a professor of sociology at the University of California, Riverside and an expert on crime, is working with Indio.

On Friday, he shared his methodology for tracking truancy and burglary rates.

He used data from the police department, school district, U.S Census Bureau and probation departments, to create a model that can be used to predict such daytime burglaries.

Nash said that based on the data, truancy seems to lead to burglary hot spots.

A few issues come to mind:

1. Could criminals simply change up their patterns once they know about this program?

2. Do approaches like this simply treat the symptoms rather than the larger issues, in this case, truancy? It is a good thing to prevent crimes or arrest people quickly but what about working to limit the potential for crime in the first place?

3. I wonder how much data is required for this to work and how responsive it is to changes in the data.

4. Since this is being funded by a federal agency, can we expect larger roll-outs in the future? Think of this approach versus that of a big city like Chicago where there has been a greater emphasis on the use of cameras.

The problem with using averages as illustrated by the average salaries of NBA players

In negotiations between NBA owners and players, the topic of the “average player salary” has come up. This discussion illustrates some of the issues involved with  using averages and medians:

Here is the “average player salary” for each of the major U.S. professional team sports, based on a variety of sources using the most recent data available:

NBA: $5.15 million (2010-11)

MLB: $3.34 million (2010)

NHL: $2.4 million (2010-11)

NFL: $1.9 million (2010)

From the public’s view, these numbers are high in all four sports. But players and agents argue that these averages obscure important distinctions including the value of certain positions over others (the quarterback in the NFL versus the punter) and the size of the roster (fewer NBA players, more NFL players).

One common solution to problems with averages is to instead use a median. Here is how this might change the discussion in the NBA:

“It’s the median salary that’s more important,” NBA agent Bill Duffy said. “Look at the Miami Heat as an analogy here: You’ve got three guys making $17 million and probably six guys making $1.2 [million]. So that’s a little misguided, that average salary.”…

It is not unlike, Duffy said, news stories that cite the “average” U.S. household income as opposed to the median. The latter figure, according to the most recent U.S. census, was $50,233. If you were to average in the dollar amounts pulled down by Wall Street bankers, Ivy League lawyers, certain public-union employees and yes, professional athletes, that number would jump considerably.

Curiously, neither the NBA nor the NBPA seems to make much use of a median player salary.

“We use [average] because it’s the most commonly used measure and best reflects the amount of compensation that the NBA provides to players across the league,” an NBA spokesman said this week. “In addition, it’s the measure that both we and the union agreed upon in the CBA.”

In the NFL, the median salary is approximately $770,000 — about 40 percent of the average.

In the NBA, using USA Today salary figures for the 2009-10 season, the estimated median salary was about $2.33 million. That’s still about 46 times what the median U.S. household earns, but it is less than half what the max-salary-bloated “average” is.

What happens in these sports is this: a small number of star athletes make huge amounts of money, pulling the average for all athletes up. If you use the median instead, where 50% of the players make more and 50% more make less, it suggests that more of the athletes in each sport make less. Particularly in the NFL which has bigger rosters, the difference between the average and the median shows that many players make very little.

It is interesting that the NBA spokesman said the two sides had agreed in their Collective Bargaining Agreement that they would use the average salary figure. Was this really a point of contention negotiations or did no one really think about the consequences? What was the thinking behind this for the players? If the union was focused on helping all of their members, perhaps they would focus on the median, suggesting that they are strongest when all of their members are well taken care of. This lower figure might also look more palatable to the public though it is unclear whether public perceptions have any influence on such negotiations. However, if the union was more interested in making sure that individual athletes could receive the biggest possible payouts because of their athletic exploits, then perhaps the average is better.

Two takeaway points:

1. Averages and medians are both measures of central tendency but they are open to different interpretations. People need to be clear about which they are using and which interpretation their number interprets.

2. It will be interesting to see if the new CBA is based on average or median salaries.

Sort this out: poll of 39 economists suggests “30% chance of recession”

Polling economists about whether the country is headed for a recession does not seem to be the best way to make predictions:

The 39 economists polled Aug. 3-11 put the chance of another downturn at 30% — twice as high as three months ago, according to their median estimates. That means another shock to the fragile economy — such as more stock market declines or a worsening of the European debt crisis — could push the nation over the edge.

Yet even if the USA avoids a recession, as economists still expect, they see economic growth muddling along at about 2.5% the next year, down from 3.1% in April’s survey. The economy must grow well above 3% to significantly cut unemployment…

The gloomier forecast is a stunning reversal. Just weeks ago, economists were calling for a strong rebound in the second half of the year, based on falling gasoline prices giving consumers more to spend on other things and car sales taking off as auto supply disruptions after Japan’s earthquake faded. In fact, July retail sales showed their best gain in four months.

But that was before European debt woes spread, the government cut its growth estimates for the first half of 2011 to less than 1%, and Standard & Poor’s lowered the USA’s credit rating after the showdown over the debt ceiling.

Here is what I find strange about this:

1. The headline meant to grab our attention focuses on the 30% statistic. Is this a good or bad figure? It is less than 50% (meaning there are less equal odds) but it is also double the prediction of predictions three months ago. Based on a 3 in 10 chance of a recession, how would the country and individual change their actions?

2. This comes from a poll of 39 economists. One, this isn’t that many. Two, how do we know that these economists know what they are talking about? How successful have their predictions been in the past? I see the advantages of “crowd-sourcing,” consulting a number of estimates to get an aggregate figure, but the sample could be larger and we don’t know whether these economists will be right. (Even if they are not right, perhaps it gives us some indication about what “leading economists” think and this could matter as well).

3. How much of this is based on real data versus perceptions of the economy? The article suggests this is a “stunning reversal” of earlier predictions and then cites some data that seems to be worse. These figures don’t determine everything. I wonder what it would take for economists to predict a recession – which numbers would have to be worse and how bad would they have to get?

4. Will anyone ever come back and look at whether these economists got it right?

In the end, I’m not sure this really tells us anything. I suspect it is these sorts of statistics and headlines that push people to throw up their hands altogether about statistics.

Would having a math PhD really help you win the lottery?

A journalist suggests that one woman who won four multi-million dollar lottery payouts was able to do so because she had a mathematics PhD:

First, [Joan Ginther] won $5.4 million, then a decade later, she won $2 million, then two years later $3 million and finally, in the spring of 2008, she hit a $10 million jackpot.

The odds of this has been calculated at one in eighteen septillion and luck like this could only come once every quadrillion years.

Harper’s reporter Nathanial Rich recently wrote an article about Ms Ginther, which questioned the validity of this ‘luck’ with which she attributes her multiple lottery wins to.

First, he points out, Ms Ginther is a former math professor with a PhD from Stanford University specialising in statistics.

A professor at the Institute for the Study of Gambling & Commercial Gaming at the University of Nevada, Reno, told Mr Rich: ‘When something this unlikely happens in a casino, you arrest ‘em first and ask questions later.’…

Three of her wins, all in two-year intervals, were by scratch-off tickets bought at the same mini mart in the town of Bishop.

Mr Rich proceeds to detail the myriad ways in which Ms Ginther could have gamed the system – including the fact that she may have figured out the algorithm that determines where a winner is placed in each run of scratch-off tickets.

He believes that after Ms Ginther figured out the algorithm, it wouldn’t be too difficult to then determine where the tickets would be shipped, as the shipping schedule is apparently fixed, and there were a few sources she could have found it out from.

At first glance, the story does seem unlikely: four wins and three from scratch-off tickets from the same retail location. But here are three reasons to doubt the claim that this woman beat the system:

1. If lottery algorithms could be figured out by the public, wouldn’t other people have figured this out as well? A math PhD sounds problematic but other smart people could figure this out if it could be figured out. Additionally, couldn’t this woman win more than 4 times if she had it all figured out?

2. Just because someone won the lottery four times does not mean that something underhanded happened. Just because some events are “random,” like winning the lottery or being struck by lightning, does not mean that people can’t win multiple times. Aren’t there plenty of other multiple lottery winners?

3. The quote from the professor is interesting: be suspicious first and then figure out what is happening. This is the view from the business end. If someone is gambling and consistently winning your money, you might respond. For example, this book about card-counting MIT students is fascinating (much better than the movie based on the book) not only for how the students figured out how to count cards but also because of the response of the casinos. (My favorite part – and I think I am remembering this correctly: the students leave Las Vegas because they are raising suspicions with their winnings. But they eventually find that their names and photos have been sent to casinos around the country. It gets to the point where they are escorted out of a casino just moments after entering.) But it sounds like the Texas Lottery Commission doesn’t think anything is wrong. Shouldn’t they be the ones who care the most?

If you read the original story, Ginther’s buying habits do sound strange. But I still think this reporter needs to find some more evidence before Ginther could be accused with certainty.

Example of problems with statistics “nearly 1,500 millionaires” (out of more than 235,000) “paid no federal taxes”

Statistics can be used well and they can be used not so well. Here is an example where the headline statistic suggests something different from the rest of the story:

Of an already small pool of millionaires and billionaires, 1,470 didn’t pay any federal income taxes in 2009, according to the Internal Revenue Service.

Just over 0.1% of taxpayers — or 8,274 out of 140 million total — made more than $10 million in 2009, according to the agency. More than 235,000 taxpayers earned $1 million or more, according to a recent report from the agency.

But of the high earners who avoided paying income taxes, many did so due to heavy charity donations or foreign investments.

About 46% of all American households won’t pay federal income tax in 2011, many due to low income, tax credits for child care and exemptions, according to the nonpartisan Tax Policy Center.

The headline makes it sound like there are a lot of millionaires who are avoiding paying taxes. The actual percentage hinted at it in the story suggests something else: less than 0.63% of all millionaires (1,470/235,000 – less than 1 in a 100)) paid no taxes. In the midst of a political debate about whether to raise taxes for the wealthy in America, each side could grab on to factual yet different figures: the 1,500 figure sounds high like the country is missing out on a lot money while the 0.63% figure suggests almost all pay some taxes. It wouldn’t take much to include both figures, the actual number and the percentage in the story.

Examples like this help contribute to the reaction some people have when they see statistics in the media: how can I trust any of them if they will just use the figures that suit them? All statistics become suspect and it is then hard to get a handle on what is going on in the world.

Evangelicals and their propensity to think that everyone is against them

Sociologist Bradley Wright draws attention to an issue among evangelicals: a common belief that fellow Americans do not like them:

Similarly, somewhere along the line we evangelical Christians have gotten it into our heads that our neighbors, peers, and most Americans don’t like us, and that they like us less every year. I’ve heard this idea stated in sermons and everyday conversation; I’ve read it in books and articles.

There’s a problem, though. It doesn’t appear to be true. Social scientists have repeatedly surveyed views of various religions and movements, and Americans consistently hold evangelical Christians in reasonably high regard. Furthermore, social science research indicates that it’s almost certain that our erroneous belief that others dislike us is actually harming our faith.

The statistics Wright presents suggests evangelicals are somewhere in the middle of favorability among different religious groups. For example, a 2008 Gallup survey suggests Methodists, Jews, Baptists, and Catholics are viewed more favorably than evangelicals while Fundamentalists, Mormons, Muslims, Atheists, and Scientologists are viewed less favorably.

Wright goes on to argue (as he also does in this book) that the perceptions evangelicals have might be harmful:

If American evangelicals do have an image problem, it’s not our neighbors’ image of us; it’s our image of them. The 2007 Pew Forum study found that American Christians hold more negative views of “atheists” than non-Christians do of evangelical Christians. (The most recent Pew survey found similar attitudes; see the chart above.) Now, I am not a theologian, but this seems to be a problem. We Christians are called to love people, and as I understand it, this includes loving people who believe differently than we do. I’m not sure how we can love atheists if we don’t like them.

Ultimately, evangelical Christians might do well not to spend too much time worrying about what others think of us. Christians in general, and evangelical Christians in particular (depending on how you ask the question), are well-regarded in this country. If nothing else, there’s little we can do to change other people’s opinions anyway. Telling ourselves over and over that others don’t like us is not only inaccurate, it also potentially hinders the very faith that we seek to advance.

This is an ongoing issue with several aspects:

1. There is a disconnect between the numbers and the perceptions. Wright looks like he is trying to make a prolonged effort to bring these statistics to the masses. Will this data make a difference in the long run? How many evangelicals will ever hear about these statistics?

2. There may be positive or functional aspects to continually holding the idea that others don’t like you. Subgroups can use this idea to enhance solidarity and prompt action among adherents. Of course, these alarmist tendencies might not be helpful in the long run. (See a better explanation of this perspective from Christian Smith here.)

In the end, this is useful data but there is more that could be done to explain how these perceptions are helpful or not and what could or should be done to move in a different direction. Providing people with the right data and good interpretations is a good start but then people will want to know what to do next.

More difficulty with housing vacancy data

I’ve written about this before but here is some more evidence that one should be careful in looking at housing vacancy data:

In early 2009 the Richmond, Virginia press wrote numerous articles after quarterly HVS data on metro area rental vacancy rates “showed” that the rental vacancy rate in the Richmond, Virginia metro area in the fourth quarter of 2008 was 23.7%, the highest in the country. This shocked local real estate folks, including folks who tracked rental vacancy rates in apartment buildings in the area. The Central Virginia Apartment Association, e.g., found that the rental vacancy rate based on a survey of 52 multi-family properties in the Richmond, VA metro area was around 8% — above a more “normal” 5%, but no where close to 23.7%. And while the HVS attempts to measure the overall rental vacancy rate (and not just MF apartments for rent), the data seemed “whacky.”

When I talked to Census folks back then, they said that there quarterly metro area vacancy rates were extremely volatile and had extremely high standard errors, and that folks should focus on annual data.

However, “annual average” data from the HVS showed MASSIVELY different rental vacancy rates in Richmond, Virginia than did the American Community Survey, which also produces estimates of the vacancy rate in the overall rental market…

There are several other MSAs where the HVS rental vacancy rates just look plain “silly.” Some Census analysts agree that the HVS MSA data aren’t reliable, and even that several state data aren’t reliable, but, well, er, the national data are probably “ok” – which they are not.

If you want to read more on the issue, there are a number of links at the bottom of the story.

If the estimates are so far off from other estimates generally regarded as being reliable like the American Community Survey or the decennial Census, it would look like a new system is needed to calculate the quarterly vacancy rates.

I wonder how much these figures could hurt a particular community. Take the case of Richmond: if data suggests the vacancy rate is the highest in the country even though it is not, is this simply bad publicity or would it actually affect decisions made by residents, businesses, and local governments?

A stable statistic since 1941: “Americans prefer boys to girls”

Amidst news that families in Asian countries are selecting boys over girls before they are born, Gallup reports that Americans also prefer boys:

Gallup has asked Americans about their preferences for a boy or a girl — using slightly different question wordings over the years — 10 times since 1941. In each instance, the results tilt toward a preference for a boy rather than a girl. The average male child-preference gap across these 10 surveys is 11 percentage points, making this year’s results (a 12-point boy-preference gap) just about average. Gallup found the largest gap in 1947 and 2000 (15 points) and the smallest in a 1990 survey (4 points).

The attitudes of American men drive the overall preference for a boy; in the current poll, conducted June 9-12, men favor a boy over a girl by a 49% to 22% margin. American women do not have a proportionate preference for girls. Instead, women show essentially no preference either way: 31% say they would prefer a boy and 33% would prefer a girl…

The degree to which Americans deliberately attempt to select the gender of their children is unclear. It is significant that 18- to 29-year-old Americans are the most likely of any age group to express a preference for a boy because most babies are born to younger adults. The impact of the differences between men and women in preferences for the sex of their babies is also potentially important. The data from the U.S. suggest that if it were up to mothers to decide the gender of their children, there would be no tilt toward boys. Potential fathers have a clear preference for boys if given a choice, but the precise amount of input males may have into a deliberate gender-selection process is unknown.

This seems to be one of those statistics that is remarkably constant since 1941 even though the relationships between and perceptions of genders has changed. Is this statistic a sign of a lack of progress in the area of gender?

Gallup suggests several traits lead to higher preferences for boys: being male, being younger, having a lower level of education (though income doesn’t matter), and Republican. So why exactly do these traits lead to these preferences? Outside of being younger, one could suggest these traits add up to a “traditionalist” understanding of families where boys are more prized.