Did all American adults shop on Thanksgiving weekend?

The Weekly Standard takes a look at some figures on Thanksgiving weekend shopping as reported by the National Retail Federation:

“A record 247 million shoppers visited stores and websites in the post-Thanksgiving Black Friday weekend this year, up 9% from 226 million last year, according to a survey by the National Retail Federation released Sunday,” the CNN reports reads. The headline reads: “247 million shoppers visited stores and websites Black Friday weekend.”

This would seem to mean, according to these statistics, that basically all Americans over the age of 14 went shopping this past weekend…

That means, if you subtract those who are too young to shop, 0-14 year olds, from the total U.S. population, there are 247,518,325 people in this country. The number of people CNN reports who went shopping this past weekend…

CNN’s numbers, however, include those who visited “websites.” The numbers [are?] so loose it could even include news website or the same person visiting multiple shopping websites.

Even if there is some double-counting in this data (and tracking across websites is difficult to do), these figures suggest a large majority of Americans went shopping after Thanksgiving. I’ve written before about the difficulty in getting 90% of Americans to agree about something but perhaps we could add the value of Black Friday shopping to the list. These figures also may add to the idea that shopping is the favorite sport of Americans.

How statistics “change(s) the way you see the world”

This article suggests looking at some well-known statistics problems will “change the way you see the world.” Enjoy the Monty Hall problem, the birthday paradox, gambler’s ruin, Abraham Wald’s memo, and Simpson’s paradox.

Here is what is missing from this article: explaining how statistics is helpful beyond these five particular cases. How would statistics help in a different situation? What is the day-to-day usefulness of statistics? I would suggest several things:

1. Statistics helps move us away from individualistic or anecdotal views of reality and toward a broader view.

2. Statistics can encourage us to ask questions about “reality” and look for data and hidden patterns.

3. Knowing about statistics can help people decipher the numbers that see every day in news stories and advertisements. What do survey or poll results mean? What numbers can I trust?

Tackling these sorts of issues would be much better for the public than looking at five fun and odd applications of statistics. Of course, these three points may not be as interesting as five statistical brain teasers but these five cases should be used to point us to the larger issues at hand.

Report calls for more study of how “kids navigate social networks”

A new report suggests we don’t know much about how kids use social networks and thus, we need more research:

A recent report from the Joan Ganz Cooney Center, Kids Online: A new research agenda for understanding social networking forums, has identified that we don’t actually know enough about how pre-teens use online social networking. The researchers, Dr. Sarah Grime and Dr. Deborah Fields, have done a good job in helping us recognize that younger children are engaged in a range of different ways with online social networks, but that our knowledge and understanding of what that means and how it impacts on their lives is pretty much underdone. GeekDads, of course, will have thoughts about how and why our children are playing and engaging with technology and networks in the ways they do, but this doesn’t give the people who make the rules and set the policy agendas the big picture that they need.

Essentially, Kids Online is a research report that calls for more research into children’s use of social networks. But the report does demonstrate very clearly why this is required. And at the rate that technology is changing and advancing, we need to work cleverly if we are to have the type of data and analysis that we need as parents to guide our decision making around technology and our children. We are all out there trying our best to facilitate healthy, dynamic, educational and exciting experiences for our children when it comes to tech, but there are not enough people exploring what that looks like. As the report says:

“Research on Internet use in the home has consistently demonstrated that family dynamics play a crucial role in children’s and parents’ activities and experiences online. We need further research on the role of parental limits, rules, and restrictions on children’s social networking as well as how families, siblings, peers, and schools influence children’s online social networking.”

I would go further: we need more research of how people of all ages navigate social networks. This doesn’t mean just looking at what activities users participate in online, how often they update information, or how many or what kinds of friends they have. These pieces of information give an outline of social network site usage. However, we need more comprehensive views how exactly social interaction online works, develops, and interacts in feedback loops with the offline and online worlds.

Let me give an example. Suppose an eleven year old joins Facebook. What happens then? Sure, they gain friends and develop a profile but how does this change and develop over the first days, weeks, and months? How does the eleven year old describe the process of social interaction? How do their friends, online and offline, describe this interaction? Where do they learn how to act and not act on Facebook? Do the social networks online overlap completely with offline networks and if so or if not, how does this affect the offline network? How does the eleven year old start seeing all social interaction differently? Does it change their interaction patterns for years to come or can they somewhat compartmentalize the Facebook experience?

This sort of research would take a lot of time and would be difficult to do with large groups. To do it well, a researcher would have two options: an ethnographic approach or to gain access to the keys to someone’s Facebook account to be able to observe everything that happens. Of course, Facebook itself could provide this information…

Businessweek: “Death of the McMansion has been greatly exaggerated”

Even in a down housing market, the size of the average new house in the United States has not dropped much. In other words, the McMansion may not be dead yet.

Who says Americans have fallen out of love with McMansions? It’s true that the housing bust shaved a few square feet off the average size of new homes in the U.S. But new single-family homes built last year were still 49 percent bigger than those built in 1973, according to Census Bureau data.  And it’s worth remembering that family sizes have shrunk over that period.

The peak size for new homes was an average of 2,521 square feet in 2007. By 2010 it was down to 2,392. That statistic fed into a slew of stories about the “new frugality.” A survey of builders conducted in December 2010 by the National Association of Home Builders predicted that the shrinkage would continue, with the average getting down to 2,152 by 2015.

But then a funny thing happened. In 2011, according to the Census Bureau, the average ticked up a bit, to 2,480 square feet.

That’s partly because mortgages were so hard to get that only the well-to-do, who buy bigger houses, were able to buy new homes in 2011, according to Stephen Melman, the director of economic services for the National Association of Home Builders. But it could also be that the “new frugality” story was somewhat oversold.

A couple of thoughts:

1. This is why it helps to wait and have two kinds of data before making definitive pronouncements: longer-term data as well as a variety of housing measures. Year to year figures tell us something but we should be interested in larger trends. Additionally, if houses are about the same size but there are a lot fewer being built, this tells us something as well. Sometimes, trends are hard to see while we are in them.

2. Even if the size of new houses hasn’t dropped much, it could be that these new large homes look less like McMansions. The common definition of McMansion includes several factors: a large house (perhaps in a teardown setting) that is architecturally deficient and also tied to other concepts like sprawl and overconsumption. What if more of these new large houses are green? What if they are designed by architects and built to last?

Mapping secessionist petitions by county as well as looking at gender

A sociologist and a graduate seminar took data from petitions for secession from the United States as listed on whitehouse.gov and mapped the patterns. Here is the map and some of the results:

While petitions are focused on particular states, signers can be from anywhere. In order to show where support for these secession was the strongest, a graduate seminar on collecting and analyzing and data from the web in the UNC Sociology Department downloaded the names and cities of each of the petition signers from the White House website, geocoded each of the locations, and plotted the results.

In total, we collected data on 862,914 signatures. Of these, we identified 304,787 unique combinations of names, places and dates, suggesting that a large number of people were signing more than one petition. Approximately 90%, or 275,731, of these individuals provided valid city locations that we could locate with a US county.

The above graphic shows the distribution of these petition signers across the US. Colors are based proportion of people in each county who signed, and the total number of signers is displayed when you click or hover over a county.

We also looked at the distribution of petition signers by gender. While petition signers did not list their gender, we attempted to match first names with Social Security data on the relative frequency of names by sex. Of the 302,502 respondents with gendered names, 63% had male names and 38% had female names. This 26 point gender gap is twice the size of the gender gap for voters in the 2012 Presidential election. For signatures in the last 24 hours, the gender gap has risen to 34 points.

So it looks like the petition signers are more likely to be men from red states and more rural counties. On one hand, this is not too surprising. On the other hand, it is an interesting example of combining publicly available data and looking for patterns.

Republicans (and Democrats) need to pay attention to data rather than just spinning a story

Conor Friedersdorf suggests conservatives clearly had their own misinformed echo chambers ahead of this week’s elections:

Before rank-and-file conservatives ask, “What went wrong?”, they should ask themselves a question every bit as important: “Why were we the last to realize that things were going wrong for us?”

Barack Obama just trounced a Republican opponent for the second time. But unlike four years ago, when most conservatives saw it coming, Tuesday’s result was, for them, an unpleasant surprise. So many on the right had predicted a Mitt Romney victory, or even a blowout — Dick Morris, George Will, and Michael Barone all predicted the GOP would break 300 electoral votes. Joe Scarborough scoffed at the notion that the election was anything other than a toss-up. Peggy Noonan insisted that those predicting an Obama victory were ignoring the world around them. Even Karl Rove, supposed political genius, missed the bulls-eye. These voices drove the coverage on Fox News, talk radio, the Drudge Report, and conservative blogs.

Those audiences were misinformed.

Outside the conservative media, the narrative was completely different. Its driving force was Nate Silver, whose performance forecasting Election ’08 gave him credibility as he daily explained why his model showed that President Obama enjoyed a very good chance of being reelected. Other experts echoed his findings. Readers of The New York Times, The Atlantic, and other “mainstream media” sites besides knew the expert predictions, which have been largely born out. The conclusions of experts are not sacrosanct. But Silver’s expertise was always a better bet than relying on ideological hacks like Morris or the anecdotal impressions of Noonan.

But I think Friedersdorf misses the most important point here in the rest of his piece: it isn’t just about Republicans veering off into ideological territory into which many Americans did not want to follow or wasting time on inconsequential issues that did not affect many voters. The misinformation was the result of ignoring or downplaying the data that showed President Obama had a lead in the months leading up to the election. The data predictions from “The Poll Quants” were not wrong, no matter how many conservative pundits wanted to suggest otherwise.

This could lead to bigger questions about what political parties and candidates should do if the data is not in their favor in the days and weeks leading up to an election. Change course and bring up new ideas and positions? This could lead to questions about political expediency and flip-flopping. Double-down on core issues? This might ignore the key things voters care about or reinforce negative impressions. Ignore the data and try to spin the story? It didn’t work this time. Push even harder in the get-out-the-vote ground game? This sounds like the most reasonable option…

Correlation and not causation: Redskins games predict results of presidential election

Big events like presidential elections tend to bring out some crazy data patterns. Here is my nomination for the oddest one of this election season: how the Washington Redskins do in their final game before the election predicts the presidential election.

Since 1940 — when the Redskins moved to D.C. — the team’s outcome in its final game before the presidential election has predicted which party would win the White House each time but once.

When the Redskins win their game before the election, the incumbent party wins the presidential vote. If the Redskins lose, the non-incumbent wins.

The only exception was in 2004, when Washington fell to Green Bay, but George W. Bush still went on to win the election over John Kerry.

This is simply a quirk of data: how the Redskins do should have little to no effect on voting in other states. This is exactly what correlation without causation is about; there may be a clear pattern ut it doesn’t necessarily mean the two related facts cause each other. There may be some spurious association here, some variable that predicts both outcomes, but even that is hard to imagine. Yet, the Redskins Rule has garnered a lot of attention in recent days. Why? A few possible reasons:

1. It connects two American obsessions: presidential elections and the NFL. A sidelight: both may involve a lot of betting.

2. So much reporting has been done on the 2012 elections that this adds a more whimsical and mysterious element.

3. Humans like to find patterns, even if these patterns don’t make much sense.

What’s next, an American octopus who can predict presidential elections?

Sociologist defends statistical predictions for elections and other important information

Political polling has come under a lot of recent fire but a sociologist defends these predictions and reminds us that we rely on many such predictions:

We rely on statistical models for many decisions every single day, including, crucially: weather, medicine, and pretty much any complex system in which there’s an element of uncertainty to the outcome. In fact, these are the same methods by which scientists could tell Hurricane Sandy was about to hit the United States many days in advance…

This isn’t wizardry, this is the sound science of complex systems. Uncertainty is an integral part of it. But that uncertainty shouldn’t suggest that we don’t know anything, that we’re completely in the dark, that everything’s a toss-up.

Polls tell you the likely outcome with some uncertainty and some sources of (both known and unknown) error. Statistical models take a bunch of factors and run lots of simulations of elections by varying those outcomes according to what we know (such as other polls, structural factors like the economy, what we know about turnout, demographics, etc.) and what we can reasonably infer about the range of uncertainty (given historical precedents and our logical models). These models then produce probability distributions…

Refusing to run statistical models simply because they produce probability distributions rather than absolute certainty is irresponsible. For many important issues (climate change!), statistical models are all we have and all we can have. We still need to take them seriously and act on them (well, if you care about life on Earth as we know it, blah, blah, blah).

A key point here: statistical models have uncertainty (we are making inferences about larger populations or systems from samples that we can collect) but that doesn’t necessarily mean they are flawed.

A second key point: because of what I stated above, we should expect that some statistical predictions will be wrong. But this is how science works: you tweak models, take in more information, perhaps change your data collection, perhaps use different methods of analysis, and hope to get better. While it may not be exciting, confirming what we don’t know does help us get to an outcome.

I’ve become more convinced in recent years that one of the reasons polls are not used effectively in reporting is that many in the media don’t know exactly how they work. Journalists need to be trained in how to read, interpret, and report on data. This could also be a time issue; how much time to those in the media have to pore over the details of research findings or do they simply have to scan for new findings? Scientists can pump out study after study but part of the dissemination of this information to the public requires a media who understands how scientific research and the scientific process work. This includes understanding how models are consistently refined, collecting the right data to answer the questions we want to answer, and looking at the accumulated scientific research rather than just grabbing the latest attention-getting finding.

An alternative to this idea about media statistical illiteracy is presented in the article: perhaps the media perhaps knows how polls work but likes a political horse race. This may also be true but there is a lot of reporting on statistics on data outside of political elections that also needs work.

A company offers to replicate research study findings

A company formed in 2011 is offering a new way to validate the findings of research studies:

A year-old Palo Alto, California, company, Science Exchange, announced on Tuesday its “Reproducibility Initiative,” aimed at improving the trustworthiness of published papers. Scientists who want to validate their findings will be able to apply to the initiative, which will choose a lab to redo the study and determine whether the results match.

The project sprang from the growing realization that the scientific literature – from social psychology to basic cancer biology – is riddled with false findings and erroneous conclusions, raising questions about whether such studies can be trusted. Not only are erroneous studies a waste of money, often taxpayers’, but they also can cause companies to misspend time and resources as they try to invent drugs based on false discoveries.

This addresses a larger concern about how many research studies found their results by chance alone:

Typically, scientists must show that results have only a 5 percent chance of having occurred randomly. By that measure, one in 20 studies will make a claim about reality that actually occurred by chance alone, said John Ioannidis of Stanford University, who has long criticized the profusion of false results.

With some 1.5 million scientific studies published each year, by chance alone some 75,000 are probably wrong.

I’m intrigued by the idea of having an independent company assess research results. This could work in conjunction with other methods of verifying research results:

1. The original researchers could run multiple studies. This works better with smaller studies but it could be difficult when the N is larger and more resources are needed.

2. Researchers could also make their data available as they publish their paper. This would allow other researchers to take a look and see if things were done correctly and if the results could be replicated.

3. The larger scientific community should endeavor to replicate studies. This is the way science is supposed to work: if someone finds something new, other researchers should adopt a similar protocol and test it with similar and new populations. Unfortunately, replicating studies is not seen as being very glamorous and it tends not to receive the same kind of press attention.

The primary focus of this article seems to be on medical research. Perhaps this is because it can affect the lives of many and involves big money. But it would be interesting to apply this to more social science studies as well.

Selecting a 4 digit pin code is hardly random

There are 10,000 possible pin codes that could be made with four digits (0-9) but what pins we select to use are hardly random:

What he found, he says, was a “staggering lack of imagination” when it comes to selecting passwords. Nearly 11% of the 3.4 million four-digit passwords he analyzed are 1234. The second most popular PIN in is 1111 (6% of passwords), followed by 0000 (2%). (Last year SplashData compiled a list of the most common numerical and word-based passwords and found that the “password” and “123456” topped the list.)

Berry says that a whopping 26.83% of all passwords could be guessed by attempting just 20 combinations of four-digit numbers (see first table). “It’s amazing how predictable people are,” he says…

Many of the commonly used passwords are, of course, dates: birthdays, anniversaries, the year you were born, etc. Indeed, using a year, starting with 19__ helps people remember their code, but it also increases its predictability, Berry says. His analysis shows that every single 19__ combination be found in the top 20% of the dataset…

Somewhat intriguing was #22 on the most common password list: 2580. It seems random, but if you look at a telephone keypad (or ATM keypad) you’ll see those numbers are straight down the middle — yet another sign we’re uncreative and lazy password makers…

The least-used PIN is 8068, Berry found, with just 25 occurrences in the 3.4 million set, which equates to 0.000744%. (See the second table for the least popular passwords.) Why this set of numbers? Berry guesses, “It’s not repeating pattern, it’s not a birthday, it’s not the year Columbus discovered America, it’s not 1776.” At a certain point, these numbers at the bottom of the list are all kind of “the lowest of the low, they’re all noise,” he says.

This is a great example of two things:

1. There are often patterns among supposedly “random” numbers.

2. Humans don’t particularly like to use “random” numbers but instead prefer numbers that are meaningful to them (which corresponds with them being able to remember their codes).