SurveyMonkey made good 2014 election predictions based on experimental web polls

Here is an overview of some experimental work at SurveyMonkey in doing political polls ahead of the 2014 elections:

For this project, SurveyMonkey took a somewhat different approach. They did not draw participants from a pre-recruited panel. Instead, they solicited respondents from the millions of people that complete SurveyMonkey’s “do it yourself” surveys every day run by their customers for companies, schools and community organizations. At the very end of these customer surveys, they asked respondents if they could answer additional questions to “help us predict the 2014 elections.” That process yielded over 130,000 completed interviews across the 45 states with contested races for Senate or governor.

SurveyMonkey tabulated the results for all adult respondents in each state after weighting to match Census estimates for gender, age, education and race for adults — a relatively simple approach analogous to the way most pollsters weight random sample telephone polls. SurveyMonkey provided HuffPollster with results for each contest tabulated among all respondents as well as among subgroups of self-identified registered voters and among “likely voters — those who said they had either already voted or were absolutely certain or very likely to vote (full results are published here).

“We sliced the data by these traditional cuts so we could easily compare them with other surveys,” explains Jon Cohen, SurveyMonkey’s vice president of survey research, “but there’s growing evidence that we shouldn’t necessarily use voters’ own assessments of whether or not they’ll vote.” In future elections, Cohen adds, they plan “to dig in and build more sophisticated models that leverage the particular attributes of the data we collect.” (In a blog post published separately on Thursday, Cohen adds more detail about how the surveys were conducted).

The results are relatively straightforward. The full SurveyMonkey samples did very well in forecasting winners, showing the ultimate victor ahead in all 36 Senate races and missing in just three contests for Governor (Connecticut, Florida and Maryland)…

The more impressive finding is the way the SurveyMonkey samples outperformed the estimates produced by HuffPost Pollster’s poll tracking model. Our models, which are essentially averages of public polls, were based on all available surveys and calibrated to corresponded to results from the non-partisan polls that had performed well in previous elections. SurveyMonkey’s full samples in each state showed virtually no bias, on average. By comparison, the Pollster models overstated the Democrats’ margins against Republican candidates by an average 4 percent. And while SurveyMonkey’s margins were off in individual contests, the spread of those errors was slightly smaller than the spread of those for the Pollster averages (as indicated by the total error, the average of the absolute values of the error on the Democrat vs Republican margins).

The general concerns with web surveys involve obtaining a representative sample, either because it is difficult to identify the particular respondents who would meet the appropriate demographics or the survey is open to everyone. But, SurveyMonkey was able to produce good predictions for this past election cycle. Was it because they had (a) large enough samples that their data was a better approximation of the general population (they were able to reach a large number of people who use their services or (b) their weighting was particularly good?

The real test of this will be when a major organization, particularly a media outlet, solely utilizes web polls ahead of a major election. Given these positive results, perhaps we will see this in 2016. Yet, I imagine there may be some kinks to work out of the system or some organizations would only be willing to do that if they paired the web data with more traditional forms of polling.

Reminder to journalists: a blog post garnering 110 comments doesn’t say much about social trends

In reading a book this weekend (a review to come later this week), I ran across a common tactic used by journalists: looking at the popularity of websites as evidence of a growing social trend. This particular book quoted a blog post and then said “The post got 110 comments.”

The problem is that this figure doesn’t really tell us much about anything.

1. These days, 110 comments on an Internet story is nothing. Controversial articles on major news websites regularly garner hundreds, if not thousands, of comments.

2. We don’t know who exactly was commenting on the story. Were these people who already agreed with what the author was writing? Was it friends and family?

In the end, citing these comments runs into the same problems that face web surveys done poorly: we don’t know whether they are representative of Americans as a whole or not. That doesn’t mean blogs and comments can’t be cited at all but we need to be very careful of what these sites tell us, what we can know from the comments, and who exactly they represent. A random sample of blog posts might help as would a more long-term study of responses to news articles and blog posts. But, simply saying that something is an important issue because a bunch of people were moved enough to comment online may not mean much of anything.

Thankfully, the author didn’t use this number of blog comments as their only source of evidence; it was part of a larger story with more conclusive data. However, it might simply be better to quote a blog post like this as an example of what is out on the Internet rather than try to follow it with some “hard” numbers.

Sociology grad student: “the Internet is a sociologist’s playground”

A sociology graduate student makes an interesting claim: “the Internet is a sociologist’s playground“:

The Internet is a sociologist’s playground, says Scott Golder, a graduate student in sociology at Cornell University. Although sociologists have wanted to study entire societies in fine-grained detail for nearly a century, they have had to rely primarily upon large-scale surveys (which are costly and logistically challenging) or interviews and observations (which provide rich detail, but for small numbers of subjects). Golder hopes that data from the social Web will provide opportunities to observe the detailed activities of millions of people, and he is working to bring that vision to fruition.  The same techniques that make the Web run—providing targeted advertisements and filtering spam—can also provide insights into social life. For example, he has used Twitter archives to examine how people’s moods vary over time, as well as how network structure predicts friendship choices. Golder came to sociology by way of computer science, studying language use in online communities and using the Web as a tool for collecting linguistic data. After completing a B.A. at Harvard and an M.S. at the MIT Media Lab, he spent several years in an industrial research lab before beginning his Ph.D. in sociology at Cornell.

I would think that having a background in computer science would be a big plus for a sociologist today. Lots of people want to study social networking sites like Facebook and work with the data available online. But I wonder if there still aren’t a few issues to overcome before we can really tap this information:

1. Do companies that have a lot of this data, places like Google and Facebook, want to open it up to researchers or would they prefer to keep the data in-house in order to make money?

2. How will Internet users respond to the interest researchers have in studying their online behavior if they are often not thrilled about being tracked by companies?

3. Has the sampling issue been resolved? In other words, one of the problems with web surveys or working with certain websites is that theses users are not representative of the total US population. So while internet activity has increased among the population as a whole, isn’t internet usage, particularly among those who use it most frequently, still skewed in certain directions?

4. Just how much does online activity reveal about offline activity? Do the two worlds overlap so much that this is not an issue or are there important things that you can’t uncover through online activity?

I would think some of these issues could be resolved and the sociologists who can really tap this growing realm will have a valuable head start.