Seeing the real America at the ER Saturday at 10 PM and Walmart Sunday at 8 PM

I visited both of these locations in recent weeks and was intrigued to see the mix of people at each. I’ll make a quick case for why these locations could provide as good cross-section of America as any other location:

  1. Limited options. For the emergency room on a Saturday night, there are few other medical options available at that time. If anyone has a medical issue, they will end up here. As for Walmart on Sunday evening, there are limited brick and mortar shopping options and the work week is about to start.
  2. People need medical care and grocery/home items. Both locations have people trying to meet basic human needs. Even as online shopping may allow people to avoid other shoppers and online medical consultations are now available, there are inevitably moments where running out to a store or medical professional is necessary. It is hard to imagine either of these facilities disappearing completely (even if the number of retailers is severely reduced).
  3. Connected to #1 and #2 above, people of differences races, ethnicities, and social classes are at both locations. In many other locations, whether due to residential location, the location of jobs, ill will toward others, or access to resources, not all groups are represented. Sociologist Elijah Anderson wrote a book about such rare urban locations.

While these may not be the best locations in which to conduct research, they could offer insights into typical American life.

NYT: Yes, don’t trust online polls

Although the purpose here may truly be to discredit Donald Trump, here is another argument in the New York Times against online polls:

“Those do a good job of engaging audiences online, and they do a good job of letting you know how other people who have come to the webpage feel about whatever issue,” said Mollyann Brodie, the executive director for public opinion and survey research at the Kaiser Family Foundation. “But they’re not necessarily good at telling you, in general, what people think, because we don’t know who’s come to that website and who’s taken it.”

Professional pollsters use scientific statistical methods to make sure that their small random samples are demographically appropriate to indicate how larger groups of people think. Online polls do nothing of the sort, and are not random, allowing anyone who finds the poll to vote. They are thus open to manipulation from those who would want to stuff the ballot box. Users on Reddit and 4chan directed masses of people to vote for Mr. Trump in the instant-analysis surveys, according to The Daily Dot. Similar efforts were observed on Twitter and other sites.

Even when there is no intentional manipulation, the results are largely a reflection of who is likely to come to a particular site and who would be motivated enough to participate. Intuitively, it’s no surprise that readers of sites like Breitbart News and the Drudge Report would see Mr. Trump as the winner, just as Mrs. Clinton would be more likely to find support on liberal sites…

“In our business, the key is generalizability,” he said, referring to the ability of a sample group to apply to a wider population. “That’s the core of what we do. Typically, it takes a lot of time, and a lot of effort, and a lot of money to do it.”

One helpful solution may be to have media outlets refuse to use any online polls. On one hand, journalists often remind the public that they don’t mean anything while they consistently offer them on their website or on the evening news broadcast. They may have some marketing purpose – perhaps participants feel more engaged or it can give outlets some indication of how many people are going further than just passively taking it in – but why confuse people.

Reminder to journalists: a blog post garnering 110 comments doesn’t say much about social trends

In reading a book this weekend (a review to come later this week), I ran across a common tactic used by journalists: looking at the popularity of websites as evidence of a growing social trend. This particular book quoted a blog post and then said “The post got 110 comments.”

The problem is that this figure doesn’t really tell us much about anything.

1. These days, 110 comments on an Internet story is nothing. Controversial articles on major news websites regularly garner hundreds, if not thousands, of comments.

2. We don’t know who exactly was commenting on the story. Were these people who already agreed with what the author was writing? Was it friends and family?

In the end, citing these comments runs into the same problems that face web surveys done poorly: we don’t know whether they are representative of Americans as a whole or not. That doesn’t mean blogs and comments can’t be cited at all but we need to be very careful of what these sites tell us, what we can know from the comments, and who exactly they represent. A random sample of blog posts might help as would a more long-term study of responses to news articles and blog posts. But, simply saying that something is an important issue because a bunch of people were moved enough to comment online may not mean much of anything.

Thankfully, the author didn’t use this number of blog comments as their only source of evidence; it was part of a larger story with more conclusive data. However, it might simply be better to quote a blog post like this as an example of what is out on the Internet rather than try to follow it with some “hard” numbers.

Internet commenters can’t handle science because they argue by anecdote, think studies apply to 100% of cases

Popular Science announced this week they are not allowing comments on their stories because “comments can be bad for science”:

But even a fractious minority wields enough power to skew a reader’s perception of a story, recent research suggests. In one study led by University of Wisconsin-Madison professor Dominique Brossard, 1,183 Americans read a fake blog post on nanotechnology and revealed in survey questions how they felt about the subject (are they wary of the benefits or supportive?). Then, through a randomly assigned condition, they read either epithet- and insult-laden comments (“If you don’t see the benefits of using nanotechnology in these kinds of products, you’re an idiot” ) or civil comments. The results, as Brossard and coauthor Dietram A. Scheufele wrote in a New York Times op-ed:

Uncivil comments not only polarized readers, but they often changed a participant’s interpretation of the news story itself.
In the civil group, those who initially did or did not support the technology — whom we identified with preliminary survey questions — continued to feel the same way after reading the comments. Those exposed to rude comments, however, ended up with a much more polarized understanding of the risks connected with the technology.
Simply including an ad hominem attack in a reader comment was enough to make study participants think the downside of the reported technology was greater than they’d previously thought.

Another, similarly designed study found that just firmly worded (but not uncivil) disagreements between commenters impacted readers’ perception of science…

A politically motivated, decades-long war on expertise has eroded the popular consensus on a wide variety of scientifically validated topics. Everything, from evolution to the origins of climate change, is mistakenly up for grabs again. Scientific certainty is just another thing for two people to “debate” on television. And because comments sections tend to be a grotesque reflection of the media culture surrounding them, the cynical work of undermining bedrock scientific doctrine is now being done beneath our own stories, within a website devoted to championing science.

In addition to rude comments and ad hominem attacks leading to changed perceptions about scientific findings, here are two common misunderstandings of how science works often found in online comments (these are also common misconceptions offline):

1. Internet conversations are ripe for argument by anecdote. This happens all the time: a study is described and then the comments are full of people saying that the study doesn’t apply to them or someone they know. Providing a single counterfactual usually says very little and scientific studies are often designed to be as generalizable as they can be. Think of jokes made about global warming: just because there is one blizzard or one cold season doesn’t necessarily invalidate a general trend upward for temperatures.

2. Argument by anecdote is related to a misconception about scientific studies: the findings do not often apply to 100% of cases. Scientific findings are probabilistic, meaning there is some room for error (this does not mean science doesn’t tell us anything – it means it is hard to measure and analyze the real world – and scientists try to limit error as much as possible). Thus, scientists tend to talk in terms of relationships being more or less likely. This tends to get lost in news stories that suggest 100% causal relationships.

In other words, in order to have online conversations about science, you have to have readers who know the basics of scientific studies. I’m not sure my two points above are necessarily taught before college but I know I cover these ideas in both Statistics and Research Methods courses.

The rise of “data science” as illustrated by examining the McDonald’s menu

Christopher Mims takes a look at “data science” and one of its practitioners:

Before he was mining terabytes of tweets for insights that could be turned into interactive visualizations, [Edwin] Chen honed his skills studying linguistics and pure mathematics at MIT. That’s typically atypical for a data scientist, who have backgrounds in mathematically rigorous disciplines, whatever they are. (At Twitter, for example, all data scientists must have at least a Master’s in a related field.)

Here’s one of the wackier examples of the versatility of data science, from Chen’s own blog. In a post with the rousing title Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Chen delves into the problem of clustering. That is, how do you take a mass of data and sort it into groups of related items? It’s a tough problem — how many groups should there be? what are the criteria for sorting them? — and the details of how he tackles it are beyond those who don’t have a background in this kind of analysis.

For the rest of us, Chen provides a concrete and accessible example: McDonald’s

By dumping the entire menu of McDonald’s into his mathemagical sorting box, Chen discovers, for example, that not all McDonald’s sauces are created equal. Hot Mustard and Spicy Buffalo do not fall into the same cluster as Creamy Ranch, which has more in common with McDonald’s Iced Coffee with Sugar Free Vanilla Syrup than it does with Newman’s Own Low Fat Balsamic Vinaigrette.

This sounds like an updated version of factor analysis: break a whole into its larger and influential pieces.

Here is how Chen describes the field:

I agree — but it depends on your definition of data science (which many people disagree on!). For me, data science is a mix of three things: quantitative analysis (for the rigor necessary to understand your data), programming (so that you can process your data and act on your insights), and storytelling (to help others understand what the data means). So useful skills for a data scientist to have could include:

* Statistics, machine learning (on the quantitative analysis side). For example, it’s impossible to extract meaning from your data if you don’t know how to distinguish your signals from noise. (I’ll stress, though, that I believe any kind of strong quantitative ability is fine — my own background was originally in pure math and linguistics, and many of the other folks here come from fields like physics and chemistry. You can always pick up the specific tools you’ll need.)

* General programming ability, plus knowledge of specific areas like MapReduce/Hadoop and databases. For example, a common pattern for me is that I’ll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.

* Web programming, data visualization (on the storytelling side). For example, I find it extremely useful to be able to throw up a quick web app or dashboard that allows other people (myself included!) to interact with data — when communicating with both technical and non-technical folks, a good data visualization is often a lot more helpful and insightful than an abstract number.

I would be interested in hearing whether data science is primarily after descriptive data (like Twitter mood maps) or explanatory data. The McDonald’s example is interesting but what kind of research question does it answer? Chen mentions some more explanatory research questions he is pursuing but it seems like there is a ways to go here. I would also be interested in hearing Chen’s thoughts on how representative the data is that he typically works with. In other words, how confident are he and others are that the results are generalizable beyond the population of technology users or whatever the specific sampling frame is. Can we ask and answer questions about all Americans or world residents from the data that is becoming available through new data sources?

h/t Instapundit

Do politicians understand how polls work?

A recent CBS News/New York Times poll showed 80% of Americans do not think their family is financially better than four years ago:

Just 20 percent of Americans feel their family’s financial situation is better today than it was four years ago. Another 37 percent say it is worse, and 43 percent say it is about the same.

When asked about these specific results, Harry Reid has this to say about polls in general:

“I’m not much of a pollster guy. As everyone knows, there isn’t a poll in America that had me having any chance of being re-elected, but I got re-elected,” he told TheDC.

“I think this poll is so meaningless. It is trying to give the American people an idea of what 300 million people feel by testing several hundred people. I think the poll is flawed in so many different ways including a way that questions were asked. I don’t believe in polls generally and specifically not in this one.”

The cynical take on this is that Reid and politicians in general like polls when they are supportive of their positions and don’t like them when they do not favor them. If this is true, then you might expect politicians to cite polls when they are good but to ignore them or even try to discredit them if they are bad.

But, I would like to ask a more fundamental question: are politicians any better than average Americans in understanding polls? Reid seems to suggest that this poll has two major problems: it doesn’t ask questions of enough people to really understand all Americans (a sampling issue) and the questions are poor which leads to biased answers (an issue of how the questions are worded). Is Reid right? From the information at the bottom of the CBS story about the poll, it seems pretty standard:

This poll was conducted by telephone from March 7-11, 2012 among 1009 adults nationwide.

878 interviews were conducted with registered voters, including 301 with voters who said they plan to vote in a Republican primary. Phone numbers were dialed from samples of both standard land-line and cell phones. The error due to sampling for results based on the entire sample could be plus or minus three percentage points. The margin of error for the sample of registered voters could be plus or minus three points and six points for the sample of Republican primary voters. The error for subgroups may be higher. This poll release conforms to the Standards of Disclosure of the National Council on Public Polls.

Yes, the number of respondents seems low to be able to talk about all Americans but this is how all major polls work: you select a representative sample based on standard demographic factors (gender, race, age, etc.) and then you estimate how close the survey results are to the actual results if we asked all American adults these questions. This is why all polls have a margin of error: if you ask less people, you are less confident in the generalizability of the results (which is why there is a larger 6% gap for the smaller Republican primary voters subgroup) and if you ask more people, you can be more confident (though the payoff of asking more people usually diminishes between 1200-1500 respondents so it is not worth asking more at some point).

I don’t think Reid sounds very good in this soundbite: he attacks the scientific basis of polls with common objections. While polls may not “feel right” and may contradict anecdotal or personal evidence, they can be done well and with a good sample of around 1,000 people, you can be confident that the results are generalizable to the American people. If Reid does understand how polls work, he could raise other issues. For example, he could insist that this is a one-time poll and you would want to measure this again and again to see how it changes (perhaps this is an unusual trough?) or you would want other polling organizations to ask the same question and triangulate the results between the surveys (like what Real Clear Politics does by taking averages of polls). Or he could suggest that this question doesn’t matter much because asking about four years ago is a rather arbitrary point and philosophically, does life always have to get better over time?

The educational level of immigrants in America

A new report suggests that there are more immigrants with college degrees than immigrants without high school diplomas:

“There’s more high-skilled (immigrants) than people believe,” said Audrey Singer, senior fellow with the Metropolitan Policy Program at the Brookings Institution and co-author of the report, which contends that the economic contribution of immigrants has been overshadowed by the rancorous debate over illegal immigration.

Singer and Matthew Hall, a sociologist at the University of Illinois-Chicago, analyzed census data for the nation’s 100 largest metropolitan areas and found that 30 percent of working-age immigrants had at least a bachelor’s degree, compared with 28 percent who lack a high school diploma.

The article suggests that the report is intended to influence the national immigration debate, presumably by suggesting that many immigrants are an asset to the country.

But it would be helpful here to compare these figures for immigrants to the statistics for American adults overall to know whether these figures are impressive or not. Here are the 2010 educational attainment figures for Americans 18 and older of all races: 27.28% have a bachelor’s degree or higher while 13.71% have less than a high school degree. It looks like the figures for immigrants are more polarized compared to the general population with a higher percentage, about 2-3% more, having a college degree while a much higher percentage, about double, having less than a high school diploma. (Figures for Americans 25 and older change a little: 29.93% have a college degree or greater while 12.86% have less than a high school degree.)

The value, then, in the figures about immigrants are probably in the field of public perceptions, particularly the statistic of immigrants with a college degree which matches up well with comparisons to Americans 18+ and 25+ years old.

(The article doesn’t address this and I don’t know if the report does either: does it matter that the figures for immigrants are drawn from the 100 largest metropolitan areas? Would the figures be different if looking at all immigrants?)