The rise of “data science” as illustrated by examining the McDonald’s menu

Christopher Mims takes a look at “data science” and one of its practitioners:

Before he was mining terabytes of tweets for insights that could be turned into interactive visualizations, [Edwin] Chen honed his skills studying linguistics and pure mathematics at MIT. That’s typically atypical for a data scientist, who have backgrounds in mathematically rigorous disciplines, whatever they are. (At Twitter, for example, all data scientists must have at least a Master’s in a related field.)

Here’s one of the wackier examples of the versatility of data science, from Chen’s own blog. In a post with the rousing title Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Chen delves into the problem of clustering. That is, how do you take a mass of data and sort it into groups of related items? It’s a tough problem — how many groups should there be? what are the criteria for sorting them? — and the details of how he tackles it are beyond those who don’t have a background in this kind of analysis.

For the rest of us, Chen provides a concrete and accessible example: McDonald’s

By dumping the entire menu of McDonald’s into his mathemagical sorting box, Chen discovers, for example, that not all McDonald’s sauces are created equal. Hot Mustard and Spicy Buffalo do not fall into the same cluster as Creamy Ranch, which has more in common with McDonald’s Iced Coffee with Sugar Free Vanilla Syrup than it does with Newman’s Own Low Fat Balsamic Vinaigrette.

This sounds like an updated version of factor analysis: break a whole into its larger and influential pieces.

Here is how Chen describes the field:

I agree — but it depends on your definition of data science (which many people disagree on!). For me, data science is a mix of three things: quantitative analysis (for the rigor necessary to understand your data), programming (so that you can process your data and act on your insights), and storytelling (to help others understand what the data means). So useful skills for a data scientist to have could include:

* Statistics, machine learning (on the quantitative analysis side). For example, it’s impossible to extract meaning from your data if you don’t know how to distinguish your signals from noise. (I’ll stress, though, that I believe any kind of strong quantitative ability is fine — my own background was originally in pure math and linguistics, and many of the other folks here come from fields like physics and chemistry. You can always pick up the specific tools you’ll need.)

* General programming ability, plus knowledge of specific areas like MapReduce/Hadoop and databases. For example, a common pattern for me is that I’ll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.

* Web programming, data visualization (on the storytelling side). For example, I find it extremely useful to be able to throw up a quick web app or dashboard that allows other people (myself included!) to interact with data — when communicating with both technical and non-technical folks, a good data visualization is often a lot more helpful and insightful than an abstract number.

I would be interested in hearing whether data science is primarily after descriptive data (like Twitter mood maps) or explanatory data. The McDonald’s example is interesting but what kind of research question does it answer? Chen mentions some more explanatory research questions he is pursuing but it seems like there is a ways to go here. I would also be interested in hearing Chen’s thoughts on how representative the data is that he typically works with. In other words, how confident are he and others are that the results are generalizable beyond the population of technology users or whatever the specific sampling frame is. Can we ask and answer questions about all Americans or world residents from the data that is becoming available through new data sources?

h/t Instapundit

Judging the validity of academic expertise in court

It is common in the world of academia for academics to judge the credibility of other scholars. But what happens when academics step into the courtroom and a judge assesses whether they are experts or not? Consider the case of a Canadian sociologist who was going to testify as a gangs expert:

Mark Totten, an Ottawa sociologist, has “virtually no expertise with gangs in the Greater Toronto Area,” Ontario Superior Court Justice Robert Clark said in a 27-page ruling which had been under a publication ban until the jury in a gang-related case began deliberations Wednesday.

Yet, this is the same sociologist who, in 2009, was praised by the Ontario Court of Appeal for having “extensive and impressive credentials” in the field of street gang culture…

Totten himself admitted in an interview that he “didn’t handle it very well” after wilting under cross-examination in the voir dire, a preliminary examination to determine the competency of a witness…

The last time Totten’s expertise was questioned was in 2007, when Justice Todd Archibald disallowed his “expert” witness testimony on the meaning of a teardrop tattoo on the cheek of an accused killer, Warren Abbey…

On his website, Totten’s list of degrees includes a PhD in sociology (1996) from Carleton University. He is also the author of a book about to be released, Nasty, Brutish and Short: The Lives of Gang Members in Canada.

According to his 31-page resumé, most of Totten’s work with gangs has been in the Ottawa area and western Canada, and he says he has counselled hundreds of gang members.

Several things seem to be happening here:

1. In the most recent incident, Totten admitted he didn’t do a good job testifying. So perhaps he isn’t convincing and/or gets flustered.

2. Perhaps Totten’s knowledge is not specific enough for particular cases. While he has researched gangs, he may not know the particulars of gang activity in Toronto (or some other locations).

3. With the possibility of #1 and #2, why would either the prosecution or defense call on Totten for his expert testimony?

4. How does a judge decide whether a testifying expert has enough expertise. I’m sure there are guidelines to this but doesn’t this require the judge to assess the research ability of the expert? For example, the recent case involved questions about the methodology Totten used:

In a ruling released March 5, Clark flagged as a “problem” Totten’s data relating to the sample size of gang members he purportedly interviewed, calling it “inaccurate and misleading in several ways.”

Clark had listened for a day and a half as Misener challenged Totten about his research and methodology, including that used in the Abbey trial, and about his lack of knowledge about Toronto street gangs.

This is a very common academic argument: attack the methodology of another researcher and suggest they can’t reach the conclusions they do because the data is bad. Knowing this, many academics know they have to be able to respond to this which is why articles and books typically contain a defense of the methodology used for the study. In this case, the argument seems to be that Totten can’t really speak about Toronto gangs because there are important differences between these gangs and the ones Totten has studied. At what point is the judge convinced that Totten is not an expert for this case?

Even if the methodology is good, perhaps #1 and #2 are most important here – if the expert can’t speak well to the specific case and defend their methodology, it doesn’t matter if the expert really is an expert. Part of being an expert requires that the expert can effectively communicate their argument and the methodology behind it.

(My goal in this post is not to defend Totten or suggest his testimony should not be allowed. Rather, I was intrigued by the fact that these arguments about methodology and validity took place in court. While sociologists and researchers in other disciplines might know how the publishing system works for their own field, I assume the rules and standards in court differ even as there are some similarities between the two realms.)

We need a more complex analysis of how taxes affect income inequality

One current blogosphere discussion about whether taxes could help reduce income inequality would benefit from more complex analyses. Here is the discussion thus far according to TaxProf Blog:

There have been a number of reports published recently that purport to show a link between rising inequality and changes in tax policy — especially tax cuts for the so-called rich. The latest installment comes from Berkeley professor Emmanuel Saez, Striking it Richer: The Evolution of Top Incomes in the United States.

Saez and others who write on this issue seem so intent on proving a link between tax policy and inequality that they overlook the major demographic changes that are occurring in America that can contribute to — or at least give the appearance of — rising inequality; a few of these being, differences in education, the rise of dual-earner couples, the aging of our workforce, and increased entrepreneurship.

Today, we will look at the link between education and income. Recent census data comparing the educational attainment of householders and income shows about as clearly as you can that America’s income gap is really an education gap and not the result of tax cuts for the rich.

The chart below shows that as people’s income rise, so too does the likelihood that they have a college degree or higher. By contrast, those with the lowest incomes are most likely to have a high school education or less. Just 8% of those at the lowest income level have a college degree while 78% of those earning $250,000 or more have a college degree or advanced degree. At the other end of the income scale, 69% of low-income people have a high school degree or less, while just 9% of those earning over $250,000 have just a high school degree.

This analysis starts in the right direction: looking at a direct relationship between two variables such as tax rates and income inequality is difficult to do in isolation of other factors. While some factors may be more influential than others, there are a number of reasons for income inequality. In other words, graphs with two variables are not enough. Pulling out one independent variable at a time doesn’t give us the full picture.

But, then the supposedly better way is that we were just looking at the wrong variable’s influence on income and should have been looking at education instead! So after saying that the situation was more complex, we get another two variable graph that shows that as education goes up, so does income so perhaps it really isn’t about taxes at all.

What we need here is some more complex statistical analysis, preferably including regression analysis where we can see how a variety of factors at the same time influence income inequality. Some of this might be a little harder to model since you would want to account for changing tax rates but arguing over two variable graphs isn’t going to get us very far. Indeed, I wonder if this is more common now in debates: both sides like simpler analyses because it allows each to make the point they want without considering the full complexity of the matter. In other words, easier to make graphs line up more with ideological commitments rather than an interest in truly sorting out what factors are more influential in affecting income inequality.

Why don’t we collect data to see whether we have become more rude or uncivil rather than rely on anecdotes?

NPR ran a story the other day about how American culture is becoming more casual and less polite. This is not an uncommon story: every so often, different news organizations will run something similar, often focusing on the decreasing use of manners like saying “please” and “you’re welcome.” Here is the main problem I have with these articles: what kind of data could we look at to evaluate this argument? These stories tend to rely on experts who provide anecdotal evidence or their own interpretation. In this piece, these are the three experts: “a psychiatrist and blogger,” “a sophomore at the College of Charleston — in the South Carolina city that is often cited as one of the most courteous in the country,” and “etiquette maven Cindy Post Senning, a director of the Emily Post Institute in Burlington, Vt.”

There is one data point cited in this story:

Research backs up Smith’s anecdotal observations. In 2011, some 76 percent of people surveyed by Rasmussen Reports said Americans are becoming more rude and less civil.

Interestingly, this statistic is about perceptions. Perceptions may be more important than reality in many social situations. But I could imagine another scenario about these perceptions: older generations tend to think that younger generations (often their children and grandchildren?) are less mannered and don’t care as much about social etiquette. As this story suggests, perhaps the manners are simply changing – instead of saying “you’re welcome,” younger people give the dreaded “sure.”

There has to be some way to measure this. It would be nice to do this online or in social media but the problem is that face-to-face rules don’t apply there. Perhaps someone has recorded interactions at McDonald’s or Walmart registers? In whatever setting a researcher chooses, you would want to observe a broad range of people to look for patterns by age, occupation, gender, race, education level (though some of this would have to come through survey or interview data with the people being observed).

In my call for data, I am not disagreeing with the idea that traditional manners and civility have decreased. I just want to see data that suggests this rather than anecdotes and observations from a few people.

Questionable web survey of the day: smart USA finds Americans prefer “right-sizing”

I ran across a recent survey that initially looked promising as the findings suggested Americans prefer “right-sizing”:

While the last decade is often seen as a period of gluttonous consumption, McMansions, and Super-Size meals, the old adage that less is more seems to be ringing true in today’s post-recession era. The survey found that three out of four Americans prefer to receive a present in a small package over a large one. Those who thought bigger was better tended to be young, a preference that shrinks as people get older and wiser. (34% of Americans age 18-34 preferred bigger presents compared to 22% of those age 45-54 and 17% of those age 55+).

Overall, on the subject of preferring less over more:

  • 97% of Americans believe that at least some of the items in their household are junk (i.e., they could easily get rid of it)
  • Nearly one out of 10 (9%) Americans believe they can part with a full half of their stuff
  • 9% of Americans believe that 51-100% of the items in their household are junk, indicating that the supposed American obsession with size and quantity is overstated

I’m not sure the statistics here strongly show “the supposed American obsession with size and quantity is overstated” but this still seems interesting. Lots of people would argue Americans have too much stuff and particularly the admissions about having some or a lot of junk back this up. But if you read more closely, two issues pop up:

1. The survey was sponsored by smart USA and Harris Interactive. Not familiar with smart USA? Here is a hint:

“The fact that a majority of Americans are deeply concerned with right-sizing their lifestyles and making intelligent choices shows why smart has so much curb appeal today,” says smart USA General Manager Tracey Matura. “People are rethinking whether bigger is actually better and focusing instead on value. They’re looking at how they can cut down the clutter in their lives, whether in their choice of vehicle, home or other purchases, so they have fewer, better things rather than simply more, more, more. And smart is proof that good things do come in small packages.”

So the survey shows that there should be plenty of Americans who want to buy a smart Fortwo! While early sales of the car lagged, Mercedes Benz trumpeted moving 9,341 smart cars worldwide in April 2011. Is this really just a marketing survey?

2. There is another issue with the survey, which happened through the web:

This survey was conducted online within the United States by Harris Interactive on behalf of Smart from December 6-8, 2011 among 2,246 adults ages 18 and older. This online survey is not based on a probability sample and therefore no estimate of theoretical sampling error can be calculated. For complete survey methodology, including weighting variables, please contact terry.wei@mbusa.com.

Perhaps I’m missing something but the admission that this is not based on a probability sample is bad news. This usually means that the survey is not terribly representative of the American population at large. Of course, the surveys results could be weighted to try to make up for this but weighting may not be able to truly adjust for having a bad sample.

In conclusion, I’m not sure this survey really tells us much about anything. I assume that the findings are useful to smart USA but the results about larger American consumer patterns should be used with much caution.

Living alone means having no “social checks and balances”?

As more Americans live alone, these solo dwellers may have different behaviors at home:

In a sense, living alone represents the self let loose. In the absence of what Mr. Klinenberg calls “surveilling eyes,” the solo dweller is free to indulge his or her odder habits — what is sometimes referred to as Secret Single Behavior. Feel like standing naked in your kitchen at 2 a.m., eating peanut butter from the jar? Who’s to know?

Amy Kennedy, 28, a schoolteacher who has a two-bedroom apartment in High Point, N.C., all to herself, calls it living without “social checks and balances.”…

Among her domestic oddities: running in place during TV commercials; speaking conversational French to herself while making breakfast (she listens to a language CD); singing Journey songs in the shower; and removing only the clothes she needs from her dryer, thus turning it into a makeshift dresser…

What emerges over time, for those who live alone, is an at-home self that is markedly different — in ways big and small — from the self they present to the world. We all have private selves, of course, but people who live alone spend a good deal more time exploring them.

This sounds like Goffman’s dramaturgical approach: those living alone can be truly back-stage with themselves perhaps in a way they never could with a spouse or family. Would all of us exhibit this kind of quirky behavior if we didn’t have others around at home? Without others around to enforce the social norms of behavior, perhaps we become our only standard.

This makes me think about an area of life we don’t examine enough: what do people do when they are alone? Do they generally follow social conventions or are all people quirky? Do they feel comfortable when alone? Are there limits to much we can talk to each other about being alone or how much we can ask about what others do when they are alone? How do alone behaviors and feelings about being alone differ across cultures? Do people in the Western world today spend more or less time alone than in the past? Do we feel a need to have more alone time (“me-time”?) or do simply express this more? How do others tend to respond when we express loneliness or express that we like to be alone?

One thing I noted when reading this article: what about being alone yet through different mediums not really being alone? I’m thinking of situations where someone is alone but they are watching TV, listening to the radio, or interacting with people online. (Might reading also fall into this?) Of course, this kind of interaction is different than face-to-face interaction but is it truly living alone? I tend to be a person who likes to listen to talk radio – am I alone when doing this? Additionally, does this mediated interactions limit the quirky side of living alone?

It might be difficult methodologically to get at alone time. I assume the best way to do this would be to have cameras observing people while alone. Of course, it would take some time for people to forget the cameras are there but it would happen eventually. Other methods would not be as good: having a person do the observations would alter the setting too much; time diaries are unreliable; and surveys or interviews after the fact could be helpful but would end up being interpreted accounts.

Sociology grad student: “the Internet is a sociologist’s playground”

A sociology graduate student makes an interesting claim: “the Internet is a sociologist’s playground“:

The Internet is a sociologist’s playground, says Scott Golder, a graduate student in sociology at Cornell University. Although sociologists have wanted to study entire societies in fine-grained detail for nearly a century, they have had to rely primarily upon large-scale surveys (which are costly and logistically challenging) or interviews and observations (which provide rich detail, but for small numbers of subjects). Golder hopes that data from the social Web will provide opportunities to observe the detailed activities of millions of people, and he is working to bring that vision to fruition.  The same techniques that make the Web run—providing targeted advertisements and filtering spam—can also provide insights into social life. For example, he has used Twitter archives to examine how people’s moods vary over time, as well as how network structure predicts friendship choices. Golder came to sociology by way of computer science, studying language use in online communities and using the Web as a tool for collecting linguistic data. After completing a B.A. at Harvard and an M.S. at the MIT Media Lab, he spent several years in an industrial research lab before beginning his Ph.D. in sociology at Cornell.

I would think that having a background in computer science would be a big plus for a sociologist today. Lots of people want to study social networking sites like Facebook and work with the data available online. But I wonder if there still aren’t a few issues to overcome before we can really tap this information:

1. Do companies that have a lot of this data, places like Google and Facebook, want to open it up to researchers or would they prefer to keep the data in-house in order to make money?

2. How will Internet users respond to the interest researchers have in studying their online behavior if they are often not thrilled about being tracked by companies?

3. Has the sampling issue been resolved? In other words, one of the problems with web surveys or working with certain websites is that theses users are not representative of the total US population. So while internet activity has increased among the population as a whole, isn’t internet usage, particularly among those who use it most frequently, still skewed in certain directions?

4. Just how much does online activity reveal about offline activity? Do the two worlds overlap so much that this is not an issue or are there important things that you can’t uncover through online activity?

I would think some of these issues could be resolved and the sociologists who can really tap this growing realm will have a valuable head start.

When a sociological survey about Hong Kong angers Chinese authorities

Politics can interfere with research studies and findings. For an example, here is a case of a sociological survey done in Hong Kong that has gotten the attention of Chinese authorities:

In December, a Hong Kong sociologist by the name of Robert Chung found himself at the center of a political storm. A study commissioned by Chung, director of opinion research at a leading university in the territory, discovered that the number of people who identify themselves primarily as citizens of Hong Kong was higher than it’s been for the past 10 years. The survey showed that the number of those who viewed themselves as Chinese had fallen to 16.6 percent. That’s a 12-year low and less than half of what it was three years ago.

Since then the territory’s communist press has launched a vicious attack on the pollster. “Political fraudster” and “a slave of dirty political money” are just two of the Cultural Revolution style epithets trotted out against Professor Chung. Hao Tiechuan, a Beijing official stationed in Hong Kong, called in local reporters to denounce Professor Chung’s work as “unscientific” and “illogical.”

Beijing, always wary of Hong Kong’s loyalty because of its colonial heritage, ratchets up the rhetoric even higher during “election” season. In March, 1200 mostly pro-Beijing loyalists will choose the next chief executive, and in September, Hong Kong citizens will go to the polls to choose 35 of 70 seats in the partially-democratic legislature. Last fall, pro-Beijing candidates won local district-level polls overwhelmingly, although an investigation has been opened into possible vote-rigging. Beijing’s attacks on Professor Chung– as well as on a so-called “Gang of Four” of prominent democracy advocates — may be calculated to keep the minions who choose the chief executive in line and dampen turnout by the solid majority of Hong Kong voters who favor progress toward full democracy.

Does this make complaints about academic freedom in the United States seem rather tame?

The attacks by the communist press are intriguing. First, “political fraudster” implies that the work is unscientific. Second, the charge of being  “a slave of dirty political money” suggests that the work is politically motivated and skewed. In both critiques, the attack is against the scientific credibility of the sociologist. The argument is that Chung has done poor research and the results shouldn’t be trusted. Furthermore, it suggests that Chung himself is not capable of good conducting good research.

These are serious charges for a sociologist. It is one thing to disagree with findings or about their interpretation or suggest that they should have used another method. It is another thing to claim that the researcher intentionally found certain results or can’t do good research. Yes, methodological errors are made occasionally (and sometimes fraudulently) but this cuts to the heart of sociology and the claim that we are searching for replicable and valid results. I hope Chang is able to show his proper use of sociological methods and is supported by others.

Participant observation or “sociological stalking”?

A psychotherapist tells a story about observing, interacting with, and being blessed by  a woman in Mexico and calls what she does “sociological stalking.”

A couple of thoughts:

1. Stalking clearly has negative connotations so why use this term? If you talk to people about using Facebook, “stalking” is crossing the line from simple observer, which you are supposed to do on Facebook by reading the news feed and interacting with information others post, to an aggressive observer who looks at too much. And since this story has a happy ending, can’t we replace the term “stalking”?

2. In sociological terms, this is more like participant observation than nefarious stalking. On one hand, you want to observe to understand better why people do what they do. On the other hand, you end up interacting with those they observe, sharing in what they do with the hopes that the participation helps provide new insights. Put together, you get both the insider and outsider perspective.

3. Here is the summary made about the benefits of observing:

Sometimes, when in doubt, just observe. It is a fine remedy for assumptions, bias, judgments, and the angst that can accompany living. It is also a fine remedy for spiritual bank accounts.

In other words, observation can help take the focus off yourself, see the world in new ways, and involve you in the lives of others. I wonder if taking the time needed to truly observe and also the skills required to figure out what is really going on are lost arts.