Christopher Mims takes a look at “data science” and one of its practitioners:
Before he was mining terabytes of tweets for insights that could be turned into interactive visualizations, [Edwin] Chen honed his skills studying linguistics and pure mathematics at MIT. That’s typically atypical for a data scientist, who have backgrounds in mathematically rigorous disciplines, whatever they are. (At Twitter, for example, all data scientists must have at least a Master’s in a related field.)
Here’s one of the wackier examples of the versatility of data science, from Chen’s own blog. In a post with the rousing title Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Chen delves into the problem of clustering. That is, how do you take a mass of data and sort it into groups of related items? It’s a tough problem — how many groups should there be? what are the criteria for sorting them? — and the details of how he tackles it are beyond those who don’t have a background in this kind of analysis.
For the rest of us, Chen provides a concrete and accessible example: McDonald’s
By dumping the entire menu of McDonald’s into his mathemagical sorting box, Chen discovers, for example, that not all McDonald’s sauces are created equal. Hot Mustard and Spicy Buffalo do not fall into the same cluster as Creamy Ranch, which has more in common with McDonald’s Iced Coffee with Sugar Free Vanilla Syrup than it does with Newman’s Own Low Fat Balsamic Vinaigrette.
This sounds like an updated version of factor analysis: break a whole into its larger and influential pieces.
Here is how Chen describes the field:
I agree — but it depends on your definition of data science (which many people disagree on!). For me, data science is a mix of three things: quantitative analysis (for the rigor necessary to understand your data), programming (so that you can process your data and act on your insights), and storytelling (to help others understand what the data means). So useful skills for a data scientist to have could include:
* Statistics, machine learning (on the quantitative analysis side). For example, it’s impossible to extract meaning from your data if you don’t know how to distinguish your signals from noise. (I’ll stress, though, that I believe any kind of strong quantitative ability is fine — my own background was originally in pure math and linguistics, and many of the other folks here come from fields like physics and chemistry. You can always pick up the specific tools you’ll need.)
* General programming ability, plus knowledge of specific areas like MapReduce/Hadoop and databases. For example, a common pattern for me is that I’ll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.
* Web programming, data visualization (on the storytelling side). For example, I find it extremely useful to be able to throw up a quick web app or dashboard that allows other people (myself included!) to interact with data — when communicating with both technical and non-technical folks, a good data visualization is often a lot more helpful and insightful than an abstract number.
I would be interested in hearing whether data science is primarily after descriptive data (like Twitter mood maps) or explanatory data. The McDonald’s example is interesting but what kind of research question does it answer? Chen mentions some more explanatory research questions he is pursuing but it seems like there is a ways to go here. I would also be interested in hearing Chen’s thoughts on how representative the data is that he typically works with. In other words, how confident are he and others are that the results are generalizable beyond the population of technology users or whatever the specific sampling frame is. Can we ask and answer questions about all Americans or world residents from the data that is becoming available through new data sources?