“Nothing that is off-limits to political data mining”

Your consumer data is of value to political campaigns and parties eager to reach individual voters:

But as presidential campaigns push into a new frontier of voter targeting, scouring social media accounts, online browsing habits and retail purchasing records of millions of Americans, they have brought a privacy imposition unprecedented in politics. By some estimates, political candidates are collecting more personal information on Americans than even the most aggressive retailers. Questions are emerging about how much risk the new order of digital campaigning is creating for unwitting voters as the vast troves of data accumulated by political operations becomes increasingly attractive to hackers…

“There is a tremendous amount of data out there and the question is what types of controls are in place and how secure is it,” said Craig Spiezle, executive director of the nonprofit Online Trust Alliance. The group’s recent audit of campaign websites for privacy, security and consumer protection gave three-quarters of the candidates failing grades…One firm, Aristotle, boasts how it helped a senior senator win reelection in 2014 using “over 500 demographic and consumer points, which created a unique voter profile of each constituent.” Company officials declined an interview request.

When investigators in Congress and the FTC looked into the universe of what data brokers make available to their clients – be they political, corporate or nonprofit – some of the findings were unsettling. One company was selling lists of rape victims; another was offering up the home addresses of police officers.

I think several things are relevant to note. First, it sounds like the majority of this data is not collected by political actors but rather is aggregated by them to help predict voter behavior. In other words, this data collection is happening whether political actors use the information or not. This is a bigger issue than just politics. Second, should American residents be more concerned that this information is available in the political realm or is available to corporations? The story suggests political campaigns aren’t well prepared to protect all this data but how do corporations stack up? Again, this is a larger issue of who is gathering all of this data to start, from where, and how is it being protected.

Another area worth thinking more about is how effective all this data actually is in elections. This story doesn’t say and numerous other stories on this subject I’ve read tend not to say: just how big are the differences in voting behavior among these microgroups or people identified by particular consumer behaviors? Is this the only way to win campaigns today (see media reports on political campaigns successfully using this data here and here)? Is this knowledge worth 1% in the final outcome, 5%, 10%? Perhaps this is hard to get at because this is a relatively new phenomena and because data companies as well as campaigns want to guard their proprietary methods. Yet, it is hard to know how big of a deal this is to either consumers or political actors. Is this data mining manipulating elections?

Data mining for red flags indicating corruption

Two sociologists have developed a method for finding red flags of corruption in public databases:

Researchers at the University of Cambridge has developed a series of algorithms than mine public procurement data for “red flags” — signs of the abuse of public finances. Scientists interviewed experts on public corruption to identify the kinds of anomalies that might indicate something fishy.

Their research allowed them to hone in on a series of red flags, like an unusually short tender period. If a request for proposal is issued by the government on a Friday and a contract is awarded on Monday — red flag…

Some of the other red flags identified by the researchers includes tender modifications that result in larger contracts, few bidders in a typically competitive industry, and inaccessible or unusually complex tender documents…

“Imagine a mobile app containing local CRI data, and a street that’s in bad need of repair. You can find out when public funds were allocated, who to, how the contract was awarded, how the company ranks for corruption,” explained Fazekas. “Then you can take a photo of the damaged street and add it to the database, tagging contracts and companies.”

This is a good use of data mining as it doesn’t require theoretical explanations after the fact. Why not make use of such public information?

At the same time, simply finding these red flags may not be enough. I could imagine websites that track all of these findings and dog officials and candidates. Yet, are these red flags proof of corruption or just indicative that more digging needs to be done? There could be situations where officials would justify these anomalies. It could still take persistent effort and media attention to push from just noting these anomalies to suggesting a response is required.

Hillary Clinton’s biggest urban Facebook fan base is Baghdad?

Melding political, social media, and urban analysis, a look at Hillary Clinton’s Facebook fans has an interesting geographic dimension:

Hillary Clinton’s Facebook pages have an unexpected fan base. At least 7 percent of Clinton’s Facebook fans list their hometown as Baghdad, way more than any other city in the world, including in the United States.

Vocativ’s exclusive analysis of Clinton’s Facebook fan statistics yielded a number of surprises. Despite her reputation as an urban Democrat favored by liberal elites, Iraqis and southerners are more likely to be a Facebook fan of Hillary than people living on America’s coasts. And the Democratic candidate for president has one of her largest followings in the great red-state of Texas.

While Chicago and New York City, both with 4 per cent of fans, round out the top three cities for Hillary’s Facebook base, Texas’ four major centers—Houston (3 percent), Dallas (3 percent), Austin (2 percent) and San Antonio (2 percent)—contain more of her Facebook supporters. Los Angeles with 3 percent of her fans, and Philadelphia and Atlanta, each with 2 percent, round out the Top 10 cities for Facebook fans of Hillary.

On a per capita basis, in which Vocativ compared a town’s population to percentage of Hillary’s likes, people living in cities and towns in Texas, Kentucky, Ohio, Arkansas, North Carolina and Wisconsin were the most likely to be her fans on Facebook than any other American residents.

This hints at the broader knowledge we might gain from social media and should beg the question of how this information could be well used. I imagine this information could be used for political ends. Is this a curiosity? Is this something the Clinton campaign would want to change? Would this influence the behavior of other voters? The article itself is fairly agnostic about what this means.

This sounds like data mining and here is how the company behind this – Vocativ – describes its mission:

Vocativ is a media and technology venture that explores the deep web to discover original stories, hidden perspectives, emerging trends, and unheard voices from around the world. Our audience is the young, diverse, social generation that wants to share what’s interesting and what’s valuable. We reach them with a visual language, wherever they are naturally gathering…

Our proprietary technology, Verne, allows us to search and monitor the deep web to spot breaking news quickly, and discover stories that otherwise might not be told. Often we know what we’re looking for, such as witnesses near the front lines of a conflict or data related to an emerging political movement. We also uncover unexpected information, like pro-gun publications giving away assault rifles to fans of their Facebook pages.

Is this the Freakonomicization of journalism?

Analyzing Netflix’s thousands of movie genres

Alexis Madrigal decided to look into the movie genres of Netflix – and found lots of interesting data:

As the hours ticked by, the Netflix grammar—how it pieced together the words to form comprehensible genres—began to become apparent as well.

If a movie was both romantic and Oscar-winning, Oscar-winning always went to the left: Oscar-winning Romantic Dramas. Time periods always went at the end of the genre: Oscar-winning Romantic Dramas from the 1950s

In fact, there was a hierarchy for each category of descriptor. Generally speaking, a genre would be formed out of a subset of these components:

Region + Adjectives + Noun Genre + Based On… + Set In… + From the… + About… + For Age X to Y

Yellin said that the genres were limited by three main factors: 1) they only want to display 50 characters for various UI reasons, which eliminates most long genres; 2) there had to be a “critical mass” of content that fit the description of the genre, at least in Netflix’s extended DVD catalog; and 3) they only wanted genres that made syntactic sense.

And the conclusion is that there are so many genres that they don’t necessarily make sense to humans. This strikes me as a uniquely modern problem: we know how to find patterns via algorithm and then we have to decide whether we want to know why the patterns exist. We might call this the Freakonomics problem: we can collect reams of data, data mine it, and then have to develop explanations. This, of course, is the reverse of the typical scientific process that starts with theories and then goes about testing them. The Netflix “reverse engineering” can be quite useful but wouldn’t it be nice to know why Perry Mason and a few other less celebrated actors show up so often?

At the least, I bet Hollywood would like access to such explanations. This also reminds me of the Music Genome Project that underlies Pandora. Unlock the genres and there is money to be made.

Five main methods of detecting patterns in data mining

Here is a summary of five of the main methods utilized to uncover patterns when data mining:

Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.

Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. These types of findings are often used for targeting coupons/deals or advertising. Similarly, this form of data mining (albeit a quite complex version) is behind Netflix movie recommendations.

Cluster detection: one type of pattern recognition that is particularly useful is recognizing distinct clusters or sub-categories within the data. Without data mining, an analyst would have to look at the data and decide on a set of categories which they believe captures the relevant distinctions between apparent groups in the data. This would risk missing important categories. With data mining it is possible to let the data itself determine the groups. This is one of the black-box type of algorithms that are hard to understand. But in a simple example – again with purchasing behavior – we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.

Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. Learning from a large set of pre-classified examples, algorithms can detect persistent systemic differences between items in each group and apply these rules to new classification problems. Spam filters are a great example of this – large sets of emails that have been identified as spam have enabled filters to notice differences in word usage between legitimate and spam messages, and classify incoming messages according to these rules with a high degree of accuracy.

Regression: Data mining can be used to construct predictive models based on many variables. Facebook, for example, might be interested in predicting future engagement for a user based on past behavior. Factors like the amount of personal information shared, number of photos tagged, friend requests initiated or accepted, comments, likes etc. could all be included in such a model. Over time, this model could be honed to include or weight things differently as Facebook compares how the predictions differ from observed behavior. Ultimately these findings could be used to guide design in order to encourage more of the behaviors that seem to lead to increased engagement over time.

Several of these seem similar to methods commonly used by sociologists:

1. Anomaly detection seems like looking for outliers. On one hand, outliers can throw off basic measures of central tendency or dispersion. On the other hand, outliers can help prompt researchers to reassess their models and/or theories to account for the unusual cases.

2. Cluster detection and/or classification appear similar to factor analysis. This involves a statistical analysis of a set of variables to see which ones “hang together.” This can be helpful for finding categories and reducing the number of variables in an analysis to a lesser number of important concepts.

3. Regression is used all the time both for modeling and predictions.

This all reminds me of what I heard in graduate school about the difference between data mining and statistical research: data mining amounted to atheoretical analysis. In other words, you might find relationships  between variables (or apparent relationships between variables – could always be a spurious association or there could be suppressor or distorter effects) but you wouldn’t have compelling explanations for these relationships. While you might be able to develop some explanations, this is a different process than hypothesis testing where you set out to look and test for relationships and patterns.

The rise of “data science” as illustrated by examining the McDonald’s menu

Christopher Mims takes a look at “data science” and one of its practitioners:

Before he was mining terabytes of tweets for insights that could be turned into interactive visualizations, [Edwin] Chen honed his skills studying linguistics and pure mathematics at MIT. That’s typically atypical for a data scientist, who have backgrounds in mathematically rigorous disciplines, whatever they are. (At Twitter, for example, all data scientists must have at least a Master’s in a related field.)

Here’s one of the wackier examples of the versatility of data science, from Chen’s own blog. In a post with the rousing title Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Chen delves into the problem of clustering. That is, how do you take a mass of data and sort it into groups of related items? It’s a tough problem — how many groups should there be? what are the criteria for sorting them? — and the details of how he tackles it are beyond those who don’t have a background in this kind of analysis.

For the rest of us, Chen provides a concrete and accessible example: McDonald’s

By dumping the entire menu of McDonald’s into his mathemagical sorting box, Chen discovers, for example, that not all McDonald’s sauces are created equal. Hot Mustard and Spicy Buffalo do not fall into the same cluster as Creamy Ranch, which has more in common with McDonald’s Iced Coffee with Sugar Free Vanilla Syrup than it does with Newman’s Own Low Fat Balsamic Vinaigrette.

This sounds like an updated version of factor analysis: break a whole into its larger and influential pieces.

Here is how Chen describes the field:

I agree — but it depends on your definition of data science (which many people disagree on!). For me, data science is a mix of three things: quantitative analysis (for the rigor necessary to understand your data), programming (so that you can process your data and act on your insights), and storytelling (to help others understand what the data means). So useful skills for a data scientist to have could include:

* Statistics, machine learning (on the quantitative analysis side). For example, it’s impossible to extract meaning from your data if you don’t know how to distinguish your signals from noise. (I’ll stress, though, that I believe any kind of strong quantitative ability is fine — my own background was originally in pure math and linguistics, and many of the other folks here come from fields like physics and chemistry. You can always pick up the specific tools you’ll need.)

* General programming ability, plus knowledge of specific areas like MapReduce/Hadoop and databases. For example, a common pattern for me is that I’ll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.

* Web programming, data visualization (on the storytelling side). For example, I find it extremely useful to be able to throw up a quick web app or dashboard that allows other people (myself included!) to interact with data — when communicating with both technical and non-technical folks, a good data visualization is often a lot more helpful and insightful than an abstract number.

I would be interested in hearing whether data science is primarily after descriptive data (like Twitter mood maps) or explanatory data. The McDonald’s example is interesting but what kind of research question does it answer? Chen mentions some more explanatory research questions he is pursuing but it seems like there is a ways to go here. I would also be interested in hearing Chen’s thoughts on how representative the data is that he typically works with. In other words, how confident are he and others are that the results are generalizable beyond the population of technology users or whatever the specific sampling frame is. Can we ask and answer questions about all Americans or world residents from the data that is becoming available through new data sources?

h/t Instapundit

Obama campaign data mining information for fundraising, voters

Politico reports on how the Obama campaign is using data mining in its quest to win reelection:

Obama for America has already invested millions of dollars in sophisticated Internet messaging, marketing and fundraising efforts that rely on personal data sometimes offered up voluntarily — like posts on a Facebook page— but sometimes not.

And according to a campaign official and former Obama staffer, the campaign’s Chicago-based headquarters has built a centralized digital database of information about millions of potential Obama voters.

It all means Obama is finding it easier than ever to merge offline data, such as voter files and information purchased from data brokers, with online information to target people with messages that may appeal to their personal tastes. Privacy advocates say it’s just the sort of digital snooping that his new privacy project is supposed to discourage…

There’s an added twist for Obama: He’s making these moves at the same moment his administration is pushing the virtues of online privacy, last month proposing a consumer bill of rights to protect it.

This has been brewing for some time: back in July 2011, Ben Smith reported that the Obama campaign was advertising for “Predictive Modeling/Data Mining Scientists and Analysts.”

I really want to ask: what took so long? This is a gold mine for candidates.

I’ll be curious to see how far these hypocrisy charges go. If companies are going to make money off the Internet, don’t they have to have some of these abilities to put information together? Which group do people trust less to have their information: corporations or political parties?

After case of fraud, researchers discuss others means of “misusing research data”

The news that a prominent Dutch social psychologist published fraudulent work has pushed other researchers to talk about other forms of “misusing research data”:

Even before the Stapel case broke, a flurry of articles had begun appearing this fall that pointed to supposed systemic flaws in the way psychologists handle data. But one methodological expert, Eric-Jan Wagenmakers, of the University of Amsterdam, added a sociological twist to the statistical debate: Psychology, he argued in a recent blog post and an interview, has become addicted to surprising, counterintuitive findings that catch the news media’s eye, and that trend is warping the field…

In September, in comments quoted by the statistician Andrew Gelman on his blog, Mr. Wagenmakers wrote: “The field of social psychology has become very competitive, and high-impact publications are only possible for results that are really surprising. Unfortunately, most surprising hypotheses are wrong. That is, unless you test them against data you’ve created yourself.”…

To show just how easy it is to get a nonsensical but “statistically significant” result, three scholars, in an article in November’s Psychological Science titled “False-Positive Psychology,” first showed that listening to a children’s song made test subjects feel older. Nothing too controversial there.

Then they “demonstrated” that listening to the Beatles’ “When I’m 64” made the test subjects literally younger, relative to when they listened to a control song. Crucially, the study followed all the rules for reporting on an experimental study. What the researchers omitted, as they went on to explain in the rest of the paper, was just how many variables they poked and prodded before sheer chance threw up a headline-making result—a clearly false headline-making result.

If the pressure is great to publish (and it certainly is), then there have to be some countermeasures to limit unethical research practices. Here are a few ideas:

1. Giving more people access to the data. In this way, people could check up on other people’s published findings. But if the fraudulent studies are already published, perhaps this is too late.

2. Having more people have oversight over the project along the way. This doesn’t necessarily have to be a bureaucratic board but only having one researcher looking at the data and doing the analysis (such as in the Stapel case) means that there is more opportunity for an individual to twist the data. This could be an argument for collaborative data.

3. Could there be more space within disciplines and journals to discuss the research project? While papers tend to have very formal hypotheses, there is a lot of messy work that goes into these but very little room to discuss how the researchers arrived at them.

4. Decrease the value of media attention. I don’t know how to deal with this one. What researcher doesn’t want to have more people read their research?

5. Have a better educated media so that they don’t report so many inconsequential and shocking studies. We need more people like Malcolm Gladwell who look at a broad swath of research and summarize it rather than dozens of reports grabbing onto small studies. This is the classic issue with nutrition reporting: eggs are great! A new study says they are terrible! A third says they are great for pregnant women and no one else! We rarely get overviews of this research or real questions about the value of all this research. We just get: “a study proved this oddity today…”

6. Resist data mining. Atheoretical correlations don’t help much. Let theories guide statistical models.

7. Have more space to publish negative findings. This would help researchers feel less pressure to come up with positive results.

Claim: “Facebook knows when you’ll break up”

There is an interesting chart going around that is based on Facebook data and claims to show when people are more prone to break-up. Here is a quick description of the chart:

British journalist and graphic designer David McCandless, who specializes in showcasing data in visual ways, compiled the chart. He showed off the graphic at a TED conference last July in Oxford, England.

In the talk, McCandless said he and a colleague scraped 10,000 Facebook status updates for the phrases “breakup” and “broken up.”

They found two big spikes on the calendar for breakups. The first was after Valentine’s Day — that holiday has a way of defining relationships, for better or worse — and in the weeks leading up to spring break. Maybe spring fever makes people restless, or maybe college students just don’t want to be tied down when they’re partying in Cancun.

Potentially interesting findings and it is an interesting way to present this data. But when you consider how the data was collected, perhaps it isn’t so great. A few thoughts on the subject:

1. The best way to figure this out would be to convince Facebook to let you have the data for relationship status changes.

2. Searching for the word “breakup” and “broken up” might catch some, or perhaps even many ended relationships, but not all. Does everyone include these words when talking about ending a relationship?

3. Are 10,000 status updates a representative sample of all Facebook statuses?

4. Is there a lag time involved in reporting these changes? Monday, for example is the most popular day for announcing break-ups, not necessarily for break-ups occurring on that day. Do people immediately run to Facebook to tell the world that they have ended a relationship?

5. Does everyone initially “register” and then “unregister” a relationship on Facebook anyway?

The more I think about it, it is a big claim to make that “Facebook knows when you are going to break up” based on this data mining exercise.

Race as a lesser factor in forming friendships on Facebook

A new study in the American Journal of Sociology finds that a shared racial identity was less important than several other factors when making friends on Facebook:

“Sociologists have long maintained that race is the strongest predictor of whether two Americans will socialize,” said Andreas Wimmer, the study’s lead author and a sociologist at UCLA…

In fact, the strongest attraction turned out to be plain, old-fashioned social pressure. For the average student, the tendency to reciprocate a friendly overture proved to be seven times stronger than the attraction of a shared racial background, the researchers found…

Other mechanisms that proved stronger than same-race preference included having attended an elite prep school (twice as strong), hailing from a state with a particularly distinctive identity such as Illinois or Hawaii (up to two-and-a-half times stronger) and sharing an ethnic background (up to three times stronger).

Even such routine facts of college life as sharing a major or a dorm often proved at least as strong, if not stronger, than race in drawing together potential friends, the researchers found.

Interesting findings – perhaps Facebook is a new world or younger generations don’t pay as much attention to race.

Additionally, it is interesting to read about the methodology of the study which took place at a school where 97% of students had Facebook profiles and the sociologists measured friendships in terms of photo tagging (and not who were actually listed as “friends”).

A couple of questions I have: is behavior on Facebook and choosing friends reflective of actual social patterns in the real world? Is there a selection issue going on here  – not all students or people of this age use Facebook so are college students who use Facebook already more likely to form cross-racial friendships?