Five main methods of detecting patterns in data mining

Here is a summary of five of the main methods utilized to uncover patterns when data mining:

Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.

Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. These types of findings are often used for targeting coupons/deals or advertising. Similarly, this form of data mining (albeit a quite complex version) is behind Netflix movie recommendations.

Cluster detection: one type of pattern recognition that is particularly useful is recognizing distinct clusters or sub-categories within the data. Without data mining, an analyst would have to look at the data and decide on a set of categories which they believe captures the relevant distinctions between apparent groups in the data. This would risk missing important categories. With data mining it is possible to let the data itself determine the groups. This is one of the black-box type of algorithms that are hard to understand. But in a simple example – again with purchasing behavior – we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.

Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. Learning from a large set of pre-classified examples, algorithms can detect persistent systemic differences between items in each group and apply these rules to new classification problems. Spam filters are a great example of this – large sets of emails that have been identified as spam have enabled filters to notice differences in word usage between legitimate and spam messages, and classify incoming messages according to these rules with a high degree of accuracy.

Regression: Data mining can be used to construct predictive models based on many variables. Facebook, for example, might be interested in predicting future engagement for a user based on past behavior. Factors like the amount of personal information shared, number of photos tagged, friend requests initiated or accepted, comments, likes etc. could all be included in such a model. Over time, this model could be honed to include or weight things differently as Facebook compares how the predictions differ from observed behavior. Ultimately these findings could be used to guide design in order to encourage more of the behaviors that seem to lead to increased engagement over time.

Several of these seem similar to methods commonly used by sociologists:

1. Anomaly detection seems like looking for outliers. On one hand, outliers can throw off basic measures of central tendency or dispersion. On the other hand, outliers can help prompt researchers to reassess their models and/or theories to account for the unusual cases.

2. Cluster detection and/or classification appear similar to factor analysis. This involves a statistical analysis of a set of variables to see which ones “hang together.” This can be helpful for finding categories and reducing the number of variables in an analysis to a lesser number of important concepts.

3. Regression is used all the time both for modeling and predictions.

This all reminds me of what I heard in graduate school about the difference between data mining and statistical research: data mining amounted to atheoretical analysis. In other words, you might find relationships  between variables (or apparent relationships between variables – could always be a spurious association or there could be suppressor or distorter effects) but you wouldn’t have compelling explanations for these relationships. While you might be able to develop some explanations, this is a different process than hypothesis testing where you set out to look and test for relationships and patterns.

We need a more complex analysis of how taxes affect income inequality

One current blogosphere discussion about whether taxes could help reduce income inequality would benefit from more complex analyses. Here is the discussion thus far according to TaxProf Blog:

There have been a number of reports published recently that purport to show a link between rising inequality and changes in tax policy — especially tax cuts for the so-called rich. The latest installment comes from Berkeley professor Emmanuel Saez, Striking it Richer: The Evolution of Top Incomes in the United States.

Saez and others who write on this issue seem so intent on proving a link between tax policy and inequality that they overlook the major demographic changes that are occurring in America that can contribute to — or at least give the appearance of — rising inequality; a few of these being, differences in education, the rise of dual-earner couples, the aging of our workforce, and increased entrepreneurship.

Today, we will look at the link between education and income. Recent census data comparing the educational attainment of householders and income shows about as clearly as you can that America’s income gap is really an education gap and not the result of tax cuts for the rich.

The chart below shows that as people’s income rise, so too does the likelihood that they have a college degree or higher. By contrast, those with the lowest incomes are most likely to have a high school education or less. Just 8% of those at the lowest income level have a college degree while 78% of those earning $250,000 or more have a college degree or advanced degree. At the other end of the income scale, 69% of low-income people have a high school degree or less, while just 9% of those earning over $250,000 have just a high school degree.

This analysis starts in the right direction: looking at a direct relationship between two variables such as tax rates and income inequality is difficult to do in isolation of other factors. While some factors may be more influential than others, there are a number of reasons for income inequality. In other words, graphs with two variables are not enough. Pulling out one independent variable at a time doesn’t give us the full picture.

But, then the supposedly better way is that we were just looking at the wrong variable’s influence on income and should have been looking at education instead! So after saying that the situation was more complex, we get another two variable graph that shows that as education goes up, so does income so perhaps it really isn’t about taxes at all.

What we need here is some more complex statistical analysis, preferably including regression analysis where we can see how a variety of factors at the same time influence income inequality. Some of this might be a little harder to model since you would want to account for changing tax rates but arguing over two variable graphs isn’t going to get us very far. Indeed, I wonder if this is more common now in debates: both sides like simpler analyses because it allows each to make the point they want without considering the full complexity of the matter. In other words, easier to make graphs line up more with ideological commitments rather than an interest in truly sorting out what factors are more influential in affecting income inequality.

An emerging portrait of emerging adults in the news, part 1

In recent weeks, a number of studies have been reported on that discuss the beliefs and behaviors of the younger generation, those who are now between high school and age 30 (an age group that could also be labeled “emerging adults”). In a three-part series, I want to highlight three of these studies because they not only suggest what this group is doing but also hints at the consequences.

Almost a week ago, a story ran along the wires about a new study linking “hyper-texting” and excessive usage of social networking sites with risky behaviors:

Teens who text 120 times a day or more — and there seems to be a lot of them — are more likely to have had sex or used alcohol and drugs than kids who don’t send as many messages, according to provocative new research.

The study’s authors aren’t suggesting that “hyper-texting” leads to sex, drinking or drugs, but say it’s startling to see an apparent link between excessive messaging and that kind of risky behavior.

The study concludes that a significant number of teens are very susceptible to peer pressure and also have permissive or absent parents, said Dr. Scott Frank, the study’s lead author

The study was done at 20 public high schools in the Cleveland area last year, and is based on confidential paper surveys of more than 4,200 students.

It found that about one in five students were hyper-texters and about one in nine are hyper-networkers — those who spend three or more hours a day on Facebook and other social networking websites.

About one in 25 fall into both categories.

Hyper-texting and hyper-networking were more common among girls, minorities, kids whose parents have less education and students from a single-mother household, the study found.

Several interesting things to note in this study:

1. It did not look at what exactly is being said/communicated in these texts or in social networking use. This study examines the volume of use – and there are plenty of high school students who are heavily involved with these technologies.

2. One of the best parts of this story is that the second paragraph is careful to suggest that finding an association between these behaviors does not mean that they cause each other. In other words, there is not a direct link between excessive testing and drug use. Based on this dataset, these variables are related. (This is a great example of “correlation without causation.”)

3. What this study calls for is regression analysis where we can control for other possible factors. It would then give us the ability to compare two students with the same family background and same educational performance and isolate whether texting was really the factor that led to the risky behaviors. If I had to guess, factors like family life and performance in school are more important in predicting these risky behaviors. Then, excessive texting for SNS use is an intervening variable. Why this study did not do this sort of analysis is unclear – perhaps they already have a paper in the works.

Overall, we need more research on these associated variables. While it is interesting in itself that there are large numbers of emerging adults who text a lot and use SNS a lot, we ultimately want to know the consequences. Part two and three of this series will look at a few studies that offer some possible consequences.