The implications of discontinuing the American Community Survey

This didn’t exactly make the front page this week but a vote in the House of Representatives about the American Community Survey could have a big impact on how we understand the United States. Nate Berg explains:

So the Republican-led House of Representatives this week voted 232-190 to eliminate the American Community Survey, the annual survey of about 3 million randomly chosen U.S. households that’s like the Census only much more detailed. It collects demographic details such as what sort of fuel a household uses for heating, the cost of rent or mortgage payments, and what time residents leave home to go to work.

In a post on the U.S. Census Bureau’s website, Director Robert Groves says the bill “devastates the nation’s statistical information about the status of the economy and the larger society. Modern societies need current, detailed social and economic statistics. The U.S. is losing them.”

While the elimination of the ACS would take a slight nibble out of the roughly $3.8 trillion in government expenditures proposed in the 2013 federal budget, its negative impacts could be much greater – affecting the government’s ability to fund a wide variety of services and programs, from education to housing to transportation.

The issue is that the information collected in the ACS is used heavily by the federal government to figure out where it will spend a huge chunk of its money. In a 2010 report for the Brookings Institution, Andrew Reamer found that in the 2008 fiscal year, 184 federal domestic assistance programs used ACS-related datasets to help determine the distribution of more than $416 billion in federal funding. The bulk of that funding, more than 80 percent, went directly to fund Medicaid, highway infrastructure programs and affordable housing assistance. Reamer, now a research professor George Washington University’s Institute of Public Policy, also found that the federal government uses the ACS to distribute about $100 billion annually to states and communities for economic development, employment, education and training, commerce and other purposes. He says that should the ACS be eliminated, it would be very difficult to figure out how to distribute this money where it’s needed…

And it’s not just government money that would be wasted. Reamer says many businesses are increasingly reliant on the market data available within the ACS, and that without it they would have much less success picking locations where their businesses would have market demand. It would affect businesses throughout the country, “from mom-and-pops to Walmart.”

Some history might also be helpful here. The United States has carried out a dicennial census since 1790 but the American Community Survey began in the mid 1990s. There has been talk in recent years of replacing the expensive and complicated dicennial census with a beefed up American Community Survey. There would be several advantages: it wouldn’t cost as much plus the government (and the country) would have more consistent information rather than having to wait every ten years. In other words, our country is rapidly changing and we need consistent information that can tell us what is happening.

In my mind, as a researcher who consistently uses Census data, dropping the ACS would be a big loss. The government funding is important but even more important to me would be losing the more up-to-date information the ACS provides. Without this survey, we would likely have to rely on private data which is often restrictive and/or expensive. For example, I’ve used ACS data to track some housing issues but without this, I’m not sure where I could get similar data.

This is part of a larger issue of conservatives wanting to limit the reach of the Census Bureau. The argument often is that the Census is too intrusive, therefore invading the privacy of citizens (see this 2011 story about an insistent ACS worker), and the Constitution only provides for a dicennial census. I wonder if these arguments are red herrings: there is a long history of battling over Census counts and timing depending on which political party might benefit. For example, see Republican claims that inappropriate sampling techniques were used to correct undercounts for big cities, claims that the Census “imputes” races to people (so mark your race as American!), or efforts by New York City to ask for a recount in order to boost their 2010 population figures, which are tied to funding. In other words, the Census can turn into a political football even though its data is very important and it uses social science research techniques.

We need better data on loneliness and its effects

In response to the recent Atlantic cover story “Is Facebook Making Us Lonely?” by Stephen Marche, sociologist Eric Klinenberg suggests the data is much less clear than the cover story suggests.

This debate suggests two things:

1. We need better data on loneliness and how it affects people. There are multiple ways that this could be done but perhaps we need a methodological breakthrough. I’ve been thinking lately that we need better ways to know what people do when they are alone. Now, we rely on after-the-fact questions rather than observational data. If we ask the same questions over time (such as the famous one about how many confidants respondents have), we can track changes over time but this also requires interpretation. How much loneliness is acceptable and “normal” before there are adverse effects? Does the importance or effects of loneliness change over the lifecourse? Is loneliness mitigated by other social forces?

2. Without this more conclusive data, I think we end up having a proxy battle over two warring American schools of thought: communitarianism versus individualism. This dates back to the early days of the American experiment. Who is more virtuous, the cosmopolitan city dweller or the self-reliant farmer or frontiersman? Should we all live in urban areas or preserve small town life? Should the government help people get an equal shot at success or help defend people from each other? Should religion be expressed in the public sphere or should it be comparmentalized? Several well-known social science works in recent decades have tackled these divides including the 1985 classic Habits of the Heart.  Both Klinenberg and Marche seem to bring these ideological approaches to their arguments and then look for the data that supports their points. For example, Klinenberg admits that loneliness will be felt by those who live alone but this is desirable because living alone allows for other good things to happen.

Data guru Hans Rosling named to Time’s 100 most influential people

Hans Rosling’s talks are fascinating as he makes data and charts exciting and explanatory in his own enthusiastic manner. Named as one of the 100 most influential people by Time, Rosling is profiled by sociologist and MD Nicholas Christakis:

Hans Rosling trained in statistics and medicine and spent years on the front lines of public health in Africa. Yet his greatest impact has come from his stunning renderings of the numbers that characterize the human condition.

His 2006 TED talk, in which he animated statistics to tell the story of socio-economic development, has been viewed over 3.8 million times and translated into dozens of languages. His subsequent talks have moved millions of people worldwide to see themselves and our planet in new ways by showing how our actions affect our health and wealth and one another across space and time.

When you meet Rosling, 63, you are struck by his energy and clarity. He has the quiet assurance of a sword swallower (which he is) but also of a man who is in the vanguard of a critically important activity: advancing the public understanding of science.

What does Rosling make of his statistical analysis of worldwide trends? “I am not an optimist,” he says. “I’m a very serious possibilist. It’s a new category where we take emotion apart and we just work analytically with the world.” We can all, Rosling thinks, become healthy and wealthy. What a promising thought, so eloquently rendered with data.

Here are some of Rosling’s presentations that are well worth watching:

200 Countries, 200 Years, 4 minutes – The Joy of Stats

TED Talk: No More Boring Data

TED Talk: The Good News of the Decade?

Here is what The Economist thinks are Rosling’s greatest hits.

I’ve used several of Rosling’s talk in class to illustrate what is possible with data and charts. Rosling gets at an important issue: data should tell a story and be interactive and available to people so they too can dig into it and understand the world better. By simply taking a chart and adding some extra information (like population size of a country displayed as a larger circle or being able to quickly show the quartile income distributions for a country) and the dimension of time, you can start to visualize patterns and possible explanations of how the world works.

(A side note: alas, I don’t think any sociologists were named as one of the 100 most influential people.)

History – facts = sociology?

Lamenting how history is taught in today’s schools, one writer argues that history without facts is just sociology:

My son’s teacher confirmed that this is broadly true. The teaching of history in British schools is increasingly influenced by US methods of presenting the past thematically rather than chronologically. Thus pupils might study crime and punishment, or kingship, and dip in and out of different centuries. Consequently, dates lose their value. So 1605, which for me means the Gunpowder Plot, for my son simply means that he is five minutes late for games.

I didn’t argue with his teacher, and in any case there is more than one way to skin a cat, as Torquemada (1420-1498) knew. Besides, a slant on history that was good enough for two of our greatest historians, WC Sellar and RJ Yeatman, ought to be good enough for me. The subtitle of their enduringly delightful 1930 book, 1066 And All That, was A Memorable History of England comprising all the parts you can remember, including 103 Good Things, 5 Bad Kings, and 2 Genuine Dates.

Maybe it wasn’t crusty American academics but Sellar and Yeatman, having a laugh, who really popularised the notion that history can be taught largely without dates. “The first date in English history is 55BC,” they wrote, referring to the arrival of Julius Caesar and his legions on the pebbly shores of Kent. “For the other date, see Chapter 11, William the Conqueror.” They didn’t specify the year in which the King of Spain “sent the Great Spanish Armadillo to ravish the shores of England”.

Whatever, I can see the logic of going down the thematic rather than the chronological route. And I made sympathetic noises when my son’s teacher explained that “it’s helpful for those pupils who struggle to take in lots of facts”. But even if we leave out dates, aren’t facts what history is all about? The rest, as they say, is sociology.

This is not an unusual complaint: the next generations always seem to know less history and perhaps even more troubling is that they don’t seem to care.

A couple of other thoughts:

1. Why can’t you have both dates and thematic approaches? Knowing dates doesn’t necessarily know that a student knows what to do with the information or that they know the broad sweep of historical change.

2. I think the argument in the final sentence is that sociology is devoid of facts. While sociologists may indeed care about certain topics (such as race, class, and gender) that others don’t care as much about, we also care about facts. For example, many sociology undergraduate programs have students take statistics and research methods courses. We don’t want students or sociologists simply interpreting data and information without having their findings be reliable (replicable) and valid (measuring what we say we are). There is a lot of debate within the field about how we can best know about the world and determine what is causing or influencing what. This is not easy work since most social situations are quite complex and there are a lot of variables at play.

3. Why can’t history and sociology coexist? As an overgeneralization, history tends to tell us what happened and sociology helps us think through why these things happened. Why can’t sociology help inform us about history, particularly about how certain historical narratives develop and then become part of our collective memory?

Five main methods of detecting patterns in data mining

Here is a summary of five of the main methods utilized to uncover patterns when data mining:

Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.

Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. These types of findings are often used for targeting coupons/deals or advertising. Similarly, this form of data mining (albeit a quite complex version) is behind Netflix movie recommendations.

Cluster detection: one type of pattern recognition that is particularly useful is recognizing distinct clusters or sub-categories within the data. Without data mining, an analyst would have to look at the data and decide on a set of categories which they believe captures the relevant distinctions between apparent groups in the data. This would risk missing important categories. With data mining it is possible to let the data itself determine the groups. This is one of the black-box type of algorithms that are hard to understand. But in a simple example – again with purchasing behavior – we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.

Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. Learning from a large set of pre-classified examples, algorithms can detect persistent systemic differences between items in each group and apply these rules to new classification problems. Spam filters are a great example of this – large sets of emails that have been identified as spam have enabled filters to notice differences in word usage between legitimate and spam messages, and classify incoming messages according to these rules with a high degree of accuracy.

Regression: Data mining can be used to construct predictive models based on many variables. Facebook, for example, might be interested in predicting future engagement for a user based on past behavior. Factors like the amount of personal information shared, number of photos tagged, friend requests initiated or accepted, comments, likes etc. could all be included in such a model. Over time, this model could be honed to include or weight things differently as Facebook compares how the predictions differ from observed behavior. Ultimately these findings could be used to guide design in order to encourage more of the behaviors that seem to lead to increased engagement over time.

Several of these seem similar to methods commonly used by sociologists:

1. Anomaly detection seems like looking for outliers. On one hand, outliers can throw off basic measures of central tendency or dispersion. On the other hand, outliers can help prompt researchers to reassess their models and/or theories to account for the unusual cases.

2. Cluster detection and/or classification appear similar to factor analysis. This involves a statistical analysis of a set of variables to see which ones “hang together.” This can be helpful for finding categories and reducing the number of variables in an analysis to a lesser number of important concepts.

3. Regression is used all the time both for modeling and predictions.

This all reminds me of what I heard in graduate school about the difference between data mining and statistical research: data mining amounted to atheoretical analysis. In other words, you might find relationships  between variables (or apparent relationships between variables – could always be a spurious association or there could be suppressor or distorter effects) but you wouldn’t have compelling explanations for these relationships. While you might be able to develop some explanations, this is a different process than hypothesis testing where you set out to look and test for relationships and patterns.

The rise of “data science” as illustrated by examining the McDonald’s menu

Christopher Mims takes a look at “data science” and one of its practitioners:

Before he was mining terabytes of tweets for insights that could be turned into interactive visualizations, [Edwin] Chen honed his skills studying linguistics and pure mathematics at MIT. That’s typically atypical for a data scientist, who have backgrounds in mathematically rigorous disciplines, whatever they are. (At Twitter, for example, all data scientists must have at least a Master’s in a related field.)

Here’s one of the wackier examples of the versatility of data science, from Chen’s own blog. In a post with the rousing title Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Chen delves into the problem of clustering. That is, how do you take a mass of data and sort it into groups of related items? It’s a tough problem — how many groups should there be? what are the criteria for sorting them? — and the details of how he tackles it are beyond those who don’t have a background in this kind of analysis.

For the rest of us, Chen provides a concrete and accessible example: McDonald’s

By dumping the entire menu of McDonald’s into his mathemagical sorting box, Chen discovers, for example, that not all McDonald’s sauces are created equal. Hot Mustard and Spicy Buffalo do not fall into the same cluster as Creamy Ranch, which has more in common with McDonald’s Iced Coffee with Sugar Free Vanilla Syrup than it does with Newman’s Own Low Fat Balsamic Vinaigrette.

This sounds like an updated version of factor analysis: break a whole into its larger and influential pieces.

Here is how Chen describes the field:

I agree — but it depends on your definition of data science (which many people disagree on!). For me, data science is a mix of three things: quantitative analysis (for the rigor necessary to understand your data), programming (so that you can process your data and act on your insights), and storytelling (to help others understand what the data means). So useful skills for a data scientist to have could include:

* Statistics, machine learning (on the quantitative analysis side). For example, it’s impossible to extract meaning from your data if you don’t know how to distinguish your signals from noise. (I’ll stress, though, that I believe any kind of strong quantitative ability is fine — my own background was originally in pure math and linguistics, and many of the other folks here come from fields like physics and chemistry. You can always pick up the specific tools you’ll need.)

* General programming ability, plus knowledge of specific areas like MapReduce/Hadoop and databases. For example, a common pattern for me is that I’ll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.

* Web programming, data visualization (on the storytelling side). For example, I find it extremely useful to be able to throw up a quick web app or dashboard that allows other people (myself included!) to interact with data — when communicating with both technical and non-technical folks, a good data visualization is often a lot more helpful and insightful than an abstract number.

I would be interested in hearing whether data science is primarily after descriptive data (like Twitter mood maps) or explanatory data. The McDonald’s example is interesting but what kind of research question does it answer? Chen mentions some more explanatory research questions he is pursuing but it seems like there is a ways to go here. I would also be interested in hearing Chen’s thoughts on how representative the data is that he typically works with. In other words, how confident are he and others are that the results are generalizable beyond the population of technology users or whatever the specific sampling frame is. Can we ask and answer questions about all Americans or world residents from the data that is becoming available through new data sources?

h/t Instapundit

Why don’t we collect data to see whether we have become more rude or uncivil rather than rely on anecdotes?

NPR ran a story the other day about how American culture is becoming more casual and less polite. This is not an uncommon story: every so often, different news organizations will run something similar, often focusing on the decreasing use of manners like saying “please” and “you’re welcome.” Here is the main problem I have with these articles: what kind of data could we look at to evaluate this argument? These stories tend to rely on experts who provide anecdotal evidence or their own interpretation. In this piece, these are the three experts: “a psychiatrist and blogger,” “a sophomore at the College of Charleston — in the South Carolina city that is often cited as one of the most courteous in the country,” and “etiquette maven Cindy Post Senning, a director of the Emily Post Institute in Burlington, Vt.”

There is one data point cited in this story:

Research backs up Smith’s anecdotal observations. In 2011, some 76 percent of people surveyed by Rasmussen Reports said Americans are becoming more rude and less civil.

Interestingly, this statistic is about perceptions. Perceptions may be more important than reality in many social situations. But I could imagine another scenario about these perceptions: older generations tend to think that younger generations (often their children and grandchildren?) are less mannered and don’t care as much about social etiquette. As this story suggests, perhaps the manners are simply changing – instead of saying “you’re welcome,” younger people give the dreaded “sure.”

There has to be some way to measure this. It would be nice to do this online or in social media but the problem is that face-to-face rules don’t apply there. Perhaps someone has recorded interactions at McDonald’s or Walmart registers? In whatever setting a researcher chooses, you would want to observe a broad range of people to look for patterns by age, occupation, gender, race, education level (though some of this would have to come through survey or interview data with the people being observed).

In my call for data, I am not disagreeing with the idea that traditional manners and civility have decreased. I just want to see data that suggests this rather than anecdotes and observations from a few people.

When only bad people live in McMansions

I doubt I will see the movie Wanderlust but this quick description of the film caught my eye:

Paul Rudd and Jennifer Aniston star in “Wanderlust,” the raucous new comedy from director David Wain and producer Judd Apatow about a harried couple who leave the pressures of the big city and join a freewheeling community where the only rule is to be yourself. When overextended, overstressed Manhattanites, George (Rudd) and Linda (Aniston), pack up their lives and head south to move in with George’s McMansion-living jerk of a brother, Rick (Ken Marino), they stumble upon Elysium, an idyllic community populated by colorful characters including the commune’s alpha male, Seth (Justin Theroux), the sexually adventurous Eva (Malin Akerman), and the troupe’s drop-out founder, Carvin (Alan Alda).

This reinforces an idea I have seen hinted at in many other places: the people who live in McMansions are jerks or bad people. McMansion owners don’t care about the environment, love to consume, have little taste, and don’t want to interact with people unlike them. The converse would look like this: smart or nice or enlightened people would not live in the homes. This is a great example of drawing moral boundaries by attaching character traits to certain home choices. This could be tied to the idea that living in a large home is viewed as morally wrong by some.

I would love to get my hands on sociological data to examine this claim. Of course, this would require first determining whether someone lives in a McMansion and this itself would require work. But then you could examine some different factors: do McMansion owners interact with their neighbors more? Are they involved with more civic organizations? Do they give more money to charity? Do they help people in need more often? Do they have a stronger prosocial orientation? If there were not significant differences, how might people respond…

Journalists need a better measure for when something has “taken over the web”

I’ve noticed that there are a growing number of online news stories about what is popular online. While many websites need to feed on this buzz, journalists need some better measures of how popular things are on the Internet. Take, for instance, this story posted on Yahoo:

This video from the California State University, Northridge campus has ignited controversy across the Internet this morning. In the video, reportedly taken during finals week, a female student loses her temper with her fellow students, accusing them of being disruptive.

Exactly how much “controversy across the Internet” has erupted? Phrases like this are not unusual; we’re commonly told that a particular story or video or meme has spread across the Internet so we need to know about it. But we have little idea about how popular anything really is.

I’ve noted before my dislike for journalists using the size of Facebook groups as a measure of popularity. So what can be used? We need numbers that can be at least put in a context and compared to other numbers. For example, the number of YouTube views can be compared to the views for other videos. Page views and hits (which have their own problems) at least provide some information. Journalists could do a quick search of Google news to get some idea of how many news sources have picked up on a story. We can know how many times something has been retweeted on Twitter.

None of these numbers are perfect. By themselves, they are meaningless. But broad and vague assertions that we need to read about something simply because lots of people on the Internet have seen it are silly. Give us some idea of how popular something really is, where it started, and who has responded to it so far. Show us some trend and put it in some context.