Google Street View, machine learning, and social patterns

I have wondered why more researchers do not make use of Google Street View. Here is a new study that connects vehicles in neighborhoods with voting patterns and demographics:

Abstract: The United States spends more than $250 million each year on the American Community Survey (ACS), a labor-intensive door-to-door study that measures statistics relating to race, gender, education, occupation, unemployment, and other demographic factors. Although a comprehensive source of data, the lag between demographic changes and their appearance in the ACS can exceed several years. As digital imagery becomes ubiquitous and machine vision techniques improve, automated data analysis may become an increasingly practical supplement to the ACS. Here, we present a method that estimates socioeconomic characteristics of regions spanning 200 US cities by using 50 million images of street scenes gathered with Google Street View cars. Using deep learning-based computer vision techniques, we determined the make, model, and year of all motor vehicles encountered in particular neighborhoods. Data from this census of motor vehicles, which enumerated 22 million automobiles in total (8% of all automobiles in the United States), were used to accurately estimate income, race, education, and voting patterns at the zip code and precinct level. (The average US precinct contains 1,000 people.) The resulting associations are surprisingly simple and powerful. For instance, if the number of sedans encountered during a drive through a city is higher than the number of pickup trucks, the city is likely to vote for a Democrat during the next presidential election (88% chance); otherwise, it is likely to vote Republican (82%). Our results suggest that automated systems for monitoring demographics may effectively complement labor-intensive approaches, with the potential to measure demographics with fine spatial resolution, in close to real time.

And a little more explanation from a news source:

The researchers created an algorithm to identify the brand, model and year of every car sold in the US since 1990.

The types of cars also provided information about the race, income and education levels of a neighborhood, the study said.

Volkswagens and Aston Martins were associated with white neighborhoods while Chryslers, Buicks and Oldsmobiles tended to appear in African-American neighborhoods, the study found.

This study seems to do two things that get at different areas of research:

  1. Linking lifestyle choices to voting behavior as well as other social traits. Researchers and marketers have done this for decades. For example, see this earlier post about media consumption and voting behavior. This hints at the work Bourdieu who suggested class status is defined by cultural tastes and lifestyles in addition to access to resources and power.
  2. Connecting different publicly available big data sets to find connections. Google Street View is available to all and election outcomes are also accessible. All it takes is a method to put these two things together. Here, it was a machine learning algorithm by which different kinds of vehicles could be identified. It would take humans a long time to connect these pieces of data but algorithms, once they correctly are identifying vehicles, can do this very quickly.

Of course, this still leaves us with questions about what to do with it all. The authors seem interested in helping facilitate more efficient national data-gathering efforts. The American Community Survey and the Dicennial Census are both costly efforts. Could machine learning help reduce the effort needed while providing accurate results? At the same time, it is less clear regarding the causal mechanisms behind these findings: do people buy pick-up trucks because they are Republican? How does this choice of a vehicle fit with a larger constellation of behaviors and beliefs? If someone wanted to change voting patterns, could encouraging the purchase of more pick-up trucks or sedans actually change voting patterns (or are these more of correlations)?