Collecting big data the slow way

One of the interesting side effects of the era of big data is finding out how much information is not actually automatically collected (or is at least not available to the general public or researchers without paying money). A quick example from the work of sociologist Matthew Desmond:

The new data, assembled from about 83 million court records going back to 2000, suggest that the most pervasive problems aren’t necessarily in the most expensive regions. Evictions are accumulating across Michigan and Indiana. And several factors build on one another in Richmond: It’s in the Southeast, where the poverty rates are high and the minimum wage is low; it’s in Virginia, which lacks some tenant rights available in other states; and it’s a city where many poor African-Americans live in low-quality housing with limited means of escaping it.

According to the Eviction Lab, here is how they collected the data:

First, we requested a bulk report of cases directly from courts. These reports included all recorded information related to eviction-related cases. Second, we conducted automated record collection from online portals, via web scraping and text parsing protocols. Third, we partnered with companies that carry out manual collection of records, going directly into the courts and extracting the relevant case information by hand.

In other words, it took a lot of work to put together such a database: various courts, websites, and companies had different pieces of information but a researcher to access all of that data and put them together.

Without a researcher or a company or government body explicitly starting to record or collect certain information, a big dataset on that particular topic will not happen. Someone or some institution, typically with resources at its disposal, needs to set a process into motion. And simply having the data is not enough; it needs to be cleaned up so it all works with the other pieces. Again, from the Eviction Lab:

To create the best estimates, all data we obtained underwent a rigorous cleaning protocol. This included formatting the data so that each observation represented a household; cleaning and standardizing the names and addresses; and dropping duplicate cases. The details of this process can be found in the Methodology Report (PDF).

This all can lead to a fascinating dataset of over 83 million records on an important topic.

We are probably still a ways off from a scenario where this information would automatically become part of a dataset. This data had a definite start and required much work. There are many other areas of social life that require similar efforts before researchers and the public have big data to examine and learn from.

Quick Review: Evicted

I recently read Matthew Desmond’s much discussed work Evicted: Poverty and Profit in the American City. Here are my thoughts on the ethnographic work.

  1. The book is certainly readable as he tells the stories of a number of tenants and landlords in the Milwaukee area. The plight of the tenants is striking and the landlords are also an interesting group (particularly Sherrena who wanted to tell her story). Of course, such readability may not impress some sociologists who prefer more scientific prose (and who complain about the work of Venkatesh or Goffman) but this should reach a broader public. The narratives have some summary data and causal explanations sprinkled in but the emphasis is on the stories.
  2. One of the more impressive features of this work is the quantitative data that it also draws on. This information is buried in the footnotes but Desmond also developed several quantitative datasets that helped (1) suggest his stories are not unusual and (2) provide the broader patterns for an issue that is not studied much in sociology.
  3. The biggest takeaway for me: the number of evictions that take place on a regular basis.
  4. The subject area – evictions – certainly needs more attention. I’ve read my share of work on affordable housing in the last decade but rarely did I see this issue mentioned. As Desmond notes, big cities have a sizable population of people who consistently have to move around due to evictions. Even if there were more housing units – and big cities are often tens of thousands of units short of affordable units – evictions make it difficult to establish roots and settle kids into schools. The final chapter – where Desmond discusses the broader issue and possible solutions – leads off nicely with this idea of a good physical home as the centerpiece of a thriving society.
  5. That said, how common is this issue in suburban areas? As poverty moves to the suburbs as do increasing numbers of minorities, I would expect that evictions are not limited just to larger cities.
  6. One area that gets less attention in this ethnography that may also prove worthwhile to explore further is the legal apparatus. Desmond follows one of the eviction squads and provides some insights into the court process but it would be interesting to hear more from judges (who from the book seem to work against the tenants – though they may just be following the law) as well as local officials (how do public officials respond to these situations).
  7. A second area is thinking about the intersections of race and class. Desmond hints at the influence of race: comparing the experiences of blacks on the North Side of Milwaukee versus whites on the South Side, comments from black and white tenants about the possibilities for living in the other’s neighborhoods, briefly discussing the race of landlords. However, there is a lot more here to unpack, especially given Desmond’s other work on race. Take the two main landlords in the book: one is white, the other black. The first has a more stand-offish approach (working through intermediaries) while the second is more directly involved with tenants. Both are in it for the money and seem to be doing well. How much does their race matter?

An enjoyable read and a work I could imagine using with undergraduates who often have little to no experience with housing issues. I look forward to looking at Desmond’s journal articles that also build on this ethnographic and quantitative data.