Collecting big data the slow way

One of the interesting side effects of the era of big data is finding out how much information is not actually automatically collected (or is at least not available to the general public or researchers without paying money). A quick example from the work of sociologist Matthew Desmond:

The new data, assembled from about 83 million court records going back to 2000, suggest that the most pervasive problems aren’t necessarily in the most expensive regions. Evictions are accumulating across Michigan and Indiana. And several factors build on one another in Richmond: It’s in the Southeast, where the poverty rates are high and the minimum wage is low; it’s in Virginia, which lacks some tenant rights available in other states; and it’s a city where many poor African-Americans live in low-quality housing with limited means of escaping it.

According to the Eviction Lab, here is how they collected the data:

First, we requested a bulk report of cases directly from courts. These reports included all recorded information related to eviction-related cases. Second, we conducted automated record collection from online portals, via web scraping and text parsing protocols. Third, we partnered with companies that carry out manual collection of records, going directly into the courts and extracting the relevant case information by hand.

In other words, it took a lot of work to put together such a database: various courts, websites, and companies had different pieces of information but a researcher to access all of that data and put them together.

Without a researcher or a company or government body explicitly starting to record or collect certain information, a big dataset on that particular topic will not happen. Someone or some institution, typically with resources at its disposal, needs to set a process into motion. And simply having the data is not enough; it needs to be cleaned up so it all works with the other pieces. Again, from the Eviction Lab:

To create the best estimates, all data we obtained underwent a rigorous cleaning protocol. This included formatting the data so that each observation represented a household; cleaning and standardizing the names and addresses; and dropping duplicate cases. The details of this process can be found in the Methodology Report (PDF).

This all can lead to a fascinating dataset of over 83 million records on an important topic.

We are probably still a ways off from a scenario where this information would automatically become part of a dataset. This data had a definite start and required much work. There are many other areas of social life that require similar efforts before researchers and the public have big data to examine and learn from.

When the candidate with the big data advantage didn’t win the presidency

Much was made of the effective use of big data by Barack Obama’s campaigns. That analytic advantage didn’t help the Clinton campaign:

Clinton can be paranoid and self-destructively self-protective, but she’s also capable of assessing her own deficiencies as a politician in a bracingly clear-eyed way. And the conclusion that she drew from her 2008 defeat was essentially an indictment of her own management style: Eight years earlier, she had personally presided over a talented, sloppy, squabbling, sprawling menagerie of pals, longtime advisers and hangers-on who somehow managed to bungle the building of a basic political infrastructure to oppose Obama’s efficient, data-driven operation.

To do so, Mook hired a buddy who had helped Terry McAuliffe squeak out a win in the 2014 Virginia governor’s race: Elan Kriegel, a little-known data specialist who would, in many ways, exert more influence over the candidate than any of all-star team of veteran consultants. Kriegel’s campaign-within-a-campaign conducted dozens of targeted surveys—to test messaging and track voter sentiment day-by-day, especially in battleground states—and fed them into a computer algorithm, which ran hundreds of thousands of simulations that were used to steer ad spending, the candidate’s travel schedule, even the celebrities Clinton would invite to rallies.

The data operation, five staffers told me, was the source of Mook’s power within the campaign, and a source of perpetual tension: Many of Clinton’s top consultants groused that Mook and Kriegel withheld data from them, balking at the long lead time—a three-day delay—between tracking reports. A few of them even thought Mook was cherry-picking rosy polling to make the infamously edgy Clinton feel more confident…

In numerous interviews conducted throughout the campaign, Clinton staffers attested to Mook’s upbeat attitude and mastery of detail. But, in the end, Brooklyn simply failed to predict the tidal wave that swamped Clinton—a pro-Trump uprising in rural and exurban white America that wasn’t reflected in the polls—and his candidate failed to generate enough enthusiasm to compensate with big turnouts in Detroit, Milwaukee and the Philadelphia suburbs.

It would be fascinating to hear more. The pollsters didn’t get it right – but neither did the Clinton campaign internally?

The real question is what this will do to future campaigns. Was Donald Trump’s lack of campaign infrastructure and reliance on celebrity and media coverage (also highlighted nicely in the article above) something that others can or will replicate? Or, would the close margins in this recent presidential election highlight even more the need for finely-tuned data and microtargeting? I’m guessing the influence of big data in campaigns will only continue but data will only get you so far if it (1) isn’t great data in the first place and (2) people know how to use it well.

Let Amazon’s big data tractor trailer drive to you

Americans like big trucks and hard drive space so why not put the two together?

Amazon announced the new service, confusingly named Snowmobile, at its Re:Invent conference in Las Vegas this week. It’s designed to shuttle as many as 100 petabytes–around 100,000 terabytes–per truck. That’s enough storage to hold five copies of the Internet Archive (a comprehensive backup of the web both present and past), which contains “only” about 18.5 petabytes of unique data...

Using multiple semis to shuttle data around might seem like overkill. But for such massive amounts of data, hitting the open road is still the most efficient way to go. Even with a one gigabit per-second connection such as Google Fiber, uploading 100 petabytes over the internet would take more than 28 years. At an average speed of 65 mph, on the other hand, you could drive a Snowmobile from San Francisco to New York City in about 45 hours—about 4,970 gigabits per second. That doesn’t count the time it takes to actually transfer the data onto Snowmobile–which Amazon estimates will take less than 10 days–or from the Snowmobile onto Amazon’s servers. But all told, that still makes the truck much, much faster. And because Amazon has data centers throughout the country, your data probably won’t need to travel cross-country anyway.

One could make a strong case that semis make America go. And all the money that the government has put into highways and roads certainly helps.

Guidelines for using big data to improve colleges

A group of researchers and other interested parties recently made suggestions about how big data from higher ed can be used for good within higher ed:

To Stevens and others, this massive data is full of promise –­­ but also peril. The researchers talk excitedly about big data helping higher education discover its Holy Grail: learning that is so deeply personalized that it both keeps struggling students from dropping out and pushes star performers to excel…

The guidelines center on four core ideas. The first calls on all players in higher education, including students and vendors, to recognize that data collection is a joint venture with clearly defined goals and limits. The second states that students be told how their data are collected and analyzed, and be allowed to appeal what they see as misinformation. The third emphasizes that schools have an obligation to use data-driven insights to improve their teaching. And the fourth establishes that education is about opening up opportunities for students, not closing them.

While numbers one and two deal with handling the data, numbers three and four discuss the purposes: will the data actually help students in the long run? Such data could serve a lot of interested parties: faculty, administrators, alumni, donors, governments, accreditation groups, and others. I suspect faculty would be worried that administrators would try to squeeze more efficiencies out of the college, donors might want to see what exactly is going on at college, the government could set new regulatory guidelines, etc.

Yet, big data doesn’t necessarily provide quick answers to these purposes even as it might provide insights into broader patterns. Take improving teaching: there is a lot of disagreement over this topic. Or, opening opportunities for students: which ones? Who chooses which options students should have?

One takeaway: big data offers much potential to see new patterns and give decision makers better tools. However, it does not guarantee better or worse outcomes; it can be used well or misused like any sense of data. I like the idea of getting out ahead of the data to set some common guidelines but I imagine it will take some time to work out best practices.

Claim that McMansions have proportionally lost resale value

A recent study by Trulia suggests McMansions don’t hold their value:

The premium that buyers can expect to pay for a McMansion in Fort Lauderdale, Fla., declined by 84 percent from 2012 to 2016, according to data compiled by Trulia. In Las Vegas, the premium dropped by 46 percent and in Phoenix, by 42 percent.

Real estate agents don’t usually tag their listings #McMansion, so to compile the data, Trulia created a proxy, measuring the price appreciation of homes built from 2001 and 2007 that have 3,000 to 5,000 square feet. While there’s no single size designation, and plenty of McMansions were built outside that time window, those specifications capture homes built at the height of the trend.

McMansions cost more to build than your average starter ranch home does, and they will sell for more. But the return on investment has dropped like a stone. The additional cash that buyers should be willing to part with to get a McMansion fell in 85 of the 100 largest U.S. metropolitan areas. For example, four years ago a typical McMansion in Fort Lauderdale was valued at $477,000, a 274 percent premium over all other homes in the area. This year, those McMansions are worth about $611,000, or 190 percent more than the rest the homes on the market.

The few areas in which McMansions are gaining value faster than more tasteful housing stock are located primarily in the Midwest and the eastern New York suburbs that make up Long Island. The McMansion premium in Long Island has increased by 10 percent over the last four years.

Read the Trulia report here.

Interesting claim. After the housing bubble burst, some commentators suggested that Americans should go back to not viewing homes as goods with significant returns on investment. Instead, homes should be viewed as having some appreciation but this happens relatively slowly. This article would seem to suggest that return on investment is a key factor in buying a home. How often does this factor into the decisions of buyers versus other concerns (such as having more space or locating in the right neighborhoods)? And just how much of a premium should homeowners expect – 190% more than the rest of the market is not enough?

This analysis also appears to illustrate both the advantages and pitfalls of big data. On one hand, sites like Trulia and Zillow can look at the purchase and sale of all across the country. Patterns can be found and certain causal factors – such as housing market – ca be examined. Yet, they are still limited by the parameters in their data collection which, in this case, severely restricts their definition of McMansions to a certain size home built over a particular time period. As others might attest, big homes aren’t necessarily McMansions unless they have bad architecture or are teardowns. This sort of analysis would be very difficult to do without big data but it is self-evident that such analyses are always worthwhile.

Using a supercomputer and big data to find stories of black women

A sociologist is utilizing unique methods to uncover more historical knowledge about black women:

Mendenhall, who is also a professor of African American studies and urban and regional planning, is heading up the interdisciplinary team of researchers and computer scientists working on the big data project, which aims to better understand black women’s experience over time. The challenge in a project like this is that documents that record the history of black women, particularly in the slave era, aren’t necessarily going to be straightforward explanations of women’s feelings, resistance, or movement. Instead, Mendenhall and her team are looking for keywords that point to organizations or connections between groups that can indicate larger movements and experiences.

Using a supercomputer in Pittsburgh, they’ve culled 20,000 documents that discuss black women’s experience from a 100,000 document corpus (collection of written texts). “What we’re now trying to do is retrain a model based on those 20,000 documents, and then do a search on a larger corpus of 800,000, and see if there are more of those documents that have more information about black women,” Mendenhall added…

Using topic modeling and data visualization, they have started to identify clues that could lead to further research. For example, according to Phys.Org, finding documents that include the words “vote” and “women” could indicate black women’s participation in the suffrage movement. They’ve also preliminarily found some new texts that weren’t previously tagged as by or about black women.

Next up Mendenhall is interested in collecting and analyzing data about current movements, such as Black Lives Matter.

It sounds like this involves putting together the best algorithm to do pattern recognition that would take humans too long to process. This can only be done with some good programming as well as a significant collection of texts. Three questions come quickly to mind:

  1. How would one report findings from this data in typical outlets for sociological or historical research?
  2. How easy would it be to apply this to other areas of inquiry?
  3. Is this data mining or are there hypothesis that can be tested?

There are lots of possibilities like this with big data but it remains to be seen how useful it might be for research.

The first wave of big data – in the early 1800s

Big data may appear to be a recent phenomena but the big data of the 1800s allowed for new questions and discoveries:

Fortunately for Quetelet, his decision to study social behavior came during a propitious moment in history. Europe was awash in the first wave of “big data” in history. As nations started developing large-scale bureaucracies and militaries in the early 19th century, they began tabulating and publishing huge amounts of data about their citizenry, such as the number of births and deaths each month, the number of criminals incarcerated each year, and the number of incidences of disease in each city. This was the inception of modern data collection, but nobody knew how to usefully interpret this hodgepodge of numbers. Most scientists of the time believed that human data was far too messy to analyze—until Quetelet decided to apply the mathematics of astronomy…

In the early 1840s, Quetelet analyzed a data set published in an Edinburgh medical journal that listed the chest circumference, in inches, of 5,738 Scottish soldiers. This was one of the most important, if uncelebrated, studies of human beings in the annals of science. Quetelet added together each of the measurements, then divided the sum by the total number of soldiers. The result came out to just over 39 ¾ inches—the average chest circumference of a Scottish soldier. This number represented one of the very first times a scientist had calculated the average of any human feature. But it was not Quetelet’s arithmetic that was history-making—it was his answer to a rather simple-seeming question: What, precisely, did this average actually mean?

Scholars and thinkers in every field hailed Quetelet as a genius for uncovering the hidden laws governing society. Florence Nightingale adopted his ideas in nursing, declaring that the Average Man embodied “God’s Will.” Karl Marx drew on Quetelet’s ideas to develop his theory of Communism, announcing that the Average Man proved the existence of historical determinism. The physicist James Maxwell was inspired by Quetelet’s mathematics to formulate the classical theory of gas mechanics. The physician John Snow used Quetelet’s ideas to fight cholera in London, marking the start of the field of public health. Wilhelm Wundt, the father of experimental psychology, read Quetelet and proclaimed, “It can be stated without exaggeration that more psychology can be learned from statistical averages than from all philosophers, except Aristotle.”

Is it a surprise then that sociology emerges in the same time period with greater access to data on societies in Europe and around the globe? Many are so used to having data and information at our fingertips that the revolution that this must have been – large-scale data within stable nation-states – opened up all sorts of possibilities.