Linking nicer cars to a suburb on the rise

From the Australian suburbs: one insider suggests seeing nicer cars in driveways signals good prospects for the suburban community.

The gentrification of the driveway happens before the gentrification of a suburb, says the boss of a data analytics company.

Upmarket vehicles beginning to appear in the carports and garages of houses is often a forerunner of a suburb on the rise, as renovators move in...

When more models such as a BMW X5 or an Audi SUV begin appearing in the driveway of houses and apartments in particular suburban streets, it is a reliable predictor of a suburb undergoing gentrification and becoming much more popular with renovators. Extra investment in community infrastructure often followed, and there was a broad flow on to higher property prices…

He said households who were taking out a loan for $500,000 to buy a rundown home in an up-and-coming area were often also purchasing a $30,000 to $40,000 car to fit the aspirational lifestyle.

The article chalks this up to a big data insight as bringing together multiple pieces of information helped reveal this relationship. I can see how this new information might help investors but it is less clear how this would help residents or local governments.

More broadly, this gets at something my dad always said: look at the cars in driveways, on the street, or in parking spots and it gives you a sense of the people who live there. In societies that prize cars, such as in the United States and Australia and particularly their suburbs, a vehicle becomes an important social marker. The one-to-one relationship might not always work as some people buy more expensive cars than their housing might indicate and vice versa (recall the stories of millionaires driving old reliable cars). Yet, on the whole, people of different social classes drive different vehicles in varying states of repair. Hence, various brands aim at different segments of the market. Famously, General Motors did this early in the 20th century with five different car lines to appeal to different kinds of buyers.

UPDATE: I probably did not contribute to this upward trend with long-term ownership of a Toyota Echo. But, it looked good for its age.

 

Collecting big data the slow way

One of the interesting side effects of the era of big data is finding out how much information is not actually automatically collected (or is at least not available to the general public or researchers without paying money). A quick example from the work of sociologist Matthew Desmond:

The new data, assembled from about 83 million court records going back to 2000, suggest that the most pervasive problems aren’t necessarily in the most expensive regions. Evictions are accumulating across Michigan and Indiana. And several factors build on one another in Richmond: It’s in the Southeast, where the poverty rates are high and the minimum wage is low; it’s in Virginia, which lacks some tenant rights available in other states; and it’s a city where many poor African-Americans live in low-quality housing with limited means of escaping it.

According to the Eviction Lab, here is how they collected the data:

First, we requested a bulk report of cases directly from courts. These reports included all recorded information related to eviction-related cases. Second, we conducted automated record collection from online portals, via web scraping and text parsing protocols. Third, we partnered with companies that carry out manual collection of records, going directly into the courts and extracting the relevant case information by hand.

In other words, it took a lot of work to put together such a database: various courts, websites, and companies had different pieces of information but a researcher to access all of that data and put them together.

Without a researcher or a company or government body explicitly starting to record or collect certain information, a big dataset on that particular topic will not happen. Someone or some institution, typically with resources at its disposal, needs to set a process into motion. And simply having the data is not enough; it needs to be cleaned up so it all works with the other pieces. Again, from the Eviction Lab:

To create the best estimates, all data we obtained underwent a rigorous cleaning protocol. This included formatting the data so that each observation represented a household; cleaning and standardizing the names and addresses; and dropping duplicate cases. The details of this process can be found in the Methodology Report (PDF).

This all can lead to a fascinating dataset of over 83 million records on an important topic.

We are probably still a ways off from a scenario where this information would automatically become part of a dataset. This data had a definite start and required much work. There are many other areas of social life that require similar efforts before researchers and the public have big data to examine and learn from.

When the candidate with the big data advantage didn’t win the presidency

Much was made of the effective use of big data by Barack Obama’s campaigns. That analytic advantage didn’t help the Clinton campaign:

Clinton can be paranoid and self-destructively self-protective, but she’s also capable of assessing her own deficiencies as a politician in a bracingly clear-eyed way. And the conclusion that she drew from her 2008 defeat was essentially an indictment of her own management style: Eight years earlier, she had personally presided over a talented, sloppy, squabbling, sprawling menagerie of pals, longtime advisers and hangers-on who somehow managed to bungle the building of a basic political infrastructure to oppose Obama’s efficient, data-driven operation.

To do so, Mook hired a buddy who had helped Terry McAuliffe squeak out a win in the 2014 Virginia governor’s race: Elan Kriegel, a little-known data specialist who would, in many ways, exert more influence over the candidate than any of all-star team of veteran consultants. Kriegel’s campaign-within-a-campaign conducted dozens of targeted surveys—to test messaging and track voter sentiment day-by-day, especially in battleground states—and fed them into a computer algorithm, which ran hundreds of thousands of simulations that were used to steer ad spending, the candidate’s travel schedule, even the celebrities Clinton would invite to rallies.

The data operation, five staffers told me, was the source of Mook’s power within the campaign, and a source of perpetual tension: Many of Clinton’s top consultants groused that Mook and Kriegel withheld data from them, balking at the long lead time—a three-day delay—between tracking reports. A few of them even thought Mook was cherry-picking rosy polling to make the infamously edgy Clinton feel more confident…

In numerous interviews conducted throughout the campaign, Clinton staffers attested to Mook’s upbeat attitude and mastery of detail. But, in the end, Brooklyn simply failed to predict the tidal wave that swamped Clinton—a pro-Trump uprising in rural and exurban white America that wasn’t reflected in the polls—and his candidate failed to generate enough enthusiasm to compensate with big turnouts in Detroit, Milwaukee and the Philadelphia suburbs.

It would be fascinating to hear more. The pollsters didn’t get it right – but neither did the Clinton campaign internally?

The real question is what this will do to future campaigns. Was Donald Trump’s lack of campaign infrastructure and reliance on celebrity and media coverage (also highlighted nicely in the article above) something that others can or will replicate? Or, would the close margins in this recent presidential election highlight even more the need for finely-tuned data and microtargeting? I’m guessing the influence of big data in campaigns will only continue but data will only get you so far if it (1) isn’t great data in the first place and (2) people know how to use it well.

Let Amazon’s big data tractor trailer drive to you

Americans like big trucks and hard drive space so why not put the two together?

Amazon announced the new service, confusingly named Snowmobile, at its Re:Invent conference in Las Vegas this week. It’s designed to shuttle as many as 100 petabytes–around 100,000 terabytes–per truck. That’s enough storage to hold five copies of the Internet Archive (a comprehensive backup of the web both present and past), which contains “only” about 18.5 petabytes of unique data...

Using multiple semis to shuttle data around might seem like overkill. But for such massive amounts of data, hitting the open road is still the most efficient way to go. Even with a one gigabit per-second connection such as Google Fiber, uploading 100 petabytes over the internet would take more than 28 years. At an average speed of 65 mph, on the other hand, you could drive a Snowmobile from San Francisco to New York City in about 45 hours—about 4,970 gigabits per second. That doesn’t count the time it takes to actually transfer the data onto Snowmobile–which Amazon estimates will take less than 10 days–or from the Snowmobile onto Amazon’s servers. But all told, that still makes the truck much, much faster. And because Amazon has data centers throughout the country, your data probably won’t need to travel cross-country anyway.

One could make a strong case that semis make America go. And all the money that the government has put into highways and roads certainly helps.

Guidelines for using big data to improve colleges

A group of researchers and other interested parties recently made suggestions about how big data from higher ed can be used for good within higher ed:

To Stevens and others, this massive data is full of promise –­­ but also peril. The researchers talk excitedly about big data helping higher education discover its Holy Grail: learning that is so deeply personalized that it both keeps struggling students from dropping out and pushes star performers to excel…

The guidelines center on four core ideas. The first calls on all players in higher education, including students and vendors, to recognize that data collection is a joint venture with clearly defined goals and limits. The second states that students be told how their data are collected and analyzed, and be allowed to appeal what they see as misinformation. The third emphasizes that schools have an obligation to use data-driven insights to improve their teaching. And the fourth establishes that education is about opening up opportunities for students, not closing them.

While numbers one and two deal with handling the data, numbers three and four discuss the purposes: will the data actually help students in the long run? Such data could serve a lot of interested parties: faculty, administrators, alumni, donors, governments, accreditation groups, and others. I suspect faculty would be worried that administrators would try to squeeze more efficiencies out of the college, donors might want to see what exactly is going on at college, the government could set new regulatory guidelines, etc.

Yet, big data doesn’t necessarily provide quick answers to these purposes even as it might provide insights into broader patterns. Take improving teaching: there is a lot of disagreement over this topic. Or, opening opportunities for students: which ones? Who chooses which options students should have?

One takeaway: big data offers much potential to see new patterns and give decision makers better tools. However, it does not guarantee better or worse outcomes; it can be used well or misused like any sense of data. I like the idea of getting out ahead of the data to set some common guidelines but I imagine it will take some time to work out best practices.

Claim that McMansions have proportionally lost resale value

A recent study by Trulia suggests McMansions don’t hold their value:

The premium that buyers can expect to pay for a McMansion in Fort Lauderdale, Fla., declined by 84 percent from 2012 to 2016, according to data compiled by Trulia. In Las Vegas, the premium dropped by 46 percent and in Phoenix, by 42 percent.

Real estate agents don’t usually tag their listings #McMansion, so to compile the data, Trulia created a proxy, measuring the price appreciation of homes built from 2001 and 2007 that have 3,000 to 5,000 square feet. While there’s no single size designation, and plenty of McMansions were built outside that time window, those specifications capture homes built at the height of the trend.

McMansions cost more to build than your average starter ranch home does, and they will sell for more. But the return on investment has dropped like a stone. The additional cash that buyers should be willing to part with to get a McMansion fell in 85 of the 100 largest U.S. metropolitan areas. For example, four years ago a typical McMansion in Fort Lauderdale was valued at $477,000, a 274 percent premium over all other homes in the area. This year, those McMansions are worth about $611,000, or 190 percent more than the rest the homes on the market.

The few areas in which McMansions are gaining value faster than more tasteful housing stock are located primarily in the Midwest and the eastern New York suburbs that make up Long Island. The McMansion premium in Long Island has increased by 10 percent over the last four years.

Read the Trulia report here.

Interesting claim. After the housing bubble burst, some commentators suggested that Americans should go back to not viewing homes as goods with significant returns on investment. Instead, homes should be viewed as having some appreciation but this happens relatively slowly. This article would seem to suggest that return on investment is a key factor in buying a home. How often does this factor into the decisions of buyers versus other concerns (such as having more space or locating in the right neighborhoods)? And just how much of a premium should homeowners expect – 190% more than the rest of the market is not enough?

This analysis also appears to illustrate both the advantages and pitfalls of big data. On one hand, sites like Trulia and Zillow can look at the purchase and sale of all across the country. Patterns can be found and certain causal factors – such as housing market – ca be examined. Yet, they are still limited by the parameters in their data collection which, in this case, severely restricts their definition of McMansions to a certain size home built over a particular time period. As others might attest, big homes aren’t necessarily McMansions unless they have bad architecture or are teardowns. This sort of analysis would be very difficult to do without big data but it is self-evident that such analyses are always worthwhile.

Using a supercomputer and big data to find stories of black women

A sociologist is utilizing unique methods to uncover more historical knowledge about black women:

Mendenhall, who is also a professor of African American studies and urban and regional planning, is heading up the interdisciplinary team of researchers and computer scientists working on the big data project, which aims to better understand black women’s experience over time. The challenge in a project like this is that documents that record the history of black women, particularly in the slave era, aren’t necessarily going to be straightforward explanations of women’s feelings, resistance, or movement. Instead, Mendenhall and her team are looking for keywords that point to organizations or connections between groups that can indicate larger movements and experiences.

Using a supercomputer in Pittsburgh, they’ve culled 20,000 documents that discuss black women’s experience from a 100,000 document corpus (collection of written texts). “What we’re now trying to do is retrain a model based on those 20,000 documents, and then do a search on a larger corpus of 800,000, and see if there are more of those documents that have more information about black women,” Mendenhall added…

Using topic modeling and data visualization, they have started to identify clues that could lead to further research. For example, according to Phys.Org, finding documents that include the words “vote” and “women” could indicate black women’s participation in the suffrage movement. They’ve also preliminarily found some new texts that weren’t previously tagged as by or about black women.

Next up Mendenhall is interested in collecting and analyzing data about current movements, such as Black Lives Matter.

It sounds like this involves putting together the best algorithm to do pattern recognition that would take humans too long to process. This can only be done with some good programming as well as a significant collection of texts. Three questions come quickly to mind:

  1. How would one report findings from this data in typical outlets for sociological or historical research?
  2. How easy would it be to apply this to other areas of inquiry?
  3. Is this data mining or are there hypothesis that can be tested?

There are lots of possibilities like this with big data but it remains to be seen how useful it might be for research.