Nations vying for big data hegemony

Big data is out there – but who will control it or oversee it?

Photo by Federico Orlandi on Pexels.com

The rise of Big Data—the vast digital output of daily life, including data Google and Facebook collect from their users and convert into advertising dollars—is now a matter of national security, according to some policymakers. The fear is that China is vacuuming up data about the U.S. and its citizens not just to steal secrets from U.S. companies or to influence citizens but also to build the foundation of technological hegemony in the not-too-distant future. Data—lots of it, the more the better—has, along with the rise of artificial intelligence, taken on strategic importance…

Broad fears of technological hegemony may be overblown, some policy experts say. And harsh measures against China could alienate allies and trigger a rash of similarly harsh measures by counties abroad toward U.S. tech firms.

In any case, the U.S. is in an exceedingly weak position to lead a moral crusade for the sanctity of data. The concept of harvesting clicks, text, internet addresses and other data from unsuspecting citizens and exploiting them for commercial and national-security ends was invented in the halls of the National Security Agency, the CIA and the tech startups of Silicon Valley. Facebook (now Meta), Google, Amazon, Microsoft and Apple currently lead a vast industry based on trading and compiling user data. Taking measures to protect the data of American citizens from the ravages of Silicon Valley would go a long way to protecting them from China, too. Any measures directed solely against China would likely be ineffective because vast troves of consumer data would still be available for purchase on secondary data markets…

Whatever the case, some suggest the world is already moving inexorably towards a bipolar digital world—a move that will only accelerate as the burgeoning race for AI dominance between China and America picks up steam.

So data becomes just another area in which powerful nations fight? Does the data with all of its potential and pitfalls simply become a national instrument of power?

There could be other options here. However, it might be hard to know whether these are preferable compared to states wanting to control big data.

  1. In the hands of users. Move data toward consumers and individuals rather than in the hands or accessed by nations and corporations.
  2. In the hands of corporations. They often generate and collect a lot of this data and then operate across nations and contexts.
  3. In the hands of some other neutral actors. They may not exist yet or have much power but could they in the future?

This bears watching because this could go well or not and would have wide consequences either way.

Facebook releases big data to researchers outside the company

Researchers can now access a big dataset of Facebook sharing data:

Social Science One is an effort to get the Holy Grail of data sets into the hands of private researchers. That Holy Grail is Facebook data. Yep, that same unthinkably massive trove that brought us Cambridge Analytica.

In the Foo Camp session, Stanford Law School’s Nate Persily, cohead of Social Science One, said that after 20 months of negotiations, Facebook was finally releasing the data to researchers. (The researchers had thought all of that would be settled in two months.) A Facebook data scientist who worked on the team dedicated to this project beamed in confirmation. Indeed, the official announcement came a few days later…

This is a new chapter in the somewhat tortured history of Facebook data research. The company hires top data scientists, sociologists, and statisticians, but their primary job is not to conduct academic research, it’s to use research to improve Facebook’s products and promote growth. These internal researchers sometimes do publish their findings, but after a disastrous 2014 Facebook study that involved showing users negative posts to see if their mood was affected, the company became super cautious about what it shared publicly. So this week’s data drop really is a big step in transparency, especially since there’s some likelihood that the researchers may discover uncomfortable truths about the way Facebook spreads lies and misinformation.

See the codebook here and the request for proposals to use the data here. According to the RFP, the data involves shared URLs and who interacted with those links:

Through Social Science One, researchers can apply for access to a unique Facebook dataset to study questions related to the effect of social media on democracy. The dataset contains approximately an exabyte (a quintillion bytes, or a billion gigabytes) of raw data from the platform, a total of more than 10 trillion numbers that summarize information about 38 million URLs shared more than 100 times publicly on Facebook (between 1/1/2017 and 7/31/2019).  It also includes characteristics of the URLs (such as whether they were fact-checked or flagged by users as hate speech) and the aggregated data concerning the types of people who viewed, shared, liked, reacted to, shared without viewing, and otherwise interacted with these links. This dataset enables social scientists to study some of the most important questions of our time about the effects of social media on democracy and elections with information to which they have never before had access.

Now to see what social scientists can do with the data. The emphasis appears to be on democracy, political posts, and misinformation but given what is shared on Facebook, I imagine there are connections to numerous other topics.

Linking nicer cars to a suburb on the rise

From the Australian suburbs: one insider suggests seeing nicer cars in driveways signals good prospects for the suburban community.

The gentrification of the driveway happens before the gentrification of a suburb, says the boss of a data analytics company.

Upmarket vehicles beginning to appear in the carports and garages of houses is often a forerunner of a suburb on the rise, as renovators move in...

When more models such as a BMW X5 or an Audi SUV begin appearing in the driveway of houses and apartments in particular suburban streets, it is a reliable predictor of a suburb undergoing gentrification and becoming much more popular with renovators. Extra investment in community infrastructure often followed, and there was a broad flow on to higher property prices…

He said households who were taking out a loan for $500,000 to buy a rundown home in an up-and-coming area were often also purchasing a $30,000 to $40,000 car to fit the aspirational lifestyle.

The article chalks this up to a big data insight as bringing together multiple pieces of information helped reveal this relationship. I can see how this new information might help investors but it is less clear how this would help residents or local governments.

More broadly, this gets at something my dad always said: look at the cars in driveways, on the street, or in parking spots and it gives you a sense of the people who live there. In societies that prize cars, such as in the United States and Australia and particularly their suburbs, a vehicle becomes an important social marker. The one-to-one relationship might not always work as some people buy more expensive cars than their housing might indicate and vice versa (recall the stories of millionaires driving old reliable cars). Yet, on the whole, people of different social classes drive different vehicles in varying states of repair. Hence, various brands aim at different segments of the market. Famously, General Motors did this early in the 20th century with five different car lines to appeal to different kinds of buyers.

UPDATE: I probably did not contribute to this upward trend with long-term ownership of a Toyota Echo. But, it looked good for its age.

 

Collecting big data the slow way

One of the interesting side effects of the era of big data is finding out how much information is not actually automatically collected (or is at least not available to the general public or researchers without paying money). A quick example from the work of sociologist Matthew Desmond:

The new data, assembled from about 83 million court records going back to 2000, suggest that the most pervasive problems aren’t necessarily in the most expensive regions. Evictions are accumulating across Michigan and Indiana. And several factors build on one another in Richmond: It’s in the Southeast, where the poverty rates are high and the minimum wage is low; it’s in Virginia, which lacks some tenant rights available in other states; and it’s a city where many poor African-Americans live in low-quality housing with limited means of escaping it.

According to the Eviction Lab, here is how they collected the data:

First, we requested a bulk report of cases directly from courts. These reports included all recorded information related to eviction-related cases. Second, we conducted automated record collection from online portals, via web scraping and text parsing protocols. Third, we partnered with companies that carry out manual collection of records, going directly into the courts and extracting the relevant case information by hand.

In other words, it took a lot of work to put together such a database: various courts, websites, and companies had different pieces of information but a researcher to access all of that data and put them together.

Without a researcher or a company or government body explicitly starting to record or collect certain information, a big dataset on that particular topic will not happen. Someone or some institution, typically with resources at its disposal, needs to set a process into motion. And simply having the data is not enough; it needs to be cleaned up so it all works with the other pieces. Again, from the Eviction Lab:

To create the best estimates, all data we obtained underwent a rigorous cleaning protocol. This included formatting the data so that each observation represented a household; cleaning and standardizing the names and addresses; and dropping duplicate cases. The details of this process can be found in the Methodology Report (PDF).

This all can lead to a fascinating dataset of over 83 million records on an important topic.

We are probably still a ways off from a scenario where this information would automatically become part of a dataset. This data had a definite start and required much work. There are many other areas of social life that require similar efforts before researchers and the public have big data to examine and learn from.

When the candidate with the big data advantage didn’t win the presidency

Much was made of the effective use of big data by Barack Obama’s campaigns. That analytic advantage didn’t help the Clinton campaign:

Clinton can be paranoid and self-destructively self-protective, but she’s also capable of assessing her own deficiencies as a politician in a bracingly clear-eyed way. And the conclusion that she drew from her 2008 defeat was essentially an indictment of her own management style: Eight years earlier, she had personally presided over a talented, sloppy, squabbling, sprawling menagerie of pals, longtime advisers and hangers-on who somehow managed to bungle the building of a basic political infrastructure to oppose Obama’s efficient, data-driven operation.

To do so, Mook hired a buddy who had helped Terry McAuliffe squeak out a win in the 2014 Virginia governor’s race: Elan Kriegel, a little-known data specialist who would, in many ways, exert more influence over the candidate than any of all-star team of veteran consultants. Kriegel’s campaign-within-a-campaign conducted dozens of targeted surveys—to test messaging and track voter sentiment day-by-day, especially in battleground states—and fed them into a computer algorithm, which ran hundreds of thousands of simulations that were used to steer ad spending, the candidate’s travel schedule, even the celebrities Clinton would invite to rallies.

The data operation, five staffers told me, was the source of Mook’s power within the campaign, and a source of perpetual tension: Many of Clinton’s top consultants groused that Mook and Kriegel withheld data from them, balking at the long lead time—a three-day delay—between tracking reports. A few of them even thought Mook was cherry-picking rosy polling to make the infamously edgy Clinton feel more confident…

In numerous interviews conducted throughout the campaign, Clinton staffers attested to Mook’s upbeat attitude and mastery of detail. But, in the end, Brooklyn simply failed to predict the tidal wave that swamped Clinton—a pro-Trump uprising in rural and exurban white America that wasn’t reflected in the polls—and his candidate failed to generate enough enthusiasm to compensate with big turnouts in Detroit, Milwaukee and the Philadelphia suburbs.

It would be fascinating to hear more. The pollsters didn’t get it right – but neither did the Clinton campaign internally?

The real question is what this will do to future campaigns. Was Donald Trump’s lack of campaign infrastructure and reliance on celebrity and media coverage (also highlighted nicely in the article above) something that others can or will replicate? Or, would the close margins in this recent presidential election highlight even more the need for finely-tuned data and microtargeting? I’m guessing the influence of big data in campaigns will only continue but data will only get you so far if it (1) isn’t great data in the first place and (2) people know how to use it well.

Let Amazon’s big data tractor trailer drive to you

Americans like big trucks and hard drive space so why not put the two together?

Amazon announced the new service, confusingly named Snowmobile, at its Re:Invent conference in Las Vegas this week. It’s designed to shuttle as many as 100 petabytes–around 100,000 terabytes–per truck. That’s enough storage to hold five copies of the Internet Archive (a comprehensive backup of the web both present and past), which contains “only” about 18.5 petabytes of unique data...

Using multiple semis to shuttle data around might seem like overkill. But for such massive amounts of data, hitting the open road is still the most efficient way to go. Even with a one gigabit per-second connection such as Google Fiber, uploading 100 petabytes over the internet would take more than 28 years. At an average speed of 65 mph, on the other hand, you could drive a Snowmobile from San Francisco to New York City in about 45 hours—about 4,970 gigabits per second. That doesn’t count the time it takes to actually transfer the data onto Snowmobile–which Amazon estimates will take less than 10 days–or from the Snowmobile onto Amazon’s servers. But all told, that still makes the truck much, much faster. And because Amazon has data centers throughout the country, your data probably won’t need to travel cross-country anyway.

One could make a strong case that semis make America go. And all the money that the government has put into highways and roads certainly helps.

Guidelines for using big data to improve colleges

A group of researchers and other interested parties recently made suggestions about how big data from higher ed can be used for good within higher ed:

To Stevens and others, this massive data is full of promise –­­ but also peril. The researchers talk excitedly about big data helping higher education discover its Holy Grail: learning that is so deeply personalized that it both keeps struggling students from dropping out and pushes star performers to excel…

The guidelines center on four core ideas. The first calls on all players in higher education, including students and vendors, to recognize that data collection is a joint venture with clearly defined goals and limits. The second states that students be told how their data are collected and analyzed, and be allowed to appeal what they see as misinformation. The third emphasizes that schools have an obligation to use data-driven insights to improve their teaching. And the fourth establishes that education is about opening up opportunities for students, not closing them.

While numbers one and two deal with handling the data, numbers three and four discuss the purposes: will the data actually help students in the long run? Such data could serve a lot of interested parties: faculty, administrators, alumni, donors, governments, accreditation groups, and others. I suspect faculty would be worried that administrators would try to squeeze more efficiencies out of the college, donors might want to see what exactly is going on at college, the government could set new regulatory guidelines, etc.

Yet, big data doesn’t necessarily provide quick answers to these purposes even as it might provide insights into broader patterns. Take improving teaching: there is a lot of disagreement over this topic. Or, opening opportunities for students: which ones? Who chooses which options students should have?

One takeaway: big data offers much potential to see new patterns and give decision makers better tools. However, it does not guarantee better or worse outcomes; it can be used well or misused like any sense of data. I like the idea of getting out ahead of the data to set some common guidelines but I imagine it will take some time to work out best practices.

Claim that McMansions have proportionally lost resale value

A recent study by Trulia suggests McMansions don’t hold their value:

The premium that buyers can expect to pay for a McMansion in Fort Lauderdale, Fla., declined by 84 percent from 2012 to 2016, according to data compiled by Trulia. In Las Vegas, the premium dropped by 46 percent and in Phoenix, by 42 percent.

Real estate agents don’t usually tag their listings #McMansion, so to compile the data, Trulia created a proxy, measuring the price appreciation of homes built from 2001 and 2007 that have 3,000 to 5,000 square feet. While there’s no single size designation, and plenty of McMansions were built outside that time window, those specifications capture homes built at the height of the trend.

McMansions cost more to build than your average starter ranch home does, and they will sell for more. But the return on investment has dropped like a stone. The additional cash that buyers should be willing to part with to get a McMansion fell in 85 of the 100 largest U.S. metropolitan areas. For example, four years ago a typical McMansion in Fort Lauderdale was valued at $477,000, a 274 percent premium over all other homes in the area. This year, those McMansions are worth about $611,000, or 190 percent more than the rest the homes on the market.

The few areas in which McMansions are gaining value faster than more tasteful housing stock are located primarily in the Midwest and the eastern New York suburbs that make up Long Island. The McMansion premium in Long Island has increased by 10 percent over the last four years.

Read the Trulia report here.

Interesting claim. After the housing bubble burst, some commentators suggested that Americans should go back to not viewing homes as goods with significant returns on investment. Instead, homes should be viewed as having some appreciation but this happens relatively slowly. This article would seem to suggest that return on investment is a key factor in buying a home. How often does this factor into the decisions of buyers versus other concerns (such as having more space or locating in the right neighborhoods)? And just how much of a premium should homeowners expect – 190% more than the rest of the market is not enough?

This analysis also appears to illustrate both the advantages and pitfalls of big data. On one hand, sites like Trulia and Zillow can look at the purchase and sale of all across the country. Patterns can be found and certain causal factors – such as housing market – ca be examined. Yet, they are still limited by the parameters in their data collection which, in this case, severely restricts their definition of McMansions to a certain size home built over a particular time period. As others might attest, big homes aren’t necessarily McMansions unless they have bad architecture or are teardowns. This sort of analysis would be very difficult to do without big data but it is self-evident that such analyses are always worthwhile.

Using a supercomputer and big data to find stories of black women

A sociologist is utilizing unique methods to uncover more historical knowledge about black women:

Mendenhall, who is also a professor of African American studies and urban and regional planning, is heading up the interdisciplinary team of researchers and computer scientists working on the big data project, which aims to better understand black women’s experience over time. The challenge in a project like this is that documents that record the history of black women, particularly in the slave era, aren’t necessarily going to be straightforward explanations of women’s feelings, resistance, or movement. Instead, Mendenhall and her team are looking for keywords that point to organizations or connections between groups that can indicate larger movements and experiences.

Using a supercomputer in Pittsburgh, they’ve culled 20,000 documents that discuss black women’s experience from a 100,000 document corpus (collection of written texts). “What we’re now trying to do is retrain a model based on those 20,000 documents, and then do a search on a larger corpus of 800,000, and see if there are more of those documents that have more information about black women,” Mendenhall added…

Using topic modeling and data visualization, they have started to identify clues that could lead to further research. For example, according to Phys.Org, finding documents that include the words “vote” and “women” could indicate black women’s participation in the suffrage movement. They’ve also preliminarily found some new texts that weren’t previously tagged as by or about black women.

Next up Mendenhall is interested in collecting and analyzing data about current movements, such as Black Lives Matter.

It sounds like this involves putting together the best algorithm to do pattern recognition that would take humans too long to process. This can only be done with some good programming as well as a significant collection of texts. Three questions come quickly to mind:

  1. How would one report findings from this data in typical outlets for sociological or historical research?
  2. How easy would it be to apply this to other areas of inquiry?
  3. Is this data mining or are there hypothesis that can be tested?

There are lots of possibilities like this with big data but it remains to be seen how useful it might be for research.

The first wave of big data – in the early 1800s

Big data may appear to be a recent phenomena but the big data of the 1800s allowed for new questions and discoveries:

Fortunately for Quetelet, his decision to study social behavior came during a propitious moment in history. Europe was awash in the first wave of “big data” in history. As nations started developing large-scale bureaucracies and militaries in the early 19th century, they began tabulating and publishing huge amounts of data about their citizenry, such as the number of births and deaths each month, the number of criminals incarcerated each year, and the number of incidences of disease in each city. This was the inception of modern data collection, but nobody knew how to usefully interpret this hodgepodge of numbers. Most scientists of the time believed that human data was far too messy to analyze—until Quetelet decided to apply the mathematics of astronomy…

In the early 1840s, Quetelet analyzed a data set published in an Edinburgh medical journal that listed the chest circumference, in inches, of 5,738 Scottish soldiers. This was one of the most important, if uncelebrated, studies of human beings in the annals of science. Quetelet added together each of the measurements, then divided the sum by the total number of soldiers. The result came out to just over 39 ¾ inches—the average chest circumference of a Scottish soldier. This number represented one of the very first times a scientist had calculated the average of any human feature. But it was not Quetelet’s arithmetic that was history-making—it was his answer to a rather simple-seeming question: What, precisely, did this average actually mean?

Scholars and thinkers in every field hailed Quetelet as a genius for uncovering the hidden laws governing society. Florence Nightingale adopted his ideas in nursing, declaring that the Average Man embodied “God’s Will.” Karl Marx drew on Quetelet’s ideas to develop his theory of Communism, announcing that the Average Man proved the existence of historical determinism. The physicist James Maxwell was inspired by Quetelet’s mathematics to formulate the classical theory of gas mechanics. The physician John Snow used Quetelet’s ideas to fight cholera in London, marking the start of the field of public health. Wilhelm Wundt, the father of experimental psychology, read Quetelet and proclaimed, “It can be stated without exaggeration that more psychology can be learned from statistical averages than from all philosophers, except Aristotle.”

Is it a surprise then that sociology emerges in the same time period with greater access to data on societies in Europe and around the globe? Many are so used to having data and information at our fingertips that the revolution that this must have been – large-scale data within stable nation-states – opened up all sorts of possibilities.