Using a supercomputer and big data to find stories of black women

A sociologist is utilizing unique methods to uncover more historical knowledge about black women:

Mendenhall, who is also a professor of African American studies and urban and regional planning, is heading up the interdisciplinary team of researchers and computer scientists working on the big data project, which aims to better understand black women’s experience over time. The challenge in a project like this is that documents that record the history of black women, particularly in the slave era, aren’t necessarily going to be straightforward explanations of women’s feelings, resistance, or movement. Instead, Mendenhall and her team are looking for keywords that point to organizations or connections between groups that can indicate larger movements and experiences.

Using a supercomputer in Pittsburgh, they’ve culled 20,000 documents that discuss black women’s experience from a 100,000 document corpus (collection of written texts). “What we’re now trying to do is retrain a model based on those 20,000 documents, and then do a search on a larger corpus of 800,000, and see if there are more of those documents that have more information about black women,” Mendenhall added…

Using topic modeling and data visualization, they have started to identify clues that could lead to further research. For example, according to Phys.Org, finding documents that include the words “vote” and “women” could indicate black women’s participation in the suffrage movement. They’ve also preliminarily found some new texts that weren’t previously tagged as by or about black women.

Next up Mendenhall is interested in collecting and analyzing data about current movements, such as Black Lives Matter.

It sounds like this involves putting together the best algorithm to do pattern recognition that would take humans too long to process. This can only be done with some good programming as well as a significant collection of texts. Three questions come quickly to mind:

  1. How would one report findings from this data in typical outlets for sociological or historical research?
  2. How easy would it be to apply this to other areas of inquiry?
  3. Is this data mining or are there hypothesis that can be tested?

There are lots of possibilities like this with big data but it remains to be seen how useful it might be for research.

The first wave of big data – in the early 1800s

Big data may appear to be a recent phenomena but the big data of the 1800s allowed for new questions and discoveries:

Fortunately for Quetelet, his decision to study social behavior came during a propitious moment in history. Europe was awash in the first wave of “big data” in history. As nations started developing large-scale bureaucracies and militaries in the early 19th century, they began tabulating and publishing huge amounts of data about their citizenry, such as the number of births and deaths each month, the number of criminals incarcerated each year, and the number of incidences of disease in each city. This was the inception of modern data collection, but nobody knew how to usefully interpret this hodgepodge of numbers. Most scientists of the time believed that human data was far too messy to analyze—until Quetelet decided to apply the mathematics of astronomy…

In the early 1840s, Quetelet analyzed a data set published in an Edinburgh medical journal that listed the chest circumference, in inches, of 5,738 Scottish soldiers. This was one of the most important, if uncelebrated, studies of human beings in the annals of science. Quetelet added together each of the measurements, then divided the sum by the total number of soldiers. The result came out to just over 39 ¾ inches—the average chest circumference of a Scottish soldier. This number represented one of the very first times a scientist had calculated the average of any human feature. But it was not Quetelet’s arithmetic that was history-making—it was his answer to a rather simple-seeming question: What, precisely, did this average actually mean?

Scholars and thinkers in every field hailed Quetelet as a genius for uncovering the hidden laws governing society. Florence Nightingale adopted his ideas in nursing, declaring that the Average Man embodied “God’s Will.” Karl Marx drew on Quetelet’s ideas to develop his theory of Communism, announcing that the Average Man proved the existence of historical determinism. The physicist James Maxwell was inspired by Quetelet’s mathematics to formulate the classical theory of gas mechanics. The physician John Snow used Quetelet’s ideas to fight cholera in London, marking the start of the field of public health. Wilhelm Wundt, the father of experimental psychology, read Quetelet and proclaimed, “It can be stated without exaggeration that more psychology can be learned from statistical averages than from all philosophers, except Aristotle.”

Is it a surprise then that sociology emerges in the same time period with greater access to data on societies in Europe and around the globe? Many are so used to having data and information at our fingertips that the revolution that this must have been – large-scale data within stable nation-states – opened up all sorts of possibilities.

Cruz campaign using psychological data to reach potential voters

Campaigns not working with big data are behind: Ted Cruz’s campaign is working with unique psychological data as they try to secure the Republican nomination.

To build its data-gathering operation widely, the Cruz campaign hired Cambridge Analytica, a Massachusetts company reportedly owned in part by hedge fund executive Robert Mercer, who has given $11 million to a super PAC supporting Cruz. Cambridge, the U.S. affiliate of London-based behavioral research company SCL Group, has been paid more than $750,000 by the Cruz campaign, according to Federal Election Commission records.

To develop its psychographic models, Cambridge surveyed more than 150,000 households across the country and scored individuals using five basic traits: openness, conscientiousness, extraversion, agreeableness and neuroticism. A top Cambridge official didn’t respond to a request for comment, but Cruz campaign officials said the company developed its correlations in part by using data from Facebook that included subscribers’ likes. That data helped make the Cambridge data particularly powerful, campaign officials said…

The Cruz campaign modified the Cambridge template, renaming some psychological categories and adding subcategories to the list, such as “stoic traditionalist” and “true believer.” The campaign then did its own field surveys in battleground states to develop a more precise predictive model based on issues preferences.

The Cruz algorithm was then applied to what the campaign calls an “enhanced voter file,” which can contain as many as 50,000 data points gathered from voting records, popular websites and consumer information such as magazine subscriptions, car ownership and preferences for food and clothing.

Building a big data operation behind a major political candidate seems pretty par for the course these days. The success of the Obama campaigns was often attributed to tech whizzes behind the scenes. Since this is fairly normal these days, perhaps we need to move on to other questions: what do voters think about such micro targeting and how do they experience it? Does this contribute to political fragmentation? What is the role of the mass media amid more specific approaches? How valid are the predictions for voters and their behavior (since they are based on certain social science data and theories)? How does this all significantly change political campaigns?

How far are we from just getting ridding of the candidates all together and putting together AI apps/machines/data programs that garner support…

 

Subjective decisions can affect home appraisals

The final appraisal price for a home can be influenced by numerous subjective factors:

A massive, first-of-its-kind study of 1.3 million individual appraisal reports from 2012 through this year conducted by real estate analytics firm CoreLogic offers a suggestion: You should look at what are called adjustments to appraisals that involve relatively subjective estimations — the appraiser’s opinions on the overall quality level of your house, its condition, location and view — rather than more objectively determinable items such as living space square footage, lot size, number of baths and bedrooms, etc…

Adjustments are made in 99.8 percent of all appraisals, according to the CoreLogic study. The most frequent adjustments involve objective features of houses: Living area, rooms, car storage, porch and deck were all adjusted in more than 50 percent of the study’s 1.3 million appraisals, according to CoreLogic. (As a rule, the adjustments on objective features were not large in dollar terms. For example, room adjustments were made in nearly three-quarters of all appraisals but averaged only $2,246 and did not affect the final appraised value dramatically.)

Adjustments involving more-subjective matters — the overall quality or condition of the house — were less common, but they typically triggered much bigger dollar changes. The average adjustment based on quality was nearly $15,000, which is more than enough to complicate a home sale. Some subjective adjustments on the view or location of high-cost homes ran into the hundreds of thousands or even millions of dollars…

Research released last week by Platinum Data Solutions, which reviewed 300,000 appraisals made between July and September, found that fully 39 percent of “quality” or “condition” ratings conflicted with previous ratings on the same property. That inevitably invites controversy.

In other words, appraisals are an inexact science. What makes it particularly frustrating is that the stakes can be big as sellers and buyers are dealing with one of the biggest financial investments of their lives.

Two more thoughts about these findings:

  1. In order to cut down on the variation in findings, would it be better to regularly have multiple appraisers for the same property or some sort of blinded review?
  2. Here is how an example of big data can help reveal patterns across numerous properties and appraisers. But it would be particularly interesting – and perhaps some money could be made – if research identified individual appraisers who consistently had high or low findings.

The perils of analyzing big real estate data

Two leaders of Zillow recently wrote Zillow Talk: The New Rules of Real Estate which is a sort of Freakanomics look at all the real estate data they have. While it is an interesting book, it also illustrates the difficulties of analyzing big data:

1. The key to the book is all the data Zillow has harnessed to track real estate prices and make predictions on current and future prices. They don’t say much about their models. This could be for two good reasons: this is aimed at a mass market and the models are their trade secrets. Yet, I wanted to hear more about all the fascinating data – at least in an appendix?

2. Problems of aggregation: the data is analyzed usually at a metro area or national level. There are hints at smaller markets – a chapter on NYC for example and another looking at some unusual markets like Las Vegas – but there are not different chapters on cheaper/starter homes or luxury homes. An unanswered questino: is real estate within or across markets more similar? Put another way, are the features of the Chicago market so unique and patterned or are cheaper homes in the Chicago region more like similar homes in Atlanta or Los Angeles compared to more expensive homes across markets?

3. Most provocative argument: in Chapter 24, the authors suggest that pushing homeownership for lower-income Americans is a bad idea as it can often trap them in properties that don’t appreciate. This was a big problem in the 2000s: Presidents Clinton and Bush pushed homeownership but after housing values dropped in the late 2000s, poorer neighborhoods were hit hard, leaving many homeowners to default or seriously underwater. Unfortunately, unless demand picks up in these neighborhoods (and gentrification is pretty rare), these homes are not good investments.

4. The individual chapters often discuss small effects that may be significant but don’t have large substantive effects. For example, there is a section on male vs. female real estate agents. The effects for each gender are small: at most, a few percentage points difference in selling price as well as slight variations in speed of sale. (Women are better in both categories: higher prices, faster sales.)

5. The authors are pretty good at repeatedly pointing out that correlation does not mean causation. Yet, they don’t catch all of these moments and at other times present patterns in such a way that distort the axes. For example, here is a chart from page 202:

ZillowTalkp202

These two things may be correlated (as one goes up so does the other and vice versa) but why fix the axes so you are comparing half percentages to five percentage increments?

6. Continuing #4, I supposed a buyer and seller would want to use all the tricks they can but the tips here mean that those in the real estate market are supposed to string along all of these small effects to maximize what they get. On the final page, they write: “These are small actions that add up to a big difference.” Maybe. With margins of error on the effects, some buyers and sellers aren’t going to get the effects outlined here: some will benefit more but some will benefit less.

7. The moral of the whole story? Use data to your advantage even as it is not a guarantee:

In the new realm of real estate, everyone faces a rather stark choice. The operative question now is: Do you wield the power of data to your advantage? Or do you ignore the data, to your peril?

The same is true of the housing market writ large. Certainly, many macro-level dynamics are out of any one person’s control. And yet, we’re better equipped than ever before to choose wisely in the present – to make the kinds of measured judgments that can prevent another coast-to-coast bubble and calamitous burst. (p.252)

In the end, this book is aimed at the mass market where a buyer or seller could hope to string together a number of these small advantages. Yet, there are no guarantees and the effects are often small. Having more data may be good for markets and may make participants feel more knowledgeable (or perhaps more overwhelmed) but not everyone can take advantage of this information.

Using Chicago as a new big data laboratory

University of Chicago sociologist Robert Park once said that the city was a laboratory. A new venture seeks to use Chicago as just that:

On the heels of the University of Chicago’s $1 million Innovation Challenge for urban policy solutions, today’s announcement that UI Labs (“universities and industries”) will open CityWorks, a private R&D partnership that will be based on Goose Island, sets up the city to be a center for urban studies, technology and innovations. Founding partners Microsoft, Accenture, ComEd and Siemens will operate a bit like angel investors, according to Jason Harris, a spokesman for UI Labs. This project will seek to “level up Chicago as a center for the built environment.” The city’s mix of university and industry partners, government leadership and legacy of architecture and design innovation place it in a perfect position for this kind of incubator, according to Harris.

CityWorks wants to seed 6-8 ideas this year, focused on energy, physical infrastructure, transportation and water and sanitation, Harris says (funding amounts aren’t being released). “Our vision is that we have projects that can use the city as a testbed and try out ideas not being tested in other cities,” he says.

CityWorks will award grants to university and private researchers, with a focus on digital planning and the Internet of Things. Chicago is vying to be an important center for this potentially lucrative field. With the recent introduction of the Array of Things, a cutting-edge system of sensors that researchers and computer scientists are hoping will prove the value of real-time, open-source city data, and the recent opening of Uptake, a Brad Keywell-backed startup looking to bring custom data analytics solutions to businesses, the city is well-positioned to become a leader in the field.

I’ll be interested to see what comes out of this. It sounds like the goal the goal is to use big data collected at the city scale to find solutions to urban business issues. I do wonder if this is primarily about making profits or more about addressing urban social problems.

Some might be surprised to see such a project going forward in Chicago. After all, isn’t it a Rust Belt city struggling with big financial problems and violence? At the same time, this project highlights Chicago as a center of innovation (which requires a particular social context), a place where businesses want to locate, and home to a good amount of human capital (in both research interests and educated workers).

Using Google Street View to collect large-scale neighborhood data

One sociologist has plans to use a new Google Street View app to study neighborhoods:

Michael Bader, a professor of sociology at the American University, revealed the app developed is called Computer Assisted Neighborhood Visual Assessment System (CANVAS). The app rated 150 dissimilar features of neighborhoods in some main metropolitan cities in the U.S. The researchers claim the latest app reduces the cost and time in research.

With the help of Google Street View, the new app connects images and creates panoramic views of the required rural areas as well as cities. Bader explains that without the Google app researchers would have to cover many square miles for data collection, which is a painstaking job…

The app has already received funding of around $250,000 and s also supposed to be the first app that examines the scope and reliability of Google Street View when it comes to rating the neighborhoods in the U.S.

Bader reveals he is currently using CANVAS for a research on the Washington D.C. area. He revealed the population of people who have reached 65 and over in the region will be 15.3 percent by 2030. Bader hopes to understand why elderly people leave their community and what stops them from spending the remainder of the lives in the region. Bader’s research wants to understand the challenges elders face in Washington D.C.

As an urban sociologist, I think this has a lot of potential: Google Street View has an incredible amount of data and offers a lot of potential for visual sociology. While tradition in urban sociology might involve long-term study of a neighborhood (or perhaps a lot of walking within a single city), this offers a way to easily compare street scenes within and across cities.

Deciphering the words in home listings; “quaint” = 1,299 square feet, “cute” = 1,128 square feet

An analysis of Zillow data looks at the text accompanying real estate listings:

Longer is almost always better, though above a certain length, you didn’t get any added value — you don’t need to write a novel. Over 250 words, it doesn’t seem to matter. Our takeaway was that if you’ve got it, flaunt it. Descriptive words are very helpful. “Stainless steel,” “granite,” “view” and “landscaped” were found in listings that got a higher sales price than comparable homes.

And there are words you should stay away from, especially “nice.” We think that in the American dialect, you say “nice” if you don’t have anything more to say. And then there are the words that immediately tell a buyer that the house is small: When we analyzed the data, we found that homes described as “charming” averaged 1,487 square feet, “quaint” averaged 1,299 square feet, and “cute” averaged 1,128. All of them were smaller than the average house in our sampling.

The impact of words seems to vary by price tier. For example, “spotless” in a lower-priced house seemed to pay off in a 2 percent bonus in the final price, but it didn’t seem to affect more pricey homes. “Captivating” paid off by 6.5 percent in top-tier homes, but didn’t seem to matter in lower-priced ones.

There are certainly codes in real estate listings that are necessary due to the limited space for words. But, as the article notes some of the words are more precise than others. If someone says they have stainless steel appliances, the potential buyer has a really good idea of what is there. But, other words are much more ambiguous. Just how “new” are some big-ticket items like roofs or flooring or furnaces? The big data of real estate listings allows us to see the patterns tied to these words. Just remember the order for size: cute is small, quaint is slightly larger, and charming slightly bigger still.

If I’m Zillow, is it time to sell this info to select real estate professionals?

Mapping every road in the United States

The United States has a lot of roads and you can see them all on these state and national maps:

Roads, it turns out, are fantastic indicators of geographies, as evidenced by Fathom’s All Streets series of posters. A few years ago the Boston design studio released All Streets, a detailed look at all the streets in the United States. The team has since produced a set of All Streets for individual states and countries.

Using data from the U.S. Census Bureau’s TIGER/Line data files (Open Maps for other countries), the designers are able to paint a clear picture of where our infrastructure bumps into nature-made dead ends. In states like North Dakota and Iowa you see flat expanses of grids. Nebraska has a dense set of roads in the east near its more populated cities that dissipates as you head west towards the rural Sandhills prairie. A dark spot near the southern tip of Nevada punctuates the otherwise desert-heavy state, conversely, the Adirondack mountain range provides an expanse of white in a dark stretch of New York roads.

I do find the smaller maps or smaller scale views more interesting because they do show some differences. Looking at the national level doesn’t reveal all that much because we are now used to such images based on infrastructure and big data, whether based on cell phone coverage or interstates or lights seen from space or population distributions.

I could see hanging one of these – perhaps the Illinois version?

I like the band name “Big Data”

Band names can often reflect societal trends so I’m not surprised that a group selected the name Big Data. I like the name: it sounds current and menacing. I’ve only heard their songs a few times on the radio so I’ll have to reserve judgment on what they have actually created.

It might be interesting to think of what sociological terms and ideas could easily translate into good band names. One term that sometimes intrigues my intro students – interactional vandalism – could work. Conspicuous consumption? Cultural lag? Differential association? The culture industry? Impression management? The iron cage? Social mobility? The hidden curriculum?