Using and interpreting alternative data sources to examine COVID-19 impact

Posted on March 22, 2020 by legallysociable

In a world full of data, businesses, investors, and others have access to newer sources of information that can provide insights into responses to COVID-19:

For instance, Angus says that monitoring China’s internet throughout the pandemic showed how industrial plants in the worst-affected regions—which operate servers and computers—shut down during the outbreak. In the last few weeks, as the emergency abated, things have started crawling back to normalcy, even if we are still far from pre-Covid-19 levels, and the evidence might be polluted by plants being restarted just to hit government-imposed power consumption targets. “China is not normal yet,” Angus says. The country’s internet latency suggests that “recovery is happening in China, but there are still a lot of people who must be facing at-home-life for their activities.”…

Combining data from vessel transponders with satellite images, he has periodically checked how many oil tankers are in anchorage in China, unable to deliver their cargo—an intimation both of how well China’s ports are functioning amid the pandemic, and of how well industrial production is keeping up.

Madani also relies on TomTom’s road traffic data for various Chinese and Italian cities to understand how they are affected by quarantines and movement restrictions. “What we’ve seen over the past two weeks is a big revival in congestion,” he says. “There’s more traffic going on now in China, in the big cities, apart from Wuhan.”…

Pollution data is another valuable source of information. Over the past weeks, people on Twitter have been sharing satellite images of various countries, showing that pollution levels are dropping across the industrialised world as a result of coronavirus-induced lockdowns. But where working-from-home twitteratis see a poetic silver lining, Madani sees cold facts about oil consumption.

Three quick thoughts:

1. Even with all of this data, interpreting it is still an important task. People could look at similar data and come to similar conclusions. Or, they might have access to one set of data and not another piece and then draw different conclusions. This becomes critical when people today want data-driven responses or want to back up their position with data. Simply having data is not enough.

2. There is publicly available data – with lots of charts and graphs going around in the United States about cases – and then there is data that requires subscriptions, connections, insider information. Who has access to what data still matters.

3. We have more data than ever before and yet this does not necessarily translate into less anxiety or more preparation regarding certain occurrences. Indeed, more information might make things worse for some.

In sum, we can know more about the world than ever before but we are still working on ways to utilize and comprehend that information that might have been unthinkable decades ago.

Finding data by finding and/or guessing URLs

Posted on January 12, 2020 by legallysociable

A California high school student is posting new data from 2020 presidential polls before news organizations because he found patterns in their URLs:

How does Rawal do it? He correctly figures out the URL — the uniform resource locator, or full web address — that a graphic depicting the poll’s results appears at before their official release.

“URL manipulation is what I do,” he said, “and I’ve been able to get really good at it because, with websites like CNN and Fox, all the file names follow a pattern.”

He added, “I’m not going to go into more detail on that.”

He said he had just spoken with The Register’s news director, who expressed interest in his helping the newspaper “keep it under tighter wraps.” He is considering it.

This makes sense on both ends: media organizations need a way to organize their files and sites and someone who looks at the URLs over time could figure out the pattern. Now to see how media organizations respond as to not let their stories out before they report them.

I imagine there is a broader application for this. Do many organizations have websites or data available that is not linked to or a link is not easily found? I could imagine how such hidden/unlinked data could be used for nefarious or less ethical purposes (imagine scooping news releases about soon-to-be released economic figures in order to buy or sell stocks) as well as data collection.

Implicating suburban sprawl in the spread of ticks and pathogens

Posted on July 11, 2019 by legallysociable

As new tick-borne illnesses spread, sprawl is part of the problem:

But as climate change, suburban sprawl, and increased international travel are putting more ticks and the pathogens they carry in the paths of humans, what’s becoming more urgently apparent is how the US’s tick monitoring systems are not keeping pace.

“It’s really a patchwork in terms of the effort that different areas are putting into surveillance,” says Becky Eisen, a tick biologist with CDC’s Division of Vector-Borne diseases. The federal public health agency maintains national maps of the ranges of different tick species, but they’re extrapolated from scattered data collected in large part by academic researchers. Only a few states, mostly in the Northeast, have dedicated tick surveillance and control programs. That leaves large parts of the country in a data blackout.

To help address that problem the CDC is funding an effort to identify the most urgent gaps in surveillance. It has also begun publishing guidance documents for public health departments on how to collect ticks and test them for diseases, to encourage more consistent data collection across different states and counties.

In an ideal world, says Eisen, every county in the US would send a few well-protected people out into fields and forests every spring and summer, setting traps or dragging a white flannel sheet between them to collect all the ticks making their homes in the grasses and underbrush. Their precise numbers, locations, and species would be recorded so that later on when they get ground up and tested, that DNA would paint a national picture of risk for exposure to every tick-borne pathogen in America. But she recognizes that would be incredibly labor-intensive, and with only so many public funding dollars to go around each year, there are always competing priorities.“But from a research perspective, that’s the kind of repeatable, consistent data we’d really want,” says Eisen. “That would be the dream.”

While there is little direct discussion of sprawl, I wonder if there are two problems at play.

First, sprawl puts more people in interaction with more natural settings. As metropolitan areas expand, more residents end up in higher densities in areas that previously had experienced limited human residence. More people at the wildland urban interface could potentially lead to more problems in both directions: humans can pick up diseases while nature can be negatively impacted by more people.

Second, increasing sprawl means more data needs to be collected as more people are at possible threat. Metropolitan areas (metropolitan statistical areas according to the Census Bureau) typically expand county by county as outer counties increase in population and have more ties to the rest of the region. Since many metropolitan regions expand in circles, adding more counties at the edges could significantly increase the number of counties that need monitoring. And as the article ends with, finding money to do all that data collection and analysis is difficult.

Making a horror film about illnesses carried by ticks would take some work to make interesting but these sorts of hidden and minimally problematic in terms of number of suburbanites at this point issues could cause a lot of anxiety.

Collecting big data the slow way

Posted on April 11, 2018 by legallysociable

One of the interesting side effects of the era of big data is finding out how much information is not actually automatically collected (or is at least not available to the general public or researchers without paying money). A quick example from the work of sociologist Matthew Desmond:

The new data, assembled from about 83 million court records going back to 2000, suggest that the most pervasive problems aren’t necessarily in the most expensive regions. Evictions are accumulating across Michigan and Indiana. And several factors build on one another in Richmond: It’s in the Southeast, where the poverty rates are high and the minimum wage is low; it’s in Virginia, which lacks some tenant rights available in other states; and it’s a city where many poor African-Americans live in low-quality housing with limited means of escaping it.

According to the Eviction Lab, here is how they collected the data:

First, we requested a bulk report of cases directly from courts. These reports included all recorded information related to eviction-related cases. Second, we conducted automated record collection from online portals, via web scraping and text parsing protocols. Third, we partnered with companies that carry out manual collection of records, going directly into the courts and extracting the relevant case information by hand.

In other words, it took a lot of work to put together such a database: various courts, websites, and companies had different pieces of information but a researcher to access all of that data and put them together.

Without a researcher or a company or government body explicitly starting to record or collect certain information, a big dataset on that particular topic will not happen. Someone or some institution, typically with resources at its disposal, needs to set a process into motion. And simply having the data is not enough; it needs to be cleaned up so it all works with the other pieces. Again, from the Eviction Lab:

To create the best estimates, all data we obtained underwent a rigorous cleaning protocol. This included formatting the data so that each observation represented a household; cleaning and standardizing the names and addresses; and dropping duplicate cases. The details of this process can be found in the Methodology Report (PDF).

This all can lead to a fascinating dataset of over 83 million records on an important topic.

We are probably still a ways off from a scenario where this information would automatically become part of a dataset. This data had a definite start and required much work. There are many other areas of social life that require similar efforts before researchers and the public have big data to examine and learn from.

New standard and platform for city maps

Posted on February 23, 2018 by legallysociable

Maps are important for many users these days and a new open data standard and platform aims to bring all the street data together:

Using giant GIS databases, cities from Boston to San Diego maintain master street maps to guide their transportation and safety decisions. But there’s no standard format for that data. Where are the intersections? How long are the curbs? Where’s the median? It varies from city to city, and map to map.

That’s a problem as more private transportation services flood the roads. If a city needs to communicate street closures or parking regulations to Uber drivers, or Google Maps users, or new dockless bikesharing services—which all use proprietary digital maps of their own—any confusion could mean the difference between smooth traffic and carpocalypse.

And, perhaps more importantly, it goes the other way too: Cities struggle to obtain and translate the trip data they get from private companies (if they can get their hands on it, which isn’t always the case) when their map formats don’t match up.

A team of street design and transportation data experts believes it has a solution. On Thursday, the National Association of City Transportation Officials and the nonprofit Open Transport Partnership launched a new open data standard and digital platform for mapping and sharing city streets. It might sound wonky, but the implications are big: SharedStreets brings public agencies, private companies, and civic hackers onto the same page, with the collective goal of creating safer, more efficient, and democratic transportation networks.

It will be interesting whether this step forward simply makes what is currently happening easier to manage or whether this will be a catalyst for new opportunities. In a number of domains, having access to data is necessary before creative ideas and new collaborations can emerge.

This also highlights how more of our infrastructure is entering a digital realm. I assume there are at least a few people who are worried about this. For example, what happens if the computers go down or all the data is lost? Does the digital distance from physical realities – streets are tangible things, not just manipulable objects on a screen – remove us from authentic streetlife? Data like this may no be no substitute for a Jane Jacobs-esque immersion in vibrant blocks.

“Tiny Houses Are Big” – with 10,000 total in the United States

Posted on July 14, 2017 by legallysociable

Tiny houses get a lot of attention – including this recent Parade story – but rarely are numbers provided about how big (or small) this trend really is. The Parade story did provide some data (though without any indication of how this was measured) on the number of tiny houses in the US. Ready for the figure?

10,000.

Without much context, it is hard to know what to do with this figure or how accurate it might be. Assuming the figure’s veracity, is that a lot of tiny houses? Not that many? Some comparisons might help:

–Between February 2016 and March 2017, there were over 1,000,000 housing starts in each month. (National Association of Home Builders) Within data going back to 1959, the lowest point for housing starts after the 2000s housing bubble burst experienced about 500,000 new housing starts a month. (Census Bureau data at TradingEconomics.com)

–The RV industry shipped over 430,000 units in 2016. This follows a low point of shipments in recent years back in 2009 where only 165,000 units were shipped. (Recreation Vehicle Industry Association)

–The number of manufactured homes that have shipped in recent years – 2014 to 2016 – has surpassed 60,000 each year. (Census Bureau)

–The percent of new homes that are under 1,400 square feet has actually dropped since 1999 to 7% in 2016. (Census Bureau)

Based on these comparisons, 10,000 units is not much at all. They are barely a drop in the bucket within all housing.

Perhaps the trend is sharply on the rise? There is a little evidence of this. I wrote my first post here on tiny houses back in 2010 and it involved how to measure the tiny house trend. The cited article in that post included measures like the number of visitors to a tiny house blog and sales figures from tiny house builders. Would the number of tiny house shows on HGTV and similar networks provide some data? All trends have to start somewhere – with a small number of occurrences – but it doesn’t seem like the tiny house movement is taking off in exponential form.

Ultimately, I would ask for more and better data on tiny houses. Clearly, there is some interest. Yet, calling this a major trend would be misleading.

Measuring attitudes by search results rather than surveys?

Posted on June 10, 2017 by legallysociable

An author suggests Google search result data gives us better indicators of attitudes toward insecurity, race, and sex than surveys:

I think there’s two. One is depressing and kind of horrifying. The book is called Everybody Lies, and I start the book with racism and how people were saying to surveys that they didn’t care that Barack Obama was black. But at the same time they were making horrible racist searches, and very clearly the data shows that many Americans were not voting for Obama precisely because he was black.

I started the book with that, because that is the ultimate lie. You might be saying that you don’t care that [someone is black or a woman], but that really is driving your behavior. People can say one thing and do something totally different. You see the darkness that is often hidden from polite society. That made me feel kind of worse about the world a little bit. It was a little bit frightening and horrifying.

But, I think the second thing that you see is a widespread insecurity, and that made me feel a little bit better. I think people put on a front, whether it’s to friends or on social media, of having things together and being sure of themselves and confident and polished. But we’re all anxious. We’re all neurotic.

That made me feel less alone, and it also made me more compassionate to people. I now assume that people are going through some sort of struggle, even if you wouldn’t know that from their Facebook posts.

We know surveys have flaws and there are multiple ways – from sampling, to bad questions, to nonresponse, to social desirability bias (the issue at hand here) – they can be skewed.

But, these flaws wouldn’t lead me to these options:

Thinking that search results data provides better information. Who is doing the searching? Are they a representative population? How clear are the patterns? (It is common to see stories based on the data but that provide no numbers. “Illinois” might be the most misspelled word in the state, for example, but by a one search margin and with 486 to 485 searches).
Thinking that surveys are worthless on the whole. They still tell us something, particularly if we know the responses to some questions might be skewed. In the example above, why would Americans tell pollsters they have more progressive racial attitudes that they do? They have indeed internalized something about race.
That attitudes need to be measured as accurately as possible. People’s attitudes often don’t line up with their actions. Perhaps we need more measures of attitudes and behaviors rather than a single good one. The search result data cited above could supplement survey data and voting data to better inform us about how Americans think about race.

Good data is foundational to doing good sociological work

Posted on May 6, 2017 by legallysociable

I’ve had conversations in recent months with a few colleagues outside the discipline about debates within sociology over the work of ethnographers like Alice Goffman, Matt Desmond, and Sudhir Venkatesh. It is enlightening to hear how outsiders see the disagreements and this has pushed me to consider more fully how I would explain the issues at hand. What follows is my one paragraph response to what is at stake:

In the end, what separates the work of sociologists from perceptive non-academics or journalists? (An aside: many of my favorite journalists often operate like pop sociologists as they try to explain and not just describe social phenomena.) To me, it comes down to data and methods. This is why I enjoy teaching both our Statistics course and our Social Research course: undergraduates rarely come into them excited but they are foundational to who sociologists are. What we want to do is have data that is (1) scientific – reliable and valid – and (2) generalizable – allowing us to see patterns across individuals and cases or settings. I don’t think it is a surprise that the three sociologists under fire above wrote ethnographies where it is perhaps more difficult to fit the method under a scientific rubric. (I do think it can be done but it doesn’t always appear that way to outsiders or even some sociologists.) Sociology is unique in both its methodological pluralism – we do everything from ethnography to historical analysis to statistical models to lab or natural experiments to mass surveys – and we aim to find causal explanations for phenomena rather than just describe what is happening. Ultimately, if you can’t trust a sociologist’s data, why bother considering their conclusions or why would you prioritize their explanations over that of an astute person on the street?

Caveats: I know no data is perfect and sociologists are not in the business of “proving” things but rather we look for patterns. There is also plenty of disagreement within sociology about these issues. In a perfect world, we would have researchers using different methods to examine the same phenomena and develop a more holistic approach. I also don’t mean to exclude the role of theory in my description above; data has to be interpreted. But, if you don’t have good data to start with, the theories are abstractions.

Claim: we see more information today so we see more “improbable” events

Posted on February 9, 2017 by legallysociable

Are more rare events happening in the world or are we just more aware of what is going on?

In other words, the more data you have, the greater the likelihood you’ll see wildly improbable phenomena. And that’s particularly relevant in this era of unlimited information. “Because of the Internet, we have access to billions of events around the world,” says Len Stefanski, who teaches statistics at North Carolina State University. “So yeah, it feels like the world’s going crazy. But if you think about it logically, there are so many possibilities for something unusual to happen. We’re just seeing more of them.” Science says that uncovering and accessing more data will help us make sense of the world. But it’s also true that more data exposes how random the world really is.

Here is an alternative explanation for why all these rare events seem to be happening: we are bumping up against our limited ability to predict all the complexity of the world.

All of this, though, ignores a more fundamental and unsettling possibility: that the models were simply wrong. That the Falcons were never 99.6 percent favorites to win. That Trump’s odds never fell as low as the polling suggested. That the mathematicians and statisticians missed something in painting their numerical portrait of the universe, and that our ability to make predictions was thus inherently flawed. It’s this feeling—that our mental models have somehow failed us—that haunted so many of us during the Super Bowl. It’s a feeling that the Trump administration exploits every time it makes the argument that the mainstream media, in failing to predict Trump’s victory, betrayed a deep misunderstanding about the country and the world and therefore can’t be trusted.

And maybe it isn’t very easy to reconcile these two explanations:

So: Which is it? Does the Super Bowl, and the election before it, represent an improbable but ultimately-not-confidence-shattering freak event? Or does it indicate that our models are broken, that—when it comes down to it—our understanding of the world is deeply incomplete or mistaken? We can’t know. It’s the nature of probability that it can never be disproven, unless you can replicate the exact same football game or hold the same election thousands of times simultaneously. (You can’t.) That’s not to say that models aren’t valuable, or that you should ignore them entirely; that would suggest that data is meaningless, that there’s no possibility of accurately representing the world through math, and we know that’s not true. And perhaps at some point, the world will revert to the mean, and behave in a more predictable fashion. But you have to ask yourself: What are the odds?

I know there is a lot of celebration of having so much available information today but it isn’t necessarily easy adjusting to the changes. Taking it all in requires some effort on its own but the hard work is in the interpretation and knowing what to do with it all.

Perhaps a class in statistics – in addition to existing efforts involving digital or media literacy – could help many people better understand all of this.

Richard Florida: we lack systematic data to compare cities

Posted on May 7, 2016 by legallysociable

As he considers Jane Jacobs’ impact, Richard Florida suggests we need more data about cities:

MCP: Some of the research around the built environment is pretty skimpy and not very scientific, in a lot of cases.

RF: Right. And it’s done by architects who are terrific, but are basically looking at it from the building level. We need a whole research agenda. A century or so ago John Hopkins University invented the teaching hospital, modern medicine. They said, medicine could be advanced by underpinning the way doctors treat people and develop clinical methodologies, with a solid, scientific research base. Think of it as a system that runs from laboratory to bed-side. We don’t have that for cities and urbanism. But at the same time we know that the city is the key economic and social unit of our time. Billions of people across the world are pouring into cities and we are spending trillions upon trillions of dollars building new cities and rebuilding, expanding and upgrading existing ones. We’re doing it with little in the way of systematic research. We lack even the most basic data we need to compare and assess cities around the world. There’s no comparable grand challenge that we have so terribly under funded as cities and urbanism. We need to develop everything from the underlying science to better understand cities and their evolution, the systematic data to assess them and the educational and clinical protocols for building better, more prosperous and inclusive cities. Right now, mayors are out there winging it. Economic developers are out there winging it. There’s no clinical training program. There are some, actually, but they’re scattered about and they’re not having much impact. It’s going to take a big commitment. But we need to build the equivalent of the medical research infrastructure, with the equivalent of “teaching hospitals” for our cities. When you think of it cities are our greatest laboratories for advancing our understanding the intersection of natural, physical, social and human environments—they’re our most complex organisms. This is going to be my next big research project: I’m calling it the Urban Genome Project. It’s what I hope to devote the rest of my career doing.

The cities as laboratories language echoes that of the Chicago School. But, much of the sociological literature suggests a basic tension in this area: how much are cities alike compared to how much are they different? Are there common processes across most or all cities that we can highlight and work with or does their unique contexts limit how much generalizing can be done? Hence, we have a range of studies with everything from examining large sets of cities at once or processes across all cities (like Florida would argue in The Rise of the Creative Class) versus studies of particular neighborhoods and cities to discover their idiosyncratic patterns.

Of course, we could just look at cities like a physicist might and argue there are power laws underlying cities…

Legally Sociable

Pleasant Musings on Sociology, McMansions and Housing, Suburbs and Cities, and Miscellaneous Errata.

Tag Archives: data

Using and interpreting alternative data sources to examine COVID-19 impact

Finding data by finding and/or guessing URLs

Implicating suburban sprawl in the spread of ticks and pathogens

Collecting big data the slow way

New standard and platform for city maps

“Tiny Houses Are Big” – with 10,000 total in the United States

Measuring attitudes by search results rather than surveys?

Good data is foundational to doing good sociological work

Claim: we see more information today so we see more “improbable” events

Richard Florida: we lack systematic data to compare cities

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: