Measuring religious affiliation at the county level and the variation within counties

I was looking at the methodology for the “Where Should You Live?” interactive feature in the New York Times from November 2021 and noticed this section on religion and place:

Photo by Ricardo Esquivel on Pexels.com

Why isn’t there a checkbox for ____?

There are many metrics that we wanted to include but for which we couldn’t find data.

Religion was at the top of that list. The Public Religion Research Institute sent us breakdowns of religious affiliation by county. But some counties contain dozens of places. Cook County, for instance, includes Chicago and is home to a large number of Black Protestants. The county also includes Chicago’s northern suburbs, where very few Black people live. Assigning the same statistics to every place within Cook County would have been misleading.

(We did use county- or metropolitan-level statistics for a handful of metrics — but only when we thought values were unlikely to vary significantly within those areas.)

This explanation makes some sense given the data available. Counties can have significant variation within them, particularly when they are large counties and/or have a lot of different municipalities. The example of Cook County illustrates the possible variation within one county: not only does the county contain Chicago, there are scores of other suburbs with a variety of histories and demographics.

On the other hand, it is a shame to not be able to include any measure of religion. People do not necessarily gather with similar religious adherents in their own community. People regularly travel for religious worship and community. There are Black Protestant congregations in Cook County outside of Chicago even as they may not be evenly distributed across the county. Because this religion data is at the county level, perhaps it could be weighted less in the selection of places to live and still included as a potential factor.

This also speaks to a need for more systematic data on religious affiliation on a smaller scale than counties. This requires a tremendous amount of work and data but it would be a useful research tool.

Searching for well-presented weather forecasts; example of Weather Underground

Finding good and well-presented weather forecasts can be difficult. What provider supplies accurate information? Use a widget, app, or website? And who puts it together best on the screen? I have settled on Weather Underground because they present data in this format when you look at the ten day forecast:

This is a lot of information in one graphic. Here is what I think it does well:

  1. The top still provides the basic information many people might seek: conditions and high and low temperatures. A quick scan of the top quickly reveals all of this information.
  2. The amount of information available each day is helpfully shown in four sets of graphs below the header information. I do not just get a high and low temperature; I can see this over an hourly chart (no need to click on hourly information). I do not just get a notice about precipitation; I can see when rain or snow will fall. I do not just get a summary of wind speed; I can see if that wind speed is consistent, when it is rising or falling, and the direction.
  3. Connected to #2, it is easy to see patterns across days. Will that rain continue into the next day? Is the temperature spike or drop going to last? The longitudinal predictions are easy to see and I can see more details than just the summary info at the top.
  4. Also connected to #2, I can see how these four different paths of data line up with each other at the same days and times.

In sum, I think Weather Underground does a great job of showing a lot of information in an easy-to-decipher format. This may be too much information for many people, especially if you want quick information for now or the next few hours. But, if you want to think about the next few days and upcoming patterns, this one graphic offers a lot.

Trying to use statistics in a post-evidence political world

Ahead of the presidential debate last night, my Statistics class came up with a short list of guidelines for making sense of the statistics that were sure to be deployed in the discussion. Here is my memory of those strategies:

Photo by Pixabay on Pexels.com
  1. What is the source of the data?
  2. How was the statistic obtained (sample, questions asked, etc.)?
  3. Is the number unreasonable or too good/too bad to be true?
  4. How is the statistic utilized in an argument or what are the implications of the statistic?

These are good general tips for approaching any statistic utilized in the public realm. Asking good questions about data helps us move beyond accepting all numbers because they are numbers or rejecting all numbers because they can be manipulated. Some statistics are better than others and some are deployed more effectively than others.

But, after watching the debate, I wonder if these strategies make much sense in our particular political situation. Numbers were indeed used by both candidates. This suggests they still have some value. But, it would be easy for a viewer to leave thinking that statistics are not trustworthy. If every number can be debated – methods, actual figures, implications – depending on political view or if every number can be answered with another number that may or may not be related, what numbers can be trusted? President Trump throws out unverified numbers, challenges other numbers, and looks for numbers that boost him.

When Stephen Colbert coined the term “truthiness” in 2005, he hinted at this attitude toward statistics:

Truthiness is tearing apart our country, and I don’t mean the argument over who came up with the word …

It used to be, everyone was entitled to their own opinion, but not their own facts. But that’s not the case anymore. Facts matter not at all. Perception is everything. It’s certainty. People love the President [George W. Bush] because he’s certain of his choices as a leader, even if the facts that back him up don’t seem to exist. It’s the fact that he’s certain that is very appealing to a certain section of the country. I really feel a dichotomy in the American populace. What is important? What you want to be true, or what is true? …

Truthiness is ‘What I say is right, and [nothing] anyone else says could possibly be true.’ It’s not only that I feel it to be true, but that I feel it to be true. There’s not only an emotional quality, but there’s a selfish quality.

Combine numbers with ideology and what statistics mean can change dramatically.

This does not necessarily mean a debate based solely on numbers would lead to clearer answers. I recall some debate exchanges in previous years where candidates argued they each had studies to back up their side. In that instance, what is a viewer to decide (probably not having read any of the studies)? Or, if science is politicized, where do numbers fit? Or, there might be instances where a good portion of the electorate thinks statistics based arguments are not appropriate compared to other lines of reasoning. And the issue may not be that people or candidates are innumerate; indeed, they may know numbers all too well and seek to exploit how they are used.

Census cutting short time going door to door

The Census Bureau will not be able to go door to door as long as planned and this could affect the quality of the data at the end:

building door entrance exit

Photo by Pixabay on Pexels.com

Attempts by the bureau’s workers to conduct in-person interviews for the census will end on Sept. 30 — not Oct. 31, the end date it indicated in April would be necessary to count every person living in the U.S. given major setbacks from the coronavirus pandemic. Three Census Bureau employees, who were informed of the plans during separate internal meetings Thursday, confirmed the new end date with NPR. All of the employees spoke on the condition of anonymity out of fear of losing their jobs…

Former Census Bureau Director John Thompson warns that with less time, the bureau would likely have to reduce the number of attempts door knockers would make to try to gather information in person. The agency may also have to rely more heavily on statistical methods to impute the data about people living in households they can’t reach.

“The end result would be [overrepresentation] for the White non-Hispanic population and greater undercounts for all other populations including the traditionally hard-to-count,” Thompson wrote in written testimony for a Wednesday hearing on the census before the House Oversight and Reform Committee…

Moving up the end date from Oct. 31 for door knocking is likely to throw the census, already upended by months of delays, deeper into turmoil as hundreds of thousands of the bureau’s door knockers try to figure out how to conduct in-person interviews as many states grapple with growing coronavirus outbreaks in the middle of hurricane season.

Beyond the political football that the Census can be, Census data is important for researchers, residents, and political leaders. Not being able to go through the full data process and having to impute more data means that more of the final counts will need to be estimated. Since the decennial Census tries to get data from every household in the United States, it has some of the most comprehensive data. Lower counts, less time, COVID-19, political wrangling – may this not disturb useful and important data results.

Poor Census response rate in neighborhoods with fleeing New Yorkers

Here is another consequence of city residents leaving for other places during COVID-19: absent New York residents are not filling out Census 2020 forms at a good rate.

chart close up data desk

Photo by Lukas on Pexels.com

Only 46 percent of Upper East Side households have filled out their census forms, according to a June 25 report circulated by the Department of City Planning’s chief demographer, Joseph J. Salvo — well below the neighborhood’s final response rate in 2010, and short of the current citywide rate of almost 53 percent…

Even if New Yorkers have asked the Postal Service to forward mail to their second homes, census forms are addressed to the household, not the individual, which — unless New Yorkers pay for premium forwarding — prevents the post office from including them with the forwarded mail…

Officials hope that many of the coronavirus evacuees will return by the end of October, the new extended deadline for final responses to the census. But with Manhattan parents now enrolling children in schools outside the city, it is not clear that the evacuees will return to New York City in time…

The pandemic has prompted census outreach workers to adjust their tactics, especially in trying to reach undocumented immigrants and residents in illegal housing, who may be fearful of sharing information with the government. In the heavily immigrant neighborhoods of North Corona and East Elmhurst, outreach workers have approached New Yorkers while they wait in lines at food distribution sites, for example.

A lot of effort goes into conducting the decennial census and the data collected is helpful to many. Trying to boost response rates to surveys in a world awash in data collection is a difficult task without a global pandemic. But, I imagine this might lead to some interesting lessons about data collection. Researchers need to have some flexibility in all cases as circumstances can change and plans may go awry. This could be a helpful story about how a large organization adapted in a difficult situation and maybe even made future data collection more robust.

While the article mentions the potential consequences for New York City, there is another consequence of the movement of people: would these wealthy New Yorkers boost the Census numbers elsewhere, provided that they fill out the forms about residing in other locations? Granted, they would still have to fill out a Census form but others might do that for them (if they are living with others) or they might fill out a form once they are more settled in.

Using and interpreting alternative data sources to examine COVID-19 impact

In a world full of data, businesses, investors, and others have access to newer sources of information that can provide insights into responses to COVID-19:

For instance, Angus says that monitoring China’s internet throughout the pandemic showed how industrial plants in the worst-affected regions—which operate servers and computers—shut down during the outbreak. In the last few weeks, as the emergency abated, things have started crawling back to normalcy, even if we are still far from pre-Covid-19 levels, and the evidence might be polluted by plants being restarted just to hit government-imposed power consumption targets. “China is not normal yet,” Angus says. The country’s internet latency suggests that “recovery is happening in China, but there are still a lot of people who must be facing at-home-life for their activities.”…

Combining data from vessel transponders with satellite images, he has periodically checked how many oil tankers are in anchorage in China, unable to deliver their cargo—an intimation both of how well China’s ports are functioning amid the pandemic, and of how well industrial production is keeping up.

Madani also relies on TomTom’s road traffic data for various Chinese and Italian cities to understand how they are affected by quarantines and movement restrictions. “What we’ve seen over the past two weeks is a big revival in congestion,” he says. “There’s more traffic going on now in China, in the big cities, apart from Wuhan.”…

Pollution data is another valuable source of information. Over the past weeks, people on Twitter have been sharing satellite images of various countries, showing that pollution levels are dropping across the industrialised world as a result of coronavirus-induced lockdowns. But where working-from-home twitteratis see a poetic silver lining, Madani sees cold facts about oil consumption.

Three quick thoughts:

1. Even with all of this data, interpreting it is still an important task. People could look at similar data and come to similar conclusions. Or, they might have access to one set of data and not another piece and then draw different conclusions. This becomes critical when people today want data-driven responses or want to back up their position with data. Simply having data is not enough.

2. There is publicly available data – with lots of charts and graphs going around in the United States about cases – and then there is data that requires subscriptions, connections, insider information. Who has access to what data still matters.

3. We have more data than ever before and yet this does not necessarily translate into less anxiety or more preparation regarding certain occurrences. Indeed, more information might make things worse for some.

In sum, we can know more about the world than ever before but we are still working on ways to utilize and comprehend that information that might have been unthinkable decades ago.

Finding data by finding and/or guessing URLs

A California high school student is posting new data from 2020 presidential polls before news organizations because he found patterns in their URLs:

How does Rawal do it? He correctly figures out the URL — the uniform resource locator, or full web address — that a graphic depicting the poll’s results appears at before their official release.

“URL manipulation is what I do,” he said, “and I’ve been able to get really good at it because, with websites like CNN and Fox, all the file names follow a pattern.”

He added, “I’m not going to go into more detail on that.”

He said he had just spoken with The Register’s news director, who expressed interest in his helping the newspaper “keep it under tighter wraps.” He is considering it.

This makes sense on both ends: media organizations need a way to organize their files and sites and someone who looks at the URLs over time could figure out the pattern. Now to see how media organizations respond as to not let their stories out before they report them.

I imagine there is a broader application for this. Do many organizations have websites or data available that is not linked to or a link is not easily found? I could imagine how such hidden/unlinked data could be used for nefarious or less ethical purposes (imagine scooping news releases about soon-to-be released economic figures in order to buy or sell stocks) as well as data collection.

Implicating suburban sprawl in the spread of ticks and pathogens

As new tick-borne illnesses spread, sprawl is part of the problem:

But as climate change, suburban sprawl, and increased international travel are putting more ticks and the pathogens they carry in the paths of humans, what’s becoming more urgently apparent is how the US’s tick monitoring systems are not keeping pace.

“It’s really a patchwork in terms of the effort that different areas are putting into surveillance,” says Becky Eisen, a tick biologist with CDC’s Division of Vector-Borne diseases. The federal public health agency maintains national maps of the ranges of different tick species, but they’re extrapolated from scattered data collected in large part by academic researchers. Only a few states, mostly in the Northeast, have dedicated tick surveillance and control programs. That leaves large parts of the country in a data blackout.

To help address that problem the CDC is funding an effort to identify the most urgent gaps in surveillance. It has also begun publishing guidance documents for public health departments on how to collect ticks and test them for diseases, to encourage more consistent data collection across different states and counties.

In an ideal world, says Eisen, every county in the US would send a few well-protected people out into fields and forests every spring and summer, setting traps or dragging a white flannel sheet between them to collect all the ticks making their homes in the grasses and underbrush. Their precise numbers, locations, and species would be recorded so that later on when they get ground up and tested, that DNA would paint a national picture of risk for exposure to every tick-borne pathogen in America. But she recognizes that would be incredibly labor-intensive, and with only so many public funding dollars to go around each year, there are always competing priorities.“But from a research perspective, that’s the kind of repeatable, consistent data we’d really want,” says Eisen. “That would be the dream.”

While there is little direct discussion of sprawl, I wonder if there are two problems at play.

First, sprawl puts more people in interaction with more natural settings. As metropolitan areas expand, more residents end up in higher densities in areas that previously had experienced limited human residence. More people at the wildland urban interface could potentially lead to more problems in both directions: humans can pick up diseases while nature can be negatively impacted by more people.

Second, increasing sprawl means more data needs to be collected as more people are at possible threat. Metropolitan areas (metropolitan statistical areas according to the Census Bureau) typically expand county by county as outer counties increase in population and have more ties to the rest of the region. Since many metropolitan regions expand in circles, adding more counties at the edges could significantly increase the number of counties that need monitoring. And as the article ends with, finding money to do all that data collection and analysis is difficult.

Making a horror film about illnesses carried by ticks would take some work to make interesting but these sorts of hidden and minimally problematic in terms of number of suburbanites at this point issues could cause a lot of anxiety.

Collecting big data the slow way

One of the interesting side effects of the era of big data is finding out how much information is not actually automatically collected (or is at least not available to the general public or researchers without paying money). A quick example from the work of sociologist Matthew Desmond:

The new data, assembled from about 83 million court records going back to 2000, suggest that the most pervasive problems aren’t necessarily in the most expensive regions. Evictions are accumulating across Michigan and Indiana. And several factors build on one another in Richmond: It’s in the Southeast, where the poverty rates are high and the minimum wage is low; it’s in Virginia, which lacks some tenant rights available in other states; and it’s a city where many poor African-Americans live in low-quality housing with limited means of escaping it.

According to the Eviction Lab, here is how they collected the data:

First, we requested a bulk report of cases directly from courts. These reports included all recorded information related to eviction-related cases. Second, we conducted automated record collection from online portals, via web scraping and text parsing protocols. Third, we partnered with companies that carry out manual collection of records, going directly into the courts and extracting the relevant case information by hand.

In other words, it took a lot of work to put together such a database: various courts, websites, and companies had different pieces of information but a researcher to access all of that data and put them together.

Without a researcher or a company or government body explicitly starting to record or collect certain information, a big dataset on that particular topic will not happen. Someone or some institution, typically with resources at its disposal, needs to set a process into motion. And simply having the data is not enough; it needs to be cleaned up so it all works with the other pieces. Again, from the Eviction Lab:

To create the best estimates, all data we obtained underwent a rigorous cleaning protocol. This included formatting the data so that each observation represented a household; cleaning and standardizing the names and addresses; and dropping duplicate cases. The details of this process can be found in the Methodology Report (PDF).

This all can lead to a fascinating dataset of over 83 million records on an important topic.

We are probably still a ways off from a scenario where this information would automatically become part of a dataset. This data had a definite start and required much work. There are many other areas of social life that require similar efforts before researchers and the public have big data to examine and learn from.

New standard and platform for city maps

Maps are important for many users these days and a new open data standard and platform aims to bring all the street data together:

Using giant GIS databases, cities from Boston to San Diego maintain master street maps to guide their transportation and safety decisions. But there’s no standard format for that data. Where are the intersections? How long are the curbs? Where’s the median? It varies from city to city, and map to map.

That’s a problem as more private transportation services flood the roads. If a city needs to communicate street closures or parking regulations to Uber drivers, or Google Maps users, or new dockless bikesharing services—which all use proprietary digital maps of their own—any confusion could mean the difference between smooth traffic and carpocalypse.

And, perhaps more importantly, it goes the other way too: Cities struggle to obtain and translate the trip data they get from private companies (if they can get their hands on it, which isn’t always the case) when their map formats don’t match up.

A team of street design and transportation data experts believes it has a solution. On Thursday, the National Association of City Transportation Officials and the nonprofit Open Transport Partnership launched a new open data standard and digital platform for mapping and sharing city streets. It might sound wonky, but the implications are big: SharedStreets brings public agencies, private companies, and civic hackers onto the same page, with the collective goal of creating safer, more efficient, and democratic transportation networks.

It will be interesting whether this step forward simply makes what is currently happening easier to manage or whether this will be a catalyst for new opportunities. In a number of domains, having access to data is necessary before creative ideas and new collaborations can emerge.

This also highlights how more of our infrastructure is entering a digital realm. I assume there are at least a few people who are worried about this. For example, what happens if the computers go down or all the data is lost? Does the digital distance from physical realities – streets are tangible things, not just manipulable objects on a screen – remove us from authentic streetlife? Data like this may no be no substitute for a Jane Jacobs-esque immersion in vibrant blocks.