Can you name “America’s 50 Healthiest Counties for Kids” when you only account for 38% of US counties?

US News & World Report recently released a list of “America’s 50 Healthiest Counties for Kids.” However, there is a problem with the rankings: more than half of American counties aren’t included in the data.

About 1,200 of the nation’s 3,143 counties (a total that takes in county equivalents such as Louisiana’s parishes) were evaluated for the rankings. Many states don’t collect county-level information on residents’ health, whereas populous states, such as California, Florida and New York, tend to gather and report more data. In some counties, the population is so small that the numbers are unreliable, or the few events fall below state or federal reporting thresholds. And because states don’t collect county-level information on childhood smoking and obesity, the rankings incorporated percentages for adults. Catlin says this is justified because more adult smokers mean more children are exposed to secondhand smoke, a demonstrated health risk. Studies have also shown a moderately strong correlation between adult and childhood obesity, she says.

The experts who study community health yearn for more and better data. “We don’t have county-level data on kids with diabetes, controlled or uncontrolled, or on childhood obesity rates,” says Ali Mokdad of the Institute for Health Metrics and Evaluation at the University of Washington. “Almost every kid in this country goes to school. We could measure height and weight, but nobody’s connecting the dots.”

This won’t stop counties high on the list from touting their position. See this Daily Herald article about DuPage County coming in at #20. But, there should be some disclaimer or something on this list if a majority of US counties aren’t even considered. Or, perhaps such a list shouldn’t be too together at all.

Methodological issues with the “average” American wedding costing $27,000

Recent news reports suggest the average American wedding costs $27,000. But, there may be some important methodological issues with this figure: selection bias and using an average rather than a median.

The first problem with the figure is what statisticians call selection bias. One of the most extensive surveys, and perhaps the most widely cited, is the “Real Weddings Study” conducted each year by TheKnot.com and WeddingChannel.com. (It’s the sole source for the Reuters and CNN Money stories, among others.) They survey some 20,000 brides per annum, an impressive figure. But all of them are drawn from the sites’ own online membership, surely a more gung-ho group than the brides who don’t sign up for wedding websites, let alone those who lack regular Internet access. Similarly, Brides magazine’s “American Wedding Study” draws solely from that glossy Condé Nast publication’s subscribers and website visitors. So before they do a single calculation, the big wedding studies have excluded the poorest and the most low-key couples from their samples. This isn’t intentional, but it skews the results nonetheless.

But an even bigger problem with the average wedding cost is right there in the phrase itself: the word “average.” You calculate an average, also known as a mean, by adding up all the figures in your sample and dividing by the number of respondents. So if you have 99 couples who spend $10,000 apiece, and just one ultra-wealthy couple splashes $1 million on a lavish Big Sur affair, your average wedding cost is almost $20,000—even though virtually everyone spent far less than that. What you want, if you’re trying to get an idea of what the typical couple spends, is not the average but the median. That’s the amount spent by the couple that’s right smack in the middle of all couples in terms of its spending. In the example above, the median is $10,000—a much better yardstick for any normal couple trying to figure out what they might need to spend.

Apologies to those for whom this is basic knowledge, but the distinction apparently eludes not only the media but some of the people responsible for the surveys. I asked Rebecca Dolgin, editor in chief of TheKnot.com, via email why the Real Weddings Study publishes the average cost but never the median. She began by making a valid point, which is that the study is not intended to give couples a barometer for how much they should spend but rather to give the industry a sense of how much couples are spending. More on that in a moment. But then she added, “If the average cost in a given area is, let’s say, $35,000, that’s just it—an average. Half of couples spend less than the average and half spend more.” No, no, no. Half of couples spend less than the median and half spend more.

When I pressed TheKnot.com on why they don’t just publish both figures, they told me they didn’t want to confuse people. To their credit, they did disclose the figure to me when I asked, but this number gets very little attention. Are you ready? In 2012, when the average wedding cost was $27,427, the median was $18,086. In 2011, when the average was $27,021, the median was $16,886. In Manhattan, where the widely reported average is $76,687, the median is $55,104. And in Alaska, where the average is $15,504, the median is a mere $8,440. In all cases, the proportion of couples who spent the “average” or more was actually a minority. And remember, we’re still talking only about the subset of couples who sign up for wedding websites and respond to their online surveys. The actual median is probably even lower.

These are common issues with figures reported in the media. Indeed, these are two questions the average reader should ask when seeing a statistic like the average cost of the wedding:

1. How was the data collected? If this journalist is correct about these wedding cost studies, then this data is likely very skewed. What we would want to see is a more representative sample of weddings rather than having subscribers or readers volunteer how much their wedding cost.

2. What statistic is reported? Confusing the mean and median is a big program and pops up with issues as varied as the average vs. median college debtthe average vs. median credit card debt, and the average vs. median square footage of new homes. This journalist is correct to point out that the media should know better and shouldn’t get the two confused. However, reporting a higher average with skewed data tends to make the number more sensationalistic. It also wouldn’t hurt to have more media consumers know the difference and adjust accordingly.

It sounds like the median wedding cost would likely be significantly lower than the $27,000 bandied about in the media if some basic methodological questions were asked.

Is this meaningful data: Chicago the “slowest-growing major city” between 2011 and 2012?

New figures from the Census show that Chicago doesn’t fare well compared to other cities in recent population growth:

Chicago gained nearly 10,000 people from July 2011 to July 2012, but was the slowest-growing major city in the country according to U.S. Census Bureau estimates released Thursday.

It was the second year in a row that population grew here, but the increase so far shows no signs of making up for the loss of 200,000 people over the previous decade…

Among cities with more than one million people, sun-belt metropolises like Dallas, San Antonio, Phoenix, Houston and San Diego all posted gains of more than 1.3 percent, while Chicago grew by little more than one-third of 1 percent.

With a total estimated population of 2,714,856. Chicago held on to its spot as the third largest city. But the two largest cities padded their leads, with New York City adding 67,000 in 2012 and No. 2 Los Angeles gaining 34,000 people.

While I’m sure some will use these figures to judge Chicago’s politics and development efforts, I’m not sure these figures mean anything. Here’s why:

1. The data only cover one year. This is just one time point. The story does a little bit to provide a wider context by referencing the 2010-2010 population figures but it would also be helpful to know the year-to-year figures for the last two years. In other words, what is the trend in the last several years in Chicago? Is the nearly 10,000 new people much different from 2011 or 1010 or 1009?

2. These are population estimates meaning there is a margin of error for the estimate. Thus, that error might cover a decent amount of population growth in all of these cities.

In the end, we need more data over time to know whether there are long-term trends going on in these major cities.

Two other interesting notes from the Census data:

1. The population growth in the Sunbelt continues:

Eight of the 15 fastest-growing large U.S. cities and towns for the year ending July 1, 2012 were in Texas, according to population estimates released today by the U.S. Census Bureau. The Lone Star State also stood out in terms of the size of population growth, with five of the 10 cities and towns that added the most people over the year…

No state other than Texas had more than one city on the list of the 15 fastest-growing large cities and towns. However, all but one were in the South or West.

This fits with what Joel Kotkin has been saying for a while.

2. Many Americans continue to live in communities with fewer than 50,000 people:

Of the 19,516 incorporated places in the United States, only 3.7 percent (726) had populations of 50,000 or more in 2012.

However, many of these smaller communities are suburbs near big cities. It’s too bad there aren’t figures here about what percentage of Americans live in those 726 communities of 50,000 or more.

Combining sociology and journalism

The efforts of a hyper-local journalism website in Alhambra, California illustrate an intriguing combination: journalism plus sociology.

This fixation on community interaction is part of the site’s DNA. As city newspapers inexorably decline, a smattering of new “hyperlocal” news outlets have sprung up, from Aol’s Patch network to bootstrap start-ups. But the Source has an unusual ingredient: more than a decade of research by University of Southern California communications expert Sandra Ball-Rokeach and her team…

Ball-Rokeach studies what she calls “communication ecologies”—the web of ways in which different communities get and spread information, from Facebook to the grocery-store bulletin board, from the local tabloid to chatting with neighbors. She’s found that these networks can differ dramatically from community to community, ethnic group to ethnic group…

Understanding those differences is crucial for anyone, be they advertisers or political parties, trying to reach specific communities. Ball-Rokeach believes it’s also important for civic engagement. Strong cities with plugged-in citizens tend to have dense “neighborhood storytelling networks”—crisscrossing lines of media outlets, community groups, and other institutions that hold a running conversation about what it means to live there…

Instead of simply sketching out the usual beats—city council, business, sports—they sent out a team of USC researchers who interviewed and held focus groups with residents in all three local languages. Their exploration showed that residents wanted to know more about education, local businesses, dining and entertainment deals, crime, and traffic and parking. “Many of them just said, ‘We don’t know what’s happening in Alhambra,’” says Ball-Rokeach…

Still, even if the Alhambra Source goes the same way, there’s an intriguing idea in this relationship between newspaper and university. What could embattled major dailies from The Boston Globe to the Los Angeles Times learn about their readers by teaming with sociology grad students? Tailoring a news outlet to reflect its community might not always produce the most in-depth journalism—but it might at least help the news business survive.

It sounds like what sociology and social science bring to the table in this combination is the ability to collect and analyze data. However, it still sounds like this social science research is more about marketing or targeting an audience than anything else. In an era of difficulty for newspapers and other news sources, this is not to be underestimated. But, this still puts the social science in more of a marketing role: what do we need to address in order to attract readers? At the same time, I could envision a stronger combination of these two disciplines where the journalism is much more informed and shaped by research and data rather than anecdotes and single cases and the sociologists then have another outlet to share their findings and explanations about the social world.

Defining what makes for a luxury home

Here is how one data firm defines what it means to be a luxury housing unit:

Although upscale housing is selling better in some cities than in others, a monthly analysis by the Altos Research data firm for the Institute for Luxury Home Marketing says that overall, that segment of the market is gaining momentum and prices are rising…

Q: “Luxury home” is probably one of the most abused phrases in real estate-ese. How do you define it?

A: A price range that’s considered the high end of the market in one place might be something that’s average in another. So, “luxury” is local: Our organization generally defines it as the top 10 percent of an area’s sales in the past 12 months. But for the purposes of the research that we do with Altos for our monthly Luxury Market Report, we’ve taken the ZIP codes within each of 31 markets that have the highest median prices, and for about five years we’ve tracked the sales of homes in those (areas) that are $500,000 and above.

There are two techniques proposed here:

1. The highest 10 percent of a local housing market. Thus, the prices are all relative and the data is based on the highest end in each place. So, there could be some major differences in luxury prices across zip codes or metropolitan regions.

2. Breaking it down first by geography to the wealthiest places (so this is based on geographic clustering) and then setting a clear cut point at $500,000. In these wealthiest zip codes, wouldn’t most of the units be over $500,000? Why the 31 wealthiest markets and not 20 or 40?

Each of these approaches have strengths and weaknesses but I imagine the data here could change quite a bit based on what operationalization is utilized.

Interestingly, the firm found that luxury sales rebounded quicker than the rest of the market:

The interesting thing about this recovery is that the luxury segment, that group of affluent households, was able to recover fairly quickly. They shifted their assets around, and a lot of them were able to see opportunities in the down market. By 2010, there were almost as many high-end households as before the downturn, not just in the United States, but internationally, as well. This group focused on residential real estate as a pretty desirable asset — for them, a second or third home turned out to be a portfolio play.

This shouldn’t be too surprising – when an economic crisis hits, the wealthier members of society have more of a cushion. While the upper end is doing all right, others have argued the bottom end, those looking for starter homes, are having a tougher time.

Sociologists = people who look at “boring data compiled during endless research”

If this is how a good portion of the public views what sociologists do, sociologists may be in trouble:

Anthony Campolo is a sociologist by trade, used to looking at boring data compiled during endless research.

Data collection and analysis may not be glamorous but a statement like this suggests sociologists may have some PR issues. Data collection and analysis are often time consuming and even tedious. But, there are reasons for working so hard to get data and do research: so sociologists can make substantiated claims about how the social world works. Without rigorous methods, sociologists would just be settling for interpretation, opinion, or anecdotal evidence. For example, we might be left with stories like that of a homeless man in Austin, Texas who was “testing”  which religious groups contributed more money to him. Of course, his one case tells us little to nothing.

Perhaps this opening sentence should look something like this: time spent collecting and analyzing data will pay off in stronger arguments.

 

Argument: humans like causation because they like to feel in control

Here is an interesting piece that summarizes some research and concludes that humans like to feel in control and therefore like the idea of causality:

This predisposition for causation seems to be innate. In the 1940s, psychologist Albert Michotte theorized that “we see causality, just as directly as we see color,” as if it is omnipresent. To make his case, he devised presentations in which paper shapes moved around and came into contact with each other. When subjects—who could only see the shapes moving against a solid-colored background—were asked to describe what they saw, they concocted quite imaginative causal stories…

Nassim Taleb noted how ridiculous this is in his book The Black Swan. In the hours after former Iraqi dictator Saddam Hussein was captured on December 13, 2003, Bloomberg News blared the headline, “U.S. TREASURIES RISE; HUSSEIN CAPTURE MAY NOT CURB TERRORISM.” Thirty minutes later, bond prices retreated and Bloomberg altered their headline: “U.S. TREASURIES FALL; HUSSEIN CAPTURE BOOSTS ALLURE OF RISKY ASSETS.” A more correct headline might have been: “U.S. TREASURIES FLUCTUATE AS THEY ALWAYS DO; HUSSEIN CAPTURE HAS NOTHING TO DO WITH THEM WHATSOEVER,” but that isn’t what editors want to post, nor what people want to read.

This trend doesn’t merely manifest itself for stocks or large events. Take scientific studies, for example. Many of the most sweeping findings, ones normally reported in large media outlets, originate from associative studies that merely correlate two variables—television watching and death, for example. Yet headlines—whose functions are partly to summarize and primarily to attract attention—are often written as “X causes Y” or “Does X cause Y?” (I have certainly been guilty of writing headlines in the latter style). In turn, the general public usually treats these findings as cause-effect, despite the fact that there may be no proven causal link between the variables. The article itself might even mention the study’s correlative, not causative, nature, and this still won’t change how it is perceived. Co-workers across the world will still congregate around coffee machines the next day, chatting about how watching The Kardashians is killing you, albeit very slowly.Humanity’s need for concrete causation likely stems from our unceasing desire to maintain some iota of control over our lives. That we are simply victims of luck and randomness may be exhilarating to a madcap few, but it is altogether discomforting to most. By seeking straightforward explanations at every turn, we preserve the notion that we can always affect our condition in some meaningful way. Unfortunately, that idea is a facade. Some things don’t have clear answers. Some things are just random. Some things simply can’t be controlled.

I like the reference to Taleb here. His books make just this argument: people want to see patterns when they don’t exist and thus are completely unprepared for changes in the stock market, governments, or the natural world. The trick is to know when you can rely on patterns and when you can’t – and Taleb even has general investment strategies in his most recent book Antifragile that try to minimize loss and try to maximize potential gains.

I wonder if this isn’t lurking behind the discussion of big data: there are scientists and others who seem to suggest that all we need to understand the world is more data and better pattern recognition tools. If only we could get enough, we could figure things out. But, what if the world turns out to be too complex? What if we can’t know everything about the social or natural world? Does this then change our perceptions of human ingenuity and progress?

h/t Instapundit

Spreadsheet errors, austerity, ideology, and social science

The graduate student who found some spreadsheet errors in an influential anti-austerity paper discusses what happened. Here is part of the conversation about the process of finding this error:

Q. You say, don’t you, that their use of data was faulty?

A. Yes. The terms we used about their data—”selective” and “unconventional”—are appropriate ones. The reasons for the choices they made needed to be given, and there was nowhere where they were.

Q. And how about their claim that your findings support their thesis that growth slows as debt rises?

A. That is not our interpretation of our paper, at all. If you read their paper, it’s interesting how they handle causality. They waffle between strong and weak claims. The weak claim is that it’s just a negative association. If that’s all they claim, then it’s not really relevant for policy. But they also make a strong claim, more in public than in the paper, that there’s causality going from high debt to drops in growth. They haven’t been obvious about that…

Q. Paul Krugman wrote in The New York Times that your work confirms what many economists have long intuitively thought. Was that your intuition?

A. Yes. I just thought it was counterintuitive when I first saw their claim. It wasn’t plausible.

Q. This is more than a spreadsheet error, then?

A. Yes. The Excel error wasn’t the biggest error. It just got everyone talking about this. It was an emperor-has-no-clothes moment.

This would make for a good case study in a methodology class in the social sciences: how much of this is about actual data errors versus different interpretations? You have people who are clearly staking out space on either side of a policy discussion and it is a bit unclear how much does this color their interpretation of “facts”/data. I suspect some time will help sort this out – if the spreadsheet was indeed wrong, shouldn’t this lead to a correction or a retraction?

I do like the fact that the original authors were willing to share their data – this is something that could happen more often in the social sciences and give people the ability to look at the data for themselves.

Census Bureau moving to more online data collection to save money

The US Census Bureau is collecting more information online in order to cut costs:

The Census Bureau already has started offering an Internet option to the 250,000 households it selects every month at random for the American Community Survey. Since becoming available in January, more than half the responses have come in on a secure site that requires codes and PIN numbers.

The bureau expects to use the Internet — plus smart phones and other technologies yet to be invented — for the next decen­nial census, in 2020.

The increasing reliance on technology is designed to save money. The 2010 Census cost $96 per household, including the American Community Survey that has replaced the old long form. That cost has more than doubled in two decades, up from $70 in 2000 and $39 as recently as 1990…

The Census Bureau spent two years running preliminary experiments in how people responded to American Commu­nity Survey questions on the computer screen. Five rounds of ­testing involved tracking eye movements as people scanned a Web page looking for which answer they wanted to check.

The households selected for the survey still get their first contact the old-fashioned way, with a mailed letter telling them the questionnaire is on its way. Then they receive a letter telling them how to respond over the Internet. If they don’t use that option, they get a 28-page paper form a few weeks later.

It is too bad this may be motivated primarily by money. I would hope it would be motivated more by wanting to collect better data and boost response rates. However, I’m glad they seem to have done a good amount of testing. But, the article fails to address one of the biggest issues with web surveys: can this technique be used widely with different groups in the US population or does it work best with certain groups (usually younger, more Internet access)? All this is related to how much money can be saved: what percentage of mailed forms or household visits can be eliminated with new techniques? And I would be interested in hearing more about using smartphones. The Internet may be horribly outdated even today for a certain segment of the population. Imagine a Census 2020 app – used via Google Glass.

 

Using algorithms for better realignment in the NHL?

The NHL recently announced realignment plans. However, a group of West Point mathematicians developed an algorithm they argue provides a better realignment:

Well, a team of mathematicians at West Point set out to find an algorithm that could solve some of these problems. In their article posted on the arXiv titled Realignment in the NHL, MLB, the NFL, and the NBA, they explore how to easily construct different team divisions. For example, with the relatively recent move of Atlanta’s hockey team to Winnipeg, the current team alignment is pretty weird (below left), and the NHL has proposed a new 4-division configuration (below right):

Here’s how it works. First, they use a rough approximation for distance traveled by each team (which is correlated with actual travel distances), and then examine all the different ways to divide the cities in a league into geographic halves. You then can subdivide those portions until you get the division sizes you want. However, only certain types of divisions will work, such as not wanting to make teams travel too laterally, due to time zone differences…

Anyway, using this method, here are two ways of dividing the NHL into six different divisions that are found to be optimal:

My first thought when looking at the algorithm realignment plans is that it is based less on time zones and more on regions like the Southwest, Northwest, Central, Southeast, North, and Northeast.

But here is where I think the demands of the NHL don’t quite line up with the goals of the algorithm to minimize travel. The grouping of sports teams is often dependent on historic patterns, rivalries, and when teams entered the league. For example, the NHL realignment plans generated a lot of discussion in Chicago because it meant that the long rivalry between the Chicago Blackhawks and the Detroit Red Wings would end. In other words, there is cultural baggage to realignment that can’t only be solved with statistics. Data loses out to narratives.

Another way an algorithm could redraw the boundaries: spread out the winning teams across the league. What teams are really good tends to be cyclical but occasionally leagues end up with multiple good teams in a single division or an imbalance of power between conferences. Why not spread out teams by records which then gives teams a better chance to meet in the finals or other teams in those stacked divisions or conferences a chance to make the playoffs?b