Exactly how many American homes are vacant?

Two bloggers have a disagreement about how many vacant homes there are in the United States. Check out the debate and the comments below.

The moral of the story: one still needs to interpret statistics and what exactly they are measuring. The different between 11% and 2% is quite a lot: the first figure suggests 1 out of 10 housing units are vacant while the second figure suggests it is 1 out of 50. If you look at Table 1 of this Census Bureau release regarding housing figures from Quarter 4, it looks like the vacancy rate is 2.7%. But there may be confusion based on Table 3 which suggests the vacancy for all housing units is roughly 11% for year-round units. And later in the release, page 11 of the document, gives the formula for the vacancy calculation and an explanation: “The homeowner vacancy rate is the proportion of the homeowner inventory that is vacant for sale.”

There are some other figures of note in this document. Table 4 shows that the homeownership rate is at 66.5%, down from a peak of 69.2% in the fourth quarter of 2004. (It is interesting to note that this rate peaked a couple of years before the housing market is popularly thought to have gone downhill. What happened between Q4 2004 and the start of the larger economic crisis? Table 7 has homeownership rates by race: the white rate has dropped 1.1% since 1Q 2007 while Blacks and Latinos have seen bigger drops (3.2% and 3.3%).

An example of statistics in action: measuring faculty performance by the grades students receive in subsequent courses

Assessment, whether it is for student or faculty outcomes,  is a great area in which to find examples of statistics. This example comes from a discussion of assessing faculty by looking at how students do in subsequent courses:

[A]lmost no colleges systematically analyze students’ performance across course sequences.

That may be a lost opportunity. If colleges looked carefully at students’ performance in (for example) Calculus II courses, some scholars say, they could harvest vital information about the Calculus I sections where the students were originally trained. Which Calculus I instructors are strongest? Which kinds of homework and classroom design are most effective? Are some professors inflating grades?

Analyzing subsequent-course preparedness “is going to give you a much, much more-reliable signal of quality than traditional course-evaluation forms,” says Bruce A. Weinberg, an associate professor of economics at Ohio State University who recently scrutinized more than 14,000 students’ performance across course sequences in his department.

Other scholars, however, contend that it is not so easy to play this game. In practice, they say, course-sequence data are almost impossible to analyze. Dozens of confounding variables can cloud the picture. If the best-prepared students in a Spanish II course come from the Spanish I section that met at 8 a.m., is that because that section had the best instructor, or is it because the kind of student who is willing to wake up at dawn is also the kind of student who is likely to be academically strong?

It sounds like the relevant grade data for this sort of analysis would not be difficult. The hard part is making sure the analysis includes all of the potentially relevant factors, “confounding variables,” that could influence student performance.

One way to limit these issues is to limit student choice regarding sections and instructors. Interesting, this article cites studies done at the Air Force Academy, where students don’t have many options in the Calculus I-II sequence. In summary, this setting means “the Air Force Academy [is] a beautifully sterile environment for studying course sequences.”

Some interesting findings both from the Air Force Academy and Duke: students who were in introductory/earlier classes that they considered more difficult or stringent did better in subsequent courses.

Finding the right model to predict crime in Santa Cruz

Science fiction stories are usually the setting when people talk about predicting crimes. But it appears that the police department in Santa Cruz is working with an academic in order to forecast where crimes will take place:

Santa Cruz police could be the first department in Northern California that will deploy officers based on forecasting.

Santa Clara University assistant math professor Dr. George Mohler said the same algorithms used to predict aftershocks from earthquakes work to predict crime.”We started with theories from sociological and criminological fields of research that says offenders are more likely to return to a place where they’ve been successful in the past,” Mohler said.

To test his theory, Mohler plugged in several years worth of old burglary data from Los Angeles. When a burglary is reported, Mohler’s model tells police where and when a so-called “after crime” is likely to occur.

The Santa Cruz Police Department has turned over 10 years of crime data to Mohler to run in the model.

I wonder if we will be able to read about the outcome of this trial, regardless of whether the outcome is good or bad. If the outcome is bad, perhaps the police department or the academic would not want to publicize the results.

On one hand, this simply seems to be a problem of getting enough data to make accurate enough predictions. On the other hand, there will always be some error in the predictions. For example, how could a model predict something like what happened in Arizona this past weekend? Of course, one could include some random noise into the model – but these random guesses could easily be wrong.

And knowing the location of where crime would happen doesn’t necessarily mean that the crime could be prevented.

US land use statistics from the 2011 Statistical Abstract of the United States

I have always enjoyed reading or looking through almanacs or statistical abstracts: there is so much interesting information from crop production to sports results to country profiles and more. Piquing my interest, the New York Times has a small sampling of statistics from 2011 Statistical Abstract of the United States.

One reported statistic struck me: “The proportion of developed land reached a record high: 5.6 percent of all land in the continental U.S.” At first glance, I am not surprised: a number of the car trips I make to visit family in different locations includes a number of hours of driving past open fields and forests. Even with all the talk we hear of sprawl, there still appears to be plenty of land that could be developed.

But the Statistical Abstract allows us to dig deeper: how exactly is American land used? According to 2003 figures (#363, Excel table), 71.1% of American land is rural with 19% total and 20.9% total being devoted to crops and “rangeland,” respectively. While developed land may have reached a record high (5.6%), Federal land is almost four times larger (20.7%).

Another factor here would have to be how much of the total land could actually be developed. How much of that rural land is inaccessible or would require a large amount of work and money to improve?

So whenever there is a discussion of developable land and sprawl, it seems like it would be useful to keep these statistics in mind. How much non-developed land do we want to have as a country and should it be spread throughout the country? How much open land is needed around cities or in metropolitan regions? And what should this open land be: forest preserve, state park, national park, open fields, farmland, or something else?

A humorous yet relevant comment from the scientific past: “Oh, well, nobody is perfect.”

When I look at sociology journal articles from the past, a few things strike me: the lack of high-powered statistics and a simplicity in explanation and research design. In the current world of publishing demands and the push for always high-quality, ground-breaking work, these earlier articles look like they were from a more innocent era.

I was reminded of this by a recent Wired post. In this case, a geology journal had published an article in the early 1960s and another scholar had responded in print to this article by pointing out a mistake on the part of the original authors. This is not uncommon. What does look particularly uncommon is the response by the original authors: “Oh, well, nobody is perfect.”

In a perfect world, isn’t this how science is supposed to work: just admit your mistakes, don’t repeat them, and move on? But I can’t imagine that many current scholars could give such a reply, perhaps in fear that their career or reputation would be in jeopardy. And in the world of scientific journals, is this sort of back and forth (with candidness) even possible much of the time?

I also infer a sense of humility on the part of the original authors. Instead of going on for pages about how their mistake was defensible or trying to pass the blame, a quick one-liner admits the mistake, diffuses the situation, and everyone can move on.

Trying to understand China’s economy with a lack of statistics

Megan McArdle writes about the issue of a lack of comprehensive data to understand what is happening with China’s economy:

But central planners badly need good, comprehensive data.  Once you limit the autonomy of local nodes to make decisions, you need some sort of massive data set to overcome information loss as decisions move up the hierarchy.

Libertarians often use this to argue against any sort of central planning, but that’s not the point of this post.  All modern economies engage in some level of planning, whether it is monetary policy or infrastructure construction.  It was in response to the problems of managing production during World War I that economists first conspired to create US economic statistics.

The Chinese government is extremely enthusiastic about managing their economy, and they put a lot of thought into it.  But the lack of good statistics on economic performance makes an already near-impossible challenge even more daunting.

It is remarkable to recognize how much data there is out there these days in the United States. And even with all that data, it is often not always clear what should be done – government officials, investors, journalists, and citizens need to know how to interpret the data and figure out how to respond.

What would it take to get comprehensive data in China?

h/t Instapundit

The statistical calculations used for counting votes

Some might be surprised to hear that “Counting lots of ballots [in elections] with absolute precision is impossible.” Wired takes a brief look at how the vote totals are calculated:

Most laws leave the determination of the recount threshold to the discretion of registrars. But not California—at least not since earlier this year, when the state assembly passed a bill piloting a new method to make sure the vote isn’t rocking a little too hard. The formula comes from UC Berkeley statistician Philip Stark; he uses the error rate from audited precincts to calculate a key statistical number called the P-value. Election auditors already calculate the number of errors in any given precinct; the P-value helps them determine whether that error rate means the results are wrong. A low P-value means everything is copacetic: The purported winner is probably the one who indeed got the most votes. If you get a high value? Maybe hold off on those balloon drops.

A p-value is a key measure in most statistical analysis – it provides a measure of how much error is in the data and whether the obtained results are just by chance or whether we can be fairly sure (95% or more) the statistical estimation represents the whole population.

So what is the acceptable p-value for elections in California?

I would be curious to know whether people might seize upon this information for two reasons: (1) it shows the political system is not exact and therefore, possibly corrupt and (2) they distrust statistics altogether.

Comparing stories and statistics

A mathematician thinks about the differences between stories and statistics and the people who prefer one side over another:

Despite the naturalness of these notions, however, there is a tension between stories and statistics, and one under-appreciated contrast between them is simply the mindset with which we approach them. In listening to stories we tend to suspend disbelief in order to be entertained, whereas in evaluating statistics we generally have an opposite inclination to suspend belief in order not to be beguiled. A drily named distinction from formal statistics is relevant: we’re said to commit a Type I error when we observe something that is not really there and a Type II error when we fail to observe something that is there. There is no way to always avoid both types, and we have different error thresholds in different endeavors, but the type of error people feel more comfortable may be telling. It gives some indication of their intellectual personality type, on which side of the two cultures (or maybe two coutures) divide they’re most comfortable. I’ll close with perhaps the most fundamental tension between stories and statistics. The focus of stories is on individual people rather than averages, on motives rather than movements, on point of view rather than the view from nowhere, context rather than raw data. Moreover, stories are open-ended and metaphorical rather than determinate and literal…

I’ll close with perhaps the most fundamental tension between stories and statistics. The focus of stories is on individual people rather than averages, on motives rather than movements, on point of view rather than the view from nowhere, context rather than raw data. Moreover, stories are open-ended and metaphorical rather than determinate and literal.

This is a good discussion and one that I think about often while teaching statistics or research methods. Stories are often easy for students to grab unto, particularly if told from an interesting point of view. In the end, these stories (particularly the “classics”) have the ability to illuminate the human condition or interesting concerns but don’t have the same ability to offer more concrete overviews of the typical or common experience. Statistics do offer a different lens for viewing the world, one where individual experiences are muted in favor of data about larger groups. Both can miss important features of the reality around us but offer different angles for tackling similar concerns.

Both have their place and I would suggest both are necessary.

The presence of error in statistics as illustrated by basketball predictions

TrueHoop has an interesting paragraph from this afternoon illustrating how there is always error in even complicated statistical models:

A Laker fan wrings his hands over the fact that advanced stats prefer the Heat and LeBron James to the Lakers and Kobe Bryant. It’s pitched as an intuition vs. machine debate, but I don’t see the stats movement that way at all. Instead, I think everyone agrees the only contest that matters takes place in June. In the meantime, the question is, in clumsily predicting what will happen then (and stats or no, all such predictions are clumsy) do you want to use all of the best available information, or not? That’s the debate about stats in the NBA, if there still is one.

By suggesting that predictions are clumsy, Abbott is highlighting an important fact about statistics and statistical analysis: there is always some room for error. Even with the best statistical models, there is always a chance that a different outcome could result. There are anomalies that pop up, such as a player who has an unexpected breakout year or a young star who suffers an unfortunate injury early in the season. Or perhaps an issue like “chemistry,” something that I imagine is difficult to model, plays a role. The better the model, meaning the better the input data and the better the statistical techniques, the more accurate the predictions.

But in the short term, there are plenty of analysts (and fans) who want some way to think about the outcome of the 2010-2011 NBA season. Some predictions are simply made on intuition and basketball knowledge. Other predictions are made based on some statistical model. But all of these predictions will serve as talking points during the NBA season to help provide some overarching framework to understand the game by game results. Ultimately, as Gregg Easterbrook has pointed out in his TMQ column during the NFL off-season, many of the predictions are wrong – though the makers of the predictions are not often punished for poor results.

Comparing greatness of players past and present an enjoyable part of sports fandom

As the NBA season approaches, discussion this week has centered on the relative status of several players: Kobe Bryant, Kevin Durant, LeBron James, and Michael Jordan. While the first three players in this list were involved in a question about who is the best current player and potential MVP, Jordan also has been inserted in the discussion due to his starring role in NBA2K11 and comments he made about the number of points he could score if he played today when more fouls are called.

Several quick thoughts come to mind:

1. The new era of statistics in sports offers more opportunities to make comparisons of players across different eras, particularly if you can control for certain features of the game at each time period (like the average pace in basketball).

2. I wonder how much current players think about issues like these. Fans seems to like these discussions. It allows the average guy sitting on the couch to say, “my guy, whoever that may be, can match up or beat your guy.”

3. Jordan, like some other old players, still likes to be part of these discussions.

4. All of these discussions are magnified by the non-stop media attention for sports these days. I can hear it on local sports talk radio which all sound like the CNN of the radio airwaves; stories are repeated all day long with slightly different interpretations.