From outlier to outlier in unemployment data

With the responses to COVID-19, unemployment is expected to approach or hit a record high among recorded data:

April’s employment report, to be released Friday, will almost certainly show that the coronavirus pandemic inflicted the largest one-month blow to the U.S. labor market on record.

Economists surveyed by The Wall Street Journal forecast the new report will show that unemployment rose to 16.1% in April and that employers shed 22 million nonfarm payroll jobs—the equivalent of eliminating every job created in the past decade.

The losses in jobs would produce the highest unemployment rate since records began in 1948, eclipsing the 10.8% rate touched in late 1982 at the end of the double-dip recession early in President Reagan’s first term. The monthly number of jobs lost would be the biggest in records going back to 1939—far steeper than the 1.96 million jobs eliminated in September 1945, at the end of World War II.

But, also noteworthy is what these rapid changes follow:

Combined with the rise in unemployment and the loss of jobs in March, the new figures will underscore the labor market’s sharp reversal since February, when joblessness was at a half-century low of 3.5% and the country notched a record 113 straight months of job creation.

In other words, the United States has experienced both a record low in unemployment and a record high within three months. A few thoughts connected to this:

1. Either outlier is noteworthy; having them occur so close to each other is more unusual.

2. Their close occurrence makes it more difficult to ascertain what is “normal” unemployment for this period of history. The fallout of COVID-19 is unusual. But the 3.5% unemployment can also be considered unusual compared to historical data.

3. Given these two outliers, it might be relatively easy to dismiss either as aberrations. Yet, while people are living through the situations and the fallout, they cannot simply be dismissed. If unemployment now is around 16%, this requires attention even if historically this is a very unusual period.

4. With these two outliers, predicting the future regarding unemployment (and other social measures) is very difficult. Will the economy quickly restart in the United States and around the world? Will COVID-19 be largely under control within a few months or will there be new outbreaks for a longer period of time (and will governments and people react in the same ways)?

In sum, dealing with extreme data – outliers – is a difficult task for everyone.

Home value algorithms show consumers data with outliers, mortgage companies take the outliers out

A homeowner can look online to get an estimate of the value of their home but that number may not match what a lender computes:

Different AVMs are designed to deliver different types of valuations. And therein lies confusion.

Consumers don’t realize that there’s an AVM for nearly any purpose, which explains why different algorithms serve up different results, said Ann Regan, an executive product manager with real estate analytic firm CoreLogic. “The scores presented to consumers are not the same version that is being used by lenders to make decisions,” she said. “The consumer-facing AVMs are designed for consumer marketing purposes.”

For instance, more accurate models used by lenders do not include outliers — properties that sold for extremely high or low prices and that consequently would skew the averages and the comparable sales for a particular house, like yours. But models used by consumer websites, such as brokers’ sites and national listing sites, scoop in as much “sold” data as possible when concocting a valuation, because then they can claim to include all available data. That’s true, said Regan, but it’s more accurate to weed out misleading data.

AVMs used by lenders send along “confidence scores” that indicate how firm the estimate is. That is a factor typically not included alongside consumer AVMs, she added.

This is an interesting trade-off. The assumption is the consumer wants to see that all the data is accounted for, which makes it seem that the estimate is more worthwhile. More data = more accuracy. On the other hand, those that work with data know that measures of central tendency and variability can be thrown off by unusual cases, often known as outliers. If the value of a home is too high or too low, and there are many reasons why this could be the case, the rest of the data can be thrown off. If there are significant outliers, more data does not equal more accuracy.

Since this knowledge is out there (at least printed in a major newspaper), does this mean consumers will be informed of these algorithm features when they look at websites like Zillow? I imagine it could be tricky to easily explain how removing some of the housing comparison data is actually a good thing but if the long-term goal is better numeracy for the public, this could be a good addition to such websites.

When a few people generate most of the complaints about a public nuisance

The newest runway at O’Hare Airport has generated more noise complaints than ever. However, a good portion of the complaints come from a small number of people.

She now ranks among the area’s most prolific complainers and is one of 11 people responsible for 44 percent of the noise complaints leveled in August, according to the city’s Department of Aviation.

The city, which operates the airport, pokes at her serial reporting in its monthly report by isolating the number of complaints from a single address in various towns. It’s a move meant to downplay the significant surge in noise complaints since the airport’s fourth east-west runway opened last fall, but it only seems to energize Morong…

Chicago tallied 138,106 complaints during the first eight months of the year, according to the Department of Aviation. That figure surpassed the total number of noise complaints from 2007 to 2013.

The city, however, literally puts an asterisk next to this year’s numbers in monthly reports and notes that a few addresses are responsible for thousands of complaints. The August report, for example, states that 11 addresses were responsible for more than 13,000 complaints during that 31-day period…

But even excluding the serial reporters, the city still logged about 16,000 complaints in August, about eight times the number it received in August 2013.

There are two trends going on here:

1. The overall number of complaints is still up, even without the more serial complainers. This could mean several things: there are more people now affected by noise, a wider range of people are complaining, and/or this system of filing complaints online has caught on.

2. A lot of the complaints are generated by outliers, including the main woman in the story who peaked one day at 600 complaints. It is interesting that the City of Chicago has taken to pointing this out, probably in an attempt to

This is not an easy issue to solve. The runway issues and O’Hare’s path to being the world’s busiest airport again mean that there is more flight traffic and more noise. This is not desirable for some residents who feel like they are not heard. Yet, it is probably good for the whole region as Chicago tries to build on its transportation advantages. What might the residents accept as “being heard”? Changing whole traffic patterns or efforts at limiting the sound? Balancing local and regional interests is often very difficult but I don’t see how this is going to get much better for the residents.

Statistical anomalies show problems with Chicago’s red light cameras

There has been a lot of fallout from the Chicago Tribune‘s report on problems with Chicago’s red light cameras. And the smoking gun was the improbable spikes in tickets handed out on single days or in short stretches:

From April 29 to June 19, 2011, one of the two cameras at Wague’s West Pullman intersection tagged drivers for 1,717 red light violations. That was more violations in 52 days than the camera captured in the previous year and a half…

On the Near West Side, the corner of North Ashland Avenue and West Madison Street generated 949 tickets in a 17-day period beginning June 23, 2013. That is a rate of about 56 tickets per day. In the previous two years, that camera on Ashland averaged 1.3 tickets per day…

City officials insisted the city has not changed its enforcement practices. They also said they have no records indicating camera malfunctions or adjustments that would have affected the volume of tickets.

The lack of records is significant, because Redflex was required to document any time the operation of a camera was disrupted for more than a day, as well as work “that will affect incident volume” — in other words, adjustments or repairs that could increase or decrease the number of violations.

In other words, graphs showing the number of tickets over time show big spikes. Here is one such graph from the intersection of Halsted and 119th Street:

As the article notes, there are a number of these big outliers in the data, outliers that would be difficult to miss if anyone was examining the data like they were supposed to. Given the regularities in traffic, you would expect fairly similar patterns over time but graphs like this suggest something else at work. Outside of someone directly testifying to underhanded activities, it is difficult to imagine more damaging evidence than graphs like these.

Guinness World Records for housing

Here is a roundup of some of the 2014 Guinness World Records in housing:

Knapp, who died in 1988, lived in the same house in Montgomery Township, Pa., for 110 years. And for that feat, she earns the title as the person who has lived the longest time ever in one residence, according to the 2014 edition of the “Guinness World Records.”…

While we’re at it, a nod to the world’s tallest real estate agents: Laurie and Wayne Hallquist are 6’6″ and 6’10”, respectively. She’s a full-time agent with Prudential California Realty in Stockton, Calif., while he’s a part-timer with the company…

The skinniest house on record is in Warsaw. It is three feet two inches wide at its narrowest point and just about five feet at its widest. It contains a floor area of 151 square feet, and instead of stairs, occupants climb a ladder to reach the bedrooms above…

The tallest resident-only building is in Dubai. Princess Tower is 1,356-feet high, with the highest occupied floor at 1,171 feet. But the title of tallest residential apartments belongs to Burj Khalifa, also in Dubai, which combines a hotel, offices and apartments. There, the highest residential floor—the 108th—is at 1,263 feet.

Houses, their furnishings, and apparently, their agents, come in all shapes and sizes. However, when I think about these records, it strikes me that most housing in the United States is relatively uniform. I don’t mean that the housing is uniform – this is a common criticism of suburban housing and I don’t think it is particularly fair – but that most housing is within a standard deviation or two from normal. Give or take a few rooms, a few decades, and some furnishings and decorations, most housing is “normal.” The housing cited in Guinness tends to be unusual and extreme outliers.

Long tail: 17% of the seven foot tall men between ages 20 and 40 in the US play in the NBA

As part of dissecting whether Shaq can really fit in a Buick Lacrosse (I’ve asked this myself when watching the commercial), Car & Driver drops in this little statistic about men in the United States who are seven feet tall:

The population of seven-footers is infinitesimal. In 2011, Sports Illustrated estimated that there are fewer than 70 men between the ages of 20 and 40 in the United States who stand seven feet or taller. A shocking 17 percent of them play in the NBA.

In the distribution of heights in the United States, being at least seven feet tall is quite unusual and at the far right side of a fairly normal distribution. But, being that tall increases the odds of playing in the NBA by quite a lot. As a Forbes post suggests, “Being 7 Feet Tall [may be] the Fastest Way To Get Rich in America“:

Drawing on Centers for Disease Control data, Sports Illustrated‘s Pablo Torre estimated that no more than 70 American men are between the ages of 20 and 40 and at least 7 feet tall. “While the probability of, say, an American between 6’6? and 6’8? being an NBA player today stands at a mere 0.07%, it’s a staggering 17% for someone 7 feet or taller,” Torre writes.

(While that claim might seem like a tall tale, more than 42 U.S.-born players listed at 7 feet did debut in NBA games between 1993 and 2013. Even accounting for the typical 1-inch inflation in players’ listed heights would still mean that 15 “true” 7-footers made it to the NBA, out of Torre’s hypothetical pool of about 70 men.)…

And given the market need for players who can protect the rim, there are extra rewards for this extra height. The league’s median player last season was 6 feet 7 inches tall, and paid about $2.5 million for his service. But consider the rarified air of the 7-footer-and-up club. The average salary of those 35 NBA players: $6.1 million.

(How much does one more inch matter? The 39 players listed at 6 feet 11 inches were paid an average of $4.9 million, or about 20% less than the 7 footers.)

Standing as an outlier at the far end of the distribution seems to pay off in this case.

Bad logic: stories of successful college dropouts obscure advantages of going to college

The president of the University of Chicago writes that holding up successful college dropouts as models takes away attention from the advantages of a college degree:

Names like Jobs, Gates, Dell, and others lend star power to the myth of the wildly successful college dropout. One recent New York Times homage to the phenomenon compared dropping out to “lighting out for the territories to strike gold,” with one young executive describing it as “almost a badge of honor” among startup entrepreneurs. Like any myth, this story has a kernel of truth: There are exceptional individuals whose hard work, determination, and intelligence make up for the lack of a college degree. If they could do it, one might think, why can’t everybody?

Such a question ignores the outlier status of these exceptional drop-out entrepreneurs and innovators.

Those who are able to achieve such success often rely on a set of skills already developed before they get to college. They know how to educate themselves, get a bank loan, and manage their time and their money. They may benefit from a network of family, friends and acquaintances who open doors and provide a safety net.

But what happens to young people without access to these important resources? For them, skipping college to pursue business success is like investing their savings in lottery tickets in the hopes they will be a multimillion-dollar winner, or failing to pursue an education because they expect to be an NBA superstar. The reality is that the next college dropout will not be LeBron James, James Cameron, or Mark Zuckerberg. He will likely belong to the millions of college drop-outs you don’t hear the press singing about. These are the 34 million Americans over 25 with some college credits but no diploma. Nearly as large as the state of California, this group is 71 percent more likely to be unemployed and four times more likely to default on student loans. Far from being millionaires, they earn 32 percent less than college graduates, on average.

I’ve seen this logic used in arguments about not having to spend lots of money on college or from those who see college as liberal indoctrination. As Zimmer argues, using outliers to build a theory is just not a good idea. These famous cases are held up partly because they are so rare, not because this is necessarily a good path to pursue. This is similar to the logic used in holding up rages to riches stories; while it is true that social mobility, upward and downward, occurs in the United States, a phenomenal change in position over one lifetime is more rare.

I’ve used this very example with my Introduction to Sociology class when talking about why people go to college. I ask them if they are aware of wealthy college dropouts like Bill Gates and Steve Jobs. They say yes. I then ask if they dropped out of college, would their parents accept these stories as good rationale? They answer no. I then tell them a little of the Bill Gates story as relayed by Malcolm Gladwell in Outliers. Gates attended a pretty good high school that through one student’s parent who worked for a computer company was able to purchase a used mainframe computer. Gates then had a rare opportunity at the time for a high school student to spend hours with the mainframe and learn about it. He was then able to build on this background and later founded Microsoft with Paul Allen. Gladwell uses this as an example of the Matthew effect where those who come from more advantaged backgrounds (or who happened to be the oldest hockey players) tend to get more opportunities later in life.

Five main methods of detecting patterns in data mining

Here is a summary of five of the main methods utilized to uncover patterns when data mining:

Anomaly detection : in a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit.

Association learning: This is the type of data mining that drives the Amazon recommendation system. For instance, this might reveal that customers who bought a cocktail shaker and a cocktail recipe book also often buy martini glasses. These types of findings are often used for targeting coupons/deals or advertising. Similarly, this form of data mining (albeit a quite complex version) is behind Netflix movie recommendations.

Cluster detection: one type of pattern recognition that is particularly useful is recognizing distinct clusters or sub-categories within the data. Without data mining, an analyst would have to look at the data and decide on a set of categories which they believe captures the relevant distinctions between apparent groups in the data. This would risk missing important categories. With data mining it is possible to let the data itself determine the groups. This is one of the black-box type of algorithms that are hard to understand. But in a simple example – again with purchasing behavior – we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.

Classification: If an existing structure is already known, data mining can be used to classify new cases into these pre-determined categories. Learning from a large set of pre-classified examples, algorithms can detect persistent systemic differences between items in each group and apply these rules to new classification problems. Spam filters are a great example of this – large sets of emails that have been identified as spam have enabled filters to notice differences in word usage between legitimate and spam messages, and classify incoming messages according to these rules with a high degree of accuracy.

Regression: Data mining can be used to construct predictive models based on many variables. Facebook, for example, might be interested in predicting future engagement for a user based on past behavior. Factors like the amount of personal information shared, number of photos tagged, friend requests initiated or accepted, comments, likes etc. could all be included in such a model. Over time, this model could be honed to include or weight things differently as Facebook compares how the predictions differ from observed behavior. Ultimately these findings could be used to guide design in order to encourage more of the behaviors that seem to lead to increased engagement over time.

Several of these seem similar to methods commonly used by sociologists:

1. Anomaly detection seems like looking for outliers. On one hand, outliers can throw off basic measures of central tendency or dispersion. On the other hand, outliers can help prompt researchers to reassess their models and/or theories to account for the unusual cases.

2. Cluster detection and/or classification appear similar to factor analysis. This involves a statistical analysis of a set of variables to see which ones “hang together.” This can be helpful for finding categories and reducing the number of variables in an analysis to a lesser number of important concepts.

3. Regression is used all the time both for modeling and predictions.

This all reminds me of what I heard in graduate school about the difference between data mining and statistical research: data mining amounted to atheoretical analysis. In other words, you might find relationships  between variables (or apparent relationships between variables – could always be a spurious association or there could be suppressor or distorter effects) but you wouldn’t have compelling explanations for these relationships. While you might be able to develop some explanations, this is a different process than hypothesis testing where you set out to look and test for relationships and patterns.