Analyzing Netflix’s thousands of movie genres

Alexis Madrigal decided to look into the movie genres of Netflix – and found lots of interesting data:

As the hours ticked by, the Netflix grammar—how it pieced together the words to form comprehensible genres—began to become apparent as well.

If a movie was both romantic and Oscar-winning, Oscar-winning always went to the left: Oscar-winning Romantic Dramas. Time periods always went at the end of the genre: Oscar-winning Romantic Dramas from the 1950s

In fact, there was a hierarchy for each category of descriptor. Generally speaking, a genre would be formed out of a subset of these components:

Region + Adjectives + Noun Genre + Based On… + Set In… + From the… + About… + For Age X to Y

Yellin said that the genres were limited by three main factors: 1) they only want to display 50 characters for various UI reasons, which eliminates most long genres; 2) there had to be a “critical mass” of content that fit the description of the genre, at least in Netflix’s extended DVD catalog; and 3) they only wanted genres that made syntactic sense.

And the conclusion is that there are so many genres that they don’t necessarily make sense to humans. This strikes me as a uniquely modern problem: we know how to find patterns via algorithm and then we have to decide whether we want to know why the patterns exist. We might call this the Freakonomics problem: we can collect reams of data, data mine it, and then have to develop explanations. This, of course, is the reverse of the typical scientific process that starts with theories and then goes about testing them. The Netflix “reverse engineering” can be quite useful but wouldn’t it be nice to know why Perry Mason and a few other less celebrated actors show up so often?

At the least, I bet Hollywood would like access to such explanations. This also reminds me of the Music Genome Project that underlies Pandora. Unlock the genres and there is money to be made.

Freakonomics.com readers vote to eliminate sociology

Responding to the question “Which social science should die?”, the readers of Freakonomics.com voted out sociology:

As you can see from the chart below, nearly 50 percent believed that college/university presidents should eliminate sociology. Nearly 30 percent thought poli sci should be shuttered. [Editor’s note: it is perhaps not surprising that Freakonomics readers wouldn’t vote to eliminate economics.]

The rationales varied. Many felt that sociology had become too insular and out of touch. Some argued that political science had become a sub-field of economics, and a good old-fashioned “M&A” could occur. Others said “market” discipline should be enforced: that is, save the departments that bring in the most cash to the university.  And many of you argued that the tradition of the disciplines was being ignored — e.g., sociology used to promote reform, but is no longer organized around such pragmatic tasks—and so it makes sense to close them for good.

One possible explanation: economists and sociologists don’t always get along.

I would be interested to see a larger poll of academics about this. Could this be related at all to the size of relative departments?

The morality of termination rights

Raustiala and Sprigman over at the New York Times Freakonomics blog take on the morality of copyright termination rights, “an obscure provision of U.S. copyright law…[that] allows songwriters and musicians to…take back from the record labels many thousands of songs they licensed 35 years ago”:

In general, if you decide to sell or perpetually license a piece of property, you can’t later take it back, no matter how much you might want to. So If I sell my house and two years later the city decides to build a lovely public park in my neighborhood, the value of my former house may rise substantially. But no one contends that I can take the house back, or that I’m due a bonus payment from the lucky buyer.  A deal is a deal.

So why the exception for copyright owners?

I have to start somewhere, so it might as well be here:  it’s disingenuous to invoke a home-sy (literally) analogy, show that it fails, and use that failure to “prove” your point.  Raustiala and Sprigman note that “in general,” residential homes are sold outright.  So what?  Equally “in general,” commercial property leases for retail outlets (e.g., stores in shopping center developments) explicitly vary rent payments based on sales (i.e., higher store sales this month/year = higher rent).  Both systems are unobjectionable, assuming one simple fact:  the parties know what kind of deal they are making at the time they make it.

Thus, Raustiala and Sprigman’s analysis falls apart right off the bat.  Termination rights are not a recent phenomenon that nobody knew anything about until a year ago.  Unlike, say, Congress’ decision to re-copyright works that had already fallen into the public domain, termination rights have clearly been a part of U.S. copyright law since 1976.  They may have been “an obscure provision” to the general public reading the Freakonomics blog, but they certainly weren’t obscure to artists and labels.  Raustiala and Sprigman’s characterization is like calling the infield fly rule “obscure”–and then implying that a bunch of MLB players should be out because they didn’t know it existed or how it worked.

They go on:

Think for a moment about the economic effect of the termination provision on the behavior of parties to copyright transactions. Because buyers can expect, on average, to make lower profits when the law contains the termination provision, they will offer less in the initial transaction. Thus, sellers will be more willing to accept less, because they know that if a work later proves valuable, they can terminate and demand some additional payment. So the most likely effect of the termination provision is to force deal prices down across the board….Put differently, the termination provision is a regressive tax.  And in that light, the “fairness” justification for the termination provision is less than overwhelming.

Even assuming this is true, the record labels’ supposed “offer [of] less in the initial transaction” has already happened–35 years ago.  Changing the rules at this point to favor the labels over artists would also seem to invoke its own set of fairness issues.  To put it mildly.

Quick Review: Scorecasting

I have written about Scorecasting several times (see here and here) so I figured I had better read it. Here are my thoughts on what I read about “the hidden influences” in sports:

1. This book truly aims for the Freakonomics crowd: there is a blurb both at the top of the cover and the back from Freakonomics author Steven D. Levitt. Those University of Chicago professors stick together…

2. I know that I have heard a number of these arguments before, particularly ones about why football teams should not punt, the unfairness of coin flips at the beginning of overtime in the NFL, and the phenomenon of the “hot hand.” Perhaps this indicates that I read too much sports news or that the sports world in recent years really has taken a liking to new kinds of statistics and statistical analysis.

3. A number of the explanations included psychology, just like Spousonomics. Is this because psychological terms and studies are better known (compared to disciplines like sociology) or because psychology truly does provide a lot of helpful information about sports situations? A lot of sports can be broken down into individual performances and efforts – see all of the recent psychoanalyzing of LeBron James – but they are also team games that require cooperation. Could we get more analysis of units or collectives?

4. There were particular chapters and insights that I found fascinating – here are a few:

4a. The overvaluing of round numbers, such as 20 home runs in a season or a .300 batting average, compared to hitters with 19 home runs and/or a .299 average. I don’t know if teams could really save a lot of money doing this but there is a fixation on certain figures.

4b. The trade value chart used around the NFL Draft and pioneered by the Dallas Cowboys needs to be revised.

4c. Two things about home field advantage. First, it is fairly consistent within sports across time and across countries. Second, officiating make up a decent amount of this advantage. I like the evidence of how baseball umpires suddenly started advantaging the road team on close calls when they knew that technology was being used to evaluate their calls.

4d. The chapter on the Cubs curse shows again that the idea is irrational.

5. In reading through this, I was reminded again of the wealth of statistics available in baseball. Other sports have to try to catch up to quantify as much as baseball can. But there is clearly a revolution underway with more professional teams taking these numbers seriously, including the new NBA champions. Could we get an analysis of whether teams that pay more attention to advanced statistics and analysis actually have better records? “Moneyball” was a big idea for a while as well but doesn’t seem to get as much attention now that Billy Beane isn’t competing as well out in Oakland.

5a. I’m sure someone has to have translated an undergraduate statistics course into an all-sports data format. How appealing would students find this and does this improve student learning outcomes?

Overall, I enjoyed this book: this should be of little surprise since it involves sports and statistics, two things that interest me. While some of the arguments may be familiar to sports fans, it does provide some more fodder for future sports conversations.