In the midst of a story involving fake data generated for DailyKos by the polling firm, Research 2000, TechDirt summarizes how exactly it was discovered that Research 2000 was faking the data. Several statisticians approached Kos after seeing some irregularities in cross-tab (table) data. The summary and the original analysis on DailyKos are fascinating: even truly random data follows certain parameters. One takeaway: faking random data is a lot harder than it looks. Another takeaway (for me at least): statistics can be both useful and enjoyable.
The three issues as summarized on DailyKos:
Issue one: astronomically low odds that both male and female figures would both be even or odd numbers.
In one respect, however, the numbers for M and F do not differ: if one is even, so is the other, and likewise for odd. Given that the M and F results usually differ, knowing that say 43% of M were favorable (Fav) to Obama gives essentially no clue as to whether say 59% or say 60% of F would be. Thus knowing whether M Fav is even or odd tells us essentially nothing about whether F Fav would be even or odd.
Issue two: the margin between favorability and unfavorability ratings did not display enough variance. If the polls were truly working with random samples, there would be broader range of values.
What little variation there was in the difference of those cross-tab margins seemed to happen slowly over many weeks, not like the week-to-week random jitter expected for real statistics.
Issue three: the changes in favorability ratings from week to week were too random. In most polls like this that track week to week, the most common result is no change. Research 2000 results had too many changes from week to week – often small changes, a percent either way.
For each individual issue, the odds are quite low that each would arise with truly random data. Put all three together happening with the same data and the odds are even lower.
Besides issues regarding integrity of data collection (and it becomes clearer why many people harbor a distrust toward polls and statistics), this is a great example of statistical detective work. Too often, many of us see numbers and quickly trust them (or distrust them). In reality, it takes just a little work to dig deeper into figures to discover what exactly is being measured and how it is being measured. The “what” and “how” matter tremendously as they can radically alter the interpretation of the data. Citizens and journalists need some of these abilities to decipher all the numbers we encounter on a daily basis.