Making money online by tracking consumers

The Wall Street Journal starts a series on what companies are doing to track consumers to make money online. Some of the common tactics:

The study found that the nation’s 50 top websites on average installed 64 pieces of tracking technology onto the computers of visitors, usually with no warning. A dozen sites each installed more than a hundred. The nonprofit Wikipedia installed none.

Tracking technology is getting smarter and more intrusive. Monitoring used to be limited mainly to “cookie” files that record websites people visit. But the Journal found new tools that scan in real time what people are doing on a Web page, then instantly assess location, income, shopping interests and even medical conditions. Some tools surreptitiously re-spawn themselves even after users try to delete them.

These profiles of individuals, constantly refreshed, are bought and sold on stock-market-like exchanges that have sprung up in the past 18 months.

If you are using the Internet, expect that people are “watching” you and trying to figure out how to make money off of you.

Discovering fake randomness

In the midst of a story involving fake data generated for DailyKos by the polling firm, Research 2000, TechDirt summarizes how exactly it was discovered that Research 2000 was faking the data. Several statisticians approached Kos after seeing some irregularities in cross-tab (table) data. The summary and the original analysis on DailyKos are fascinating: even truly random data follows certain parameters. One takeaway: faking random data is a lot harder than it looks. Another takeaway (for me at least): statistics can be both useful and enjoyable.

The three issues as summarized on DailyKos:

Issue one: astronomically low odds that both male and female figures would both be even or odd numbers.

In one respect, however, the numbers for M and F do not differ: if one is even, so is the other, and likewise for odd. Given that the M and F results usually differ, knowing that say 43% of M were favorable (Fav) to Obama gives essentially no clue as to whether say 59% or say 60% of F would be. Thus knowing whether M Fav is even or odd tells us essentially nothing about whether F Fav would be even or odd.

Issue two: the margin between favorability and unfavorability ratings did not display enough variance. If the polls were truly working with random samples, there would be broader range of values.

What little variation there was in the difference of those cross-tab margins seemed to happen slowly over many weeks, not like the week-to-week random jitter expected for real statistics.

Issue three: the changes in favorability ratings from week to week were too random. In most polls like this that track week to week, the most common result is no change. Research 2000 results had too many changes from week to week – often small changes, a percent either way.

For each individual issue, the odds are quite low that each would arise with truly random data. Put all three together happening with the same data and the odds are even lower.

Besides issues regarding integrity of data collection (and it becomes clearer why many people harbor a distrust toward polls and statistics), this is a great example of statistical detective work. Too often, many of us see numbers and quickly trust them (or distrust them). In reality, it takes just a little work to dig deeper into figures to discover what exactly is being measured and how it is being measured. The “what” and “how” matter tremendously as they can radically alter the interpretation of the data. Citizens and journalists need some of these abilities to decipher all the numbers we encounter on a daily basis.