Finding data by finding and/or guessing URLs

A California high school student is posting new data from 2020 presidential polls before news organizations because he found patterns in their URLs:

How does Rawal do it? He correctly figures out the URL — the uniform resource locator, or full web address — that a graphic depicting the poll’s results appears at before their official release.

“URL manipulation is what I do,” he said, “and I’ve been able to get really good at it because, with websites like CNN and Fox, all the file names follow a pattern.”

He added, “I’m not going to go into more detail on that.”

He said he had just spoken with The Register’s news director, who expressed interest in his helping the newspaper “keep it under tighter wraps.” He is considering it.

This makes sense on both ends: media organizations need a way to organize their files and sites and someone who looks at the URLs over time could figure out the pattern. Now to see how media organizations respond as to not let their stories out before they report them.

I imagine there is a broader application for this. Do many organizations have websites or data available that is not linked to or a link is not easily found? I could imagine how such hidden/unlinked data could be used for nefarious or less ethical purposes (imagine scooping news releases about soon-to-be released economic figures in order to buy or sell stocks) as well as data collection.

What is a “digital sociology firm”?

This news story reports the sale of a “digital sociology firm” named mPathDiscovery:

Richard Neal, CIO of mPathDiscovery, described TBX as a group of investors from different industries that came together in April. The transaction will provide mPathDiscovery with access to TBX’s capital, experience and business connections.

Neal said mPathDiscovery has two employees — himself and President David Goode — and uses an array of contract employees. The company will remain in Kansas City and soon will begin looking for its first office space.

One result of the transaction has been the purchase of the “” web domain. Neal said the name had been owned by a cybersquatter who offered to sell it for a profit.

Neal said digital sociology helps companies see who is saying what, when and where about them online. The process can help companies see how marketing messages are being received by the public and analyze attitudes about competitors.

Two things strike me:

  1. So this is beyond web analytics where companies try to figure out who is visiting their site. (That industry is crowded and there are a number of ways to measure engagement with websites.) This goes to the next level and examines how companies/pages are perceived. I imagine there are plenty of people already doing this – I’ve heard plenty of commercials for site that want to protect the reputation of individuals – so what sets this company apart? This leads to the second point…
  2. What exactly makes this “digital sociology”? As a sociologist, I’m not sure what exactly this is getting at. Online society? Studying online interactions with companies? The use of the term sociology is meant to imply a more rigorous kind of analysis? In the end, is the term sociology attractive to companies that want these services?

The ongoing mystery of counting website visitors

The headline says it all: “It’s 2015 – You’d Think We’d Have Figured Out How to Measure Web Traffic By Now.”

ComScore was one of the first businesses to take the approach Nielsen uses for TV and apply it to the Web. Nielsen comes up with TV ratings by tracking the viewing habits of its panel — those Nielsen families — and taking them as stand-ins for the population at large. Sometimes they track people with boxes that report what people watch; sometimes they mail them TV-watching diaries to fill out. ComScore gets people to install the comScore tracker onto their computers and then does the same thing.

Nielsen gets by with a panel of about 50,000 people as stand-ins for the entire American TV market. ComScore uses a panel of about 225,000 people4 to create their monthly Media Metrix numbers, Chasin said — the numbers have to be much higher because Internet usage is so much more particular to each user. The results are just estimates, but at least comScore knows basic demographic data about the people on its panel, and, crucial in the cookie economy, knows that they are actually people.5

As Chasin noted, though, the game has changed. Mobile users are more difficult to wrangle into statistically significant panels for a basic technical reason: Mobile apps don’t continue running at full capacity in the background when not in use, so comScore can’t collect the constant usage data that it relies on for its PC panel. So when more and more users started going mobile, comScore decided to mix things up…

Each measurement company comes up with different numbers each month, because they all have different proprietary models, and the data gets more tenuous when they start to break it out into age brackets or household income or spending habits, almost all of which is user-reported. (And I can’t be the only person who intentionally lies, extravagantly, on every online survey that I come across.)…

And that’s assuming that real people are even visiting your site in the first place. A study published this year by a Web security company found that bots make up 56 percent of all traffic for larger websites, and up to 80 percent of all traffic for the mom-and-pop blogs out there. More than half of those bots are “good” bots, like the crawlers that Google uses to generate its search rankings, and are discounted from traffic number reports. But the rest are “bad” bots, many of which are designed to register as human users — that same report found that 22 percent of Web traffic was made up of these “impersonator” bots.

This is an interesting data problem to solve with multiple interested parties from measurement firms, website owners, people who create search engines, and perhaps, most important of all, advertisers who want to quantify exactly which advertisements are seen and by whom. And the goalposts keep moving: new technologies like mobile devices change how visits are tracked and measured.

How long until we get an official number from the reputable organization? Could some of these measurement groups and techniques merge – consolidation to cut costs seems to be popular in the business world these days. In the end, it might not be good measurement that wins out but rather which companies can throw their weight around most effectively to eliminate their competition.

Facebook ran a mood altering experiment. What are the ethics for doing research with online subjects?

In 2012, Facebook ran a one-week experiment by changing news feeds and looking how people’s moods changed. The major complaint about this seems to be the lack of consent and/or deception:

The backlash, in this case, seems tied directly to the sense that Facebook manipulated people—used them as guinea pigs—without their knowledge, and in a setting where that kind of manipulation feels intimate. There’s also a contextual question. People may understand by now that their News Feed appears differently based on what they click—this is how targeted advertising works—but the idea that Facebook is altering what you see to find out if it can make you feel happy or sad seems in some ways cruel.

This raises important questions about how online research intersects with traditional scientific ethics. In sociology, we tend to sum up our ethics in two rules: don’t harm people and participants have to volunteer or give consent to be part of studies. The burden falls on the researcher to ensure that the subject is protected. How explicit should this be online? Participants on Facebook were likely not seriously harmed though it could be quite interesting if someone could directly link their news feed from that week to negative offline consequences. And, how well do the terms of service line up with conducting online research? Given the public relations issues, it would behoove companies to be more explicit about this in their terms of services or somewhere else though they might argue informing people immediately when things are happening online can influence results. This particular issue will be one to watch as the sheer numbers of people online alone will drive more and more online research.

Let’s be honest about the way this Internet stuff works. There is a trade-off involved: users get access to all sorts of information, other people, products, and the latest viral videos and celebrity news that everyone has to know. In exchange, users give up something, whether that is their personal information, tracking of their online behaviors, and advertisements intended to part them from their money. Maybe it doesn’t have to be this way, set up with such bargaining. But, where exactly the line is drawn is a major discussion point at this time. But, you should assume websites and companies and advertisers are trying to get as much from you as possible and plan accordingly. Facebook is not a pleasant entity that just wants to make your life better by connecting you to people; they have their own aims which may or may not line up with your own. Google, Facebook, Amazon, etc. are mega corporations whether they want to be known as such or not.

21st century methodology problem: 4 ways to measure online readership

While websites can collect lots of information about readers, how exactly this should all be measured is still unclear. Here are four options:

Uniques: Unique visitors is a good metric, because it measures monthly readers, not just meaningless clicks. It’s bad because it measures people rather than meaningful engagement. For example, Facebook viral hits now account for a large share of traffic at many sites. There are one-and-done nibblers on the Web and there are loyal readers. Monthly unique visitors can’t tell you the difference.

Page Views: They’re good because they measure clicks, which is an indication of engagement that unique visitors doesn’t capture (e.g.: a blog with loyal readers will have a higher ratio of page views-to-visitors, since the same people keep coming back). They’re bad for the same reason that they can be corrupted. A 25-page slideshow of the best cities for college graduates will have up to 25X more views than a one-page article with all the same information. The PV metric says the slideshow is 25X more valuable if ads are reloaded on each page of the slideshow. But that’s ludicrous.

Time Spent/Attention Minutes: Page views and uniques tell you an important but incomplete fact: The article page loaded. It doesn’t tell you what happens after the page loads. Did the reader click away? Did he stay for 20 minutes? Did he open the browser tab and never read the story? These would be nice things to know. And measures like attention minutes can begin to tell us. But, as Salmon points out, they still don’t paint a complete picture. Watching a 5 minute video and deciding it was stupid seems less valuable than watching a one minute video that you share with friends and praise. Page views matter, and time spent matters, but reaction matters, too. This suggests two more metrics …

Shares and Mentions: “Shares” (on Facebook, Twitter, LinkedIn, or Google+) ostensibly tell you something that neither PVs, nor uniques, nor attention minutes can tell you: They tell you that visitors aren’t just visiting. They’re taking action. But what sort of action? A bad column will get passed around on Twitter for a round of mockery. An embarrassing article can go viral on Facebook. Shares and mentions can communicate the magnitude of an article’s attention, but they can’t always tell you the direction of the share vector: Did people share it because they loved it, or because they loved hating it?

Here are some potential options for sorting this all out:

1. Developing a scale or index that combines all of these factors. It could be as easy as each of these four counts for 25% or the components could be weighted differently.

2. Heavyweights in the industry – whether particular companies or advertisers or analytical leaders – make a decision about which of these is most important. For example, comments after this story note the problems with Nielsen television ratings over the decades but Nielsen had a stranglehold on this area.

3. Researchers outside the industry could “objectively” develop a measure. This may be unlikely as outside actors have less financial incentive but perhaps someone sees an opportunity here.

In the meantime, there is plenty of information on online readership to look at, websites and companies can claim various things with different metrics, and websites and advertisers will continue to have a strong financial interest in all of this.

Help needed in measuring online newspaper readership

The newspaper industry is in trouble and it doesn’t help that there is not an agreed-upon way to measure online readership:

It’s no longer uncommon for someone to own three or four devices that can access news content at home, work or almost anywhere. This array causes headaches for newspaper publishers and editors and sows confusion for advertisers who want to know how many readers a newspaper has. How should they be counted? Where should advertisers put their dollars? How many readers does an online advertisement reach? What’s an ad worth anymore?

Perhaps as vexing is who is counting readers and who counts them best. Unlike the methods Arbitron and Nielsen use to develop radio and TV ratings, the science of counting online and digital news consumers has existed only for a short time. At least nine companies have crowded into the business of measuring digital audiences over the past 15 years. Each company employs its own methodology to collect data. And because digital technology seems to leap forward almost every day, measurement techniques that were acceptable yesterday may not be adequate tomorrow.

With the money at stake in advertising and prestige, you would think there would be more agreement here. Without agreed-upon standards, newspapers can claim very different numbers and there is no way to really sort it out.

Why can’t newspapers themselves pick a provider or two they like, perhaps one that is more generous in its counting, and run with it as an industry?

Dana Chinn, a lecturer at the University of Southern California’s Annenberg School for Communication and Journalism, said newspapers haven’t kept up with other industries that do business online.

“There is a stark contrast between the news industry and e-commerce, in that e-commerce is saying analytics is do or die for us because we are a digital business,” Chinn said. “News organizations don’t say that, because if they did they would use the right metrics. All the news organizations I know are usually using the wrong metrics to make the decisions that are needed to survive.”

This is a reminder that money-making today is very closely tied to measurement, particularly when you are selling online information.

A lot of web traffic comes through the “dark social,” not through social network sites

Alexis Madrigal argues that while social network sites like Facebook get a lot of attention, a lot of web traffic is influenced by social processes that are much more difficult to see and measure:

Here’s a pocket history of the web, according to many people. In the early days, the web was just pages of information linked to each other. Then along came web crawlers that helped you find what you wanted among all that information. Some time around 2003 or maybe 2004, the social web really kicked into gear, and thereafter the web’s users began to connect with each other more and more often. Hence Web 2.0, Wikipedia, MySpace, Facebook, Twitter, etc. I’m not strawmanning here. This is the dominant history of the web as seen, for example, in this Wikipedia entry on the ‘Social Web.’…

There are circumstances, however, when there is no referrer data. You show up at our doorstep and we have no idea how you got here. The main situations in which this happens are email programs, instant messages, some mobile applications*, and whenever someone is moving from a secure site (““) to a non-secure site (
This means that this vast trove of social traffic is essentially invisible to most analytics programs. I call it DARK SOCIAL. It shows up variously in programs as “direct” or “typed/bookmarked” traffic, which implies to many site owners that you actually have a bookmark or typed in into your browser. But that’s not actually what’s happening a lot of the time. Most of the time, someone Gchatted someone a link, or it came in on a big email distribution list, or your dad sent it to you…
Just look at that graph. On the one hand, you have all the social networks that you know. They’re about 43.5 percent of our social traffic. On the other, you have this previously unmeasured darknet that’s delivering 56.5 percent of people to individual stories. This is not a niche phenomenon! It’s more than 2.5x Facebook’s impact on the site…
If what I’m saying is true, then the tradeoffs we make on social networks is not the one that we’re told we’re making. We’re not giving our personal data in exchange for the ability to share links with friends. Massive numbers of people — a larger set than exists on any social network — already do that outside the social networks. Rather, we’re exchanging our personal data in exchange for the ability to publish and archive a record of our sharing. That may be a transaction you want to make, but it might not be the one you’ve been told you made.

Two thoughts about this:

1. Here is how I might interpret this argument from a sociological point of view: Internet traffic is heavily dependent on social connections. Whether this is done on sites like Facebook, which are more publicly social, or through email, which is restricted from public view but is still quite social, the interactions people have influence where they go on the web. In this sense, the Internet is an important social domain that may have some of its own norms and rules as well as its own advantages and disadvantages but it is built around human connections.

2. This sounds like a fantastic business and/or research opportunity; what is going on in this “dark social” realm? Could there be ways at getting at these activities that would help us better understand and analyze the importance of social connections and interactions and could this information be monetized as well?

Why sociologists should make their own apps

A sociologist who has made her own medical sociology app argues that her colleagues should be making their own apps:

My decision to make an app stemmed from two major reasons. First, I have long been interested in the ways people interact with computer technologies, and have published some research on this in the past.

More recently my interest has turned to health-related apps available for smartphones and tablet computers. I had been researching the various apps available for such purposes and had noted that many apps have been developed for teaching purposes for medical students.

Second, we have mobile digital devices at home that are very popular with my two school-aged daughters. I had noticed the huge number of educational apps that are available for children’s use, from infancy to high-school level. Some Australian high schools, including my older daughter’s school, have acknowledged young people’s high take-up of mobile digital devices and are beginning to advocate that students bring their devices to school and use them for educational purposes during the school day.

The relevance for tertiary-level education appeared obvious. I wondered whether many universities, academic publishers or academics themselves had begun to develop apps. Yet, having searched both the Android and the Apple App Stores using the search term of my discipline, ‘sociology’, I discovered only a handful of apps related to this subject for tertiary students. Nor were there many for other social sciences. There seemed to be a wide-open gap in the market…

My app is very simple. It is text-based only and has no illustrations or graphics, but there is provision for these to be included if the developer so chooses. Apps developed using this particular wizard are only be available for use on Android devices, but having looked at similar app makers for Apple devices I was put off by their more technical nature and the greater expense involved.

In just a couple of hours my app was ready. I had typed in over 25 medical sociology key concepts (for example, social class, discourse, identity, illness narratives, poststructuralism), plus a list of books for further reading, chosen a nice-looking background and paid US$79.00 for the app to appear without ads and to guarantee that it would be submitted to the Android App Store.

Three issues I could see with this:

1. How much demand is there really for such apps? I can’t imagine too many people look for sociology or social science apps. Of course, it is relatively easy to make so it isn’t like tons of time has to be invested in such apps (though there could be a relationship between the time put into an app and how engaging it is).

2. The assumption here is that people want to use these apps for educational purposes. Would this work? Can apps effectively be used for education

3. How much better is making an app than putting together a website?

I’m glad to see more sociologists venturing into new technologies but it is worthwhile to consider the payoffs and how they are really going to be used.

Media looks for ways to better measure fragmented audience

As media platforms proliferate, media companies are looking for better ways to measure their audience:

“We have Omniture data, comScore, Nielsen, some of our internal metrics that we look at — they don’t match,” Wert said.

Hampering the effort are audiences splintering into ever smaller shards as they use an array of outlets and platforms — including websites, mobile devices, print and broadcast…

The tinier the pieces the more precious each becomes. It’s more important than ever for traditional media looking to cover the costs of producing content to deliver to marketers as much information as possible about who’s watching, reading and listening.

Arguably, technology has made the measurement systems better than ever. But the result is counterintuitive: Consumers are followed more closely but the numbers don’t always add up, and it’s not clear how to put a value on those numbers…

Nielsen’s Patrick Dineen, senior vice president of local television audience measurement, said it’s “wildly inappropriate” to try to track audiences through one medium. Kevin Gallagher, executive vice president and local director at Starcom, said his firm has replaced talk of traditional media planning with something that tracks targeted consumers’ daily interaction with media.

Getting the right numbers means media companies will be able to more accurately gauge advertising, particularly target audiences, and then make more money. Solving these issues and appropriately valuing these media interactions will be a huge issue moving forward and whoever can do it first or do it best could have an advantage.

Sociologist Duncan Watts helped come up with the idea for the Huffington Post

Here an interesting sidelight to sociologist Duncan Watts career: he helped create the Huffington Post.

The origins of the now famous Huffington Post began at a lunch in 2003 between AOL’s Kenneth Lerer and author and sociologist Duncan Watts. The two met to discuss Watts’ book, and left with the beginnings of the Huff Post.  The Columbia  Journalism Review recently gave its own take on Watts’ book, Six Degrees, that inspired Lerer from the get-go and on the history of The Huffington Post as we now know it. According to CJR, before AOL’s purchase of HuffPost in 2011, the company was not known for revenues or breaking news stories. However, the website had managed to master social media integration and search-engine optimization.

Here are more details from the story in the Columbia Journalism Review cited above:

He brought the book with him and Watts would recall that the copy was dog-eared, the flatteringly telltale sign of a purposeful read. Lerer had a plan and he wanted Watts to help him. He had set himself an ambitious target. He wanted to take on the National Rifle Association.

He told Watts: “I know the answer to this is somewhere in these pages.”…

Ken Lerer listened, and he was not deterred. Networks did, in fact, occur—vast networks through which previously disconnected people suddenly found themselves joined together, perhaps to share an idea, a song, a sentiment, a cause. Why not then try to create a network that could challenge the vast and powerful and sustaining network of the NRA?

“I know the answers,” Watts told him. “I am confident they are not there.” Then, having deflated Lerer, Watts threw him a lifeline: “Maybe my friend Jonah can help you.”

An interesting read: in order to fight the NRA and counter the DrudgeReport, people wanted to make the Huffington Post both viral and sticky.

However, from his Twitter account, here is Watt’s Apr 18 take on the CJR piece:

Six degrees of aggregation: A fascinating (in my biased opinion) take on the origins of the Huffington Post.