The problem of archiving the Internet may be just the first problem; how do we make causal arguments from its contents?

Archiving the Internet so that it can understood and studied by later researchers and scholars may be a big problem:

In a new paper, “Stewardship in the ‘Age of Algorithms,’” Clifford Lynch, the director of the Coalition for Networked Information, argues that the paradigm for preserving digital artifacts is not up to the challenge of preserving what happens on social networks.

Over the last 40 years, archivists have begun to gather more digital objects—web pages, PDFs, databases, kinds of software. There is more data about more people than ever before, however, the cultural institutions dedicated to preserving the memory of what it was to be alive in our time, including our hours on the internet, may actually be capturing less usable information than in previous eras…

Nick Seaver of Tufts University, a researcher in the emerging field of “algorithm studies,” wrote a broader summary of the issues with trying to figure out what is happening on the internet. He ticks off the problems of trying to pin down—or in our case, archive—how these web services work. One, they’re always testing out new versions. So there isn’t one Google or one Bing, but “10 million different permutations of Bing.” Two, as a result of that testing and their own internal decision-making, “You can’t log into the same Facebook twice.” It’s constantly changing in big and small ways. Three, the number of inputs and complex interactions between them simply makes these large-scale systems very difficult to understand, even if we have access to outputs and some knowledge of inputs.

In order to study something, you have measure and document it well. This is an essential first step for many research projects.

But, I wonder if even it can all be documented well, what exactly would it tell us about behaviors and aspirations? Like any “text,” it may be difficult to make causal arguments based on the artifacts of our Internet or social media. They are controlled by a relatively small number of people. Social media is dominated by a relatively small number of users. Many people in society interact with both but how exactly are their lives changed? The history of the Internet and social media and the forces behind it is one thing; it could be fascinating to see how the birth of the World Wide Web in the early 1990s or AOL or Facebook or Google are all viewed several decades into the future. But, it will be much harder to clearly show how all these forces affected the average person. Did it change personalities? Did day-to-day life change in substantial ways? Did political opinions change? Did it disrupt or enhance relationships? What if Twitter dominates the media and the lives of 10% of the American population but little impact on most lives?

There is a lot here to sort out and a lot of opportunities for good research. At the same time, there are a lot of chances for people to make vague claims and arguments based on correlations and broad patterns that cannot be explicitly linked.

Three possible responses to the finding that human behavior is complicated

A review of a new book includes a paragraph (the second one excerpted below) that serves as a good reminder for those interested in human behavior:

What happens in brains and bodies at the moment humans engage in violence with other humans? That is the subject of Stanford University neurobiologist and primatologist Robert M. Sapolsky’s Behave: The Biology of Humans at Our Best and Worst. The book is Sapolsky’s magnum opus, not just in length, scope (nearly every aspect of the human condition is considered), and depth (thousands of references document decades of research by Sapolsky and many others) but also in importance as the acclaimed scientist integrates numerous disciplines to explain both our inner demons and our better angels. It is a magnificent culmination of integrative thinking, on par with similar authoritative works, such as Jared Diamond’s Guns, Germs, and Steel and Steven Pinker’s The Better Angels of Our Nature. Its length and detail are daunting, but Sapolsky’s engaging style—honed through decades of writing editorials, review essays, and columns for The Wall Street Journal, as well as popular science books (Why Zebras Don’t Get Ulcers, A Primate’s Memoir)—carries the reader effortlessly from one subject to the next. The work is a monumental contribution to the scientific understanding of human behavior that belongs on every bookshelf and many a course syllabus.

Sapolsky begins with a particular behavioral act, and then works backward to explain it chapter by chapter: one second before, seconds to minutes before, hours to days before, days to months before, and so on back through adolescence, the crib, the womb, and ultimately centuries and millennia in the past, all the way to our evolutionary ancestors and the origin of our moral emotions. He gets deep into the weeds of all the mitigating factors at work at every level of analysis, which is multilayered, not just chronologically but categorically. Or more to the point, uncategorically, for one of Sapolsky’s key insights to understanding human action is that the moment you proffer X as a cause—neurons, neurotransmitters, hormones, brain-specific transcription factors, epigenetic effects, gene transposition during neurogenesis, dopamine D4 receptor gene variants, the prenatal environment, the postnatal environment, teachers, mentors, peers, socioeconomic status, society, culture—it triggers a cascade of links to all such intervening variables. None acts in isolation. Nearly every trait or behavior he considers results in a definitive conclusion, “It’s complicated.”

To adapt sociologist Joel Best’s approach to statistics in Damned Lies and Statistics, I suggest there are three broad approaches to understanding human behavior:

1. The naive. This approach believes human behavior is simple and explainable. We just need the right key to unlock behavior (whether this is a religious text or a single scientific cause or a strongly held personal preferance).

2. The cynical. Human behavior is so complicated that we can never understand it. Why bother trying?

3. The critical. As Best suggests, this is an informed approach that knows how to ask the right questions. To the reductionist, it might ask whether there are other factors to consider. To the cynical, it might say that just because it is really complicated doesn’t mean that we can’t find patterns. Causation is often difficult to determine in the natural and social sciences but this does not mean that we cannot find bundles of factors or processes that occur. The key here is recognizing when people are making reasonable arguments about explaining human behavior: when do their claims go too far or when are they missing something?

Why is football “the sport that most closely aligns itself with religion”?

NFL player Arian Foster is out as a non-religious player:

Arian Foster, 28, has spent his entire public football career — in college at Tennessee, in the NFL with the Texans — in the Bible Belt. Playing in the sport that most closely aligns itself with religion, in which God and country are both industry and packaging, in which the pregame flyover blends with the postgame prayer, Foster does not believe in God.

“Everybody always says the same thing: You have to have faith,” he says. “That’s my whole thing: Faith isn’t enough for me. For people who are struggling with that, they’re nervous about telling their families or afraid of the backlash … man, don’t be afraid to be you. I was, for years.”

He has tossed out sly hints in the past, just enough to give himself wink-and-a-nod deniability, but he recently decided to become a public face of the nonreligious. Moved by the testimonials of celebrity atheists like comedian Bill Maher and magicians Penn and Teller, Foster has joined a national campaign by the nonprofit group Openly Secular, which plans to use his story to increase awareness and acceptance of nonbelievers, especially in sports. The organization initially approached ESPN about Foster’s willingness to share his story, but ESPN subsequently dealt directly with Foster, and Openly Secular had no involvement…

Religion may be football’s sole concession to humility, perhaps the only gesture that suggests the game itself is not its own denomination. Nowhere is the looming proximity of Christianity more pronounced than in the SEC, where, in the time of Tim Tebow, a man named Chad Gibbs was inspired to write a book — God and Football — telling of his travels to every SEC school to decipher how like-minded Christians navigate the cliff walk between rooting for Florida and maintaining their devotion to Christ. These religious currents aren’t confined to football, of course: Big league baseball teams routinely hold “faith and family” days; players appear at postgame celebrations to give their testimonials, and Christian rock bands perform well into the night. In football, though, public displays of faith can be viewed as a necessary accessory for such a dangerous and violent sport.

I’m more interested in why football might identify more with religion than other sports. (And I’m a bit skeptical of whether this is true.) Is it:

1. The physical nature of the game? Perhaps it reminds the athletes more of their own mortality. Plus, careers are short due to the physical demands. Perhaps playing football reinforces religiosity.

2. The connection between football and certain areas of the country? This article cites the Bible Belt and SEC schools. So this connection between football and religion could really be a relationship between football and the South? This could be an example of a spurious correlation.

3. The people who play football are more religious and/or come from more religious families? In this explanation, the religiosity comes before football rather than because of football (different causal order).

4. Football players have been more publicly vocal about their faith compared to athletes in other sports?

5. A historical connection between churches and/or religious schools and football?

Could be some interesting stuff to look into…

Argument: humans like causation because they like to feel in control

Here is an interesting piece that summarizes some research and concludes that humans like to feel in control and therefore like the idea of causality:

This predisposition for causation seems to be innate. In the 1940s, psychologist Albert Michotte theorized that “we see causality, just as directly as we see color,” as if it is omnipresent. To make his case, he devised presentations in which paper shapes moved around and came into contact with each other. When subjects—who could only see the shapes moving against a solid-colored background—were asked to describe what they saw, they concocted quite imaginative causal stories…

Nassim Taleb noted how ridiculous this is in his book The Black Swan. In the hours after former Iraqi dictator Saddam Hussein was captured on December 13, 2003, Bloomberg News blared the headline, “U.S. TREASURIES RISE; HUSSEIN CAPTURE MAY NOT CURB TERRORISM.” Thirty minutes later, bond prices retreated and Bloomberg altered their headline: “U.S. TREASURIES FALL; HUSSEIN CAPTURE BOOSTS ALLURE OF RISKY ASSETS.” A more correct headline might have been: “U.S. TREASURIES FLUCTUATE AS THEY ALWAYS DO; HUSSEIN CAPTURE HAS NOTHING TO DO WITH THEM WHATSOEVER,” but that isn’t what editors want to post, nor what people want to read.

This trend doesn’t merely manifest itself for stocks or large events. Take scientific studies, for example. Many of the most sweeping findings, ones normally reported in large media outlets, originate from associative studies that merely correlate two variables—television watching and death, for example. Yet headlines—whose functions are partly to summarize and primarily to attract attention—are often written as “X causes Y” or “Does X cause Y?” (I have certainly been guilty of writing headlines in the latter style). In turn, the general public usually treats these findings as cause-effect, despite the fact that there may be no proven causal link between the variables. The article itself might even mention the study’s correlative, not causative, nature, and this still won’t change how it is perceived. Co-workers across the world will still congregate around coffee machines the next day, chatting about how watching The Kardashians is killing you, albeit very slowly.Humanity’s need for concrete causation likely stems from our unceasing desire to maintain some iota of control over our lives. That we are simply victims of luck and randomness may be exhilarating to a madcap few, but it is altogether discomforting to most. By seeking straightforward explanations at every turn, we preserve the notion that we can always affect our condition in some meaningful way. Unfortunately, that idea is a facade. Some things don’t have clear answers. Some things are just random. Some things simply can’t be controlled.

I like the reference to Taleb here. His books make just this argument: people want to see patterns when they don’t exist and thus are completely unprepared for changes in the stock market, governments, or the natural world. The trick is to know when you can rely on patterns and when you can’t – and Taleb even has general investment strategies in his most recent book Antifragile that try to minimize loss and try to maximize potential gains.

I wonder if this isn’t lurking behind the discussion of big data: there are scientists and others who seem to suggest that all we need to understand the world is more data and better pattern recognition tools. If only we could get enough, we could figure things out. But, what if the world turns out to be too complex? What if we can’t know everything about the social or natural world? Does this then change our perceptions of human ingenuity and progress?

h/t Instapundit

Canadian PM says we shouldn’t “commit sociology” and try to explain terrorism

When asked about a recently uncovered train terrorism plot, the Canadian Prime Minister said we should not “commit sociology”:

Prime Minister Stephen Harper said this is not the time to “commit sociology” when asked about the arrests of two men this week who are accused of conspiring to carry out a terrorist attack on a Via train.

Harper was asked during a news conference with Trinidad and Tobago’s prime minister about concerns with the timing of the arrests. He was also asked about when it’s appropriate to talk about the root causes of involvement with terrorism.

The Conservatives had taken Liberal Leader Justin Trudeau to task when he suggested last week it was important to look at the root causes of the Boston Marathon bombings after offering condolences and support to the victims. They said he was trying to rationalize the bombings or make excuses when the Liberal leader said the bombings happened because someone felt excluded from society.

“I think, though, this is not a time to commit sociology, if I can use an expression,” Harper said. “These things are serious threats, global terrorist attacks, people who have agendas of violence that are deep and abiding threats to all the values our society stands for.

“I don’t think we want to convey any view to the Canadian public other than our utter condemnation of this kind of violence, contemplation of this violence and our utter determination through our laws and our activities to do everything we can to prevent it and counter it,” Harper said.

This echoes some conversations in recent years:

George Will warned against committing sociology after the shooting in Aurora, Colorado.

-After the riots in London, some said we should not try to explain why some people would riot (which is a relatively rare event in Western society).

Is this a new conservative talking point?

Just because we want to try to understand why some people commit terrorist acts (and most others do not) does not mean the explanations excuse or condone the actions. It also does not necessarily imply that society is entirely at fault. But, we do know that social forces can affect people even as individuals have some agency. In the end, thinking about causes of terrorism (and rioting) can help us develop ways to stop it in the future.

The rise of “data science” as illustrated by examining the McDonald’s menu

Christopher Mims takes a look at “data science” and one of its practitioners:

Before he was mining terabytes of tweets for insights that could be turned into interactive visualizations, [Edwin] Chen honed his skills studying linguistics and pure mathematics at MIT. That’s typically atypical for a data scientist, who have backgrounds in mathematically rigorous disciplines, whatever they are. (At Twitter, for example, all data scientists must have at least a Master’s in a related field.)

Here’s one of the wackier examples of the versatility of data science, from Chen’s own blog. In a post with the rousing title Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Chen delves into the problem of clustering. That is, how do you take a mass of data and sort it into groups of related items? It’s a tough problem — how many groups should there be? what are the criteria for sorting them? — and the details of how he tackles it are beyond those who don’t have a background in this kind of analysis.

For the rest of us, Chen provides a concrete and accessible example: McDonald’s

By dumping the entire menu of McDonald’s into his mathemagical sorting box, Chen discovers, for example, that not all McDonald’s sauces are created equal. Hot Mustard and Spicy Buffalo do not fall into the same cluster as Creamy Ranch, which has more in common with McDonald’s Iced Coffee with Sugar Free Vanilla Syrup than it does with Newman’s Own Low Fat Balsamic Vinaigrette.

This sounds like an updated version of factor analysis: break a whole into its larger and influential pieces.

Here is how Chen describes the field:

I agree — but it depends on your definition of data science (which many people disagree on!). For me, data science is a mix of three things: quantitative analysis (for the rigor necessary to understand your data), programming (so that you can process your data and act on your insights), and storytelling (to help others understand what the data means). So useful skills for a data scientist to have could include:

* Statistics, machine learning (on the quantitative analysis side). For example, it’s impossible to extract meaning from your data if you don’t know how to distinguish your signals from noise. (I’ll stress, though, that I believe any kind of strong quantitative ability is fine — my own background was originally in pure math and linguistics, and many of the other folks here come from fields like physics and chemistry. You can always pick up the specific tools you’ll need.)

* General programming ability, plus knowledge of specific areas like MapReduce/Hadoop and databases. For example, a common pattern for me is that I’ll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.

* Web programming, data visualization (on the storytelling side). For example, I find it extremely useful to be able to throw up a quick web app or dashboard that allows other people (myself included!) to interact with data — when communicating with both technical and non-technical folks, a good data visualization is often a lot more helpful and insightful than an abstract number.

I would be interested in hearing whether data science is primarily after descriptive data (like Twitter mood maps) or explanatory data. The McDonald’s example is interesting but what kind of research question does it answer? Chen mentions some more explanatory research questions he is pursuing but it seems like there is a ways to go here. I would also be interested in hearing Chen’s thoughts on how representative the data is that he typically works with. In other words, how confident are he and others are that the results are generalizable beyond the population of technology users or whatever the specific sampling frame is. Can we ask and answer questions about all Americans or world residents from the data that is becoming available through new data sources?

h/t Instapundit

Looking for a new area of study? Try Twitterology

If it is in the New York Times, Twitterology must be a viable area of academic study:

Twitter is many things to many people, but lately it has been a gold mine for scholars in fields like linguistics, sociology and psychology who are looking for real-time language data to analyze.

Twitter’s appeal to researchers is its immediacy — and its immensity. Instead of relying on questionnaires and other laborious and time-consuming methods of data collection, social scientists can simply take advantage of Twitter’s stream to eavesdrop on a virtually limitless array of language in action…

One criticism of “sentiment analysis,” as such research is known, is that it takes a naïve view of emotional states, assuming that personal moods can simply be divined from word selection. This might seem particularly perilous on a medium like Twitter, where sarcasm and other playful uses of language often subvert the surface meaning…

Still, the Twitterologists will continue to have a tough row to hoe in justifying their research to those who think that Twitter is a trivial form of communication. No less a figure than Noam Chomsky has taken Twitter to task recently for its “superficiality.”

For more sociological thoughts about Chomsky’s comments, see this post from a few days ago.

Here is my quick take on Twitterology: it has some potential for gathering quick, on-the-ground information. But there are two big issues that this article doesn’t address:

1. Are Twitter users representative of the whole population? Probably not. Twitter feeds might be good for studying very specific groups and movements.

2. How can one make causal arguments with Twitter data? If we had more information about Twitter users from profiles, this might be doable but Twitter is less about Facebook-style profiles. We then need studies that collect the information about Twitter users as well as their Twitter activity. If we want to ask questions like whether Twitter was instrumental or even helped cause the Arab Spring movements, we need more data.

Twitterology may be trendy at the moment but I think it has a ways to go before we can use it to tackle typical questions that sociologists ask.