Big data, it has been said, is making science obsolete. No longer do we need theories of genetics or linguistics or sociology, Wired editor Chris Anderson wrote in a manifesto four years ago: “With enough data, the numbers speak for themselves.”…
There are echoes here of a centuries-old debate, unleashed in the 1600s by protoscientist Sir Francis Bacon, over whether deduction from first principles or induction from observed reality is the best way to get at truth. In the 1930s, philosopher Karl Popper proposed a synthesis, in which the only scientific approach was to formulate hypotheses (using deduction, induction, or both) that were falsifiable. That is, they generated predictions that — if they failed to pan out — disproved the hypothesis.
Actual scientific practice is more complicated than that. But the element of hypothesis/prediction remains important, not just to science but to the pursuit of knowledge in general. We humans are quite capable of coming up with stories to explain just about anything after the fact. It’s only by trying to come up with our stories beforehand, then testing them, that we can reliably learn the lessons of our experiences — and our data. In the big-data era, those hypotheses can often be bare-bones and fleeting, but they’re still always there, whether we acknowledge them or not.
“The numbers have no way of speaking for themselves,” political forecaster Nate Silver writes, in response to Chris Anderson, near the beginning of his wonderful new doorstopper of a book, The Signal and the Noise: Why So Many Predictions Fail — But Some Don’t. “We speak for them.”
These days, finding and examining data is much easier than before but it is still necessary to interpret what these numbers mean. Observing relationships between variables doesn’t necessarily tell us something valuable. We also want to know why variables are related and this is where hypotheses come in. Careful hypothesis testing means we can rule out spurious associations, other variables that may be leading to the observed relationship, and look for the influence of one variable on another when controlling for other factors (the essence of regression) or looking at more complex models where we can see how a variety of models affect each other at the same time.
Also, at the opposite end of the scientific process from the hypotheses, utilizing findings when creating and implementing policies will also require thinking. Once we have established that relationships likely exist, it takes even more work to respond to this in useful and effective ways.