Apr 1, 2015

Data science still woefully short on science | Matt Asay

Data science is all about asking the right questions. The problem, however, is that we introduce bias into data science simply by asking the question and establishing which data we will collect.
As enterprises get serious about big data, they need to get equally serious about using data to challenge corporate assumptions and not merely confirm them. It's a big step from the mythology masquerading as data science today, but it's the only way to get to the truth our data may be able to tell us.

Inescapable bias

As Gartner analyst Svetlana Sicular highlights, "That's a hint to the glorified business analysts in California who say in a mystifying voice, 'Data will tell you.' Yes, it will -- if you know what to ask."
Asking good questions is the heart of data science and, indeed, all science. But the problem is that the very questions we ask bias the answers--or rather, the data behind those answers.
Noted statistician Nate Silver calls out this bias, arguing:
"[Big Data] is sometimes seen as a cure-all, as computers were in the 1970s. Chris Anderson...wrote in 2008 that the sheer volume of data would obviate the need for theory, and even the scientific method....
"[T]hese views are badly mistaken. The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning....[W]e may construe them in self-serving ways that are detached from their objective reality."
Echoing this theme, Gartner analyst Mark Beyer insists that "ALL data is biased toward the creator." But is this a bad thing? He continues:
"[E]ach new data point reflects the intent of the business process designer. This means it is not possible to actually assemble new analytics from existing data. Even more concerning, it is not possible to alter analysis with new data--you can only reinforce or refute the expected analysis....Human logic is inherently embedded in all business processes and all data capture is biased toward the expected outcome of those processes."
In other words, big data can't replace human intuition: it reflects it.

A return to science

But Beyer and Silver aren't arguing that we should throw up our hands in despair. Silver, for example, suggests that bias is primarily a problem when we attempt to ignore it: "Data-driven predictions can succeed--and they can fail. It is when we deny our role in the process that the odds of failure rise."
Beyer goes a bit further, offering a mechanism for combating bias. It's called science:
"[W]hat business strategists need is 'real' data science. Real data science is the practice of building out competing interpretations of data, many multi-layered analytic theorems that intentionally challenge the inferences used by the others."
It is this systematic challenging that creates real science. Decades ago, Karl Popper offered a clear way to distinguish real science from pseudo-science. He called it "falsifiability," or the ability to test a statement. Popper argued that while it's impossible to prove something true, it's very possible to show something to be false.
Popper thus stated that real science concerns itself with proving statements ("All swans are white") to be false, rather than true. Those statements that hold up to rigorous falsification are presumed to be provisionally true, which is the best science can hope to deliver.
Back to Beyer. His suggestion that we set up "multi-layered analytic theorems that intentionally challenge the inferences" used to determine which data to gather feels like a solid start. It is the challenge--the effort to falsify our assumptions--that makes it real data science and not merely data fiction.
Sicular notes that pseudo data science has mostly "mystified, glorified and made mathematics sexy," but real data science promises to do much more. It's hard work, but that's the point.

No comments: