focus the realbig data has limits one of the reasons why the abuse of science is so prevalent in big...

4
50 Dialogue | Mar/May 2014 The real business of data science Will data science become necessary in all businesses like marketing or accounting departments? Neal Patel investigates ILLUSTRATION: ANDY POTTS THE REAL BUSINESS OF DATA IS SCIENCE THE THIN DATA REVOLUTION FOCUS DATA IS THE KEY TO COMPETITION CHANGING THE FACE OF LEADERSHIP KNOWLEDGE IS POWER, BUT IT’S ALL ABOUT SMALL DATA THE DIGITAL NERVOUS SYSTEM THE BIG DATA DEBACLE THERE WAS A time when you could go to cocktail parties and use the words “big data” with careless abandon. No one laughed uncomfortably. There were no unpleasant pauses. No one gave you those derisive looks we normally reserve for the guy who turns up at the show wearing the head- lining band’s T-shirt. But today, it seems like those of us who work at the intersection of com- puter and social science are eager to distance ourselves from terms like “big data”, “analyt- ics” and even “data science”. In fact, every time I hire a new lab member, we have an awkward conversation in which they delicately ask per- mission to switch their job title from “data scientist” to something else. What everyone appears to be thinking, but doesn’t want to say out loud, is that big data is massively overhyped and is increasingly embarrassing to legitimate scien- tists. First, in practice, most of what passes for “data science” is not all that scientific, and done without any empirical rigor. Second, “big data” is inherently limited in what it can do – there are forms of consumer knowledge that remain beyond its reach. Nevertheless, big data continues to be hyped because the intimidating scale and technology makes it easy to sell “truthy” tidbits to neophytes who don’t know any better. Indeed, the preva- lence of “analytics” and the big data shovel routinely displaces other, less popular research methods

Upload: others

Post on 24-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: focus The realBig data has limits One of the reasons why the abuse of science is so prevalent in big data is because it allows less scru-pulous researchers to avoid having to acknowledge

50 Dialogue | Mar/May 2014

The real business of data scienceWill data science become necessary in all businesses like marketing or accounting departments? Neal Patel investigates IllustratIon: andy Potts

THE REAL BUSINESS OF DATA IS SCIENCE

THE THIN DATA REVOLUTION

focus

DATA IS THE kEy TO COmpETITION

CHANGING THE FACE OF LEADERSHIp

kNOWLEDGE IS pOWER, BUT IT’S ALL ABOUT SmALL DATA

THE DIGITAL NERVOUS SySTEm

THE BIG DATA DEBACLE

there was a time when you could go to cocktail parties and use the words “big data” with careless

abandon. No one laughed uncomfortably. There were no unpleasant pauses. No one gave you those derisive looks we normally reserve for the guy who turns up at the show wearing the head-

lining band’s T-shirt. But today, it seems like those of us who work at the intersection of com-puter and social science are eager to distance ourselves from terms like “big data”, “analyt-

ics” and even “data science”. In fact, every time I hire a new lab member, we have an awkward

conversation in which they delicately ask per-mission to switch their job title from “data

scientist” to something else. What everyone appears to be

thinking, but doesn’t want to say out loud, is that big data is massively overhyped and is increasingly embarrassing to legitimate scien-

tists. First, in practice, most of what passes for “data science” is not all that scientific, and done

without any empirical rigor. Second, “big data” is inherently limited in what it can do – there are forms of consumer knowledge that remain beyond its reach. Nevertheless, big data continues to be hyped because the intimidating scale and technology makes it easy to sell “truthy” tidbits to neophytes who don’t know any better. Indeed, the preva-lence of “analytics” and the big data shovel routinely displaces other, less popular research methods

Page 2: focus The realBig data has limits One of the reasons why the abuse of science is so prevalent in big data is because it allows less scru-pulous researchers to avoid having to acknowledge

Dialogue | Mar/May 2014 51

framework” appeared more like a person watching TV while Tweeting. Then Tweeting during the commercials. Then maybe monitoring Bottlenose and producing qualitative reports. Setting aside the obvious scientific problem of influencing the outcome variable (a term most commonly used in correlation and regression designs in which cause-and-effect relationships cannot be demonstrated), this fails to provide a detailed understanding of how consumers think or make sense of the world. This is not consumer insight. This is snake oil.

But many companies have a taste for snake oil. Indeed, because computational methods driven by “big data” work convincingly well when applied to search and ads, practically anything that can be called “data science” enjoys similar credibility

– whether scientific or not. Unfortunately, because “data science” merely refers to the

practice of extracting generalizable knowl-edge from data, it can mean anything. The choice between deeper analytic engagement with customers and the previous example is a choice between treating “data science” as “science” or discarding empirical rigor altogether.

Credit companies, for example, employ “data scientists” to learn about

their customers. Visa famously disavows being able to predict whether a person

is involved in divorce based on their credit records. Is this science? On the one hand, the

correlation, in and of itself, might be sufficient for Visa’s hypothetical purposes – after all, they should be primarily interested in their own customers.

On the other hand, the results are not intended for application without limit. However, whenever a data scientist “discovers” a thought-provoking correlation – between, say marital status and credit records – the subsequent public discourse assumes it applies to everyone. This, in turn, distorts the scien-tific merit of the “discovery” by stretching its claims beyond the supporting evidence. Generalization to the public on the basis of one statistically valid asso-ciation is like releasing a drug to the public on the basis of a single clinical trial or, for that matter, simply discovering that it works for the very first time.

Indeed, if we treated what passes for “data science” as rigorously as “real science,” we might decide that simply finding an association is only the beginning, not the end, of a discovery. This goes beyond the mere distinction between correlation and causation, although it certainly applies. Rather, there is an underlying social process or cultural framework accounting for the correlation which requires further investigation, in the same way that the demonstrated effect of a drug requires a detailed explanation of its function within the human body.

more appropriately tuned to uncovering the insights businesses are actually looking for.

In practice, “big data” lacks rigourPractically speaking, most companies turn to big data to better understand and market products or services to their customers. Unfortunately, they are likely to encounter a range of pseudo-scientific gimmicks masquerading as research before they arrive at anything resembling the truth.

The first excursion is typically with “analytics.” Corporate leaders notice how successfully some systems predict consumer preferences, and assume the same can be done with consumer insights. They want apps, which mine Twitter posts, they want to know what’s “trending” on Facebook, they believe a trending Tweet puts them “inside” the mind of their customers.

But the problem with these sorts of analytics is that they don’t get inside the mind of consumers, they measure an outcome behaviour (in this case “Tweeting”) driven by what’s going on inside a given customer’s head. The problem with reporting observed outcomes is that they are rarely directional. It’s like trying to under-stand a cause and effect relationship without knowing the cause. This is precisely where we get things like alchemy and bloodletting. In the Middle Ages, barbers observed that people were more calm and relaxed when they lost blood. While relaxation is an outcome of exsan-guination, it is because the body is weakened by blood loss and, for all intents and purposes, is preparing to die! Therefore, taking guidance from a Tweet without understanding why that person tweeted, is “soothsaying” in the truest sense: it invites assumptions which cannot be verified.

Seldom do these methods successfully link infor-mational or behavioral exchange (tweets, social contacts, etc) with revenue, profit, loss, growth or any corresponding strategic guidance. Meanwhile, “analytics” relies on a host of pseudo-metrics such as “item trend velocity”, and “Tweet volume”, which create a false sense of precision.

Several months ago, I met with a prominent firm pitching a self-described “integrated strategy, tech-nology, and marketing” solution. Out of a team of six, no one – not a single soul – could describe either the monetary or strategic value of a “trending” item on Twitter; or what action should be taken when content is sufficiently “Liked” on Facebook; or whether increased “trend velocity” is good or bad.

As a matter of fact, the team claimed to have devised a six-million dollar “real-time rapid-response framework” driven by “social media analytics.” After a few tough questions, the “rapid-response

Big data is overhyped and is

embarrassing to legitimate

scientists

Page 3: focus The realBig data has limits One of the reasons why the abuse of science is so prevalent in big data is because it allows less scru-pulous researchers to avoid having to acknowledge

52 Dialogue | Mar/May 2014

This explanation is not something readily under-stood or captured by the methods that “discovered” the association. Instead, cultural and phenomenolog-ical methods must decode the underlying framework of shared assumptions, taboos and structures of meaning that explain the existence of the observed correlation. Just because something correlates doesn’t mean we can decide which caused which or, indeed, if either caused the other at all.

Big data has limitsOne of the reasons why the abuse of science is so prevalent in big data is because it allows less scru-pulous researchers to avoid having to acknowledge the limits of what big data can do. At the recent Consumer Electronics Show, Yahoo! CEO Marissa Mayer proclaimed: “The future of search is contex-tual knowledge”. By “contextual”, Mayer means using cues from an individual’s previous online activity to guess what they might mean when they mistype search terms, or the next album they’ll want to purchase. Big data originated in search, where it is a proven solution to problems like “contextual search”. Imagine how a search engine works. Billions of bits of data from every website must be “read” in some way and then organized so they can be referred to later. This task alone requires a database so massive and complex that it exceeds the compu-tational ability of any individual computer, necessi-tating a “cloud” of computers acting in concert.

Compare this to a conventional survey dataset. It may contain tens of thousands of observations, but a cloud-enabled survey could solicit a random sample every day, for months. Before long, the observation count quickly grows from thousands to millions. Big data delivers impossibly comprehensive data-sets, in some cases, population-level data to rival the US Census.

At a very basic level, contextual search deter-mines what else consumers want, based on the content they already click on. Imagine that each click is merely one in a cascading set of pair-wise choices, choice (a) versus (b), choice (c) versus (b), (a) over (c), and so on. These choices eventually form a hierarchy. After thousands of iterations, it becomes possible to calculate the probability of selecting choice (a) versus any other option in the set. A massive computational infrastructure powers this algorithm, arriving at a top set of prefer-ences for every individual who visits a site.

This is an exceptionally simple version of the algo-rithms internet companies use, but the advantages are straightforward. Whereas a randomly placed ad is essentially a wager, an ad targeted at click-through behaviour is a data-driven decision. Indeed, the same system makes it possible to compare revenue from targeted ads with revenue from other advertising

streams. Finally, costly marketing research can be reduced to simple A/B testing, a practice known as “optimization”.

The system works well until we begin to consider “context” more rigorously. Even a perfectly accu-rate system – and I’ve seen contextual search plat-forms ranging from eerily accurate to comically non sequitur – fails to answer the most simple, funda-mental question: why?

“Contextual knowledge” is important because it provides a framework for understanding subjective mental order – the choices, biases, predilections and assumptions that organize our comprehen-sion of reality. A hypothetically perfect contextual search solution generates, at best, the outcome of this mental order, an articulated preference for one choice over another. But there is tremendous busi-ness value in understanding the mental order driving those outcomes: understanding why people make the choices that they do.

A company which understands how its consumers think and make sense of their world, speaks to them in a shared language. It is the differ-ence between the short margins of selling option (a) over option (b), and a deep, intersubjective connec-tion between a product and an individual—one in which consumers believe the brand is an authentic expression of themselves.

In other words, despite “big” data, some things remain unmeasurable, especially because most big data is completely observational. It can neither confirm causation the way manmade experiments do, nor explain the mental framework which drives decision-making. This is the domain of qualitative and theoretical methods, which social scientists have been using to mine mental life since 1906. In The Philosophy of Money, founding sociologist Georg Simmel distinguishes between “objective” knowledge, which can be measured and quantified; and abstract, “subjective” knowledge concerned with “those questions… that we have so far been unable either to answer or dismiss” (Simmel, 1978).

By “subjective”, Simmel refers to both inner mental experience and fundamental philosoph-ical questions about the origins of things, neither of which readily lend themselves to a quantifi-able solution. Yet both are integral to under-standing social phenomena. “Even the empirical in its perfected state,” Simmel (1978) argues, “might no more replace philosophy as an interpretation… than

would the perfection of mechanical repro-duction of phenomena make the visual arts

superfluous.”Thus, extending the example of credit card

records, discovering a correlation to marital status should compel an investigation of under-

lying the values, choices and cultural rules

Page 4: focus The realBig data has limits One of the reasons why the abuse of science is so prevalent in big data is because it allows less scru-pulous researchers to avoid having to acknowledge

Dialogue | Mar/May 2014 53

embedded in those spending habits – the truest form of contextual knowledge, the reasons why. Yet these deeper scientific questions are routinely over-looked in the favour of the surface correlation. It is no wonder the label “data science” is passé among those who analyze big data for a living – for true computational social scientists, observing the frenzy around pithy “data science” correlations in public discourse is like an electrical engi-neer watching an electro-magnetic field detector in the hands of a “ghost” hunter.

Big data done the right wayThere are, of course, examples of data science which reach the highest stand-ards of scientific rigor. Sandy Pentland’s (MIT Media Lab) recent investigation of what makes team’s “click”, for instance, deployed 2,500 sociometric badges collecting vast amounts of sensor and proximity data among sales and support teams. Pentland and his team discov-ered that they could accurately predict each team’s success based on their communica-tion pattern data, without even meeting the team’s members.

Were this the typical shiny object served up by “data science”, the discussion would have ended there. However, Pentland and team conducted a deep investigation into the underlying framework for social interaction which explained this result – iden-tifying three key communication dynamics which influence performance and experimentally demon-strating improvement among teams that map their communication dynamics to the ideal pattern. As a result, Pentland not only made a significant scientific discovery, but generated clear business recommen-dations about what teams should do to succeed. According to Pentland, stellar teams hold frequent informal meetings, or “asides”, and ensure everyone speaks and listens in equal measure. In determining why, MIT researchers transformed a curious insight into concrete guidance for businesses interested in higher team productivity.

what’s nextThe question, preoccupying individuals in my profession is whether high or low scientific stand-ards will prevail in the world of “big data” – or will “data science”, and “analytics”, become the next “focus group”, or something equally discredited?

Indeed, as the set of practices referred to as “big data” and “data science” evolves into a profes-sional institution, it will have to choose between the high standards of Sandy Pentland’s work and the pseudo-scientific slapstick of the next big “inte-grated technology, strategy and marketing” solution.

Today data scientists are born “by accident”, when a social science PhD picks up programming skills, or a computer scientist develops an intellectual fascina-tion with human behaviour. The next generation will

train in “data science” degree programmes. Today data scientists define their own positions within

organizations. Tomorrow, there will be offi-cial data science departments. The level of scientific rigour these programmes instil in future students, and what prospective employers expect of them, will depend on the prevailing professional stand-ards of our day. Indeed, a number of universities have started data science Masters programmes.

Unfortunately, industrial competi-tion does not always select for the fittest

institutional structures. As Neil Fligstein’s study of the savings and loan industry

suggests, firms operating in “emerging” indus-tries can behave mimetically – that is, they simply copy what other firms do, because it seems to work. And this is the greatest danger posed by the fact that “big data” works too well: the promise of immediate reward is more attractive than the more difficult engagement with context, with why – although the latter poses a greater long-term strategic advantage.

For the individuals who work with big data every day, the challenge is ultimately ours. In the long run, will data science become a necessary competency of all businesses (like marketing, IT or accounting)? A skill everyone applies in their job? Or will it become the province of outside consultants who operate on demand? Ultimately, each of us must decide whether to embrace computational methods as true scientists, to pursue an engagement with why, or gorge ourselves with the golden goose of pseudo “data science”.

● Neal Patel is technical program lead, Advanced Technology and Projects (ATAP) at Google Inc

further reading

How Visa Predicts Divorce, The Daily Beast, nicholas Ciarelli, (2010)

yahoo Just acquired a new search Product that Could hurt Google, Business Insider, Jim edwards (2014)

eric schmidt: the Future of Magazines Is on tablets, Mashable, lauren Indvik (oct 23, 2013)

the new science of Building Great teams, Harvard Business Review, alex “sandy” Pentland (april 2012)

The Philosophy of Money, Georg simmel, david Frisby ed. (1978) london: routledge.

For individuals who work with big

data every day, the challenge is ultimately ours