What is 'Big Data,' anyway? Authors of a new book try to explain

·Rob Walker

August 27, 2013 at 3:51 PM

What is 'Big Data,' anyway? Authors of a new book try to explain

“Big data” has become a really big buzz-phrase — tossed around in conversations about everything from business to surveillance; cited as a tool to improve driving, hiring, understanding dogs, and everything else; and, inevitably, dismissed as a bunch of hype.

But what exactly is big data, anyway? Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schönberger and Kenneth Cukier, offers an answer. Their book is a wide-ranging assessment of “the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value.” And while they acknowledge that the term itself has become amorphous, they frame their subject pretty clearly: “Big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more.”

That (not to mention the book’s subtitle) might sound a little hype-y, but Big Data is fairly even-handed: Early chapters explore the hope and potential around the way massive information sets are being created and mined, but later ones are clear about risks, pitfalls, and dangers. Mayer-Schönberger is Professor of Internet Governance and Regulation at the Oxford Internet Institute / Oxford University; Cukier is “data editor” for The Economist. Their book raised a few questions for me — so I asked the authors. Here’s what they said.

I'd like to start toward the end: One of your later chapters examines "the dark side of big data," and among other things you note concerns about privacy and the possibility of using "big-data predictions" to in effect penalize people for behavior they seem likely to engage in, but haven't. You even mention the NSA at one point. So I wonder what you've made of the debate about more recent surveillance revelations related to the agency: There's a lot of focus on the collection of the data, for instance, but should we be talking about how it's analyzed?

Kenneth Cukier: The question draws an excellent distinction — one that's sadly missing from the debate. The disclosures have been mostly about the collection and not the use of the data. And when intelligence agencies explain how they work with the data, the method seems oddly old-school: targeted surveillance, not too different from the days of alligator-clips atop copper wires. Of course we're probably not told the whole story and they're actually running massive statistical regressions across all the data to hunt for patterns that they didn't know to look for in advance. That's what Facebook and LinkedIn data-scientists would do with it. But we haven't yet seen evidence that this is what the NSA is doing.

That said, the collection alone is troubling because it is happening with insufficient oversight. And the goal of intelligence is to prevent bad things from happening — it's about prediction. As we lay out in the book, this may be troubling when people are penalized for what they only have propensity to do, not for what they've done. So we have to be very careful using this ability, as it improves to the degree that it becomes more established.

You make a compelling case about the limitations of sampling (as opposed to more comprehensive big data approaches) and how we've come to accept it perhaps more than we should. But among the examples you mention is voter intent. It's not like there's a comprehensive database of who everyone intends to vote for, is there? How does big data actually provide an alternative here? Isn't there a distinction between what we want to measure and what we can measure?

Cukier: Actually, there is a database of every voter and their intentions. Both major parties contract with different data providers that are loosely affiliated with the parties, to tap databases of all Americans. The first variable is if the person is registered to vote and if he or she actually cast a ballot in the most recent election. The Democrats in 2012 had an internal database of every voter in America and asked three questions of it: Do you support Obama; are you likely to vote; and if you are undecided, are you persuadable? By ranking people based on that last measure, the Dems could know where to best spend their advertising budget for maximum impact.

Big data was critical: sampling works well for basic questions like what candidate a person supports. But it's less useful when you want to drill down into the granular — like what candidate Asian-American women with college degrees support. To do that, you may need to give up your sample and go for it all.

Yet the broader point is correct: there is a difference between what we want to measure and what we can measure. And we need to be on guard that we don't confuse the two. For example, in the Vietnam War, the Pentagon used the metric of the body count as a way to measure progress, when that data wasn't really meaningful to what they wanted to depict. Sadly, I fret this fallibility is something that we'll just have to learn to live with, as we have in so many other domains.

Many of your examples involve scrutinizing data that already exists (including instances where it's mined for reasons that have nothing to do with why it was gathered), but I was very interested to learn about "datafication" that involves setting out to collect new information in new ways: For instance, UPS "datifying" its vehicle fleet by gathering mechanical information that predicts and minimizes breakdowns. This almost seem like a distinct category to me. Do you think of it as a fundamentally different form of big data?

Viktor Mayer-Schönberger: It is tempting to be dazzled by the many new types of data that are being collected — from engine sensors in UPS vehicles, to heart rates in

premature babies, to human posture. But that is how datafication works in practice: at first we think it is impossible to render something in data form, then somebody comes up with a nifty and cost-efficient idea to do so, and we are amazed by the applications that this will enable, and then we come to accept it as the new normal. A few years ago, this happened with geo-location data, and before it was with web browsing data (gleaned through cookies). It is a sign of the continuing progress of datafication.

You're right that dataficiation is fundamentally different than big data. For example, the 19th century American navigator Commodore Maury, who invented tidal maps, datafied the logbooks of past sea voyages by extracting information about the wind and waves at a given location. But we can get the most of big data today because so many new elements of our lives are being rendered into a data form, which was extremely hard to do in the past.

You emphasize that making the most of big data means we have to "shed some of [our] obsession for causality in exchange for simple correlations: not knowing why but only what." This means breaking from the tradition of coming up with a hypothesis and testing it: It doesn't matter whether we can explain a correlation that big data reveals, we should just act on it. That's a big shift! I'm curious if when you're out talking about the book whether you get a lot of resistance to that idea, because it seems crucial to what you call the "big data mindset."

Mayer-Schönberger: Yes, we do encounter resistance on this point, but intriguingly, it's rarely from the real experts in their field. They often know how tentative their causal conclusions are, or how much they are actually based on correlations rather than truly comprehending the exact causality of things. Also, we often get mischaracterized as either suggesting that theories don't matter or causality is not important. We don't argue either. In fact, theories will continue to matter very much, but the concrete hypothesis derived from a theory less so.

Take Google Flu Trends. The theory that what people search for could correlate with human health in a given location was crucial for Google Flu Trends to happen. But none of Google's engineers could ever have guessed the exact hypothesis to test — that is, the exact search terms that best predict the spread of the flu. After all, the company handles around 3 billion searches every day. So big data analysis did that for them.

Causal connections are really valuable where and if one can find them. But looking for them at great cost and coming up empty is less useful, we suggest, than looking for correlations — not least because such correlations can help identify what potential connections between two phenomena should be investigated for a possible causal link. In that very sense, big data analysis actually helps causal investigations as well.

Finally, I was struck by how many examples in the book involved businesses that have amassed incredible data sets and learned to use them to boost sales or improve marketing. You have the story of how Wal Mart mined its past data and figured out that people preparing for a hurricane by purchasing flashlights and the like also tended to buy Pop-Tarts — so it put Pop-Tarts at the front of the store during hurricane season, and sales increased. Is there any concern about how much big data is in effect owned by business, and deployed largely in the service of the profit motive? I think one thing that makes people nervous about the big data idea is that it's so often opaque. But do the benefits outweigh those concerns? Should we stop worrying and just be thankful for the conveniently placed Pop-Tarts?

Mayer-Schönberger: There is a value in having conveniently placed Pop-Tarts, and it isn't just that Wal Mart is making more money. It is also that shoppers find faster what they are likely looking for. Sometimes big data gets badly mischaracterized as just a tool to create more targeted advertising online. But UPS uses big data to save millions of gallons of fuel — and thus improve both its bottom line and the environment. Google aiding public health agencies in predicting the spread of the flu, or Decide.com helping consumers save a bundle has nothing to do with targeted advertising, and create positive effects beyond a single company's quarterly profit. We need to cast our gaze wider when we want to understand big data's upside (and incidentally, also its "dark sides").

My thanks to Mayer-Schönberger and Cukier for taking the time to answer these questions. Their book is: Big Data: A Revolution That Will Transform How We Live, Work, and Think.

Yahoo Sports
NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in
Not everyone was thrilled with their team's draft on Thursday night.
1d ago
Yahoo Sports
NFL to allow players to wear protective Guardian Caps in games beginning with 2024 season
The NFL will allow players to wear protective Guardian Caps during games beginning with the 2024 season. The caps were previously mandated for practices.
9h ago
Yahoo Sports
Michael Penix Jr. said Kirk Cousins called him after Falcons' surprising draft selection
Atlanta Falcons first-round draft pick Michael Penix Jr. said quarterback Kirk Cousins called him after he was picked No. 8 overall in one of the 2024 NFL Draft's more puzzling selections.
6h ago
Yahoo Sports
NBA playoffs: Tyrese Hailburton game-winner and potential Damian Lillard Achilles injury leaves Bucks in nightmare
Tyrese Haliburton hit a floater with 1.1 seconds left in overtime to give the Indiana Pacers a 121–118 win over the Milwaukee Bucks. The Pacers lead their first-round playoff series two games to one.
6h ago
Yahoo Sports
Panthers owner David Tepper stopped by Charlotte bar that criticized his draft strategy
“Please Let The Coach & GM Pick This Year" read a sign out front.
11h ago
Yahoo Sports
Korey Cunningham, former NFL lineman, found dead in New Jersey home at age 28
Cunningham played 31 games in the NFL with the Cardinals, Patriots and Giants.
11h ago
Yahoo Sports
Based on the odds, here's what the top 10 picks of the NFL Draft will be
What would a mock draft look like using just betting odds?
4d ago
Yahoo Sports
Luka makes Clippers look old, Suns are in big trouble & a funeral for Lakers | Good Word with Goodwill
Vincent Goodwill and Tom Haberstroh break down last night’s NBA Playoffs action and preview several games for tonight and tomorrow.
2d ago
Yahoo Sports
Fantasy Baseball Waiver Wire: Widely available players ready to help your squad
Andy Behrens has a fresh batch of priority pickups for fantasy managers looking to close out the week in strong fashion.
16h ago
Yahoo Sports
Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion
The Red Sox were already mourning the loss of Tim Wakefield from that 2004 team.
6d ago
Yahoo Sports
Jackson Holliday sent back to Triple-A after struggling in first 10 games with Orioles
Holliday batted .059 in 34 at-bats after being called up April 10.
9h ago
Autoblog
UPS and FedEx find it harder to replace gas guzzlers than expected
Shipping companies like UPS and FedEx are facing uncertainty in U.S. supplies of big, boxy electric step vans they need to replace their gas guzzlers.
2d ago
Autoblog
These are the cars being discontinued for 2024 and beyond
As automakers shift to EVs, trim the fat on their lineups and cull slow-selling models, these are the vehicles we expect to die off soon.
4d ago
Yahoo Sports
Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal
Dan Wetzel, Ross Dellenger & SI’s Pat Forde react to the huge performance this weekend by Texas QB Arch Manning, Michigan and Notre Dame's spring games, Jaden Rashada entering the transfer portal, and more
4d ago
Yahoo Sports
Chiefs make Andy Reid NFL's highest-paid coach, sign president Mark Donovan, GM Brett Veach to extensions
Reid's deal reportedly runs through 2029 and makes him the highest-paid coach in the NFL.
4d ago
Yahoo Sports
NFL Draft: Jim Harbaugh's Chargers get aggressive, trade up to take WR Ladd McConkey
Justin Herbert has a new receiver to work with.
5h ago
Yahoo Sports
NBA playoffs: Who's had the most impressive start to the postseason? Most surprising?
Our NBA writers weigh in on the first week of the playoffs and look ahead to what they're watching as the series shift to crucial Game 3s.
2d ago
Yahoo Sports
The Buzz: Fantasy baseball's polarizing hitters — is Mike Trout really back?
Fantasy baseball analyst Scott Pianowski breaks down some of the trickiest batters to gauge so far this season in the latest edition of The Buzz.
1d ago
Yahoo Life
Here’s when people think old age begins — and why experts think it’s starting later
People's definition of "old age" is older than it used to be, new research suggests.
5d ago
Yahoo Sports
Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal
Cortés' attempt didn't fool Andrés Giménez, who fouled off the pitch.
7d ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

What is 'Big Data,' anyway? Authors of a new book try to explain

Recommended Stories

NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in

NFL to allow players to wear protective Guardian Caps in games beginning with 2024 season

Michael Penix Jr. said Kirk Cousins called him after Falcons' surprising draft selection

NBA playoffs: Tyrese Hailburton game-winner and potential Damian Lillard Achilles injury leaves Bucks in nightmare

Panthers owner David Tepper stopped by Charlotte bar that criticized his draft strategy

Korey Cunningham, former NFL lineman, found dead in New Jersey home at age 28

Based on the odds, here's what the top 10 picks of the NFL Draft will be

Luka makes Clippers look old, Suns are in big trouble & a funeral for Lakers | Good Word with Goodwill

Fantasy Baseball Waiver Wire: Widely available players ready to help your squad

Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion

Jackson Holliday sent back to Triple-A after struggling in first 10 games with Orioles

UPS and FedEx find it harder to replace gas guzzlers than expected

These are the cars being discontinued for 2024 and beyond

Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal

Chiefs make Andy Reid NFL's highest-paid coach, sign president Mark Donovan, GM Brett Veach to extensions

NFL Draft: Jim Harbaugh's Chargers get aggressive, trade up to take WR Ladd McConkey

NBA playoffs: Who's had the most impressive start to the postseason? Most surprising?

The Buzz: Fantasy baseball's polarizing hitters — is Mike Trout really back?

Here’s when people think old age begins — and why experts think it’s starting later

Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal