Data science demands elastic infrastructure

Big data
Big data

As companies struggle to make sense of their increasingly big data, they're laboring to figure out the morass of technologies necessary to become successful. However, many will remain stymied, because they keep trying to fit a necessarily fluid process of asking questions of one's data with outmoded, rigid data infrastructure.

Or as Amazon Web Services (AWS) data science chief Matt Wood tells it, they need the cloud.

While the cloud isn't a panacea, its elasticity may well prove to be the essential ingredient to big data success.

How much cloud do I need?

The problem with trying to run big data projects within a data center revolves around rigidity. As Matt Wood told me in a recent interview, this problem "is not so much about absolute scale of data but rather relative scale of data."

In other words, as a company's data volume takes a step function up or down, enterprise infrastructure can't keep up. In his words, "Customers will tool for the scale they're currently experiencing," which is great... until it's not.

In a separate conversation, he elaborates:

"Those that go out and buy expensive infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on. You need an environment that is flexible and allows you to quickly respond to changing big data requirements. Your resource mix is continually evolving--if you buy infrastructure, it's almost immediately irrelevant to your business because it's frozen in time. It's solving a problem you may not have or care about any more."

Success in big data depends upon iteration, upon experimentation as you try to figure out the right questions to ask and the best way to answer them. This is hard when dealing with a calcified infrastructure.

A eulogy for the data center?

Of course, it's not quite so simple as "all cloud, all the time."

Data, it would seem, has to obey fundamental laws of gravity, as Basho CTO Dave McCrory told TechRepublic in an interview:

"Big data workloads will live in large data centers where they are most advantaged. Why will they live in specific places? Because data attracts data.

"If I already have a large quantity of data in a specific cloud, I'm going to be inclined to store additional quantities of large data in the same place. As I do this and add workloads that interact with this data, more data will be created."

Over time, enterprises will look to the public cloud for all the reasons Wood describes, but legacy data is unlikely to make the migration. There's simply no reason to try to house old data in new infrastructure. Not most of the time.

But some companies will find that they're more comfortable with existing data centers and will eschew the cloud. I'm not talking about hide-bound enterprise curmudgeons that shout "Phooey!" every time AWS is mentioned, either. No, sometimes the most data center-centric of companies will be the innovators like Etsy.

As Etsy CTO Kellan Elliott-McCrea informed TechRepublic, once Etsy had "gained confidence" in its ability to manage its Hadoop clusters (and other technology), they brought them in-house, netting a 10X increase in utilization and "very real cost savings."

Nor is Etsy alone. Other new-school web companies like Twitter have opted to run their own data centers, finding that this gives them greater control over their data.

You're no Twitter

As highly as you may estimate your abilities, the reality is that you're probably not an Etsy, Twitter, or Google. As painful as it is to say it, most of us are average. By definition.

This is what Microsoft's great genius was: rather than cater to the Übermensch of IT, Microsoft lowered the bar to becoming productive as a system administrator, developer, etc. In the process, Microsoft banked billions in profits, helping make a good sysadmin better or a decent developer good.

Regardless, all enterprises need to establish infrastructure that helps them to iterate. Some, like Etsy, may have figured out how to do this in their data centers--but for most of us, most of the time, Wood's advice rings true: "You need an environment that is flexible and allows you to quickly respond to changing big data requirements."

In other words, odds are that you're going to need the cloud.

Also see