Coolan leverages crowdsourcing to predict server failures

Figure A: An illustration of how Coolan presents component failure-rate information.

 Image: Coolan

Enterprise data-center managers are squeezing every penny out of their operations. To accomplish that, operators are even building custom equipment and creating proprietary operating platforms. And, as expected, the effort has reaped tangible benefits. PUEs and operating costs at enterprise data-centers continue to drop.

That's great. However, lest we forget, there is a lot more small to mid-sized data centers than the monster enterprise versions. And, those data-center managers will tell you there's zero money to develop custom servers; besides, colocation data centers most likely have customer equipment to contend with. As for specialized control and management systems, that's not even a consideration.

This chasm has not gone unnoticed. A group of guys with extensive experience in data-center infrastructure and controls decided to build a bridge over the digital chasm. Data-center gurus Amir Michael and Jonathan Heiliger (both former Facebook employees) and Amir's brother Yoni Michael of Practice Fusion started collaborating in 2013 with the thought of developing solutions for smaller commercial data centers.

"We bring to this effort our experience from building next-generation hardware at Facebook and Google," mentions Amir Michael on the team's website. "And from sharing those designs via the Open Compute Project, a community-based initiative that seeks to build more efficient and sustainable infrastructure technology."

A year ago, things got serious with the trio receiving seed money from Social + Capital Partnership, North Bridge Venture Partners, and Keshif Ventures.

Coolan

The team recently introduced their first product: Coolan. Amir Michael adds, "Coolan's analytics solution provides visibility into the performance of your data-center environment, and delivers insights that help reduce downtime and lower the cost of infrastructure."

Coolan focuses on server-failure prediction, which is critical in today's competitive market. To stay competitive, commercial data centers are guaranteeing 100% availability and paying stiff penalties when a client's uptime is anything less. "We've worked with small startups and web-scale companies," explains Amir Michael. "And universally, we've heard from Dev-Ops engineers, data-center managers, and CIOs that they want more insight into how their hardware is performing."

Knowing that, the Michael brothers and Heiliger decided that Coolan should be able to answer the following questions:

  • Why is a server failing?

  • How often does a server fail, and can the next failure be prevented?

  • How can a server's configuration be optimized?

  • What is the current hardware inventory?

  • Was the right equipment purchased?

How Coolan works

The team developed a web dashboard, an API, and an innocuous piece of software that gets added to the customer's servers. The server software collects data on performance. "It digs deep and analyzes thousands of operational variables, from how many hours a component has been in operation to how many bit errors have been generated by memory, and our platform evaluates the server for stability and efficiency," mentions Amir Michael.

If that doesn't sound unique, remember that unlike brand-name servers with their proprietary server-optimization programs, Coolan is brand agnostic. Besides, Coolan does something that no one else does: It leverages crowdsourcing to gather a larger pool of data to increase prediction accuracy. Figure A above illustrates how Coolan presents component failure-rate information.

"Coolan's algorithms analyze the collective configuration, failure, and event data from our customer base," notes Amir Michael. "So we can empower each customer -- whether they have ten servers or hundreds of thousands -- with insights gleaned from a much larger data set."

Coolan reports and recommendations

After analyzing data from a client, Coolan melds the client information with crowdsourced data, ultimately creating benchmarking reports and real-time recommendations tailored for the client. The Michael brothers and Heiliger feel this proactive information will allow clients to shift from "wait and see" mode to anticipating and negating unexpected hardware issues. The Coolan blog mentions, "With the typical annual failure rate of a server at seven percent, even an incremental improvement can translate to a cost savings from thousands to millions of dollars, depending on a customer's server fleet."

Managers appreciate advanced notification

After asking colleagues who run data centers, they all agree that something like Coolan would be useful -- if the costs align with their budgets, and they can get upper management to see the benefit.

If the success of Amir Michael and Heiliger at Facebook are any indication, Coolan should keep antacid tablets in the bottle of many a data-center manager.

Coolan is in private beta. If you are interested, visit the website and join the pilot program.

Also read