Three reasons you need to run Spark in the cloud

This article, Three reasons you need to run Spark in the cloud, originally appeared on TechRepublic.com.

Cloud
Cloud

The open-source project Apache Spark today is perhaps the most famous spawn from UC Berkeley's AMPLab. Working at the intersection of three massive trends--powerful machine learning, cloud computing, and crowdsourcing--the AMPLab is integrating algorithms, machines, and people to make sense of big data.

Originally written to extend the capabilities of another AMPLab project, Apache Mesos, Spark took off and its co-authors created a startup in 2013 funded by Andreessen Horowitz called Databricks to deliver Spark through a hosted cloud platform that makes it easy for data professionals to leverage the power of Spark.

Spark is hugely appealing as an alternative to Hadoop's MapReduce for munging big data. It combines speed, an easy-to-use programming model, and a unified design that enables users to combine interactive queries, streaming analytics, machine learning, and graph computation within a single system.

Arsalan Tavakoli
Arsalan Tavakoli

Put that power in the cloud with a simple, elegant user experience, and you have a killer platform for anyone doing data exploration and building end-to-end data pipelines. Use a visual analytics application built from scratch for big data, like Zoomdata, and you have a killer value proposition for doing super fast business intelligence (BI) visual analytics.

I spoke to Arsalan Tavakoli, VP of Customer Engagement at Databricks, about how Spark-plus-analytics can be a powerful combination.

TechRepublic: Why Spark on the cloud? I can download and run Spark on premise, so why do I need to rent it from Databricks?

Tavakoli: Obviously, Spark is available as open source. Anyone can download and use it from wide variety of vendors. But when we looked at customers whose big data projects were failing, they had three typical explanations for why.

First, infrastructure management is hard. With on-premise, you are looking at a six- to nine-month ramp to get big data infrastructure into production--sometimes more. Even if you are running it on Amazon Web Services (AWS), you have to write EC2 scripts and get DevOps people involved. It's brittle.

Remember, infrastructure is hard. And companies turn to Spark in large part because of its rapid innovation cycles. They want to get the benefits of a technology improving all the time with hundreds of people contributing. Well, that means it is also technology that moves fast. How long does it take your team to get the latest version deployed and running?

Second, once you get your Spark cluster up and running, what do you do with it? Data scientists tend to work with their favorite languages, like R and Python. Now, they have to figure out how to import their data and get a job up and running. The toolchain necessary to work with standalone Spark can be hard to use for these users. And how do you run your analytics and collaborate with your colleagues?

It's not trivial.

Third, after you have tested out your queries and models, you want to move into production--what does that process look like? In most companies, that means turning your model over to engineering, and that team goes back and re-implements what you think you want on all new infrastructure.

A cloud platform like Databricks removes these three obstacles to Spark adoption and success for your big data initiative by providing an integrated and hosted solution. We give you fully managed and tuned Spark clusters backed by the experts who created Spark. Our platform provides you with an interactive workspace to explore, visualize, collaborate, and publish. When you are ready for production, launch a job with one single click. We automatically create the infrastructure.

Additionally, we provide a rich set of APIs for programmatic access to the platform, which also enables seamless integration of 3rd-party applications.

TechRepublic: Tell me why customers will want to do BI visualizations in the cloud. Are there particular reasons why this delivery is best suited for BI visualization?

Tavakoli: People want to use data to get insights into their business, and data engineers and data scientists are focused on delivering these insights. But unless you are an engineering-oriented company like Pinterest, Netflix, or Facebook, they're just a small part of any organization. There is a much larger user base of business analysts and end users.

For example, the person in marketing who wants to slice and dice data at a high level but doesn't have technical skills. They just want to get their dashboards, or whatever, in a much more constrained decision space.

Smart companies know that they want to help their workers self enable. That is where the role of BI visualization comes in. That's when the questions you have or want to ask are not clearly understood yet. If they were, you'd likely have a domain-specific application.

TechRepublic: So, that's why you partnered with Zoomdata? What benefits do Databricks Cloud users get with this partnership that they would not get otherwise?

Tavakoli: We have a lot of customer use case overlap with Zoomdata. Many of these organizations are the classic early adopters who rely heavily on data engineers and data scientists. All of these organizations also have a major BI warehouse component.

But the next question these companies are asking themselves is: How can I make this simpler for more users? I have all this data that I'm processing with Spark, how can I make it available to users who are not developers?

For this, a BI visualization application is perfect, and Zoomdata proved a great fit for our cloud.

TechRepublic: What are some common use cases you see around this Databricks/Zoomdata joint offering?

Tavakoli: One common one is the AdTech vertical more broadly.

AdTech companies typically have the following flow: they build up their internal database by pulling data from a wide variety of sources, which are then run through an in-depth ETL pipeline and converted into processed form.

Then, each of their customers provides data from the CRM and marketing automation systems that needs to be joined with this internal database to answer questions about the effectiveness of their campaigns. This process is handled by the data engineers and data scientists who test out in-depth theories.

On the other hand, data analysts and product managers want to ask higher level questions, such as what feature in a product is most effective, or they want to know how a mobile ad performed. These are a class of users much more comfortable going through a BI interface like Zoomdata.

Another use case is Internet of Things (IoT). Companies like Automatic Labs take all the data from all the devices in cars. Data scientists look at deeper questions about underlying trends that correlate to the car, cost, and driving patterns.

Non-experts, like account managers, may just want to look at disparate data to correlate to insurance premiums. These people don't want to deal with spinning up a Spark cluster and writing Python or SQL code.

Would your organization consider Spark through a hosted cloud platform? Why or why not? Share your thoughts in the discussion thread below.

Also see