How Apache Spark Is Transforming Big Data Processing, Development

Apache Spark has been called a game changer and perhaps the most significant open source project of the next decade, and it’s been taking the big data world by storm since it was open sourced in 2010.

Apache Spark is an open source data processing engine built for speed, ease of use and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries, and machine learning.

“Spark is undoubtedly a force to be reckoned with in the big data ecosystem,” said Beth Smith, general manager of the Analytics Platform for IBM Analytics. IBM has invested heavily in Spark.

Meanwhile, in a talk at the Spark Summit East 2015, Matthew Glickman, a managing director at Goldman Sachs, said he realized Spark was something special when he attended last year’s Strata + Hadoop World conference in New York.

He said he went back to Goldman and “posted on our social media that I’d seen the future and it was Apache Spark. What did I see that was so game-changing? It was sort of to the same extent [as] when you first held an iPhone or when you first see a Tesla. It was completely game-changing.”

Matei Zaharia, co-founder and CTO of Databricks and the creator of Spark, told eWEEK Spark started out in 2009 as a research project at the University of California Berkeley, where he was working with early users of MapReduce and Hadoop, including Facebook and Yahoo.

He said he found some common problems among those users, chief among them being that they all wanted to run more complex algorithms that couldn’t be done with just one MapReduce step.

“MapReduce is a simple way to scan through data and aggregate information in parallel and not every algorithm can be done with it,” Zaharia said. “So we wanted to create a more general programming model for people to write cluster applications that would be fast and efficient at these more complex types of algorithms.”

Zaharia noted that the researchers he worked with also said MapReduce was not only slow for what they wanted to do, but they also found the process for writing applications “clumsy.” So he set out to deliver something better.

It turned out he delivered something much better.

“What made it [Spark] game changing is it had cross-platform capability,” Glickman said. “It combined relational, functional, iterative APIs without going through all the boilerplate or all the conversions back and forth to SQL or not. It was storage agnostic, which I think was the key insight Hadoop had been missing, because people were thinking about how to put compute on HDFS” [Hadoop Distributed File System.]

Glickman also saw other advantages of Spark, including that it provides compute elasticity as well as the ability to scale storage and the number of application users.

“The power of Spark is in the API abstractions,” said Glickman. “Spark is becoming the lingua franca of big data analytics. We should all embrace this.”

Spark vs. Hadoop

Zaharia, who earned an Association for Computing Machinery (ACM) Doctoral Dissertation Award for his design of the Apache Spark engine, explained that Spark and Hadoop are not competitors, as Hadoop does things that Spark doesn’t.

How Apache Spark Is Transforming Big Data Processing, Development

Spark can replace the Hadoop MapReduce computation framework and many Hadoop vendors and users are replacing MapReduce with Spark.

However, there also is the Hadoop ecosystem as a whole, which includes the HDFS storage system and NoSQL key value stores like HBase. Yet, Spark doesn’t do storage. It only works with the existing storage system.

“You run it alongside that storage stack,” said Zaharia. “So it basically can replace the computing part of MapReduce, but it can run with all the other parts. You can run it alongside MapReduce as well if you’re transitioning. It’s not so much competing with Hadoop as a whole as much as it competes with the engine part.”

But how is a big data technology so transforming that it’s like the first time you picked up an iPhone?

Zaharia notes that there are a couple of things that stand out.

“One of the things is it improved on what was out there in two dimensions at the same time,” he said. “So it was both a lot faster—like 10 to 100 times faster—and a lot quicker to program with and easier to use. So you could write 10 times less code. It’s very uncommon that you have something that’s better in both dimensions,” he said.

Another reason for the iPhone analogy is that in the Hadoop world, there are many separate tools for each type of processing, which can be a hindrance.

“Like if you wanted to do graph processing there’s one tool, if you want to do streams there’s another one,” Zaharia said. “You have to learn and use each one. It’s similar to how before smartphones you had to have many specialized devices. Some people still have a device for recording, and people had devices like a camera and a music player and so on. Yet you could replace these with a single device that had the right sensors and the right software to do all these things,” Zaharia explained.

Spark for the enterprise

Interestingly enough, Zaharia and his UC Berkeley doctoral adviser, Ion Stoica, saw a business opportunity in Spark and founded Databricks to continue to enhance Apache Spark and to simplify big data processing.

In early August, Databricks introduced version 2.0 of its eponymously named platform, which adds new enterprise features to securely manage data access for large teams while streamlining Spark application development and deployment. Databricks 2.0 provides new capabilities for Spark, including access control, R language support, the ability to run multiple Spark versions in production environments and notebook versioning.

Access control improves security and manageability for large teams with diverse roles and responsibilities. R language support enables a new category of users to take advantage of Apache Spark, as users can explore large-scale data volumes with R, including one-click visualizations and instant deployment of R code into production.

“The integration of R enables even more non-expert programmers to use Spark and run stuff on clusters,” Zaharia said.

How Apache Spark Is Transforming Big Data Processing, Development

This solves a significant problem for Spark users. “Managing access to credentials and other sensitive information for every user on my team has been a big challenge,” said Benny Blum, vice president of product and data science at Sellpoints, in a statement. “The ability to quickly and easily do so with the Databricks Access Control feature will enable my team to maintain the highest security standard.”

Enter Big Blue

Meanwhile, in June, IBM announced a series of moves to invest in and further commit to Spark as a centerpiece of its big data platform.

“IBM is building Spark into the core of our analytics and commerce platforms,” Joel Horwitz, director of the IBM Analytics Platform, told eWEEK.

“Additionally, we’ll offer Spark as a Service on IBM Bluemix, host Spark applications and offer free Spark online courses to educate a million people worldwide. IBM will also offer enterprise level support and consulting to our clients. Spark enhancements will extend well beyond IBM Analytics into all parts of the business,” he said. Bluemix is IBM’s cloud platform as a service for running and developing large scale applications.

IBM also said it will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide.

Moreover, the company opened a Spark Technology Center in San Francisco for the data science and developer community.

Other large enterprises are putting Spark to work. Independence Blue Cross (IBC), a large health insurer in the Philadelphia area, serving more than 2 million people in the region and 7 million nationwide, is using Spark to develop new services.

“Apache Spark is quickly maturing into a power tool for development of machine-learning analytic applications,” said Darwin Leung, the company’s director of Informatics. “It allows our IBC researchers and academic partners to work together more seamlessly, which means we can get new claims and benefits apps up and out to customers much faster.”

Findability Sciences, a consulting and contextual data technology company, is using IBM Analytics and Spark to help clients implement Big Data processing applications.

“Apache Spark with IBM BigInsights has given us tremendous capacity for our implementations for small and medium businesses, where MapReduce was not efficient,” said Anand Mahurkar, CEO of Findability Sciences, in a statement.

“With Spark, the performance has improved multifold. We’re now able to process streaming data from IoT [Internet of Things] devices and offer analytics for data in motion for things like traffic, commuters and parking.”

As Spark has origins in supporting machine learning research, Databricks has been intent on enhancing its machine learning capabilities. To that end, the company is working closely with IBM, which over the summer open-sourced its IBM SystemML machine learning technology.

The companies plan to introduce new domain specific algorithms to the Spark ecosystem and add new machine learning primitives in the Apache Spark Project. IBM and Databricks also will collaborate to integrate IBM’s SystemML with the Spark platform.

How Apache Spark Is Transforming Big Data Processing, Development

Darryl K. Taft

Company

Categories