How Apache Spark Is Transforming Big Data Processing, Development - Page 2

Spark can replace the Hadoop MapReduce computation framework and many Hadoop vendors and users are replacing MapReduce with Spark.

However, there also is the Hadoop ecosystem as a whole, which includes the HDFS storage system and NoSQL key value stores like HBase. Yet, Spark doesn’t do storage. It only works with the existing storage system.

“You run it alongside that storage stack,” said Zaharia. “So it basically can replace the computing part of MapReduce, but it can run with all the other parts. You can run it alongside MapReduce as well if you’re transitioning. It’s not so much competing with Hadoop as a whole as much as it competes with the engine part.”

But how is a big data technology so transforming that it’s like the first time you picked up an iPhone?

Zaharia notes that there are a couple of things that stand out.

“One of the things is it improved on what was out there in two dimensions at the same time," he said. “So it was both a lot faster—like 10 to 100 times faster—and a lot quicker to program with and easier to use. So you could write 10 times less code. It’s very uncommon that you have something that’s better in both dimensions," he said.

Another reason for the iPhone analogy is that in the Hadoop world, there are many separate tools for each type of processing, which can be a hindrance.

“Like if you wanted to do graph processing there’s one tool, if you want to do streams there’s another one,” Zaharia said. “You have to learn and use each one. It’s similar to how before smartphones you had to have many specialized devices. Some people still have a device for recording, and people had devices like a camera and a music player and so on. Yet you could replace these with a single device that had the right sensors and the right software to do all these things," Zaharia explained.

Spark for the enterprise

Interestingly enough, Zaharia and his UC Berkeley doctoral adviser, Ion Stoica, saw a business opportunity in Spark and founded Databricks to continue to enhance Apache Spark and to simplify big data processing.

In early August, Databricks introduced version 2.0 of its eponymously named platform, which adds new enterprise features to securely manage data access for large teams while streamlining Spark application development and deployment. Databricks 2.0 provides new capabilities for Spark, including access control, R language support, the ability to run multiple Spark versions in production environments and notebook versioning.

Access control improves security and manageability for large teams with diverse roles and responsibilities. R language support enables a new category of users to take advantage of Apache Spark, as users can explore large-scale data volumes with R, including one-click visualizations and instant deployment of R code into production.

“The integration of R enables even more non-expert programmers to use Spark and run stuff on clusters,” Zaharia said.