How Apache Spark Is Transforming Big Data Processing, Development

Apache Spark speeds up big data processing by a factor of 10 to 100 and simplifies app development to such a degree that developers call it a "game changer."

Spark Transformation

Apache Spark has been called a game changer and perhaps the most significant open source project of the next decade, and it's been taking the big data world by storm since it was open sourced in 2010.

Apache Spark is an open source data processing engine built for speed, ease of use and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries, and machine learning.

"Spark is undoubtedly a force to be reckoned with in the big data ecosystem,” said Beth Smith, general manager of the Analytics Platform for IBM Analytics. IBM has invested heavily in Spark.

Meanwhile, in a talk at the Spark Summit East 2015, Matthew Glickman, a managing director at Goldman Sachs, said he realized Spark was something special when he attended last year’s Strata + Hadoop World conference in New York.

He said he went back to Goldman and “posted on our social media that I’d seen the future and it was Apache Spark. What did I see that was so game-changing? It was sort of to the same extent [as] when you first held an iPhone or when you first see a Tesla. It was completely game-changing.”

Matei Zaharia, co-founder and CTO of Databricks and the creator of Spark, told eWEEK Spark started out in 2009 as a research project at the University of California Berkeley, where he was working with early users of MapReduce and Hadoop, including Facebook and Yahoo.

He said he found some common problems among those users, chief among them being that they all wanted to run more complex algorithms that couldn’t be done with just one MapReduce step.

“MapReduce is a simple way to scan through data and aggregate information in parallel and not every algorithm can be done with it,” Zaharia said. “So we wanted to create a more general programming model for people to write cluster applications that would be fast and efficient at these more complex types of algorithms.”

Zaharia noted that the researchers he worked with also said MapReduce was not only slow for what they wanted to do, but they also found the process for writing applications "clumsy." So he set out to deliver something better.

It turned out he delivered something much better.

“What made it [Spark] game changing is it had cross-platform capability,” Glickman said. “It combined relational, functional, iterative APIs without going through all the boilerplate or all the conversions back and forth to SQL or not. It was storage agnostic, which I think was the key insight Hadoop had been missing, because people were thinking about how to put compute on HDFS" [Hadoop Distributed File System.]

Glickman also saw other advantages of Spark, including that it provides compute elasticity as well as the ability to scale storage and the number of application users.

“The power of Spark is in the API abstractions,” said Glickman. “Spark is becoming the lingua franca of big data analytics. We should all embrace this.”

Spark vs. Hadoop

Zaharia, who earned an Association for Computing Machinery (ACM) Doctoral Dissertation Award for his design of the Apache Spark engine, explained that Spark and Hadoop are not competitors, as Hadoop does things that Spark doesn't.