Apache Spark Continues to Gain Enterprise Traction

A new Databricks survey shows that Apache Spark is seeing big adoption in the enterprise, to the point of eclipsing Apache Hadoop.

big data BI

Apache Spark, the hot open-source data processing engine, is outgrowing Apache Hadoop in terms of user adoption, according to a recent survey.

Databricks, the company founded by the creators of Apache Spark, released the findings of a survey of more than 1,400 respondents from the Spark community to identify how organizations are using the data analytics and processing engine.

The 2015 Spark User Survey results showed that the number of standalone deployments of Spark eclipses those on YARN as more users run Spark independent of Hadoop.

Indeed, the most common Spark deployments according to the community are: 48 percent standalone, 40 percent YARN within Hadoop and 11 percent with Apache Mesos. Spark users who do not use any Hadoop components have more than doubled in 2015 as compared to 2014, the survey said. Moreover, the survey found that 51 percent of respondents run Spark on a public cloud.

With more than 600 contributors in the last 12 months -- up from 315 contributors for the prior 12 months, Spark is the most active open source project in big data, according to Databricks. Additionally, more than 200 organizations contribute code to Spark, making it one of the largest communities of engaged developers to date, the company said.

Spark has been referred to as a game changer and perhaps the most significant open source project of the next decade. It is an open source data processing engine built for speed, ease of use, and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries, and machine learning. Users say it speeds up big data processing by a factor of 10 to 100 and simplifies app development.

Spark is being used for an increasingly diverse set of applications, particularly by data scientists for machine learning, streaming and graph analysis use cases. In 2015, there are 56 percent more Spark streaming users than in 2014. The production use of advanced analytics, like MLib for machine learning and GraphX for graph processing, increased from 11 percent in 2014 to 15 percent in 2015. And 75 percent of Spark users are also using two or more Spark components, with 51 percent of Spark users are using three or more Spark components.

“The continued growth of Spark has been highly encouraging, as companies are going into production to obtain real business value, and they are doing so in a wide range of environments beyond Hadoop clusters,” said Matei Zaharia, creator of Apache Spark and CTO of Databricks, in a statement. “Databricks and our partners are 100 percent committed to the long-term growth of Spark and we’ll continue to make improvements based on this survey data and our ongoing community feedback, to make the most complete big data analytics toolkit accessible to all businesses.”