Results of a new survey indicate that the Apache Spark big data processing engine is gaining traction with a growing number of developers.
Typesafe, the company behind Play Framework, Akka, and Scala, has released the findings of a survey of more than 2,100 enterprise developers, data scientists, executives and system architects, analyzing adoption patterns of Apache Spark. The survey showed that 13 percent of developers said they are using Spark in production, while 31 percent said they are currently evaluating Spark.
Apache Spark is an open-source engine for large-scale data processing and analytics. It has been in development for a number of years at UC Berkeley’s AmpLab, and is now being driven by Databricks, a Berkeley spin-out founded by Ion Stoica and Matei Zaharia. Zaharia is CTO of Databricks and the creator of Apache Sparks. Databricks worked with Typesafe on the survey.
“This survey of over 2100 developers alone highlights that over 500 enterprises are using or planning to use Spark in production in 2015, in environments ranging from Hadoop clusters to public and private clouds, with data sources including key-value stores, databases, streaming data and file systems,” Zaharia said. “Their use cases range from batch workloads to SQL queries, stream processing and machine learning, highlighting Spark’s unique capability as a simple, unified platform for data processing.”
Typesafe said Spark awareness and adoption are seeing hockey-stick-like growth. Google Trends confirms this finding and the survey shows that 71 percent of respondents have at least evaluation or research experience with Spark—up to 35 percent are using it or plan to adopt soon. Of the survey respondents running big data applications in production, 82 percent indicated that they are eager to replace MapReduce with Spark as the core processing engine.
“Coming directly from developers, this survey reiterated the rapid adoption of Spark for large-scale data processing,” Zaharia said in a statement. “I’m especially excited by the breadth of use cases seen, which range from batch jobs to streaming and machine learning. It’s this type of direct feedback and dialogue with our community that enables us to continue to improve the usability, performance and built-in libraries of Spark.”
For example, faster data processing and event streaming are the focus for enterprises. By far, the most desirable features are Spark’s improved processing power over MapReduce—more than 78 percent of respondents mention this—and the ability to process event streams (66 percent), which MapReduce cannot do.
Moreover, the survey showed that perceived barriers to adoption are not major blockers to adoption. When asked, respondents mentioned lack of in-house experience and perceived immaturity of some Spark components and integrations with other middleware and management tools. Also cited are needs for better commercial support options and for more comprehensive documentation and advanced examples. Some respondents mentioned that their organizations aren’t currently in need of “big” data solutions at this time.
“The need to process big data faster has largely fueled the intense developer interest in Spark,” said Dr. Dean Wampler, Big Data Architect at Typesafe, in a statement. “Hadoop’s historic focus on batch processing of data was well supported by MapReduce, but there is an appetite for more flexible developer tools to support the larger market of ‘mid-size’ datasets and use cases that call for real-time processing.”
Apache Spark Developer Adoption on the Rise
Indeed, “Compared to the MapReduce API, the Spark API is amazingly intuitive, providing concise, expressive operations that are often needed for analytics,” Wampler added. “So, in addition to addressing a wider class of problems, Spark is improving the productivity of developers who use it.”
Apache Spark is reaching a level of maturity that moves it beyond pure experimentation—with imminent availability of a stable 1.0 release and inclusion (current or planned) in all major Hadoop distributions. There’s good reason for all of the interest. Spark accelerates analytics on Hadoop, working as a full suite of complementary tools including a fully-featured machine learning library (MLlib), a graph processing engine (GraphX) and stream processing. Spark can access data in a variety of sources including HDFS, Cassandra and HBase.
Developers across all industries have been turning to Typesafe to build Reactive applications, of which big data is a core component. Because it is built with Scala, it was a logical choice for Typesafe to add full lifecycle support for Apache Spark to the Typesafe Together Project Success Subscription program to accelerate developer adoption and success in building Reactive big data applications.
According to the survey the top three languages used with Spark are Scala (88 percent of respondents), Java (44 percent) and Python (22 percent). Also, 82 percent of respondents using Spark said they chose Spark to replace MapReduce.
“When we started Spark, we had two goals—we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated Big Data framework I know of, that begat things like FlumeJava and Crunch),” Zaharia said in the study. “On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.”
The Typesafe study acknowledges that Spark is less mature than older technologies, like MapReduce, so developers also need good documentation, example applications, and guidance on runtime performance tuning, management and monitoring. Spark is also driving interest in Scala, the language in which Spark is written, but developers and data scientists can also use Java, Python, and soon, R, the study said.
“This survey further validates Databricks’ partnership and shared vision with Typesafe to bring a comprehensive suite of application development tools for developers that enable enterprises to operate with more agility and speed,” said Kavitha Mariappan, vice president of marketing at Databricks, in a statement. “We look forward to collectively utilizing this feedback to make the Spark developer experience not only richer but also as seamless as possible.”