This week at its MongoDB World conference in New York, MongoDB announced the MongoDB Connector for Apache Spark, which enables developers and data scientists to get real-time analytics from fast-moving data.
The product is now available for users to run analytics and glean insights from live, operational and streaming data. MongoDB worked closely with Databricks, the company founded by the team that created Apache Spark. And the MongoDB Connector has received Databricks Certified Application status for Spark. The certification means Databricks has ensured that the connector provides integration and API compatibility between Spark processes and MongoDB.
The new Spark connector follows the pattern of MongoDB’s existing connector for Hadoop.
Kelly Stirman, vice president of strategy and product marketing at MongoDB, told eWEEK there is a lot of interest in people using Spark with MongoDB. He said the way people typically use MongoDB with Hadoop is they have different operational systems and data moves through extract, transform and load (ETL) or some other process into Hadoop. So MongoDB created a Hadoop connector; now they have one for Spark.
“People are saying with the kind of machine learning and analytics that they’re doing on data, they want to move some of that to run on the operational data as it’s being created,” Stirman said. “And that’s the demand of using Spark with MongoDB.”
So MongoDB took its connector for Hadoop and enhanced it so that it would be compatible with Spark. “We learned a lot and decided there’s enough interest there to make an engineering investment to make a dedicated connector for Spark,” Stirman said.
“Spark jobs can be executed directly against operational data managed by MongoDB, without the time and expense of ETL processes,” Eliot Horowitz, co-founder and CTO of MongoDB, said in a statement. “MongoDB can efficiently index and serve analytics results back into live, operational processes, making them smarter, more contextual and responsive to events as they happen.”
Moreover, the MongoDB Connector for Apache Spark is written in Scala, Apache Spark’s native language so it offers a familiar development experience for Spark users. In addition, the connector exposes all of Spark’s libraries, enabling MongoDB data to run as data frames and data sets for analysis with machine learning, graph, streaming and SQL APIs, further benefiting from automatic schema inference, Stirman said.
“Users are already combining Apache Spark and MongoDB to build sophisticated analytics applications,” Reynold Xin, co-founder and chief architect of Databricks, said in a statement. “The new native MongoDB Connector for Apache Spark provides higher performance, greater ease of use, and access to more advanced Apache Spark functionality than any MongoDB connector available today.”
The connector also takes advantage of MongoDB’s aggregation pipeline, Stirman said. And it enables users to co-locate Resilient Distributed Datasets (RDDs) with the source MongoDB node to help minimize data movement across the cluster and reduce latency.
Jeff Smith, data engineering team lead at x.ai, which produces an artificial intelligence-powered personal assistant for scheduling meetings, said x.ai uses both MongoDB and Apache Spark to process and analyze the huge amounts of data required to power an AI application.
“With the new native MongoDB Connector for Apache Spark, we have an even better way of connecting up these two key pieces of our infrastructure,” Smith said in a statement. “We believe the new connector will help us move faster and build reliable machine learning systems that can operate at massive scale.”
At its annual user conference, MongoDB also introduced Atlas, the company’s new database-as-a-service offering.