Databricks, the company founded by the creators of Apache Spark, announced plans to partner with Intel to optimize Spark’s real-time analytic capabilities for the Intel architecture.
Indeed, Intel is investing in a broad initiative involving Apache Spark and is collaborating with both Databricks and the Algorithms, Machines and People Lab (AMPLab) at the University of California, Berkeley, to advance the technology. Databricks’ founders launched the company out of UC Berkeley’s AMPLab after creating Apache Spark.
Spark is a popular open-source big data processing engine. Spark accelerates analytics on Hadoop, working as a full suite of complementary tools, including a fully-featured machine learning library (MLlib), a graph processing engine (GraphX) and stream processing. Spark can access data in a variety of sources, including HDFS, Cassandra and HBase.
Intel said it has engaged with Databricks to advance analytics capability for the Spark on Intel Architecture platforms and to accelerate the development of Spark projects, such as GraphX, MLlib, and Spark Streaming. Intel is collaborating with AMPLab to accelerate the development of data analytics technologies in real-world solutions through developments for SparkR—distributed statistic computing in R on top of Spark. They also are working on collaborations for using Tachyon, an in-memory file system in the Berkeley Data Analytics Stack (BDAS), to address challenges in Internet-scale machine learning.
“Open source is undoubtedly the future of technological innovation and big data tools and processing are at the forefront of that wave,” Ion Stoica, CEO at Databricks, said in a statement. “Our collaboration with Intel will bring the unified Spark ecosystem to businesses of all sizes with new levels of analytic capabilities, real-time benefits and simplicity.”
Enterprises are increasingly developing applications to extract real-time insights from large data sets. The necessity for real-time analytics across Intel architecture is a vital piece of the big data puzzle to enable the extraction of prompt, actionable insights from large data sets. As an open-source framework that enables stream processing as well as fast queries on large data sets stored on a Hadoop cluster, Apache Spark supports new modes of analytics on big data platforms based on the Apache Hadoop ecosystem.
“As more and more connected devices, including sensors, are introduced to the market, big data sets are growing exponentially every year, making processing and analyzing this data a more complex task,” Michael Greene, vice president of Intel Software and Services Group and general manager of System Technologies and Optimization at the chip maker, said in a statement. “To find new trends and strong patterns from large, complex data sets, a strong analytics foundation is needed. Our work with Databricks to advance these analytics capabilities on Intel architecture by utilizing the rich capabilities of Spark will help our customers dive deeper into their data and derive real-time insights and benefits in the cloud.”
Apache Spark can run in Hadoop clusters through YARN or Spark’s stand-alone mode and is designed to perform both batch processing and new workloads like streaming, interactive queries and machine learning. Spark is built for scalability, stability and performance with the ability to process datasets from gigabytes to terabytes to petabytes.
In a blog post about Intel’s part in the collaboration, Greene said, “Data is the currency of the future, and the economy is booming.”
Greene noted that Intel is making moves in the big data world because everything is moving at a new speed, with transactions going on all the time and data being generated in unprecedented amounts.
Moreover, big data lies at the heart of the ongoing battle to win and maintain customers, he said.
“Customers expect technology providers to fit their exact needs and provide flawless user experiences,” Greene said. “If they don’t, those customers will find another product or service. This creates a fiercely competitive landscape for tech businesses, as they vie for market share. The Internet of things is creating a world more connected than we ever imagined [and] is expanding the mind-blowing amount of information our society produces every second. The term ‘big data’ is, itself, an acknowledgement of the wealth of insights to be discovered in the way people interact in the digital world—if you have the infrastructure to store, manage, process and analyze it.”
Greene noted that the new collaborations with Databricks and AMPLab are complementary to Intel’s ongoing engagement with Cloudera and the Apache Hadoop community for driving the foundation of big data with Hadoop as an enterprise data hub.
“We believe Spark’s efficient in-memory computation within Hadoop enterprise data hub, combined with the performance of Intel architecture, enables advanced analytics with faster real-time decisions,” he said.
Moreover, “With Databricks, we plan to demonstrate the efficiency of running Spark-based analytics on Intel architecture-based platforms to data center owners using benchmarks and technology education,” Greene said. “In addition, we are accelerating analytics using in-memory techniques and enhancing data security using hardware and software mechanisms.”