IBM Launches Apache Spark-Based Data Science Experience

Aimed at making the cloud more data-friendly, IBM's new Data Science Experience is a native Apache Spark platform for data scientists and developers.

IBM big data

IBM has launched a new application for Apache Spark called the Data Science Experience, which the company is referring to as the first enterprise application for Apache Spark.

In an interview with eWEEK, Ritika Gunnar, vice president of Offering Management for IBM Analytics, described the new IBM Data Science Experience as a cloud-based development environment for real-time, high-performance analytics that gives data scientists and developers the ability to access and ingest vast amounts of data and deliver new business insights.

IBM made the announcement at Spark Summit 2016 in San Francisco. Last year at this event, IBM initiated a $300 million investment in making Spark the analytics operating system for the company’s big data efforts. This move builds on the $300 million investment. Gunnar would not put a price tag on this year’s launch, but said it was but a “drumbeat” among many more to come.

The new Data Science Experience will run on IBM’s Bluemix cloud development platform and will simplify the work of embedding data and machine learning into cloud applications, Gunnar said.

“We are enabling the data science community to build machine learning applications very efficiently by leveraging Spark,” Gunnar told eWEEK. “We made a big commitment last year into development resources and into investment in core open-source Apache Spark because we believed it was transforming how analytics were being run across businesses.

IBM not only put development investment into the open componentry itself, but over the past year, the company built well over 30 internal applications on Apache Spark, she said.

IBM interviewed a large number of data science professionals and concluded that data science as practiced today is an “individual sport,” Gunnar said. With the Data Science Experience, IBM is attempting to make it a team sport, she noted. This is particularly important as businesses try to use more forms of data across more parts of organizations to make faster, better business decisions.

There is a shortage of data science skills in the market today. Many people aspire to be data science professionals, and as a result, there is a huge need for them to be able to understand the science better using practical, real-world examples. IBM’s new project gives data science professionals the ability to learn, create and collaborate.

So one of the first things the Data Science Experience provides is content to help data scientists understand what they need to know, and also provides a platform to help them create algorithms, solutions and insights that can be built on open tooling. The open nature of the project shows through here, because no matter what language the users learned on, IBM will help them get started creating insights from whatever data they have.

“And to make it a team sport, we enable you to share what you’ve learned or what you’ve built with the rest of the data science community through what we’re calling an exchange,” Gunnar said. “The Data Science Experience enables organizations to collaborate across what they build. Sharing and collaborating is a big part of what we do.”

The Data Science Experience brings together content, data, models, and open-source resources from IBM and others, including H2O, RStudio and Jupyter Notebooks on Apache Spark. Moreover, any model built in the Data Science Experience can be extended to real-world applications through IBM MobileFirst, the Web, IBM’s Internet of things (IoT) technology or IBM’s cognitive solutions.

In related news, IBM announced that it joined the R Consortium to advance the R programming language for data science applications. With that move, IBM is extending the agility of Spark to more than 2 million members of the R community through new contributions to SparkR, SparkSQL and Apache SparkML.