IBM has launched a new application for Apache Spark called the Data Science Experience, which the company is referring to as the first enterprise application for Apache Spark.
In an interview with eWEEK, Ritika Gunnar, vice president of Offering Management for IBM Analytics, described the new IBM Data Science Experience as a cloud-based development environment for real-time, high-performance analytics that gives data scientists and developers the ability to access and ingest vast amounts of data and deliver new business insights.
IBM made the announcement at Spark Summit 2016 in San Francisco. Last year at this event, IBM initiated a $300 million investment in making Spark the analytics operating system for the company’s big data efforts. This move builds on the $300 million investment. Gunnar would not put a price tag on this year’s launch, but said it was but a “drumbeat” among many more to come.
The new Data Science Experience will run on IBM’s Bluemix cloud development platform and will simplify the work of embedding data and machine learning into cloud applications, Gunnar said.
“We are enabling the data science community to build machine learning applications very efficiently by leveraging Spark,” Gunnar told eWEEK. “We made a big commitment last year into development resources and into investment in core open-source Apache Spark because we believed it was transforming how analytics were being run across businesses.
IBM not only put development investment into the open componentry itself, but over the past year, the company built well over 30 internal applications on Apache Spark, she said.
IBM interviewed a large number of data science professionals and concluded that data science as practiced today is an “individual sport,” Gunnar said. With the Data Science Experience, IBM is attempting to make it a team sport, she noted. This is particularly important as businesses try to use more forms of data across more parts of organizations to make faster, better business decisions.
There is a shortage of data science skills in the market today. Many people aspire to be data science professionals, and as a result, there is a huge need for them to be able to understand the science better using practical, real-world examples. IBM’s new project gives data science professionals the ability to learn, create and collaborate.
So one of the first things the Data Science Experience provides is content to help data scientists understand what they need to know, and also provides a platform to help them create algorithms, solutions and insights that can be built on open tooling. The open nature of the project shows through here, because no matter what language the users learned on, IBM will help them get started creating insights from whatever data they have.
“And to make it a team sport, we enable you to share what you’ve learned or what you’ve built with the rest of the data science community through what we’re calling an exchange,” Gunnar said. “The Data Science Experience enables organizations to collaborate across what they build. Sharing and collaborating is a big part of what we do.”
The Data Science Experience brings together content, data, models, and open-source resources from IBM and others, including H2O, RStudio and Jupyter Notebooks on Apache Spark. Moreover, any model built in the Data Science Experience can be extended to real-world applications through IBM MobileFirst, the Web, IBM’s Internet of things (IoT) technology or IBM’s cognitive solutions.
In related news, IBM announced that it joined the R Consortium to advance the R programming language for data science applications. With that move, IBM is extending the agility of Spark to more than 2 million members of the R community through new contributions to SparkR, SparkSQL and Apache SparkML.
IBM Launches Apache Spark-Based Data Science Experience
“With Apache Spark, we see an opportunity to drive significant innovation into the community to benefit data engineers, data scientists and application developers,” Bob Picciano, senior vice president of IBM Analytics, said in a statement. “Our IBM Analytics platform is designed for blending those new technologies and solutions into existing architectures. It’s ready-made to take advantage of whatever innovations lie ahead as more and more data scientists around the globe create solutions based on Spark.”
Indeed, one such upcoming innovation from IBM, or another “drumbeat,” will come later this year when IBM delivers a new platform for enterprises to be able to consume the insights generated from using the Data Science Experience. This is one of a series of announcements that IBM will be making to be able to build out a broader vision to help clients realize more value from data, Gunnar said.
“The Data Science Experience qualifies as the next step in IBM’s long-term investments in Apache Spark and should pay substantial dividends for critical Spark constituencies,” said Charles King, principal analyst at Pund-IT.
In a press release, IBM said the company forges partnerships with data science organizations, such as Galvanize, Lightbend and RStudio. In addition, IBM has built Spark into the core of its platforms, including Watson, IBM Commerce, as well as its analytics, systems and cloud solutions as well as and more than 30 offerings, such as IBM BigInsights for Apache Hadoop, IBM Analytics on Apache Spark, Spark with Power Systems, Watson Analytics, SPSS Modeler and IBM Stream Computing. IBM also open-sourced its SystemML machine learning technology to advance Spark’s machine learning capabilities in 2015.
Users such as USA Cycling are employing IBM’s Spark technology. USA Cycling Women’s Team Pursuit is using IBM Spark, Watson IoT, as well as IBM mobile and cloud solutions to gain insights for training strategies and racing tactics. The team can now get advanced analysis of rider data, calculate dynamic race positioning and determine the grouping of riders over the race track.
Meanwhile, IBM also continues to grow its analytics ecosystem and has contributed to related projects, including Apache Toree, EclairJS, Apache Quarks, Apache Mesos and Apache Tachyon (now called Alluxio), and providing major contributions to Apache Spark sub-projects SparkSQL, SparkR, MLLib and PySpark with more 3,000 total contributions in the last year, IBM officials said.
“Just as IBM played a critical role in the development of computer science, we can see many similarities today,” Picciano said in a statement. “Computer science went mainstream with the introduction of the PC. With data science, the major roadblock is having access to large data sets and having the ability to work with so much data. With today’s announcement, clients can have both.”
Indeed, this IBM move is about making data science available for the masses, Gunnar said. “This is about enabling a foundation to the masses that then bridges to a cognitive system,” she noted. “We fully anticipate through this offering being able to grow the number of data science professionals that are out in the market.”
IBM is trying to transform how companies use data all across their organizations. The company is trying to enable organizations to take data from IT and be able to activate the developer, the data science professional and the line-of-business professional to have data at their fingertips for them to make decisions that transform the business in ways they didn’t think about doing before.
“That means that you have to be more agile and collaboration is key,” Gunnar said. “This notion of team sport is pivotal to that—through things like Watson Analytics and the Data Science Experience.”