SystemML, which came out of IBM Research and is now used in IBM’s BigInsights data analytics platform, is a machine learning algorithm translator. With SystemML, developers can build a machine learning model one time and keep reusing it to analyze and make predictions on data in a nearly infinite number of industry-specific scenarios, said Rob Thomas, vice president of development for IBM Analytics.
“This is significant because for the first time we’re bringing to this vast community a way to automate the process of analytics,” Thomas told eWEEK. “Analytics is very manual in organizations today. And with SystemML it becomes something that can be automated and run at scale.”
In June, IBM announced a major commitment to Apache Spark and that it was open-sourcing its SystemML machine learning technology to the Spark open source ecosystem.
“In the next several years, all businesses will rely almost exclusively on applications that learn,” Thomas said in a statement. “For developers that are not expert in machine learning, the availability of SystemML as open source technology will help scale learning and widespread development of applications that truly sense, learn, reason and interact with people in new ways. IBM developed SystemML to provide the ability to scale data analysis from a small laptop to large clusters without the need to rewrite the entire codebase. This allows for domain or industry-specific machine learning, providing developers what they need from a base code to customize applications for their enterprise’s need.”
Data scientists today face time consuming and difficult challenges when porting their algorithms to production environments. Apache SystemML addresses these challenges by dynamically compiling and optimizing machine learning algorithms in the environments familiar to the data scientist, and automatically porting these algorithms to production environments. By contributing SystemML to the open source community, IBM is helping data scientists iterate faster with the changing needs of the business, and helping data engineers by removing the need to rewrite for varying environments. As a result, more app developers will be able to apply deep intelligence into everything from mobile applications to large mainframe processes.
“The best analogy for this would be to think of it as like a universal translator for languages,” Thomas said. “If you were to go to one country and speak your native language, and no matter what you said it was translated on the fly, and therefore understood by the locals you were talking to, that’s essentially what SystemML does for machine learning algorithms.”
The Apache SystemML project has achieved a number of early milestones, including more than 320 patches, plus APIs, data ingestion, optimizations, language and runtime operators, additional algorithms, testing, and documentation. There also have been more than 90 contributions to the Apache Spark project and more than 25 engineers at the IBM Spark Technology Center in San Francisco to make machine learning accessible to the fastest growing community of data science professionals and to various other components of Apache Spark. Additionally, there have been more than 15 contributors from a number of organizations to enhance the capabilities to the core SystemML engine.
IBM’s SystemML Moves Forward as Apache Incubator Project
“SystemML not only scales for big data analytics with high performance optimizer technology, but also empowers users to write customized machine learning algorithms using simple domain specific language without learning complicated distributed programming. It is a great extensible complement framework of Spark MLlib. I’m looking forward to seeing this become part of Apache Spark ecosystem,” said D.B. Tsai, an Apache Spark and Apache SystemML committer, in a statement.
IBM’s Thomas said data scientists are the primary target audience for SystemML.
“For instance, SAS is a big player in data science and analytics, but you have to bring everything into SAS and you have to know SAS,” he said. “With SystemML, if you know Spark and you know Java or Scala, suddenly you can be proficient in building algorithms at scale using SystemML. This is very different because Spark doesn’t limit you to just a few points of data; it can leverage data in Hadoop, or data in a mainframe or data in a warehouse.”
So it empowers a new set of data scientists in organizations that don’t have to know a specific tool; they just need to know a programming language and something that’s common like Python or Java or Spark. Thus it is targeted at data scientists that want to build more effective algorithms to automate the analytics process.
“It’s actually pretty rare to go from open-sourced in June, to available on GitHub in August, to being an incubator project in November,” Thomas said. “That’s incredibly fast. I think that’s just a demonstration that the community sees the value here of how this will do something different.”
SystemML changes the process of analytics from a manual, build-once mentality to an ongoing way to run machine learning at scale, he said.
“So I believe this will change the face of machine learning,” Thomas said. “I believe that’s why it’s gotten the momentum that it has, because it gives you scale. You can write once and as new data comes in you can continue to update. It really simplifies the process.”
Meanwhile, there is a good deal of machine learning activity in the open-source software world, said Dave Schubmehl, research director for Cognitive Systems and Content Analytics at IDC. Google recently released its second generation machine learning library, DMTK under an MIT license, he said.