LinkedIn Open-Sources Spark-Based Machine Learning Library

LinkedIn continues its strategy of developing hot technology and open-sourcing it, this time a machine learning library for Spark called Photon ML.

Download the authoritative guide: Big Data: Mining Data for Revenue

big data BI

LinkedIn announced that it has open-sourced its machine learning library for Apache Spark, known as Photon ML.

Apache Spark is an open-source cluster computing framework used for processing and analyzing big data. The open-source data processing engine was built for speed, ease of use and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries and machine learning.

In a blog post about the open-sourcing of Photon ML, Paul Ogilvie, engineering manager of the LinkedIn Machine Learning Algorithms team, noted that machine learning is a key component of LinkedIn's relevance-driven products, and the company uses machine learning to train the ranking algorithms for its feed, advertising, recommender systems—such as People You May Know—email optimization, search engines and more.

"These algorithms play an important role in determining user experience for content-rich websites, so it's critical that we provide our engineers with easy-to-use machine learning tools that create high-quality models that are fast and scale to large datasets," Ogilvie said. "By combining the ability of Spark to quickly process massive datasets with powerful model training and diagnostic utilities, Photon ML allows research engineers to make more informed decisions about the algorithms they choose for the types of recommendation systems listed above."

Photon ML provides support for large-scale regression, supporting linear, logistic and poisson regression. Poisson regression is a form of regression analysis used to model count data and contingency tables. Photon ML provides the optional generation of model diagnostics, creating charts and tables that can be helpful in diagnosing the model and its fit to an optimization problem, Ogilvie said. It also includes an experimental implementation of generalized additive mixed effect (GAME) models, which is where LinkedIn hopes to take Photon ML to have a broader impact on the industry—on how people build and apply machine learning technology.

"Currently, the GAME implementation in Photon ML supports generalized linear mixed effect models (GLMix), a subset of the algorithms we intend to one day support in GAME," Ogilvie said. A GLMix model consists of a fixed effect component and multiple random effects, he added.

LinkedIn uses GLMix models to improve job recommendations by using a random effect for members and a random effect for jobs. "To be more precise, the random effect for members includes features from job descriptions, such as extracted skills or job titles," Ogilvie said. "Modeling the random effect in this way allows us to better learn which jobs a highly-active member is interested in, with coefficients for job features specific to that member."

Meanwhile, GAME models enable research engineers to train their algorithms using a more accurate picture of the underlying dataset that better reflects the experience of individual members, Ogilvie said. He noted that LinkedIn hopes that increased use of these techniques in the future will lead to better algorithms for recommendation systems in general.

"Our own initial A/B tests have showed that GLMix models trained using Photon ML improved job recommendations by 15 to 30 percent in job applications, and improved email article recommendations by 10 to 20 percent," Ogilvie said. "While these tests are still in their early stages, these results indicate that Photon can significantly improve recommendations for members."

Last month, LinkedIn open-sourced its Kafka Monitor, a framework for monitoring and testing Kafka deployments.

LinkedIn originally developed what is now known as Apache Kafka, a standard messaging system for large-scale, streaming data. LinkedIn open-sourced Kafka in 2011.

Despite it being a standard message broker, it can pose problems for Kafka operators or site reliability engineers (SREs), such as reported metrics can be unreliable or inaccurate—which can be time-consuming for the SRE to investigate. And it is prone to occasional bugs, which don't manifest until Kafka has been deployed in a real cluster for days or even weeks.

That's why LinkedIn built the Kafka Monitor, a framework for monitoring and testing Kafka deployments in real clusters. It reports critical health metrics and runs validation tests to capture bugs or regressions before they make their way into a deployed cluster.

In a blog post, Dong Lin, a LinkedIn software engineer, said, "Kafka Monitor makes it easy to develop and execute long-running Kafka-specific system tests in real clusters and to monitor existing Kafka deployments' SLAs provided by users."

Moreover, he said Kafka Monitor is potentially useful to other companies to validate their own client libraries and Kafka clusters.

"Indeed, Microsoft has an open-source project on GitHub that also monitors availability and end-to-end latency for Kafka clusters," Lin said.

Similarly, in this blog post, Netflix describes a monitoring service that sends continuous heartbeat messages and measures the latency of these messages, he noted.

"Kafka Monitor differentiates itself by focusing on extensibility, modularity and support for custom client libraries and scenarios," said Lin.