LinkedIn has open-sourced its Hadoop and Spark performance tuning tool, known as Dr. Elephant, to help Hadoop and Spark users add easy, self-service tuning to their production environments.
In a blog post about the tool, Akshay Rai, a software engineer at LinkedIn, described Dr. Elephant as a simple tool for the users of Hadoop and Spark to understand, analyze and improve the performance of their workflows.
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Apache Spark also is an open-source cluster computing framework. Spark is a fast engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
LinkedIn runs about 100,000 Hadoop and Spark jobs every day, and as the amount of data the company generates continues to grow, LinkedIn will be running more and more analytics on Hadoop and Spark, so an automated tuning tool was needed.
As Hadoop allows for the distributed storage and processing of large data sets involving a number of components interacting with each other, it is particularly important to make sure every component performs optimally, Rai said.
"While we can always optimize the underlying hardware resources, network infrastructure, OS and other components of the stack, only users have control over optimizing the jobs that run on the cluster," he said.
Dr. Elephant gathers all the metrics on Hadoop jobs, runs analysis on them and presents them in a simple way for easy consumption, Rai said. The goal of the tool is to improve developer productivity and increase cluster efficiency by making it easier to tune the jobs. It analyzes the Hadoop and Spark jobs using a set of configurable, rule-based heuristics that provide insights on how a job performed, and uses the results to make suggestions on how to tune the job to make it perform more efficiently, he said.
"At LinkedIn, we made it compulsory for the developers to use Dr. Elephant as part of their development cycle," Rai said. "It is mandatory to get a green signal from Dr. Elephant for a flow to run in production. For any user issues, we first ask for Dr. Elephant's report. This encourages them to write their jobs optimally and try to make all their jobs appear green in Dr. Elephant. Dr. Elephant has been a part of LinkedIn's culture for more than a year and has been helping everyone."
There are typically several obstacles to optimizing Hadoop jobs, Rai explained. One basic challenge is that users are unaware of how things are actually getting executed or how much resources they consume, he said. The other challenge is that the information required to scrutinize a job is scattered across many systems.
"We have the resource manager which provides high level information of the job, the application master logs, hundreds of counters for each task, different counters for different types of tasks, task level logs, etc.," Rai said.
Dr. Elephant works by obtaining a list of all recent succeeded and failed applications from the YARN resource manager, Rai said. The tool then collects the metadata for each application—the job counters, configurations and the task data—from the Job History server. Once it has all the metadata, Dr. Elephant runs a set of heuristics on it and generates a diagnostic report on how the individual heuristics and the job as a whole performed. These are then tagged with one of five severity levels, to indicate potential performance problems, Rai said.
LinkedIn developed Dr. Elephant in mid-2014. Until a few years ago, LinkedIn's Hadoop team analyzed workflows for employees, gave them tuning advice and approved jobs to run in production. But as the number of users grew, it became difficult to keep this up and the team decided to automate the process, Rai said. Thus, Dr. Elephant was born.
"Dr. Elephant is very popular at LinkedIn," Rai said. "People love it for its simplicity. Like a family doctor, it is always reachable and solves around 80 percent of the problems through simple diagnosis. It is designed to be self explanatory and focused toward helping the Hadoop users understand and optimize their flows by providing job level suggestions rather than cluster level statistics. Like how a real doctor diagnoses a problem, Dr. Elephant also analyses the problem through simple flow charts. You can add as many heuristics or rules into Dr. Elephant as you'd like."
LinkedIn uses Dr. Elephant for many different use cases, including monitoring how a flow is performing on the cluster, understanding why a flow is running slowly, how and what can be tuned to improve a flow, comparing a flow against previous executions, and troubleshooting, Rai said.
Meanwhile, in addition to adding and improving heuristics and extending the tool to newer job types, LinkedIn also plans to update Dr. Elephant with job-specific tuning suggestions based on real-time metrics, visualizations of a job's cluster resource usage and trends, better Spark integration and support for more schedulers.