LinkedIn, which has more than 364 million users, relies on an auto-remediation tool called Nurse to ensure that the site stays up and running.
It takes a large team of engineers and constant monitoring to keep LinkedIn going. It requires gathering data and recording action to fix long-term problems and track recurring issues. However, an engineer getting inundated with a constant flow of low-level remediation tasks might not have the time to notice the deeper patterns.
Given all this, and LinkedIn’s constant growth, the company faced a key decision: increase the number of people monitoring the site or build automation to help them do their jobs. They chose the latter and in April 2014 LinkedIn rolled out its auto-remediation platform and named it Nurse — because nurses help you get better and also do the bulk of operational tasks in hospitals, said Brian Cory Sherwin, a LinkedIn site reliability engineer.
“The number one priority of our operations engineers is to keep the site up,” Sherwin said.
The company decided to automate responses to alerts, allowing its operations engineers to focus less on the quantity of alerts and more on the quality of alerts. A year in, Nurse is having a meaningful impact on the company, Sherwin said.
LinkedIn has a robust system to collect alerts from hundreds of thousands of sensors, he said. These sensors are attached to alerts that determine an error state. The particularly important alerts are forwarded to LinkedIn’s operations engineers for additional research, resolution, and action.
“However, as time went by the number of alerts began to outpace our operations engineer’s time to triage,” Sherwin said in a post on the LinkedIn Engineering blog. “To address this, we have several alert goals in mind. In the short term, we want to automate the simpler alerts before our engineers are engaged to investigate. By reducing the number of alerts we monitor we should be able to better focus our engineers onto higher value activities like improving monitoring and noticing patterns. Our longer term goal is to create critical alerts that demonstrate symptoms the LinkedIn member is experiencing. The resolution of the causes of these symptoms would be automated. Only when automation has failed should a human be tasked to investigate.”
Sherwin said the design of Nurse is simple in that it acts as a broker across many systems. LinkedIn’s monitoring system posts requests for remediation workflows to the company’s remediation broker. The company implemented integrations with its code deployment system, ticketing system, remote execution system, and virtually any other system for which LinkedIn can write integration, he said.
“We allow our site reliability engineers and operations engineers to combine any number of workflow actions into the systems we provide integrations for,” Sherwin said. “These actions are the steps our engineers would perform to resolve the alert.”
Indeed, during LinkedIn’s beta testing of Nurse, the company saw its usefulness first-hand, he noted. “A significant power disruption occurred and took many servers offline. One team had converted most of their monitoring to Nurse and was able to restore their entire stack within minutes of power restoration. The other impacted teams had to identify the servers through monitoring and issue the restoration commands manually.”
Nurse enables LinkedIn engineers to focus on better incident tracking, as it automates lower- level operational tasks that used to take up engineers’ time.
Sherwin said long-term goals for Nurse include reducing monitoring fatigue, reducing recovery times on outages, and enabling operations engineers to focus on symptom-based alerts.
It also assists in the concept of assisted troubleshooting, and is helping to transform the careers of LinkedIn’s operational engineers, Sherwin said. “The auto-remediation system saves time and gives our team the opportunity to build new skills and explore new roles. For some, they can transition into site reliability engineers; for others, they’ll create new opportunities around in-depth site health and troubleshooting.”