Close
  • Latest News
  • Artificial Intelligence
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Logo
  • Latest News
  • Artificial Intelligence
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Home Cloud
    • Cloud

    LinkedIn’s Nurse Provides 24/7 System Health

    By
    Darryl K. Taft
    -
    July 24, 2015
    Share
    Facebook
    Twitter
    Linkedin
      cloud health

      LinkedIn, which has more than 364 million users, relies on an auto-remediation tool called Nurse to ensure that the site stays up and running.

      It takes a large team of engineers and constant monitoring to keep LinkedIn going. It requires gathering data and recording action to fix long-term problems and track recurring issues. However, an engineer getting inundated with a constant flow of low-level remediation tasks might not have the time to notice the deeper patterns.

      Given all this, and LinkedIn’s constant growth, the company faced a key decision: increase the number of people monitoring the site or build automation to help them do their jobs. They chose the latter and in April 2014 LinkedIn rolled out its auto-remediation platform and named it Nurse — because nurses help you get better and also do the bulk of operational tasks in hospitals, said Brian Cory Sherwin, a LinkedIn site reliability engineer.

      “The number one priority of our operations engineers is to keep the site up,” Sherwin said.

      The company decided to automate responses to alerts, allowing its operations engineers to focus less on the quantity of alerts and more on the quality of alerts. A year in, Nurse is having a meaningful impact on the company, Sherwin said.

      LinkedIn has a robust system to collect alerts from hundreds of thousands of sensors, he said. These sensors are attached to alerts that determine an error state. The particularly important alerts are forwarded to LinkedIn’s operations engineers for additional research, resolution, and action.

      “However, as time went by the number of alerts began to outpace our operations engineer’s time to triage,” Sherwin said in a post on the LinkedIn Engineering blog. “To address this, we have several alert goals in mind. In the short term, we want to automate the simpler alerts before our engineers are engaged to investigate. By reducing the number of alerts we monitor we should be able to better focus our engineers onto higher value activities like improving monitoring and noticing patterns. Our longer term goal is to create critical alerts that demonstrate symptoms the LinkedIn member is experiencing. The resolution of the causes of these symptoms would be automated. Only when automation has failed should a human be tasked to investigate.”

      Sherwin said the design of Nurse is simple in that it acts as a broker across many systems. LinkedIn’s monitoring system posts requests for remediation workflows to the company’s remediation broker. The company implemented integrations with its code deployment system, ticketing system, remote execution system, and virtually any other system for which LinkedIn can write integration, he said.

      “We allow our site reliability engineers and operations engineers to combine any number of workflow actions into the systems we provide integrations for,” Sherwin said. “These actions are the steps our engineers would perform to resolve the alert.”

      Indeed, during LinkedIn’s beta testing of Nurse, the company saw its usefulness first-hand, he noted. “A significant power disruption occurred and took many servers offline. One team had converted most of their monitoring to Nurse and was able to restore their entire stack within minutes of power restoration. The other impacted teams had to identify the servers through monitoring and issue the restoration commands manually.”

      Nurse enables LinkedIn engineers to focus on better incident tracking, as it automates lower- level operational tasks that used to take up engineers’ time.

      Sherwin said long-term goals for Nurse include reducing monitoring fatigue, reducing recovery times on outages, and enabling operations engineers to focus on symptom-based alerts.

      It also assists in the concept of assisted troubleshooting, and is helping to transform the careers of LinkedIn’s operational engineers, Sherwin said. “The auto-remediation system saves time and gives our team the opportunity to build new skills and explore new roles. For some, they can transition into site reliability engineers; for others, they’ll create new opportunities around in-depth site health and troubleshooting.”

      Darryl K. Taft
      Darryl K. Taft covers the development tools and developer-related issues beat from his office in Baltimore. He has more than 10 years of experience in the business and is always looking for the next scoop. Taft is a member of the Association for Computing Machinery (ACM) and was named 'one of the most active middleware reporters in the world' by The Middleware Co. He also has his own card in the 'Who's Who in Enterprise Java' deck.
      Get the Free Newsletter!
      Subscribe to Daily Tech Insider for top news, trends & analysis
      This email address is invalid.
      Get the Free Newsletter!
      Subscribe to Daily Tech Insider for top news, trends & analysis
      This email address is invalid.

      MOST POPULAR ARTICLES

      Latest News

      Zeus Kerravala on Networking: Multicloud, 5G, and...

      James Maguire - December 16, 2022 0
      I spoke with Zeus Kerravala, industry analyst at ZK Research, about the rapid changes in enterprise networking, as tech advances and digital transformation prompt...
      Read more
      Applications

      Datadog President Amit Agarwal on Trends in...

      James Maguire - November 11, 2022 0
      I spoke with Amit Agarwal, President of Datadog, about infrastructure observability, from current trends to key challenges to the future of this rapidly growing...
      Read more
      IT Management

      Intuit’s Nhung Ho on AI for the...

      James Maguire - May 13, 2022 0
      I spoke with Nhung Ho, Vice President of AI at Intuit, about adoption of AI in the small and medium-sized business market, and how...
      Read more
      Applications

      Kyndryl’s Nicolas Sekkaki on Handling AI and...

      James Maguire - November 9, 2022 0
      I spoke with Nicolas Sekkaki, Group Practice Leader for Applications, Data and AI at Kyndryl, about how companies can boost both their AI and...
      Read more
      Cloud

      IGEL CEO Jed Ayres on Edge and...

      James Maguire - June 14, 2022 0
      I spoke with Jed Ayres, CEO of IGEL, about the endpoint sector, and an open source OS for the cloud; we also spoke about...
      Read more
      Logo

      eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Advertisers

      Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

      Advertise with Us

      Menu

      • About eWeek
      • Subscribe to our Newsletter
      • Latest News

      Our Brands

      • Privacy Policy
      • Terms
      • About
      • Contact
      • Advertise
      • Sitemap
      • California – Do Not Sell My Information

      Property of TechnologyAdvice.
      © 2022 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.

      ×