Close
  • Latest News
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Logo
  • Latest News
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Home Cloud
    • Cloud

    LinkedIn’s Nurse Provides 24/7 System Health

    By
    Darryl K. Taft
    -
    July 24, 2015
    Share
    Facebook
    Twitter
    Linkedin
      cloud health

      LinkedIn, which has more than 364 million users, relies on an auto-remediation tool called Nurse to ensure that the site stays up and running.

      It takes a large team of engineers and constant monitoring to keep LinkedIn going. It requires gathering data and recording action to fix long-term problems and track recurring issues. However, an engineer getting inundated with a constant flow of low-level remediation tasks might not have the time to notice the deeper patterns.

      Given all this, and LinkedIn’s constant growth, the company faced a key decision: increase the number of people monitoring the site or build automation to help them do their jobs. They chose the latter and in April 2014 LinkedIn rolled out its auto-remediation platform and named it Nurse — because nurses help you get better and also do the bulk of operational tasks in hospitals, said Brian Cory Sherwin, a LinkedIn site reliability engineer.

      “The number one priority of our operations engineers is to keep the site up,” Sherwin said.

      The company decided to automate responses to alerts, allowing its operations engineers to focus less on the quantity of alerts and more on the quality of alerts. A year in, Nurse is having a meaningful impact on the company, Sherwin said.

      LinkedIn has a robust system to collect alerts from hundreds of thousands of sensors, he said. These sensors are attached to alerts that determine an error state. The particularly important alerts are forwarded to LinkedIn’s operations engineers for additional research, resolution, and action.

      “However, as time went by the number of alerts began to outpace our operations engineer’s time to triage,” Sherwin said in a post on the LinkedIn Engineering blog. “To address this, we have several alert goals in mind. In the short term, we want to automate the simpler alerts before our engineers are engaged to investigate. By reducing the number of alerts we monitor we should be able to better focus our engineers onto higher value activities like improving monitoring and noticing patterns. Our longer term goal is to create critical alerts that demonstrate symptoms the LinkedIn member is experiencing. The resolution of the causes of these symptoms would be automated. Only when automation has failed should a human be tasked to investigate.”

      Sherwin said the design of Nurse is simple in that it acts as a broker across many systems. LinkedIn’s monitoring system posts requests for remediation workflows to the company’s remediation broker. The company implemented integrations with its code deployment system, ticketing system, remote execution system, and virtually any other system for which LinkedIn can write integration, he said.

      “We allow our site reliability engineers and operations engineers to combine any number of workflow actions into the systems we provide integrations for,” Sherwin said. “These actions are the steps our engineers would perform to resolve the alert.”

      Indeed, during LinkedIn’s beta testing of Nurse, the company saw its usefulness first-hand, he noted. “A significant power disruption occurred and took many servers offline. One team had converted most of their monitoring to Nurse and was able to restore their entire stack within minutes of power restoration. The other impacted teams had to identify the servers through monitoring and issue the restoration commands manually.”

      Nurse enables LinkedIn engineers to focus on better incident tracking, as it automates lower- level operational tasks that used to take up engineers’ time.

      Sherwin said long-term goals for Nurse include reducing monitoring fatigue, reducing recovery times on outages, and enabling operations engineers to focus on symptom-based alerts.

      It also assists in the concept of assisted troubleshooting, and is helping to transform the careers of LinkedIn’s operational engineers, Sherwin said. “The auto-remediation system saves time and gives our team the opportunity to build new skills and explore new roles. For some, they can transition into site reliability engineers; for others, they’ll create new opportunities around in-depth site health and troubleshooting.”

      Darryl K. Taft
      Darryl K. Taft covers the development tools and developer-related issues beat from his office in Baltimore. He has more than 10 years of experience in the business and is always looking for the next scoop. Taft is a member of the Association for Computing Machinery (ACM) and was named 'one of the most active middleware reporters in the world' by The Middleware Co. He also has his own card in the 'Who's Who in Enterprise Java' deck.

      MOST POPULAR ARTICLES

      Cybersecurity

      Visa’s Michael Jabbara on Cybersecurity and Digital...

      James Maguire - May 17, 2022 0
      I spoke with Michael Jabbara, VP and Global Head of Fraud Services at Visa, about the cybersecurity technology used to ensure the safe transfer...
      Read more
      Big Data and Analytics

      Alteryx’s Suresh Vittal on the Democratization of...

      James Maguire - May 31, 2022 0
      I spoke with Suresh Vittal, Chief Product Officer at Alteryx, about the industry mega-shift toward making data analytics tools accessible to a company’s complete...
      Read more
      Big Data and Analytics

      GoodData CEO Roman Stanek on Business Intelligence...

      James Maguire - May 4, 2022 0
      I spoke with Roman Stanek, CEO of GoodData, about business intelligence, data as a service, and the frustration that many executives have with data...
      Read more
      Applications

      Cisco’s Thimaya Subaiya on Customer Experience in...

      James Maguire - May 10, 2022 0
      I spoke with Thimaya Subaiya, SVP and GM of Global Customer Experience at Cisco, about the factors that create good customer experience – and...
      Read more
      Cloud

      Yotascale CEO Asim Razzaq on Controlling Multicloud...

      James Maguire - May 5, 2022 0
      Asim Razzaq, CEO of Yotascale, provides guidance on understanding—and containing—the complex cost structure of multicloud computing. Among the topics we covered:  As you survey the...
      Read more
      Logo

      eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Advertisers

      Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

      Advertise with Us

      Menu

      • About eWeek
      • Subscribe to our Newsletter
      • Latest News

      Our Brands

      • Privacy Policy
      • Terms
      • About
      • Contact
      • Advertise
      • Sitemap
      • California – Do Not Sell My Information

      Property of TechnologyAdvice.
      © 2021 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.

      ×