How to Keep Systems Running During a Crisis

eWEEK BEST-PRACTICE DATA POINTS: In times of crisis, there's a heightened expectation around keeping services up and running. Here are some best practices on how to plan for such situations.

eweek.logo.DataPoints-UPDATE_2

One aspect of the COVID-19 crisis that’s affecting many industries is the immense strain on IT teams due to increased volume and the resulting problems (e.g., downtime, slow time that impacts customers). Headlines reveal remote work apps failing, government communication channels overwhelmed and entertainment networks buffering. 

Recent data from PagerDuty shows that industries under the most stress are seeing as much as 11x increase in digital incidents. This increased demand illustrates the unexpected and unprecedented pressure our tech systems are currently under. This is a real issue when one in three IT employees leaves their job due to unplanned work like this. 

In times of crisis, there’s a heightened expectation around keeping services up and running. In this eWEEK Data Points article, Jonathan Rende, senior vice president at PagerDuty, offers his perspectives on how developer, IT and customer service teams alike can keep systems running, even when incident response rates increase during times of crisis.

Data Point No. 1: Set Up Comms and Divide Roles

As a crisis management leader, your initial instinct is probably to set up one or more dedicated email addresses and phone numbers that employees, customers or stakeholders can use to contact your team about a number of crisis situations. But configuring an email address or telephone number is the easy part. What individuals or teams should be the first points of contact for each type of issue? How will they be reached? Who is next in line for escalation if that person is unreachable? How will they collaborate when they’re on the go or working from home?

Nothing is more important in a crisis situation than getting the right information to the right people at the right time. When customers come calling or internal teams are asking how to deal with essential services or where to go, there should be a dedicated phone line that identifies the right (now remote) individuals who are on call. That call process should follow the same escalation policies as automated processes do. Remember, your teams want both digital and live access to a human in times of need. You have to know who the right person is at that time to solve the problem. 

You might be tempted to immediately lay your hands on a tool and get to work, but taking a moment to think through the answers to some of these questions will help you save precious time in crisis. Once you’ve done so, route emails and urgent phone calls to your crisis management team and set up on-call rotations so that individual team members don’t burn out.

Data Point No. 2: Prioritize Incoming Issues

Issues arising from crises such as the COVID-19 pandemic can be numerous and vary in both severity and business impact, and we’ve seen a dramatic increase in COVID-19 related incidents among our customers in the last month. You need to prioritize issues according to their impact on employees, customers and the business overall. For example, an employee who becomes infected with COVID-19 is a serious situation, but it would become much more serious if, say, they were recently in contact with customers. 

Issues impacting the supply chain in the retail industry would be something to address immediately, given the devastating impact they would have on sales, while an issue with a rewards app would be deprioritized. Your email and telephone hotlines that you set up above may be flooded with a variety of issues at differing levels of severity, and your crisis management team needs a way to react to the ones of most concern while not allowing less-important issues to consume valuable time.

Data Point No. 3: Proactive Disaster Recovery

Every savvy IT organization already has disaster recovery (DR) set up. If the first system fails, the DR flips everything over to a replicated site on a data center in a different location. However, in times of crisis, being proactive with your DR strategy will help your teams. Establish reliable WiFi where remote on-call workers are located, so they can address system issues without being on site. Ensure that they have monitors, other necessary equipment and secure access to systems they previously used on premises, so they can address situations virtually. Also, to avoid overload, set the systems to switch to DR before capacity reaches 100%, so that applications don’t break and result in more attention needed from IT. 

Data Point No. 4: Practice, Practice, Practice 

Just as firefighters have arenas where they run drills again and again before going into a live situation, developer and IT teams need to have one place where all their materials are to familiarize themselves with services. This singular place should combine all of the important information and related content and context in the case of an outage or problem. For example, many teams have run books, but they are scattered in various places: a Google doc here, a spreadsheet there and so on, adding additional confusion and complexity when time is of the essence.

Teams should ask themselves: What are all the services we are responsible for? Who is on call for those services right now? What are the relationships between these and other systems in the organization? What's the history of availability for this service and the last time it went down?

The key to managing a crisis is understanding the set of steps that need to be taken in various scenarios and making those processes available all in one place. Get ready and practice.

Data Point No. 5: Communicating With the Business and Other Stakeholders

The last thing an IT team needs while focused on solving a critical issue is an executive looking over their shoulders and asking for updates. Before something breaks, define standard ways that business leaders and other partners will be updated about situations that arise. Establishing communication channels such as standardized email templates and dashboards will increase the level of confidence and reduce the panic that stakeholders have when they don’t “feel informed.”

Furthermore, using technology to set expectations about the format and frequency of communications allows your crisis management team to not get distracted by constant inbound requests for updates by email or telephone, allowing them to focus their limited time on what’s actually important: solving the issue.

If you have a suggestion for an eWEEK Data Points article, email [email protected].