Amazon, Windows Azure, GoDaddy: How to Avoid Similar Cloud Outages

 
 
By Darryl K. Taft  |  Posted 2013-03-04
 
 
 

Monitor Everything

You can probably name your core gear off the top of your head—maybe not all your less high-profile stuff, but certainly critical devices. In GoDaddy's case, a performance cascade turned a minor problem into a major outage. Use the discovery engine of your network monitor to ensure wide discovery with low manual configuration investment. Configure scheduled discovery to automatically detect new devices and assess how critical they are.

Monitor Everything

Detailed and Effective Alerting With Escalation

Out-of-the-box alerts are generally configured to send an email for any alert. It's great to get a heads-up that a printer is down, but with a little investment, you can create specific alerting for critical infrastructure elements and make sure their alerts rise above the noise. It also allows you to configure aggressive escalation notifications to make sure they're addressed quickly, before critical services are affected.

Detailed and Effective Alerting With Escalation

Check the Charts Every Morning

Even with the most sophisticated alerting and reporting approach, the human mind of an experienced network engineer is still the best network management tool ever invented—especially if the data is consolidated in one view. Regular observation of the historical performance charts of device memory, CPU and interface utilization allow the network support team to learn the bounds of even complex operation models. Performance charting also allows administrators to establish alerting thresholds, tuned to ensure proactive resolution before users are affected.

Check the Charts Every Morning

Create Targeted Map, NOC Views

There are unlimited uses for the detailed data collected by monitoring your critical network devices, but there's no substitute for a bright-red alert on a big screen. Create maps that contain specific components of your critical network devices indicating overall status and related top-level metrics. For example, mount a 60-inch LED on the wall to display a geographic map with your core network devices with up/down status, including the primary network links between them with their associated utilization metrics. Your network operations center (NOC) team will always appreciate being the first to identify developing issues before users are affected.

Create Targeted Map, NOC Views

Publish Rich Reports

This is helpful because some managers think the network is just like the phone system: it's in the wall, you can't see it and even the same sort of connector is used. They don't think about capacity planning until it's a problem. By publishing utilization reports regularly, you bring attention to the users driving the depletion of your mission-critical (and often expensive- and complex-to-upgrade) network hardware. Visibility into these issues makes expansion a regular topic of conversation outside your group, and purchasing requests more likely to be approved.

Publish Rich Reports

Limit Outages Caused by Human Error

Some of the worst outages in our careers have been from human errors, and it's especially common with networking issues. Enter enough arcane command-line interface (CLI) commands hundreds of times at all hours of the day, and sooner or later you'll have an accidental disaster. Having multiple engineers logging into local network gear complicates the process, and misconfiguration issues can be difficult to troubleshoot. Ensure you make nightly backups of your device configurations, ideally with a system that facilitates change detection.

Limit Outages Caused by Human Error

Create an Internal Communications Plan

You don't need process flow charts for every possible issue permutation, but a concise spreadsheet of reasonably likely issues can help the team get you back online quickly. Identify risk areas, team member responsibilities and initial troubleshooting steps. Include a reference to your team's emergency contact info. It's way better to quickly debug an outage at 2 a.m. on the VPN than explain it to your manager in the office at 8 a.m. when you arrive to the office.

Create an Internal Communications Plan

Create an External Communications Plan

Work with your marketing communications or public relations team to create a communications plan in case your network outage affects customers and partners, as was seen with AWS, Azure and GoDaddy. The plan should include guidelines for when your department is expected to inform them of an outage, as well as the kind of information you'll need to provide them. Your team might be expected to provide a brief summary of the issue, the time the outage occurred and an estimation of when it is expected to be resolved. Be sure to discuss expectations for how much detail can and should be given.

Create an External Communications Plan

Prevent Issues From Growing Bigger

Sometimes outages happen outside your control, even after implementing the previous tips. To prevent issues from growing bigger than they need to be, you need the right alert management system so the right team is notified at the right time. For example, if a file was incorrectly deleted, production IT would be notified that an unexpected change was made. Also, a proper alert management system quickly engages the triage/production IT team related to the system or zone having issues as soon as the performance starts to degrade beyond acceptable limits—not after customers started complaining.

Prevent Issues From Growing Bigger

Rocket Fuel