Amazon, Windows Azure, GoDaddy: How to Avoid Similar Cloud Outages
You can probably name your core gear off the top of your head—maybe not all your less high-profile stuff, but certainly critical devices. In GoDaddy's case, a performance cascade turned a minor problem into a major outage. Use the discovery engine of your network monitor to ensure wide discovery with low manual configuration investment. Configure scheduled discovery to automatically detect new devices and assess how critical they are.
Detailed and Effective Alerting With Escalation
Out-of-the-box alerts are generally configured to send an email for any alert. It's great to get a heads-up that a printer is down, but with a little investment, you can create specific alerting for critical infrastructure elements and make sure their alerts rise above the noise. It also allows you to configure aggressive escalation notifications to make sure they're addressed quickly, before critical services are affected.
Check the Charts Every Morning
Even with the most sophisticated alerting and reporting approach, the human mind of an experienced network engineer is still the best network management tool ever invented—especially if the data is consolidated in one view. Regular observation of the historical performance charts of device memory, CPU and interface utilization allow the network support team to learn the bounds of even complex operation models. Performance charting also allows administrators to establish alerting thresholds, tuned to ensure proactive resolution before users are affected.
Create Targeted Map, NOC Views
There are unlimited uses for the detailed data collected by monitoring your critical network devices, but there's no substitute for a bright-red alert on a big screen. Create maps that contain specific components of your critical network devices indicating overall status and related top-level metrics. For example, mount a 60-inch LED on the wall to display a geographic map with your core network devices with up/down status, including the primary network links between them with their associated utilization metrics. Your network operations center (NOC) team will always appreciate being the first to identify developing issues before users are affected.
Publish Rich Reports
This is helpful because some managers think the network is just like the phone system: it's in the wall, you can't see it and even the same sort of connector is used. They don't think about capacity planning until it's a problem. By publishing utilization reports regularly, you bring attention to the users driving the depletion of your mission-critical (and often expensive- and complex-to-upgrade) network hardware. Visibility into these issues makes expansion a regular topic of conversation outside your group, and purchasing requests more likely to be approved.
Limit Outages Caused by Human Error
Some of the worst outages in our careers have been from human errors, and it's especially common with networking issues. Enter enough arcane command-line interface (CLI) commands hundreds of times at all hours of the day, and sooner or later you'll have an accidental disaster. Having multiple engineers logging into local network gear complicates the process, and misconfiguration issues can be difficult to troubleshoot. Ensure you make nightly backups of your device configurations, ideally with a system that facilitates change detection.
Create an Internal Communications Plan
You don't need process flow charts for every possible issue permutation, but a concise spreadsheet of reasonably likely issues can help the team get you back online quickly. Identify risk areas, team member responsibilities and initial troubleshooting steps. Include a reference to your team's emergency contact info. It's way better to quickly debug an outage at 2 a.m. on the VPN than explain it to your manager in the office at 8 a.m. when you arrive to the office.
Create an External Communications Plan
Work with your marketing communications or public relations team to create a communications plan in case your network outage affects customers and partners, as was seen with AWS, Azure and GoDaddy. The plan should include guidelines for when your department is expected to inform them of an outage, as well as the kind of information you'll need to provide them. Your team might be expected to provide a brief summary of the issue, the time the outage occurred and an estimation of when it is expected to be resolved. Be sure to discuss expectations for how much detail can and should be given.
Prevent Issues From Growing Bigger
Sometimes outages happen outside your control, even after implementing the previous tips. To prevent issues from growing bigger than they need to be, you need the right alert management system so the right team is notified at the right time. For example, if a file was incorrectly deleted, production IT would be notified that an unexpected change was made. Also, a proper alert management system quickly engages the triage/production IT team related to the system or zone having issues as soon as the performance starts to degrade beyond acceptable limits—not after customers started complaining.