Calamity descended from the skies around Washington on June 29 in the form of a derecho, a type of weather system so rare most people have never even heard of it. This unusual complex of extremely severe weather had never been known to cross a range of mountains such as the Alleghenies. But this time it happened, and disaster planning went out the window.
Amazons huge data center near Dulles International Airport, fully redundant in itself, and served by redundant backup power and redundant power grids, redundant network access went down under the combined onslaught of massive power outages, massive Internet outages, phone line outages and cell system outages. Not only did everything go down, but nobody could call for backup. And, of course, even if the staff had known that this event was happening, they couldnt have traveled there anyway. Most of the roads were blocked.
While we often preach the gospel of preparedness, there are disasters for which no one could prepare. When weather this violent appears out of nowhere, with no warning and no forecasts, there is only so much that anyone or any institution can do. The fact that Amazon was able to get back online and have all of its affected customers fully restored by the next morning was remarkable.
But Amazon was one of the few that managed this. For smaller organizations with fewer resources this calamitous blow simply took them out. Many of those companies remained down as this was written on July 2and some will never recover.
Of course, some of those smaller organizations didnt have disaster plans and were simply left hanging. Some did have plans, but they werent tested, and when push came to shove, didnt work. And some were in place, tested and should have been enough, but just like with Amazon, the planners couldnt plan for everything.
In my own company, which houses the test lab that produces those eWEEK reviews you see from time to time, I thought Id planned for anything short of the Mayan Apocalypse or a slightly more probable world-ending asteroid strike. Id even tested the lab using the backup generators, communicated using the backup WiFi hotspot and made plans for the air conditioning to be out.
But in the case of the lab, configuration changes had crept in since the last time I calculated the electrical loads and Id never tested the latest configuration. Worse, Id assumed that the T-Mobile cell near the lab would keep running for at least a few days after losing power, since it had always done so in the past.
Recovery Is Hard When the Whole World Is Shut Down
So the derecho came in the dark of night. The first hint was the flicker of lightening off to the northwest. Then a storm more violent than anything Id ever seen before slammed the area. This was worse than the hurricanes Id experienced, including one off the West Coast of Africa that was my previous high point when it came to weather-related anxiety. In 45 minutes, it was gone and so was the power, the Internet service, the phone service and the previously reliable cell tower.
But I got the generators started and began bringing up the lab infrastructure. One by one, the switches and servers came alive, the whir of the fans and the flickering of the lights reassuring me that all was well. Then I started up the HP server that handles the Domain Name System (DNS), the Dynamic Host Configuration Protocol (DHCP) and directory services. The low-voltage alarms started going off one by one. I didnt have enough capacity to run the lab, despite my previous tests.
So I shut down the servers and the other computers, and finished bringing up the infrastructure. I had capacity for that and everything ran, but I was approaching the total capacity of the generators, and thats never a good thing. But thats when I found that it didnt matter. My lab might be operational, but it couldnt communicate with the outside world because nothing else was operational. Being able to run when the rest of the world isnt really doesnt help muchespecially when you realize that youre going to have to buy another generator and set up load sharing.
Actually, Ill have to buy two more generators for full N+1 capability. But in the meantime, Ill have to also remember that I have to run tests of the entire system more frequently, especially after I add more servers, new switches or network management equipment. I hadnt gotten around to that, and it cost me.
But in this case, all of the planning wouldnt have made any difference. As I looked out at the hazy heat that had brought all of this about, the one thing that kept popping into my head (right after the desire for a nice cold beer) was the words of Scottish poet Robert Burns:
The best-laid schemes o’ mice an’ men
Gang aft agley,
An’ lea’e us nought but grief an’ pain,
For promis’d joy!
Thanks for the reminder, old Rabbie. Tonight, Ill have a wee dram of Scotlands best in your memory and to remind me that we cant plan for everything.