It seems that the cloud’s greatest enemy is lightning.
Or at least, that’s the conclusion I’ve reached after considering the latest outages affecting Amazon’s and Microsoft’s cloud services in Europe. It’s been widely reported that a lightning strike outside Dublin took out part of the power grid, which then led to the outages; although the Electricity Supply Board – Ireland’s primary provider of electricity – now denies that the lightning was directly responsible, pointing instead to equipment failure as the cause.
[WP_IMAGE]
Here’s what appears to have happened: on Sunday evening (6:15pm Dublin time), a power failure affected about 100 customers who draw juice from the Citywest substation, in an area southwest of Dublin city center that is home to several business parks and data centers. Power is reported to have been completely off for about an hour, with a partial disruption that lasted for another four hours. The ESB has confirmed that Amazon was one of the customers affected by this outage. Microsoft has a data center in the area, and it was also knocked offline, according to reports.
Meanwhile, a second Amazon data center in south Dublin experienced a voltage dip that lasted for a fraction of a second. It’s unclear what connection exists between this event and the Citywest outage.
The utility at first fingered nearby lightning strikes for the problem, but has switched the onus to a failed 110kV transformer; an explosion and fire are said by ESB to have been limited to the bushings and insulators of the transformer. That’s certainly specific, but it causes one to wonder that caused the explosion in the first place; if I were trying to explain an exploding transformer, I’m hard pressed to think of anything better than lightning to blame.
Both Amazon and Microsoft have said that the fluctuations in utility power interfered with their ability to automatically start backup generators. This I can actually believe; it’s certainly possible that intermittent outages would have confused the backup systems. It’s difficult to take that kind of a situation into account when designing these systems, which have to provide power for the data center, without allowing that power to leak back out to the grid and further disrupting the utility’s efforts to restore power. That’s because a common assumption is that utility power is completely off rather than wobbly, as was the case on Sunday night.
But what this incident should be telling cloud providers is that it’s not enough to have backup power at hand; the backup has to perform as well in cases where utility service is in that grey area between completely out and normal operation as it does when the utility is completely dark.
One of my lasting memories of the 1989 Loma Prieta earthquake was sitting on the balcony of my upstairs neighbors and watching Pacific Gas and Electric try to restore power to San Francisco. SF General Hospital was the only island of artificial light in the city until about 10pm, if I’m remembering correctly; my neighbors and I watch for the next two hours as one part of the city flickered into light, followed by another. From time to time, blocks that had just been relit would lose power for a few minutes, and then come back on the grid; we were cheering as the lights drew closer to the Corona Heights neighborhood where I lived at the time.
The lesson I took away from 1989 was that electrical distribution is as much of an art as it is a science, if only because so much depends on the cooperation of Mother Nature. Because cloud customers demand 24x7x365 availability, Amazon, Microsoft and anyone else who wants to be a player in the cloud have to do a better job of preparing their data centers for this kind of event. It’s not enough to have backup generators and a three-ring binder with a recovery plan; for the cloud to succeed in the long run, what’s cloud providers have to offer is a continent-spanning mirroring and failover scheme that allows lighting to strike in Ireland (or North Carolina) without affecting operations in Rome (or Los Angeles).