This month has brought forth a horde of rather obvious one-year-later stories, looking back at the Northeasts massive electrical blackout of August 2003. Many of those stories share an interesting blind spot: They look at traditional issues of power-grid technology and management, while apparently forgetting the initial (and rather hysterical) suspicions that Blaster worm infections might have triggered or worsened that power loss, which affected 50 million people.
Perhaps familiarity has bred contempt—and perhaps thats the most disturbing lesson not being learned from the mass medias backward glance at last summer. Blacked-out cities across several states are still a novelty, thank goodness, but tens of millions of Blaster and Sasser and other malware infections during the past 12 months seem to have left us numb. Wed better retain our capacity for outrage, though, if we aspire to build computing grids that handle bits more reliably than the power grid handles watts.
Lets try to learn the right things from the blackout. It didnt take anything so malicious as a deliberate malware attack to turn out all those lights. Instead, as aptly said in an Aug. 10 analysis by Matthew Wald in The New York Times, "The disturbance spread so quickly that day largely because hundreds of components acted exactly as they had been programmed to do." When I looked further into follow-up studies of the blackout, it appeared to me that engineers and operators had made many separate conservative decisions that added up to unanticipated results.
Wald quoted Douglas Voda, a vice president of ABB, an international group of companies that includes several power technology providers: "Theres a conflict between protection of an asset and protection of a system." IT builders will do well to consider that distinction, as their services-based systems grow more complex—and as they perhaps require new operating definitions of system integrity, protecting vital business processes as vigorously as they protect IT components.
Analysts now believe the overall failure of the power system last August was the result of individually correct actions by Zone 3 relays: backup components, designed to detect and respond to disturbance in distant parts of the grid, typically with fairly long delay times of half a second to 3 seconds.
I say "fairly long" only by comparison with Zone 1 (essentially instantaneous) and Zone 2 (one-third- to one-half-second responses) components. Even Zone 3 responses are still (by necessity) too quick to allow for fine-tuning by human operators. Envisioning large-scale system behavior has to be an upfront task.
Power system operators have long been aware of the potential for widespread chaos arising from the aggregate behavior of individually maintained safety systems. In 1997, a report from the Western System Coordinating Council examined the massive power failures affecting the western United States during the summer of 1996. "It is generally assumed," the report said, "that the time delay of the Zone 3 relay is set long enough to properly coordinate with all downstream protective devices that are within the reach setting of the Zone 3 relay. ... It may be difficult to set the Zone 3 timer long enough to avoid miscoordination."
IT operators must look across disciplinary boundaries to see what lessons they can learn from large-scale power disruptions. Proponents of "grid computing" aspire to imitate the power grids ability to transfer power from where its available to where its needed, reliably and with reaction times that are measured in 60ths of a second.
Grid computing proponents must strive, though, to imagine complex disruptions, as well as simple and obvious failure modes. And the challenge facing compute grids is greater because compute cycle times are shorter, while the compute capacity being shared is far more complex than the capacity measures of power plants. Watts are watts, but compute cycles come in many more flavors with many more dependencies on complex states of data and processes.
For every new Web service standard thats advanced—for example, the WS-Addressing specification that made a major step forward this month—I want to know more than how well it works. I want to know how well it fails.
Technology Editor Peter Coffee can be reached at firstname.lastname@example.org.