No, I did not get an advance copy of the Interim Report of the U.S.-Canada Power System Outage Task Force, released last week. Any resemblance between its conclusions and my column of two weeks ago is pure coincidence—and Im more worried than gratified by this prompt confirmation of the problem that I described.
The task force report found three groups of causes for the August 14th blackout. Assuming that the pruning of tree limbs near wires doesnt need to be on your Web services agenda, the first and third groups still have high relevance for application developers as they move into broadly distributed systems with high-availability requirements. The report labels Cause 1 as “Inadequate Situational Awareness,” and Cause 3 as “Failure of the interconnected grids reliability organizations to provide effective diagnostic support.” The crucial event under the Cause 1 heading was the failure of an alarm system: “…[A]larm and logging software failed sometime shortly after 14:14 EDT… operators were working under a significant handicap without these tools. However, they were in further jeopardy because they did not know that they were operating without alarms…”
As I said above, its pure coincidence that two weeks earlier, I had written that “the scene is set for…catastrophe when something appears to be working—and isnt.”
But wait, theres more. “At 14:41 EDT, the primary server hosting the [Energy Management System] alarm processing application failed, due either to the stalling of the alarm application, queuing to the remote terminals, or some combination of the two. Following preprogrammed instructions, the alarm system application and all other EMS software running on the first server automatically transferred (failed-over) onto the backup server. However, because the alarm application moved intact onto the backup while still stalled and ineffective, the backup server failed 13 minutes later, at 14:54 EDT. Accordingly, all of the EMS applications on these two servers stopped running.”
In short words: a fail-over mechanism only made provision for a failure of the application platform; it did not correctly deal with failure at the level of the application itself, but instead merely duplicated that failure on the backup server.
Even with both servers down, the system did not suffer hard failure; it did, however, slow down to the point that screen updates took almost a minute instead of the usual 1 to 3 seconds. Ordinary movements from one top-level screen to a lower-level detail screen, and back again, took minutes to perform. The report does not state this conclusion, but I will: interactive speed is part of correct application performance, and anything that slows application response needs to be treated as a form of failure—not just an annoyance.
The servers failures triggered pager alerts to IT staff, who rebooted them. “At 15:08 EDT, IT staffers completed a warm reboot (restart) of the primary server. Startup diagnostics monitored during that reboot verified that the computer and all expected processes were running; accordingly, IT staff believed that they had successfully restarted the node and all the processes it was hosting. However, although the server and its applications were again running, the alarm system remained frozen and non-functional, even on the restarted computer. The IT staff did not confirm that the alarm system was again working properly with the control room operators.”
As with the initial fail-over, the problem here was an ineffective definition of what it means to be “up and running.” The process was running, but the application represented by that process was not doing what it should—and it was not part of the problem resolution procedure to confirm that it was.
Ive previously mentioned application assurance tools like those from TeaLeaf Technology Inc.. At eWEEK Labs, weve also seen significant improvement of late in application security analysis products like Sanctum Inc.s AppScan. The tools are there.
Whats also needed, though, as the East Coast blackout report clearly shows, is a culture of responsibility for making sure that the system is meeting the enterprise need—and not just “working.”