The Knowledge Trek

Opinion: Recovery from failure ought to be viewed not as a crisis management process but rather as the simplest case of a system upgrade.

With at least some of his ashes soon to be placed in Earth orbit, the late James Doohan cant even be imagined to be spinning in his grave. Even so, I hate to imagine the reaction to recent events that one might have heard from the character made famous by that veteran actor. In real life, he stormed Juno Beach on D-Day, but hell be best remembered as starship troubleshooter Montgomery Scott from three decades of the "Star Trek" saga.

Miracle-worker "Scotty," chief engineer of the Enterprise, would surely have something to say at the sight of NASA engineers agonizing over whether it was safe to trim the toenails of the space shuttle orbiter Discovery. I imagine him asking in exasperation, "If youre afraid to fix the bloody thing, what business do you have flying it?"

I ask that same question of anyone entrusting vital tasks to an enterprise (pardon the expression) IT system.

Actually, I ask five questions. How do you know its working? How can it fail to work correctly? What can happen when it fails? How do you limit the damage while youre fixing it? And how do you fix it without breaking something else in the process? System operators ought to be able to answer those questions on a moments notice—preferably by consulting well-indexed, regularly updated plans rather than relying on expert knowledge and their ability to improvise.

How do you know a system is working? Its tempting to say that if its moving packets or serving up Web pages, its working, but thats like looking at the fuel flow through the space shuttles engines without paying attention to whether the shuttle is going in the right direction. The kind of assurance that you really need comes from products such as RealiTea from TeaLeaf Technology or Vantage from Compuware. Both of these analytic tools measure customer session response and completion rates, as well as business impact of process problems, rather than lower-level measures of hardware function.

How can it fail to work? Thats a question with many more answers than it used to have, as any nontrivial application now involves loosely coupled processes owned and managed by multiple parties. End-to-end testing with something like Segue Softwares SilkPerformer, Mercury Interactives Business Process Testing or Embarcadero Technologies Extreme Test is the necessary response.

What are the effects of failure, and what does that imply about the acceptable procedures for recovery? The mantra here is "graceful degradation." Falling off the edge of the world, even for a few minutes, is not an option—who knows how many people will brand you as an unreliable partner and never give you a second look? Put some effort into building facilities to avoid data loss, offer basic information even if two-way services are interrupted and capture contact information to get back to a customer who finds you temporarily unable to complete a transaction.

Recovery from failure ought to be viewed not as a crisis management process but rather as the simplest case of a system upgrade—that is, one that doesnt actually add new capability. Why does it help to look at it that way? Because it gets people out of the blame-avoidance game and turns it into an engineering problem instead of a political problem.

To perform a system upgrade without disrupting operations, you prepare by accurately identifying system dependencies and formally specifying and documenting interfaces; you maintain current information on licenses and their management and other administrative needs. Merely bringing a system back online should then be at least as easy as adding or changing functions of that system.

With all due respect to the legend of Scotty, I also remember the legendary tale of three Chinese brothers—all of them physicians. The most famous of the three cured illnesses and became personal physician to the emperor—but he honored and deferred to his brothers, although they were much less well known, because they had greater skills of preventing disease.

Miracle workers like Scotty make for better TV drama, but in the real world, Id rather have the kind of professional who makes disaster handling look routine.

/zimages/2/28571.gifReaders respond to The Knowledge Trek. Click here to read more.

Technology Editor Peter Coffee can be reached at

/zimages/2/28571.gifCheck out eWEEK.coms for the latest news, reviews and analysis in programming environments and developer tools.