The Knowledge Trek

By Peter Coffee  |  Posted 2005-08-08 Print this article Print

Opinion: Recovery from failure ought to be viewed not as a crisis management process but rather as the simplest case of a system upgrade.

With at least some of his ashes soon to be placed in Earth orbit, the late James Doohan cant even be imagined to be spinning in his grave. Even so, I hate to imagine the reaction to recent events that one might have heard from the character made famous by that veteran actor. In real life, he stormed Juno Beach on D-Day, but hell be best remembered as starship troubleshooter Montgomery Scott from three decades of the "Star Trek" saga.

Miracle-worker "Scotty," chief engineer of the Enterprise, would surely have something to say at the sight of NASA engineers agonizing over whether it was safe to trim the toenails of the space shuttle orbiter Discovery. I imagine him asking in exasperation, "If youre afraid to fix the bloody thing, what business do you have flying it?"

I ask that same question of anyone entrusting vital tasks to an enterprise (pardon the expression) IT system.

Actually, I ask five questions. How do you know its working? How can it fail to work correctly? What can happen when it fails? How do you limit the damage while youre fixing it? And how do you fix it without breaking something else in the process? System operators ought to be able to answer those questions on a moments notice—preferably by consulting well-indexed, regularly updated plans rather than relying on expert knowledge and their ability to improvise.

How do you know a system is working? Its tempting to say that if its moving packets or serving up Web pages, its working, but thats like looking at the fuel flow through the space shuttles engines without paying attention to whether the shuttle is going in the right direction. The kind of assurance that you really need comes from products such as RealiTea from TeaLeaf Technology or Vantage from Compuware. Both of these analytic tools measure customer session response and completion rates, as well as business impact of process problems, rather than lower-level measures of hardware function.

How can it fail to work? Thats a question with many more answers than it used to have, as any nontrivial application now involves loosely coupled processes owned and managed by multiple parties. End-to-end testing with something like Segue Softwares SilkPerformer, Mercury Interactives Business Process Testing or Embarcadero Technologies Extreme Test is the necessary response.

What are the effects of failure, and what does that imply about the acceptable procedures for recovery? The mantra here is "graceful degradation." Falling off the edge of the world, even for a few minutes, is not an option—who knows how many people will brand you as an unreliable partner and never give you a second look? Put some effort into building facilities to avoid data loss, offer basic information even if two-way services are interrupted and capture contact information to get back to a customer who finds you temporarily unable to complete a transaction.

Recovery from failure ought to be viewed not as a crisis management process but rather as the simplest case of a system upgrade—that is, one that doesnt actually add new capability. Why does it help to look at it that way? Because it gets people out of the blame-avoidance game and turns it into an engineering problem instead of a political problem.

To perform a system upgrade without disrupting operations, you prepare by accurately identifying system dependencies and formally specifying and documenting interfaces; you maintain current information on licenses and their management and other administrative needs. Merely bringing a system back online should then be at least as easy as adding or changing functions of that system.

With all due respect to the legend of Scotty, I also remember the legendary tale of three Chinese brothers—all of them physicians. The most famous of the three cured illnesses and became personal physician to the emperor—but he honored and deferred to his brothers, although they were much less well known, because they had greater skills of preventing disease.

Miracle workers like Scotty make for better TV drama, but in the real world, Id rather have the kind of professional who makes disaster handling look routine.

Readers respond to The Knowledge Trek. Click here to read more. Technology Editor Peter Coffee can be reached at

Check out eWEEK.coms for the latest news, reviews and analysis in programming environments and developer tools.
Peter Coffee is Director of Platform Research at, where he serves as a liaison with the developer community to define the opportunity and clarify developers' technical requirements on the company's evolving Apex Platform. Peter previously spent 18 years with eWEEK (formerly PC Week), the national news magazine of enterprise technology practice, where he reviewed software development tools and methods and wrote regular columns on emerging technologies and professional community issues.Before he began writing full-time in 1989, Peter spent eleven years in technical and management positions at Exxon and The Aerospace Corporation, including management of the latter company's first desktop computing planning team and applied research in applications of artificial intelligence techniques. He holds an engineering degree from MIT and an MBA from Pepperdine University, he has held teaching appointments in computer science, business analytics and information systems management at Pepperdine, UCLA, and Chapman College.

Submit a Comment

Loading Comments...
Manage your Newsletters: Login   Register My Newsletters

Rocket Fuel