Backup Systems Have Bad Days, Too

Peter Coffee: Operator training and fault simulation are necessary parts of the picture.

MIT students were startled to see dining-hall charges appearing, late last month, on their records for dates and times when they knew they had not used those facilities. It turned out that several cash registers, having lost their network connections, had quietly been accumulating charges for several months and had posted the charges when their connections were restored. "There will be educational sessions held for vendors so they will be able to recognize when cash registers are offline," reported campus newspaper The Tech.

My second thought, on reading this, was to wonder if any student had inquired as to why an expected charge had not shown up on a bill, but thats not really fair: having just spent four days on the campus, Im newly reminded that a students life doesnt lend itself to remembering when, where or even if a given meal was eaten on any given day. (In those four days, I lost three pounds.)

My first thought, though, and the reason I share this item with you in the first place, was that fault tolerance can be taken to (pardon the expression) a fault. When a fault-tolerant system is working correctly, how do you know if its currently tolerating faults? How do you know how much reserve actually remains before system functions will actually start to fail?

If a primary system has failed, and the backup system is doing its job, its reasonable to expect the system to provide a notification that a properly trained operator can recognize--although its far from certain, as evidenced by those cash registers, that operators will have that training or that theyll remember what to do when they see that alert. But what if the backup system fails first? When, for example, did you last check the air pressure in the spare tire in the trunk of your car?

In more of an IT domain, when was the last time you simulated a failure of a primary data storage system and measured the time required to get back to full operation? Or even tried restoring a disk from a backup tape to make sure that the entire chain, end to end, is actually capturing data in a form that it can use?

When Im working with critical data, Ive found that the system often does too good a job of hiding failures. When I make a backup copy of a recordable CD, for example, its pointless to check the files on the second copy if I dont first search out and flush the cache files that were made by various data-browsing applications when I checked the first copy. Without that manual step, Im just looking at the cache. A cache is a good thing, but sometimes it can do its job too well.

Do you know where false assurance can arise in the systems you use?

Tell me about your fault tolerance.