Backup Systems Have Bad Days, Too
Peter Coffee: Operator training and fault simulation are necessary parts of the picture.MIT students were startled to see dining-hall charges appearing, late last month, on their records for dates and times when they knew they had not used those facilities. It turned out that several cash registers, having lost their network connections, had quietly been accumulating charges for several months and had posted the charges when their connections were restored. "There will be educational sessions held for vendors so they will be able to recognize when cash registers are offline," reported campus newspaper The Tech. My second thought, on reading this, was to wonder if any student had inquired as to why an expected charge had not shown up on a bill, but thats not really fair: having just spent four days on the campus, Im newly reminded that a students life doesnt lend itself to remembering when, where or even if a given meal was eaten on any given day. (In those four days, I lost three pounds.) My first thought, though, and the reason I share this item with you in the first place, was that fault tolerance can be taken to (pardon the expression) a fault. When a fault-tolerant system is working correctly, how do you know if its currently tolerating faults? How do you know how much reserve actually remains before system functions will actually start to fail?
If a primary system has failed, and the backup system is doing its job, its reasonable to expect the system to provide a notification that a properly trained operator can recognize--although its far from certain, as evidenced by those cash registers, that operators will have that training or that theyll remember what to do when they see that alert. But what if the backup system fails first? When, for example, did you last check the air pressure in the spare tire in the trunk of your car?