Its bad when something refuses to do its job, but the scene is set for greater catastrophe when something appears to be working—and isnt.
This fact of life is beyond the understanding, it seems, of all too many software developers and IT system builders, who fail to consider the ways that the world may fall short of their expectations—or perhaps they just dont bother to detect those differences, or warn users of the resulting risks of non-performance.
This blind spot becomes a more serious problem as enterprise systems become more distributed—not just geographically, but more importantly in ownership and control of IT assets that function as paid services. I suspect that experienced developers assume, subconsciously, that users are able to watch the blinking lights to confirm that something is actually happening; the systems that we propose to build tomorrow, and that we attempt to build today, require more explicit self-assessment and verification.
Data backup, or rather non-backup, is the most vicious example of what Ive sometimes called “success-oriented design”—that is, the assumption that things will work, and that this need not be confirmed at the time that an operation takes place. I remember a conference in 1988—the second annual PC Tech Journal confab, in San Francisco, for the benefit of my fellow dinosaurs—when I first heard a users saga of faithfully making regular backups…only to discover, the first time it really mattered, that the resulting tapes could not be successfully restored.
A driver update, he speculated, had resulted in the system continuing to go through all the motions, but to no useful purpose. Neither the backup application, nor the procedures that his department had devised for its use, included any verification that valid and effective backups were actually being produced.
Just to get some sense of the measure of this continuing problem, I Googled the search term “backup” along with the exact phrase “unable to restore”: I got 10,700 hits. Some of them may seem like obsolete, individual-user issues like “Windows 98 unable to restore a file from multiple diskettes.” Others, though, have a more alarmingly enterprise-level presence, like this one: “Any backup job containing EFS encrypted files that is being restored to a FAT/FAT32 volume or a previous version of NTFS (i.e. NTFS in NT 3.1 – NT 4.0) will generate an error…It is only possible to restore encrypted backup sets to NTFS 5.0 volumes.”
I mention this particular example because it brings up an important point. If a backup is needed because the primary hardware is down, its important that the backup be usable on secondary hardware—which may not be running the very latest version of a software platform. Only in the laboratory do we have the luxury of saying that a proper experiment only changes one thing at a time: real-world survival tests will typically hand us a fistful of simultaneous misfortunes, and its important that our survival tools be able to handle a certain degree of stone-age regression.
Im at a loss to explain why any application doesnt bother to audit its own contracts, so to speak, by ensuring that its requirements are met before it wastes time and destroys valuable work. For example, Apples iMovie appears to work just fine when a project is created on an external FireWire drive, but I wasted quite a bit of time the other day when iMovie failed to save the results of an hour of careful editing. Apparently, its known to at least some users that iMovie requires the Macs own file system for full function, while my external drives are FAT32-partitioned for maximum flexibility in moving data among my various machines: that requirement is rather deeply buried, though, in Apples support forum, and it may surprise my fellow multi-platform videographers. And Apple, I regret to observe, is on thin ice at the moment when it comes to the subject of protecting users data.
My larger point, which I hope will be taken to heart by other application developers, is that iMovie should have detected and notified me of the problem—instead of merely pretending to save my project when I gave that command. If the shoe fits, wear it—and start walking in a better direction.
Its up to the developer, though, to decide whether the correct approach is fault prevention or fault tolerance. For example, many developers take it for granted that TCP is the protocol of choice on the Internet, presumably because of its guarantees of packet delivery and packet order preservation, but theres also something to be said for the oft-dismissed UDP alternative thats sometimes called the “message in a bottle” protocol. With its minimal overhead, broadcast capabilities and well-defined data boundaries, UDP gives developers some useful advantages—as long as they also accept the responsibility for making sure that whats supposed to happen, actually does happen.
And thats a responsibility that should always be taken to heart.
What other responsibilities should systems take more seriously?