Dealing with Complexity
I've seen this kind of thing happen many times. And that's when we programmers get a phone call at 3 in the morning because the operations team couldn't get the software up and running again. And then we have to either connect remotely or drag our butts into the office in the middle of the night, load up on caffeine and track down the problem. And then we find exactly what the problem is and how to fix it. In our example in particular, it turns out the programmer would have been better off using a set of classes built into the .NET framework that allow for read and write locks on files. These classes are easy to use and take only a couple lines of code. Had the programmer used these, the problem wouldn't have occurred.Oh, really? Well there's a good litmus test for determining if the code is right: Does it crash? Good software doesn't crash. Good software doesn't cause phone calls in the middle of the night where panicked people have to try and figure out why the software crashed. I've expressed this litmus test before to others, but was met with severe resistance from other programmers. People don't like criticism. But the fact is, perfect software doesn't crash. The reality is that with today's massive systems it's nearly impossible to get every single bug out. But it's certainly within reason to get as many as bugs as possible out, minimizing crashes as much as possible and not using the excuse that "Bugs are inevitable and we should live with them." And writing code for a Web server that crashes when two users connect to it simultaneously is unacceptable. Handling things correctly, a manager can teach his or her team to not allow such bugs in the first place, and can oversee the process to prevent such bugs. How can this be done? First, the team (and the QA folks) must do their job in testing. It's easy to run through a test and see that the program works fine when only one user is accessing the software; it's also easy for you, as the manager, to see that it's working wonderfully and to feel good about it. But it's not so easy to run a real stress test where hundreds or even thousands of threads are running simultaneously, all trying to access and manipulate the data. That's when you'll discover the real problems, the kind that can bring a system to its knees. To run these kinds of tests requires that you have a QA team of testers who know their tools and know how to simulate such conditions. And further, it's important that the coders are aware of the issues so that by the time their code gets to the QA team, it's already set up to handle high-load situations. That brings me to the second point: The developers must be trained in how to write code that handles such situations correctly so the system doesn't crash. I said that some bugs will creep in, and as much as I don't want to live with that situation, I suppose I accept it as fact. (And your programmers, by the way, should have a similar attitude, rather than just shrugging and saying bugs are normal. Bugs are unacceptable, and we must stop as many as possible, but occasionally we have to accept that a couple might slip through.) Thus, at a minimum the programmers must be aware of what can go wrong, and must know how to write code that handles those situations correctly. And that means writing code that is "thread-safe" and is scalable (meaning it can run not only on a single-user basis, but easily and efficiently when hundreds or thousands of people are using it simultaneously, and even when divided up onto multiple servers).
As a programmer, I remember seeing such mess-ups in code and complaining to others in the company about it. One tech writer friend of mine laughed and said, "Oh, you guys each have your own way of doing things, and neither is better than the other."