This Single Point of Failure Was a Disaster Waiting to Happen
Once again, it seems Microsofts misfortunes are to serve as an object lesson to the rest of the online world. In the wake of errors that left most online users unable to locate Microsofts main Web sites in late January, many businesses are scrambling to find defenses and response plans to ensure their Web sites are less vulnerable. While I applaud any effort to shore up defenses, few people have taken a particularly close look at what actually happened, and as a result, are learning the wrong lessons.
It is important to realize that, contrary to popular belief, Microsofts Web sites did not crash or go offline Jan. 23-26. Given the information currently available, it appears that all of the companys major Web servers were working just fine, and were fully accessible to the public. The problem was that almost no one knew how to find them.
Microsoft maintained a total of four redundant Domain Name System (DNS) servers, which converted domain names like microsoft.com into the numerical IP addresses recognized by the Nets traffic-routing system. Unfortunately, they placed all four servers in the same subnet within their corporate network, forcing all traffic to or from any of these servers to pass through a single router.
On Jan. 23, a Microsoft technician made an error that temporarily disabled the critical router. The DNS servers, though functioning properly, were cut off from the Web. Microsoft repaired the faulty router within hours, but the underlying network architecture problem remained, and as a result of the outage, quickly became public knowledge. Online vandals seized upon this information, and used it to attack the router by flooding it with spurious traffic, once again cutting off the DNS servers.
> Lesson 1: Piecemeal Approaches to Security Dont Work.
Networks are composed of many different elements, most of which rely upon each other to function properly. A skilled attacker will hunt down the most exposed element and use it to bring down the entire structure.
The vandals in this case didnt bother with Microsofts Web sites or even their DNS servers, all of which are heavily firewalled and load-balanced to ward off attacks. Disabling the machine that controlled the traffic flow between the DNS servers and the Internet was enough to cause a kind of chain reaction.
> Lesson 2: Good Security is About Good Planning, Not Fancy Technology.
As much as I love to bash Microsoft, none of their products or technology was to blame here; as far as I am able to determine, all of their software worked entirely as intended. The real problem was the decision to structure their network such that so much relied on the proper functioning of a single router. The poor planning that created this single point of failure was disaster waiting to happen, and one that no amount of fancy technology could have prevented.