Lessons from Facebook's Outage

News Analysis: The Sept. 23 outage at Facebook was caused by a system designed by engineers who created software that was meant to correct invalid configuration entries. The problem with their approach manifested itself when the automated system tried to load an invalid entry from what was assumed to be a list of valid ones. Although some users may have felt that Facebook let them down, they got what they paid for; systems management is still very much an art instead of a science, and this outage proved once again that software can only cope with the inputs that its designers expect.

I wasn't affected by the Facebook outage of Sept. 23, and I feel sorry for anyone who believes they were let down by the company. That's not because the company owed these people access to their data. It's because anyone whose happiness is based on the availability of Facebook, Twitter or any other Website either works for that company or needs to get a life. Although these companies' business models are predicated on 24/7 availability, that's just an unreasonable expectation, even today.

I'll admit that I'm guilty of abusing the Web for the purpose of feeding my almost infinite information jones. When I roll out of bed on any given morning, I head to the computer to see what's happened in the world since the previous night, and I find myself visiting a range of sites; the normal routine includes the sites of traditional publishers such as The New York Times as well as less familiar sites such as the Saginaw News. Throw in some Web-only sites that specialize in IT news and a handful of webcomics, and only then am I ready to face the day.

I expect these sites to be available, when I want them throughout the day, and as a rule, their track record is impressive. But I know better than to expect perfection; anyone seeking that is going to be disappointed in this lifetime, and probably the next as well. Even the best automated system is designed by human beings, and humans make mistakes. Sometimes mistakes cut you off from your family photos for a few hours and sometimes they cause the East Coast's power grid to collapse.

Maybe I've spent too much time in IT, and my expectations are therefore much lower than the millions who think that Websites "just happen." But I know how difficult it can be to provide scalability and availability, and when one of my favorite sites is having a bad day, I may curse for a bit, but I usually stop when I think about what the poor slobs who are responding to the outage are going through. In pre-Internet days, I had a routine for responding to outages in my server room that included an important step: find my boss and tell him to keep his boss off my back until things were fixed.

The fault with Facebook's site was another example of why automated systems need to be tested out in all conceivable error conditions before they're put into production use. The company had deployed a configuration verification system that was designed to check system caches for incorrect values and replace them with supposedly "good" values from a persistent store. The problem in this case was that the values from the persistent store were themselves incorrect.

This led to a feedback loop that crippled a database cluster by throwing hundreds of thousands of queries at it every second. According to a blog post from Facebook Engineering Director Robert Johnson, the only way to fix the problem was to cut off all requests to the database cluster, which in effect shut down Facebook's site. After disabling the automated configuration checker, the company was able to allow users back onto the site, its engineers having learned a lesson in foreseeing the unforeseeable.

The Facebook outage lasted a couple of hours, and no user data appears to have been lost; this was by no means a disaster, just a much-needed dose of humility for Facebook's engineering team.