The outage that darkened Facebook for two and a half hours Sept. 23 was caused by a software flaw in its database clusters, the company confirmed.
Facebook went down-the company called it the “worst outage we’ve had in over four years”-around 1:30 p.m. EDT Thursday and didn’t go back up until 4 p.m. EDT.
Some of the 500 million-plus Facebook users tweeted about the event on Twitter, wondering what they would do without access to their photos, links, videos and other content they shared on the massive social network.
“The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition,” said Robert Johnson, director of software engineering at Facebook, in a blog post.
“An automated system for verifying configuration values ended up causing much more damage than it fixed.”
One fault cascaded into many, with Facebook having to halt traffic to the failing database cluster. The company slowly allowed users to re-enter the Website.
The company turned off the automated system that handles correction values and is looking to pattern this configuration system after other systems at the company.
“We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously,” Johnson concluded.
Read a more detailed explanation here.
Facebook outages are as rare as it used to be common to see the fail whale on Twitter, which was a regular haunt for prolonged outages until the company made significant infrastructure improvements this year.
Google Apps such as Gmail go down a couple times a year. This is particularly bad for business users who pay $50 per user, per year for the cloud collaboration software.
Facebook may find it has a shorter leash with consumers, who are increasingly spending their time there online.
ComScore said U.S. Web users in August spent 41.1 million minutes on Facebook, compared with 39.8 million minutes on all of Google’s Websites.