Google's Gmail application goes down for 100 minutes, keeping millions of users all over the world of the messaging and collaboration application in the dark. The culprit was human error accompanied by insufficient router capacity to handle the Gmail requests. Google is improving its service routers to prevent this from happening again, but the situation is still likely a sour taste in the mouths of Gmail users, some of whom pay $50 per month for the app and other Google Apps.
Google's Gmail application was knocked out
for the majority of users for 100 minutes on
Sept. 1 due to human error, the company said last night after the Gmail
engineering team fixed the issue.
Google took a small fraction of Gmail's servers offline to perform
routine upgrades. Google does this regularly, sending traffic to other
locations when one is offline. That's when things got hairy, as Ben Treynor,
vice president of engineering and site reliability czar for Google, explained
"We had slightly underestimated the load which some recent changes
(ironically, some designed to improve service availability) placed on the
request routers-servers which direct web queries to the appropriate Gmail
server for response. At about 12:30 pm
Pacific a few of the request routers became overloaded and in effect told the
rest of the system "stop sending us traffic, we're too slow!" This
transferred the load onto the remaining request routers, causing a few more of
them to also become overloaded, and within minutes nearly all of the request
routers were overloaded. As a result, people couldn't access Gmail via the Web
interface because their requests couldn't be routed to a Gmail server."
Through internal monitors, Treynor said the Gmail engineering team was
alerted to the failures within seconds and added several request routers online
to make up for the dearth in capacity and distributed the traffic across the
request routers. Gmail came back online around 2:30
To ensure this lack of server capacity-which is ironic considering that
Google allegedly powers the world's most popular search engine with more than 1
million servers-doesn't happen again, Google boosted request router capacity
well beyond peak demand for extra juice when the application needs it.
Treynor also said Google is improving the failure isolation in the routers,
so a problem in one data center won't affect servers in another facility.
Moreover, he said that Google is taking steps to make sure that when the
request routers are overloaded simultaneously, they all should just get slower
instead of refusing to accept traffic and shifting their load to another data
It's also worth noting that when Gmail did go down, Google urged users to
access it via the IMAP and POP mail
protocols; mail processing continued to work normally because these requests
don't use the same routers at Google.
"We know how many people rely on Gmail for personal and professional
communications, and we take it very seriously when there's a problem with the
service," Treynor added. "Thus, right up front, I'd like to apologize
to all of you-today's outage was a Big Deal, and we're treating it as
So are the Gmail users who use Gmail for their businesses. Donald told Google Watch
: "I use G-Mail to run my CPA practice. This is a
serious (huge) problem."
Sergei added: "This is a huge problem and an outrage. I demand
immediate Gmail access. What is with those people?"
Indeed, more than 1.75 million businesses
use Google Apps and some of them pay Google $50 per user, per year for the Google Apps collaboration suite, which boasts Gmail as its backbone. Users have little patience for a service that conks out on them,
particularly when they are paying for the extra reliability and security. Read
more about this on TechMeme here
The latest issue follows a big outage in February, when Gmail went down for
two and a half hours due to
"unexpected side effects of some new code." But
these last two issues were nothing compared with the August 2008 outage
that took Gmail down for nearly