Google's Gmail service has apologized to users who were affected by email delivery delays on Sept. 23, explaining in a blog post that the slowdowns were caused by a rare two-pronged failure in the company's network architecture.
"On Sept. 23rd, many Gmail users received an unwelcome surprise: some of their messages were arriving slowly, and some of their attachments were unavailable," wrote Sabrina Farmer, the senior site reliability engineering manager for Gmail, in a Sept. 24 post on the Google Gmail Blog. "We'd like to start by apologizing—we realize that our users rely on Gmail to be always available and always fast, and for several hours we didn't deliver. We have analyzed what happened, and we'll tell you about it below. In addition, we're taking several steps to prevent a recurrence."
What caused the problems, she wrote, was "a dual network failure" that occurred when two separate, redundant network paths both stopped working at the same time. The events were "unrelated," wrote Farmer, "but in combination they reduced Gmail's capacity to deliver messages to users, and beginning at [8:54 a.m. ET] messages started piling up."
An automated monitoring system quickly alerted the Gmail engineering team, which began investigating the incident, she wrote. Repairs got under way, and much of the accumulated message backlog was cleared up and delivered by 4 p.m. ET, with the rest of the delayed mail being delivered by shortly before 7 p.m. ET, according to Farmer's post. The service delays could be monitored by users on Google's application performance status page.
"The impact on users' Gmail experience varied widely," she wrote. "Most messages were unaffected—71 percent of messages had no delay, and of the remaining 29 percent, the average delivery delay was just 2.6 seconds. However, about 1.5 percent of messages were delayed more than two hours."
Postings about the problems were seen frequently throughout the day on Facebook, Twitter and other social media platforms.
With the latest service problem now fixed, Farmer wrote that the company is implementing some changes quickly to ensure that a similar problem is prevented in the future. "What's next? Our top priority is ensuring that Gmail users get the experience they expect: fast, highly-available email, anytime they want it. We're taking steps to ensure that there is sufficient network capacity, including backup capacity for Gmail, even in the event of a rare dual network failure. We also plan to make changes to make Gmail message delivery more resilient to a network capacity shortfall in the unlikely event that one occurs in the future."
In addition, Google is "updating our internal practices so that we can more quickly and effectively respond to network issues," she wrote. "We'll be working on all of these improvements and more over the next few weeks—even including this event, Gmail remains well above 99.9 percent available, and we intend to keep it that way!"
Asked about this week's Gmail problems, two IT analysts told eWEEK that such glitches are essentially unavoidable for these kinds of free services that are provided by the company.
"They have redundancy, but they are still a publicly controlled company," said Dan Maycock of Slalom Consulting. "They can't do triple redundancy. It would not be cost effective."