Google’s Compute Engine cloud infrastructure hosting service suffered a nearly two-hour disruption between late Feb. 18 and early Feb. 19, impacting customers on a global scale.
The disruption started at 10:59 p.m. Pacific Standard Time on Feb. 18 and was resolved shortly before 1 a.m. PST on Feb. 19, a Google incident report noted.
The company blamed the outage on a glitch in an internal software system used to manage virtual machine egress traffic on the Google Compute Engine. According to Google, the software stopped issuing updated routing information, bringing outbound traffic to a halt.
Google said that for about 40 minutes the majority of Google’s Compute Engine instances “experienced traffic loss for outbound network connectivity.” The total length of detectable traffic loss was 2 hours and 40 minutes.
The disruption began at around 10:40 p.m. PST on Feb. 18 when Google Compute Engine instances began to experience a 10 percent loss in traffic flow. The loss increased linearly to peak at 70 percent around 11:55 p.m. and stayed there for about 40 minutes. By then, Google engineers had implemented remediation measures and managed to reduce and then eliminate the traffic loss.
“The issue manifested as a loss of external connectivity to the instances, and an inability of the instances to send traffic outside their private network,” Google said. “The instances themselves continued to run, and became available again as their external traffic loss cleared.”
Web application performance management firm Dynatrace, described the outage as global in nature.
“At the time of the outage we saw a significant degradation of Web performance across virtually all business sectors, an indication that organizations using Google’s cloud infrastructure encountered difficulties beyond their control to fix,” the company’s digital performance expert David Jones said in a statement.
Cloud service outages are not uncommon and Google is certainly not the only one to suffer such glitches. Dynatrace said it has tracked multiple similar outages in the past year including incidents at Microsoft Azure and Amazon Web Services.
For instance in November customers of Microsoft’s Azure cloud service experienced an 11-hour outage after a scheduled performance update to Azure Storage went horribly awry. Just a few days later, Amazon’s CloudFront content delivery network sustained a suffered a two-hour outage.
Even though such incidents aren't uncommon, the outage is bound to be an embarrassment for Google, which prides itself on having one of the most reliable cloud hosting services in the business. CloudHarmony, a cloud performance analysis and comparison company ranked Amazon’s Elastic Compute and Google’s Cloud Platform as the two most reliable cloud services in 2014 from a performance standpoint.
According to CloudHarmony, Amazon’s hosted service experienced a total of 20 outages resulting in a grand total of 2.41 hours of downtime in all of 2014 resulting in an uptime record of 99.9974 percent. Google’s storage service experienced a total of just 14 minutes of downtime to record a 99.9996 percent uptime record in 2014.
Google’s Compute Engine platform suffered 72 outages resulting in 4.42 hours of downtime for all of 2014 compared to 39.77 hours of downtime for Azure during the same period.
The Google service disruption comes as it ramps up efforts to go after bigger corporate customers for its cloud offerings. Over the past several months the company has rolled out several new services and programs aimed at getting more large companies to sign up with Google's cloud services.
The Internet giant has also hired several senior executives from companies like Oracle, Microsoft and Amazon to shepherd is efforts in the cloud computing market.