Amazon.com reported that problems
that shut down some of its Web services—namely its Elastic Compute Cloud,
Relational Database Service and Elastic Beanstalk—still hadn't been completely
resolved by late afternoon April 22, more than 40 hours after they went
offline.
The most critical outage
started at 1:41 a.m. PDT April 21 at an AWS (Amazon Web Services) data center
in Northern Virginia and caused disruptions in its EC2 (Elastic Compute Cloud) hosting
service, knocking thousands of Websites—including such popular ones as
Foursquare, Reddit, Quora and Hootsuite—off the Internet.
These Websites and numerous
smaller sites were still offline in some systems by late afternoon April 22.
Businesses that depend on the AWS hosting service lost money during that window
of time—income that cannot be regained.
The AWS Elastic Beanstalk,
which software developers use for deploying and managing applications in the
AWS cloud, was running but was experiencing performance problems, Amazon said.
Elastic Beanstalk automatically handles the deployment details of capacity
provisioning, load balancing, auto-scaling and application health monitoring.
Amazon reported April 22 on
its status Website that it has made
progress in fixing the outage.
"We continue to see
progress in recovering volumes, and have heard many additional customers
confirm that they're recovering. Our current estimate is that the majority of
volumes will be recovered over the next 5 to 6 hours," Amazon said Friday
morning. "As we mentioned in our last post, a smaller number of volumes
will require a more time-consuming process to recover, and we anticipate that
those will take longer to recover."
By late afternoon Pacific Time
on Friday, Amazon had issued a total of 19 updates on its status page since the
outages began. EC2, Relational Database Service and Elastic Beanstalk were
still having problems.
SLA Not Violated, However
Lydia Leong of Gartner
Research wrote in an advisory that Amazon EC2 didn't actually violate its
service-level agreement when the outage occurred.
"Amazon’s SLA for EC2
is 99.95 percent for multi-AZ deployments," Leong wrote. "That means
that you should expect that you can have about 4.5 hours of total region
downtime each year without Amazon violating its SLA.
"Note, by the way, that
this outage does not actually violate their SLA. Their SLA defines
unavailability as a lack of external connectivity to EC2 instances, coupled
with the inability to provision working instances. In this case, EC2 was just
fine by that definition. It was Elastic Block Store [EBS] and Relational
Database Service [RDS] which weren't, and neither of those services have
SLAs."
An Amazon spokeswoman didn't
respond to an eWEEK query by end of business April 22.
Mixed Reaction from Customers
Reaction to the outage from
cloud customers was mixed.
"Proponents of cloud
computing aren't going to like the fact that Amazon had issues that resulted in
outages among its customers' sites, but the fact is that most insurers have
their own outages when they host applications internally, in some cases with
more frequency and severity than we're seeing here with Amazon," Craig
Weber, a senior vice president of the Insurance Group at Celent, a Boston-based
financial research and consulting firm.
"This outage should
focus the discussion on the relative reliability of various approaches and the
tradeoffs between them. Of course, there are also lessons about being aware of
the capabilities of your business partners.
"Engaging with an SAAS [software
as a service] vendor requires understanding things like their architecture,
their disaster-recovery capability and similar issues, because worst-case
scenarios always seem to emerge eventually."
Morphlabs, which was one of the first AWS
solution providers when it launched Morph Appspace in 2007, now has more than
4,000 users. Founder and CEO Winston Damarillo told eWEEK that organizations
need to focus on two things to help get a better understanding of their cloud solutions:
diversity and control.
"A multi-vendor
approach to the cloud means that an organization is not relying on one company
or solution to keep its cloud in working order," Damarillo wrote.
"With the hybrid cloud
model, companies are able to extend existing infrastructure resources without
isolating themselves. When the time comes that they have maxed out their
hardware compute capabilities behind a firewall, they can easily make use of
the public cloud, as well."