If your company is running Slack, BitBucket, SendGrid or any number of other apps today that you normally take for granted but suddenly cannot access, the answer has come to the fore.
The world’s largest and busiest cloud infrastructure provider, Amazon Web Services, was hit by a widespread service interruption Feb. 28 at its northern Virginia data center that took down much of the company’s S3 storage and a long list of services with it for several hours.
“We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region,” AWS said on its status page.
[Editor’s update]: AWS issued a notice at 2:08 p.m. Pacific time that the issues had been resolved and that “we are fully recovered for operations for adding new objects in S3 … and that Amazon S3 service is operating normally.”
Services affected included Adobe’s services, Amazon’s Twitch, Atlassian’s Bitbucket and HipChat, Buffer, Business Insider, Carto, Chef, Citrix, Clarifai, Codecademy, Coindesk, Convo, Coursera, Cracked, Docker, Elastic, Expedia, Expensify, FanDuel, FiftyThree, Flipboard, Flippa, Giphy, GitHub, GitLab, Google-owned Fabric, Greenhouse, Heroku, Home Chef, iFixit, IFTTT, Imgur, Ionic, isitdownrightnow.com, Jamf, JSTOR, Kickstarter, Lonely Planet, Mailchimp, Mapbox, Medium, Microsoft’s HockeyApp, the MIT Technology Review, MuckRock, New Relic, News Corp, PagerDuty, Pantheon, Quora, Razer, Signal, Slack, Sprout Social, StatusPage, Travis CI, Trello, Twilio, Unbounce, the U.S. Securities and Exchange Commission (SEC), Vermont Public Radio, VSCO and Zendesk, among others.
Repercussions Went Far and Wide
Airbnb, Down Detector, Freshdesk, Pinterest, SendGrid, Snapchat’s Bitmoji and Time Inc. were working slowly in the afternoon, the company reported.
Apple said it had issues with its App Stores, Apple Music, FaceTime, iCloud services, iTunes, Photos, and other services on its system status page, but it’s not clear they’re attributable to today’s S3 difficulties.
Ariel Maislos, CEO at Stratoscale, told eWEEK that AWS customers, minus any replicated backup, are stuck with waiting it out along with the service itself.
“Right now they need to wait it out, which is frustrating,” Maislos said. “In the future they’d need to replicate the data to multiple regions and multiple cloud providers and it greatly impacts costs and operating complexity.
“Operating a complex large scale service is always difficult. When dealing with outages like this, you want to be careful not to cause an even worse cascade problem. That’s why the outages are never short.”
What Can an AWS Customer Do When Outage Hits?
Although an AWS outage is rare, what are businesses supposed to do about it?
“Always be backing up and always design for multi-site availability,” Forrester Research storage analyst Dave Bartolletti said in response to the eWEEK question. “S3 itself is highly reliable, and this issue was limited to a particular region. I see no evidence that data was lost; it was inaccessible for a period of time today.
“Everyone affected should re-evaluate how current their backups are, where they are stored, and how to switch over to alternative locations automatically when an S3 issue is detected in the future.”
How, in Bartolletti’s estimation, did AWS perform in response to this crisis?
“I think AWS responded very well. The company acknowledged the problem quickly, took a bit too long to update the status page (showing other services affected), but quickly reported when they had identified the problem (always good to know) and resolved most of the problems within 5 hours,” Bartolletti said. “That’s a significant issue, but the response was effective.
“I expect to see a description of what went wrong and guidance for how users can insulate themselves from this in the future–that type of quick response will go a long way to assuage lingering fears. Overall, this is not a trend; S3 reliability has been incredible for any reasonable reporting period.”
Two AWS Customers Weigh In
Two AWS customers, Workspot and Nerdio–offering DaaS and ITaaS, respectively–were impacted by the AWS outage, but their reactions are very different.
“[The answer is] stone-cold simple: SaaS vendors need to assume that cloud infrastructure will be down a few times a year. Cloud-native services should add Multi-Availability Zone and Multi-Region replication for seamless failover,” Puneet Chawla, Chief Technology Officer and Co-founder of Workspot, told eWEEK.
“At Workspot, we’ve spent the last four years making sure our service is region and cloud agnostic so that we can failover to another location when disaster strikes.”
“The lesson for us all from this major AWS outage may very well be to build redundancy for mission critical apps across multiple clouds,” said James Sivis, VP of ITaaS Nerdio.