If it seems like major services have been crashing a lot lately and for extended periods, you’re not imagining things. A cluster of crashes has plagued users of a wide variety of high-profile cloud services in the past month.
If this is a glimpse of what our future holds then let’s examine the occurrences, learn from them and take action accordingly.
The biggest of these was an epic Amazon Web Services (AWS) outage.
Amazon Web Services
Amazon Web Services on Sept. 20 caused big chunks of the Internet to either go offline or get really slow—for five long hours.
An AWS database service called the Amazon DynamoDB service, which is used for the kind of super low-latency applications that run on AWS, started having problems. And those problems sent a ripple effect to major applications used by some of the Internet’s best-known sites.
It took Amazon two hours to figure out what was causing the problem and then three additional hours to fix it.
Affected sites included Netflix, Medium, Buffer, Reddit, Pocket, Product Hunt, SocialFlow, GroupMe, Viber and others. The effects of the problem ranged from total outages to radical drops in performance.
Amazon’s answer to Siri, which is the Alexa virtual assistant that’s conveyed through Amazon’s Echo product, also was hit.
Another major outage struck Microsoft’s Skype service on Monday, Sept. 21. While the paid Skype for Business app and newish Web version of Skype continued operating, regular Skype (which connects via an app and is used by a majority of Skype users) suffered a colossal outage lasting an incredible 15 hours.
Some users couldn’t log in or see if contacts were online. Many others simply couldn’t do Skype calls.
This affected me, personally, as in addition to working as a columnist I also anchor a daily Internet show called Tech News Today. It’s an interview show, and our guests connect via Skype. In fact, the TWiT network has dozens of shows, and most of them use Skype in one capacity or another. Needless to say, it was a rough day because of Skype’s outage.
And, of course, we weren’t alone. Skype has about 300 million users.
Even Twitter went down for an hour earlier in September. Twitter is increasingly relied upon for news outlets, TV shows along with the general public as a source of breaking and emergency news. It’s a big deal when it goes down, according to the many people complaining about it on Facebook, but note that most users accessing Twitter via a third-party client were still able to use the service.
A major power transformer failure Aug. 22 at a substation that provides power to a Fujitsu data center in Silicon Valley took a variety of Software as a Service applications and public cloud services offline. The outage caused cascading failures that led to service disruptions for up to five days for some customers.
What to Do When the Cloud Comes Crashing Down
These failures were caused by a flaw in Fujitsu’s system for handing off to the backup power supply.
Nest, which is owned by Google, reported on Sept. 7 the unavailability of the cloud services that support both their Nest Thermostat and Dropcam camera products. Users were unable to use the cloud services for the thermostats or see the video for their Dropcams during a three-hour outage. It was the second major outage in a single week.
The outages for Dropcam were a serious concern for some, as the cameras are often used for security or as baby monitors. It also raised the question of whether using cloud-based cameras for security is really a good idea.
Cloud Service Crashes Are Always a Concern
Because all these services are so significant, and because these outages happened within the span of a month, it’s worth taking a moment to consider this strange place we’ve come to.
Cloud computing has become so commonplace that we find ourselves using services that we can forget are cloud services. When people enjoy the proper, automatically set temperature in their homes with a Nest thermostat, for example, they tend not to think: “I’m doing cloud computing.” The recent outage serves as a reminder.
I think there are three takeaways from all this.
1. There’s nothing magical about the cloud. “The cloud” is just somebody else’s computers located somewhere else. All the problems that exist in one’s own data-center can exist within the cloud services.
2. The cloud comes with a certain degree of helplessness. One benefit of the cloud is detachment. A large number of people are sweating bullets every day to keep cloud services up and running as best they can, but when things go sideways, there’s usually nothing that subscribers can do about fixing it. This brings me to my most important point.
3. It’s important to always have a cloud Plan B. That could be an alternative cloud service. It could be localized access of one’s data in a non-cloud service. Or, as in the case of the Skype and Twitter outages, the alternative could be using a different facet of the service.
For Skype, using the Web version was an instant way to continue using the service. For Twitter, a third-party client would have been a good alternative to not using Twitter at all during the outage.
In the meantime, let’s cross our fingers and hope that the past month isn’t an indication of the reliability of the cloud in general. It’s probably just a coincidence that all these big outages happened within a month of each other.
Either way, it’s always best to make sure you’re ready with a Plan B for each cloud service you’re using.