Skype, AWS Outages Rekindle Cloud Reliability Concerns

Mishaps at data centers operated by Microsoft and Amazon are reminders that despite its sheer scale, the cloud is not immune to show-stopping bugs.

cloud reliability

Skype users experienced an extended outage Sept. 21, highlighting the cloud's growing influence on the lives of consumers and business technology users alike.

On Monday, after receiving complaints from several users, Microsoft's official Twitter account confirmed that its cloud-backed Skype calling, messaging and video conferencing platform was experiencing technical issues. While the company has been tight-lipped as to the extent and duration of the problem, as well as the regions affected, users in the United States, the United Kindgom, Singapore and even New Zealand reported difficulties logging in, suggesting the issue was widespread across Microsoft's globe-spanning network of Azure cloud data centers.

In a Sept. 21 post to the Skype Heartbeat, a blog that reports on the health of the voice over IP (VoIP) service, Microsoft revealed that its enterprise customers were unaffected by the glitch.

"We have identified the network issue which prevented users from logging in and using Skype today. We're in the process of reconnecting our users, and focused on restoring full service. The issue did not affect Skype for Business users," wrote the company.

While consumers were primarily the ones to bear the brunt of this latest outage, businesses that rely on cloud computing providers can face crippling downtime.

Early Sunday morning, Amazon's North Virginia data center (US-EAST-1) suffered a barrage of disruptions, culminating in a five-hour outage of major services, including Elastic Compute Cloud (EC2), Amazon Web Services (AWS) Lambda and the company's NoSQL database service, DynamoDB. Affected customers included AirBnB, Netflix, Reddit, and Tinder.

Mounting login failures and increasing error rates swept across Amazon's cloud infrastructure, underscoring the importance of network availability zones. In DynamoDB's case, the company reported experiencing "increased error rates for all API calls in DynamoDB in US-East-1" in an advisory on the AWS Service Health Dashboard. "The root cause began with a portion of our metadata service within DynamoDB. This is an internal sub-service which manages table and partition information," added Amazon in a later update before restoring the service by 9:12 am PDT on Sept. 20.

The incident was an eerie reminder of the last major Amazon outage during the summer of 2013.

Affecting the US-East-1 data center and also occurring on a Sunday (Aug. 25), the interruption knocked a number of virtual machine instances offline. The culprit was a physical networking device that caused problems for the company's Elastic Block Storage service, which in turn provides persistent storage for virtual machines hosted on AWS. Airbnb, Instagram, Flipboard and Vine suffered degraded services at the time.

Google, too, has been afflicted by bugs that call cloud application availability into question.

On Feb. 18 and 19, Google Compute Engine suffered an outage that severely restricted access to the service, starting with a 10 percent reduction in traffic flow in the early minutes to nearly 70 percent at the peak of the nearly three-hour interruption. "The issue manifested as a loss of external connectivity to the instances, and an inability of the instances to send traffic outside their private network. The instances themselves continued to run, and became available again as their external traffic loss cleared," explained Google in a status update at the time.

Pedro Hernandez

Pedro Hernandez

Pedro Hernandez is a contributor to eWEEK and the IT Business Edge Network, the network for technology professionals. Previously, he served as a managing editor for the network of...