Skype engineers appear to be making some progress recovering from the outage that has plagued their services for the better part of the last 48 hours, but the company is not out of the woods yet by any means. Personally, I have found several times today that my Skype client successfully logged in to the network for a brief time, but the connection was dropped each time within minutes.
At this point, little information has been made available about the root cause of the problem. Via the Skype Heartbeat Web page, Skype has provided a bare minimum of information about the nature of the outage, except to dispel some rumors that the outage was caused by a) scheduled maintenance to the billing system that was completed on Wednesday, or b) any form of attack. Skype officially denies that either of those conditions are the cause, but rather there is an unspecified algorithm problem with the Skype networking software that has affected the client sign-on process.
Here’s Skype’s official comment on the matter:
“On the early morning of August 16th (Luxembourg), Skype experienced a software error in its product that prevented log-in attempts from our users. The problem occurred because of a deficiency in an algorithm within Skype networking software. This controls the interaction between the user’s own Skype client and the Skype directory. Throughout the day, this issue spread and this made it very difficult for a sizable proportion of our users to log in to Skype.”
What becomes clear is that, despite the remarkable resilience Skype has shown over the last few years, the sign-on and authentication process—the most centralized component of the network—is a susceptible bottleneck that can bring down the entire network.
The first step in a Skype session is when the client first authenticates to Skype’s servers. The central authentication is necessary to inform the client software where some supernodes and relays are in order to facilitate in call routing and setup, and b) characterize the services for which the user has subscribed, like SkypeOut, SkypeIN, or Voice mail services.
With Skype, clients communicate in a peer-to-peer fashion if the various endpoint networks allow traffic to pass in both directions on the appropriate ports. Clients protected by a network firewall or NAT can still access the service, however, through the use of Skype supernodes and relay agents that can act as call proxies—a third party relay between callers that allows both participant endpoints to open an outbound network tunnel to a stable resource.
Supernodes also act as an agent to assist in the process of call routing and setup. The supernodes and relays are distributed around the Internet on high-powered, but otherwise ordinary Skype-enabled client machines that have the right levels of network access. Because supernodes and relays can be anywhere on the network, and can appear and disappear dynamically, these components of the Skype network are quite robust and reliable.
As we have learned over the last few days, all this resilience can be for naught when the upfront authentication services are unavailable. But what I have found most disappointing about this failure is that there appears to be no separation of services when it comes to business-class versus individual accounts using the Skype service, as both types have been affected equally during this outage. Companies that have bought into the Skype service as a way to foster long distance or international calls on the cheap within the organization have learned the hard way that a business-class account buys no additional protections from this kind of outage.
While Skype does not have a great track record of catering to the needs of business users, moves made over the last year to improve account management, centrally control Skype services across a company’s network, and lock out the use of certain features have shown that Skype is willing to make improvements tailored to that particular audience. But the next step Skype’s evolution towards business adoption needs to be a separation of login and accounting services from standard users, in a more distributed, robust and stable configuration.
Certainly, such changes would lead to more costs for business customer. The question would then be whether such additional stability would be worth the additional costs, or whether another alternative may become more attractive.