The widespread outage that crippled Skype on Dec. 22 appears to have been aggravated by a bug in the Windows version of the VOIP client, said Lars Rabbe, the company’s CIO, on Dec. 29.
A “confluence of events,” which included overloaded servers, a bug in the Windows client and a shortage of supernodes caused Skype to be “unavailable” to users, Rabbe wrote on the Skype blog.
Supernodes help manage multiple connections on Skype’s peer-to-peer network, said Rabbe. Each end user’s computer running Skype’s client software is a node and connects to other computers to create a p2p network. Supernodes acts like a directory by supporting multiple computers running the Skype software and establishing connections between them, he said. Supernodes typically create local clusters of several hundred nodes, according to Rabbe. Any computer running the client on a public network can be tapped to act as a supernode.
The Skype outage began when a cluster of servers that handle offline instant messaging got overloaded with too many requests, wrote Rabbe. Users running Windows client version 220.127.116.11 received delayed responses from the overloaded servers and could not process the information properly, causing the clients to crash. Enough clients crashed to affect the p2p network Skype relies on to connect users, Rabbe wrote.
About half of all Skype users globally were running that version of the Skype client for Windows and about 40 percent of those clients crashed, said Rabbe. About 25 percent to 30 percent of the public supernodes that connected users ran the buggy version of the client, and failed, he said.
Users running the latest version, 18.104.22.168 or the older 4.0 versions, Skype for Mac, Skype for iPhone, Skype for TV, Skype Connect, and Skype Manager for enterprises didn’t have any problems processing data from the overloaded servers and didn’t crash, he said.
Even though Skype’s team disabled overloaded servers, “a significant number” of supernodes had already failed, forcing a “disproportionate load” on the remaining supernodes to keep the p2p network up. “Even when restarted, it takes some time to become available” as a supernode again, Rabbe said.
Timing was also important. The clients crashed just before peak hours and all the clients trying to reconnect resulted in traffic “100 times what would normally be expected at that time of day,” according to Rabbe.
Rabbe did not address what caused the offline instant messaging servers to overload in the first place. He said Skype will be reviewing processes for providing automatic updates to users so that everyone will always be on the latest client software.
The outage lasted approximately 24 hours, from mid-day Dec. 22 to mid-day Dec. 23. While not a complete shutdown, the outage affected a vast majority of the millions of users who use Skype each day. Early on Dec. 23, Skype said about 5 million of its users were back online, which was 30 percent of normal activity. The company has said it will offer refunds to affected users.
Skype engineers were able to restore the network by introducing thousands of new, dedicated supernodes to the network and diverting resources normally used for other services, wrote Rabbe. The supernodes were stabilized by Dec. 24 and Skype restored all services, including Group Video Calling, “in time for Christmas,” he wrote.