Skype Postmortem: Overloaded Servers and Desktop Bugs Brought Us Down
Skype today published a lengthy postmortem explanation concerning why its service went down for the better part of two days last week.
CIO Lars Rabbe says in a blog post that a set of support servers responsible for Skype instant messaging became overloaded, and as a result sent delayed responses. A bug in the latest Windows version of the Skype desktop software failed to process these delayed responses, causing them to crash. About half of the world’s Skype users who were signed on at the time the problem began were using that version of the software, and of those, about 40 percent crashed. Among them were users whose machines were serving as supernodes. Rabbe says as many as 30 percent of the Skype network’s supernodes were among the crashed machines.
Losing those supernodes increased the load on other still-functioning supernodes, which was compounded by all the crashed Windows users trying to restart their software and get back on the network. He says traffic to these supernodes surged to 100 times normal volume for that time of day.
What he doesn’t go into great detail about was why the instant messaging servers became overloaded in the first place. Was this another bug in the server software? It’s a little unclear from this explanation.
Rabbe says Skype is trying to learn from the incident and has instituted new procedures to try to prevent this sort of thing from happening again. But this can’t help but hurt its reputation as it looks for ways to diversify its base beyond the millions of free users it has and make some actual money.
The whole reason Skype is supposed to work as well as it usually does is the strength and resilience of the network, and the fact that the network gets stronger as more people are signed on to it. To say that two bugs in a strange confluence of events could bring that entire network down raises a lot of fundamental questions about Skype.
Rabbe says an investment program to increase capacity to support paid consumers and enterprise customers is underway and will continue into 2011. I’m betting Skype will speed it up.