Microsoft Explains Last Week’s Azure Outage: Whoops!
Microsoft has figured out why its Windows Azure cloud-computing platform went down in Europe last week.
In a blog post summarizing Microsoft’s root-cause analysis of the incident, Mike Neil, general manager for Windows Azure, writes that someone forgot to adjust a safety-valve setting in the networking gear.
Just before everything happened, the company had just added a bunch of new capacity. The safety-valve mechanisms on the networking equipment, which usually guard against the possibility of a cascading failure brought on by unusual spikes in the number of connections, hadn’t had their limit setting adjusted upward. So when the new capacity was brought online, the safety valves hit their limits, and did what they were set to do: Generate network-management messages to administrators. The resulting surge in traffic brought on by those messages triggered other bugs, and pushed the CPU usage of some of the machines in the cluster to 100 percent.
Service for Microsoft’s Windows Azure Europe region went down for more than two hours on July 26. It was the second notable disruption for the service this year. The first stemmed from difficulty from the “Leap Day” on Feb. 29. It also came close on the heels of the Amazon Web Services outage on June 30 , caused by a lightning strike at a Northern Virginia Data Center. That outage disrupted service for numerous companies including Netflix, Instagram and Pinterest.