Arik Hesseldahl

Recent Posts by Arik Hesseldahl

Microsoft Explains Last Week’s Azure Outage: Whoops!

Microsoft has figured out why its Windows Azure cloud-computing platform went down in Europe last week.

In a blog post summarizing Microsoft’s root-cause analysis of the incident, Mike Neil, general manager for Windows Azure, writes that someone forgot to adjust a safety-valve setting in the networking gear.

Just before everything happened, the company had just added a bunch of new capacity. The safety-valve mechanisms on the networking equipment, which usually guard against the possibility of a cascading failure brought on by unusual spikes in the number of connections, hadn’t had their limit setting adjusted upward. So when the new capacity was brought online, the safety valves hit their limits, and did what they were set to do: Generate network-management messages to administrators. The resulting surge in traffic brought on by those messages triggered other bugs, and pushed the CPU usage of some of the machines in the cluster to 100 percent.

Service for Microsoft’s Windows Azure Europe region went down for more than two hours on July 26. It was the second notable disruption for the service this year. The first stemmed from difficulty from the “Leap Day” on Feb. 29. It also came close on the heels of the Amazon Web Services outage on June 30 , caused by a lightning strike at a Northern Virginia Data Center. That outage disrupted service for numerous companies including Netflix, Instagram and Pinterest.


Latest Video

View all videos »

Search »

The problem with the Billionaire Savior phase of the newspaper collapse has always been that billionaires don’t tend to like the kind of authority-questioning journalism that upsets the status quo.

— Ryan Chittum, writing in the Columbia Journalism Review about the promise of Pierre Omidyar’s new media venture with Glenn Greenwald