Amazon Details Last Week's Cloud Failure, and Apologizes

Amazon has released a detailed account of its terrible, horrible, no good very bad week, during which portions of its Amazon Web Services crashed in the U.S. and brought the operations of numerous other companies down with it. It’s a rather lengthy read, so I thought I’d pull out some highlights.

It all started at 12:47 am PT on April 21 in Amazon’s Elastic Block Storage operation, which is essentially the storage used by Amazon’s EC2 cloud compute service, so EBS and EC2 go hand in hand. During normal scaling activities, a network change was underway. It was performed incorrectly. Not by a machine, but by a human. As Amazon puts it:

The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.

EBS works using a peer-to-peer technology that keeps data in sync across several nodes, and using two networks–one fast, used for normal operation, and one slower, used as a backup when the primary one fails. Each node uses the network to create multiple copies of the data being used as needed, and when one node stops talking to another mid-stream, the first one assumes the second one failed and looks for another to replicate to. This normally happens so fast that humans aren’t even involved.

When the network change didn’t work properly, one group of EBS nodes lost contact with their replicas. When their connection was restored, so many had gone down that when they started up replicating again, the available space ran out. That left several nodes in a bad loop looking over and over again for space on other nodes when there was none. New requests to create new EBS nodes piled up, overwhelming everything else. At 2:40 am Amazon disabled the ability of customers to create new volumes. Once new requests stopped piling up, it seemed Amazon had turned a corner, but those hopes were short-lived.

The EBS volumes kept looking for new nodes to replicate to, putting continued strain on the system. By 11:30 am, technicians had figured out a way to quiet things down without affecting other communications between nodes. Once this was done, 13 percent of EBS volumes were still “stuck.” By noon, attention shifted to finding new capacity for the stuck volumes to replicate to. Not easy. That meant undergoing a time-consuming process of physically moving servers and installing them into the degraded EBS cluster. Naturally, it took a lot longer than expected. The process didn’t start in earnest until 2 am the following morning. By 12:30 pm April 22, only 2.2 percent of affected volumes were still “stuck.”

With the new capacity installed, Amazon started work on letting each node communicate normally again. This had to be done gradually, and work to dial it up just right went on well until the early morning hours of April 23. By 6:15 pm that day, operations were almost back to normal–except for the 2.2 percent of volumes that had remained “stuck.” It turned out they would have to be recovered manually. Their data was backed up to Amazon S3, its general storage service. By noon the next day, all but 1.04 percent of that data was recovered.

A more intensive recovery process was tried, and in the end 0.07 percent of the data involved in the crash could not be recovered, Amazon says.

The company says it is auditing its process for carrying out changes in its network, which is where the problem started, and that it will “increase automation” to prevent a similar mistake from happening again. From that statement I gather that it was a human-caused mistake that was then exacerbated by the way the cloud system was designed to work. Customers who used the affected services, whether or not their services were interrupted, are getting a 10-day credit. The list of other changes it is promising is long and detailed, ranging from having more capacity on hand for use in a recovery to making it easier for customers to access more than one availability zone (those who had done prior to the outage fared better than those who hadn’t) to improvements to its status dashboard.

Finally, Amazon apologized:

Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.

What’s not clear is the effect not only on Amazon and its reputation, but on the planning of customers who rely on its cloud services or are thinking about using them. A failure as widespread and as widely publicized as this may be forgiven but it won’t be forgotten. Mike Rowan, CTO of RatePoint, a reputation management service for small businesses, indicated on Twitter that he was happy with the billing credit. But time will tell if Amazon loses customers over this.

Amazon Details Last Week's Cloud Failure, and Apologizes

AllThingsD by Writer