Arik Hesseldahl

Recent Posts by Arik Hesseldahl

Amazon's Cloud Crash Is Over, But the Talking About It Isn't

The big crash of Amazon’s cloud that brought down hundreds of other Internet companies that rely upon it is over. Now everyone who was affected in one way or another is comparing notes on how they coped or didn’t. And for cloud providers not named Amazon, there’s going to be an obvious business opportunity.

Last week I talked with David Young, co-founder and CEO of Joyent, an Amazon rival with 30,000 customers around the world. His criticism of Amazon in this instance is rather harsh. The way the Amazon cloud is built, he said, virtually guaranteed that a service outage such as this would happen. While on one hand he gives Amazon high praise for “evangelizing the cloud computing model,” with the other he disparages it as “the Atari of the cloud.”

“Amazon does not represent the cloud,” Young says. “These guys are booksellers who got in the cloud business. That’s like Nordstrom’s getting into the cash register business,” he told me. “They may be the market leader, but there’s a bunch of us who are building things in such a way that we don’t see these downtimes.”

It’s hard to miss Amazon’s outsize influence. Though Amazon Web Services amounts to just a fraction of the company’s overall revenue, it is the market leader, controlling about 60 percent of the market with Rackspace, IBM, Joyent and Terremark, recently acquired by Verizon.

Joyent, which is privately held and backed by investments from Greycroft Partners and Intel Capital, would never have suffered an outage, Young claimed, because of the way it is built. Amazon, on the other hand, was never designed for what he calls persistent computing that customers need to be available all the time. The problem, he said, started in Amazon’s Elastic Block Storage, which is vulnerable to being overwhelmed by demand, something he likened to a run on a bank, where depositors panic and rush to withdraw the cash from their accounts.

Essentially, he said, Amazon became a victim of its own popularity, unable to meet the demands placed upon the EBS storage infrastructure by the network, causing in the end a cascading failure. The company has blogged about its theory in greater technical detail here.

Meanwhile, there were Amazon customers who managed to keep their services up despite the crash, and they were comparing notes today. Don MacAskill, CEO of the photo-sharing site SmugMug, blogged about how designing for failure allowed its service to remain live during the crash. Others tweeted, like Mathias Meyer of Basho Technologies, who noted that having their service running in more than one Amazon region helped avert failure.

There were others. Donnie Flood, VP of engineering of Bizo, an advertising service aimed at business executives, said that the company used all of Amazon’s regions except its most recently launched one in Japan, and combined that with a global Domain Name Service system that would direct traffic to the nearest Amazon region. When Amazon’s U.S. East region went down, all traffic in the U.S. was routed to its Amazon instances running in Amazon’s Western U.S. data center. “We were able to stay up fully the entire time,” Flood said.

Oren Michels, CEO of Mashery, a cloud-based manager of software APIs, said that his company had additional cloud infrastructure in place from Internap that took over when Amazon failed. Neither Michels nor Flood said he was likely to move the services currently hosted with Amazon to another provider.

Amazon hasn’t made any public statements about what happened, beyond those made in its status dashboard, and it hasn’t responded to any of my messages seeking a comment on any aspect of this. The company reports earnings tomorrow, so expect some questions on the conference call about Amazon’s terrible, horrible day that stretched into nearly a week.


Latest Video

View all videos »

Search »

I’m a giant vat of creative juices.

— David Pogue on why he’s joining Yahoo