Amazon and the Terrible, Horrible, No Good, Very Bad Day
If you had any doubt about how large a footprint Amazon Web Services has upon the modern Web, it became readily apparent today as dozens of companies suffered service failures they blamed on the failure of infrastructure belonging to Amazon.
Companies as large and widely known as Foursquare and as small and unknown as CampgroundManager.com all turned to Twitter to advise their customers that service would be down for awhile, apologizing and asking for patience. It’s because of this that Amazon will have to work extra hard to win back the unquestioning confidence it has so long enjoyed. Meanwhile, competitors like Microsoft Azure, IBM and others will do their best to capitalize on this and lure customers away from Amazon.
Amazon wasn’t helped as the day went on and the list of affected customers grew longer and included ever more prominent names: The New York Times lost service on its Projects subdomain at projects.nytimes.com, a section where the Times publishes special projects like this one on the Census (link goes to Google Cache for now).
Another victim was ProPublica. Three days after winning its second Pulitzer Prize in as many years, the section of ProPublica’s site where it hosts its data-heavy news applications, such as this one which displays federal stimulus funding by county, was out of commission. (Again, the link goes to a Google Cache.)
Everyblock, a hyper-local news site that became part of MSNBC in 2009 is still down as of this writing. Foreign Policy, the Washington Post-owned journal, saw its Web site fail too, but as noted by Jim Romenesko today, it quickly switched to publishing its content on Facebook, and made light of the situation on Twitter.
One victim which surprised me was Heroku, the Cloud-based Web development concern that Salesforce.com acquired last year. Heroku kept its users apprised of the situation throughout the day without mentioning Amazon by name. Interestingly, Salesforce’s infrastructure showed no sign of trouble all day.
To its credit, Amazon did its best to communicate about the situation all day, but the incident couldn’t help but give its Web services division–which is relatively small as a percentage of revenue but obviously punches above its weight in terms of influence–a black eye. Late in the day it had isolated the trouble to a single “availability zone,” or group of machines running together in its Northern Virginia data center, and was trying to shift services away from the affected zone.
As of 4:20 PM PT its latest messages indicate it seems to be getting closer to resolving the issue, though many services were still reporting on Twitter that the outage was keeping them offline.
At 1:48 PM PT, the status dashboard for EC2, its compute cloud service, said:
A single Availability Zone in the US-EAST-1 Region continues to experience problems launching EBS backed instances or creating volumes. All other Availability Zones are operating normally. Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.
Another message concerning its Elastic Beanstalk service came at 2:16 PM PT:
We have observed several successful launches of new and updated environments over the last hour. A single Availability Zone in US-EAST-1 is still experiencing problems. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.
The outage also affected Amazon’s CloudFormation and CloudWatch.
2:40 PM PT We are continuing to see delays and failures creating and deleting stacks containing EC2, EBS and RDS resources in a single Availability Zone in the US-EAST-1 region. We are working towards a resolution. Please see the Amazon Elastic Compute Cloud (N. Virginia) and Amazon Relational Database Service (N. Virginia) status for more details.
This message went out to Amazon MapReduce customers at 3:12 PM:
Customers can now start job flows with CC1 instances in the US-EAST-1 region by not targeting a specific Availability Zone.
Not that a big and nasty outage isn’t serious business. It certainly is, and I feel for the people at Amazon and all their customers. But having sat through more on-the-job system outages in my career than I care to count, I know that at the end of the day you have to laugh a bit at the head-slapping frustration of it all, and play a little loud music. In that spirit of sympathy and understanding I offer Freddie King’s “Going Down.” Sorry, Amazon. Here’s hoping tomorrow’s a better day.
(Image and headline obviously inspired by the Judith Viorst book I so loved as a kid.)