Amazon's Cloud Crashed Overnight, And Brought Several Other Companies Down Too

The Amazon Web Services status dashboard is reporting an ongoing failure of its EC2 service on its servers based in Northern Virgina. Foursquare, Quora, and Reddit are reported to have been affected. I’ve got a call in to Amazon asking what happened and will update this post as more information becomes available.

A failure in the cloud is of course one of the fundamental problems that its critics always point to. Yes, you can save money and time and effort by farming your IT services and infrastructure out to someone else. But when those services crash unexpectedly, you–and scores of others that rely on the same infrastructure–are left to wonder what’s going on and when it’s going to be fixed.

As of now, ~~it seems like Amazon is getting the situation under control~~, it seems to be getting worse, as other parts of the Amazon service that are tied to EC2 are reporting various failures via Amazon’s status dashboard. Failures are showing up Elastic Beanstalk, and the relational database, and Cloudwatch among others.

Amazon’s status messages are below.

1:41 AM PT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.

2:18 AM PT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.

2:49 AM PT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution.

3:20 AM PT Delayed EC2 instance launches and EBS API error rates are recovering. We’re continuing to work towards full resolution.

4:09 AM PT EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone. The errors, which started at 12:55AM PDT, began recovering at 2:55am PDT

5:02 AM PT Latency has recovered for a portion of the impacted EBS volumes. We are continuing to work to resolve the remaining issues with EBS volume latency and error rates in a single Availability Zone.

Update: Here’s more companies that are affected by the outage, according to status updates on Twitter.
Hootsuite, the cloud-based Twitter client, is down because of the outage too. Here’s what the site looks like right now. Also down is the Hootsuite URL shortener ow.ly.

SCVNGR is reporting that it is down too as a result of the outage.

Discovr, an iPad music app, reported that it went down, but shortly reported its service was restored.

Wildfire, a social media app, reports that it is down.

Livefyre is down.

Here’s an interesting one. CampgroundManager.com, apparently a software-as-service application used to manage campgrounds, say it is down.

A service called Totango, which appears to do something with managing customer relations and subscriptions, had some issues, but moved some things around, and got things mostly working again.

ESchedule, a Canada-based employee scheduling service, reports its service is down.

ZeHosting, a Web host, says it is experiencing slowdowns.

Recorded Future, which bills itself as a “temporal analytics engine” is reporting an outage.

PercentMobile, a mobile analytics firm, say its service is down.

The Cydia Store, which hosts applications available for jailbroken iPhones, reports it is down.

Here’s the latest update from Amazon:

6:09 AM PT EBS API errors and volume latencies in the affected availability zone remain. We are continuing to work towards resolution.

Here’s another update from Amazon

6:59 AM PDT There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue.

Another Amazon update:

7:40 AM PDT In addition to the EBS volume latencies, EBS-backed instances in the US-EAST-1 region are failing at a high rate. This is due to a high error rate for creating new volumes in this region.

Updating again at 9:10 AM PDT

Amazon continues to post regular updates on its multi-faceted cloud services outage this morning. The latest update message came in about 15 minutes ago, and is reprinted below.

I’ve heard about three other sites that are affected by the outage. Radarsync, a cloud-based service that updates drivers for Microsoft Windows users tells me via Twitter that its service is down. I’ve also seen that Thrillist is having some troubles sending emails. Venmo, an iPhone-based payment service is also down.

This is an update on Amazon’s Relational Database service.

8:12 AM PDT Despite the continued effort from the team to resolve the issue we have not made any meaningful progress for the affected database instances since the last update. Create and Restore requests for RDS database instances are not succeeding in US-EAST-1 region.

And this is one from the EC2 team.

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

Update at 10:17 AM PDT: Here’s more companies affected by this outage, as offered by contributors to the comments below: ELog.com, a sort of all purpose-notepad in the cloud is down (link goes to a Google cache).

About.me, which AOL acquired last last year is down, and is currently displaying a message saying “We are currently experiencing an outage.”

ECairn, a social media marketing app says it’s down.

Travelmuse, a vacation-planning site is down, though I can’t find an official update on Twitter confirming that it’s connected to Amazon’s troubles.

Web host and design firm Drupal Gardens has an blog entry on its partial outage.

PeekYou, a search company that specializes information about people, tells me it has experienced some trouble, but has shifted its hosting to compensate.

Gamechanger.io, a service that tracks live baseball scoring stats, is down.

I’m beginning to understand what it feels like to be a radio announcer on a snow day reciting school closures after another! The only thing is, there’s no kids cheering.

Update at 10:39 AM: Yet more communication from Amazon, who says it is making progress.

10:26 AM PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.

And another concerning the relational database.

10:35 AM PDT We are making progress on restoring access and IO latencies for affected RDS instances. We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance – currently those requests are not being processed.

Amazon's Cloud Crashed Overnight, And Brought Several Other Companies Down Too

AllThingsD by Writer