Arik Hesseldahl

Recent Posts by Arik Hesseldahl

Amazon's Cloud Crashed Overnight, And Brought Several Other Companies Down Too

The Amazon Web Services status dashboard is reporting an ongoing failure of its EC2 service on its servers based in Northern Virgina. Foursquare, Quora, and Reddit are reported to have been affected. I’ve got a call in to Amazon asking what happened and will update this post as more information becomes available.

A failure in the cloud is of course one of the fundamental problems that its critics always point to. Yes, you can save money and time and effort by farming your IT services and infrastructure out to someone else. But when those services crash unexpectedly, you–and scores of others that rely on the same infrastructure–are left to wonder what’s going on and when it’s going to be fixed.

As of now, it seems like Amazon is getting the situation under control, it seems to be getting worse, as other parts of the Amazon service that are tied to EC2 are reporting various failures via Amazon’s status dashboard. Failures are showing up Elastic Beanstalk, and the relational database, and Cloudwatch among others.

Amazon’s status messages are below.

1:41 AM PT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.

2:18 AM PT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.

2:49 AM PT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution.

3:20 AM PT Delayed EC2 instance launches and EBS API error rates are recovering. We’re continuing to work towards full resolution.

4:09 AM PT EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone. The errors, which started at 12:55AM PDT, began recovering at 2:55am PDT

5:02 AM PT Latency has recovered for a portion of the impacted EBS volumes. We are continuing to work to resolve the remaining issues with EBS volume latency and error rates in a single Availability Zone.

Update: Here’s more companies that are affected by the outage, according to status updates on Twitter.
Hootsuite, the cloud-based Twitter client, is down because of the outage too. Here’s what the site looks like right now. Also down is the Hootsuite URL shortener ow.ly.

SCVNGR is reporting that it is down too as a result of the outage.

Discovr, an iPad music app, reported that it went down, but shortly reported its service was restored.

Wildfire, a social media app, reports that it is down.

Livefyre is down.

Here’s an interesting one. CampgroundManager.com, apparently a software-as-service application used to manage campgrounds, say it is down.

A service called Totango, which appears to do something with managing customer relations and subscriptions, had some issues, but moved some things around, and got things mostly working again.

ESchedule, a Canada-based employee scheduling service, reports its service is down.

ZeHosting, a Web host, says it is experiencing slowdowns.

Recorded Future, which bills itself as a “temporal analytics engine” is reporting an outage.

PercentMobile, a mobile analytics firm, say its service is down.

The Cydia Store, which hosts applications available for jailbroken iPhones, reports it is down.

Here’s the latest update from Amazon:

6:09 AM PT EBS API errors and volume latencies in the affected availability zone remain. We are continuing to work towards resolution.

Here’s another update from Amazon

6:59 AM PDT There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue.


Another Amazon update:

7:40 AM PDT In addition to the EBS volume latencies, EBS-backed instances in the US-EAST-1 region are failing at a high rate. This is due to a high error rate for creating new volumes in this region.

Updating again at 9:10 AM PDT

Amazon continues to post regular updates on its multi-faceted cloud services outage this morning. The latest update message came in about 15 minutes ago, and is reprinted below.

I’ve heard about three other sites that are affected by the outage. Radarsync, a cloud-based service that updates drivers for Microsoft Windows users tells me via Twitter that its service is down. I’ve also seen that Thrillist is having some troubles sending emails. Venmo, an iPhone-based payment service is also down.

This is an update on Amazon’s Relational Database service.

8:12 AM PDT Despite the continued effort from the team to resolve the issue we have not made any meaningful progress for the affected database instances since the last update. Create and Restore requests for RDS database instances are not succeeding in US-EAST-1 region.

And this is one from the EC2 team.

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

Update at 10:17 AM PDT: Here’s more companies affected by this outage, as offered by contributors to the comments below: ELog.com, a sort of all purpose-notepad in the cloud is down (link goes to a Google cache).

About.me, which AOL acquired last last year is down, and is currently displaying a message saying “We are currently experiencing an outage.”

ECairn, a social media marketing app says it’s down.

Travelmuse, a vacation-planning site is down, though I can’t find an official update on Twitter confirming that it’s connected to Amazon’s troubles.

Web host and design firm Drupal Gardens has an blog entry on its partial outage.

PeekYou, a search company that specializes information about people, tells me it has experienced some trouble, but has shifted its hosting to compensate.

Gamechanger.io, a service that tracks live baseball scoring stats, is down.

I’m beginning to understand what it feels like to be a radio announcer on a snow day reciting school closures after another! The only thing is, there’s no kids cheering.

Update at 10:39 AM: Yet more communication from Amazon, who says it is making progress.

10:26 AM PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.

And another concerning the relational database.

10:35 AM PDT We are making progress on restoring access and IO latencies for affected RDS instances. We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance – currently those requests are not being processed.


comments so far. Add yours.

  • Anonymous

    Amazon’s last report says: 6:09 AM PDT EBS API errors …
    They are not being open enough about this. It is more than API errors. For elog.com, I can’t launch instances in any way (presumably because they rely on EBS).
    Also, I wasted about an hour of time this morning instead seeing if Amazon had a problem (I’ll know better for next time). Probably a lot of other folks wasted their time this morning too.

  • http://www.facebook.com/jiwanish Shahzeb Jiwani

    I wonder how they will reimburse these customers for this breakdown…Also this definitely has to sway people using this for regular business operations from using AWS. Unfortunate this had to happen with such a great system.

  • http://www.facebook.com/profile.php?id=647865578 Laurie Head Atkinson

    Ugh. That’s rough. I feel for them.

  • Anonymous

    Wow thats pretty annoying. Havent been able to get on Reddit yet today! Uggh. Stupid cloud stuff.

    http://www.total-privacy.int.tc

  • http://twitter.com/Ramkarthik Ramkarthik

    Guess About.Me is also down

  • http://twitter.com/ZacharyRD Zachary Reiss-Davis

    A lot of other startups, for example http://ecairn.com/ appear to have parts of their site down as well – it’s not just the large companies, and if this follows traditional patterns, the startups will be slower to come back up. (eCairn’s homepage is up, but not their application)

  • http://twitter.com/papillonc Carmen Magar

    Travelmuse.com is also down

  • Anonymous

    http://TribePro.com is down also because of this outage

  • http://www.facebook.com/conradovina Conrado Viña
  • http://hette.ma Dennis Hettema

    I first noticed it on our own product strawberryj.am but about.me seems affected by it too.

  • http://www.facebook.com/people/Iz-Teremka-Kennel/100001597385090 Iz Teremka Kennel
  • http://profiles.google.com/awadamson Alan Adamson

    gamechanger.io, a baseball scoring and streaming service, is down

  • http://twitter.com/ceelee Conor Lee

    Evite is also down.

  • Anonymous

    FooPets is down–thought it was yet another of their Evil Tricks! lol

  • Anonymous

    Don’t forget, seems like almost all of heroku is down too: http://status.heroku.com/ so, people who’ve built off heroku, which itself has built off of amazon, are also offline.

  • http://profiles.google.com/graynor Gil Raynor

    It has affected MXtoolbox and Zimbra also

  • http://profiles.google.com/graynor Gil Raynor

    It has also affected MXtoolbox and Zimbra email services. I have been without my business email all day.

  • Anonymous

    Moral of the story: business email should not be entrusted to the cloud.

  • http://twitter.com/connectedhq Connected HQ

    Connected (http://connectedhq.com) is down as well, would love to hear the updates from others as far as their success in reviving their EBS volumes.

  • Anonymous

    Our Facebook search engine profileengine.com is down, as are all our Facebook apps.

  • http://www.facebook.com/people/Stephanie-Stewart/1559343486 Stephanie Stewart
  • http://www.facebook.com/people/Nora-Rich/550790647 Nora Rich

    So, any idea what it was? Equipment failure, Human Error, Outside attack, something else?

  • http://www.facebook.com/rayloke1 Ray Loke

    CLOUDS CRASHED ;A failure in the cloud is of course one of the fundamental problems that its critics always point to. Yes, you can save money and time and effort by farming your IT services and infrastructure out to someone else. But when those services crash unexpectedly, you–and scores of others that rely on the same infrastructure–are left to wonder what’s going on and when it’s going to be fixed.

  • http://www.facebook.com/rayloke1 Ray Loke

    CLOUDS CRASHED; A failure in the cloud is of course one of the fundamental problems that its critics always point to. Yes, you can save money and time and effort by farming your IT services and infrastructure out to someone else. But when those services crash unexpectedly, you–and scores of others that rely on the same infrastructure–are left to wonder what’s going on and when it’s going to be fixed.

  • http://www.facebook.com/matthewgrichmond Matthew Richmond

    In my corporate experience, there’s significantly more downtime when hosting email svcs locally than in the cloud.

  • http://pulse.yahoo.com/_T2OFBA75QDE346MTGABFWGODUY Rosemary Hoon

    I hope they get it fixed before the 25th. Or else some people are going to have messed up faces.

  • http://pulse.yahoo.com/_2KO3REZVYECY6HU3LKDSNWHZCA Girish

    There would not have been such a wait for these businesses or taking such a long time to bring back the site once down if they hosted on Microsoft’s Windows Azure. The site would have been back up in less than a minute due to built in redundancy and high level of SLAs which Microsoft commits to.

  • http://pulse.yahoo.com/_2KO3REZVYECY6HU3LKDSNWHZCA Girish

    No defense or offense. Time to explore Microsoft’s Windows Azure cloud offering with committed SLAs and no nonsesne data center operations worldwide.

  • http://twitter.com/juliarocks893 Animalsrgreatfriens
  • Jacob Stewart

    Arrrrrg. Our online quoting service @ http://www.thetravelingphotobooth.com relies on relies on EC2 and is still down!

  • shadow16nh

    Both AintItCool.com and BoxOfficeMojo.com are down.

  • http://twitter.com/jackomo Jaki Levy

    This is absolutely BS – so many people are left with hands up in the air with no clue what’s next – I mean in the grand scheme of things things could be worse – we don’t have people dying (it’s not a natural disaster), but in the world of business and uptime, this is a total catastrophe – many of my clients who rely on amazon have lots many thousand$ of due to this outage. It’s time for a switch and backup plan

  • Anonymous
  • http://www.AllSanDiegoComputerRepair.com AllSanDiegoComputerRepair

    yepyep
    :(

  • http://pulse.yahoo.com/_YGL2PNQMOXIU54TYMQZKSTO6RY Robert

    Sony Online Entertainment is down as well. There portal and marketplace for all there MMO’s

  • http://www.facebook.com/people/Matt-Eggers/100002158704920 Matt Eggers

    Well i work on a make money from home website called cash crate it is completely down here is the address check it out http://www.cashcrate.com/2665885

  • http://www.facebook.com/people/Matt-Eggers/100002158704920 Matt Eggers

    i wonder could this be the problem with internet explorer its hanging taking alot of time to go to next page switched to mozzila im fine now

  • http://www.facebook.com/people/Matt-Eggers/100002158704920 Matt Eggers

    http://www.cashcrate.com/2665885
    that is my site down company that i work for

  • http://www.facebook.com/people/Matt-Eggers/100002158704920 Matt Eggers

    ya well i just spent lots of money on demagraphics for a company on wed for a site that is down

  • http://butyoureagirl.com adriarichards

    Excellent timeline here! Ah, I remember the days of Gmail being down but this is truly epic…

  • http://twitter.com/RozTheDove Rosalind Mills

    We’ll all go back to dead animals :(

  • Mansoor Ahmed

    Bab amazon bad!!

  • http://profiles.google.com/singingbadgerclan shadow manypaths

    foopets has graduated from a 503 to a “standby” message from the site to a blanket 500…any new info/ideas on what is going on, and the prognosis?

  • http://twitter.com/juliarocks893 Animalsrgreatfriens

    Still down but now saying Status: 500 Internal Server Error Content-Type: text/html

  • http://www.facebook.com/people/Arianna-Rose/1205944468 Arianna Rose

    I thought it was another one of their tricks also…

    user member arianna4961

  • http://newenterprise.allthingsd.com Arik Hesseldahl

    I’m curious. I’m not familiar with FooPets, so don’t get the context of “evil tricks.” Anyone care to clue me in?

  • http://newenterprise.allthingsd.com Arik Hesseldahl

    Thanks. I was late in noticing this one and added it in the second post last night.

  • http://newenterprise.allthingsd.com Arik Hesseldahl

    ZImbra is a surprise.

  • http://newenterprise.allthingsd.com Arik Hesseldahl

    What happens on the 25th?

  • http://newenterprise.allthingsd.com Arik Hesseldahl

    These too are surprises. I added a line about them in this mornings post. (We’re now into day 2 at this point.)

Latest Video

View all videos »

Search »

While it’s tempting to see the Huffington Post’s Pulitzer as a “big win for new media,” or something like that, the real story is that these organizations — the Huffington Post, the New York Times, the Washington Post — are becoming more like each other. Old media and new media are increasingly antiquated terms.

— Journalism professor Jay Rosen to HuffPo media writer Michael Calderone (via GigaOM)