Amazon Says It's "All Hands On Deck" As Cloud Troubles Enter Day Two
It’s now officially been about 29 hours since the first signs of trouble emerged on Amazon Web Services, trouble that at its worst brought down a good slice of the Web along with it. Yesterday was a bad day for cloud computing.
While most of the larger sites that were initially affected–like Foursquare and Quora–have come back, lots of other sites are still having trouble. The New York Times special project site is still down, as is the news application page on ProPublica. Everyblock remains offline this morning. Heroku said it was partially back, but still working to restore shared databases. Two sites popular with the Hollywood crowd, Harry Knowles’ Ain’t It Cool News and Box Office Mojo were both down this morning.
One site that caught my eye on Twitter this morning was Blue Sombrero, which offers a service for clubs and organizations to register its members online for events. It took to Facebook this morning to vent its frustration and apologize to its customers. There are still, apparently, lots of Blue Sombreros out there. A site called EC2disabled.com purports to be a complete list of sites affected by the outage.
Amazon continues to send intermittent status updates via its dashboard, but it’s difficult to get much of an idea as to when they expect to be fully back to normal, though minutes ago it said it is starting to see “meaningful progress” in getting things under control. These are the latest messages from its EC2 status feed:
6:18 AM PT We’re starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that we’ll reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we’ll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.
Earlier it said:
2:41 AM PT We continue to make progress in restoring volumes but don’t yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available.
And before that:
10:58 PM PT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It’s taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.
One of the companies listed on the EC2 Disabled site is Sony Online Entertainment. It’s probably just a coincidence, and its Web site is up this morning. However, it’s notable because of another ongoing outage now in its second day as well, of Sony’s Playstation gaming network. It’s latest official blog entry says the network may be down for a day or two.
The buzz about this outage, though Sony hasn’t addressed it, is that it’s the result of another attack by Anonymous, the loose coalition of hackers who are outraged at Sony’s recently settled legal fight with George Hotz. Known by the nom-de-keyboard GeoHot, Hotz had figured out a way to jailbreak the Playstation 3 so that it could run games not approved by Sony. Anonymous has denied involvement. The outage started Wednesday night and coincided with the release of three big games.