On October 22nd 2012, Amazon experienced an outage in its East Coast region, causing the Socialize API and website to be unavailable for several hours.  Also affected were many popular web properties, including Reddit, Airbnb and others.

We asked Amazon to provide us a detailed accounting of what happened, as well as a post-mortem, to better allow us to evaluate the problem and how we can keep future problems like this from affecting Socialize.

Below is the answer directly from Amazon:

Summary of Post-Mortem:

  • The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers.
  • The memory pressure on many of the EBS servers had reached a point where EBS servers began losing the ability to process customer requests and the number of stuck volumes increased quickly.
  • We have deployed monitoring that will alarm if we see this specific memory leak again in any of our production EBS servers, and next week, we will begin deploying a fix for the memory leak issue.
  • The primary event only affected EBS volumes in a single Availability Zone, so those customers running with adequate capacity in other Availability Zones in the US East Region were able to tolerate the event with limited impact to their applications. .  However, we’ve heard from customers that they struggled to use the APIs for several hours.   We now understand that our API throttling during the event disproportionately impacted some customers and affected their ability to use the APIs.

While we always have a base level of throttling in place, the team enabled a more aggressive throttling policy during this event to try to assure that the system remained stable during the period where customers and the system were trying to recover.   Unfortunately, the throttling policy that was put in place was too aggressive.

  • Single-AZ database instances are exposed to disruptions in an Availability Zone. In this case, a Single-AZ database instance would have been affected if one of the EBS volumes it was relying on got stuck.   During this event, a significant number of the Single-AZ databases in the affected zone became stuck as the EBS volumes used by them were affected by the primary EBS event described above.
  • In the case of these Single-AZ databases, recovery depended on waiting for the underlying EBS volumes to have their performance restored.
  • These ELB load balancer instances use EBS for storing configuration and monitoring information, and when the EBS volumes on these load balancer instances hung, some of the ELB load balancers became degraded and the ELB service began executing recovery workflows to either restore or replace the affected load balancer instances.
  • During this event, a number of Single-AZ load balancers in the affected Availability Zone became impaired when some or all of the load balancer instances used by the load balancer became inaccessible due to the primary EBS issue.   These affected load balancers recovered as soon as the ELB system was able to provision additional EBS volumes in the affected Availability Zone, or in some cases, when the EBS volumes on which particular load balancers relied, were restored.

Leave a Reply