Amazon’s Outage: Architecting for Failure in the Cloud

Stanton JonesThe uproar surrounding the partial outage of Amazon’s EC2 cloud services platform got some new life last Friday, when the company released a detailed post mortem of the incident. The summary includes a surprising level of detail into the root cause of the outage, information on a service credit for impacted customers, and, finally, an apology from Amazon.

To recap: on April 21st, Amazon made a configuration change during a network upgrade that caused a cascading series of events that resulted in what it calls a “re-mirroring storm.” As a consequence, Amazon’s storage service was essentially “stuck” and unable to locate new storage space for either new or existing customers. This led to a significant period of degraded functionality and downtime for major Web 2.0 sites such as Reddit, Foursquare and HootSuite, as well as a plethora of bad PR for Amazon, a cloud computing pioneer.

Interestingly, though, some other sites were only moderately affected, or not affected at all, most notably Netflix. Why did some sites sink, while others sailed through the storm? As with any complex system, there is no one answer. However, organizations that “architected for failure” tended to fare better than those did not. Netflix, which recently released its lessons learned from the outage, built its platform around the assumption that services and/or zones within EC2 could be unavailable for extended periods of time.

Clearly, not every EC2 customer has the technical chops to build Netflix-like applications, and Amazon needs to make it easier to increase redundancy by taking advantage of multiple availability zones. However, this is a public cloud platform, and customers that did not take full advantage of Amazon’s redundant architecture, or that did not create their own replicated solutions, ended up paying the price.

While architecting for failure in the cloud may appear to be a purely IT responsibility, it’s not. TPI views cloud as one component of your service delivery strategy, and regardless if they are delivered in-house or are outsourced, effective planning across all components is vital.

Business continuity planning (BCP) and Disaster Recovery (DR) are two critical parts of this planning process. Unlike a traditional outsourced agreement, in which the supplier takes on a significant level of responsibility in the delivery of the BCP and DR plans, the public cloud requires that customers retain a significant amount, if not all, of this responsibility, as well as the associated risk. In return they get a highly scalable, cost-effective and fast-to-provision computing platform.

Bottom line: Some organizations are finding out the hard way what happens when they don’t integrate cloud with their overall service delivery strategy. By including business continuity planning and disaster recovery in the cloud architecture design process, enterprises can significantly reduce the risk of business disruption.

 

 

Stanton Jones

About Stanton Jones

Stanton Jones helps ISG clients rationalize and capitalize on emerging technology services within the context of the global outsourcing market. Stanton uses his unique background in both IT and outsourcing advisory services to bring a new and unique perspective to ISG clients. Prior to his analyst role, Stanton led corporate technology strategy and global IT operations as TPI’s Chief Information Officer. Stanton played a key role in leading the transition of TPI into a publicly-traded unit of ISG. Twitter: @stantonmjones Email: Stanton.Jones@isg-one.com