The Cheap Cloud versus The Reliable Cloud

5 Lessons Learned from June 29 2012 AWS Outage

Discussing a difficult situation is never fun, and I have been wrestling with how to start this post. It’s about revealing unpleasant cloud truths. And not necessarily the truths you might be expecting to hear. I am not here to preach, but my message to you is important. For the past five years I have been working on a project that uses the cloud to it’s fullest potential, celebrating the victories and learning from the defeats.

I’m speaking to my fellow Amazon cloud citizens. My co-tenants, if you will, in the “Big House of Amazon.” We’re all living together in this man-created universe with its own version of “Newtonian Laws” and “Adam Smith” economics. 99.99% of the time all is well… until out of the blue it’s not, and chaos upends polite cloud society.

If you lost data or sustained painful hours of application downtime during Amazon’s June 29 US-East outage, then you can only wag your finger in blame while looking in the mirror.

I know, I know, the cloud is supposed to be cheap AND reliable. We’ve been telling ourselves that since 2007. But this latest outage is an important wake up call: we’re living in a false cloud reality.

Lesson 1: Follow the Cloud Rules

Up front, you were told the “rules of the cloud”:

  • Expect failure on every transaction
  • Backup or replicate your data to other intra-cloud locations
  • Buy an “insurance policy” for worst case scenarios

These rules fly against the popular notion that the cloud is “cheaper” than do-it-yourself hosting.

There is a silver lining to this dark cloud event. Everyone in the cloud will learn and improve so we don’t have to repeat this episode ever again.

You see, in order to live by the “cloud rules” requires more investment in money and time to design a resilient application that can survive major cloud failure. There’s an onus on Amazon too. They created the rules, and they also have to hold up their end of the bargain. When you buy their “insurance policy” they have to be there when you redeem it.

The rush to the cloud was made so easy because we charged our “pay as you go” infrastructure to a credit card, we duped ourselves into believing unlimited ten cent per hour compute guarantees a fantastic up-time SLA. It doesn’t. June 29th reminded us all the cloud is as fallible as any other hosting environment. And a cloud failure has a exponential negative impact because there are so many services deployed from a single shared infrastructure.

Lesson 2: Every Project Needs a Risk Management Assessment

The cloud’s beauty and attraction is that a small team can serve a large audience. Perhaps what’s been missing is expert supervision and serious conversations about risk management. There are proven “frameworks” (ITL, NIST) that can help cloud application companies create their own risk assessment calculators If you’re serious about your application’s SLA to your customers, you need to invest time and money.

Every business model has an implied up-time guarantee. With free apps, the prevailing thought is “hey, you get what you pay for!” when there is unexpected downtime. For paid applications, the up-time SLA is proportional to criticality to business continuity. For example, a business email service has higher up-time expectations than a CRM service. But the great thing about cloud applications is that the architects can design systems to pass the underlying up-time SLA cost differences to the end customer. If customer “Acme” needs 99% up-time, there is a cost for that. If customer “Initech” needs 99.99% up-time, there is an increased cost to support that too. This is the cloud’s strength: to granularly match requirements to cost. Amazon gives us great building blocks with different price points that embody this thinking.

You can buy just enough infrastructure for the “cheap” cloud or you can buy the appropriate infrastructure to ensure a reliable cloud implementation.

This is a serious and involved conversation about insurance and risk management. With the cloud, it’s ultimately the end-customer’s choice for what they need, and cloud apps simply act as the middle man. This is new thinking and wasn’t economically possible in the pre-cloud hosting days.

Historically, the cloud’s thrifty mindset avoided understanding the need for insurance. We have been deluded into thinking we can have cheap infrastructure along with reliability. In turn, we have a perception problem.

My own thinking about acceptable risk levels, recovery point objectives (RPO), recovery time objectives (RTO) has evolved over the past nine months. Sonian is completing a FISMA Moderate assessment which requires exacting policies that encompass over 300 . It’s a colonoscopy for cloud start-ups, with the goal of early detection warding off potentially serious problems.

Increasingly, cloud design and architecture patterns need to reference back to a risk management executive statement. All the tools are available to achieve the most reliable SLA, but they come at varying costs.

Again, we’re back to that classic axiom: you get what you pay for. If we want a better, battle-tested cloud, we have to be willing to pay the appropriate price. But often our penny-wise-pound-foolish behaviors work against the most prudent course of action.

Lesson 3: The Cloud is Fallible

What did we learn from June 29?

Review the technical post mortem  for your own conclusions. I found the following highlights:

The first wave. A severe electrical storm passing through Northern Virginian caused a power outage to one availability zone (AZ) in US-East. Backup generators could not handle the load, UPS batteries depleted quickly, compute and storage services terminated due to power loss, and some RDS fail-overs did not complete due to a software bug. Elastic Load Balancer (ELB) did not route requests to the remaining AZs. At this point we have dead compute nodes and EBS storage volumes in unknown states (as opposed to clean shutdown.)

The second wave is the power failure recovery. Within 10 minutes power was restored, while customers were already starting to launch services in other availability zones. This crush of activity was hobbled due to a design flaw that funneled all requests through a single queue. Complicating matters a bug within Elastic Load Balancers (ELB) swamped the queues with excessive requests, significantly exaggerating the single queue design flaw. AWS personnel stepped in to manually route requests. Customers in other US-East AZ’s not affected by the power outage started to have problems with the API’s and control panels because the common control plane stayed in a read-only state longer than designed. (Existing running instances can operate in read-only mode, but making changes to start/stop services are frozen until read-write is enabled.)

The tally:

  • 1 Electrical storm directly affected
  • 1 AZ in 1 region (US-East) and
  • 4 AZs were indirectly affected because
  • 2 bugs and
  • 2 design “flaws” along with
  • Several deficient “risk mitigation” policy procedures

Prior to this series of events we were told a power failure in one AZ should have been contained, yet the resulting domino effect crippled services in the whole US-East region. This is the key learning lesson from this event: While the cloud is fallible and we can never expect perfect uptime, there are application design patterns that show a path to superior resiliency. But planning is required to architect, build, test and pay for the extra resiliency. If you are not prepared to do all the above, then don’t gripe when the cloud fails you.

Amazon learned valuable lessons in the line of fire. The affected customer’s pain will not be in vain. In addition to the backup generator problems, Amazon found two bugs and two design flaws. The bugs are the most upsetting, since these bugs broke the “insurance policy” that AWS customers purchased. These customers, using RDS in multi-AZ mode, did the right thing and paid for multi-AZ insurance. But when they needed to cash in their policy the service failed them, not by any fault of their own except for being in the wrong place at the wrong time.

What Happens Next?

Slow down and determine your project’s responsibility to your user’s up-time expectations. Also separate application up-time SLAs from data resiliency SLAs. It’s probably not OK to loose any data, but maybe OK if your users can’t access their data for 10 minutes a (week, month, year).

At the end of the self-analysis, the resulting enhanced architecture and increased expense to implement (purchasing reserved capacity) may make the cloud unaffordable for your application’s needs. If so, make plans to move to a non-cloud hosting facility. Before a knee-jerk reaction make a fair apples to apples comparison. No hosting facility, whether cloud or dedicated, will ever be perfect. Application software can bridge the gap.

Lesson 4: Buy “Cloud Insurance” in Addition to Insuring Yourself 

What Can Cloud Citizens Do Better?

Risk management guidelines need to be defined by company leadership and implemented by the product management group to ensure the goals are being met at all levels. Risk management should not be a secondary afterthought, driven solely by engineering. It’s a challenge to get non-technical management team members to pay attention to the nuances of technical risk management. The technical jargon needs to be explained in easy to understand business terms.

Use software and good design to build your own resiliency in the cloud and augment with “cloud insurance.” This means purchasing Reserved Infrastructure in other regions. This means using multi-AZ protection for RDS.

What Can AWS Do Better?

AWS has to provide a 100% available control plane. With this we can maneuver in the cloud as needed to ensure application uptime. Without the control plane were flying blind “in the cloud.”

AWS should also offer a range of White Glove Concierge services (different goal than the current Premium Support.) Provide specialists available to hand-hold customers thru a crisis, manually expediting infrastructure marshaling, etc. A cloud customer should be able to buy the level of protection needed during crisis situations.

The June 29th event caused a “bank run” mentality. Customers started grabbing cloud compute and storage in case they needed it. This activity added more stress to the situation. Perhaps temporary “Marshall law” with metered consumption to ensure control in a time of chaos. In these moments it’s customer versus customer, whoever grab instances first wins.

Lesson 5: “Cheap” and “Reliable” can Coexist

“The Cloud” is a whole new way of thinking about application hosting infrastructure. In a world where copious quantities of inexpensive compute and storage seems luxurious, care must be taken to use software to make “cheap” infrastructure into a “reliable” end-to-end system. Prior to the cloud the IT industry focused on making fixed hardware hosting more reliable. The design patterns perfected in that era are almost the exact opposite of  what the cloud needs. To be successful in the cloud we need to forget the past and embrace our collective current and future needs.

Conclusions

It’s more than ironic that a “real cloud” took out the first popular IT cloud. This event was a black swan scenario. Everyday the cloud celebrates incremental wins, only to see those wins wiped away by a single (in hindsight avoidable) catastrophe.

For a few, this event was their own tipping point to leave the cloud. For most, I hope, this is a rallying moment to double-down and make the cloud work reliably AND economically.

The choice is simple. We can have our “cheap cloud” and hope for very few black swan events. Or we can take up the challenge to play the cloud to it’s fullest possible potential. That means spending (more & wisely) money, slowing down, and being creative as hell to master our cloud destiny.