Blog Series: Cloud Masters

New Cloud Rules: Replace Instead of Fix

Here’s an all too common scenario from the “cloud chronicles.” A virtual machine that has been operating just fine for days, and has 50 other identical twins with the same configuration, starts to exhibit problems. Slow virtual disk performance. Network brown-outs. Disconnecting and reconnecting within it’s functional cluster. Monitoring systems alert on degrading performance, and the knee-jerk response is to jump on the box (nee VM) and start to troubleshoot the issue. The problem is, spending any time troubleshooting an anomaly in the “cloud” is the wrong reaction. In the cloud, the first response, when a node starts to exhibit erratic behavior, should be to replace, not fix.

Replacing, instead of fixing, goes against the ingrained habits of over two decades of entrenched IT best practices. In the pre-cloud world, when real hardware was the base, we had to “fix IT” because replacing was too expensive and not practical. There was not an endless pile of spares lying about for a “replace IT” mindset.

But in the cloud, with, in theory, nearly infinite CPU, the remediation to an errant node should be to immediately replace, and move on.

Why Is This?

Because there are too many causes beyond our control at the OS level in a cloud environment. Think of the cloud like living in a high-rise building. Each unit in the building, just like each cloud customer, can have whatever interior they want, but there are also massive shared resources in the building. So while our interior may be a candidate for the next architectural digest cover, our neighbor could “kill our chill” with a too-loud stereo boom box. The cloud suffers from the noisy neighbor problem just like our theoretical high-rise. But in the cloud, we can choose to move and jump back into the random lottery for a new unit. We can’t change the building, but we can change the location within the building.

Of coure, you need the right cloud-centric architecture to be able to simply “replace IT” instead of “fix IT.” Having cloud-dexterity is critical to operating a successful cloud deployment.

The cloud requires us to “un-learn” the best practices of the past and embrace a new way of thinking about “break fix.” While replacing instead of fixing may seem wasteful, it’s really not. The time spent troubleshooting the random problem will not yield significant insights, and could be better spent focusing on more value-add projects. Usually after extensive diagnosis, the only recourse is to replace the node, since the original problem was an outlier.

You have just finished reading “New Cloud Rules: Replace Instead of Fix.” Please consider sharing a link to this post.

 

Cloud Success Requires Cost-aware Engineering

This is a true story from the “Cloud Cost Czar Chronicles.”

Our S3 “penny per one thousand” API costs started to rise rapidly in the second half of the cloud infrastructure billing period. We have seen this behavior before, and knew this could be attributed to increased usage, a new defect, or a design flaw that rears its head at a scaling tipping point. My job as “cost czar” is to raise the alarm and work with the team to figure out what was going wrong. At the observed rate of increase, the excess charges would push the monthly bill beyond the budget. One thing we have learned in the cloud, is that costs can rise quickly, but take awhile to go down, since the deceleration effect can be out of proportion to the acceleration if trying to manage expense in a single billing period.

When we started using Amazon Web Services S3 (a PaaS object store) back in 2007, we were acutely aware of the three pricing vectors in effect; storage consumed, price for API calls to store and list data and price for API calls to read and delete data. We’ve been using S3 heavily for five years and we tried to model the “all-in” costs as accurately as possible. But “guestimating” costs beyond the raw storage was stretch. PaaS services have an intrinsic “social engineering” element. If you color outside the lines the financial penalty can be significant. But if you master the pricing game, the rewards are equally as significant. So five years ago we thought as long as we point in the right general direction, “we’ll figure it out as we go along.” Some assumptions proved a positive surprise. Raw storage costs went down. Some surprises not so pleasant; continually wrangling the API usage fees, especially the transactions that cost a penny per thousand, proved to be a constant challenge. But I still like my options with S3 compared to buying storage from a hardware vendor and having to incur the administrative overhead. With S3 we can lower our costs by smarter engineering. With storage hardware, the only way to lower costs is to wrangle a better deal from an EMC sales person. As one of the original “cloud pioneers,” Sonian is not alone in this effort, and it’s been a real eye-opener for software designers to have to think about how their code consumes cloud resources (and expense) at scale. Because whether a penny per thousand or penny per ten thousand, when processing hundreds of millions of transactions a month, any miscalculation suddenly brings a dark cloud raining over your project.

Read more…

Cost Transparency in the Cloud

Today Amazon Web Services lowered S3 “standard” pricing for storage volumes less than 450 terabytes. Standard service (STD) is the very reliable “eleven-nines” SLA. This is the original “gold standard” for cloud-based object store. S3 is a great example of Platform as a Service (PaaS) storage. This price decrease is interesting. In the past, instead of lowering the price of the standard service, Amazon creatred a new class of storage, Reduced Redundancy (RRS), with a different SLA and a different price. RRS is “four nines” of durability, and a lower price for less durable than “eleven nines.”

RRS was the recognition that cloud customers didn’t need “one size fits all” storage, but instead would benefit from different types of building blocks with lower price points, and varying service qualities. But in order to realize the lower price for RRS, AWS customers needed to write code or change behaviors. So today, AWS gave us a gift. The same reliable service we used yesterday, is five to ten percent less expensive today. With no work. We didn’t have to write one line of code.

This very public price reduction got me thinking about cloud cost transparency; in the cloud, all customers in the distribution chain know all the underlying costs. Our customers know how much we are paying for storage. So do our competitors. What this means is the cloud, different than the old co-located world, propels a new era of transparency and a healthy “checks and balance.”

Price transparency forces each value add layer in the cloud to amplify the innovation.

Amazon clearly found a better way to manage data and passed some or all of the savings to the customer. In turn, Sonian as a good cloud customer, will amplify with a positive change to our flagship archiving service.

A 2007 Multi-Cloud Fantasy Becomes a 2012 Reality

Five years ago I wrote a business plan that described an archiving SaaS project built on cloud computing. In 2007 that was an uphill battle to convince prospective investors “the cloud was the future.” And at that time there was really only one cloud, from the e-commerce giant Amazon. Amazon Web Services really started the modern cloud movement. No existing IT provider (IBM, HP, Microsoft, Dell, etc.) would have had the gusto to upset their current business model with a “disruptively priced” cloud option. For the past four years those IT giants fought the cloud momentum until they had a credible cloud themselves. But for a lean start-up getting funded five years ago, it wasn’t a stretch to assume other clouds would appear to take on Amazon.

The graphic above was my crude way to visualize how a cloud-powered digital archive, anticipating someday living on multiple clouds, could in essence become a “cloud of clouds.” A lot of positive breakthroughs would need to occur to be able to successfully operate a single reference architecture software stack across more than one cloud. There was no terminology to describe this desire. We weren’t using terms like “Big Data” or “DevOps” nor many of the acronyms that today are common lingo in our modern cloud-enabled world. The business plan depicted a system designed to manage lots of data, and being an enterprise document archive, the data itself was large in size and numerous in quantity. We probably started one of the worlds first cloud-big-data projects.

In the beginning the multi-cloud goal was a fantasy dream, a placeholder for a future that seemed possible, but the actual crawl, walk, run steps not precisely defined because we didn’t yet know “what we didn’t know.”

So why in 2008 were we thinking about “multi-cloud?” The answer is we wanted to avoid single vendor lock-in and maintain a modicum of control over our infrastructure costs. The notion of an evolving multi-cloud strategy meant the ability to seek lowest cost of goods from multiple cloud vendors. In the pre-cloud IT world, when services were built on actual hardware, pricing flexibility was derived by negotiating better deals with hardware vendors. The customers didn’t know or care that their SaaS app might be powered by HP sever one day or a Dell 1U box the next. Those decisions were up to the discretion of the SaaS provider to get the best infrastructure value by shopping vendors. But in a single cloud, when there is only one choice, there’s no ability to negotiate between multiple vendors, unless you have multi-cloud dexterity.

Multi-cloud capable means the necessary infrastructure and abstraction layer is available to run a single common reference architecture on different clouds at the same time, with one master operator console. Multi-cloud is almost like, but not exactly, the concept of running a common program across IBM, DEC, Control Data mainframes. The clouds today somewhat resemble massive time-sharing mainframes of the previous decades.

Our early start five years ago, and all the hard lessons learned since, allows us to easily assume a commanding position in multi-cloud deployments. Engineering teams just now starting their “cloud journeys” will learn from us pioneers, but there is an old saying; “until you’ve walked a mile in my shoes, don’t claim to know anything otherwise.”

Read more…

The Problem with PaaS Pricing: Total Cost Uncertainty at Scale

 

Highlights of this post:

  • PaaS costs are difficult to predict at scale
  • IaaS costs are going down due to improved operational proficiency
  • Admin cost differences between IaaS and PaaS are negligible
  • PaaS should be less expensive to get better market traction

 

Here’s a handy decoder ring for all the acronyms in this post:

  • IaaS = Infrastructure as a Service. On-demand compute and storage typically available as an API call.
  • PaaS = Platform as a Service. On-demand turn-key web service that abstracts scaling, and reliability available as an API call.
  • AWS = Amazon Web Services
  • EC2 = AWS’s Elastic Compute web service
  • DIY = Do it Yourself
  • DevOps = Developer Operations… a new category for cloud systems management
  • GB-month = A pricing mechanism for cloud storage. Amount stored multiplied by hours stored multiplied by unit-cost per month during a billing period.

In the past I have written about the pros and cons facing cloud architects when choosing between an IaaS or PaaS solution for critical application infrastructure. Take a moment and read this post, Balancing Infrastructure as a Service (IaaS) versus Platform as a Service (PaaS), which focuses on the trade-offs between IaaS flexibility and PaaS’s vendor lock-in. There I briefly mention PaaS pricing challenges, so wanted to expand on that topic with a point of view on how current PaaS pricing schemes hinder adoption.

I’ll put my main theme right out here: Most PaaS solutions have a fundamental problem estimating operating costs at production scale.

There’s an implied “grand bargain” for cloud customers who expect an economic advantage for choosing a cloud PaaS service over a comparable cloud IaaS equivalent. From an anecdotal perspective that seems true. When using PaaS you expect lower people and development costs. PaaS is supposed to provide a price advantage because extensive operational efficiencies are supposed to lower costs. This is because massive physical and human expense are spread across many many customers. It’s a text book example of the “economies of scale.”

But wide-spread PaaS adoption is being hindered because cloud architects can’t wrap their minds around reliable cost estimation. Cost calculators, without real-world at scale metrics, give a false economic security.

Read more…

The Evolution of Purchasing Cloud Compute

In the beginning, as it were, just five years ago, purchasing cloud compute was simple because the rules were easy to understand and there were no choices. There was one cloud compute instance type that cost ten cents an hour. And a credit card was the only way to pay your monthly cloud compute bill.

Today there are a myriad of compute instances to choose from and multiple ways to pay for your cloud CPU time.

“Time…”

It’s the key word in the previous statement. The paradigm shift toward cloud computing away from the old dedicated co-lo world is bringing back the concept of purchasing compute “time.” Seasoned IT folks know there is nothing new to the concept of “buying” computer time. In the mainframe era (would it be hurtful to call it the IT Jurassic age?), computer time-sharing was the norm, and developers had to be mindful of how much computer time their programs consumed because mainframe’s were very expensive. In today’s dollars the equivalent of hundreds of dollars per hour.

Steve Wozniak, in his autobiography iWoz, tells a funny-turned-serious anecdote from his University of Colorado at Boulder freshmen year. He couldn’t return for his sophomore year because a program he executed on the University’s timeshare mainframe excessively consumed shared compute resources. The fees charged to his computer science department were astronomical for the era; more than $10,000.

Today’s cloud has the same gotchas: Watch out for excessive consumption. Once an hour has been consumed, there is no “return policy” to get your money back.

The cloud is pulling us back to that “time-sharing” mindset. We’re not buying 1U servers anymore. Instead, we’re buying virtual compute processing time based on haw many hours in a month a CPU runs, regardless of how much work was performed.

In the cloud, time is the unit of consumption and the month is the billing period.

Read more…

AWS EC2 Fleet Upgrade Tests our “Cloud Abilities”

This is an essay that was published to the Sonian cloud compute blog. Cross posting here for this audience.

In the past I have written about the secret to successful cloud deployments and how to architect for the cloud. Being successful requires a “designed-for-the-cloud” architecture, best operational practices and DevOps on steroids.

A couple weeks ago Amazon notified a majority of their customers about an upcoming event that us early-to-the-cloud pioneers hadn’t seen before; a forced reboot of the host operating system. On a massive scale. For Sonian, 72% of our currently running EC2 instances will need to be restarted before Amazon’s deadline. There is no reprieve. There is no deferment. Welcome to Infrastructure as a Service!

Our AWS business development contact gave us an early heads-up, and Twitter lit up when the first email notices started to arrive for the US-West region. Something big was afoot. And a lot of groans from the EC2 user community. First let me state flat out that Amazon did a pretty good job getting the word out and provided several methods to know which EC2 instances would need to be restarted. An email was sent with the list, the EC2 Management Console displays the information, and the EC2 API ‘Ec2-describe-instancestatus’ field has the information. Fortunately Joe Kinsella (@joekinsella) enhanced our Cloud Control Viewer and provided a report showing the exact instances and their reboot schedule.
Of the various reboot types, the most invasive is the one that moves the virtual host to new hardware. That will force a change in IP address and ephemeral storage is lost. This activity will certainly shake out any bugs in automated deployments, hard-coded settings, and sloppy shortcuts.

We had to scramble in order to assess the impact. All we learned from the email notice was that a portion of our EC2 instances would need to be restarted. Actually there were two types of restarts. An operating system reboot, which would preserve the non-persistent ephemeral storage, and a more invasive full instance restart (meaning the hardware under the hypervisor would power-cycle) which would not preserve the ephemeral storage.

One of the major mistakes cloud customers can make is to get complacent and treat the cloud like traditional co-located hosting. The cloud has different operating characteristics, what one could call the “cloud laws of physics,” and this forced restart is a good example of this principle in action. It’s also a wake up call to not get lazy. A large scale forced restart is like an earthquake drill. Practice makes perfect, and if this were an actual un-scheduled emergency, then we would be scrambling.

Despite the headache, this event has some positive spins. First it’s encouraging there is an “EC2 fleet upgrade.” This means newer underlying hardware. Perhaps faster NIC cards in the hosts. But for the companies like Sonian that started in the cloud circa 2007, some of our original instances that have been running for more than a year needed a “freshening.” This event reminds us there is a “hardware” center to every amorphous cloud. Amazon just does a great job to allow us to not have to think about that too often, except for times like these. A stale part of the cloud gets a refresh.

The second “benefit” is the forced fire drill. I know, there’s never a good time for the fire drill. But this type of event has similar qualities to an unexpected outage. There is some luxury to pre-planning, but the shake-out will be the same. Something will be discovered in your architecture or deployment practices that will get improved by this reboot activity. Clusters may be too hard-coded. Config settings may be to restrictive. Reboot scripts may not work as you think.

Sonian survives unscathed due to our maniacal focus on 100% automated deployments, 100% commitment to “infrastructure as code,” and an investment in cloud control tools that allowed us to triage the situation and develop an action plan relatively quickly. We also employ the best darn DevOps team the cloud has seen.

Cloud Innovation Acceleration Effect: Now Releasing 100 Stories

Cross-posting here a two part essay I wrote for the Sonian blog on how Sonian is benefiting from, and contributing to (by amplification,) the innovation cadence in cloud computing.

I’ve been working in enterprise software since the late 1980′s, and what I am witnessing as a participant in “the cloud” is the pace of cloud technology innovation over the past five years blows away the previous two decades.

There is a real noticeable trend here. We didn’t see this in SaaS powered by co-location hosting. What we are seeing with the cloud, and the ISV’s that adopted the cloud five years ago, is truly amazing. Sonian is entering a release cadence updating production systems with substantial new features every month.

Cloud Innovation – Part 1

  • Innovation history of Amazon Web Services 2005-2007
  • How Sonian amplifies cloud innovation
  • Sonian as an example of the “perfect” cloud ISV

Cloud Innovation – Part 2

  • Innovation history of Amazon Web Services 2008-2011
  • Comments about Gov Cloud

 

 

 

A Brief History: Cloud CPU Costs Over the Past 5 Years

When I started Sonian in 2007, one of the driving forces for beginning what would become my third start-up journey was the allure of all-you-can-consume “ten-cent-per-hour” cloud computing. Amazon Web Services was the new IT game changer in town, and the on-demand compute platform they launched in August 2006 literally brought cloud computing to the masses over night. These past five years I have been studying “cloud costs” in different ways, and this weekend I looked back at the compute pricing history and uncovered some interesting trends.

Before I continue with this post, here’s a brief history of my experience with previous “clouds,” which illustrates why in 2006 I was ready to take a big leap into the AWS cloud as an early adopter.

In 2004 I was involved with another SaaS information archiving project, and I worked with a team at SUN Microsystems to create a reference architecture for our archive software stack to live on the “SUN Utility Compute Grid.” At the time we were hosting the archiving software on dedicated co-located hardware racks, and planning a large capital expenditure to increase capacity. In the guise of “there has to be a better way!” we entertained the idea of moving our software to SUN’s “cloud.” (In 2004 the term cloud computing was pretty alien…. the common term for this type of shared virtual computing was “utility computing”). SUN offered the promise of true utility computing, but at the end of six months of effort, we could not make the underlying cost structures work. SUN was charging one dollar per CPU hour and one dollar per gigabyte per month for storage. We ended up adding more capacity to our existing co-located hardware plant because our “all-in” internal unit costs were less than what SUN was willing to sell their compute grid for.

Now back to the purpose of this post… a historical analysis of the cost of cloud CPU from 2006 through 2011.

Beginning in August 2006, Amazon Web Service’s new Elastic Compute Cloud (EC2) service introduced the concept of the EC2 Compute Unit (ECU) …. a standardized way to define a unit of cloud computing, the associated characteristics of that unit (processor speed and memory), and a revolutionary hourly cost model requiring no up-front expense. Amazon achieved what SUN, IBM and others had been talking about for years, but could never bring to market. In 2006, for ten cents per hour 1 EC2 Compute Unit could be rented with no up-front costs. In 2006, a single ECU was defined as equivalent to a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor with 1.7  Gb of RAM. This 1 ECU reference is still in effect today.
Read more…

The Secret Life of a Cloud Cost Control Czar

It’s a little after 7 in the morning and I tap the space bar key to wake my Macbook Air from it’s slumber. I click a tab in Chrome, hit refresh, and with a slight pang of “what will I see,” look at the balance for our October cloud infrastructure bill. You may know this feeling… think about a recent time opening the credit card bill and dreading an unwanted surprise.

“I don’t think I overspent this month, but… there was that steak dinner in New York City …”

Monitoring the rate of spend to make sure we’re not going to break the budget is one task in the routine as the “cloud cost czar.” It’s a daily task to track the trend lines and sound the alarm if expenses start to creep off plan.

The “czar” is the human gas pedal – modulating the enormous pulsing “cloud software engine” as we process half a terabyte of data a day.

But I am not a solo act in this cloud pageant. Making the cloud work from an economic perspective is a total team effort. It all starts in the engineering group with cloud-appropriate core architecture designs. And continues with quality testing, and and then to the team that manages daily operations. Everyone plays a role and has responsibility for our prime directive: process the most amount of data at least cost, without sacrificing customer satisfaction.

“Obsessively” managing cost is one of the three design requirements, alongside reliability and performance. “Gaming the cloud,” our internal slang for all we do to maximize efficiency, is a multi-disciplined effort the engineering and service delivery teams rally around. But there has to be a least one person who focuses on the trends, the 30,000 foot view down to sea-level: The Cost Czar.

Read more…