The Evolution of Purchasing Cloud Compute

In the beginning, as it were, just five years ago, purchasing cloud compute was simple because the rules were easy to understand and there were no choices. There was one cloud compute instance type that cost ten cents an hour. And a credit card was the only way to pay your monthly cloud compute bill.

Today there are a myriad of compute instances to choose from and multiple ways to pay for your cloud CPU time.

“Time…”

It’s the key word in the previous statement. The paradigm shift toward cloud computing away from the old dedicated co-lo world is bringing back the concept of purchasing compute “time.” Seasoned IT folks know there is nothing new to the concept of “buying” computer time. In the mainframe era (would it be hurtful to call it the IT Jurassic age?), computer time-sharing was the norm, and developers had to be mindful of how much computer time their programs consumed because mainframe’s were very expensive. In today’s dollars the equivalent of hundreds of dollars per hour.

Steve Wozniak, in his autobiography iWoz, tells a funny-turned-serious anecdote from his University of Colorado at Boulder freshmen year. He couldn’t return for his sophomore year because a program he executed on the University’s timeshare mainframe excessively consumed shared compute resources. The fees charged to his computer science department were astronomical for the era; more than $10,000.

Today’s cloud has the same gotchas: Watch out for excessive consumption. Once an hour has been consumed, there is no “return policy” to get your money back.

The cloud is pulling us back to that “time-sharing” mindset. We’re not buying 1U servers anymore. Instead, we’re buying virtual compute processing time based on haw many hours in a month a CPU runs, regardless of how much work was performed.

In the cloud, time is the unit of consumption and the month is the billing period.

Over the pasts five years we have witnessed three stages of “cloud computing purchasing evolution;” Crawl, Walk and Run. Each stage builds upon the innovations of the previous, and as the cutting-edge pace continues, in a few years we’ll look back and identify more stages beyond the first three described below.

Evolution Stage 1 – Crawl

Highlights:

  • One CPU type
  • Simple pay as you go pricing
  • Credit card billing

In 2007 the cloud pioneers were just starting to get their “cloud bearings.” What those pioneers would quickly learn is that operating a complicated distributed system on cloud infrastructure would be very different from the co-location architecture patterns and operating methods of the past. The “cloud” in 2007 was basically Amazon Web Services, which is credited for creating the “modern cloud computing eco-system.” The notion of ten cent per hour “unlimited” compute was revolutionary. Despite the novelty of unlimited low price compute, in the early days there was a “one size fits all” in terms of available compute types. Many original AWS cloud-based architectures were based on the fact there was only one compute instance type to choose from. Having only one CPU type was challenging, but the allure to harness this new way of running apps was to hard to ignore. That original CPU type, compared to today’s CPU lineup, was pretty tame. A 1.7 Ghz processor with 1.7 Gb RAM and 160 Gb of local ephemeral storage. No persistent block storage was offered. The only durable storage was Amazon’s Simple Storage Service (S3.)

Today that original cloud compute instance type is called a small, and still serves a valuable purpose in cloud architectures. Within a year of the original launch of the “small” instance type, Amazon expanded their offering with more compute types. These came with more memory and more cores. Cloud architects “rightsized” their designs and moved toward a more optimal software need to compute type. The one size fits all era was short lived.

Also in this “crawl” phase was simple billing and a pure “pay as you go” price model. Pay as you go, in the form of an operating expense, was heralded as a “fresh” idea compared to the traditional capital expenditure to kit out a data center. Start-ups were primed to jump on the cloud bandwagon because of “paygo” and no legacy architectures to stymie forward progress. But “paygo” would soon encounter growth pains, which stimulated change in the “walk” phase.

Evolution Stage 2 – Walk

Highlights:

  • Consolidated billing & monthly invoices
  • Reserved Instances

Not to rest on the laurels of early success, cloud vendors stepped up the innovation pace and the “walk” phase is characterized as “more payment options / more CPU choices.”

The original credit card only payment method started to fall apart for heavy cloud users because monthly bills could be in the tens of thousands of dollars. Maintaining a dedicated credit card for the cloud was the first response to rising bills, but couldn’t be the long-term solution. Monthly invoicing was made available to large consumers and CFO’s rejoiced. Score one for the “old school” ways of business; Net-30 day invoicing.

Many cloud customers created multiple AWS accounts for various business and technical reasons. This surprised Amazon, which originally thought customers would use just one Amazon account for everything. Managing cloud expense across multiple billing accounts was cumbersome to both the customer and Amazon. So to address this issue, and not fight it, AWS offered “consolidated” invoicing. One umbrella account is designated as the “payer” account and all other accounts would roll-up into the payer account as child accounts.  One monthly invoice totaled all charges into a “CFO friendly” package. Easy to understand, easy for finance departments to pay. This was also a reckoning at AWS that selling IT to large enterprises required a different mindset than selling commodities at Amazon.com. Large enterprises didn’t want to put their cloud expense on a credit card. Enterprises want financial controls and traditional accounting concepts like purchase orders and invoices to feel comfortable. This issue is probably one of the few areas where the “clouds” desire to be “frictionless” bumps against the headwinds of reality.

The walk phase also introduced a new way to purchase compute time: the Reserved Instance (RI). RI’s came about because of the realization that a steady state compute workload in the cloud cost more than a comparable dedicated co-lo configuration. The cloud’s price advantage is most often realized when there is a dynamic versus steady workload. The concept behind RI’s is that a per-hour discount is available when you make an up-front annual financial commitment. RI’s were a way to lower overall compute expense for instances that run full-time. With RI’s, a continuous running CPU will cost less if you know for sure how many hours in a year that node will be used. With that confidence, cloud customers started to feel comfortable to commit to annual terms and lower overall costs.

RI’s were introduced just at the right time on the cloud evolution timeline. To effectively use RI’s means customers needed good science and metrics in order to extrapolate a year of compute needs in advance. In the early cloud days, system architectures were iterating too quickly to make a firm commitment to one CPU type for a year. With stability settling in, annual commitments didn’t feel like such a stretch compared to the first year.

RI’s helped solve a number of growing problems with cloud compute, but also introduced their own management overhead. RI’s have their own pricing rules, and in order to maximize savings, the rules have to be followed very closely. This means paying attention to CPU type and location, because buying a Reserved Instance means committing to a specific CPU type in a specific geographic location. If over time type and location deviated from the original purchase, the savings would be lost.

Before RI’s, cloud architects could measure cost with a simple calculation: Number of hours running per month multiplied by cost per hour. That’s a single vector price model. RI’s introduce the concept of a multi-vector price calculation. One-time upfront cost AND an hourly rate. The upfront payment is based on a one year or three year commitment, and the hourly rate is based on the commitment as well. In rough numbers, if a CPU runs twenty-four by seven, all year long, then a one-year RI will save 38% compared to the simple hourly rate method. A 49% savings is achieved with a three year commitment. To analyze costs on a monthly basis, take the one-year upfront, divide by 12, and then multiply the RI hourly rate times number of hours in a month (720 hours for a 30 day month and 744 hours for a 31 day month.)

RI’s brought the concept of physical location to the forefront of our thinking. Buying a RI means making a commitment to a specific location. The RI pricing is only good for running a CPU in that location. Prior to RI’s there wasn’t much attention paid to “location in the cloud.” The whole notion of cloud computing abstracts location from how the cloud is consumed. But with RI’s, the location matters and needs to be considered carefully before making the purchase.

While RI’s are important for monetary savings, they also have a role in disaster recovery. Within the cloud, all virtual resources have a physical infrastructure tied to a specific geographic location. Since it’s a shared environment, when a underlying problem affects a large number of customers, cloud resources can quickly become scarce. An analogy to a “bank run” is relevant. A bank doesn’t have a physical dollar in it’s vault for every depositor’s funds. Likewise the cloud doesn’t have enough physical un-used infrastructure should every customer try to demand the same compute instance at once. But a Reserved Instance guarantees a “reservation” to a compute instance, so that if a system-wide cloud resources becomes scarce, the RI is the insurance policy to guarantee access.

The fast-paced nature of “Internet-speed” development works against widespread RI adoption. System architects are leery to make long-term commitments to a specific CPU type fearful their next 6 month sprint will have different compute profile needs compared to current requirements. Cloud vendors needed to be more creative and offer solutions to help remove the uncertainty factor. This need spurred the changes in the “run” phase.

Evolution Stage 3 – Run

Highlights:

  • Spot Pricing
  • 3 Reserved Instance Flavors

The “run stage” is the natural evolution next step from walking. Here we find two new purchasing options: Spot Pricing and Multi-tiered Reserved Instances.

Spot pricing is the concept of bidding on compute time and realizing deeps discounts compared to the retail hourly rate. Multi-tiered Reserved Instances fine-tune the RI model and create new segments that offer more savings then the first iteration of Reserved.

First let’s dive into what Spot Pricing is and why it’s the most exciting thing to happen in cloud computing since that original “ten cent per hour” instance type. Spot Pricing is a real-time auction approach to purchasing compute. Customers place a bid for compute time, and the auction system awards the compute node to the highest bidder. It’s a real-time continuous auction, with price history available for easy analysis. With Spot Pricing it’s possible to buy compute time for a fraction of the retail price. But there is one huge caveat: at any time the auction system can yank the compute node from you if there is a higher bidder. In order to take advantage of Spot Pricing requires a system architecture designed for this operating characteristic. But the good news is that designing for Spot Pricing means designing for resistance to failure. Losing a node to higher bidder or losing a node to a failure is the same end result: the system needs to continue regardless of why the node went away.

Spot Pricing is ideal for batch jobs where completion is not “time sensitive” i.e. a user in the web application should not feel the latency because the request was processed by a Spot node. Spot pricing is not good for databases or any application infrastructure component that needs to run steady-state.

Also in the evolutionary “run” stage is more choices for utilizing Reserved Instances. Previously RI’s were a means to achieve discounts based on a one to three year commitment. RI’s now have multi-tiered purchase options. This means multiple price scenarios based on how heavily an RI will be used within the one to three year commitment. This new multi-tiered approach offers more savings for instances that run all the time, and also offers savings for instances that run for as little as 50% of the billing period. RI’s are now classified as “light, medium and heavy.” The original RI’s are now considered medium, meaning savings are realized as long as the instance runs for at least 75% of the billing period. Heavy RI’s are designed for instances that run for greater than 90%. Light RI’s are for instances that run for 50% of the month.

Heavy RI’s are interesting because they add a third price vector to cost analysis. The original two vectors, up-front payment and hourly rate, are now complimented with a third dimension: static price. With Heavy RI’s the hourly rate is not variable; instead it’s a fixed amount billed at the beginning of the month. Prior to “Heavy RI’s,” all cloud expenses were accrued throughout the month. The month begins with a zero dollar bill and increments with each passing hour based on how many resources were consumed. With Heavy RI’s the month begins with the total of each heavy RI instance count, multiplied by the Heavy RI hourly rate, times the number of hours in the month. With this new Heavy RI type, cloud price calculators needed to be re-tooled to take this new payment dimension into consideration. Without modification, the estimated monthly bills would not be accurate because Heavy RI’s are not consumed in a granular manner. We see the cloud purchasing options starting to morph back to co-location patterns. It’s not a negative reflection on the cloud, but more an acknowledgement that the best attributes of cloud can be optimized to real-world needs, especially in the area of billing and finance.

Summary:

Early cloud customers have been rewarded with a steady stream of incremental pricing innovations. The secret to  purchasing cloud compute is to be able to evolve system architectures quickly to take advantage of new pricing models. An overall strategy is to use the same approach one uses designing a balanced retirement portfolio;  don’t put “all your eggs in one basket.” Use a combination of standard CPU instance types, Reserved Instances (three flavors to choose from) and Spot Pricing whenever possible.