Gaming the Cloud: You Need “Cost Aware” Applications

This is the second post in my “Gaming the Cloud” series. You can read the first post here.

There are two primary reasons to adopt cloud computing for SaaS applications: One, save money and two, be more reliable (and the great aspect of the cloud is both can be achieved with the same engineering effort.) There is no reason to use cloud computing unless you have a “cost aware” application. If you use the cloud to power traditional enterprise software you won’t save money, and will probably be less reliable too. See my previous post in the “Game the Cloud” series about having the right use case. Go read it now and come back. Do you have the right use case? If so let’s continue. After validating your use case, the next step in your cloud journey is to design your application to be elastic and at the same time “cost aware.”

So what exactly is a cost aware application and why should you care?

In the old world of SaaS, using traditional co-located data centers and co-mingled hardware, it was nearly impossible to figure out at a granular level how much each software component costs itself to run. With the cloud this all changes in a very positive way. As compute and storage are consumed in small units, and each of these units has a cost (for example compute at ten cents an hour or storage at fifteen cents a gigabyte) it’s a requirement to think about software designs that focus on operational efficiency because we can now measure costs at the atomic level. When I started Sonian in 2007, with a mandate to be purely cloud focused, we had access to 1 CPU type. Very quickly, our cloud provider Amazon offered more CPU variety and our reference architecture matured in real-time as we were able to optimize the software to match virtual compute units that had more memory, more cores, or both.

A cost aware application is software with an inherent design to “game the cloud” and be ultra efficient on every transaction. This in essence means granular workload management, the ability to right-size the CPU profile for the task, and take advantage of several long-term cost management features offered by the cloud infrastructure providers.

Granular Workload Management

This is the concept of breaking down data handling into very small tasks that are all wired together in an Enterprise Service Bus concept. The optimal ratio of tasks to workers can be maintained with a queueing system, and the theory is the optimal ratio can be tuned back and forth between cost and performance. Want to go faster, dial-up more CPU where it matters. Want to lower costs, batch the transactions and utilize CPU when prices are lower. With the cloud powering SaaS applications, it’s often the customer’s usage experience that will dictate which processes run at constant high-priority and which can be scheduled to run later.

An example of an enterprise app that is NOT designed with this thinking is Microsoft Exchange. You can’t run Exchange, as is, in the cloud and expect to gain any economic advantage. Exchange, with it’s design heritage dating back to pre-cloud 1993 days, never had the mandate to be “cost aware.” Within an enterprise, Exchange is installed on dedicated CPU (either hardware or virtual) and does not have the ability to shift workloads or manage how CPU is consumed. The same amount of CPU is available at 9am when everyone wants to read their email, and at 3pm when everyone is in a meeting. If Exchange were redesigned with “cost aware” thinking, then the software would be able to utilize more CPU at 9am when it matters, and then give that CPU “back to the pool” so other enterprise apps (like Sharepoint) could use it at 3pm when no one is reading email. The cloud, more and more, starts to look like a huge mainframe computer.

Utilize “Stateless” Processing Where Possible

Implementing a design pattern that has a “stateless” theme means the ability to take advantage of variable CPU pricing and also be more resilient to failure. This is a double-win scenario. Not every component in a modern SaaS system can be stateless, but with careful planning many data management task can be implemented with a stateless design. Stateless, from a virtual CPU perspective, means software that does not store data on the local node so that if the node were to fail at any time, there is no data loss and no impact on system integrity. In the Sonian architecture, our SAFE (Sonian Archive File Engine) technology has many stateless operations. This means we can use CPU more efficiently, and at the same time adapt to CPU or cloud failure scenarios by shifting workload processing to backup infrastructure. Stateless processing is most common in the back-end data processing functions. It’s harder, but not impossible, to implement in the web application layer that customers see.

Profile Software to Choose Best CPU Type

Given that the cloud offers dozens of CPU and storage combinations, its now possible for a software architect to truly optimize the compute footprint to the available CPU types. Amazon provides a free CloudWatch service that reports on CPU utilization. It’s a tool to help understand how the software performs in a lab setting so the most optimal CPU type (cores and memory) can be used. When the software performs at optimal operating characteristics that means costs the system is running at peak efficiency.

Long Term Cost Management

Amazon Web Services, as well as some of the other clouds, offers several ways to purchase compute. The original and standard consumption method is “pay as you go.” Amazon offers more than a dozen different CPU configurations that can be called upon to play a role in your compute platform (you still need to do the science to choose the optimal CPU profile for your software component.)

In addition to “pay as you go” Amazon offers Reserved Instances (RI). RI’s are useful when you know with a high degree of certainty the CPU needs for the next one to three years. With RI’s you make an upfront pre-paid CPU purchase for a specific CPU type, and then get a dramatically lower hourly rate. The net effect of the pre-pay and reduced hourly rate, when a CPU is utilized 100% of the billing period, is an average 30-45% cost reduction. RI’s are a great way to reduce CPU costs over the long term, but you need to pay attention to the “rules” in order to maximize the savings.

CPU Spot Pricing (Spots) is another payment and utilization method available to lower costs. It’s an auction where you can bid for CPU time and then process work at much lower costs, with no upfront payment. As opposed to RI’s, where the CPU is guaranteed to be there for you, Spots are not guaranteed to be there if another cloud customer bids higher for your compute hour. Spots and stateless design are a perfect match. When a Spot disappears because of a higher bid, it’s the equivalent to losing the compute node to failure. But if you design for stateless processing, then Spots are a perfect way to lower costs. At Sonian we see up to 50% lower per-hour CPU costs by using Spot nodes. Not every process runs on Spot CPU now, but over time we’ll be shifting more work to Spots to help control compute expense.

In summary, the cloud (compared to dedicated hosting) offers more flexibility in expense management, but needs new software designed to be cost aware. It’s up to the architects to utilize all the available tools (RI’s, Spots, Stateless,) but with fresh thinking it’s possible to “game the cloud” to achieve lower costs and greater reliability.