The Secret Life of a Cloud Cost Control Czar

It’s a little after 7 in the morning and I tap the space bar key to wake my Macbook Air from it’s slumber. I click a tab in Chrome, hit refresh, and with a slight pang of “what will I see,” look at the balance for our October cloud infrastructure bill. You may know this feeling… think about a recent time opening the credit card bill and dreading an unwanted surprise.

“I don’t think I overspent this month, but… there was that steak dinner in New York City …”

Monitoring the rate of spend to make sure we’re not going to break the budget is one task in the routine as the “cloud cost czar.” It’s a daily task to track the trend lines and sound the alarm if expenses start to creep off plan.

The “czar” is the human gas pedal – modulating the enormous pulsing “cloud software engine” as we process half a terabyte of data a day.

But I am not a solo act in this cloud pageant. Making the cloud work from an economic perspective is a total team effort. It all starts in the engineering group with cloud-appropriate core architecture designs. And continues with quality testing, and and then to the team that manages daily operations. Everyone plays a role and has responsibility for our prime directive: process the most amount of data at least cost, without sacrificing customer satisfaction.

“Obsessively” managing cost is one of the three design requirements, alongside reliability and performance. “Gaming the cloud,” our internal slang for all we do to maximize efficiency, is a multi-disciplined effort the engineering and service delivery teams rally around. But there has to be a least one person who focuses on the trends, the 30,000 foot view down to sea-level: The Cost Czar.

As a Czar, I have to push back on initiatives that might spike costs, up to the point of not harming customer satisfaction. In the cloud, what we’re spending in the moment has a direct relationship to the customer usage experience. Want to dial-down the spend? Slow the system down by turning off virtual compute units. But slow the system down too much and customers may notice. (There is an aviation term, LRC – Long Range Cruise speed, which is a formula to optimize fuel usage and cruise time…. that concept is similar in the cloud as we want to maximize our “fuel” to process the most amount of work.) So it’s a delicate balance. But understanding this balance is the key to  harnessing all the positive attributes of cloud computing, and so that the customers get the best ROI for their subscription dollars.

Here are some statistics on how we use the cloud to give you a sense of our scale compared to others. An average month looks like this:

  • 530 compute nodes (EC2)
  • 3723 compute units
  • A petabyte of storage (S3 and EBS)
  • 612 block storage volumes

Compared to a traditional co-location hosting configuration, our cloud footprint would fill-up 100 or more physical 42U racks with compute and storage equipment.

The psychology of Buying in the Cloud

The cloud offers a new way to purchase compute and storage that start-ups and enterprises haven’t seen before. Prior to the cloud, purchasing infrastructure was a multi-step process with vendor quotes, internal approvals, purchase orders, shipping receipts and net 30 day invoices. The cloud is the complete opposite. For all the reasons we want to embrace the cloud to remove the aforementioned purchasing friction, the cloud needs a control layer to ensure checks and balances are in place to prevent run-a-way costs. It’s easy to start off in the cloud with pay as you go prices charged to a company credit card. But quickly the frictionless pay as you go model will work against our collective best interests. The solution is to insert cost controls that bring back just the right level of governance we used to have in the pre-cloud world.

Setting a monthly budget goal is the new way to manage cloud infrastructure expense. And a “cloud budget” is different than an annual hardware and co-location budget. The cloud budget goal should be based on a combination of the previous month (it’s not easy for subsequent months to be radically different if you are running a SaaS business,) and on engineering and DevOps input as well as the sales group’s growth expectations. In a perfect world, especially for a business where there is direct correlation between COGS and the revenue unit metric, the budget will be driven purely from the sales pipeline using a proven formula. Our experience shows this is an achievable goal, but takes time to prove out the formulas to get a predictable pattern.

Everything we do in the cloud is on the concept of a hourly “rental”** There is no permanent ownership in the cloud (and this is a good thing if you make the right choices.) We think about compute units and storage costs per month (but get charged by the hour consumed). For traditional system architectures this thinking is still an alien concept. If we start a new CPU in the last few days of the month, that additional expense won’t move the cost needle too much. Conversely, if a CPU that has been running all month is stopped the last week of the month, there will not be a dramatic decrease.

Paying by the month for units that are billed by the hour is challenging to budget.

**(n.b. EC2 reserved instances augment this thinking and will be discussed in a future post.)

Purchasing Storage in the Cloud

Cloud storage comes in three primary flavors: Ephemeral, durable block storage and durable object storage. These are three great building blocks for any SaaS application. Our particular use-case requires block and object as the primary storage systems. Ephemeral is used for R&D and not much else since we need a guarantee no data is lost even if a compute node terminates unexpectedly (there are no guarantees ephemeral storage survives a CPU crash.) Ephemeral is “free” (included in the hourly compute fee,) while block and object storage cost ten cents and fifteen cents per gigabyte per month respectively.

Cloud object storage, in our case Amazon S3, is a true “pay as you go” price model. At our monthly volume we’re paying on average twelve cents a gigabyte per month, but that value is pro-rated depending on how many hours into the month the data is stored (technically when the API PUT was executed.) The pricing concept is called “TimedStorage-ByteHrs” and calculated as number of bytes stored for a specific number of hours in a month. This is a foreign procurement concept for IT folks who are used to purchasing SANs and storage systems based on the old hardware model.

Cloud block storage (Amazon’s Elastic Block Store) has a different pricing model. For this storage type we pay for how much storage is provisioned, not how much is actually consumed. The provisioning aspect is very flexible (volumes can be created on the fly from as small as 1 gigabyte to as large as 1 terabyte.) But matching the provisioned amount to the amount actually needed is a continuous challenge. If we provision too much ahead then that’s wasted money. Provision too little, then increasing storage on the fly is a DevOps headache. Assessing real-time storage needs and striving for as near 100% utilization is a work in progress, but each successive software release moves us closer to this goal. The key is to identify the smallest compute per storage node building block so that capacity can be increased very granularly.

Purchasing Compute in the Cloud

As of this writing there are at least a dozen different cloud CPU configurations to choose from. Some have more compute and less memory. Others are weighted toward higher memory and less compute units. A complicated, distributed architecture needs many compute profiles in order to optimize the CPU type to the software task. Figuring out the compute to task profile is an art in itself, and continues to get better as the cloud vendors offer more CPU choices, and software is profiled to match the CPU type.

In the cloud, compute is purchased by the hour. Hourly rates are as low as a penny for a micro single compute unit up to two dollars for a quadruple 33x compute unit. Run a CPU for an hour and pay the hourly rate for that CPU type. Run a CPU for a month and pay the hourly rate times 720 or 744 hours (30 or 31 day month.)

The Czar’s Tools

Our cloud cost control toolset has evolved from a simple tracking spreadsheet to a suite of tools that provide current and historical analysis. As our cloud usage has increased over time, the complexities of running a large system have grown exponentially. Traditional commercial or open source tools were not designed for how the cloud works, so we had to create our own tooling.

Cloud Control Viewer: Sonian’s VP of Engineering Joe Kinsella created our central tool which we call “Cloud Control Viewer.” It’s a Rails application that aggregates metrics from across the system and gives us literally x-ray vision into how our software is performing in the cloud, including costs. The tool is constantly enhanced and is a key component enabling daily analysis. The Cloud Control Viewer uses Amazon API’s, Chef API’s and our own API’s as source data to build graphs, reports and value-add trend analysis.

Cloud Infrastructure Management Control Panel: We also use the AWS Management Console to augment data from the Cloud Control Viewer. The Sonian viewer has one perspective on activity, and the AWS Management Console Activity Reports provide an additional information source required for daily management.

Shared Spreadsheets: There is a lot of “ad hoc” analysis that happens in shared spreadsheets. The data piped into the spreadsheet comes from the Cloud Control Viewer as well as the AWS Console Activity Reports. Over time the “ad hoc” reports will become permanent fixtures in the Cloud Control Viewer. Shared spreadsheets also allow a small tactical group to focus on a specific cost cutting measure with actionable detail data, role assignment, and project tracking.

Sources of Cost Creep

Using our various tools we have found the following areas were costs creep up in the cloud. Most of these findings are related to storage. The old saying “data is sticky” certainly holds true when trying to identify obsolete storage resources. Memories fade over time, and the longer data volumes exists, the more difficult to know whether to keep or discard.

  • Unattached storage volumes. Bugs in automation software or simply forgetfulness can allow underutilized storage volumes to get orphaned. This wastes money. If the problem gets out of control then de-commissioning the volumes could take some time since “institutional memory” has to be recreated to make sure nothing of value is on a volume to be deleted.
  • Excessive EBS snapshot policy. Snap-shotting EBS volumes is the proper way to maintain a backup. EBS by itself has 99.99% durability. Snap-shotting to S3 adds another “11” nines. But a too aggressive snap shot policy can waste money, while less aggressive can make you vulnerable to data loss.
  • Temporary project work. Often times clusters of compute and storage will be provisioned for a specific project. Starting the project is easy (yaay cloud), but getting a project to completion and turning off the infrastructure seems to take a long time (boo cloud). Every additional hour a temporary cluster runs works against the budget goal.
  • Research and Development. Similar issues to temporary clusters. The benefit of developing in the cloud is development and test can be executed against an exact copy of  the production system configuration. But when delays occur, R&D clusters tend to hang around longer than anticipated. The solution is to accurately budget the duration for non-production systems and err on the side of caution. Don’t be too optimistic on turning off test environments since development and quality assurance teams will want these clusters to live as long as possible.
  • Automation errors. Almost every compute and storage provisioning task is automated with Chef. But we have found (in older versions of Chef) examples where commands to terminate compute, or delete storage, returned success, but silently failed. When this happens orphaned volumes or zombie compute nodes eat the budget. The Cloud Control Viewer is our “trust but verify” system to safe guard we don’t encounter this problem.
  • Optimize compute profile to the task. There is a tendency to “over provision” the amount of compute and storage required for a specific role in the system. Measuring actual utilization is the answer to implementing the optimal configuration. The measurement can only happen after running in the “real world.” But retrofitting a running system to the optimal configuration is time consuming and needs to be planned with great care. The answer is more “science” up front (to the detriment of a product manager looking for a quick release,) so that the initial configuration is as close to what’s required in the production system.

Conclusions

The success of cloud computing hinges on managing infrastructure expense in alignment with revenue and other enterprise value metrics. The future looks bright for the day when a cloud cost czar’s duties are automated, and software stacks are tuned for the cloud’s unique operating characteristics.