Cloud Success Requires Cost-aware Engineering

This is a true story from the “Cloud Cost Czar Chronicles.”

Our S3 “penny per one thousand” API costs started to rise rapidly in the second half of the cloud infrastructure billing period. We have seen this behavior before, and knew this could be attributed to increased usage, a new defect, or a design flaw that rears its head at a scaling tipping point. My job as “cost czar” is to raise the alarm and work with the team to figure out what was going wrong. At the observed rate of increase, the excess charges would push the monthly bill beyond the budget. One thing we have learned in the cloud, is that costs can rise quickly, but take awhile to go down, since the deceleration effect can be out of proportion to the acceleration if trying to manage expense in a single billing period.

When we started using Amazon Web Services S3 (a PaaS object store) back in 2007, we were acutely aware of the three pricing vectors in effect; storage consumed, price for API calls to store and list data and price for API calls to read and delete data. We’ve been using S3 heavily for five years and we tried to model the “all-in” costs as accurately as possible. But “guestimating” costs beyond the raw storage was stretch. PaaS services have an intrinsic “social engineering” element. If you color outside the lines the financial penalty can be significant. But if you master the pricing game, the rewards are equally as significant. So five years ago we thought as long as we point in the right general direction, “we’ll figure it out as we go along.” Some assumptions proved a positive surprise. Raw storage costs went down. Some surprises not so pleasant; continually wrangling the API usage fees, especially the transactions that cost a penny per thousand, proved to be a constant challenge. But I still like my options with S3 compared to buying storage from a hardware vendor and having to incur the administrative overhead. With S3 we can lower our costs by smarter engineering. With storage hardware, the only way to lower costs is to wrangle a better deal from an EMC sales person. As one of the original “cloud pioneers,” Sonian is not alone in this effort, and it’s been a real eye-opener for software designers to have to think about how their code consumes cloud resources (and expense) at scale. Because whether a penny per thousand or penny per ten thousand, when processing hundreds of millions of transactions a month, any miscalculation suddenly brings a dark cloud raining over your project.

At the beginning of our project we knew we had to design to be S3 friendly. (This is not only an S3 issue… all PaaS have this challenge.)  The “golden rule” for S3 is don’t store too many small files or too few large files. Otherwise you break the S3 price-engineering rules. To set us on the right path, we developed a data aggregation layer to get around the too many too small files issue. This was a good thing. We also developed a novel way to stream live data directly onto S3 and still not break with social-pricing convention of too many transactions.

So when we started seeing rapidly increasing S3 API calls (in the form of List Requests, which cost a penny per thousand) it was time to put on the Sherlock Holmes cloak and find out what was going on.

Our normal monthly S3 list requests usage graph looks like this:

 

 

 

 

 

 

But this is what we saw at the middle of the month:

 

 

 

 

 

 

The thing is, in the cloud there is “no do-over.” Once we make the API request, we incur the fee. It’s not like the “old days” where hardware could be returned if it’s not the right fit.

Here is the timeline of events:

  • At the middle of the month, a standard production release was deployed to most systems.
  • Around the same time the support department ramped up proactive index maintenance for one of our larger systems.
  • And since business is booming, every day we are taking in more and more data then the previous day. Data flows into the archive continuously, twenty-four hours day.

So we started with an initial hypothesis that a bug was introduced with the deployment or the proactive index maintenance was doing something unusual.

Preview: What we learned after Root Cause Analysis was the real problem wasn’t anyone of these, but rather a combination of the three plus a design flaw that only gets bad at scale.

We have a number of “watch-dog” reports that give us daily cloud costs. We also measure costs by the hour, sampling a few times a day. Since cloud costs are incurred incrementally throughout the billing period, a few samples over two days are required to detect an anomaly. The reporting showed S3 costs were rising faster than expected. The S3 Bucket List Request API call was the specific S3 price component that was rising fast. All other S3 cost elements were normal. Historically S3 List Requests have been a source of problems. What I saw on this report indicated the current S3 list request fee was already the same dollar amount as the total for the previous month. This represents thousands of dollars in excess if not triaged quickly. And the money was already spent.

Within a week a solution was deployed and S3 costs returned to normal.

Side-bar: 3 Essentials for Cost-Aware Engineering

  • Good metrics sub-system to capture details on every cloud API transaction (Don’t layer on the top or as an afterthought.)
  • QA that can test for cost impact against baseline goals, and at scale
  • A framework to alert on production deviations.

I love that our engineering, devops and support teams are so invested in thinking about the costs their code incurs in a cloud environment. Never before have I seen an engineering group think about the cost of their software running. Prior to the cloud there was very little attention to the theme of “how much will this software transaction cost on cloud infrastructure.” The pre-cloud world didn’t penalize us for not caring, because we were focused on optimizing physical server utilization, and that task didn’t require the same level of scruitny in terms of “code efficiency.” Sure we had to think about efficiency, but in the cloud efficiency is at the per transaction level. And for our use case, we make billions of tranactions a month.

My gut tells me that AWS doesn’t take much profit on S3 API list requests. The fees are more designed to contour how S3 is used, since excessive list requests put a burden on the shared system. So one way to keep developers on their toes is to charge a penalty for excess.

We continually get smarter about how to master the cloud. Our engineering group is “cost-aware” and we’re in charge of our own destiny, making the cloud work to our best advantage. Internally we call this effort “Playing the Cloud” and we’re creating some beautiful music.

You have just finished reading “Cloud Success Requires Cost-aware Engineering.” If you found this post useful please consider sharing a link to your networks.