This is an essay that was published to the Sonian cloud compute blog. Cross posting here for this audience.
In the past I have written about the secret to successful cloud deployments and how to architect for the cloud. Being successful requires a “designed-for-the-cloud” architecture, best operational practices and DevOps on steroids.
A couple weeks ago Amazon notified a majority of their customers about an upcoming event that us early-to-the-cloud pioneers hadn’t seen before; a forced reboot of the host operating system. On a massive scale. For Sonian, 72% of our currently running EC2 instances will need to be restarted before Amazon’s deadline. There is no reprieve. There is no deferment. Welcome to Infrastructure as a Service!
We had to scramble in order to assess the impact. All we learned from the email notice was that a portion of our EC2 instances would need to be restarted. Actually there were two types of restarts. An operating system reboot, which would preserve the non-persistent ephemeral storage, and a more invasive full instance restart (meaning the hardware under the hypervisor would power-cycle) which would not preserve the ephemeral storage.
One of the major mistakes cloud customers can make is to get complacent and treat the cloud like traditional co-located hosting. The cloud has different operating characteristics, what one could call the “cloud laws of physics,” and this forced restart is a good example of this principle in action. It’s also a wake up call to not get lazy. A large scale forced restart is like an earthquake drill. Practice makes perfect, and if this were an actual un-scheduled emergency, then we would be scrambling.
Despite the headache, this event has some positive spins. First it’s encouraging there is an “EC2 fleet upgrade.” This means newer underlying hardware. Perhaps faster NIC cards in the hosts. But for the companies like Sonian that started in the cloud circa 2007, some of our original instances that have been running for more than a year needed a “freshening.” This event reminds us there is a “hardware” center to every amorphous cloud. Amazon just does a great job to allow us to not have to think about that too often, except for times like these. A stale part of the cloud gets a refresh.
The second “benefit” is the forced fire drill. I know, there’s never a good time for the fire drill. But this type of event has similar qualities to an unexpected outage. There is some luxury to pre-planning, but the shake-out will be the same. Something will be discovered in your architecture or deployment practices that will get improved by this reboot activity. Clusters may be too hard-coded. Config settings may be to restrictive. Reboot scripts may not work as you think.
Sonian survives unscathed due to our maniacal focus on 100% automated deployments, 100% commitment to “infrastructure as code,” and an investment in cloud control tools that allowed us to triage the situation and develop an action plan relatively quickly. We also employ the best darn DevOps team the cloud has seen.