Here’s an all too common scenario from the “cloud chronicles.” A virtual machine that has been operating just fine for days, and has 50 other identical twins with the same configuration, starts to exhibit problems. Slow virtual disk performance. Network brown-outs. Disconnecting and reconnecting within it’s functional cluster. Monitoring systems alert on degrading performance, and the knee-jerk response is to jump on the box (nee VM) and start to troubleshoot the issue. The problem is, spending any time troubleshooting an anomaly in the “cloud” is the wrong reaction. In the cloud, the first response, when a node starts to exhibit erratic behavior, should be to replace, not fix.
Replacing, instead of fixing, goes against the ingrained habits of over two decades of entrenched IT best practices. In the pre-cloud world, when real hardware was the base, we had to “fix IT” because replacing was too expensive and not practical. There was not an endless pile of spares lying about for a “replace IT” mindset.
But in the cloud, with, in theory, nearly infinite CPU, the remediation to an errant node should be to immediately replace, and move on.
Why Is This?
Because there are too many causes beyond our control at the OS level in a cloud environment. Think of the cloud like living in a high-rise building. Each unit in the building, just like each cloud customer, can have whatever interior they want, but there are also massive shared resources in the building. So while our interior may be a candidate for the next architectural digest cover, our neighbor could “kill our chill” with a too-loud stereo boom box. The cloud suffers from the noisy neighbor problem just like our theoretical high-rise. But in the cloud, we can choose to move and jump back into the random lottery for a new unit. We can’t change the building, but we can change the location within the building.
Of coure, you need the right cloud-centric architecture to be able to simply “replace IT” instead of “fix IT.” Having cloud-dexterity is critical to operating a successful cloud deployment.
The cloud requires us to “un-learn” the best practices of the past and embrace a new way of thinking about “break fix.” While replacing instead of fixing may seem wasteful, it’s really not. The time spent troubleshooting the random problem will not yield significant insights, and could be better spent focusing on more value-add projects. Usually after extensive diagnosis, the only recourse is to replace the node, since the original problem was an outlier.
You have just finished reading “New Cloud Rules: Replace Instead of Fix.” Please consider sharing a link to this post.


