The core of modern IT and data center operation is automation. It's the key to the five-minute virtual machine,...
elastic scaling, DevOps, continuous deployment and more. And it's the spark that ignites devastation when things go wrong.
Any time you have to perform a manual operation, the task takes longer, is more prone to errors and costs more. Do you need to improve efficiency? Automation is the way to go. Improve your server-to-admin ratio? Automate. Drive IT agility? More automation.
Data center components -- servers, storage arrays, routers, switches -- are really just a collection of resources surrounded by automation. A server comprises motherboards, controllers, interface cards and other hardware, each piece automated to perform its unique task. Servers are reliable because the nature of the internal automation is such that every server is the same and built on a thoroughly tested and hardened set of well-known operations.
The interfaces at the boundaries of these components are well documented and generally understood. We standardized storage protocols, core network devices, hypervisors and operating systems. As long as the operations between these components are simple and repeatable, life is good. Right?
The force that allows AWS to dominate the cloud is also the Achilles heel of complex systems.
Too commonly a system fails because an operator incorrectly configures a single component using these well-documented application programming interfaces and consoles. The massive AWS outage in April 2011 spawned from a single incorrectly executed network configuration change.
A single misfire of an established procedure from a provider known for operations at scale took out the cloud for days. Why? Automation: The force that allows AWS to dominate the cloud is also the Achilles heel of complex systems.
The complex adaptive systems school of thought explains why highly automated systems inevitably suffer large-scale failures. Think of the flash crash of 2010. Automation at multiple levels -- quoting, orders and execution -- all managed to drop the U.S. stock market by nearly 8% in a matter of minutes. One unexpectedly large trade on a day where sentiments and indicators were already fragile combined with automated actions instigated political and economic chaos. This is analogous to a network administrator supporting a routine maintenance operation by routing all traffic onto the wrong secondary network and causing an outage.
No matter how much you test and how carefully you think through every possible scenario, it's easy to miss the totally random condition or event that brings the environment to its knees.
In an enterprise data center where heterogeneity, proprietary tools and shiny new objects that add complexity are the norm, creating and integrating layers of automation causes similar, albeit less impactful, failures. Put a little OpenStack here; add orchestration and governance; let each business unit implement their own form of continuous integration and deployment; and throw in some Salt, Chef, Cron jobs, Microsoft Systems Center, HP Server Automation, a summer of endless scripts and more. The ingredients for an unsavory automation stew will simmer in the data center.
IT sprawl of this sort is hard to avoid. Each technology cycle brings in a host of new tools and skills requirements, but the old stuff is still humming away. Since few businesses can start from scratch and invent tailored automation tools, your features selection is subject to the whims of the market.
Each tool and approach requires specialized expertise, with its own learning curve and nuances to master. With heterogeneous IT sprawl, your team needs lots of different skill sets to manage one environment. Shrinking IT budgets and little time to integrate new products make maintaining a critical mass of skills a difficult proposition.
Since you can't eliminate automation sprawl and complexity, minimize the impact of an eventual unexpected catalyst for a series of cascading failures that costs your company millions of dollars, and possibly costs you your job.
Controlling the impact of IT automation sprawl
Instead of investing untold sums in the data center to prevent failures, invest in modeling and controlling the impact of those failures. Automation sprawl happens. It's inevitable. And it can crush you. Rather than fighting automation sprawl, accept, isolate, create the unexpected and adapt.
Accept failure: The first step of redemption is acceptance. It's going to fail; count on it.
Isolate everything: No instance of an automation environment should hook up to the internal methods of a different subsystem. Completely isolate each automated subsystem, loosely coupling systems and putting all control through interfaces. Critical systems must be redundant.
Create the unexpected: Chaos is your friend. Rather than waiting for random events, randomize your testing model. It helps you learn about edge conditions that you wouldn't otherwise encounter. One technique is to shut data center elements down randomly -- a more elegant version of kicking the plug from the wall. Netflix released their Chaos Monkey code so other IT shops could drive failure through their systems, ultimately making them stronger.
Adapt: When you encounter a failure, learn and engineer around that condition, and move on. Complex adaptive systems in this cloud context are not about controlling for failure, but adapting to failure to make systems more resilient and operable in the future.
John Treadway is SVP at Cloud Technology Partners and can be found online at @JohnTreadway.