When talking about self-healing software and infrastructure, it's important to know what self-healing isn't.
Failover and high levels of equipment redundancy prevent downtime, but do not equal self-healing infrastructure. While redundancy is a preventative measure to ensure uptime from power systems to servers, hardware generally cannot replace its own failed components -- at least not yet. Nor is self-healing relegated to science fiction realms of artificial intelligence or super-computer-worthy processes. While self-healing software and infrastructure may seem out of reach, the reality is many of today's IT monitoring tools incorporate this ability.
How IT stacks heal
Self-healing software relies on the ability to take corrective or preventive action based on a status check. In a truly self-healing stack, the corrective action is an autonomic function done without human interaction.
Self-healing capability is to virtual resources, software and application components close to what failover and redundancy are to hardware. The key difference is the lack of human intervention. While hardware outages and impacts are noticeable and somewhat easy to correct, a self-healing software stack involves fundamental changes to the application architecture. Self-healing simply wasn't feasible for most IT shops until virtualization and microservices application architectures emerged. Once the application stack is ready for granular management, it becomes possible to add autonomic functions that correct issues on small and large scales. Corrective action can be anything from preventative steps that avoid an incident to reactive actions and corrections based on an incident.
To achieve a self-healing software stack, start at monitoring. Self-healing actions are traditionally based on something that occurs to trigger a response. Reactive action -- steps taken to heal a stack after an outage occurs -- is a lot easier to implement because event-based triggers are well understood in IT tooling. Preventative action is quite a bit harder, because the monitoring tool or suite seeks out indicators that a problem might be coming, and must initiate action before the outage occurs -- if it were going to occur. For example, when a monitoring tool identifies excessive memory usage, the cause might be a memory leak or simply a busy period for the application in question. The less expensive and complex reactive method gives IT operations teams the benefit of exact 20/20 hindsight, rather than the sometimes unclear crystal ball of predictive analytics.
Self-healing software stacks take corrective action to fix the underlying cause of an outage. It might be as simple as an application restart or as complex as having to rebuild or create a new instance or virtual server. These systems need the flexibility to restart something as small as a service or to execute a task as large as deploying a new server. The self-healing system needs to implement progressively escalating corrective steps. For the reactive self-healing process on software and infrastructure, the system must try one action, then if that does not restore operations, move to a larger action, and so on, until the application comes back online. The operations team must decide how long to wait between each step before moving on to the next.
Invest in visionary self-healing
Preventative self-healing infrastructure and software have a place in your data center, with a bit more work and data. For a monitoring system to effectively predict events, it needs a large amount of historical reference data. The obvious choice would be to save up performance data over time, but accurate predictions rely on a static environment. Every system change, upgrade or even patch can influence the software and infrastructure baseline, affecting the baseline of performance and reactions. The other challenge is to set up systems to mine data for possible patterns, while considering how static the system is from change, something that is difficult at best in IT.
All of this doesn't preclude predictive self-healing approaches. Consider focusing the predictive triggers on a much shorter subset of data. It won't be as effective as with a larger set of historical data, but still offers an advantage over the reactive approach alone.
Any software stack you support that incorporates self-healing at any level should be tested -- one prime example of real-world testing is Netflix's Chaos Monkey application. Chaos Monkey's purpose is to randomly terminate instances in the user's microservices architecture in production. The tool set is available on GitHub. While this might seem more like getting hit by a virus rather than deploying something helpful, it proves an IT operations team's ability to provide continuous service even in unpredictable events. While no one wants to do it, the only true way to know if your self-healing software and infrastructure deployment works is to test it out in production. Otherwise it is simply a theory -- perhaps a wish -- of self-healing ability.
Resiliency is a key factor in the DevOps pipeline
Improve uptime without buying new hardware
Step aside, disaster recovery. There's a new way