This content is part of the Essential Guide: ALM best practices for advanced DevOps organizations

Five approaches to IT resiliency that don't need new hardware

With new application architectures, snapshots, containerization and other technologies, IT resiliency goes up while the server count stays the same.

IT resiliency isn't just a concern for top-tier enterprise-class applications -- organizations developing and deploying...

software for web-based storefronts, mobile users and a myriad of other tasks are increasingly concerned with how resilient the workload is.

Traditional hardware-based clusters are still a powerful choice to ensure IT resiliency for critical applications in the data center, but these five options bolster application resilience without major hardware investments.

1. Fault-tolerant software platforms

The idea behind clusters is to load balance application traffic across multiple duplicate servers. If a server fails, the other servers take up the load and the workload's operations continue unaffected. One alternative to the traditional server cluster is server fault tolerance or high availability. These models typically duplicate and synchronize VMs across multiple physical servers.

In the fault tolerant mode, duplicate VMs share the load in a hot/hot configuration, like a traditional hardware-based cluster. If one VM fails, the other continues without disruption, though some traffic may drop because load balancing is not as comprehensive as hardware clustering techniques. Thus, the application is tolerant of faults not yet immune to them.

In the high availability mode, the duplicate VM is kept idle and synchronized with the working VM in a hot/warm configuration. If the working VM fails, the standby VM becomes active and takes on the traffic load. There may be a small, usually brief, amount of traffic disruption during the switchover.

Fault-tolerant and high availability deployments use software tools, such as Stratus everRun Enterprise and Vision Solutions Double-Take, capable of creating, synchronizing and failing over to redundant workload instances. The workload's importance to the business dictates whether high availability or the more rigorous fault-tolerance configuration is the right resiliency clustering choice.

2. Redundant cloud architectures

Some enterprise applications are developed and deployed in public clouds, such as Amazon Web Services (AWS), Google Cloud Platform and Microsoft Azure. Public clouds allow rapid and scalable VM and storage provisioning. Now, they also offer IT resiliency features for software developers, operations staff and cloud administrators.

AWS, for example, provides clustering with Auto Scaling services, which allow administrators to create groups of Elastic Compute Cloud (EC2) compute instances. EC2 instances increase or decrease manually, or automatically with changes in workload traffic. AWS' Elastic Load Balancer services distribute traffic on cloud instances.

Organizations need no upfront capital hardware or software platform investment to create workload resilience in a public cloud deployment. The public cloud provider handles all of the hardware and management and the business pays for the compute resources that are actually used -- this amount will vary as EC2 instances and other associated cloud services scale up or down over the course of a billing cycle. To increase IT resiliency against regional disruption, consider cloud providers with international installations, across numerous geopolitical regions.

3. Take a snapshot

Almost every enterprise workload needs some level of operational protection. Not all of them require the real-time protection of clusters, fault tolerance and high availability platforms. Secondary applications or applications in test and development can tolerate some amount of downtime and data loss, and those applications may receive adequate levels of resilience with ordinary VM snapshots.

VMs are basically complete OS, driver, application and data instances running within a server's memory space. A snapshot essentially captures the current state of that memory space or the changes to that memory space since the last snapshot, and saves that content to a disk file such as a *.vmdk or *-delta.vmdk file. If the VM fails, administrators restore the snapshot to restart the VM in a matter of minutes. This usually recovers the application to the point of the last snapshot. There may be some data loss and time to recover, so consider the implications of recovery point objective and recovery time objective before choosing snapshot-based resiliency. If the application can tolerate the potential downtime, this option minimizes hardware commitment by using only one server for the VM.

Major virtualization platforms such as VMware vSphere include powerful snapshot tools that can capture, organize, consolidate, manage and restore VM snapshots.

Resiliency mixologists

There are many potential strategies to bolster modern IT resiliency, and those strategies are rarely exclusive. As software developers and operations professionals collaborate more closely to achieve faster application delivery, it is possible to deploy multiple schemes to match resilience goals with the specific application. One size doesn't fit all.

4. Resilience in application designs

IT resiliency isn't just a deployment or operations issue. Resiliency is also a vital development consideration, and a workload's resilience can be profoundly influenced by the integrity of that application's design and implementation. In simplest terms, application resiliency is responding as elegantly as possible to problems or errors in that application's components, rather than creating nonsensical responses or crashing.

Applications with specific hardware dependencies can pose serious failover or restoration problems. Similar issues arise when workloads depend on specific OSes, drivers, database structures and other software components. Complex software with poor security or inadequate vulnerability testing leaves open many possible attack vectors, which also compromises the application's resiliency.

Proper design techniques and comprehensive testing can't prevent every problem, but do help ensure that versions released to production continue service or fail gracefully when they encounter bugs and other errors. Integrating log and data collection capabilities into the application will help record error conditions and pinpoint performance problems.

5. Containers and microservices in application design

Workload resiliency is increasingly affected by scalability. If the traffic demands outstrip the available compute resources of a workload instance, the workload's performance suffers, or it crashes entirely. Virtual machine clusters and load balancing are well-established means to scale an application; modern application design can capitalize on microservices architecture, deployed in virtualized containers. Instead of monolithic workload designs deployed as VMs, functional components communicate through application programming interfaces to enable the application's functions.

The advantage of container-based microservices is that containers share a common OS, allowing faster scaling with much less compute overhead. The containerized workload can scale in an independent fashion, allowing for clustered and load-balanced containers for each functional area rather than full iterations of the entire application. Functional components are updated and upgraded more quickly than monolithic apps, requiring less regression testing and posing less risk of unintended consequences.

It's a more complex deployment scheme, but the reward for that additional work is felt in IT resiliency as applications scale further than traditional physical or virtual machines while using less total compute hardware.

Next Steps

Is bolstered resiliency a substitute for DR?

Add resilience to DR with virtual desktops

Improving virtual machine resilience

Dig Deeper on Managing Virtual Containers