chaos engineering

Chaos engineering is the process of testing a distributed computing system to ensure that the system can withstand unexpected disruptions in function. It is so named because it relies on concepts from chaos theory, which focuses on random and unpredictable behavior. The goal of chaos engineering is to continuously conduct controlled experiments that introduce random and unpredictable behavior in order to discover weaknesses in a system. 

In computing, a distributed system is any grouping of computers that are linked over a network and share resources. Distributed systems can break when unexpected conditions or situations (such as an unintentional change from an intentional update) occur.  Large distributed systems have complex and unpredictable dependencies between components, which can it difficult to troubleshoot an error. This is where chaos engineering comes into play. Chaos engineering identifies "what if" scenarios that aim to trigger failures, so that the system owners can evaluate the performance and integrity of the software.

For example, image a distributed software system that is designed to handle a certain number of transactions per second. Chaos engineering would seek to discover how the software responds when that limit is approached, reached or exceeded and performance suffers or the system crashes. The scenario can be simulated in a chaos engineering setup to see how the system behaves when it experiences a lack of resources or point of failure. If the system fails under testing, developers can address design changes that adequately accommodate the scenario, or add ways to avoid it entirely. Once design changes are made, the same test is repeated to verify the desired results. 

How chaos engineering works

Like stress testing, chaos engineering aims to improve a system's or network's design by discovering and correcting its weaknesses. The process is typically divided into several steps and starts with the establishment of a baseline. The testers must first identify how the system should operate under optimal conditions and specify what constitutes a normal working state. Next, they must consider one or more potential weaknesses and formulate a hypothesis about the effects of those weaknesses. For example, software testers might want to know what will happen if a large traffic spike occurs. They then would conduct experiments to gauge the consequences of a large spike. The experiments might reveal an error in a critical processes or an unexpected cause-and-effect relationship. For example, the traffic spike simulation might show an unexpected performance degradation in storage.

Chaos engineering can be performed for a program not yet launched, and much can still be learned; however, testing on real-world conditions yields the most accurate results. For this reason, chaos engineering is often performed on production systems, especially when it is too cumbersome or expensive to duplicate a large, distributed system environment just for testing purposes. Naturally, this also means that chaos engineering can be highly disruptive. Success with the chaos engineering paradigm demands close communication and coordination between IT staff and developers and across business units. Experiments are rarely run during peak time to avoid giving customers a negative experience.

Chaos Monkey terminates service instances
Chaos Monkey is a tool that enables chaos engineering by creating problems on systems. Here, it is shown terminating instances of a service.

Chaos engineering tools

Chaos engineering is a relatively new approach to software testing and software quality assurance (QA). Netflix was a notable pioneer of chaos engineering, among the first to formalize how to use it in production systems. Netflix designed and open sourced automation platforms for chaos tests, including Chaos Monkey, Chaos Gorilla and similarly named tools, collectively dubbed the Simian Army. For example, Chaos Monkey randomly disables production instances to ensure a system failure, but designed not to have any customer effects. Chaos Gorilla does the same job on a larger geographical scale. The Netflix Simian Army continues to grow with more chaos-inducing programs made to test their services.

LinkedIn uses another open source failure-inducing program called SIMOORG. Monkey-Ops, an open source tool implemented in Go, is built to test and terminate random components and deployment configurations. Gremlin is another chaos engineering program.

This was last updated in February 2018

Continue Reading About chaos engineering

Dig Deeper on Managing Cloud-Native Applications