High-consequence failure domains, such as an airplane in flight, are a great metaphor for IT operational risk in production, says safety expert Sidney Dekker.
IT teams should look for failure in successes, because success tends to predict big system failures better than small errors along the way.
"If you look for totally bug-free, failure-free code, that's not very predictive of the stability or resilience of a complex system," says Dekker, a Griffith University professor and author of multiple publications on safety. Examine those organizations that meet service-level agreements despite resource constraints, conflicting pressures and diverse goals; they address IT operational risk differently than lower-performing teams.
Dekker invokes the airplane example to talk about trust and the path to a resilient application. Ops flies the plane, but devs built the engine. Developers are accountable to the people who support the application -- its pilots.
Trust between dev and ops
There's no way to completely eradicate IT operational risk in complex environments, but safe communication and forward-looking accountability reduce it.
Forward-looking accountability means that people speak out even when nothing is broken, Dekker says. They discuss what got off track and components they're uncertain about or question the way a process is handled. Ops staff should be able to say, "How can we be sure the delivered code is safe to deploy?" The more diverse the voices in the room and the work experience they bring to the discussion, the more likely issues will be uncovered before production.
When an outage does occur, don't fire the culprit; change the conditions that created the scenario. Developers focus on their small piece of the puzzle, and in their limited scope, everything looks hunky-dory, Dekker says. But that piece of code interacts with a larger app system, which is, in turn, part of a complex, interdependent production IT environment. Production IT support deals with immense amounts of dark debt, meaning the problems that only come to light in the event of failures. There's no way to fully understand every piece of the environment and how they all interact, and teams should always assume unknown IT operational risks and take measures to preempt and mitigate them.
Even when everything works smoothly, expect problems from IT operational risks. Don't assume past success ensures future outcomes. That prediction is invalid in a complex system. You can predict probabilities, not certain outcomes.
Sidney Dekker spoke with SearchITOperations at DevOps Enterprise Summit 2017. The 2018 conference will take place October 22 to 24 in Las Vegas.