Sergey Nivens - Fotolia
NEW YORK -- IT operations by nature responds to the unexpected. Chaos engineering forces the unexpected to occur, voluntarily, and spotlights issues with potential to bring systems down.
Complex workloads with intra- and inter-application dependencies increase the likelihood of cascading failures and fragility, said experts at Velocity 2017 here this week. That threat to resilience and availability means IT ops should take aggressive and even daring approaches to problem diagnostics and prevention.
Even greenfield apps cannot completely avoid unplanned issues caused by surprises in the deployment environment. As applications evolve and grow, components might rely on their own functionality to operate, a phenomenon known as a strange loop. Strange loops, interdependent apps, distributed systems all work until they don't, said Richard Cook of the Ohio State University SNAFUcatchers.
These hidden dangers to stable IT operations are dark debt, research scientist Cook said. Technical debt holds back product innovation and advancement, while unforeseen dark debt pulls its name from dark matter, which is invisible to observers until they experience an anomaly.
"Technical debt is what you fix in the future -- it's measurable and visible. Dark debt is only expressed through failure," said Eric Liu, principal architect and DevOps team lead at ADP, who attended the SNAFUcatchers' presentation at Velocity.
ADP, founded in 1949, combats technical debt on a diverse IT estate from mainframes to Amazon Web Services deployments, and they improve workflows as tech debt goes down. They use chaos engineering to discover and address dark debt -- the problems that aren't planned against. "When you don't know what's going to break, that's a real fire drill," Liu said.
Eric Liuprincipal architect and DevOps team lead, ADP
Chaos engineering was popularized by Netflix with its Simian Army, led by Chaos Monkey, which tests the resiliency of IT deployments. When you can't watch a hot new series on Netflix after work, you're mad -- but when you can't get your paycheck, mad doesn't begin to describe it. So ADP started its chaos engineering in the performance environment, a sandboxed version of production.
Google DiRT (disaster recovery testing) is another chaos engineering example that wipes out systems and manpower. LinkedIn, which also presented at Velocity, applies a version of chaos engineering to capacity management with Redliner, which pushes servers to their limits in production.
Chaos engineering is just one front where ADP fights dark debt. The company recently started to open source its code from the ADP Innovation Lab. "When you put stuff out there, people attack it for you," Liu said, and the result is a better designed, resilient system.
Another technique to shine light into the darkness is peer mentorship. Some experienced senior developers don't share their work; they are "dark matter developers," a term coined by Microsoft's Scott Hanselman, Liu said. ADP exposes these experienced pros to IT interns or new hires, so they are forced to explain what they're doing and why, and the newcomers absorb this knowledge and challenge assumptions or address overlooked problems. "They make the dark matter developers accountable," Liu said.
A saying from the medical field applies in IT too: Good results come from experience, and experience comes from bad results, Cook said. Just make sure those bad results are small, and experienced staff guide new ones through their initial attempts -- "legitimate peripheral participation" in mentorship parlance. Pose hypothetical failures to operations professionals as an exercise where you can emphasize fundamental troubleshooting approaches and how best to use tools and data. People gradually gain responsibilities with experienced staff to guide them.
Diagnosis applies logic to chaos
The ability to diagnose problems is inherent in IT operations, where the concentration is on what happens rather than what should happen, said Terran Melconian, a data scientist with operations engineering focus at Air Network Simulation and Analysis. Hone diagnosis with a set process to properly identify dark debt issues.
Above all else, question your beliefs. As the SNAFUcatchers described, everyone who supports IT workloads has a mental model of how systems and supporting tools work. But there's no guarantee that the logs are accurate, or monitoring data reports on the right metric, or that a piece of code was deployed on the server where it belongs. Do not cling to these beliefs, Melconian said -- if you're having failures, they are wrong somewhere.
To identify the source of failure as efficiently as possible, don't waste time swapping one component after another until one solves the problem. Instead, apply a binary approach. Identify two or more possible causes for the symptom you're experiencing, and track down the metrics data that will confirm or disprove the hypothetical causes. Then narrow in on the culprit until you're reasonably sure you know what to do to fix this weakness and improve resiliency. Consider what other symptoms you'd expect to test if you're correct about the problem's cause.
Melconian offered several tips to achieve the most efficient diagnostics. Take the fastest measurements first; document your efforts; pursue mutually exclusive causes; and focus on changing the failure, not getting to the root of the cause. Avoid workarounds that will obscure the issue rather than fix it -- you'll simply create more dark debt. And while a recent change is an obvious suspect in a problem, don't give too much weight to hunches.
The post mortem's function
Once the problem is fixed, a post-mortem debriefing is one way to prevent it in the future. But the role of a post mortem isn't to find out what went wrong -- it's to point out where the vulnerabilities lie, Cook said.
People in post-mortem meetings commonly say they didn't know this system worked that way, he said. That's an opportunity to recalibrate the mental model in ops of how the IT deployment works. With complex deployments, it's impossible to know everything about the whole system, so any hiccups or unexpected actions are teaching moments for the support team.
Chaos engineering can identify dark debt at one moment in time, which leads to changes that fix problems. But systems aren't static, so every change reorients the locus of the dark debt, Cook said. Organizations such as Google and Netflix repeatedly run chaos engineering exercises with new stressors to improve and maintain resiliency.
Safeguards -- gates that prevent human error or limit capabilities to make changes -- might work in some instances, but even the safeguard can suffer from poor design or an incomplete or misguided understanding of how the system functions, Cook said. Tools don't show you the consequences of your actions in an intelligent way -- Chef will run the command it's given whether it's a good idea or not, he said.
What doesn't work? Firing someone doesn't make progress on dark debt, it just gets rid of the person who knows what went wrong, Cook said. Place value on information rather than on blame, to instill a culture of reporting and investigation and overcome the natural human tendency to cover up an issue and avoid censure. Liu and others see dedicated, long-term, collaborative and meaningful training as the way to affect change.
Meredith Courtemanche is a senior site editor in TechTarget's Data Center and Virtualization group, with sites including SearchITOperations, SearchWindowsServer and SearchExchange. Find her work @DataCenterTT or email her at email@example.com.
Attention to problem diagnostics, Melconian said, is the groundwork for IT automation. After all, automation is a great way to go wrong faster, if you don't plan and account for all eventualities. Once the deployment is accurately modeled, automation is IT ops' friend.