Mike Kiev - Fotolia

IT incident response ditches root cause analysis process

IT pros can improve monitoring and pare down alerts to get only actionable information, but then what? IT incident response is its own black art, and it's changing fast.

NEW YORK -- IT incident response must change to keep up with DevOps. The root cause analysis process embraced by...

many enterprise sysadmins should be the first thing to go.

In the world of monolithic legacy applications, root cause analysis -- identifying the specific line of code, switch port or hard drive that set off a domino effect to cause an outage -- is the first step during an IT incident response. But as apps evolve into microservices distributed over a complex network infrastructure, IT pros at the Velocity Conference 2017 here this week said this approach to problem-solving no longer works.

"Focus on mitigation and not root cause" as a crisis begins, said Kristopher Beevers, founder and CEO of hosted DNS provider NS1 in a Velocity presentation. "Identify and troubleshoot the service impact and worry about doing a diagnosis later."

Down with root cause, up with service impact

A key sentiment heard here this week was that IT pros must develop a better sense of the services and functions that are priorities for the business and end users.

If the homepage lists of top articles at the top of the Financial Times website are broken, for example, that's the thing that matters for Sarah Wells, principal engineer for the financial industry newspaper, based in London.

"We've turned off alerts for service-level errors, status codes and response times," Wells said. "We start [our analysis] at the top of the stack."

We're learning how to improve versus figuring out what died and why.
Peter Nealonsolutions architect, Runkeeper

At scale, other members of the IT team become customers of microservices offered via APIs, said Mark McBride, founder of Turbine Labs, a decision support analytics software maker in San Francisco. McBride is a former developer and services engineer for Twitter, Nest Labs and Google.

Splitting responsibility for the infrastructure in this way reduces IT pros' anxiety during incident response, McBride said. Individual team members don't worry about everything behind each API that their service communicates with, but they can easily observe the behavior of the API their service calls.

This also gives IT teams a common point of observation and control over systems that span languages and runtimes, and they can apply remediations globally during IT incident response, he said. Moreover, it reduces the importance of a root cause analysis process to find the particular element of the infrastructure that malfunctions behind an API.

The post-postmortem era and the case for incident review

Root cause analysis often discourages IT from finding long-term solutions post-incident and improving troubleshooting processes, said some DevOps engineers here.

"It's common for a human to be blamed as the root cause of a problem" during this process, said Baron Schwartz, co-founder and CEO of VividCortex, a database monitoring SaaS provider. "Then companies end up firing people instead of improving things, and 'cover your [butt]' becomes the priority instead of solving problems."

For many companies, a more effective IT incident review approach remains a work in progress, but they've already begun to move away from the root cause analysis process.

"You often don't arrive at a single thing, and it's more important to understand the context, contributing factors, and how to mitigate and remediate the issue," said Peter Nealon, a solutions architect at Runkeeper, a mobile running app owned by Japanese athletic equipment retailer ASICS.

Some IT pros think the term postmortem to describe the final retrospective phase is passé, and they favor the term incident review.

"We're learning how to improve versus figuring out what died and why," Nealon said. "We aren't aiming for root cause analysis, but to reduce our meantime to remediation."

Nealon's team will soon institute an IT incident response process that uses an incident commander to direct troubleshooting, he said. Ideally developers on the application team will handle troubleshooting tasks and Runkeeper's site reliability engineers will serve as subject-matter experts on the underlying platform.

"Our assumption will be that it's most likely an app issue and not an infrastructure or platform issue," Nealon said.

Beth Pariseau is senior news writer for TechTarget's Data Center and Virtualization Media Group. Write to her at [email protected] or follow @PariseauTT on Twitter.

Next Steps

Kubernetes offers role-based access control in 1.8 update

Ops admins must be fluid in a frequently changing job mold

Increases in data breaches further heighten security requirements

Dig Deeper on API Management for IT Tools