This content is part of the Essential Guide: Don't panic! The definitive guide to IT troubleshooting
News Stay informed about the latest enterprise technology news and product updates.

Enterprise shops look to AIOps for IT root cause analysis

IT ops pros look to AIOps tools to help find the smoking gun in failing IT stacks, but some call automated root cause analysis more dream than reality.

Enterprises have automated IT root cause analysis in their sights as AIOps hype reaches a fever pitch, but some IT vendors are reluctant to jump on the AI bandwagon.

Automated IT root cause analysis is central to many AIOps tools, a category operations tools that incorporates AI to improve the tool's ability to monitor and manage IT deployments. New Relic's error profiles feature, for example, is meant to narrow down the cause of glitches and speed up IT incident response. Enterprise IT ops pros imagine a day where that response is automated through application development tools and IT service ticket systems. AIOps tool vendors even tout a proactive, rather than reactive, IT monitoring approach, which identifies the root cause of potential problems and stops them before they reach the troubleshooting stage.

But despite customers' demand for such features, some IT monitoring vendors won't adopt AIOps.

LogicMonitor, for example, touts its monitoring tools' ability to pinpoint the causes of errors in the IT stack, but its founder and chief evangelist, Steve Francis, balks at the suggestion that machine learning can automate IT incident response.

People tend to define machine learning as anything they don't understand. Anything we do understand is just statistics.
Steve FrancisFounder, LogicMonitor

"People tend to define machine learning as anything they don't understand," he said. "Anything we do understand is just statistics. Everyone's saying we need to be talking about this, but I don't see the value of it yet."

LogicMonitor customers beg to differ.

"They need to think a little beyond red light/green light and [rather about] how to build innovation around root cause analysis," said Miten Marvania, founder and COO at Agio, a managed IT and cybersecurity firm in Norman, Okla.

Marvania wants LogicMonitor's tool to understand the real effect of events in the IT stack. For example, if a load balancer has five servers attached and one fails, he wants LogicMonitor to know that four out of those five servers must fail before it's a critical event.

"Right now, it's a binary approach where, if a server is down, it's critical," Marvania said. "They need to assess the real impact of that."

AIOps fans flames of IT root cause analysis debate

The most advanced DevOps shops no longer want IT root cause analysis to be the focus of IT incident response anyway, as infrastructures become ephemeral and applications distributed. But in more traditional enterprise IT shops, automatic root cause analysis of ongoing problems is still the holy grail of AIOps.

"If LogicMonitor had a way to understand the least common denominator of problems, then it could directly tell us: 'It's this network switch that's acting up' or 'These database errors seem to be at the end of the chain of dependencies,'" said Andy Domeier, director of technology operations at SPS Commerce, a communications network for supply chain and logistics businesses based in Minneapolis. "It gives an engineer a lot more context about how to approach that problem."

Root cause analysis pain points

LogicMonitor's Francis is not convinced, however, that the approaches Marvania and Domeier suggest are viable. Users can configure LogicMonitor with thresholds that indicate how many load-balanced servers can fail without a critical alert being triggered, for example, but he doubts the discovery of such configurations could be automated out of the box.

"That's not stuff we can know absent human knowledge about their application," Francis said. "That's human knowledge and configuration."

As for Domeier's common denominator idea, Francis said LogicMonitor plans to help customers narrow down likely root cause culprits by correlating alerts, but he is skeptical such correlations can be reliably precise.

"I'm not sure that's a legitimate thing to say anyone will have in the short term," Francis said. "We can shorten the time it takes you to look for your root cause. But I don't think we'll ever be able to say, 'This is it.'"

LogicMonitor could be bluffing about the workability of AIOps for IT root cause analysis as a response to heavy marketing messages from its competitors. But those who've seen AI deployed at scale in IT operations have said that Francis has legitimate concerns.

I would rather we talk about the benefits or features of products outright … saying, 'This works because it's AI' is really glib.
Ben SigelmanCo-founder, LightStep

"The blessing and the curse of these things is that they often demo incredibly well," said Ben Sigelman, senior staff software engineer for Google from 2003 to 2012 and co-founder of infrastructure monitoring startup LightStep. "In a controlled environment where you know what the inputs are, you can show things that are almost magical -- which means someone's going to buy it and then you have the issue of making it work in production."

LightStep specializes in monitoring cloud-native microservices infrastructures, but Sigelman doesn't plan to use the AI buzzword to market LightStep either.

"I would rather we talk about the benefits or features of products outright and, below the fold, it can say it's because of AI, statistics processing or machine learning," Sigelman said. "Just saying, 'This works because it's AI' is really glib."

Could crowdsourced AIOps boost root cause analysis?

Some industry watchers wonder if proactive IT monitoring would be more realistic with a broader set of data collected from multiple enterprise customers of the particular tool. Such proactive data analysis is already in use in manufacturing and refinery facilities, where equipment vendors can analyze streams of data from their machines to proactively identify potential failures.

"We'll see groups of like-minded companies allowing customers to subscribe to aggregated feeds of cleaned-up data," said Brad Shimmin, an analyst at Current Analysis. "AI processing against not just your data but everyone's can help make more accurate predictions."

Francis said this might enable proactive monitoring on broad terms, such as detecting whether a cloud service provider's data center or an internet service provider's network connection is down in a particular region.

"That is a solvable case of root cause analysis because you have enough data from enough data points, and it's a relatively simple problem," he said.

The word relatively is operative there, Francis added.

"Having been a network guy, I know you can have trace routes that work perfectly well for one device that's going over the same network and another one that totally fails" because of EtherChannel routing behind the scenes, he said. "So even that is not going to be a perfect use case."

That's to say nothing of the obvious potential security and compliance snags of aggregate data for automated IT root cause analysis. New Relic, for example, has held off on such a service because of customer worries about sharing IT monitoring data with other companies.

Beth Pariseau is senior news writer for TechTarget's Data Center and Virtualization Media Group. Write to her at or follow @PariseauTT on Twitter.

Next Steps

How AIOps improves IT management tools

Robo-ops: Will AI-enhanced tools eliminate support roles?

Automic CTO touts dominant AIOps future

Dig Deeper on Real-Time Performance Monitoring and Management

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

How do you think root cause analysis in IT can be improved?
The elephant in the room in any automated root-cause discussion relates to not just the intrinsic quality of the raw time-series data itself; which is a big enough challenge all on it's own.

The reality is, that in anything other than the most simplistic - and static - environments; there is a less that 100% understanding of the *relationships* amongst and between the various elements/objects under management. You don't KNOW which transaction (that might be slow or failing) is running in which container/VM, let alone which Host... not to mention which data store, which LUN... and over which network elements.  And, they are all changing, and all the time.

(This is, btw, why traditional ITIL CMDB-driven/focused efforts are doomed to fail in this regard. While they were deterministic in identifying "all the players and the relationships"; they became cumbersome obstacles to dev/ops and agile delivery to market).

What is required is a automated and real-time deterministic understanding - continuously over time - of how things relate, in the context of the problem you are trying to proactively manage, so that analytics can actually produce a result that can be confidently automated against.

Otherwise, you run the risk of simply "screwing up at light speed" with fully automated responses driven by machine intelligence. All the statistics in the world are useless if you automatically make a major configuration decision based on probability instead of dead-on certainty.
There is plenty of hype around AI-enabled products of late, but the study of reliability engineering and RCA goes back to the 1950s during the early days of the Bell System. There is a large body of research on using Weibull analysis and Markov models to predict system failures so operators can preemptively take corrective actions.
Machine learning has come a long way since years ago and I believe if we are able to use a combination of deep learning, neural networks correlation behavioral analysis, many, if not all of the ITOps issues today can be pinpointed by AI. IT systems has grown in such a size that human analysis is becoming impossible and many a times the solution was just "reboot and cross your fingers"... there are tools today that can aide IT admins with analyzing raw infrastructure data to provide the right information to them instead of trying to make IT admins become data scientists, which is a totally different realm altogether. I may be speaking with a bit of bias, but I've seen our company's #SIOS iQ machine learning IT Analytics learn and pinpoint issues in customers environments that would not have been easily found by IT admin through alerts and logs. 
Maybe it is just the way we envision AI and its application that creates the snags. It is challenging for even the most experienced ops teams to discover root causes in live production systems. There isn't much wiggle room for experimental discovery. What if teams could gain this understanding with at-scale, production systems before they were live? Make configuration changes, change activity, even whack components and truly understand the complex relationships they were about to be in charge of managing? In this scenario, the AI application is in creating and coordinating the realistic activity around all of the system interfaces and edges, in addition to assisting with analysis of the performance metrics and alarms. There are some high value software and even embedded system developments in applications like smart home automation, invehicle systems, and OTT addressable advertising that have discovered this magic and have been using it to deliver flawless, fast. Still a few humans in the loop, but screwing up at light speed is done in the spirit of investigation because no one is being harmed.