Automated infrastructure management took a step forward with the emergence of AIOps monitoring tools that use machine learning to proactively identify infrastructure problems.
IT monitoring tools released in the last two months by New Relic, BMC and Splunk incorporate AI features, mainly machine learning algorithms, to correlate events in the IT infrastructure with problems in application and business environments. Enterprise IT ops pros have begun to use these tools to address problems before they arise.
New Relic's machine learning features, codenamed Seymour at its beta launch in 2016, helped the mobile application development team at Scripps Networks Interactive in Knoxville, Tenn., identify misconfigured Ruby application dependencies and head off potential app performance issues.
"Just doing those simple updates allowed them to fix some errors they hadn't realized were there," said Mark Kelly, director of cloud and infrastructure architecture at Scripps, which owns web and TV channels, such as Food Network and HGTV that are viewed by an average of 50 to 70 million people per day.
Seymour is now generally available in New Relic's Radar and Error Profiles features, which add a layer of analytics over the performance data collected by New Relic's application performance management tools that help users hone their reactions. Radar uses algorithms similar to e-commerce product recommendation engines to tailor dashboards to individual users' automated infrastructure management needs. The Error Profiles feature narrows down the possible causes of IT infrastructure errors. An engineer can then scan a prioritized list of the most unusual behaviors to identify a problem's root cause.
"Before Radar, [troubleshooting] required some manual digging -- now it's automatically identifying problem areas we might want to look for," Kelly said. "It takes some of that searching for the needle in the haystack out of the equation for us."
Data correlation stems IT ops ticket tsunami
IT monitoring tools from BMC and Splunk also expanded their AIOps features this month. BMC's TrueSight 11 IT monitoring and management platform will use new algorithms within the TrueSight Intelligence SaaS product to categorize service tickets so IT ops pros can more quickly resolve incidents, as well as assess the financial impact of bugs in application code. Event stream analytics in TrueSight Intelligence can predict IT service deterioration, and a separately licensed TrueSight Cloud Cost Control product forecasts infrastructure costs to optimize workload placement in hybrid clouds.
Chris Adamspresident and COO, Park Place
Park Place Technologies, an after-warranty server management company in Cleveland, Ohio, and a BMC partner, plans to fold TrueSight Intelligence analytics into a product that forewarns customers of equipment outages.
"We have ways to filter email alerts sent by equipment based on subject lines, but TrueSight does it faster, and can pull out strings of data from the emails as well," said Chris Adams, president and COO of Park Place. "We want to be able to call the customer and say, 'Three disk drives are going to fail, and here's why.'"
Version 3.0 of Splunk's IT Service Intelligence (ITSI) tool also correlates event data to pass along critical alerts to IT admins so they can more easily process Splunk log and monitoring data. ITSI 3.0 root cause analysis features predict the outcome of infrastructure changes, more quickly identify problem servers, and integrate with ITSM tools such as ServiceNow and PagerDuty -- which offer their own machine learning features to further prune the flow of IT alerts.
Automated infrastructure management takes shape with AIOps
Eventually, IT pros hope that AIOps monitoring tools will go beyond dashboards and into automated infrastructure management action through proactive changes to infrastructure problems, as well as application-pull requests that address code problems through the DevOps pipeline.
"The Radar platform has that potential, especially if it can start integrating into our pipeline and help change events before they happen," Kelly said. "I want it to help me do some of those automated tasks, detect my stacks going bad in advance, and give me some of that proactive feedback before I have a problem."
Such products are already on the way. Cisco previewed a feature at its AppDynamics Summit recently that displays a forecast of events along a timeline, and highlights transactions that will be critically impacted by problems as they develop. The still-unnamed tool presents theories about the causes of future problems along with recommended actions for remediation. In the product demo, the user interface presented an "execute" button for recommended remediation, along with a button to choose "other actions."
Cisco plans to eventually integrate technology from recently acquired Perspica with AppDynamics, which will perform machine learning analysis on streams of infrastructure data at wire speed.
For now, AppDynamics customers said they're interested in ways such AIOps features can improve business outcomes. But the tools must still prove themselves valuable beyond what humans can forecast based on their own experience.
"It's not going to replace a good analyst at this point -- that's what the analyst does for us, says how a change is going to affect the business," said Krishna Dammavalam, SRE for Idexx Labs, a veterinary testing and lab equipment company in Westbrook, Maine. "If machine learning's predictions are better than the analyst's, that's where the product will have value, but if the analyst is still better, there will still be room to grow."
Determine the real success of your DevOps process with these metrics
IT Ops' list of four-letter words shouldn't include ITSM
Immutable infrastructure upends traditional patterns -- in a good way
Admins dread potential obsolescence in the inevitable AIOps future
Root cause analysis tools in the AIOps future