Artificial intelligence for IT operations (AIOps) is an umbrella term for the use of big data analytics, machine learning (ML) and other artificial intelligence (AI) technologies to automate the identification and resolution of common information technology (IT) issues. The systems, services and applications in a large enterprise produce immense volumes of log and performance data. AIOps uses this data to monitor assets and gain visibility into dependencies without and outside of IT systems.
An AIOps platform should bring three capabilities to the enterprise:
- Automate routine practices.
Routine practices include user requests as well as non-critical IT system alerts. For example, AIOps can enable a help desk system to process and fulfill a user request to provision a resource automatically. AIOps platforms can also evaluate an alert and determine that it does not require action because the relevant metrics and supporting data available are within normal parameters.
- Recognize serious issues faster and with greater accuracy than humans.
IT professionals might address a known malware event on a noncritical system, but ignore an unusual download or process starting on a critical server because they are not watching for this threat. AIOps addresses this scenario differently, prioritizing the event on the critical system as a possible attack or infection because the behavior is out of the norm, and deprioritizing the known malware event by running an antimalware function.
- Streamline the interactions between data center groups and teams.
AIOps provides each functional IT group with relevant data and perspectives. Without AI-enabled operations, teams must share, parse and process information by meeting or manually sending around data. AIOps should learn what analysis and monitoring data to show each group or team from the large pool of resource metrics.
AIOps is generally used in companies that use DevOps or cloud computing and in large, complex enterprises. AIOps aids teams that use a DevOps model by giving development teams additional insight into their IT environment, which then gives the operations teams more visibility into changes in production. AIOps will also remove a lot of risks involved in hybrid cloud platforms by aiding operators across their IT infrastructure. In many cases, AIOps can help any large company that has an extensive IT environment. Being able to automate processes, recognize problems in an IT environment earlier and aid in smoothing communications between teams will help a majority of large companies with extensive or complicated IT environments.
AIOps uses a conglomeration of various AI strategies, including data output, aggregation, analytics, algorithms, automation and orchestration, machine learning and visualization. Most of these technologies are reasonably well-defined and mature.
AIOps data comes from log files, metrics and monitoring tools, helpdesk ticketing systems and other sources. Big data technologies aggregate and organize all of the systems' output into a useful form. Analytics techniques can interpret the raw information to create new data and metadata. Analytics reduces noise, which is unneeded or spurious data and also spots trends and patterns that enable the tool to identify and isolate problems, predict capacity demand and handle other events.
Analytics also requires algorithms to codify the organization's IT expertise, business policies and goals. Algorithms allow an AIOps platform to deliver the most desirable actions or outcomes -- algorithms are how the IT personnel prioritize security-related events and teach application performance decisions to the platform. The algorithms form the foundation for machine learning, wherein the platform establishes a baseline of normal behaviors and activities, and can then evolve or create new algorithms as data from the environment changes over time.
Automation is a key underlying technology to make AIOps tools take action. Automated functions occur when triggered by the outcomes of analytics and machine learning. For example, a tool's predictive analytics and ML determine that an application needs more storage, then it initiates an automated process to implement additional storage in increments consistent with algorithmic rules.
Finally, visualization tools deliver human-readable dashboards, reports, graphics and other output so users follow changes and events in the environment. With these visualizations, humans can take action on information that requires decision-making capabilities beyond those of the AIOps software.
AIOps benefits and drawbacks
When properly implemented and trained, an AIOps platform reduces the time and attention of IT staff spent on mundane, routine, everyday alerts. IT staff teaches AIOps platforms, which then evolve with the help of algorithms and machine learning, recycling knowledge gained over time to further improve the software's behavior and effectiveness. AIOps tools also perform continuous monitoring without a need for sleep. Humans in the IT department focus on serious, complex issues and on initiatives that increase business performance and stability.
AIOps software can observe causal relationships over multiple systems, services and resources, clustering and correlating disparate data sources. Those analytics and machine learning capabilities enable software to perform powerful root cause analysis, which accelerates the troubleshooting and remediation of difficult and unusual issues.
AIOps can improve collaboration and workflow activities between IT groups and between IT and other business units. With tailored reports and dashboards, teams can understand their tasks and requirements quickly, and interface with others without learning everything the other team needs to know.
Although the underlying technologies for AIOps are relatively mature, it is still an early field in terms of combining the technologies for practical use. AIOps is only as good as the data it receives and the algorithms that it is taught. The amount of time and effort needed to implement, maintain and manage an AIOps platform can be substantial. The diversity of available data sources as well as proper data storage, protection and retention are all important factors in AIOps results.
AIOps demands trust in tooling, which can be a gating factor for some businesses. For an AIOps tool to act autonomously, it must follow changes within its target environment accurately, gather and secure data, form correct conclusions based on the available algorithms and machine learning, prioritize actions properly and take the appropriate automated actions to match business priorities and objectives.
Implementing AIOps and AIOps vendors
To demonstrate value and mitigate risk from AIOps deployment, introduce the technology in small, carefully orchestrated phases. Decide on the appropriate hosting model for the tool, such as on-site or as a service. IT staff must understand and then train the system to suit needs, and to do so must have ample data from the systems under its watch.
AIOps is an emerging area, but there is a growing stable of product offerings for businesses to review and evaluate, including but not limited to:
- Splunk's IT Service Intelligence (ITSI) tool.
- BMC's TrueSight platform.
- Cisco's Crosswork Situation Manager, the AIOps part of the Cisco Crosswork Network Automation product family.
- Moogsoft AIOps.
- DRYiCE AIOps from HCL Technologies Ltd.
AIOps features and functionality also appear in existing product suites. Examples include:
- New Relic Applied Intelligence (NRAI), which integrates AI-based Radar and Error Profiles features into the New Relic Digital Intelligence Platform.
- Datapipe's Trebuchet application deployment platform, which relies on AI to improve DevOps processes.