IT incident management is an area of IT service management (ITSM) wherein the IT team returns a service to normal as quickly as possible after a disruption, in a way that aims to create as little negative impact on the business as possible.
An incident is an unexpected event that disrupts the normal operation of an IT service. A problem is an underlying issue that could lead to an incident. Problem management consists of the measures taken to prevent the occurrence of an incident.
IT incident management helps keep an organization prepared for unexpected hardware, software and security failings, and it reduces the duration and severity of disruption from these events. It can follow an established ITSM framework, such as IT infrastructure library (ITIL) or COBIT, or be based on a combination of guidelines and best practices established over time.
IT incident management process
In practice, IT incident management often relies upon temporary workarounds to ensure services are up and running while the staff investigates the incident, identifies its root cause, and develops and rolls out a permanent fix. Specific workflows and processes in IT incident management differ depending on the way each IT organization works and the issue they are addressing.
Most IT incident management workflows begin with users and IT staff pre-emptively addressing potential incidents, such as a network slowdown. IT staff contains the incident to prevent potential issues in other areas of the IT deployment. Then, they find a temporary workaround or implement a fix and recovery of the system and release that system back into the production environment. IT staff then reviews and logs the incident for future reference.
Documentation enables IT staff to find previously unseen and recurring incident trends and address them. If a temporary workaround is in place, once the disruption to end users is mitigated, the staff can develop a long-term fix for the issue.
A focus on IT incident management processes and established best practices will minimize the duration of an incident and shorten recovery time, and it can prevent future issues.
A common framework to understand IT incident management is through analyzing the ITIL process. ITIL, trademarked by Axelos, is a widely used ITSM framework. ITIL incident management uses a workflow for efficient resolution: incident identification, logging, categorization, prioritization, response, diagnosis, escalation, resolution and recovery, and then closure.
Types of incidents
Incidents are generally categorized by low, medium and high priorities. Low-priority incidents do not interrupt end users, who typically can complete work despite the issue. Medium-priority incidents are issues that affect end users, but the disruption is either slight or brief. High-priority incidents, however, are issues that will affect large amounts of end users and prevent the proper functioning of a system.
Incidents are classed as hardware, software or security, although a performance issue can often result from any combination of these areas. Software incidents typically include service availability problems or application bugs. Hardware incidents typically include downed or limited resources, network issues or other system outages. Security incidents encompass attempted and active threats intended to compromise or breach data. Unauthorized access to personally identifiable records is a security issue, for example.
Roles in incident management
IT incident management is normally separated into three levels of support, generally grouped together in the help or service desk. Most organizations use a support system, such as a ticketing system, for categorization and prioritization of incidents. IT staff responds to each incident according to its prioritization level.
Level-one support typically provides basic-level support or assistance, such as password resets or computer troubleshooting. Level-one support involves incident identification, logging, prioritization and categorization, deciding to escalate to level-two support and incident resolution when appropriate. Level-two support goes through a similar process for more complex issues that need more training, skill or security access to complete. Major incidents are given to level-three support. This category includes incidents that disrupt a business's operation, marked as a high priority, and require an immediate response. Such an example would be an issue with a network that requires an expert or a skilled team to solve.
Level-one support involves technical staff that is trained to solve common incidents and fulfill basic service requests. Level-two support includes IT staff with specific knowledge of the system in question. Level-three support team members are generally specialists in the subject matter of the incident. For example, a level-three support team could include the chief architect and engineers who work on the product or service's daily operation and maintenance.
An incident manager enforces the proper incident response and management processes across the IT support and service delivery team or teams. This person can be involved in the organization's choice of ITSM framework. They should work to improve how incidents are prevented and handled at the company over time, through risk mitigation and ongoing process improvements. The incident manager is likely to act as a communication bridge between end users and technical specialists during disruptions, such as an email outage. The person produces, along with the service desk staff, incident reports for critical business and IT services, and they might lead a post-mortem on major incidents. They also maintain a knowledgebase of problems and incidents.
In DevOps organizations, software developers are considered responsible for production-ready code, under a mantra of, "You build it, you own it." In the event of a software incident, the developer is expected to participate in or lead incident management.
Incident management tools
Help desk and incident management teams rely on a mix of tools to resolve incidents, including monitoring tools to gather operations data, root cause analysis systems, incident management and automation platforms, and other support products.
Monitoring tools enable an IT staff to pull operations data from across multiple systems, such as on-premises or cloud-based hardware and software. Root cause analysis tools help sort through operational data, such as logs, which were collected by systems management, application performance monitoring and infrastructure monitoring tools. Root cause analysis tools help in understanding how a system operates and where any incidents reside.
Incident response tools correlate that monitoring data and facilitate response to events, typically with a sophisticated escalation path and method to document the response process. PagerDuty, VictorOps and xMatters are examples of incident management tools. PagerDuty establishes escalation policies, as well as creates automated workflows and alerts users of incidents based on preconfigured parameters.
ITSM service desk tools log data such as what the incident was, its cause and what steps were taken to solve the incident. ServiceNow and Zendesk are two major vendors in this space. ServiceNow Incident Management is a root cause analysis and auditing tool that can both log and prioritize IT incidents. ServiceNow can prioritize incident events through a self-service portal, email, incoming events and more. It logs incidents by the instance, classifies them by level of impact and urgency, escalates as required and performs analysis for future improvements.