Help! An IT system has gone down and it's affecting the business. What happened, and what needs to be done to get everything back on steady footing?
This sort of predicament occurs all too often in modern businesses that have built commercial capabilities on top of a technical platform. IT incident response cannot be left to purely reactive processes -- which are about as organized as headless chickens running around -- and instead requires a well-thought-out management and resolution system.
IT incident management and resolution is core to how an organization upholds systems availability and uptime on its technical platform.
Under the ITIL service management framework, IT incident management is described as a defined process for logging, recording and resolving incidents. The aim is to restore the service to its customers as quickly as possible, often through a workaround or temporary fixes, rather than a permanent solution, initially.
Fast resolution is laudable -- but just how does an IT department ensure that this happens on a hybrid mix of physical, virtual and cloud environments, with all the complexities that come along with heterogeneous IT?
IT incident types
ITIL differentiates incidents from problems: An incident is an occurrence that tends to affect the user and occurs in isolation; a problem can be the repetition of an incident or the identification of issues in the IT infrastructure before an incident occurs. Tracking incidents and using pattern matching algorithms helps deal with problems. Let's concentrate on the one-off incidents that generate a response from the IT organization.
Incidents fall into hard, soft and software categories:
- A hard incident is the failure of a physical asset within the IT platform, such as a server, a network link or a storage array, or a component within any of these.
- A soft incident occurs due to a failure within a virtual construct within the IT platform, such as a virtual server, storage volume or network link.
- Software incidents are faults within the software caused by coding errors or the corruption of data upon which the application depends.
The IT incident management process
The first aspect of any IT incident management approach has to be root cause analysis: What exactly is causing the incident in the first place? Therefore, the first focus of management tooling has to be to uncover whether the incident comes down to a hard, soft or software issue.
The second focus must be to remediate or circumvent the issue as rapidly as possible, so as to minimize the damage caused by the incident. Full remediation is the best result of an IT incident response. Returning a system to its previous state with no loss of performance or data counts as business continuity, but is not always possible. A complete fix may take time to put in place. Partial remediation, where there may be a slight negative effect on user experience, or a known amount of data is lost, should be the minimum goal.
The ultimate safety blanket -- disaster recovery -- should only be used in a complete disaster. Disaster recovery always results in a loss of capability for a period of time, and a marked loss of data.
Tools should also ensure that incidents do not become problems, meaning that any eventual fix is long term and stops the recurrence of the incident in the future. If the appropriate IT incident response first requires a tactical fix as a workaround to enable customers, then a longer process should identify and implement the long-term fix.
Leave a trail
In the event of an IT audit, these tools can prove useful. For example, adding in details from the IM tools will help to prove what was done and when, how incidents were dealt with and what steps were taken to stop them from becoming problems. An audited company, whether being held to internal standards, ISO 90001 or for regulatory compliance requirements, might require that IT incident management tooling be in place.
Many service desk systems, such as BMC Remedy IT Service Management Suite, Vivantio Pro and Zendesk, embed IT incident management tools, but some only oversee the process of IT incident management and do not provide the actual capabilities to carry out full remediation.
Other tools integrate fully into service desk systems, providing the functionality for IT asset management, root cause analysis, and remediation and using the service desk system to deal with raising trouble tickets and to keep administrators informed of what is happening. IT management vendors, such as ManageEngine, BMC Software, SolarWinds, ServiceNow and Cherwell Software, provide full incident resolution capabilities rather than trouble tickets.
The tools that you choose to mount an effective IT incident response must have the capability to:
- Create an understanding of the physical architecture of the IT platform under management;
- Create an understanding of the virtual architecture of the IT platform under management, including platforms on public clouds;
- Fully understand all dependencies between virtual and physical entities;
- Rapidly pick up that an IT incident has occurred and log it;
- Carry out root cause analysis of the incident and log it;
- Figure out if the incident can be fixed through automated means and alert administrators via tickets if it cannot;
- Either create the means of remediation, or provide sufficient data to a remediation system so that the incident can be fixed;
- Raise a trouble ticket for complete remediation in situations where only partial remediation can be carried out;
- Log all details of what was done and store them in a manner where any repetition of the incident can be identified and details of the resulting problem logged; and
- Provide meaningful and useful reports, based on all logged information, on all incidents found, including steps taken, the results, outstanding tickets and more.
Where human intervention is required, e.g., a physical system has failed, then the IT incident management tool should integrate bidirectionally with operations tools, such as service desk software, that enable the manual work. Once hardware is replaced or fixed, the IT incident management tool should receive this information to keep its records up to date. Should the same incident occur again, the tool's records will help determine if it is endemic.
Organizations should look to how they can best implement such tools to support the desired flexibility of the changing IT platform, ensuring that it covers both private and public infrastructures.
How IT monitoring tools can boost systems management
Getting usable information from IT monitoring tools
Good DevOps monitoring takes more than data