il-fede - Fotolia


Build an incident response runbook based on these 3 components

The secret to a good runbook is balance between the effort to create and maintain one versus the effect it has on IT staff. Application layout and rebooting are ideal starting points.

Runbooks are collections of procedures and information that guide IT ops staff as they resolve issues. These documents can cover anything from troubleshooting processes to interconnections, as well as how to restart complex applications.

While a runbook is not an ITIL practice, it can follow the same guidelines and triggers put in place by a service catalog -- and it does follow many of ITIL's guidelines. The runbook's purpose is to keep IT fixes and processes the same, even when the staff changes. This consistency can yield better uptime, reduce staff effort and save costs. However, nothing in IT is as black and white as it might seem.

To write an effective incident response runbook, begin with a focus on the fundamentals. Then, address rebooting and troubleshooting practices.

The fundamentals

A company must be willing to pay staff and dedicate resources to create and maintain comprehensive documentation. And the argument that those investments will decrease once the runbooks are completed is fundamentally flawed. IT systems are fluid; they are always changing and updating. To keep up, applicable runbooks must also be fluid, and they require resources to remain up to date.

Resources might come from areas outside IT operations. They might include application owners and engineers, which raises costs in terms of staff expenses. Determine whether it's worth it to have a developer, for example, dedicate several hours to help with a runbook to save an IT operations admin a few hours in the future. The answer is tricky, because the employee cost is not the same between IT roles. These considerations might make comprehensive documentation seem difficult to achieve, but there is middle ground.

Rather than document everything, start with the processes that rarely change and are fundamental. It's not calling defeat to adjust the target; it's about a balance between the end value and the effort put in. A cornerstone in any runbook should be the application or IT environment layout. Staff can't troubleshoot a problem with a report server if they don't know the server's IP address. If IT staff know the incident relates to a report server, but must determine which server is the report server, the struggle becomes twofold: first, to identify where the problem is, and second, to fix it.

Document the location of each component of the IT environment, and who is responsible for it -- along with that person's contact information, such as a cellphone number. An application can span multiple virtual environments or extend to the public cloud. Staff must know this arrangement so they can focus on issue resolution. Infrastructure components such as server names, functions and IP addresses don't change, which makes them a solid foundation on which to start an incident response runbook.


Once the runbook details the environment, document the startup and shutdown order for IT systems and its effect. Every system needs to be rebooted at some point. To do so correctly, the IT operations team needs instructions on the correct order of actions and the effect each server in the application stack will have in this process. Include any possible issues or events staff might see in the reboot process and how to address them.

Runbook vs. playbook

In some enterprises, the terms runbook and playbook are used interchangeably. In others, they carry subtle differences. But both types of documents ultimately aim to capture key IT practices and processes.


Another important part of an incident response runbook is troubleshooting processes and how to handle incidents. Do not try to include every event or ticket type in the runbook: It will be both huge and out of date immediately after the first software update. Be selective with inclusions and focus on common issues -- those that appear multiple times across IT staff. Even if everyone already knows the fix, include it in the runbook for new staff so they don't have to repeat the research process. Also include issues that might take considerable time or effort to correct, but only if the issue might reoccur. There is no significant benefit of including a one-time issue.

The troubleshooting section requires more thought and planning than the other two sections described above. Add a table of contents and a thorough index. A runbook is no good if staff can't find what they're looking for.

Dig Deeper on Real-Time Performance Monitoring and Management