Monitoring thresholds determine IT performance alerts

An IT monitoring strategy depends on the applications and systems it governs. Static and dynamic thresholds each have benefits and drawbacks, but it's possible to find a balance.

Alastair Cooke

Published: 16 Apr 2018

IT monitoring is a complex field with several approaches to manage monitoring and alerts. Most systems work with monitoring thresholds and notify IT operations staff when resource utilization breaches them. The question is how to set monitoring thresholds for the best results.

Some IT monitoring tools use static thresholds that are manually adjusted, while others use a learning system to set thresholds specific to the given environment. Both methods have a common objective: inform the IT operations team when there is an issue and point to a cause -- ideally, before users notice any effects. Both static and dynamic monitoring thresholds have advantages and disadvantages.

Static monitoring thresholds

Static thresholds are fixed values that represent the limits of acceptable performance. For example, a server with over 90% CPU utilization is generally a bad thing, no matter when it happens or on what server. For other performance counters, it is less obvious what is acceptable and what is dangerous. Monitoring products come with default thresholds for each performance counter that the IT team can adjust. Not all IT workloads benefit from the same monitoring thresholds. A bank's IT team needs to know about CPU utilization that goes above 60% for a few minutes, for example, while a manufacturer might not.

Static monitoring threshold tuning is a major challenge for IT teams. Tuning effectively limits the number of thresholds and usually means that the same thresholds are used across every VM, despite these VMs serving markedly different business applications. For example, a reporting server is healthy at 90% CPU utilization, while a web server at the same utilization rate requires IT support. It takes more manual tuning to override the standard threshold for applications that have these different requirements. Until manual tuning is perfected, the monitoring tool will not report real issues, will over- or underreport the severity of an issue or will report issues where none exist.

Static thresholds do not allow for cyclic variation. It is common in IT environments for CPU utilization to hit 95% for two hours overnight as the backup runs, but only during that brief window. Some tools enable users to set in-hours and after-hours thresholds separately. However, IT infrastructure also can experience normal weekly and monthly variations in load. Static thresholds do not respond to these cyclic workloads and require a lot of work to avoid false positives and missed issues.

Dynamic, learning monitoring thresholds

Intelligent IT monitoring tools learn what is normal in the environment and only send an alert when things are outside of the understood normal cycles and parameters.

Intelligent IT monitoring tools learn what is normal in the environment and only send an alert when things are outside of the understood normal cycles and parameters. Dynamic thresholds usually learn the normal range for a performance counter -- both a high and low threshold -- at each point in the day, week and month. They, therefore, identify daily, weekly, monthly and even annual cycles in IT systems. A dynamic system knows the high CPU load during backup is normal, but that 80% CPU utilization on a Tuesday morning is abnormal. Because tuning is automatic, the IT monitoring strategy can include thousands of thresholds, even ones that change over time to follow business cycles.

Dynamic thresholds are not as intelligent as people. A dynamic monitoring setup can become confused when cyclic activity doesn't happen according to usual patterns. For example, the support staff will get an alert that system load is low on a public holiday, because the users are at the beach instead of at their desks creating load.

Dynamic monitoring tools deployed in a broken or poorly performing IT environment can learn that state as normal and even start to send alerts due to it getting better. For example, an application has a memory leak, so memory utilization increases over time. But the server is rebooted on a monthly basis for patches. The dynamic system will accept this monthly cycle of increasing memory utilization as normal. Dynamic systems are also inclined to view things that get broken for a while as the new normal. If a storage array slowly gets overloaded and unresponsive, the dynamic threshold monitoring system will register the overloaded state as the new normal.

An IT monitoring strategy for the real world

In the real world, most monitoring tools do more than just watch thresholds, and even dynamic threshold systems incorporate some static parameters, too. Overall, IT monitoring tools that build thresholds automatically are more useful than those that require a lot of manual tuning. Tedious tuning never gets completed in a busy IT organization, which leads to a habit of ignoring noisy false alerts.

A smart monitoring strategy uses more than just performance counters. Tools incorporate system logs to help identify issues and pair infrastructure monitoring with application monitoring. This setup tracks app availability and response time to correlate it with infrastructure performance. A monitoring system with all its dials showing green is not the complete story; look for multiple ways to identify issues in the environment.

Essential Guide

Monitoring thresholds determine IT performance alerts

An IT monitoring strategy depends on the applications and systems it governs. Static and dynamic thresholds each have benefits and drawbacks, but it's possible to find a balance.

Static monitoring thresholds

Dynamic, learning monitoring thresholds

An IT monitoring strategy for the real world

Dig Deeper on IT systems management and monitoring

The definitive guide to enterprise IT monitoring

A primer on storage anomaly detection

Auto-tech series: eG Enterprise – Automation for modern monitoring

Infrastructure-as-Code series: Practical monitoring in an IaC universe