Systems administrators and IT managers are bombarded by IT infrastructure monitoring alerts. Some alarms are immediately important, some contribute to future decisions and some are simply noise.
You can simplify IT alerting floods down to meaningful information, quickly prioritized and routed to the appropriate team, with a good monitoring strategy. Take it a step further and attach actions to relevant monitoring metrics, so you can stop constantly watching over IT monitoring tools.
First, only monitor things upon which you can take an action. IT infrastructure monitoring tools often have an auto-configure feature that discovers all available metrics; this results in hundreds or thousands of configured monitors--and thus IT alerts--on even a small, 10-device network. It is simply overkill. To get meaningful information, reduce the monitors to only ones that can really require action. Start with the most critical ones, and add monitors over time so you don't go from zero alerts to being inundated in one day.
To further refine IT alert management, differentiate between critical status information and long-term trending information. Sift through the auto-generated alarms that come along with off-the-shelf monitoring tools by asking whether each alarm represents a critical piece of information with a time requirement, or a piece of information that's valuable only for trend purposes. Keep the trending information around and available, but avoid polluting the tactical views of IT infrastructure monitoring. For example, your IT staff needs to know at a glance whether a switch port is up or down, not how its usage has grown over the last three months. Utilization and capacity reports are important in a strategic sense, and should be reviewed separately from the day-to-day tactical monitoring information.
Now you will customize IT alerting in the infrastructure monitoring setup, by creating alarms for varying severity levels on the same IT resources. Some alarms are noisy by nature, and their threshold settings contribute to the noise. On the other side of the spectrum, some threshold configurations cause IT engineers to miss critical issues because the alert is set too early. For example, IT teams generally set up disk space utilization alarms at 20% for a warning, with a critical alarm at 10%. Given the size of newer hard drives, that critical alarm can go off when there is upward of 500 GB of free space. It could be days or weeks before a system consumes the remaining 500 GB. Set up multiple disk space alarms that activate when the severity is higher and dictate the required speed of response. Leave the standard 20% and 10% thresholds in your IT alerting strategy if they work for you, but also configure another alarm when free disk space goes below 10 GB, indicating it needs immediate attention. Or take it a step further and add an alarm for 0 bytes free on a device, indicating an all-hands-on-deck emergency. This practice can be extended to other scenarios.
Picking and implementing the best IT monitoring tool or tools is only half the battle -- always put a significant amount of effort into IT alert management: useless alarm reduction, and refining what information gets collected. Otherwise, IT managers can drown in an onslaught of information. With confidence that the IT monitoring strategy is tailored to catch the important things and alert IT staff in a timely manner, organizations can focus more on business growth and operations.
Get ahead of problems with a frank IT performance assessment
Review whether you're tracking the right IT metrics
Put monitoring data to effective use in the IT4IT framework