Manage alerts without getting overwhelmed by nonstop warnings

Kit Wai Chan - Fotolia

Avoid the deluge: How to monitor alerts without getting overwhelmed

In a virtualized environment, IT admins can receive alerts around the clock. The trick is to filter the data to get warnings only when there's an actual problem.

It can be hard to remember these days, but there was a time in IT when each application was on its own physical infrastructure. Dedicated administrators for both applications and systems could easily identify where things were going wrong and fix them without delay.

Sure, those IT architectures were horrendously resource inefficient, but they were relatively simple, and it was at least nominally easier for admins to get to the root of problems. With IT resources now spread across a mix of private and public cloud platforms, admins need to use tools to dig deep into everything to know what is happening.

This can lead to problems. Dig too deep, and the sys admins get deluged as they monitor alerts. Stay out of the trenches, and you might miss the small problem that causes a major issue. If you approach tooling in a piecemeal, best-of-breed manner, you limit the ability to identify the root cause of any problem.

What causes all these alerts?

Why has it become more tedious for admins to monitor alerts? For one, in the old days, many organizations could easily define what each of their applications did and how it worked. A business might use SAP for ERP, Oracle for customer relationship management and so on.

Many organizations now break things into collections of functional components that are pulled together as required. This microservices-like approach affects monitoring needs for administrators -- it's a waste of time to monitor events that happen just in one microservice. Instead, it's better to monitor what happens along the whole of the business process -- but that's difficult.

Admins need to address three areas to solve their alert gluts: an adjustable method to take event data from any source, adequate filtering of that data and strong data analysis.

With multiple workloads sharing resources, virtualization complicates monitoring. For example, Workload A requires and receives extra storage, which might be an automated process. But what effect does this have on workloads B, C, D and E? Remember to take into account context and the domino effect these connected workloads have. Increasing the server resources for a workload means that there could be insufficient network bandwidth available for another workload's needs. Also, if a workload receives increased storage resources, that can create a lower ceiling for other workloads.

That hybridization raises other questions, too. Will a public cloud give you visibility into what happens in its environment? Will the cloud provider dispense events and alerts, and, if so, what kinds of events and alerts? And how will they be formatted? If you can, choose cloud vendors that provide access to suitable amounts of event data in a way that enables you to include the data in your overall system. An admin's desired end result is a single view of everything, but that's not possible when a cloud provider offers only a limited portal view.

Filter that data for meaningful alerts

Admins need to address three areas to solve their alert gluts: an adjustable method to take event data from any source, adequate filtering of that data and strong data analysis.

When admins have the flexibility to pull event data from any source, they can normalize the data so that it can be analyzed effectively. Broad-scale data center infrastructure management (DCIM) systems -- such as those from Nlyte, Future Facilities and Vertiv -- allow admins to monitor a wide range of IT and peripheral equipment. To be useful, however, a DCIM tool will need to fully understand the systems in place.

Where this is not the case, vendors such as Splunk specialize in taking event data from open and proprietary data stores associated with IT equipment. These products perform first-level analysis and then pass the data on for further analysis as required.

Adequate filtering prevents data from being sent to yet another data lake, which wastes money, time and effort. IT equipment creates a lot of useless data that does not need to be sent to a central place. Anything that essentially says things are completely normal can be dropped quickly.

With data analysis, there are two keys: Perform the test(s) in real time, and institute a system that reports the findings to those who need the information in a manner that will not be easily ignored.

The reports do not need to be comprehensive. A color-coded dashboard allows admins who monitor alerts to quickly see if all is well -- or when something is wrong. The report should drill down to the back-end data. This enables administrators to look at the underlying cause of an alert and provides the extra detail that they need.

Automate, automate, automate

People are the most common source of error in the data center. Wherever possible, choose orchestration systems -- such as HashiCorp Terraform, Flexiant or Electric Cloud -- that use automation to fix problems. A system that automates error correction to gain a desired outcome can be much more effective than a team of administrators.

Automation can identify and fix minor problems before they have any noticeable effect on the overall performance or stability of the platform itself. Ensure that a high level of granularity is in place. For example, the system should automatically reach areas such as SMART data of disk drives, levels and sublevels of device driver and operating systems patches. Use the intelligence provided by the vendors to understand where automated updates to systems make sense and where to trust the systems when they flag something as possibly being suspect.

It should not be a major headache to monitor alerts. The optimum approach for everyone is to take an intelligent path where automation leads the way.

Next Steps

How DCIM tools work in the enterprise

Data center outage costs on the rise

Human error is No. 1 data threat

Dig Deeper on IT Log Management and Reporting