While we prefer that the data center hum along without problems, snags can -- and will -- occur. And when they...
do, operations must investigate.
Decentralized and distributed application designs create performance problems that can be much harder to trace than those in monolithic architectures. Operations admins must become detectives who scrutinize the issues that occur -- logs are the primary source of data at the scene. These records of events and possible triggers are clues into the recent operational history and provide insights that point to what could happen next. Log analytics use cases fit into three distinct categories: incident fixes, trends discovery and IT optimization.
Incident remediation and management
This log analytics use case is reactive to something that went wrong. The admin uses logs to investigate the root cause, to fix it and to furthermore prevent it from happening again. Only collect the logs from systems identified as involved in the issue, but remember that systems are often distributed across the infrastructure. To give the response team the best chance of determining where and how the incident started, ensure that all possible data points are detailed for analysis.
The easiest way to overlook an issue is to not understand where the flash point happened, compared to what caused it. The source of the issue isn't necessarily the actual trigger that made the issue visible, so it's easy to misdiagnose a problem from insufficient data on the event. Logs reveal events that could have caused the incident but rarely pinpoint it. Cast a wide, shallow net in log analytics for incident management events. This net doesn't have to go back several days or weeks -- this timing constraint will limit the amount of data and effort needed for log collection and analysis.
Trend discovery and reporting
Unlike the reactive aspect of remediation, trend tracking is a more predictive use of logs. Predicting trends benefits a company in error prevention. Even if some issues slip through, the analysis that illuminates the source and reason helps to minimize an outage. That is a huge victory for IT ops.
There is a fundamental issue with predictive logs, however. Unlike an incident, where the log collection focuses on a narrow time window, trending analysis requires logs from weeks or months of ongoing operations. The volume of data is significant compared to incident remediation, and the log analytics tools IT organizations set up to comb through it all must be more comprehensive and often rely on machine learning. Trend analysis log management tools bring a lot to the table but come at a higher cost than incident alert products. The benefit of trending logs can't be overstated, however.
Timing is a hurdle for this log analytics use case: The efforts required for predictive analysis don't occur in real time, because the data volume is simply too large. The tool won't capture an incident without a lead-up until after the event, which can be frustrating for management, who might look at trend discovery as a catch-all bucket. It does improve incident detection over time but not in the moment, nor is it an all-encompassing effort.
Deployment hardening and optimization
IT operations logs also inform preventive measures so that deployments survive and remain available even under duress. This log analytics use case is akin to checklists that ensure all steps have been completed. These logs create a snapshot-in-time view of an environment to pass compliance checks or for other certifications or qualifications. Log data for deployment hardening captures a singular moment in time, and once taken, the picture is already out of date. This doesn't mean snapshots of operational state are not effective, but they do have a limited lifespan and must be run multiple times over the course of a year or another predefined timeframe to display value. Snapshots of deployments are relatively low impact and low cost to run.
These log analytics use cases directly affect the data center or cloud hosted environment. Each method provides IT operations administrators with different data: the means to address an issue at the moment of occurrence, predict a future event or verify current environment status. Each requires different levels of effort and associated costs to obtain necessary data. In an ideal world, every operational group should engage all three of these log use types, but predictive log data can be one of the most difficult -- and expensive -- to derive. Incident management and deployment hardening and optimization have a relatively lower cost and effort to collect.