A few years ago, the IT operations staff at BT Infonet's data center in El Segundo, Calif., had trouble keeping up with the numerous alerts and volume of data generated by multiple monitoring tools. "We have different monitoring tools that generate a lot of data," said Armand Shirikian, BT Infonet's senior IT manager of operations. "With multiple tools and multiple owners, we had to hold a meeting with eight engineers whenever we generated a report and wanted to provide a single voice from IT."
A communications services provider, BT Infonet has a range of offerings, from broadband to Voice over Internet Protocol. The data center in El Segundo serves as the nerve center for the company. Shirikian wanted to streamline troubleshooting processes that had become increasingly burdensome for IT operations staff.
About three years ago, Shirikian opted to deploy Alive, an analytical software package from Irvine, Calif.-based Integrien Corp. The premise behind Alive is somewhat unique among systems management tools, according to Gartner Inc. analyst David Williams. By looking at the historical performance of IT infrastructure components such as applications, servers, databases, networks and firewalls, and combining that with behavioral analytics (through what Integrien calls "dynamic threshold algorithms"), Alive can effectively predict problems well before static thresholds set off alerts. "Integrien is trying to take data and apply intelligence around it," Williams said.
Within weeks of deploying Alive, Shirikian said that he could see a vast improvement in how IT operations staff responded to problems. With the existing siloed monitoring tools, engineers relied on static thresholds; by the time the various servers, databases, applications and processes would be close to hitting those thresholds, engineers would have to scramble to avert problems.
Armand ShirikianSenior IT manager of operations, BT Infonet
"When we noticed our levels were approaching thresholds, engineers were constantly changing the numbers so they would have time to fix the underlying problems that were making a system approach its thresholds to begin with," he said. It was a classic case of IT operations going into reactive mode to put out a fire.
Predicting problems before they happen
With Alive, engineers are tipped off by seemingly normal activities that indicate an impending problem based on historical behavior. The Alive system delivers alerts based on behavior it predicts and can pinpoint the problematic components such as the database server or application server based on whether the activities caused problems on those components in the past. Alive also indicates to engineers the probability that a behavior will occur and predicts when it will occur. "The system alerts us to activities that are within our thresholds, but they aren't normal and they're indicative of a problem," Shirikian said.
Predicting problems with Alive 6.0
Alive 6.0 pulls time series metric data from an organization's existing monitoring systems, then applies dynamic thresholding algorithms and a scoring model to learn the normal behavior of the measured metrics. Organizations can input historical metric performance data to determine ranges of normal behavior immediately, or they can install Alive, and the software will determine normal behavior within two to six weeks.
Any deviation from normal behavior serves as an early indicator of problems. Alive users can set key performance indicators; when these indicators are breached, Alive creates a model of the abnormalities which is known as the Problem Fingerprint. Once a Problem Fingerprint is identified, Alive uses it to decipher patterns. If there's a high probability that a pattern will result in a problem, Alive sends a predictive alert to IT administrators, indicating the problem and how it was resolved previously.
Alive uses an agentless method to collect data by integrating with existing monitoring tools and using the time series data that is already available. Alive gathers business performance data, user experience data and IT infrastructure metrics from all tiers of an application including application server, database, network and Web server. – M.S.
This predictive capability has eliminated the false positive alerts that BT Infonet experienced in the past. "When a system hit 90% of the threshold, an alarm would go off, even if the system was just spiking for a split second," Shirikian explained. "That was always happening, and we would have to take the time to figure out that there really wasn't a problem."
This past December, BT Infonet upgraded to Alive 6.0, a version that includes enhanced analytics, a role-based graphical user interface and adapters that support integration with third-party monitoring tools. The adapters allow Alive 6.0 to complement existing monitoring tools from the likes of Compuware Corp., Hewlett-Packard Co., and Symantec Corp. by collecting the data and presenting it in a single dashboard. "It really reduces a lot of manual processes because we don't have to go into each tool separately to get at the data," Shirikian says.
The Alive system is installed on 175 servers that comprise the company's customer relationship management platform, including Siebel on the front end and an Oracle database on the back end. "We monitor every element in our end-to-end environment," Shirikian said. That includes servers, applications, databases and the network.
A proactive approach to IT operations
Rather than putting out fires, Shirikian likens troubleshooting activities today to noticing the smoke before a blaze erupts. An engineer can look at the data in Alive and knows that a problem lies with the database server, not the application server or network, thereby eliminating those confabs with the entire engineering staff. "We fix problems before the business even know that anything has happened," Shirikian said. Gartner's Williams says that what makes Integrien's approach unique is also what makes it challenging to market. The use of behavioral analytics "is a tough sell to data centers because it's different from the way IT operations typically get alerted to problems," he said.
BT Infonet's engineers initially were skeptical of the predictive approach. "When I first introduced Alive to engineers, they wondered how it would ever be able to tell them what would happen," Shirikian says. "I convinced them to try it, and as soon as it started pinpointing problems proactively, they thought it was great." In everyday terms, Shirikian says, the tool -- which has a base price of $62,000 -- has changed the nature of the engineers' jobs. "They're not just soldiers gathering reports," he said. "They don't have to look at whether a problem is with the database server, the application or the network; the tool tells them."
Let us know what you think about the story; email Megan Santosus, Features Writer .