DevOps shops use time-series monitoring systems to glean a nuanced, historical view of IT infrastructure that improves...
troubleshooting, autoscaling and capacity forecasting.
Time-series monitoring tools are based on time-series databases, which are optimized for time-stamped data collected continuously or at fine-grained intervals. Since they store fine-grained data for a longer term than many metrics-based traditional monitoring tools, they can be used to compare long-term trends in DevOps monitoring data and to bring together data from more diverse sources than the IT infrastructure alone to link developer and business activity with the behavior of the infrastructure.
Time-series monitoring systems include the open source project Prometheus, which is popular among Kubernetes shops, as well as commercial offerings from InfluxData and Wavefront, the latter of which VMware acquired last year.
DevOps monitoring with these tools gives enterprise IT shops such as Houghton Mifflin Harcourt, an educational book and software publisher based in Boston, a unified view of both business and IT infrastructure metrics. It does so over a longer period of time than the Datadog monitoring product the company used previously, which retains data for only up to 15 months in its Enterprise edition.
"Our business is very cyclical as an education company," said Robert Allen, director of engineering at Houghton Mifflin Harcourt. "Right before the beginning of the school year, our usage goes way up, and we needed to be able to observe that [trend] year over year, going back several years."
Allen's engineering team got its first taste of InfluxData as a long-term storage back end for Prometheus, which at the time was limited in how much data could be held in its storage subsystem -- Prometheus has since overhauled its storage system in version 2.0. Eventually, Allen and his team decided to work with InfluxData directly.
Houghton Mifflin Harcourt uses InfluxData to monitor traditional IT metrics, such as network performance, disk space, and CPU and memory utilization, in its Amazon Web Services (AWS) infrastructure, as well as developer activity in GitHub, such as pull requests and number of users. The company developed its own load-balancing system using Linkerd and Finagle. And InfluxData also collects data on network latencies in that system, and it ties in with Zipkin's tracing tool to troubleshoot network performance issues.
Multiple years of highly granular infrastructure data empowers Allen's team of just five people to support nearly 500 engineers who deliver applications to the company's massive Apache Mesos data center infrastructure.
Time-series monitoring tools boost DevOps automation
Time-series data also allows DevOps teams to ask more nuanced questions about the infrastructure to inform troubleshooting decisions.
"It allows you to apply higher-level statistics to your data," said Louis McCormack, lead DevOps engineer for Space Ape Games, a mobile video game developer based in London and an early adopter of Wavefront's time-series monitoring system. "Instead of something just being OK or not OK, you can ask, 'How bad is it?' Or, 'Will it become very problematic before I need to wake up tomorrow morning?'"
Louis McCormacklead DevOps engineer, Space Ape Games
Space Ape's infrastructure to manage is smaller than Houghton Mifflin Harcourt's, at about 600 AWS instances compared to about 64,000. But Space Ape also has highly seasonal business cycles, and time-series monitoring with Wavefront helps it not only to collect granular historical data, but also to scale the IT infrastructure in response to seasonal fluctuations in demand.
"A service in AWS consumes Wavefront data to make the decision about when to scale DynamoDB tables," said Nic Walker, head of technical operations for Space Ape Games. "Auto scaling DynamoDB is something Amazon has only just released as a feature, and our version is still faster."
The company's apps use the Wavefront API to trigger the DynamoDB autoscaling, which makes the tool much more powerful, but also requires DevOps engineers to learn how to interact with the Wavefront query language, which isn't always intuitive, Walker said. In Wavefront's case, this learning curve is balanced by the software's various prebuilt data visualization dashboards. This was the primary reason Walker's team chose Wavefront over open source alternatives, such as Prometheus. Wavefront is also offered as a service, which takes the burden of data management out of Space Ape's hands.
Houghton Mifflin Harcourt chose a different set of tradeoffs with InfluxData, which uses a SQL-like query language that was easy for developers to learn, but the DevOps team must work with outside consultants to build custom dashboards. Because that work isn't finished, InfluxData has yet to completely replace Datadog at Houghton Mifflin Harcourt, though Allen said he hopes to make the switch this quarter.
Time-series monitoring tools scale up beyond the capacity of traditional metrics monitoring tools, but both companies said there's room to improve performance when crunching large volumes of data in response to broad queries. Houghton Mifflin Harcourt, for example, queries millions of data points at the end of each month to calculate Amazon billing trends for each of its Elastic Compute Cloud instances.
"It still takes a little bit of a hit sometimes when you look at those tags, but [InfluxEnterprise version] 1.3 was a real improvement," Allen said.
Allen added that he hopes to use InfluxData's time-series monitoring system to inform decisions about multi-cloud workload placement based on cost. Space Ape Games, meanwhile, will explore AI and machine learning capabilities available for Wavefront, though the jury's still out for Walker and McCormack whether AIOps will be worth the time it takes to implement. In particular, Walker said he's concerned about false positives from AI analysis against time-series data.