Kubernetes monitoring took a step toward production for enterprise IT shops, thanks to a revamped storage engine in Prometheus 2.0.
Prometheus, originally built at SoundCloud, emerged like Kubernetes did from container management projects at Google. Kubernetes is descended from Google's Borg project, and Prometheus has its roots in an adjacent utility, called Borgmon. Both projects are hosted by the Cloud Native Computing Foundation, and vendors such as CoreOS develop and support Prometheus alongside their Kubernetes distros. Prometheus uses labels to track instances in the Kubernetes infrastructure, which dovetails with Kubernetes' infrastructure-tracking methods.
Prometheus' approach to Kubernetes monitoring differs from traditional IT monitoring tools, which use agents to detect problems and send alerts back to a central system when there's an issue. Instead, Prometheus creates a trail of time-series data that pulls data from the Kubernetes infrastructure at regular intervals to compile a more complete picture of the application environment.
The time-series approach to monitoring makes Prometheus best suited to guide developers' choices about how to improve an application's performance, rather than to inform operators on how to troubleshoot infrastructure problems, said Edwin Yuen, analyst at Enterprise Strategy Group in Milford, Mass. But operators must understand both perspectives to have a good grasp of Kubernetes monitoring.
"It will have an impact on people who are interested in the old model of alerts and being aware of what's different than normal, rather than tracking what's normal to understand if normal is good or bad," he said.
Improved Kubernetes monitoring prompts a second look
Enterprise IT professionals who have previously evaluated Prometheus said a newly scalable storage engine with better support for long-term data retention and better query performance will make them re-evaluate the software. Others said the data storage upgrade will help manage apps deployed on Kubernetes clusters more effectively when those apps reach production.
Dish Technologies, the engineering arm of Dish Network in Englewood, Colo., has conducted load tests on a new Kubernetes-based application slated for release by the end of 2017 and used Prometheus to track the application's performance during those tests.
"We were able to see traffic flowing at pretty high levels through the Kubernetes components of our applications, and Prometheus tools with Grafana [data visualization] dashboards let us pinpoint problems pretty quickly," said Brad Linder, DevOps and big data evangelist at Dish Technologies.
The load tests require the Prometheus Kubernetes monitoring tool to absorb hundreds of thousands of messages per minute, and it must handle up to 2.5 million messages per minute in production, Linder said.
"We need a tool that can support web-scale traffic, and from what we've seen so far, it seems like Prometheus can," he said.
Prometheus 2.0 turns enterprise heads
Some enterprises expect to use both the traditional alerting approach, with tools such as New Relic, as well as the detailed time-series approach, because Prometheus 2.0 can support the data retention they need.
At SAP's Concur Technologies in Bellevue, Wash., New Relic gives site reliability engineers a centralized dashboard for server-based production systems. And as more Kubernetes clusters roll out, Prometheus may offer a similarly centralized view of the container infrastructure.
Brad LinderDevOps and big data evangelist, Dish Technologies
"From there, things go into PagerDuty [alert management tools], which we can integrate really well with Alertmanager within Prometheus, and get alerts even before our customers do," said Dale Ragan, principal software engineer at Concur.
Ragan evaluated Prometheus in early 2016, but decided against it for Kubernetes monitoring because it only supported a two-week data retention period at the time. But with a data export plug-in released with version 1.7, and the storage engine in Prometheus 2.0, he'll re-evaluate the tool. The addition of an approach to marking stale data when instances in the ephemeral Kubernetes infrastructure have disappeared will also make Prometheus easier to use.
"The other major feature that's going to be really nice is marking data as stale, and making sure we're not alerting on data that's no longer relevant," he said. "It also makes sure we're not alerting that something's down because data took five minutes to collect."
Despite the intertwined histories of Kubernetes and Prometheus, Dish Technologies' Linder said he'd like to see future releases of Prometheus offer more out-of-the-box hints about key metrics in Kubernetes monitoring data at scale.
"It'd be nice to have at least some recommendations" about the top data points to monitor as an organization learns how to manage Kubernetes, Linder said. "Finding the right data points in the massive amounts of data that flow through this platform is where the challenge is."
Make monitoring better, more complete for user satisfaction
Stay abreast of Docker's maturity, features
Kubernetes makes its way into Docker