Now that enterprises understand how to deploy Kubernetes, their focus is on observability techniques that can keep complex microservices running smoothly on the platform and fix problems fast and comprehensively.
That's one conclusion that surfaced in responses to the Cloud Native Computing Foundation (CNCF) 2020 end-user survey, released this week. The report showed enterprises established this shift through experience with containers that has grown along multiple dimensions the last three years.
Overall, 92% of 1,324 respondents to the survey use containers in production, three times the number that reported production container use in 2016, with 61% using more than 250 containers. Organizations with more than 5,000 containers hit 23% in 2020, compared with 11% three years ago.
This growth in production container use means microservices management has also gone mainstream, and with that comes rising interest in observability. Observability has become a buzzword among vendors such as Splunk, Sumo Logic, Instana and others, and refers to an IT monitoring approach that flexibly queries a centralized repository of data from a wide variety of IT systems.
Even though emerging CNCF observability standards efforts such as OpenTelemetry and OpenMetrics remain at the sandbox stage of development, they garnered the most interest among CNCF survey respondents evaluating new projects, at 20% and 14% respectively.
Matt YoungPrincipal cloud architect, EverQuote
These trends also prompted the launch of a CNCF observability special interest group, SIG Observability, in June. It will focus on furthering observability standards among CNCF projects.
"With people moving to Kubernetes, sometimes they're not aware of the complexities they're taking on with microservices," said Matt Young, principal cloud architect at online insurance marketplace EverQuote in Cambridge, Mass., and co-founder of SIG Observability. "Whereas a lot of the tooling around logging, tracing and monitoring used to be viewed as nonessential, there's not just a VM anymore -- there are 20 replicas of my service talking to 20 replicas of somebody else's service, and being able to reason on that is huge."
OpenMetrics, OpenTelemetry seek observability accord
So far, OpenTelemetry is the more mature of the CNCF observability standards efforts. It began in May 2019 with the merger of CNCF's OpenTracing and Google's OpenCensus, and has since been adopted by other tools such as Jaeger and Grafana Labs' Tempo distributed tracing. Proprietary APM vendors such as New Relic, Dynatrace and Datadog have also signed on to use OpenTelemetry, along with cloud service providers such as AWS.
For enterprises, OpenTelemetry could bring order to an otherwise chaotic explosion of observability tools in the industry this year.
"Anything we use is OpenTelemetry-aware," said Pratik Wadher, vice president of product development at financial software maker Intuit, which this week announced it had finished migrating its TurboTax environment to Kubernetes. "It essentially gives us the ability to take data from anywhere, put it into an operational data lake and apply algorithms and [machine learning] models on it."
The OpenMetrics project was declared stable on Nov. 13, and seeks to standardize a wire format for Prometheus and other open source monitoring tools. OpenTelemetry, which encompasses logs, metrics and traces, will also support OpenMetrics standards.
Ultimately, OpenMetrics maintainers will submit their standard to the Internet Engineering Task Force for publication under its official memo format, the Request for Comments, which IT pros say could broaden its reach well beyond the CNCF.
"That should give [Prometheus] legitimacy in a much larger ecosystem," said Phil Fenstermacher, a systems engineer at William & Mary, a university in Williamsburg, Va."I'm more likely to have success asking a vendor to support an internet standard than I am asking them to support yet another monitoring tool."
Enterprises apply observability to BizDevOps
As observability standards emerge, the goals for large enterprises such as Intuit are twofold: improving the mean time to repair (MTTR) issues in microservices environments and using richer IT observability data to deliver business insights.
When Intuit began the move to Kubernetes with TurboTax in 2018, it was motivated primarily by the potential to speed up application releases and developer velocity but also expected improvements to its MTTR and mean time to detect (MTTD) issues. MTTD did shorten from hours to minutes, but MTTR hasn't decreased as much as Wadher's team would like, he said.
"The focus for us is shifting to a new observability platform that we're investing in heavily now, to bring down MTTR and MTTD further," Wadher said.
Intuit feeds a centralized observability data lake on AWS S3 through Kafka data pipelines, where machine learning algorithms look for anomalies and patterns. Two homegrown interfaces present that data to IT teams: an enterprise health dashboard that shows the status of any service and its dependencies, and a troubleshooting tool that ranks the most significant anomalies in the infrastructure associated with each service.
"Based on that, you can quickly determine if there's something happening in the system and be very surgical in directing people to solve the right problem," Wadher said. These tools need more internal testing and development but will likely join other open source projects Intuit has created based on its internal tools, such as Keiko and Admiral.
Intuit also imports data into its observability repository from systems beyond its IT infrastructure, ranging from developer IDEs to Zoom and Outlook, to get a better sense of how to improve developer productivity.
"We're collecting Outlook and Zoom data to analyze how many meetings we're having, the average time our developers are spending in meetings, and apply a model to categorize the types of meetings and the size of meetings they're having," Wadher said. "We can then do an experiment to say, 'What if we don't have meetings from 1 to 5 in the afternoon -- will that allow us to give developers more uninterrupted coding time?'"
EverQuote is also working on BizDevOps systems informed by IT observability standards. The company sends data collected via the Linkerd service mesh to Grafana Labs' Cortex repository, along with data from Atlassian's Jira issue tracking software via its own internally written Prometheus exporters.
"We can correlate everything from when deployments happened to when teams were making commits, to how much money we're making on a given day" using this observability system, Young said. "With Grafana, we can take data from the metrics that come out of Linkerd about what the services are actually doing, and directly overlay that on top [of business data] to show when a new service was deployed or systems scaled up and down."