Serverless monitoring tools have come a long way in only a few years, but comprehensive visibility into serverless...
infrastructures is still the coveted holy grail.
Application performance monitoring (APM) tools from most vendors, such as New Relic, AppDynamics, and Dynatrace, pull in logs and metrics for serverless monitoring in AWS Lambda environments based on Amazon CloudWatch APIs. Enterprise serverless users also export CloudWatch logs to log monitoring tools such as Splunk and Sumo Logic.
For users at the trial stage, these tools offer enough insights into serverless functions to get started. But they tend to focus on system-level data, such as memory usage, while different metrics, such as request rates and fine-grained network latencies, are more important.
"In my experience, [CloudWatch metrics] are good for you to take baby steps," said Yan Cui, principal engineer at Dazn, a sports streaming service in London. Cui is also an independent consultant and training specialist in serverless systems and has put serverless systems into production at several previous companies, such as Space Ape Games and social network Yubl Ltd. "But once you run in production with a number of functions that grow more and more complicated, you find yourself needing more help than you can get from Amazon."
Specialized serverless monitoring tools instrument code
Serverless monitoring tools' focus has moved beyond individual functions and infrastructure resources to explore data that reveals how functions combine to form an application. These tools typically inject code into the function and send monitoring data to a dashboard or metrics-collection API.
Datadog is first among major APM vendors to offer this capability with custom metrics, though New Relic is testing customers' interest in the instrumented-code approach. Amazon also offers its own such tool for Lambda, called X-Ray.
For startups with totally serverless environments, X-Ray and CloudWatch suffice. In fact, further monitoring information would defeat the purpose of a purely serverless infrastructure, since it would require ops specialists to watch a dashboard and interpret the data, said Joe Emison, founder of a stealth startup that will be entirely serverless.
"Instead of needing a monitoring team, DevOps people or in-house expertise, I can just have developers, and if performance is inconsistent, Amazon gives us enough tools to measure [it]," Emison said.
Yan Cuiprincipal engineer, Dazn
Outside the startup world, however, the NoOps approach isn't viable, and for some serverless applications X-Ray has limitations.
Lambda user Matson Inc., a cargo shipping company in Honolulu, first looked to X-Ray for a serverless deployment in 2017. But the X-Ray software development kit for Node.js introduced up to six seconds of latency on cold start Lambda functions, the first invocation of a Lambda function in which AWS has to spin up infrastructure, launch the runtime and start the function's code. Some latency is to be expected with cold starts, but six seconds per function was unacceptable, said Dave Townsend, principal software engineer at Matson.
Instead, Townsend deployed a third-party tool for AWS Lambda, IOpipe, which added far less latency and worked better than CloudWatch monitoring as well, Townsend said.
"With CloudWatch, you have to wait for the information to get into the dashboard, but with IOpipe, it's almost instant, almost within a second or two," he said.
IOpipe also facilitates useful queries into serverless monitoring data that surpass raw monitoring and enter the realm of observability, Townsend said. Observability, a term that's rapidly gained momentum in DevOps circles, refers to a system-level grasp of how complex apps function from moment to moment, as opposed to monitoring individual system components for potential problems.
"If CloudWatch says something's gone wrong, I can go into IOpipe and start querying the system about how many Lambda functions executed over a given timeframe," he said. "One tells me something bad happened -- the other allows me to go find more details about what's happening."
However, for latency-sensitive apps such as Amazon API Gateway and Alexa Skills Kit, IOpipe's approach is not always a good fit, as it adds between five and 20 milliseconds to the overall execution of a function.
Neither X-Ray or IOpipe yet offers a complete map of serverless functions and all their dependencies. In Amazon's case, X-Ray traces don't yet persist through API Gateway and DynamoDB calls, and doesn't support third-party services.
"The problem with X-Ray is that it's only in certain services ... there's always going to be parts of it where the trace just disappears," said Ben Kehoe, cloud robotics research scientist at consumer robot maker iRobot Corporation in Bedford, Mass. "It's going to be a long time before X-Ray gets to that point and has the task of coordinating across AWS' distributed organization."
This limitation sends some users back to the DIY drawing board, where they add correlation ID tags to each function and collect them with a separate tool, as Cui's team did in his previous role at Yubl Ltd.
"As you start using more and more event sources on Lambda, it becomes harder and harder to debug problems because of the lack of traceability," Cui said.
Serverless monitoring and the quest for observability
The newest serverless monitoring tools, which are in development by startups such as Thundra and Epsagon, look to fill the end-to-end observability gap for serverless infrastructures, with insight into their dependencies on cloud and third-party services.
Epsagon uses machine learning and AI techniques to automatically discover all the parts of a serverless infrastructure and how they fit together. The tool is in private beta, and Cui said a demo he saw looked promising.
"Epsagon is focused on giving you traceability across multiple systems and functions, as opposed to a lot of the vendors I've spoken to," Cui said.
Serverless-compatible observability tools are also available from emerging vendors that don't just focus on serverless apps. Matson's Townsend said he's begun to evaluate a tool from Honeycomb.io, built by former Facebook engineers who faced traceability problems in complex webscale systems. As with IOpipe, Honeycomb offers sophisticated queries into back-end data. Honeycomb also stores back-end data in a database optimized for fast responses to complex questions.
"It's about really going in there and asking questions of the system," Townsend said of his interest in Honeycomb. "I just want that at my fingertips."