Figuring out monitoring in a holistic sense is a challenge for many companies still, whether it is with conventional infrastructure or new platforms like serverless or containers.
In most applications there are two aspects of monitoring an application:
- System Metrics such as errors, invocations, latency, memory and cpu usage
- Business Analytics such as number of signups, number of emails sent, transactions processed, etc
The former is fairly universal and generally applicable in any stack to a varying degree. This is what I would call the undifferentiated aspect of monitoring an application. The abilities to perform error detection and track performance metrics are absolutely necessary to operate an application.
Everything that is old is new again. I am huge fan of the Twelve-Factor App. If you aren’t familiar, I highly suggest taking a look at it. Drafted in 2011 by developers at Heroku, the Twelve-Factor App is a methodology and set best practices designed to enable applications to be built with portability and resiliency when deployed to the web.
In the Twelve-Factor App manifesto, it is stated that applications should produce “logs as event streams” and leave it up to the execution environment to aggregate them. If we are to gather information from our application, why not make that present in the logs? We can use our event stream (i.e. application log) to create time-series metrics. Time-series metrics are just datapoints that have been sampled and aggregated over time, which enable developers and engineers to track performance. They allow us to make correlations with events at a specific time.
AWS Lambda works almost exactly in this way by default, aggregating its logs via AWS CloudWatch. CloudWatch organizes logs based on function, version, and containers while Lambda adds metadata for each invocation. And it is up to the developer to add application-specific logging to their function. CloudWatch, however, will only get you so far. If we want to track more information than just invocation, latency, or memory utilization, we need to analyze the logs deeper. This is where something like Splunk, Kibana, or other tools come into play.
In order to get to the meat of our application and the value it is delivering we need to ensure that we have additional information (telemetry) going to the logs as well:
e.g. – Timeouts – Configuration Failures – Stack traces – Event objects
Logging out these types of events or information will enable those other tools with rich query languages to create a dashboard with just about anything we want on them.
For instance, let’s say we added the following line of code to our application to track an event that was happening from a specific invocation and pull out additional information about execution:
log.Println(fmt.Sprintf(“-metrics.%s.blob.%s”, environment, method))
In a system that tracks time-series metrics in logs (e.g. SumoLogic), we could build a query like this:
“-metrics.prod.blob.” | parse “-metrics.prod.blob.*” as method | timeslice 5m | count(method) group by _timeslice, method | transpose row _timeslice column method
This would give us a nice breakdown of the different methods used in a CRUD or RESTful service and can then be visualized in the very same tool.
While visualization is nice, particularly when taking a closer look at a problem, it might not be immediately apparent where there is a problem. For that we need some way to grab the attention of the engineers or developers working on the application. Many of the tools mentioned here support some level of monitoring and alerting.
In the next installment of this series we will talk about increasing visibility into your operations and battling dashboard apathy! Check back next week.
-Lars Cromley, Director, Cloud Advocacy and Innovation