Operating and maintaining systems at scale with automation

Managing numerous customers with unique characteristics and tens of thousands of systems at scale can be challenging. Here, I want to pull back the curtain on some of the automation and tools that 2nd Watch develops to solve these problems. Below I will outline our approach to this problem and its 3 main components: Collect, Model, and React.

Collect: The first problem facing us is an overwhelming flood of data. We have CloudWatch metrics, CloudTrail events, custom monitoring information, service requests, incidents, tags, users, accounts, subscriptions, alerts, etc. The data is all structured differently, tells us different stories, and is collected at an unrelenting pace. We need to identify all the sources, collect the data, and store it in a central place so we can begin to consume it and make correlations between various events.

Most of the data I described above can be gathered from the AWS & Azure APIs directly, while others may need to be ingested with an agent or by custom scripts. We also need to make sure we have a consistent core set of data being brought in for each of our customers, while also expanding that to include some specialized data that perhaps only certain customers may have. All the data is gathered and sent to our Splunk indexers. We build an index for every customer to ensure that data stays segregated and secure.

Model: Next we need to present the data in a useful way. The modeling of the data can vary depending on who is using it or how it is going to be consumed. A dashboard with a quick look at several important metrics can be very useful to an engineer to see the big picture. Seeing this data daily or throughout the day will make anomalies very apparent. This is especially helpful because gathering and organizing the data at scale can be time consuming, and thus could reasonably only be done during periodic audits.

Modeling the data in Splunk allows for a low overhead view with up-to-date data so the engineer can do more important things. A great example of this is provisioned resources by region. If the engineer looks at the data on a regular basis, they would quickly notice that the number of provisioned resources has drastically changed. A 20% increase in the number of EC2 resources could mean several things; Perhaps the customer is doing a large deployment, or maybe Justin accidently put his AWS access key and secret key on GitHub (again).

We provide our customers with regular reports and reviews of their cloud environments. We also use the data collected and modeled in this tool for providing that data. Historical data trended over a month, quarter, and year can help you ask questions or tell a story. It can help you forecast your business, or the number of engineers needed to support it. We recently used the historical tending data to show progress of a large project that included waste removal and a resource tagging overhaul for a customer. Not only were we able to show progress throughout the project,t but we used that same view to ensure that waste did not creep back up and that the new tagging standards were being applied going forward.

React: Finally, it’s time to act on the data we collected and modeled. Using Splunk alerts we can provide conditional logic to the data patterns and act upon them. From Splunk we can call our ticketing system’s API and create a new incident for an engineer to investigate concerning trends or to notify the customer of a potential security risk. We can also call our own APIs that trigger remediation workflows. A few common scenarios are encrypting unencrypted S3 buckets, deleting old snapshots, restarting failed backup jobs, requesting cloud provider limit increases, etc.

Because we have several independent data sources providing information, we can also correlate events and have more advanced conditional logic. If we see that a server is failing status checks, we can also look to see if it recently changed instance families or if it has all the appropriate drivers. This data can be included in the incident and available for the engineer to review without having to check it themselves.

The entire premise of this idea and the solution it outlines is about efficiency and using data and automation to make quicker and smarter decisions. Operating and maintaining systems at scale brings forth numerous challenges and if you are unable to efficiently accommodate the vast amount of information coming at you, you will spend a lot of energy just trying to keep your head above water.

For help getting started in automating your systems, contact us.

-Kenneth Weinreich, Managed Cloud Operations

Leave a Reply

Your email address will not be published. Required fields are marked *