The holy grail of IT Operations is to achieve a state where all mundane, repeatable remediations occur without intervention, with a human only being woken for any action that simply cannot be automated. This allows not only for many restful nights, but it also allows IT operations teams to become more agile while maintaining a proactive and highly-optimized enterprise cloud. Getting to that state seems like it can only be found in the greatest online fantasy game, but the growing popularity of “AIOps” gives great hope that this may actually be closer to a reality than once thought.
Skeptics will tell you that automation, autonomics, orchestration, and optimization have been alive and well in the datacenter for more than a decade now. Companies like Microsoft with System Center, IBM with Tivoli, and ServiceNow are just a few examples of autonomic platforms that harness the ability to collect, analyze and make decisions on how to act against sensor data derived from physical/virtual infrastructure and appliances. But when you couple these capabilities with advancements brought through AIOps, you are able take advantage of the previously missing components by incorporating big data analytics along with artificial intelligence (AI) and Machine Learning (ML).
As you can imagine, these advancements have brought an explosion of new tooling and services from Cloud ISV’s thought to make the once utopian Autonomic cloud a reality. Palo Alto Network’s Prisma Public Cloud product is great example of a technology that functions with autonomic capabilities. The security and compliance features of Prisma Public Cloud are pretty impressive, but it also has a component known as User and Entity Behavior Analytics (UEBA). UEBA analyzes user activity data from logs, network traffic and endpoints and correlates this data with security threat intelligence to identify activities—or behaviors—likely to indicate a malicious presence in your environment. After analyzing the current state of the vulnerability and risk landscape, it reports current risk and vulnerability state and derives a set of guided remediations that can be either performed manually against the infrastructure in question or automated for remediation to ensure a proactive response, hands off, to ensure vulnerabilities and security compliance can always be maintained.
Another ISV focused on AIOps is MoogSoft who is bringing a next generation platform for IT incident management to life for the cloud. Moogsoft has purpose-built machine learning algorithms that are deigned to better correlate alerts and reduce much of the noise associated with all the data points. When you marry this with their Artificial Intelligence capabilities for IT operations, they are helping DevOps teams operate smarter, faster and more effectively in terms of automating traditional IT operations tasks.
As we move forward, expect to see more and more AI and ML-based functionality move into the core cloud management platforms as well. Amazon recently released AWS Control Tower to aide your company’s journey towards AIOps. While coming with some pretty incredible features for new account creation and increased multi-account visibility, it uses service control policies (SCPs) based upon established guardrails (rules and policies). As new resources and accounts come online, Control Tower can force compliance with the policies automatically, preventing “bad behavior” by users and eliminating the need to have IT configure resources after they come online. Once AWS Control Tower is being utilized, these guardrails can apply to multi-account environments and new accounts as they are created.
It is an exciting time for autonomic platforms and autonomic systems capabilities in the cloud, and we are excited to help customers realize the many potential capabilities and benefits which can help automate, orchestrate and proactively maintain and optimize your core cloud infrastructure.
To learn more about autonomic systems and capabilities, check out Gartner’s AIOps research and reach out to 2nd Watch. We would love to help you realize the potential of autonomic platforms and autonomic technologies in your cloud environment today!
Managing numerous customers with unique characteristics and tens of thousands of systems at scale can be challenging. Here, I want to pull back the curtain on some of the automation and tools that 2nd Watch develops to solve these problems. Below I will outline our approach to this problem and its 3 main components: Collect, Model, and React.
Collect: The first problem facing us is an overwhelming flood of data. We have CloudWatch metrics, CloudTrail events, custom monitoring information, service requests, incidents, tags, users, accounts, subscriptions, alerts, etc. The data is all structured differently, tells us different stories, and is collected at an unrelenting pace. We need to identify all the sources, collect the data, and store it in a central place so we can begin to consume it and make correlations between various events.
Most of the data I described above can be gathered from the AWS & Azure APIs directly, while others may need to be ingested with an agent or by custom scripts. We also need to make sure we have a consistent core set of data being brought in for each of our customers, while also expanding that to include some specialized data that perhaps only certain customers may have. All the data is gathered and sent to our Splunk indexers. We build an index for every customer to ensure that data stays segregated and secure.
Model: Next we need to present the data in a useful way. The modeling of the data can vary depending on who is using it or how it is going to be consumed. A dashboard with a quick look at several important metrics can be very useful to an engineer to see the big picture. Seeing this data daily or throughout the day will make anomalies very apparent. This is especially helpful because gathering and organizing the data at scale can be time consuming, and thus could reasonably only be done during periodic audits.
Modeling the data in Splunk allows for a low overhead view with up-to-date data so the engineer can do more important things. A great example of this is provisioned resources by region. If the engineer looks at the data on a regular basis, they would quickly notice that the number of provisioned resources has drastically changed. A 20% increase in the number of EC2 resources could mean several things; Perhaps the customer is doing a large deployment, or maybe Justin accidently put his AWS access key and secret key on GitHub (again).
We provide our customers with regular reports and reviews of their cloud environments. We also use the data collected and modeled in this tool for providing that data. Historical data trended over a month, quarter, and year can help you ask questions or tell a story. It can help you forecast your business, or the number of engineers needed to support it. We recently used the historical tending data to show progress of a large project that included waste removal and a resource tagging overhaul for a customer. Not only were we able to show progress throughout the project,t but we used that same view to ensure that waste did not creep back up and that the new tagging standards were being applied going forward.
React: Finally, it’s time to act on the data we collected and modeled. Using Splunk alerts we can provide conditional logic to the data patterns and act upon them. From Splunk we can call our ticketing system’s API and create a new incident for an engineer to investigate concerning trends or to notify the customer of a potential security risk. We can also call our own APIs that trigger remediation workflows. A few common scenarios are encrypting unencrypted S3 buckets, deleting old snapshots, restarting failed backup jobs, requesting cloud provider limit increases, etc.
Because we have several independent data sources providing information, we can also correlate events and have more advanced conditional logic. If we see that a server is failing status checks, we can also look to see if it recently changed instance families or if it has all the appropriate drivers. This data can be included in the incident and available for the engineer to review without having to check it themselves.
The entire premise of this idea and the solution it outlines is about efficiency and using data and automation to make quicker and smarter decisions. Operating and maintaining systems at scale brings forth numerous challenges and if you are unable to efficiently accommodate the vast amount of information coming at you, you will spend a lot of energy just trying to keep your head above water.
For help getting started in automating your systems, contact us.
In my previous blog I gave a fairly high-level overview of what automated AWS account management could (or rather should) entail. This blog will drill deeper into the processes and give you some real-world code samples of what this looks like.
AWS Organizations and Linked Account Creation:
As mentioned in my last blog, AWS recently announced the general availability of AWS Organizations, allowing you to create linked or nested AWS accounts under a master account and apply policy-based management under the umbrella of the root account. It also allows for hierarchical management (up to five levels deep) of linked accounts by Organizational Units (OU). Policies can be applied at the global level, OU level, and individual account level. It is important to note that conflicting policies always defer to the parent entities permission set. Meaning an IAM user/role in account may have permissions to perform some action, but, if at the Organizations level the account, OU, or global settings deny those actions, the resulting action for the IAM resource will be denied. Likewise, the effective permissions for a resource are a union of the resource’s direct permissions assigned in IAM and the permissions that are controlled by Organizations. This means you can lock linked accounts down to do things like “only manage DNS Route53 resources” or “only manage S3 resources” using Organizations policies. Pretty nice way of segmenting off security and reducing the potential blast radius.
I am going to pick the most common denominator for my following examples… AWS CLI. Though I rarely use it for actual automation code, I figure most folks are familiar with it and it has a pretty intuitive syntax.
Step 1: Enable Organizations on your root account
Ensure that your AWS Profile environment variable is set to your desired root account AWS profile that has the necessary permissions to work with AWS Organizations. Alternatively, if you don’t want to use an environment variable, you can either ensure the default AWS Profile is the one which has permissions on your root account or you can specify the –profile argument when typing your AWS CLI commands. I’m going to use the AWS_DEFAULT_PROFILE environment variable in my examples here (output redacted).
> exportAWS_DEFAULT_PROFILE=myrootacctadmin
> export AWS_DEFAULT_PROFILE=myrootacctadmin
This of course assumes you have a profile set up under your HOME dir in the .aws/credentials file named myrootacctadmin.
Now that we have our environment set we can get on with running the AWS CLI commands to create our organization.
Let’s be safe and make sure we don’t already have an organization created under our root account:
$ aws organizations list-roots
$ aws organizations list-roots
An error occurred (AWSOrganizationsNotInUseException) when calling the ListRoots operation: Your account is not a member of an organization.
As the error message indicates, this account is not currently a part of any organization and will need to be configured to use organizations if we want to use this as our master account and create linked accounts underneath it.
Indeed! our myrootacctadmin account is listed as the root (i.e. master) of our entire organization. This is exactly what we wanted. Now let’s see what AWS accounts are identified as part of this organization…
The actual creation of the account is not instantaneous, and the API responds to the create-account call before the new account creation is complete. While it is pretty quick to complete, unless we ensure that it is completed before performing any additional automation against it, we may receive an error from the API indicating the account is not yet ready. So prior to performing additional configuration on the new account, we need to ensure the State has reached SUCCEEDED. You will generally just loop until the State is equal to SUCCEEDED in your automation code before moving on to the next step. Also, it might be a good idea to catch failures (e.g. State == “FAILED”) and handle those gracefully. The account creation status can be achieved as follows:
Congratulations! You’ve just enabled AWS Organizations and created your first linked account!
At this point you should have a couple of emails from AWS in the inbox of the email address used to create the new account. They are standard boiler-plate emails. One of which is a “Welcome to Amazon Web Services” email and the other tells you that your account is ready and has some “getting started” type links.
Step 3: Reset New Linked Account Root Password
Now that your linked account has been created you will need to go through the AWS Reset Root Account Password workflow to make your new account accessible from either the AWS Web Console or the AWS APIs. The recommended approach here is to reset the root account password, enable MFA, Create an IAM user with Administrator privileges, store the root account secrets in a VERY secure place, and only use them as a last resort for account access.
Here’s a shortened URL that will take you directly to the root account password reset page: https://amzn.pw/45Nxe
Step 4: (Optionally) Create Organizational Units
Let’s go through a couple of examples of Organizational Units.
OU for only allowing S3 services
OU for only allowing services in us-west-2 and us-east-1 regions
“What if I want to bring my existing accounts under the umbrella of Organizations?” you ask
Good news! You can invite existing AWS accounts to join your organization. Using the API you can issue an invitation to an existing account by Account ID, Email, or Organization. For the sake of simplicity, let’s use an Account ID (222222222222) for the following example (again, using the root/master account AWS profile):
A couple of things of note – The handshake Id is what will be required to accept the invitation on the linked account side. Notice the difference between the RequestedTimestamp (epoch 1524610827.55) and the ExpirationTimestamp (epoch 1525906827.55). 1296000 seconds. Divide that by 86400 seconds in a day and we get 15 days.
At this point you have 15 days to issue an acceptance of the invitation (aka: handshake), from the target AWS account. You could simply log in to the AWS Web Console, navigate to Organizations, and accept the invitation, but that’s not what this article is about now is it? We’re talking automation here! And, as all good DevOpsers know, we utilize security entities that employ PoLP (Principal of Least Privilege) to perform process-specific tasks.
This means we aren’t going to do something ludicrous like adding AWS Access Keys to our root account login (please don’t ever do this). Nor are we going to create an IAM User with Administrator access for this very specific task. You can either create a User or a Role in the target account to accept the handshake, although, creating a Role will require you to assume that Role using STS, which might be overkill. On the other hand, you might use a lambda function to automate the handshake in which case you most certainly would utilize an IAM Role. Either way, the following IAM Policy Document will provide the User/Role with the required permissions to accept (or delete) the invitation:
Using the AWS CLI (leveraging a profile of a User/Role with the aforementioned permissions under the existing target account), you would issue the following command to accept the invitation/handshake:
The returned JSON struct is the exact same handshake struct returned by the API when we issued the invitation with one important difference. The State property is now reflecting a value of ACCEPTED.
That’s it. You’ve successfully linked an existing account into your Organization under the master billing account.
In the next installment, I will go into depth on the processes involved in automating the Account Initialization, Configuration, and Continuous Compliance.
Thanks for tuning in!
-Ryan Kennedy, Principal Cloud Automation Architect
Let’s start with a small look at the current landscape of technology and how we arrived here. There aren’t very many areas of tech that have not been, or are not currently, in a state of fluctuation. Everything from software delivery vehicles and development practices, to infrastructure creation has experienced some degree of transformation over the past several years. From VMs to Containers, it seems like almost every day the technology tool belt grows a little bigger, and our world gets a little better (though perhaps more complex) due to these advancements. For me, this was incredibly apparent when I began to delve into configuration management which later evolved into what we now call “infrastructure as code”.
The transformation of the development process began with simple systems that we once used to manage a few machines (like bash scripts or Makefiles) which then morphed into more complex systems (CF Engine, Puppet, and Chef) to manage thousands of systems. As configuration management software became more mature, engineers and developers began leaning on them to do more things. With the advent of hypervisors and the rise of virtual machines, it was only a short time before hardware requests changed to API requests and thus the birth of infrastructure as a service (IaaS). With all the new capabilities and options in this brave new world, we once again started to lean on our configuration management systems—this time for provisioning, and not just convergence.
Provisioning & Convergence
I mentioned two terms that I want to clarify; provisioning and convergence. Say you were a car manufacturer and you wanted to make a car. Provisioning would be the step in which you request the raw materials to make the parts for your automobile. This is where we would use tools like Terraform, CloudFormation, or Heat. Whereas convergence is the assembly line by which we check each part and assemble the final product (utilizing config management software).
By and large, the former tends to be declarative with little in the way of conditionals or logic, while the latter is designed to be robust and malleable software that supports all the systems we run and plan on running. This is the frame for the remainder of what we are going to talk about.
By separating the concerns of our systems, we can create a clear delineation of the purpose for each tool so we don’t feel like we are trying to jam everything into an interface that doesn’t have the most support for our platform or more importantly our users. The remainder of this post will be directed towards the provisioning aspect of configuration management.
Standards and Standardization
These are two different things in my mind. Standardization is extremely prescriptive and can often seem particularly oppressive to professional knowledge workers, such as engineers or developers. It can be seen as taking the innovation away from the job. Whereas standards provide boundaries, frame the problem, and allow for innovative ways of approaching solutions. I am not saying standardization in some areas is entirely bad, but we should let the people who do the work have the opportunity to grow and innovate in their own way with guidance. The topic of standards and standardization is part of a larger conversation about culture and change. We intend to follow up with a series of blog articles relating to organizational change in the era of the public cloud in the coming weeks.
So, let’s say that we make a standard for our new EC2 instances running Ubuntu. We’ll say that all instances must be running the la official Canonical Ubuntu 14.04 AMI and must have these three tags; Owner, Environment, and Application. How can we enforce that in development of our infrastructure? On AWS, we can create AWS Config Rules, but that is reactive and requires ad-hoc remediation. What we really want is a more prescriptive approach bringing our standards closer to the development pipeline. One of the ways I like to solve this issue is by creating an abstraction. Say we have a terraform template that looks like this:
# Create a new instance of the la Ubuntu 14.04 on an
provider "aws" { region = "us-west-2"
}
data "aws_ami" "ubuntu" { most_recent = true
filter {
name = "name" values =
["ubuntu/images/hvm-ssd/ubuntu-trusty-1 4.04-amd64-server-*"]
}
filter {
name = "virtualization-type" values = ["hvm"]
}
owners = ["099720109477"] # Canonical
}
resource "aws_instance" "web" { ami =
"${data.aws_ami.ubuntu.id}" instance_type = "t2.micro"
tags {
Owner = "DevOps Ninja" Environment = "Dev" Application = "Web01"
}
}
This would meet the standard that we have set forth, but we are relying on the developer or engineer to adhere to that standard. What if we enforce this standard by codifying it in an abstraction? Let’s take that existing template and turn it into a terraform module instead.
Module
# Create a new instance of the la Ubuntu 14.04 on an
variable "aws_region" {} variable "ec2_owner" {} variable "ec2_env"
{} variable "ec2_app" {}
variable "ec2_instance_type" {}
provider "aws" {
region = "${var.aws_region}"
}
data "aws_ami" "ubuntu" { most_recent = true
filter {
name = "name" values =
["ubuntu/images/hvm-ssd/ubuntu-trusty-1 4.04-amd64-server-*"]
}
filter {
name = "virtualization-type" values = ["hvm"]
}
owners = ["099720109477"] # Canonical
}
resource "aws_instance" "web" { ami =
"${data.aws_ami.ubuntu.id}" instance_type =
"${var.ec2_instance_type}"
tags {
Owner = "${var.ec2_owner}" Environment = "${var.ec2_env}" Application =
"${var.ec2_app}"
}
}
Now we can have our developers and engineers leverage our tf_ubuntu_ec2_instance module.
This doesn’t enforce the usage of the module, but it does create an abstraction that provides an easy way to maintain standards without a ton of overhead, it also provides an example for further creation of modules that enforce these particular standards.
This leads us into another method of implementing standards but becomes more prescriptive and falls into the category of standardization (eek!). One of the most underutilized services in the AWS product stable has to be Service Catalog.
AWS Service Catalog allows organizations to create and manage catalogs of IT services that are approved for use on AWS. These IT services can include everything from virtual machine images, servers, software, and databases to complete multi-tier application architectures. AWS Service Catalog allows you to centrally manage commonly deployed IT services, and helps you achieve consistent governance and meet your compliance requirements, while enabling users to quickly deploy only the approved IT services they need.
The Interface
Once we have a few of these projects in place (e.g. a service catalog or a repo full of composable modules for infrastructure that meet our standards) how do we serve them out? How you spur adoption of these tools and how they are consumed can be very different depending on your organization structure. We don’t want to upset workflow and how work gets done, we just want it to go faster and be more reliable. This is what we talk about when we mention the interface. Whichever way work flows in, we should supplement it with some type of software or automation to link those pieces of work together. Here are a few examples of how this might look (depending on your organization):
1.) Central IT Managed Provisioning
If you have an organization that manages requests for infrastructure, having this new shift in paradigm might seem daunting. The interface in this case is the ticketing system. This is where we would create an integration with our ticketing software to automatically pull the correct project from service catalog or module repo based on some criteria in the ticket. The interface doesn’t change but is instead supplemented by some automation to answer these requests, saving time and providing faster delivery of service.
2.) Full Stack Engineers
If you have engineers that develop software and the infrastructure that runs their applications this is the easiest scenario to address in some regards and the hardest in others. Your interface might be a build server, or it could simply be the adoption of an internal open source model where each team develops modules and shares them in a common place, constantly trying to save time and not re-invent the wheel.
“A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.” – John Gall
All good automation starts out with a manual and well-defined process. Standardizing & automating infrastructure development processes begins with understanding how our internal teams can be best organized. This allows us to efficiently perform work before we can begin automating. Work with your teammates to create a value stream map to understand the process entirely before doing anything towards the efforts of automating a workflow.
With 2nd Watch designs and automation you can deploy quicker, learn faster and modify as needed with Continuous Integration / Continuous Deployment (CI/CD). Our Workload Solutions transform on-premises workloads to digital solutions in the public cloud with next generation products and services. To accelerate your infrastructure development so that you can deploy faster, learn more often and adapt to customer requirements more effectively, speak with a 2nd Watch cloud deployment expert today.
– Lars Cromley, Director of Engineering, Automation, 2nd Watch