Automating Windows Patching of EC2 Autoscaling Group Instances

Background

Dealing with Windows patching can be a royal pain as you may know.  At least once a month Windows machines are subject to system security and stability patches, thanks to Microsoft’s Patch Tuesday. With Windows 10 (and its derivatives), Microsoft has shifted towards more of a Continuous Delivery model in how it manages system patching. It is a welcome change, however, it still doesn’t guarantee that Windows patching won’t require a system reboot.

Rebooting an EC2 instance that is a member of an Auto Scaling Group (depending upon how you have your Auto Scaling health-check configured) is something that will typically cause an Elastic Load Balancing (ELB) HealthCheck failure and result in instance termination (this occurs when Auto Scaling notices that the instance is no longer reporting “in service” with the load balancer). Auto Scaling will of course replace the terminated instance with a new one, but the new instance will be launched using an image that is presumably unpatched, thus leaving your Windows servers vulnerable.

The next patch cycle will once again trigger a reboot and the vicious cycle continues. Furthermore, if the patching and reboots aren’t carefully coordinated, it could severely impact your application performance and availability (think multiple Auto Scaling Group members rebooting simultaneously). If you are running an earlier version of Windows OS (e.g. Windows Server 2012r2), rebooting at least once a month on Patch Tuesday is an almost certainty.

Another major problem with utilizing the AWS stock Windows AMIs with Auto Scaling is that AWS makes those AMIs unavailable after just a few months. This means that unless you update your Auto Scaling Launch Configuration to use the newer AMI IDs on a continual basis, future Auto Scaling instance launches will fail as they try to access an AMI that is no longer accessible. Anguish.

Automatically and Reliably Patch your Auto-Scaled Windows instances

Given the aforementioned scenario, how on earth are you supposed to automatically and reliably patch your Auto-Scaled Windows instances?!

One approach would be to write some sort of an orchestration layer that detects when Auto Scaling members have been patched and are awaiting their obligatory reboot, suspend Auto Scaling processes that would detect and replace perceived failed instances (e.g. HealthCheck), and then reboot the instances one-by-one. This would be rather painful to orchestrate and has a potentially severe drawback that cluster capacity is reduced by N-1 during the rebooting (maybe more if you don’t take into account service availability between reboots).

Reducing capacity to N-1 might not be a big deal if you have a cluster of 20 instances but if you are running a smaller cluster of something— say 4, 3, or 2 instances—then that has a significant impact to your overall cluster capacity. And, if you are running on an Auto Scaling group with a single instance (not as uncommon as you might think) then your application is completely down during the reboot of that single member. This of course doesn’t solve the issue of expired stock AWS AMIs.

Another approach is to maintain and patch a “golden image” that the Auto Scaling Launch Configuration uses to create new instances from. If you are unfamiliar with the term, a golden-image is an operating system image that has everything pre-installed, configured, and saved in a pre-baked image file (an AMI in the case of Amazon EC2). This approach requires a significant amount of work to make this happen in a reasonably automated fashion and has numerous potential pitfalls.

While it prevents a service outage by replacing the unavailable public AMI with a stock AMI, you still need a way to reliably and automatically handle this process. Using a tool like Hashicorp’s Packer can get you partially there, but you would still have to write a number of Providers to handle the installation of Windows Update and anything else you need to do in order to prep the system for imaging. In the end, you would still have to develop or employ a fair number of tools and processes to completely automate the entire process of detecting new Windows Updates, creating a patched AMI with those updates, and orchestrating the update of your Auto Scaling Groups.

A Cloud-Minded Approach

I believe that Auto Scaling Windows servers intelligently requires a paradigm shift. One assumption we have to make is that some form of configuration management (e.g. Puppet, Chef)—or at least a basic bootstrap script executed via cfn-init/UserData—is automating the configuration of the operating system, applications, and services upon instance launch. If configuration management or bootstrap scripts are not in play, then it is likely that a golden-image is being utilized. Without one of these two approaches, you don’t have true Auto Scaling because it would require some kind of human interaction to configure a server (ergo, not “auto”) every time a new instance was created.

Both approaches (launch-time configuration vs. golden-image) have their pros and cons. I generally prefer launch-time configuration as it allows for more flexibility, provides for better governance/compliance, and enables pushing changes dynamically. But…(and this is especially true of Windows servers) sometimes launch-time configuration simply takes longer to happen than is acceptable, and the golden-image approach must be used to allow for a more rapid deployment of new Auto Scaling group instances.

Either approach can be easily automated using a solution like to the one I am about to outline, and thankfully AWS publishes new stock Windows Server AMIs immediately following every Patch Tuesday.  This means, if you aren’t using a golden-image, patching your instances is as simple as updating your Auto Scaling Launch Configuration to use the new AMI(s) and preforming a rolling replacement of the instances. Even if you are using a golden-image or applying some level of customization to the stock AMI, you can easily integrate Packer into the process to create a new patched image that includes your customizations.

The Solution

At a high level, the solution can be summarized as:

  1. An Orchestration Layer (e.g. AWS SNS and Lambda, Jenkins, AWS Step Functions) that detects and responds when new patched stock Windows AMIs have been released by Amazon.
  2. A Packer Launcher process that manages launching Packer jobs in order to create custom AMIs. Note: This step is only required If copying AWS stock AMIs to your own AWS account is desired OR if you want to apply customization to the stock AMI. Either use case requires that the custom images are available indefinitely. We solved this problem by creating a Packer Launcher process by creating an EC2 instance with a Python UserData script that launches Packer jobs (in parallel) to create copies of the new stock AMIs into our AWS account. Note: if you are using something like Jenkins, this could be handled by having Jenkins launch a local script or even a Docker container to manage launching Packer jobs.
  3. A New AMI Messaging Layer (e.g. Amazon SNS) to publish notifications when new/patched AMIs have been created
  4. Some form of an Auto Scaling Group Rolling Updater will be required to replace exiting Auto Scaling Group instances with new ones based on the Patched AMI.

Great news for anyone using AWS CloudFormation… CFT inherently supports Rolling Updates for Auto Scaling Groups! Utilizing it requires attaching an UpdatePolicy and adding a UserData or cfn-init script to notify CloudFormation when the instance has finished its configuration and is reporting as healthy (e.g. InService on the ELB). There are some pretty good examples of how to accomplish this using CloudFormation out there, but here is one specifically that AWS provides as an example.

If you aren’t using CloudFormation, all hope is not lost. With Hashicorp Terraform’s ever increasing popularity for deploying and managing AWS infrastructure as code, Terraform has still yet to implement a Rolling Update feature for AWS Auto Scaling Groups. There is a Terraform feature request from a few years ago for this exact feature, but as of today, it is not yet available, nor do the Terraform developers have any short-term plans to implement it. However, several people (including Hashicorp’s own engineers) have developed a number of ways to work around the lack of an integrated Auto Scaling Group Rolling Updater in Terraform. Here are a few I like:

Of course, you can always roll your own solution using a combination of AWS services (e.g. SNS, Lambda, Step Functions), or whatever tooling best fits your needs. Creating your own solution will allow you added flexibility if you have additional requirements that can’t be met by CloudFormation, Terraform, or other orchestration tool.

The following is an example framework for performing automated Rolling Updates to Auto Scaling Groups utilizing AWS SNS and AWS Lambda:

a.  An Auto Scaling Launch Config Modifier worker that subscribes to the New AMI messaging layer performs an update to the Auto Scaling Launch Configuration(s) when a new AMI is released. In this use case, we are using an AWS Lambda function to subscribe to an SNS topic. Upon notification of new AMIs, the worker must then update the predefined (or programmatically derived) Auto Scaling Launch Configurations to use the new AMI. This is best handled by using infrastructure templating tools like CloudFormation or Terraform to make updating the Auto Scaling Launch Configuration ImageId as simple as updating a parameter/variable in the template and performing an update/apply operation.

b.  An Auto Scaling Group Instance Cycler messaging layer (again, an Amazon SNS topic) to be notified when an Auto Scaling Launch Configuration ImageId has been updated by the worker.

c.  An Auto Scaling Group Instance Cycler worker that will perform replacing the Auto Scaling Group instances in a safe, reliable, and automated fashion. For example, another AWS Lambda function that will subscribe to the SNS topic and trigger new instances by increasing the Auto Scaling Desired Instance count to a value of twice the current number of ASG instances.

d.  Once the scale-up event generated by the Auto Scaling Group Instance Cycler worker has completed and the new instances are reporting as healthy, another message will be published to the Auto Scaling Group Instance Cycler SNS topic indicating scale-up has completed.

e.  The Auto Scaling Group Instance Cycler worker will respond to the prior event and return the Auto Scaling group back to its original size which will terminate the older instances leaving the Auto Scaling Group with only the patched instances launched from the updated AMI. This assumes that we are utilizing the default AWS Auto Scaling Termination Policy which ensures that instances launched from the oldest Launch Configurations are terminated first.

NOTE: The AWS Auto Scaling default termination policy will not guarantee that the older instances are terminated first! If the Auto Scaling Group is spanned across multiple Availability Zones (AZ) and there is an imbalance in the number of instances in each AZ, it will terminate the extra instance(s) in that AZ before terminating based on the oldest Launch Configuration. Terminating on Launch Configuration age will certainly ensure that the oldest instances will be replaced first. My recommendation is to use the OldestInstance termination policy to make absolutely certain that the oldest (i.e. unpatched) instances are terminated during the Instance Cycler scale-down process.  Consult the AWS documentation on the Auto Scaling termination policies for more on this topic.

In Conclusion

Whichever solution you choose to implement to handle the Rolling Updates to your Auto Scaling Group, the solution outlined above will provide you with a sure-fire way to ensure your Windows Auto Scaled servers are always patched automatically and minimize the operational overhead for ensuring patch compliance and server security. And the good news is that the heavy lifting is already being handled by AWS Auto Scaling and Hashicorp Packer. There is a bit of trickery to getting the Packer configs and provisioners working just right with the EC2 Config service and Windows Sysprep, but there are a number of good examples out on github to get you headed in the right direction. The one I referenced in building our solution can be found here.

One final word of caution... if you do not disable the EC2Config Set Computer Name option when baking a custom AMI, your Windows hostname will ALWAYS be reset to the EC2Config default upon reboot. This is especially problematic for configuration management tools like Puppet or Chef which may use the hostname as the SSL Client Certificate subject name (default behavior), or for deriving the system role/profile/configuration.

Here is my ec2config.ps1 Packer provisioner script which disables the Set Computer Name option:

$EC2SettingsFile="C:\\Program
Files\\Amazon\\Ec2ConfigService\\Settin
gs\\Config.xml"
$xml = [xml](get-content
$EC2SettingsFile)
$xmlElement =
$xml.get_DocumentElement()
$xmlElementToModify =
$xmlElement.Plugins
foreach ($element in
$xmlElementToModify.Plugin)
{
if ($element.name -eq
"Ec2SetPassword")
{
$element.State="Enabled"
}
elseif ($element.name -eq
"Ec2SetComputerName")
{
$element.State="Disabled"
}
elseif ($element.name -eq
"Ec2HandleUserData")
{
$element.State="Enabled"
}
elseif ($element.name -eq
"Ec2DynamicBootVolumeSize")
{
$element.State="Enabled"
}
}
$xml.Save($EC2SettingsFile)

Hopefully, at this point, you have a pretty good idea of how you can leverage existing software, tools, and services—combined with a bit of scripting and automation workflow—to reliably and automatically manage the patching of your Windows Auto Scaling Group EC2 instances!  If you require additional assistance, are resource-bound for getting something implemented, or you would just like the proven Cloud experts to manage Automating Windows Patching of your EC2 Autoscaling Group Instances, contact 2nd Watch today!

 

Disclaimer

We strongly advise that processes like the ones described in this article be performed on a environment prior to production to properly validate that the changes have not negatively affected your application’s functionality, performance, or availability.

 This is something that your orchestration layer in the first step should be able to handle. This is also something that should integrate well with a Continual Integration and/or Delivery workflow.

 

-Ryan Kennedy, Principal Cloud Automation Architect, 2nd Watch


AWS Sticky Sessions

Amazon Web Services best practices tell us to build for stateless systems, in a perfect world any server can serve any function with absolutely no impact to customers.  Sounds great, but unfortunately reality interjects into our perfect world and we find many websites and applications are not so perfectly stateless.  So how can we make use of the strengths of AWS in areas like elasticity and auto scaling without completely re-writing applications to conform?  After all, one of the key benefits to moving into the Cloud is cost savings which get eaten away by spending development resources rewriting code.

The solution is thankfully built-in to Amazon’s Elastic Load Balancer (ELB), so those that require sessions to remain open for a customer can enable that “sticky” option.  This keeps transactions processing, real time communication alive, and businesses from needing to redesign such code or give up auto scaling.  So how does it work?

The first option is to create duration-based session stickiness.  This is enabled at the ELB under port configuration.  From there, the “stickiness” option can be enabled, and the ELB will generate a session cookie with a limited duration (default is 60 seconds).  So long as the client checks in with the ELB before the cookie expires, the session is held on that instance and that instance will not be terminated by auto scaling.  The second option is to enable application-controlled stickiness.  This requires more development effort unless the existing platform already makes use of custom cookies; however this gives far more control to application developers than a basic number of seconds before timeout.  By using application control a web developer can keep a client connection directed to a specific instance through the ELB with no fear that a required instance will be terminated prematurely.

-Keith Homewood, Cloud Architect


Distributed Functional Testing on AWS

To leverage the full benefits of Amazon Web Services (AWS) and features such as instant elasticity and scalability, every AWS architect eventually considers Elastic Load Balancing and Auto Scaling.   These features enable the ability to instantly scale-in or scale-out an environment based on the flow of internet traffic.

Once implemented, how do you the configuration and application to make sure they’re scaling with the parameters you’ve set?  You could always trust the design and logic, then wait for the environment to scale naturally with organic traffic.  However, in most production environments this is not an option. You want to make sure the environment operates adequately under load.  One cool way to do this is by generating a distributed traffic load through a program called Bees with Machine Guns.

The author describes Bees with Machine Guns as “A utility for arming (creating) many bees (micro EC2 instances) to attack (load ) targets (web applications).”  This is a perfect solution for ing performance and functionality of an AWS environment because it allows you to use one master controller to call many bees for a distributed attack on an application.  Using a distributed attack from several bees gives a more realistic attack profile that you can’t get from a single node.  Bees with Machine Guns enables you to mount an attack with one or several bees with the same amount of effort.

Bees with Machine Guns isn’t just a randomly found open source tool. AWS endorses the project in several places on their website.  AWS recommends Bees with Machine Guns for distributed ing in their article “Best Practices in Evaluating Elastic Load Balancing”.  The author says “…you could consider tools that help you distribute s, such as the open source Fabric framework combined with an interesting approach called Bees with Machine Guns, which uses the Amazon EC2 environment for launching clients that execute s and report the results back to a controller.”  AWS also provides a CloudFormation template for deploying Bees with Machine Guns on their AWS CloudFormation Sample Templates page.

To install Bees with Machine Guns you can either use the template provided on the AWS CloudFormation Sample Templates page called bees-with-machineguns.template or follow the install instructions from the GitHub project page. (Please be aware the template also deploys a scalable spot instance auto scale group behind an elastic load balancer, all of which you are responsible to pay for.)

Once the Bees with Machine Guns source is installed. You have the ability to run the following commands:


The first command we run will start up five bees that we will have control over for ing.  We can use the –s option to specify the number of bees we want to spin up.  The –k option is the SSH key pair name used to connect to the new servers.  The –I option is the name of the AMI used for each bee.  The –g option is the security group in which the bees will be launched.  If the key pair, security group, and instance already exist in the region you’re launching the bees, there is less chance you will see errors when running the command.

Once launched, you can see the bees that were instantiated and under control of the Bees with Machine Guns controller with the command:

To make our bees attack we use the command “bees attack”.  The options used are -u which is the URL of the target to attack.  Make sure to use the trailing backslash in your URL or the command will error out.   The –n is the total number of connection to make to the target.  The –c option is used for the number of concurrent connections made to the target.  Here in as example run of an attack:

Notice that the attack was distributed among the bees in the following manner “Each of 5 bees will fire 20 rounds, 2 at a time.” Since we had our total number of connections set to 100 each bee received an equal share of the request.  Depending on your choices for the –n and –c options you can configure a different type of attack profile.  For example, if you wanted to increase the time of an attack you would increase the total number of connections and the bees would take longer to complete the attack.  This comes in useful when ing an auto scale group in AWS because you can configure an attack that will trigger one of your cloud watch alarms which will in turn activate a scaling action. Another trick is to use the Linux “time” command before your “bees attack” command, once the attack completes you can see the total duration of the attack.

Once the command completes you get output for the number of requests that actually completed, the requests that were made per second, the time per request, and a “Mission Assessment,” in this case the “Target crushed bee offensive”.

To spin down your fleet of bees you run the command:

This is a quick intro on how to use Bees with Machine Guns for distributed ing within AWS. The one big caution in using Bees with Machine Guns, as explained by the author, “they are, more-or-less a distributed denial-of-service attack in a fancy package,” which means you should only use it against resources that you own, and you will be liable for any unauthorized use.

As you can see, Bees with Machine Guns can be a powerful tool for distributed load s.  It’s extremely easy to setup and tremendously easy to use.  It is a great way to artificially create a production load to the elasticity and scalability of your AWS environment.

-Derek Baltazar, Senior Cloud Engineer