1-888-317-7920 info@2ndwatch.com

Troubleshooting VMware HCX

I was on a project recently where we had to set up VMware HCX in an environment to connect the on-premises datacenter to VMware Cloud on AWS for a Proof of Concept migration.  The workloads were varied, ranging from 100 MB to 5TB in size.  The customer wanted to stretch two L2 subnets and have the ability to migrate slowly, in waves.  During the POC, we found problems with latency between the two stretched networks and decided that the best course of action would be to NOT stretch the networks and instead, do an all-at-once migration.

While setting up this POC, I had occasion to do some troubleshooting on HCX due to connectivity issues.  I’m going to walk through some of the troubleshooting I needed to do.

The first thing we did was enable SSH on the NSX manager.  To perform this action, you go into the HCX manager appliance GUI and under Appliance Summary, start the SSH service.  Once SSH is enabled, you can then login to the appliance CLI, which is where the real troubleshooting can begin.

You’ll want to login to the appliance using “admin” as the user name and the password entered when you installed the appliance.  SU to “root” and enter the “root” password.  This gives you access to the appliance, which has a limited set of Linux commands.

You’ll want to enter the HCX Central CLI (CCLI) to use the HCX commands.  Since you’re already logged in as “root,” you just type “ccli” at the command prompt.  After you’re in the CCLI, you can type “help” to get a list of commands.

One of the first tests to run would be the Health Checker. Type “hc” at the command prompt, and the HCX manager will run through a series of tests to check on the health of the environment.

“list” will give you a list of the HCX appliances that have been deployed.

You’ll want to connect to an appliance to run the commands specific to that appliance.  As shown above, if you want to connect to the Interconnect appliance, you would type “go 0,” which would connect you to node 0.  From here, you can run a ton of commands, such as “show ipsec status,” which will show a plethora of information related to the tunnel.  Type “q” to exit this command.

You can also run the Health Check on this node from here, export a support bundle (under the debug command), and a multitude of other “show” commands.  Under the “show” command, you can get firewall information, flow runtime information, and a lot of other useful troubleshooting info.

If you need to actually get on the node and run Linux commands for troubleshooting, you’ll enter “debug remoteaccess enable,” which enables SSH on the remote node.  Then you can just type “ssh” and it will connect you to the interconnect node.

Have questions about this process? Contact us or leave a reply.

-Michael Moore, Associate Cloud Consultant


Automating Windows Server 2016 Builds with Packer

It might not be found on a curated list of urban legends, but trust me, IT IS possible (!!!) to fully automate the building and configuration of Windows virtual machines for AWS.  While Windows Server may not be your first choice of operating systems for immutable infrastructure, sometimes you as the IT professional are not given the choice due to legacy environments, limited resources for re-platforming, corporate edicts, or lack of internal *nix knowledge.  Complain all you like, but after the arguing, fretting, crying, and gnashing of teeth is done, at some point you will still need to build an automated deployment pipeline that requires Windows Server 2016.  With some PowerShell scripting and HashiCorp Packer, it is relatively easy and painless to securely build and configure Windows AMIs for your particular environment.

Let’s dig into an example of how to build a custom configured Windows Server 2016 AMI.  You will need access to an AWS account and have sufficient permissions to create and manage EC2 instances.  If you need an AWS account, you can create one for free.

I am using VS Code and built-in terminal with Windows Subsystem for Linux to create the Packer template and run Packer, however Packer is available for several many Linux distros, Mac OS, and Windows.

First, download and unzip Packer:

~/packer-demo:\> wget https://releases.hashicorp.com/packer/1.3.4/packer_1.3.4_linux_amd64.zip 
Resolving releases.hashicorp.com (releases.hashicorp.com)...,,, ... 
Connecting to releases.hashicorp.com (releases.hashicorp.com)||:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 28851840 (28M) [application/zip] 
Saving to: ‘packer_1.3.4_linux_amd64.zip’ 
‘packer_1.3.4_linux_amd64.zip’ saved [28851840/28851840] 
~/packer-demo:\> unzip packer_1.3.4_linux_amd64.zip 
Archive:  packer_1.3.4_linux_amd64.zip   
 inflating: packer

Now that we have Packer unzipped, verify that it is executable by checking the version:

~/packer-demo:\> packer --version

Packer can build machine images for a number of different platforms.  We will focus on the amazon-ebs builder, which will create an EBS-backed AMI.  At a high level, Packer performs these steps:

  1. Read configuration settings from a json file
  2. Uses the AWS API to stand up an EC2 instance
  3. Connect to the instance and provision it
  4. Shut down and snapshot the instance
  5. Create an Amazon Machine Image (AMI) from the snapshot
  6. Clean up the mess

The amazon-ebs builder can create temporary keypairs, security group rules, and establish a basic communicator for provisioning.  However, in the interest of tighter security and control, we will want to be prescriptive for some of these settings and use a secure communicator to reduce the risk of eavesdropping while we provision the machine.

There are two communicators Packer uses to upload scripts and files to a virtual machine: SSH (the default) and WinRM.  While there is a nice Win32 port of OpenSSH for Windows, it is not currently installed by default on Windows machines, but WinRM is available natively in all current versions of Windows, so we will use that to provision our Windows Server 2016 machine.

Let’s create and edit the userdata file that Packer will use to bootstrap the EC2 instance:

Set-ExecutionPolicy -ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore
$ErrorActionPreference = "stop"
# Remove any existing Windows Management listeners
Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse
# Create self-signed cert for encrypted WinRM on port 5986
$Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer-ami-builder"
New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force
# Configure WinRM
cmd.exe /c winrm quickconfig -q
cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="false"}'
cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="false"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer-ami-builder`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"
cmd.exe /c netsh advfirewall firewall add rule name="WinRM-SSL (5986)" dir=in action=allow protocol=TCP localport=5986
cmd.exe /c net stop winrm
cmd.exe /c sc config winrm start= auto
cmd.exe /c net start winrm

There are four main things going on here:

  1. Set the execution policy and error handling for the script (if an error is encountered, the script terminates immediately)
  2. Clear out any existing WS Management listeners, just in case there are any preconfigured insecure listeners
  3. Create a self-signed certificate for encrypting the WinRM communication channel, and then bind it to a WS Management listener
  4. Configure WinRM to:
    • Require an encrypted (SSL) channel
    • Enable basic authentication (usually this not secure as the password goes across in plain text, but we are forcing encryption)
    • Configure the listener to use port 5986 with the self-signed certificate we created earlier
    • Add a firewall rule to open port 5986

Now that we have userdata to bootstrap WinRM, let’s create a Packer template:

~/packer-demo:\> touch windows2016.json

Open this file with your favorite text editor and add this text:

    "variables": {
        "build_version": "{{isotime \"2006.01.02.150405\"}}",
        "aws_profile": null,
        "vpc_id": null,
        "subnet_id": null,
        "security_group_id": null
    "builders": [
            "type": "amazon-ebs",
            "region": "us-west-2",
            "profile": "{{user `aws_profile`}}",
            "vpc_id": "{{user `vpc_id`}}",
            "subnet_id": "{{user `subnet_id`}}",
            "security_group_id": "{{user `security_group_id`}}",
            "source_ami_filter": {
                "filters": {
                    "name": "Windows_Server-2016-English-Full-Base-*",
                    "root-device-type": "ebs",
                    "virtualization-type": "hvm"
                "most_recent": true,
                "owners": [
            "ami_name": "WIN2016-CUSTOM-{{user `build_version`}}",
            "instance_type": "t3.xlarge",
            "user_data_file": "userdata.ps1",
            "associate_public_ip_address": true,
            "communicator": "winrm",
            "winrm_username": "Administrator",
            "winrm_port": 5986,
            "winrm_timeout": "15m",
            "winrm_use_ssl": true,
            "winrm_insecure": true

Couple of things to call out here.  First, the variables block at the top references some values that are needed in the template.  Insert values here that are specific to your AWS account and VPC:

  • aws_profile: I use a local credentials file to store IAM user credentials (the file is shared between WSL and Windows). Specify the name of a credential block that Packer can use to connect to your account.  The IAM user will need permissions to create and modify EC2 instances, at a minimum
  • vpc_id: Packer will stand up the instance in this VPC
  • aws_region: Your VPC should be in this region. As an exercise, change this value to be set by a variable instead
  • user_data_file: We created this file earlier, remember? If you saved it in another location, make sure the path is correct
  • subnet_id: This should belong to the VPC specified above. Use a public subnet if needed if you do not have a Direct Connect
  • security_group_id: This security group should belong to the VPC specified above. This security group will be attached to the instance that Packer stands up.  It will need, at a minimum, inbound TCP 5986 from where Packer is running

Next, let’s validate the template to make sure the syntax is correct, and we have all required fields:

~/packer-demo:\> packer validate windows2016.json
Template validation failed. Errors are shown below.
Errors validating build 'amazon-ebs'. 1 error(s) occurred:
* An instance_type must be specified

Whoops, we missed instance_type.  Add that to the template and specify a valid EC2 instance type.  I like using beefier instance types so that the builds get done quicker, but that’s just me.

            "ami_name": "WIN2016-CUSTOM-{{user `build_version`}}",
            "instance_type": "t3.xlarge",
            "user_data_file": "userdata.ps1",

Now validate again:

~/packer-demo:\> packer validate windows2016.json
Template validated successfully.

Awesome.  Couple of things to call out in the template:

  • source_ami_filter: we are using a base Amazon AMI for Windows Server 2016 server. Note the wildcard in the AMI name (Windows_Server-2016-English-Full-Base-*) and the owner (801119661308).  This filter will always pull the most recent match for the AMI name.  Amazon updates its AMIs about once a month
  • associate_public_ip_address: I specify true because I don’t have Direct Connect or VPN from my workstation to my sandbox VPC. If you are building your AMI in a public subnet and want to connect to it over the internet, do the same
  • ami_name: this is a required property, and it must also be unique within your account. We use a special function here (isotime) to generate a timestamp, ensuring that the AMI name will always be unique (e.g. WIN2016-CUSTOM-2019.03.01.000042)
  • winrm_port: the default unencrypted port for WinRM is 5985. We disabled 5985 in userdata and enabled 5986, so be sure to call it out here
  • winrm_use_ssl: as noted above, we are encrypting communication, so set this to true
  • winrm_insecure: this is a rather misleading property name. It really means that Packer should not check if encryption certificate is trusted.  We are using a self-signed certificate in userdata, so set this to true to skip certificate validation

Now that we have our template, let’s inspect it to see what will happen:

~/packer-demo:\> packer inspect windows2016.json
Optional variables and their defaults:
  aws_profile       = default
  build_version     = {{isotime "2006.01.02.150405"}}
  security_group_id = sg-0e1ca9ba69b39926
  subnet_id         = subnet-00ef2a1df99f20c23
  vpc_id            = vpc-00ede10ag029c31e0
Note: If your build names contain user variables or template
functions such as 'timestamp', these are processed at build time,
and therefore only show in their raw form here.

Looks good, but we are missing a provisioner!  If we ran this template as is, all that would happen is Packer would make a copy of the base Windows Server 2016 AMI and make you the owner of the new private image.

There are several different Packer provisioners, ranging from very simple (the windows-restart provisioner just reboots the machine) to complex (the chef client provisioner installs chef client and does whatever Chef does from there).  We’ll run a basic powershell provisioner to install IIS on the instance, reboot the instance using windows-restart, and then we’ll finish up the provisioning by executing sysprep on the instance to generalize it for re-use.

We will use the PowerShell cmdlet Enable-WindowsOptionalFeature to install the web server role and IIS with defaults, add this text to your template below the builders [ ] section:

"provisioners": [
            "type": "powershell",
            "inline": [
                "Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServerRole",
                "Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServer"
            "type": "windows-restart",
            "restart_check_command": "powershell -command \"& {Write-Output 'Machine restarted.'}\""
            "type": "powershell",
            "inline": [
                "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\InitializeInstance.ps1 -Schedule",
                "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\SysprepInstance.ps1 -NoShutdown"

Couple things to call out about this section:

  • The first powershell provisioner uses the inline Packer simply appends these lines in order into a file, then transfers and executes the file on the instance using PowerShell
  • The second provisioner, windows-restart, simply reboots the machine while Packer waits. While this isn’t always necessary, it is helpful to catch instances where settings do not persist after a reboot, which was probably not your intention
  • The final powershell provisioner executes two PowerShell scripts that are present on Amazon Windows Server 2016 AMIs and part of the EC2Launch application (earlier versions of Windows use a different application called EC2Config). They are helper scripts that you can use to prepare the machine for generalization, and then execute sysprep

After validating your template again, let’s build the AMI!

~/packer-demo:\> packer validate windows2016.json
Template validated successfully.
~/packer-demo:\> packer build windows2016.json
amazon-ebs output will be in this color.
==> amazon-ebs: Prevalidating AMI Name: WIN2016-CUSTOM-2019.03.01.000042
    amazon-ebs: Found Image ID: ami-0af80d239cc063c12
==> amazon-ebs: Creating temporary keypair: packer_5c78762a-751e-2cd7-b5ce-9eabb577e4cc
==> amazon-ebs: Launching a source AWS instance...
==> amazon-ebs: Adding tags to source instance
    amazon-ebs: Adding tag: "Name": "Packer Builder"
    amazon-ebs: Instance ID: i-0384e1edca5dc90e5
==> amazon-ebs: Waiting for instance (i-0384e1edca5dc90e5) to become ready...
==> amazon-ebs: Waiting for auto-generated password for instance...
    amazon-ebs: It is normal for this process to take up to 15 minutes,
    amazon-ebs: but it usually takes around 5. Please wait.
    amazon-ebs: Password retrieved!
==> amazon-ebs: Using winrm communicator to connect:
==> amazon-ebs: Waiting for WinRM to become available...
    amazon-ebs: WinRM connected.
    amazon-ebs: #> CLIXML
    amazon-ebs: System.Management.Automation.PSCustomObjectSystem.Object1Preparing modules for first 
use.0-1-1Completed-1 1Preparing modules for first use.0-1-1Completed-1 
==> amazon-ebs: Connected to WinRM!
==> amazon-ebs: Provisioning with Powershell...
==> amazon-ebs: Provisioning with powershell script: /tmp/packer-powershell-provisioner561623209
    amazon-ebs: Path          :
    amazon-ebs: Online        : True
    amazon-ebs: RestartNeeded : False
    amazon-ebs: Path          :
    amazon-ebs: Online        : True
    amazon-ebs: RestartNeeded : False
==> amazon-ebs: Restarting Machine
==> amazon-ebs: Waiting for machine to restart...
    amazon-ebs: Machine restarted.
    amazon-ebs: EC2AMAZ-LJV703F restarted.
    amazon-ebs: #> CLIXML
    amazon-ebs: System.Management.Automation.PSCustomObjectSystem.Object1Preparing modules for first 
==> amazon-ebs: Machine successfully restarted, moving on
==> amazon-ebs: Provisioning with Powershell...
==> amazon-ebs: Provisioning with powershell script: /tmp/packer-powershell-provisioner928106343
    amazon-ebs: TaskPath                                       TaskName                          State
    amazon-ebs: --------                                       --------                          -----
    amazon-ebs: \                                              Amazon Ec2 Launch - Instance I... Ready
==> amazon-ebs: Stopping the source instance...
    amazon-ebs: Stopping instance, attempt 1
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating unencrypted AMI WIN2016-CUSTOM-2019.03.01.000042 from instance i-0384e1edca5dc90e5
    amazon-ebs: AMI: ami-0d6026ecb955cc1d6
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: Deleting temporary keypair...
Build 'amazon-ebs' finished.
==> Builds finished. The artifacts of successful builds are:
==> amazon-ebs: AMIs were created:
us-west-2: ami-0d6026ecb955cc1d6

The entire process took approximately 5 minutes from start to finish.  If you inspect the output, you can see where the first PowerShell provisioner installed IIS (Provisioning with powershell script: /tmp/packer-powershell-provisioner561623209), where sysprep was executed (Provisioning with powershell script: /tmp/packer-powershell-provisioner928106343) and where Packer yielded the AMI ID for the image (ami-0d6026ecb955cc1d6) and cleaned up any stray artifacts.

Note: most of the issues that occur during a Packer build are related to firewalls and security groups.  Verify that the machine that is running Packer can reach the VPC and EC2 instance using port 5986.  You can also use packer build -debug win2016.json to step through the build process manually, and you can even connect to the machine via RDP to troubleshoot a provisioner.

Now, if you launch a new EC2 instance using this AMI, it will already have IIS installed and configured with defaults.  Building and securely configuring Windows AMIs with Packer is easy!

For help getting started securely building and configuring Windows AMIs for your particular environment, contact us.

-Jonathan Eropkin, Cloud Consultant


How to use waiters in boto3 (And how to write your own!)

What is boto3?

Boto3 is the python SDK for interacting with the AWS api. Boto3 makes it easy to use the python programming language to manipulate AWS resources and automation infrastructure.

What are boto3 waiters and how do I use them?

A number of requests in AWS using boto3 are not instant. Common examples of boto3 requests are deploying a new server or RDS instance. For some long running requests, we are ok to initiate the request and then check for completion at some later time. But in many cases, we want to wait for the request to complete before we move on to the subsequent parts of the script that may rely on a long running process to have been completed. One example would be a script that might copy an AMI to another account by sharing all the snapshots. After sharing the snapshots to the other account, you would need to wait for the local snapshot copies to complete before registering the AMI in the receiving account. Luckily a snapshot completed waiter already exists, and here’s what that waiter would look like in Python:

As far as the default configuration for the waiters and how long they wait, you can view the information in the boto3 docs on waiters, but it’s 600 seconds in most cases. Each one is configurable to be as short or long as you’d like.

Writing your own custom waiters.

As you can see, using boto3 waiters is an easy way to setup a loop that will wait for completion without having to write the code yourself. But how do you find out if a specific waiter exists? The easiest way is to explore the particular boto3 client on the docs page and check out the list of waiters at the bottom. Let’s walk through the anatomy of a boto3 waiter. The waiter is actually instantiated in botocore and then abstracted to boto3. So looking at the code there we can derive what’s needed to generate our own waiter:

  1. Waiter Name
  2. Waiter Config
    1. Delay
    2. Max Attempts
    3. Operation
    4. Acceptors
  3. Waiter Model

The first step is to name your custom waiter. You’ll want it to be something descriptive and, in our example, it will be “CertificateIssued”. This will be a waiter that waits for an ACM certificate to be issued (Note there is already a CertificateValidated waiter, but this is only to showcase the creation of the waiter). Next we pick out the configuration for the waiter which boils down to 4 parts. Delay is the amount of time it will take between tests in seconds. Max Attempts is how many attempts it will try before it fails. Operation is the boto3 client operation that you’re using to get the result your testing. In our example, we’re calling “DescribeCertificate”. Acceptors is how we test the result of the Operation call. Acceptors are probably the most complicated portion of the configuration. They determine how to match the response and what result to return. Acceptors have 4 parts: Matcher, Expected, Argument, State.

  • State: This is what the acceptor will return based on the result of the matcher function.
  • Expected: This is the expected response that you want from the matcher to return this result.
  • Argument: This is the argument sent to the matcher function to determine if the result is expected.
  • Matcher: Matchers come in 5 flavors. Path, PathAll, PathAny, Status, and Error. The Status and Error matchers will effectively check the status of the HTTP response and check for an error respectively. They return failure states and short circuit the waiter so you don’t have to wait until the end of the time period when the command has already failed. The Path matcher will match the Argument to a single expected result. In our example, if you run DescribeCertificate you would get back a “Certificate.Status” as a result. Taking that as an argument, the desired expected result would be “ISSUED”. Notice that if the expected result is “PENDING_VALIDATION” we set the state to “retry” so it will continue to keep trying for the result we want. The PathAny/PathAll matchers work with operations that return a python list result. PathAny will match if any item in the list matches, PathAll will match if all the items in the list match.

Once we have the configuration complete, we feed this into the Waiter Model call the “create_waiter_with_client” request. Now our custom waiter is ready to wait. If you need more examples of waiters and how they are configured, check out the botocore github and poke through the various services. If they have waiters configured, they will be in a file called waiters-2.json. Here’s the finished code for our customer waiter.

And that’s it. Custom waiters will allow you to create automation in a series without having to build redundant code of complicated loops. Have questions about writing custom waiters or boto3? Contact us

-Coin Graham, Principal Cloud Consultant


Using Docker Containers to Move Your Internal IT Orgs Forward

Many people are looking to take advantage of containers to isolate their workloads on a single system. Unlike traditional hypervisor-based virtualization, which utilizes the same operating system and packages, Containers allow you to segment off multiple applications with their own set of processes on the same instance.

Let’s walk through some grievances that many of us have faced at one time or another in our IT organizations:

Say, for example, your development team is setting up a web application. They want to set up a traditional 3 tier system with an app, database, and web servers. They notice there is a lot of support in the open source community for their app when it is run on Ubuntu Trusty (Ubuntu 14.04 LTS) and later. They’ve developed the app in their local sandbox with an Ubuntu image they downloaded, however, their company is a RedHat shop.

Now, depending on the type of environment you’re in, chances are you’ll have to wait for the admins to provision an environment for you. This often entails (but is not limited to) spinning up an instance, reviewing the most stable version of the OS, creating a new hardened AMI, adding it to Packer, figuring out which configs to manage, and refactoring provisioning scripts to utilize aptitude and Ubuntu’s directory structure (e.g Debian has over 50K packages to choose from and manage). In addition to that, the most stable version of Ubuntu is missing some newer packages that you’ve tested in your sandbox that need to be pulled from source or another repository. At this point, the developers are procuring configuration runbooks to support the app while the admin gets up to speed with the OS (not significant but time-consuming nonetheless).

You can see my point here. A significant amount of overhead has been introduced, and it’s stagnating development. And think about the poor sysadmins. They have other environments that they need to secure, networking spaces to manage, operations to improve, and existing production stacks they have to monitor and support while getting bogged down supporting this app that is still in the very early stages of development. This could mean that mission-critical apps are potentially losing visibility and application modernization is stagnating. Nobody wins in this scenario.

Now let us revisit the same scenario with containers:

I was able to run my Jenkins build server and an NGINX web proxy, both running on a hardened CentOS7 AMI provided by the Systems Engineers with docker installed.  From there I executed a docker pull  command pointed at our local repository and deployed two docker images with Debian as the underlying OS.

$ docker pull my.docker-repo.com:4443/jenkins
$ docker pull my.docker-repo.com:4443/nginx


$ docker ps

7478020aef37 my.docker-repo.com:4443/jenkins/jenkins:lts   “/sbin/tini — /us …”  16 minutes ago   Up 16 minutes ago  8080/tcp,>80/tcp, 50000/tcp jenkins

d68e3b96071e my.docker-repo.com:4443/nginx/nginx:lts “nginx -g ‘daemon of…” 16 minutes ago Up 16 minutes>80/tcp,>443/tcp nginx

$ sudo systemctl status jenkins-docker

jenkins-docker.service – Jenkins
Loaded: loaded (/etc/systemd/system/jenkins-docker.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-11-08 17:38:06 UTC; 18min ago
Process: 2006 ExecStop=/usr/local/bin/jenkins-docker stop (code=exited, status=0/SUCCESS)

The processes above were executed on the actual instance. Note how I’m able to execute a cat of the OS release file from within the container

sudo docker exec d68e3b96071e cat /etc/os-release
PRETTY_NAME=”Debian GNU/Linux 9 (stretch)”
NAME=”Debian GNU/Linux”
VERSION=”9 (stretch)”

I was able to do so because Docker containers do not have their own kernel, but rather share the kernel of the underlying host via linux system calls (e.g setuid, stat, umount, ls) like any other application. These system calls (or syscalls for short) are standard across kernels, and Docker supports version 3.10 and higher. In the event older syscalls are deprecated and replaced with new ones, you can update the kernel of the underlying host, which can be done independently of an OS upgrade. As far as containers go, the binaries and aptitude management tools are the same as if you installed Ubuntu on an EC2 instance (or VM).

Q: But I’m running a windows environment. Those OS’s don’t have a kernel. 

Yes, developers may want to remove cost overhead associated with Windows licenses by exploring running their apps on Linux OS. Others may simply want to modernize their .NET applications by testing out the latest versions on Containers. Docker allows you to run Linux VM’s on Windows 10 and Server 2016. As docker was written to initially execute on Linux distributions, in order to take advantage of multitenant hosting, you will have to run Hyper-V containers, which provision a thin VM on top of your hosts. You can then manage your mixed environment of Windows and Linux hosts via the –isolate option. More information can be found in the Microsoft and Docker documentation.


IT teams need to be able to help drive the business forward. Newer technologies and security patches are procured on a daily basis. Developers need to be able to freely work on modernizing their code and applications. Concurrently, Operations needs to be able to support and enhance the pipelines and platforms that get the code out faster and securely. Leveraging Docker containers in conjunction with these pipelines further helps to ensure these are both occurring in parallel without the unnecessary overhead. This allows teams to work independently in the early stages of the development cycle and yet more collaboratively to get the releases out the door.

For help getting started leveraging your environment to take advantage of containerization, contact us.

-Sabine Blair, Systems Engineer & Cloud Consultant


Fully Coded And Automated CI/CD Pipelines: The Weeds

The Why

In my last post we went over why we’d want to go the CI/CD/Automated route and the more cultural reasons of why it is so beneficial. In this post, we’re going to delve a little bit deeper and examine the technical side of tooling. Remember, a primary point of doing a release is mitigating risk. CI/CD is all about mitigating risk… fast.

There’s a Process

The previous article noted that you can’t do CI/CD without building on a set of steps, and I’m going to take this approach here as well. Unsurprisingly, we’ll follow the steps we laid out in the “Why” article, and tackle each in turn.

Step I: Automated Testing

You must automate your testing. There is no other way to describe this. In this particular step however, we can concentrate on unit testing: Testing the small chunks of code you produce (usually functions or methods). There’s some chatter about TDD (Test Driven Development) vs BDD (Behavior Driven Development) in the development community, but I don’t think it really matters, just so long as you are writing test code along side your production code. On our team, we prefer the BDD style testing paradigm. I’ve always liked the symantically descriptive nature of BDD testing over strictly code-driven ones. However, it should be said that both are effective and any is better than none, so this is more of a personal preference. On our team we’ve been coding in golang, and our BDD framework of choice is the Ginkgo/Gomega combo.

Here’s a snippet of one of our tests that’s not entirely simple:

Describe("IsValidFormat", func() {
  for _, check := range AvailableFormats {
    Context("when checking "+check, func() {
      It("should return true", func() {
  Context("when checking foo", func() {
    It("should return false", func() {

So as you can see, the Ginkgo (ie: BDD) formatting is pretty descriptive about what’s happening. I can instantly understand what’s expected. The function IsValidFormat, should return true given the range (list) of AvailableFormats. A format of foo (which is not a valid format) should return false. It’s both tested and understandable to the future change agent (me or someone else).

Step II: Continuous Integration

Continuous Integration takes Step 1 further, in that it brings all the changes to your codebase to a singular point, and building an object for deployment. This means you’ll need an external system to automatically handle merges / pushes. We use Jenkins as our automation server, running it in Kubernetes using the Pipeline style of job description. I’ll get into the way we do our builds using Make in a bit, but the fact we can include our build code in with our projects is a huge win.

Here’s a (modified) Jenkinsfile we use for one of our CI jobs:

def notifyFailed() {
  slackSend (color: '#FF0000', message: "FAILED: '${env.JOB_NAME} [${env.BUILD_NUMBER}]' (${env.BUILD_URL})")
  label: 'fooProject-build',
  containers: [
      name: 'jnlp',
      image: 'some.link.to.a.container:latest',
      args: '${computer.jnlpmac} ${computer.name}',
      alwaysPullImage: true,
      name: 'image-builder',
      image: 'some.link.to.another.container:latest',
      ttyEnabled: true,
      alwaysPullImage: true,
      command: 'cat'
  volumes: [
      hostPath: '/var/run/docker.sock',
      mountPath: '/var/run/docker.sock'
      hostPath: '/home/jenkins/workspace/fooProject',
      mountPath: '/home/jenkins/workspace/fooProject'
      secretName: 'jenkins-creds-for-aws',
      mountPath: '/home/jenkins/.aws-jenkins'
      hostPath: '/home/jenkins/.aws',
      mountPath: '/home/jenkins/.aws'
  node ('fooProject-build') {
    try {
      checkout scm
      wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) {
          stage('Prep') {
            sh '''
              cp /home/jenkins/.aws-jenkins/config /home/jenkins/.aws/.
              cp /home/jenkins/.aws-jenkins/credentials /home/jenkins/.aws/.
              make get_images
          stage('Unit Test'){
            sh '''
              make test
              make profile
            $class:              'CoberturaPublisher',
            autoUpdateHealth:    false,
            autoUpdateStability: false,
            coberturaReportFile: 'report.xml',
            failUnhealthy:       false,
            failUnstable:        false,
            maxNumberOfBuilds:   0,
            sourceEncoding:      'ASCII',
            zoomCoverageChart:   false
          stage('Build and Push Container'){
            sh '''
              make push
        container('image-builder') {
          sh '''
            make deploy_integration
            make toggle_integration_service
        try {
          wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) {
            container('image-builder') {
              sh '''
                sleep 45
                export KUBE_INTEGRATION=https://fooProject-integration
                export SKIP_TEST_SERVER=true
                make integration
        } catch(e) {
            sh '''
              make clean
      stage('Deploy to Production'){
        container('image-builder') {
          sh '''
            make clean
            make deploy_dev
    } catch(e) {
        sh '''
          make clean
      currentBuild.result = 'FAILED'

There’s a lot going on here, but the important part to notice is that I grabbed this from the project repo. The build instructions are included with the project itself. It’s creating an artifact, running our tests, etc. But it’s all part of our project code base. It’s checked into git. It’s code like all the other code we mess with. The steps are somewhat inconsequential for this level of topic, but it works. We also have it setup to run when there’s a push to github (AND nightly). This ensures that we are continuously running this build and integrating everything that’s happened to the repo in a day. It helps us keep on top of all the possible changes to the repo as well as our environment.

Hey… what’s all that make_ crap?_


Our team uses a lot of tools. We ascribe to the maxim: Use what’s best for the particular situation. I can’t remember every tool we use. Neither can my teammates. Neither can 90% of the people that “do the devops.” I’ve heard a lot of folks say, “No! We must solidify on our toolset!” Let your teams use what they need to get the job done the right way. Now, the fear of experiencing tool “overload” seems like a legitimate one in this scenario, but the problem isn’t the number of tools… it’s how you manage and use use them.

Enter Makefiles! (aka: make)

Make has been a mainstay in the UNIX world for a long time (especially in the C world). It is a build tool that’s utilized to help satisfy dependencies, create system-specific configurations, and compile code from various sources independent of platform. This is fantastic, except, we couldn’t care less about that in the context of our CI/CD Pipelines. We use it because it’s great at running “buildy” commands.

Make is our unifier. It links our Jenkins CI/CD build functionality with our Dev functionality. Specifically, opening up the docker port here in the Jenkinsfile:

volumes: [
    hostPath: '/var/run/docker.sock',
    mountPath: '/var/run/docker.sock'

…allows us to run THE SAME COMMANDS WHEN WE’RE DEVELOPING AS WE DO IN OUR CI/CD PROCESS. This socket allows us to run containers from containers, and since Jenkins is running on a container, this allows us to run our toolset containers in Jenkins, using the same commands we’d use in our local dev environment. On our local dev machines, we use docker nearly exclusively as a wrapper to our tools. This ensures we have library, version, and platform consistency on all of our dev environments as well as our build system. We use containers for our prod microservices so production is part of that “chain of consistency” as well. It ensures that we see consistent behavior across the horizon of application development through production. It’s a beautiful thing! We use the Makefile as the means to consistently interface with the docker “tool” across differing environments.

Ok, I know your interest is peaked at this point. (Or at least I really hope it is!)
So here’s a generic makefile we use for many of our projects:

CONTAINER=$(shell basename $$PWD | sed -E 's/^ia-image-//')
.PHONY: install install_exe install_test_exe deploy test
    docker pull sweet.path.to.a.repo/$(CONTAINER)
    docker tag sweet.path.to.a.repo/$(CONTAINER):latest $(CONTAINER):latest
    if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi
    echo "docker run -itP -v \$$PWD:/root $(CONTAINER) \"\$$@\"" > $(HOME)/bin/$(CONTAINER)
    chmod u+x $(HOME)/bin/$(CONTAINER)
    if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi
    echo "docker run -itP -v \$$PWD:/root $(CONTAINER)-test \"\$$@\"" > $(HOME)/bin/$(CONTAINER)
    chmod u+x $(HOME)/bin/$(CONTAINER)
    docker build -t $(CONTAINER)-test .
    captain push

This is a Makefile we use to build our tooling images. It’s much simpler than our project Makefiles, but I think this illustrates how you can use Make to wrap EVERYTHING you use in your development workflow. This also allows us to settle on similar/consistent terminology between different projects. %> make test? That’ll run the tests regardless if we are working on a golang project or a python lambda project, or in this case, building a test container, and tagging it as whatever-test. Make unifies “all the things.”

This also codifies how to execute the commands. ie: what arguments to pass, what inputs etc. If I can’t even remember the name of the command, I’m not going to remember the arguments. To remedy, I just open up the Makefile, and I can instantly see.

Step III: Continuous Deployment

After the last post (you read it right?), some might have noticed that I skipped the “Delivery” portion of the “CD” pipeline. As far as I’m concerned, there is no “Delivery” in a “Deployment” pipeline. The “Delivery” is the actual deployment of your artifact. Since the ultimate goal should be Depoloyment, I’ve just skipped over that intermediate step.

Okay, sure, if you want to hold off on deploying automatically to Prod, then have that gate. But Dev, Int, QA, etc? Deployment to those non-prod environments should be automated just like the rest of your code.

If you guessed we use make to deploy our code, you’d be right! We put all our deployment code with the project itself, just like the rest of the code concerning that particular object. For services, we use a Dockerfile that describes the service container and several yaml files (e.g. deployment_<env>.yaml) that describe the configurations (e.g. ingress, services, deployments) we use to configure and deploy to our Kubernetes cluster.

Here’s an example:

apiVersion: extensions/v1beta1
kind: Deployment
    app: sweet-aws-service
    stage: dev
  name: sweet-aws-service-dev
  namespace: sweet-service-namespace
  replicas: 1
        app: sweet-aws-service
      name: sweet-aws-service
      - name: sweet-aws-service
        image: path.to.repo.for/sweet-aws-service:latest
        imagePullPolicy: Always
          - name: PORT
            value: "50000"
          - name: TLS_KEY
                name: grpc-tls
                key: key
          - name: TLS_CERT
                name: grpc-tls
                key: cert

This is an example of a deployment into Kubernetes for dev. That %> make deploy_dev from the Jenkinsfile above? That’s pushing this to our Kubernetes cluster.


There is a lot of information to take in here, but there are two points to really take home:

  1. It is totally possible.
  2. Use a unifying tool to… unify your tools. (“one tool to rule them all”)

For us, Point 1 is moot… it’s what we do. For Point 2, we use Make, and we use Make THROUGH THE ENTIRE PROCESS. I use Make locally in dev and on our build server. It ensures we’re using the same commands, the same containers, the same tools to do the same things. Test, integrate (test), and deploy. It’s not just about writing functional code anymore. It’s about writing a functional process to get that code, that value, to your customers!

And remember, as with anything, this stuff get’s easier with practice. So once you start doing it you will get the hang of it and life becomes easier and better. If you’d like some help getting started, download our datasheet to learn about our Modern CI/CD Pipeline.

-Craig Monson, Sr Automation Architect



How We Organize Terraform Code at 2nd Watch

When IT organizations adopt infrastructure as code (IaC), the benefits in productivity, quality, and ability to function at scale are manifold. However, the first few steps on the journey to full automation and immutable infrastructure bliss can be a major disruption to a more traditional IT operations team’s established ways of working. One of the common problems faced in adopting infrastructure as code is how to structure the files within a repository in a consistent, intuitive, and scaleable manner. Even IT operations teams whose members have development skills will still face this anxiety-inducing challenge simply because adopting IaC involves new tools whose conventions differ somewhat from more familiar languages and frameworks.

In this blog post, we’ll go over how we structure our IaC repositories within 2nd Watch professional services and managed services engagements with a particular focus on Terraform, an open-source tool by Hashicorp for provisioning infrastructure across multiple cloud providers with a single interface.

First Things First: README.md and .gitignore

The task in any new repository is to create a README file. Many git repositories (especially on Github) have adopted Markdown as a de facto standard format for README files. A good README file will include the following information:

  1. Overview: A brief description of the infrastructure the repo builds. A high-level diagram is often an effective method of expressing this information. 2nd Watch uses LucidChart for general diagrams (exported to PNG or a similar format) and mscgen_js for sequence diagrams.
  2. Pre-requisites: Installation instructions (or links thereto) for any software that must be installed before building or changing the code.
  3. Building The Code: What commands to run in order to build the infrastructure and/or run the tests when applicable. 2nd Watch uses Make in order to provide a single tool with a consistent interface to build all codebases, regardless of language or toolset. If using Make in Windows environments, Windows Subsystem for Linux is recommended for Windows 10 in order to avoid having to write two sets of commands in Makefiles: Bash, and PowerShell.

It’s important that you do not neglect this basic documentation for two reasons (even if you think you’re the only one who will work on the codebase):

  1. The obvious: Writing this critical information down in an easily viewable place makes it easier for other members of your organization to onboard onto your project and will prevent the need for a panicked knowledge transfer when projects change hands.
  2. The not-so-obvious: The act of writing a description of the design clarifies your intent to yourself and will result in a cleaner design and a more coherent repository.

All repositories should also include a .gitignore file with the appropriate settings for Terraform. GitHub’s default Terraform .gitignore is a decent starting point, but in most cases you will not want to ignore .tfvars files because they often contain environment-specific parameters that allow for greater code reuse as we will see later.

Terraform Roots and Multiple Environments

A Terraform root is the unit of work for a single terraform apply command. We group our infrastructure into multiple terraform roots in order to limit our “blast radius” (the amount of damage a single errant terraform apply can cause).

  • Repositories with multiple roots should contain a roots/ directory with a subdirectory for each root (e.g. VPC, one per-application) tf file as the primary entry point.
  • Note that the roots/ directory is optional for repositories that only contain a single root, e.g. infrastructure for an application team which includes only a few resources which should be deployed in concert. In this case, modules/ may be placed in the same directory as tf.
  • Roots which are deployed into multiple environments should include an env/ subdirectory at the same level as tf. Each environment corresponds to a tfvars file under env/ named after the environment, e.g. staging.tfvars. Each .tfvars file contains parameters appropriate for each environment, e.g. EC2 instance sizes.

Here’s what our roots directory might look like for a sample with a VPC and 2 application stacks, and 3 environments (QA, Staging, and Production):

Terraform modules

Terraform modules are self-contained packages of Terraform configurations that are managed as a group. Modules are used to create reusable components, improve organization, and to treat pieces of infrastructure as a black box. In short, they are the Terraform equivalent of functions or reusable code libraries.

Terraform modules come in two flavors:

  1. Internal modules, whose source code is consumed by roots that live in the same repository as the module.
  2. External modules, whose source code is consumed by roots in multiple repositories. The source code for external modules lives in its own repository, separate from any consumers and separate from other modules to ensure we can version the module correctly.

In this post, we’ll only be covering internal modules.

  • Each internal module should be placed within a subdirectory under modules/.
  • Module subdirectories/repositories should follow the standard module structure per the Terraform docs.
  • External modules should always be pinned at a version: a git revision or a version number. This practice allows for reliable and repeatable builds. Failing to pin module versions may cause a module to be updated between builds by breaking the build without any obvious changes in our code. Even worse, failing to pin our module versions might cause a plan to be generated with changes we did not anticipate.

Here’s what our modules directory might look like:

Terraform and Other Tools

Terraform is often used alongside other automation tools within the same repository. Some frequent collaborators include Ansible for configuration management and Packer for compiling identical machine images across multiple virtualization platforms or cloud providers. When using Terraform in conjunction with other tools within the same repo, 2nd Watch creates a directory per tool from the root of the repo:

Putting it all together

The following illustrates a sample Terraform repository structure with all of the concepts outlined above:


There’s no single repository format that’s optimal, but we’ve found that this standard works for the majority of our use cases in our extensive use of Terraform on dozens of projects. That said, if you find a tweak that works better for your organization – go for it! The structure described in this post will give you a solid and battle-tested starting point to keep your Terraform code organized so your team can stay productive.

Additional resources

  • The Terraform Book by James Turnbull provides an excellent introduction to Terraform all the way through repository structure and collaboration techniques.
  • The Hashicorp AWS VPC Module is one of the most popular modules in the Terraform Registry and is an excellent example of a well-written Terraform module.
  • The source code for James Nugent’s Hashidays NYC 2017 talk code is an exemplary Terraform repository. Although it’s based on an older version of Terraform (before providers were broken out from the main Terraform executable), the code structure, formatting, and use of Makefiles is still current.

For help getting started adopting Infrastructure as Code, contact us.

-Josh Kodroff, Associate Cloud Consultant