4 Unexpected Customer Insights Uncovered Through Analytics

Analytics can uncover valuable information about your customers, allowing you to connect on a deeper, more personal level.

A customer insight is a piece of information or metric that helps a business better understand how their customers think or behave. Unlike just a few years ago, businesses don’t need to rely on a stodgy market research firm to gain these insights. Today’s most successful companies are digging deep into their own datasets — past superficial metrics like gender, age, and location — to uncover valuable knowledge that was unattainable until very recently.

Here are a few key examples.

Eloquii Discovers New E-Commerce Revenue Streams

Because e-commerce companies exist in such a data-rich world, it makes sense that they’d be ahead of the curve in terms of using analytics to gain new insights. That’s exactly how Eloquii, a fast-fashion house catering to plus-size women, has solved several of its marketing problems.

After noticing that customers were returning white dresses at a higher proportion than other products, Eloquii’s marketing department dug into its data and discovered that many of those customers had actually bought multiple dresses, with the intention of using one of them as a wedding dress. That unexpected insight enabled Eloquii to have a more effective conversation with its customers around those products and better serve their needs.

According to Eloquii VP of Marketing, Kelly Goldston, the company also relies on analytics to anticipate customer behavior and tailor their marketing efforts to proactively engage each of the brand’s customer profiles, such as customers who indicate a potential high lifetime value and those who have started to shop less frequently at the site.

DirecTV Uses Customer Insight to Create Double-Digit Conversion Boost

Satellite media provider DirectTV used data to uncover an underserved portion of its customer base – those who had recently moved. The company discovered that statistically, people who have recently moved are more likely to try new products and services, especially those who’ve moved within seven days.

Armed with this information, and change of address data from the U.S. Postal Service, DirecTV created a special version of their homepage that would appear only for people who had recently moved. Not only did the targeted campaign result in a double-digit conversion improvement of the homepage, it did so with a reduced offer compared to the one on the standard website.

Whirlpool Uses Customer Insight to Drive Positive Social Change

While analyzing customer data, Whirlpool discovered that 1 in 5 children in the U.S. lack access to clean clothes, and that not having clean laundry directly contributes to school absenteeism and increases the risk of dropping out. This further predisposes these children to a variety of negative outcomes as an adult, including a 70% increased risk of unemployment.

To help stop this vicious cycle, Whirlpool created the Care Counts Laundry Program, which installs washers and dryers in schools with high numbers of low-income students. The machines are outfitted with data collection devices, enabling the Whirlpool team to record laundry usage data for each student and correlate their usage with their attendance and performance records.

The program has yielded dramatic results, including a 90% increase in student attendance, an 89% improvement in class participation, and a 95% increase in extracurricular activity participation among target students. As a result of its success, the program has attracted interest from over 1000 schools. It’s also drawn support from other organizations like Teach for America, which partnered with Whirlpool on the initiative for the 2017/2018 school year.

Prudential Better Serves its Customers with Data-Driven Insight

Financial services firms are a leading adopter of data analytics technology and Prudential has established itself as one of the forward-thinkers in the field. In August of this year, the company announced the launch of a completely new marketing model built on the customer insights gleaned from analytics and machine learning.

A central part of that initiative is the Prudential LINK platform, a direct-to-consumer investing service that allows customers to create a detailed profile, set and track personal financial goals, and get on-demand human assistance through a video chat. The LINK platform not only provides a more convenient customer experience, it also gives the Prudential team access to customer data they can use to make optimizations to other areas, such as the new PruFast Track system, which uses data to streamline the normally tedious insurance underwriting process.

Quality Customer Insights Have Become Vital to Business Success

As customers grow used to data-driven marketing, businesses will be forced to approach prospects with customized messages, or run the risk of losing competitive advantage. Research from Salesforce shows that 52% of customers are either extremely likely or likely to switch brands if a company doesn’t personalize communication with them.

2nd Watch helps organizations uncover high value insights from their data. If you’re looking to get more insights from your data or just want to ask one our analytics experts a question, send us a message. We’re happy to help.


Amazon Forecast: Best Practices

In part one of this article, we offered an overview of Amazon Forecast and how to use it. In part two, we get into Amazon Forecast best practices:

Know your business goal

In our data and analytics practice, business value comes first. We want to know and clarify use cases before we talk about technology. Using amazon Forecast is no different. When creating a forecast, do you want to make sure you always have enough inventory on hand? Or do you want to make sure that all your inventory gets used all the time? This will drive which “quartile” you look at.

Each quartile – the defaults are 10%, 50%, and 90% – is important for its own reasons and should be looked at to give a range. What is the 50% quartile? The forecast at this quartile has a 50-50 chance of being right. The real number has a 50% chance of being higher and a 50% chance of being lower than the actual value. The forecast at the 90% quartile has a 90% chance of being higher than the actual value, while the forecast at the 10% quartile has only a 10% chance of being higher. So, if you want to make sure you sell all your inventory, use the 10% quartile forecast.

Use related time series

Amazon has made Forecast so easy to use with related time series, you really have nothing to lose to make your forecast more robust. All you have to do is make the time series time units the same as your target time series.

One way to create a related dataset is to use categorical or binary data whose future values are already known – for example, whether the future time is on a weekend or a holiday or there is a concert playing – anything that is on a schedule that you can rely on.

Even if you don’t know if something will happen, you can create multiple forecasts where you vary the future values. For example, if you want to forecast attendance at a baseball game this Sunday, and you want to model the impact of weather, you could create a feature is_raining and try one forecast with “yes, it’s raining” and another with “no, it’s not raining.”

Look at a range of forecasted values, not a singular forecasted value

Don’t expect the numbers to be precise. One of the biggest values from a forecast is knowing what the likely range of actual values will be. Then, take some time to analyze what drives that range. Can it be made smaller (more accurate) with more related data? If so, can you control any of that related data?

Visualize the results

Show historical and forecast values on one chart. This will give you a sense of how the forecast is trending. You can backfill the chart with actuals as they come in, so you can learn more about your forecast’s accuracy.

Choose a “medium term” time horizon

Your time horizon – how far in the future your forecast looks – is either 500 timesteps or ⅓ of your time series data, whichever is smaller. We recommend choosing up to a 10% horizon for starters. This will give you enough forward-looking forecasts to evaluate the usefulness of your results without taking too long.

Save your data prep code

Save the code you use to stage your data for the forecast for the future. Because you will be doing this again, you don’t want to repeat yourself. An efficient way to do this is to use PySpark code inside a Sagemaker notebook. If you end up using your forecast in production, you will eventually place that code into a Glue ETL pipeline (using PySpark), so it is best to just use PySpark out of the box.

Another advantage of using PySpark is that the utilities to load and drop csv-formatted data to/from S3 are dead simple. You will be using CSV for Forecasting work.

Interpret the results!

The guide to interpret results is here, but admittedly it is a little dense if you are not a statistician. One easy metric to look at, especially if you use multiple algorithms, is Root Mean Squared Error (RMSE). You want this as low as possible, and, in fact, Amazon will choose its winning algorithm mostly on this value.

It will take some time

How long will it take? If you do select AutoML, expect model training to take a while – at least 20 minutes for even the smallest datasets. If your dataset is large, it can take an hour or several hours. The same is true when you generate the actual forecast. So, start it in the beginning of the day so you can work with it before lunch, or near the end of your day so you can look at it in the morning.

Data prep details (for your data engineer)

  • Match the ‘forecast frequency’ to the frequency of your observation timestamps.
  • Set the demand datatype to a float prior to import (it might be an integer).
  • Get comfortable with `striptime` and `strftime` – you have only two options for timestamp format.
  • Assume all data are from the same time zone. If they are not, make them that way. Use python datetime methods.
  • Split out a validation set like this: https://github.com/aws-samples/amazon-forecast-samples/blob/master/notebooks/1.Getting_Data_Ready.ipynb
  • If using pandas dataframes, do not use the index when writing to csv.

Conclusion

If you’re ever asked to produce a forecast or predict some number in the future, you now have a robust method at your fingertips to get there. With Amazon Forecast, you have access to Amazon.com’s optimized algorithms for time series forecasting. If you can get your target data into CSV format, then you can use a forecast. Before you start, have a business goal in mind – it is essential to think about ranges of possibilities rather than a discrete number. And be sure to keep in mind our best practices for creating a forecast, such as using a “medium term” time horizon, visualizing the results, and saving your data preparation code.

If you’re ready to make better, data-driven decisions, trust your dashboards and reports, confidently bring in new sources for enhanced analysis, create a culture of DataOps, and become AI-ready, contact us to schedule a demo of our DataOps Foundation.

-Rob Whelan, Practice Director, Data & Analytics


How to Use Amazon Forecast for your Business

How to use Amazon Forecast: What Is it Good For?

How many times have you been asked to predict revenue for next month or next quarter? Do you mostly rely on your gut? Have you ever been asked to support your numbers? Cue sweaty palms frantically churning out spreadsheets.

Maybe you’ve suffered from the supply chain “bullwhip” effect: you order too much inventory, which makes your suppliers hustle, only to deliver a glut of product that you won’t need to replace for a long time, which makes your suppliers sit idle.

Wouldn’t it be nice to plan for your supply chain as tightly as Amazon.com does? With Amazon Forecast, you can do exactly that. In part one of this two-part article, I’ll provide an overview of the Amazon Forecast service and how to get started. Part two of the article will focus on best practices for using Amazon Forecast.

Amazon Forecast: The backstory

Amazon knows a thing or two about inventory planning, given its intense focus on operations. Over the years, it has used multiple algorithms for accurate forecasting. It even fine-tuned them to run in an optimized way on its cloud compute instances. Forecasting demand is important, if nothing else to get a “confidence interval” – a range where it’s fairly certain reality will fall, say, 80% of the time.

In true Amazon Web Services fashion, Amazon decided to provide its forecasting service for sale in Amazon Forecast, a managed service that takes your time series data in CSV format and spits out a forecast into the future. Amazon Forecast gives you a customizable confidence interval that you can set to 95%, 90%, 80%, or whatever percentage you need. And, you can re-use and re-train the model with actuals as they come in.

When you use Amazon Forecast, you can tell it to run up to five different state-of-the-art algorithms and pick a winner. This saves you the time of deliberating over which algorithm to use.

The best part about Amazon Forecast is that you can make the forecast more robust by adding in “related” time series – any data that you think is correlated to your forecast. For example, you might be predicting electricity demand based on macro scales such as season, but also on a micro level such as whether or not it rained that day.

Amazon Forecast: How to use

Amazon Forecast is considered a serverless service. You don’t have to manage any compute instances to use it. Since it is serverless, you can create multiple scenarios simultaneously – up to three at once. There is no reason to do this in series; you can come up with three scenarios and fire them off all at once. Additionally, Amazon Forecast is low-cost , so it is worth trying and experimenting with often. As is generally the case with AWS, you end up paying mostly for the underlying compute and storage, rather than any major premium for using the service. Like any other machine learning task, you have a huge advantage if you have invested in keeping your data orderly and accessible.

Here is a general workflow for using Amazon Forecast:

  1. Create a Dataset Group. This is just a logical container for all the datasets you’re going to use to create your predictor.
  2. Import your source datasets. A nice thing here is that Amazon Forecast facilitates the use of different “versions” of your datasets. As you go about feature engineering, you are bound to create different models which will be based on different underlying datasets. This is absolutely crucial for the process of experimentation and iteration.
  3. Create a predictor. This is another way of saying “create a trained model on your source data.”
  4. Create a forecast using the predictor. This is where you actually generate a forecast looking into the future.

To get started, stage your time series data in a CSV file in S3. You have to follow AWS’s naming convention for the column names. You also can optionally use your domain knowledge to enrich the data with “related time series.” Meaning, if you think external factors drive the forecast, you should add those data series, too. You can add multiple complementary time series.

When your datasets are staged, you create a Predictor. A Predictor is just a trained machine learning model. If you choose the “AutoML” option, Amazon will make up to five algorithms compete. It will save the results of all of the models that trained successfully (sometimes an algorithm clashes with the underlying data).

Finally, when your Predictor is done, you can generate a forecast, which will be stored on S3, which can be easily shared with your organization or with any Business Intelligence tool. It’s always a good idea to visualize the results to give them a reality check.

In part two of this article, we’ll dig into best practices for using Amazon Forecast. And if you’re interested in learning even more about transforming your organization to be more data-driven, check out our DataOps Foundation service that helps you transform your data analytics processes.

-Rob Whelan, Practice Director, Data & Analytics


AWS re:Invent 2019: AWS Product/Service Review, a Networking Perspective

Announcements for days!

AWS re:Invent 2019 has come and gone, and now the collective audience has to sort through the massive list of AWS announcements released at the event.  According to the AWS re:Invent 2019 Recap communication, AWS released 77 products, features and services in just 5 days!  Many of the announcements were in the Machine Learning (ML) space (20 total), closely followed by announcements around Compute (16 total), Analytics (6 total), Networking and Content Delivery (5 total), and AWS Partner Network (5 total), amongst others.   In the area of ML, things like AWS DeepComposer, Amazon SageMaker Studio, and Amazon Fraud Detector topped the list.  While in the Compute, Analytics, and Networking space, Amazon EC2 Inf1 Instances, AWS Local Zones, AWS Outposts, Amazon Redshift Data lake, AWS Transit Gateway Network Manager, and Inter-Region Peering were at the forefront. Here at 2nd Watch we love the cutting-edge ML feature announcements like everyone else, but we always have our eye on those announcements that key-in on what our customers need now – announcements that can have an immediate benefit for our customers in their ongoing cloud journey.

All About the Network

In Matt Lehwess’ presentation, Advanced VPC design and new capabilities for Amazon VPC, he kicked off the discussion with a poignant note of, “Networking is the foundation of everything, it’s how you build things on AWS, you start with an Amazon VPC and build up from there. Networking is really what underpins everything we do in AWS.  All the services rely on Networking.” This statement strikes a chord here at 2nd Watch as we have seen that sentiment in action. Over the last couple years, our customers have been accelerating the use of VPCs, and, as of 2018, Amazon VPCs is the number one AWS service used by our customers, with 100% of them using it. We look for that same trend to continue as 2019 comes to an end.  It’s not the sexiest part of AWS, but networking provides the foundation that brings all of the other services together.  So, focusing on newer and more efficient networking tools and architectures to get services to communicate is always at the top of the list when we look at new announcements.  Here are our takes on these key announcements.

AWS Transit Gateway Inter-Region Peering (Multi-Region)

One exciting feature announcement in the networking space is Inter-Region Peering for AWS Transit Gateway.  This feature allows the ability to establish peering connections between Transit Gateways in different AWS Regions.  Previously, connectivity between two Transit Gateways could only be done through a Transit VPC which included the overhead of running your own networking devices as part of the Transit VPC.   Inter-Region peering for AWS Transit Gateway enables you to remove the Transit VPC and connect Transit Gateways directly.

The solution uses a new static attachment type called a Transit Gateway Peering Attachment that, once created, requires an acceptance or rejection from the accepter Transit Gateway.  In the future, AWS will likely allow dynamic attachments, so they advise you to create unique ASNs for each Transit Gateway for the easiest transition.  The solution also uses encrypted VPC peering across the AWS backbone.  Currently Transit Gateway inter-region peering support is available for gateways in US East (Virginia), US East (Ohio), US West (Oregon), EU (Ireland), and EU (Frankfurt) AWS Regions with support for other regions coming soon.  You also can’t peer Transit Gateways in the same region.

(Source: Matt Lehwess: Advanced VPC design and new capabilities for Amazon VPC (NET305))

On the surface the ability to connect two Transit Gateways is just an incremental additional feature, but when you start to think of the different use cases as well as the follow-on announcement of Multi-Region Transit Gateway peering and Accelerated VPN solutions, the options for architecture really open up.  This effectively enables you to create a private and highly-performant global network on top of the AWS backbone.  Great stuff!

AWS Transit Gateway Network Manager

This new feature is used to centrally monitor your global network across AWS and on premises. The Transit Gateway network manager simplifies operational complexity of managing networks across regions and remote locations.  This AWS feature is another to take a dashboard approach to provide a simpler overview of your resources that may be spread over several regions and accounts. To use it, you create a Global Network within the tool which is an object in the AWS Transit Gateway Network Manager service that represents your private global network in AWS. It includes your AWS Transit Gateway hubs, their attachments, and on-premises devices, sites, and links.  Once the Global Network is created, you extend the configuration by adding in Transit Gateways, information about your on-premises devices, sites, links, and the Site-to-Site VPN connections with which they are associated, and start using it to visualize and monitor your network. It includes a nice geographic world map view to visualize VPNs (if they’re up/down impaired) or Transit Gateway Peering connections.

https://d1.awsstatic.com/re19/gix/gorgraphic.cdb99cd59ba34015eccc4ce5eb4b657fdf5d9dd6.png

There’s also a nice Topology feature that shows VPCs, VPNs, Direct Connect gateways, and AWS Transit Gateway-AWS Transit Gateway peering for all registered Transit gateways.  It provides an easier way to understand your entire global infrastructure from a single view.

Another key feature is the integration with SD-WAN providers like Cisco, Aviatrix, and others. Many of these solutions will integrate with AWS Transit Gateway Network Manager and automate the branch-cloud connectivity and provide end-to-end monitoring of the global network from a single dashboard. It’s something we look forward to exploring with these SD-WAN providers in the future.

AWS Local Zones

AWS Local Zones in an interesting new service that addresses challenges we’ve encountered with customers.  Although listed under Compute and not Networking and Content Delivery on the re:Invent 2019 announcement list, Local Zones is a powerful new feature with networking at its core.

Latency tolerance for applications stacks running in a hybrid scenario (i.e. app servers in AWS, database on-prem) is a standard conversation when planning a migration.  Historically, those conversations would be predicated by their proximity to an AWS region.  Depending on requirements, customers in Portland, Oregon may have the option to run a hybrid application stack, where those in Southern California may have been excluded.  The announcement of Local Zones (initially just in Los Angeles) opens up those options to markets that were not previously available.  I hope this is the first of many localized resource deployments.

That’s no Region…that’s a Local Zone

Local Zones are interesting in that they only have a subset of the services available in a standard region.  Local Zones are organized as a child of a parent region, notably the Los Angeles Local Zone is a child of the Oregon Region.  API communication is done through Oregon, and even the name of the LA Local Zone AZ maps to Oregon (Oregon AZ1= us-west-2a, Los Angeles AZ1 = us-west-2-lax-1a).  Organizationally, it’s easiest to think of them as remote Availability Zones of existing regions.

As of December 2019, only a limited amount of services are available, including EC2, EBS, FSx, ALB, VPC and single-zone RDS.  Pricing seems to be roughly 20% higher than in the parent region.  Given that this is the first Local Zone, we don’t know whether this will always be true or if it depends on location.  One would assume that Los Angeles would be a higher-cost location whether it was a Local Zone or full region.

All the Things

To see all of the things that were launched at re:Invent 2019 you can check out the re:Invent 2019 Announcement Page. For all AWS announcements, not just re:Invent 2019 launches (e.g. Things that launched just prior to re:Invent), check out the What’s New with AWS webpage. If you missed the show completely or just want to re-watch your favorite AWS presenters, you can see many of the re:Invent presentations on the AWS Events Youtube Channel. After you’ve done all that research and watched all those videos and are ready to get started, you can always reach out to us at 2nd Watch. We’d love to help!

-Derek Baltazar, Managing Consultant

-Travis Greenstreet, Principal Architect


Serverless Aurora – Is it Production-Ready Yet?

In the last few months, AWS has made several announcements around it’s Aurora offering such as:

All of these features work towards the end goal of making serverless databases a production-ready solution. Even with the latest offerings, should you explore migrating to a serverless architecture? This blog highlights some considerations when looking to use Backend-as-a-Services (BaaS) at your data layer.

Aurora Models

Let’s assume that you’ve already either made the necessary schema changes and have migrated already or have a general familiarity of implementing a new database with Aurora Classic. Aurora currently comes in two models -Provisioned and Serverless Aurora. A traditional AWS database that is provisioned either has a self-managed EC2 instance or operates as a PAAS model using an AWS managed RDS instance. In both use cases, you have to allocate memory and CPU in addition to creating security groups to allow applications to listen on a TCP connection string.

In this pattern, issues can arrive right at the connection. There are limits as to how many connections can access a database before you start to see performance degradation or an inability to connect altogether when the limit is maxed out. In addition to that, your application may also receive varying degrees of traffic (e.g., a retail application used during a peak season or promotion). Even if you implement a caching layer in front, such as Memcache or Redis, you still have scenarios where the instance will eventually either have to scale vertically to a more robust instance or horizontally with replicas to distribute reads and writes.

This area is where serverless provides some value. It’s worth recalling that a serverless database does not equal no servers. There are servers there, but that is abstracted away from the user (or in this case the application). Following recent compute trends, Serverless focuses more on writing business logic and less on infrastructure management and provisioning to deploy from the requirements stage, to prod-ready quicker. In the traditional database model, you are still responsible for securing the box, authentication, encryption, and other operations unrelated to the actual business functions.

How Aurora Serverless works

What serverless Aurora provides to help alleviate issues with scaling and connectivity is a Backend as a Service solution. The application and Aurora instance must be deployed in the same VPC and connect through endpoints that go through a network load balancer (NLB). Doing so allows for connections to terminate at the load balancer and not at the application.

By abstracting the connections, you no longer have to create logic manage load balancing algorithms or worry about making DNS changes to facilitate for database endpoint changes. The NLB has routing logic through request routers that make the connection to whichever instance is available at the time, which then maps to the underlying serverless database storage. If the serverless database needs to scale up, a pool of resources is always available and kept warm. In the event the instances scale down to zero, a connection cannot persist.

By having an available pool of warm instances, you now have a pay-as-you-go model where you pay for what you utilize. You can still run into the issue of max connections, which can’t be modified, but the number allowed for smaller 2 and 4 ACU implementations has increased since the initial release.

Note: Cooldowns are not instantaneous and can take up to 5 mins after the instance is entirely idle, and you are still billed for that time. Also, even though the instances are kept warm, the connection to those instances still has to initiate. If you make a query to the database during that time, you can see wait times of 25 seconds or more before the query fully executes.

Cost considerations

Can you really scale down completely? Technically yes, if certain conditions are made:

  • CPU below 30 percent utilization
  • Less than 40 percent of connections being used

To achieve this and get the cost savings, the database must be completely idle. There can’t be long-running queries or locked tables. Also, varying activities outside of the application can generate queries such as open sessions, monitoring tools, health-checks, so on and so forth. The database only pauses when the conditions are met, AND there is zero activity.

Serverless Aurora at .06/VCU starts at a higher price than its provisioned predecessor at .041. Aurora classic also charges hourly, where Serverless Aurora charges by the second with a 5-minute minimum AND a 5-minute cool-down period. We already discussed that cool-downs in many cases are not instantaneous, and now you pile on that billing doesn’t stop until an additional 5 minutes after that period. If you go with the traditional minimal setup of 2 VCU and never scale down the instances, the cost is more expensive by a magnitude of at least 3x. Therefore, to get the same cost payoff, your database would have to run only 1/3 of the time and can be achievable for dev/test boxes that are parked or apps only used during business hours in a single time-zone. Serverless Aurora is supposed to be highly available by default, so if you are getting two instances at this price point, then you are getting a better bargain performance-wise than running a single, provisioned instance for an only slightly higher price point.

Allowing for a minimum of 1ACU allows you the option of fully scaling down to a serverless database and makes the price point more comparable to RDS without enabling pausing.

Migration and Data API

Migrating to Serverless Aurora is relatively simple as you can just load in a snapshot from an existing database. With Data API, you no longer need a persistent connection to query the database. In previous scenarios, a fetch could take 25 seconds or more if the query is executed after a cool-down period. In this scenario, you can query the serverless database even if it’s been idle for some time. You can leverage a Lambda function via API gateway which works around the VPC implementation. AWS has mentioned it is providing performance metrics around the time it takes on average to execute a query with data API in the next coming months.

Conclusion

With the creation of EC2, Docker, and Lambda functions, we’ve seen more innovation in the area of compute and not as much on the data layer. Traditional provisioned relational databases have difficulties scaling and have a finite limit on the number of connections. By eliminating the need for an instance, this level of abstraction presents a strong use case for unpredictable workloads. Kudos to AWS for engineering a solution at this layer.

The latest updates these last few months embellish AWS’ willingness to solve complex problems. Running 1ACU does bring the cost down to a rate comparable to RDS while providing a mechanism for better performance if you disable pauses. However, while it is now possible to run Aurora serverless 24/7 more cost-effectively, this scenario contrasts their signature use case of having an on/off database.

Serverless still seems a better fit for databases that are rarely used and only see spikes on occasion or applications primarily used during business hours. Administration time is still a cost, and serverless databases, despite the progress, still has many unknowns. It can take an administrator some time and patience to truly get a configuration that is performant, highly available, and not overly expensive. Even though you don’t have to rely on automation and can manually scale your Aurora serverless cluster, it takes some effort to do so in a way that doesn’t immediately terminate the connections.

Today, you can leverage ECS or Fargate with spot instances and implement a solution that yields similar or better results at a cheaper cost if a true serverless database is the desired goal. I would still recommend this for dev/test workloads and see if you can work your way up to prod for smaller workloads as the tool still provides much value. Hopefully, AWS releases GA offerings for MySQL 5.7 and Postgres soon.

Want more tips and info on Serverless Aurora or serverless databases? Contact our experts.

-Sabine Blair, Cloud Consultant


AWS re:Invent 2018: Daily Recap – Wednesday

Every year AWS re:Invent gets bigger and better. There are more people attending and even more who will participate remotely than any previous year. There are also more vendors showing the strength of the AWS ecosystem.

You realized why when Andy Jassy started his keynote session Wednesday morning.  The growth rate of AWS is phenomenal.  Adoption is up, revenues are up and AWS responds with customer-driven changes. Three years ago, there were less than 100 AWS services out here, and now, with yesterday’s announcements, there are more than 140. Jassy discussed a lot at the keynote, but the focus was on three major themes:

Storage/Database

The first theme was around Storage/Database with services such as Amazon FSx, which provides a platform for such things as FSx for Windows File Server. This is like Amazon EFS, but instead of supporting the NFS protocol it supports the SMB protocol. For those running workloads on Windows, you now have a shared filesystem. If you need a file system for High Performance Computing cluster, then FSx supports Lustre. I would look for more protocols and services in the future.

FSx was just the tip of the iceberg with new options DynamoDB Read/Write Capacity On Demand, another storage tier for Glacier called Deep Archive, a time-oriented database named Timestream, a fully managed ledger database – QLDB and even a Managed Blockchain service.  Read more about these from AWS:

Glacier Deep Archive
Amazon FSx for Windows File Servers
Amazon FSx for Lustre
DynamoDB Read/Write Capacity On Demand
Amazon Timestream
Amazon Quantum Ledger Database
Amazon Managed Blockchain

Security

The second theme was around Security.  It surprises no one that AWS is always expanding their offerings in this space.  They are fond of saying that security is Job One at AWS.  Two interesting announcements here were AWS Control Tower and AWS Security Hub. These will assist in many aspects of managing your AWS accounts and increasing your security posture across your entire AWS account footprint.

Machine Learning/Artificial Intelligence

The final theme was around Machine Learning/Artificial Intelligence. We see a lot of effort being put into AWS’ Machine Learning and Artificial Intelligence solutions. This shows with the number of announcements this year. New Sagemaker offerings, Elastic Inference, and even their own specialized chip all point to a focus in this area.

Amazon Elastic Inference
AWS Inferentia
Amazon SageMaker Ground Truth
AWS Marketplace for machine learning
Amazon SageMaker RL
AWS DeepRacer

Amazon Textract
Amazon Personalize
Amazon Forecast

And we can’t forget the cool toy of the show – DeepRacer. Like Amazon DeepLens from last year, this “toy” car will help you explore machine learning. It has sensors and compute onboard, so you can teach it how to drive. There’s even a DeepRacer League, where you can compete for a trophy at AWS re:Invent 2019!

Outposts

Although not one of the three main themes, and not available until 2019, AWS Outposts was another exciting feature yesterday. Want to run your own “region” in your datacenter? Take a look at this. It is fully-managed, maintained and supported infrastructure for your datacenter. It comes in two variants – 1) VMware Cloud on AWS Outposts, which allows you to use the same VMware control plane and APIs you use to run your infrastructure and, 2) AWS native variant of AWS Outposts allows you to use the same exact APIs and control plane you use to run in the AWS cloud, but on-premises.

If you can’t come to the cloud, it can come to you.

Sessions and Events

There are more sessions than ever at this year’s re:Invent, and the conference agenda is full of interesting and useful events and demos. It’s always great to know that, even if you missed a session, you can stream it on-demand later on the AWS re:Invent YouTube channel. And we can’t forget the expo hall, which has been very heavily-trafficked. If you haven’t yet, stop by and see 2nd Watch in booth 2440. We’re giving away one more of those awesome Amazon DeepLens cameras we mentioned earlier in this post. This year’s re:Invent shows that AWS is bigger and better than ever!

David Nettles – Solutions Architect


Fully Coded And Automated CI/CD Pipelines: The Weeds

The Why

In my last post we went over why we’d want to go the CI/CD/Automated route and the more cultural reasons of why it is so beneficial. In this post, we’re going to delve a little bit deeper and examine the technical side of tooling. Remember, a primary point of doing a release is mitigating risk. CI/CD is all about mitigating risk… fast.

There’s a Process

The previous article noted that you can’t do CI/CD without building on a set of steps, and I’m going to take this approach here as well. Unsurprisingly, we’ll follow the steps we laid out in the “Why” article, and tackle each in turn.

Step I: Automated Testing

You must automate your testing. There is no other way to describe this. In this particular step however, we can concentrate on unit testing: Testing the small chunks of code you produce (usually functions or methods). There’s some chatter about TDD (Test Driven Development) vs BDD (Behavior Driven Development) in the development community, but I don’t think it really matters, just so long as you are writing test code along side your production code. On our team, we prefer the BDD style testing paradigm. I’ve always liked the symantically descriptive nature of BDD testing over strictly code-driven ones. However, it should be said that both are effective and any is better than none, so this is more of a personal preference. On our team we’ve been coding in golang, and our BDD framework of choice is the Ginkgo/Gomega combo.

Here’s a snippet of one of our tests that’s not entirely simple:

Describe("IsValidFormat", func() {
  for _, check := range AvailableFormats {
    Context("when checking "+check, func() {
      It("should return true", func() {
        Ω(IsValidFormat(check)).To(BeTrue())
      })
    })
  }
 
  Context("when checking foo", func() {
    It("should return false", func() {
      Ω(IsValidFormat("foo")).To(BeFalse())
    })
  })
)

Describe("IsValidFormat", func() { for _, check := range AvailableFormats { Context("when checking "+check, func() { It("should return true", func() { Ω(IsValidFormat(check)).To(BeTrue()) }) }) } Context("when checking foo", func() { It("should return false", func() { Ω(IsValidFormat("foo")).To(BeFalse()) }) }) )

So as you can see, the Ginkgo (ie: BDD) formatting is pretty descriptive about what’s happening. I can instantly understand what’s expected. The function IsValidFormat, should return true given the range (list) of AvailableFormats. A format of foo (which is not a valid format) should return false. It’s both tested and understandable to the future change agent (me or someone else).

Step II: Continuous Integration

Continuous Integration takes Step 1 further, in that it brings all the changes to your codebase to a singular point, and building an object for deployment. This means you’ll need an external system to automatically handle merges / pushes. We use Jenkins as our automation server, running it in Kubernetes using the Pipeline style of job description. I’ll get into the way we do our builds using Make in a bit, but the fact we can include our build code in with our projects is a huge win.

Here’s a (modified) Jenkinsfile we use for one of our CI jobs:

def notifyFailed() {
  slackSend (color: '#FF0000', message: "FAILED: '${env.JOB_NAME} [${env.BUILD_NUMBER}]' (${env.BUILD_URL})")
}
 
podTemplate(
  label: 'fooProject-build',
  containers: [
    containerTemplate(
      name: 'jnlp',
      image: 'some.link.to.a.container:latest',
      args: '${computer.jnlpmac} ${computer.name}',
      alwaysPullImage: true,
    ),
    containerTemplate(
      name: 'image-builder',
      image: 'some.link.to.another.container:latest',
      ttyEnabled: true,
      alwaysPullImage: true,
      command: 'cat'
    ),
  ],
  volumes: [
    hostPathVolume(
      hostPath: '/var/run/docker.sock',
      mountPath: '/var/run/docker.sock'
    ),
    hostPathVolume(
      hostPath: '/home/jenkins/workspace/fooProject',
      mountPath: '/home/jenkins/workspace/fooProject'
    ),
    secretVolume(
      secretName: 'jenkins-creds-for-aws',
      mountPath: '/home/jenkins/.aws-jenkins'
    ),
    hostPathVolume(
      hostPath: '/home/jenkins/.aws',
      mountPath: '/home/jenkins/.aws'
    )
  ]
)
{
  node ('fooProject-build') {
    try {
      checkout scm
 
      wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) {
        container('image-builder'){
          stage('Prep') {
            sh '''
              cp /home/jenkins/.aws-jenkins/config /home/jenkins/.aws/.
              cp /home/jenkins/.aws-jenkins/credentials /home/jenkins/.aws/.
              make get_images
            '''
          }
 
          stage('Unit Test'){
            sh '''
              make test
              make profile
            '''
          }
 
          step([
            $class:              'CoberturaPublisher',
            autoUpdateHealth:    false,
            autoUpdateStability: false,
            coberturaReportFile: 'report.xml',
            failUnhealthy:       false,
            failUnstable:        false,
            maxNumberOfBuilds:   0,
            sourceEncoding:      'ASCII',
            zoomCoverageChart:   false
          ])
 
          stage('Build and Push Container'){
            sh '''
              make push
            '''
          }
        }
      }
 
      stage('Integration'){
        container('image-builder') {
          sh '''
            make deploy_integration
            make toggle_integration_service
          '''
        }
        try {
          wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) {
            container('image-builder') {
              sh '''
                sleep 45
                export KUBE_INTEGRATION=https://fooProject-integration
                export SKIP_TEST_SERVER=true
                make integration
              '''
            }
          }
        } catch(e) {
          container('image-builder'){
            sh '''
              make clean
            '''
          }
          throw(e)
        }
      }
 
      stage('Deploy to Production'){
        container('image-builder') {
          sh '''
            make clean
            make deploy_dev
          '''
        }
      }
    } catch(e) {
      container('image-builder'){
        sh '''
          make clean
        '''
      }
      currentBuild.result = 'FAILED'
      notifyFailed()
      throw(e)
    }
  }
}

def notifyFailed() { slackSend (color: '#FF0000', message: "FAILED: '${env.JOB_NAME} [${env.BUILD_NUMBER}]' (${env.BUILD_URL})") } podTemplate( label: 'fooProject-build', containers: [ containerTemplate( name: 'jnlp', image: 'some.link.to.a.container:latest', args: '${computer.jnlpmac} ${computer.name}', alwaysPullImage: true, ), containerTemplate( name: 'image-builder', image: 'some.link.to.another.container:latest', ttyEnabled: true, alwaysPullImage: true, command: 'cat' ), ], volumes: [ hostPathVolume( hostPath: '/var/run/docker.sock', mountPath: '/var/run/docker.sock' ), hostPathVolume( hostPath: '/home/jenkins/workspace/fooProject', mountPath: '/home/jenkins/workspace/fooProject' ), secretVolume( secretName: 'jenkins-creds-for-aws', mountPath: '/home/jenkins/.aws-jenkins' ), hostPathVolume( hostPath: '/home/jenkins/.aws', mountPath: '/home/jenkins/.aws' ) ] ) { node ('fooProject-build') { try { checkout scm wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) { container('image-builder'){ stage('Prep') { sh ''' cp /home/jenkins/.aws-jenkins/config /home/jenkins/.aws/. cp /home/jenkins/.aws-jenkins/credentials /home/jenkins/.aws/. make get_images ''' } stage('Unit Test'){ sh ''' make test make profile ''' } step([ $class: 'CoberturaPublisher', autoUpdateHealth: false, autoUpdateStability: false, coberturaReportFile: 'report.xml', failUnhealthy: false, failUnstable: false, maxNumberOfBuilds: 0, sourceEncoding: 'ASCII', zoomCoverageChart: false ]) stage('Build and Push Container'){ sh ''' make push ''' } } } stage('Integration'){ container('image-builder') { sh ''' make deploy_integration make toggle_integration_service ''' } try { wrap([$class: 'AnsiColorBuildWrapper', 'colorMapName': 'XTerm']) { container('image-builder') { sh ''' sleep 45 export KUBE_INTEGRATION=https://fooProject-integration export SKIP_TEST_SERVER=true make integration ''' } } } catch(e) { container('image-builder'){ sh ''' make clean ''' } throw(e) } } stage('Deploy to Production'){ container('image-builder') { sh ''' make clean make deploy_dev ''' } } } catch(e) { container('image-builder'){ sh ''' make clean ''' } currentBuild.result = 'FAILED' notifyFailed() throw(e) } } }

There’s a lot going on here, but the important part to notice is that I grabbed this from the project repo. The build instructions are included with the project itself. It’s creating an artifact, running our tests, etc. But it’s all part of our project code base. It’s checked into git. It’s code like all the other code we mess with. The steps are somewhat inconsequential for this level of topic, but it works. We also have it setup to run when there’s a push to github (AND nightly). This ensures that we are continuously running this build and integrating everything that’s happened to the repo in a day. It helps us keep on top of all the possible changes to the repo as well as our environment.

Hey… what’s all that make_ crap?_

Make

Our team uses a lot of tools. We ascribe to the maxim: Use what’s best for the particular situation. I can’t remember every tool we use. Neither can my teammates. Neither can 90% of the people that “do the devops.” I’ve heard a lot of folks say, “No! We must solidify on our toolset!” Let your teams use what they need to get the job done the right way. Now, the fear of experiencing tool “overload” seems like a legitimate one in this scenario, but the problem isn’t the number of tools… it’s how you manage and use use them.

Enter Makefiles! (aka: make)

Make has been a mainstay in the UNIX world for a long time (especially in the C world). It is a build tool that’s utilized to help satisfy dependencies, create system-specific configurations, and compile code from various sources independent of platform. This is fantastic, except, we couldn’t care less about that in the context of our CI/CD Pipelines. We use it because it’s great at running “buildy” commands.

Make is our unifier. It links our Jenkins CI/CD build functionality with our Dev functionality. Specifically, opening up the docker port here in the Jenkinsfile:

volumes: [
  hostPathVolume(
    hostPath: '/var/run/docker.sock',
    mountPath: '/var/run/docker.sock'
  ),

volumes: [ hostPathVolume( hostPath: '/var/run/docker.sock', mountPath: '/var/run/docker.sock' ),

…allows us to run THE SAME COMMANDS WHEN WE’RE DEVELOPING AS WE DO IN OUR CI/CD PROCESS. This socket allows us to run containers from containers, and since Jenkins is running on a container, this allows us to run our toolset containers in Jenkins, using the same commands we’d use in our local dev environment. On our local dev machines, we use docker nearly exclusively as a wrapper to our tools. This ensures we have library, version, and platform consistency on all of our dev environments as well as our build system. We use containers for our prod microservices so production is part of that “chain of consistency” as well. It ensures that we see consistent behavior across the horizon of application development through production. It’s a beautiful thing! We use the Makefile as the means to consistently interface with the docker “tool” across differing environments.

Ok, I know your interest is peaked at this point. (Or at least I really hope it is!)
So here’s a generic makefile we use for many of our projects:

CONTAINER=$(shell basename $$PWD | sed -E 's/^ia-image-//')
.PHONY: install install_exe install_test_exe deploy test
 
install:
    docker pull sweet.path.to.a.repo/$(CONTAINER)
    docker tag sweet.path.to.a.repo/$(CONTAINER):latest $(CONTAINER):latest
 
install_exe:
    if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi
    echo "docker run -itP -v \$$PWD:/root $(CONTAINER) \"\$$@\"" > $(HOME)/bin/$(CONTAINER)
    chmod u+x $(HOME)/bin/$(CONTAINER)
 
install_test_exe:
    if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi
    echo "docker run -itP -v \$$PWD:/root $(CONTAINER)-test \"\$$@\"" > $(HOME)/bin/$(CONTAINER)
    chmod u+x $(HOME)/bin/$(CONTAINER)
 
test:
    docker build -t $(CONTAINER)-test .
 
deploy:
    captain push

CONTAINER=$(shell basename $$PWD | sed -E 's/^ia-image-//') .PHONY: install install_exe install_test_exe deploy test install: docker pull sweet.path.to.a.repo/$(CONTAINER) docker tag sweet.path.to.a.repo/$(CONTAINER):latest $(CONTAINER):latest install_exe: if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi echo "docker run -itP -v \$$PWD:/root $(CONTAINER) \"\$$@\"" > $(HOME)/bin/$(CONTAINER) chmod u+x $(HOME)/bin/$(CONTAINER) install_test_exe: if [[ ! -d $(HOME)/bin ]]; then mkdir -p $(HOME)/bin; fi echo "docker run -itP -v \$$PWD:/root $(CONTAINER)-test \"\$$@\"" > $(HOME)/bin/$(CONTAINER) chmod u+x $(HOME)/bin/$(CONTAINER) test: docker build -t $(CONTAINER)-test . deploy: captain push

This is a Makefile we use to build our tooling images. It’s much simpler than our project Makefiles, but I think this illustrates how you can use Make to wrap EVERYTHING you use in your development workflow. This also allows us to settle on similar/consistent terminology between different projects. %> make test? That’ll run the tests regardless if we are working on a golang project or a python lambda project, or in this case, building a test container, and tagging it as whatever-test. Make unifies “all the things.”

This also codifies how to execute the commands. ie: what arguments to pass, what inputs etc. If I can’t even remember the name of the command, I’m not going to remember the arguments. To remedy, I just open up the Makefile, and I can instantly see.

Step III: Continuous Deployment

After the last post (you read it right?), some might have noticed that I skipped the “Delivery” portion of the “CD” pipeline. As far as I’m concerned, there is no “Delivery” in a “Deployment” pipeline. The “Delivery” is the actual deployment of your artifact. Since the ultimate goal should be Depoloyment, I’ve just skipped over that intermediate step.

Okay, sure, if you want to hold off on deploying automatically to Prod, then have that gate. But Dev, Int, QA, etc? Deployment to those non-prod environments should be automated just like the rest of your code.

If you guessed we use make to deploy our code, you’d be right! We put all our deployment code with the project itself, just like the rest of the code concerning that particular object. For services, we use a Dockerfile that describes the service container and several yaml files (e.g. deployment_<env>.yaml) that describe the configurations (e.g. ingress, services, deployments) we use to configure and deploy to our Kubernetes cluster.

Here’s an example:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: sweet-aws-service
    stage: dev
  name: sweet-aws-service-dev
  namespace: sweet-service-namespace
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: sweet-aws-service
      name: sweet-aws-service
    spec:
      containers:
      - name: sweet-aws-service
        image: path.to.repo.for/sweet-aws-service:latest
        imagePullPolicy: Always
        env:
          - name: PORT
            value: "50000"
          - name: TLS_KEY
            valueFrom:
              secretKeyRef:
                name: grpc-tls
                key: key
          - name: TLS_CERT
            valueFrom:
              secretKeyRef:
                name: grpc-tls
                key: cert

apiVersion: extensions/v1beta1 kind: Deployment metadata: labels: app: sweet-aws-service stage: dev name: sweet-aws-service-dev namespace: sweet-service-namespace spec: replicas: 1 template: metadata: labels: app: sweet-aws-service name: sweet-aws-service spec: containers: - name: sweet-aws-service image: path.to.repo.for/sweet-aws-service:latest imagePullPolicy: Always env: - name: PORT value: "50000" - name: TLS_KEY valueFrom: secretKeyRef: name: grpc-tls key: key - name: TLS_CERT valueFrom: secretKeyRef: name: grpc-tls key: cert

This is an example of a deployment into Kubernetes for dev. That %> make deploy_dev from the Jenkinsfile above? That’s pushing this to our Kubernetes cluster.

Conclusion

There is a lot of information to take in here, but there are two points to really take home:

  1. It is totally possible.
  2. Use a unifying tool to… unify your tools. (“one tool to rule them all”)

For us, Point 1 is moot… it’s what we do. For Point 2, we use Make, and we use Make THROUGH THE ENTIRE PROCESS. I use Make locally in dev and on our build server. It ensures we’re using the same commands, the same containers, the same tools to do the same things. Test, integrate (test), and deploy. It’s not just about writing functional code anymore. It’s about writing a functional process to get that code, that value, to your customers!

And remember, as with anything, this stuff get’s easier with practice. So once you start doing it you will get the hang of it and life becomes easier and better. If you’d like some help getting started, download our datasheet to learn about our Modern CI/CD Pipeline.

-Craig Monson, Sr Automation Architect

 


How We Organize Terraform Code at 2nd Watch

When IT organizations adopt infrastructure as code (IaC), the benefits in productivity, quality, and ability to function at scale are manifold. However, the first few steps on the journey to full automation and immutable infrastructure bliss can be a major disruption to a more traditional IT operations team’s established ways of working. One of the common problems faced in adopting infrastructure as code is how to structure the files within a repository in a consistent, intuitive, and scaleable manner. Even IT operations teams whose members have development skills will still face this anxiety-inducing challenge simply because adopting IaC involves new tools whose conventions differ somewhat from more familiar languages and frameworks.

In this blog post, we’ll go over how we structure our IaC repositories within 2nd Watch professional services and managed services engagements with a particular focus on Terraform, an open-source tool by Hashicorp for provisioning infrastructure across multiple cloud providers with a single interface.

First Things First: README.md and .gitignore

The task in any new repository is to create a README file. Many git repositories (especially on Github) have adopted Markdown as a de facto standard format for README files. A good README file will include the following information:

  1. Overview: A brief description of the infrastructure the repo builds. A high-level diagram is often an effective method of expressing this information. 2nd Watch uses LucidChart for general diagrams (exported to PNG or a similar format) and mscgen_js for sequence diagrams.
  2. Pre-requisites: Installation instructions (or links thereto) for any software that must be installed before building or changing the code.
  3. Building The Code: What commands to run in order to build the infrastructure and/or run the tests when applicable. 2nd Watch uses Make in order to provide a single tool with a consistent interface to build all codebases, regardless of language or toolset. If using Make in Windows environments, Windows Subsystem for Linux is recommended for Windows 10 in order to avoid having to write two sets of commands in Makefiles: Bash, and PowerShell.

It’s important that you do not neglect this basic documentation for two reasons (even if you think you’re the only one who will work on the codebase):

  1. The obvious: Writing this critical information down in an easily viewable place makes it easier for other members of your organization to onboard onto your project and will prevent the need for a panicked knowledge transfer when projects change hands.
  2. The not-so-obvious: The act of writing a description of the design clarifies your intent to yourself and will result in a cleaner design and a more coherent repository.

All repositories should also include a .gitignore file with the appropriate settings for Terraform. GitHub’s default Terraform .gitignore is a decent starting point, but in most cases you will not want to ignore .tfvars files because they often contain environment-specific parameters that allow for greater code reuse as we will see later.

Terraform Roots and Multiple Environments

A Terraform root is the unit of work for a single terraform apply command. We group our infrastructure into multiple terraform roots in order to limit our “blast radius” (the amount of damage a single errant terraform apply can cause).

  • Repositories with multiple roots should contain a roots/ directory with a subdirectory for each root (e.g. VPC, one per-application) tf file as the primary entry point.
  • Note that the roots/ directory is optional for repositories that only contain a single root, e.g. infrastructure for an application team which includes only a few resources which should be deployed in concert. In this case, modules/ may be placed in the same directory as tf.
  • Roots which are deployed into multiple environments should include an env/ subdirectory at the same level as tf. Each environment corresponds to a tfvars file under env/ named after the environment, e.g. staging.tfvars. Each .tfvars file contains parameters appropriate for each environment, e.g. EC2 instance sizes.

Here’s what our roots directory might look like for a sample with a VPC and 2 application stacks, and 3 environments (QA, Staging, and Production):

Terraform modules

Terraform modules are self-contained packages of Terraform configurations that are managed as a group. Modules are used to create reusable components, improve organization, and to treat pieces of infrastructure as a black box. In short, they are the Terraform equivalent of functions or reusable code libraries.

Terraform modules come in two flavors:

  1. Internal modules, whose source code is consumed by roots that live in the same repository as the module.
  2. External modules, whose source code is consumed by roots in multiple repositories. The source code for external modules lives in its own repository, separate from any consumers and separate from other modules to ensure we can version the module correctly.

In this post, we’ll only be covering internal modules.

  • Each internal module should be placed within a subdirectory under modules/.
  • Module subdirectories/repositories should follow the standard module structure per the Terraform docs.
  • External modules should always be pinned at a version: a git revision or a version number. This practice allows for reliable and repeatable builds. Failing to pin module versions may cause a module to be updated between builds by breaking the build without any obvious changes in our code. Even worse, failing to pin our module versions might cause a plan to be generated with changes we did not anticipate.

Here’s what our modules directory might look like:

Terraform and Other Tools

Terraform is often used alongside other automation tools within the same repository. Some frequent collaborators include Ansible for configuration management and Packer for compiling identical machine images across multiple virtualization platforms or cloud providers. When using Terraform in conjunction with other tools within the same repo, 2nd Watch creates a directory per tool from the root of the repo:

Putting it all together

The following illustrates a sample Terraform repository structure with all of the concepts outlined above:

Conclusion

There’s no single repository format that’s optimal, but we’ve found that this standard works for the majority of our use cases in our extensive use of Terraform on dozens of projects. That said, if you find a tweak that works better for your organization – go for it! The structure described in this post will give you a solid and battle-tested starting point to keep your Terraform code organized so your team can stay productive.

Additional resources

  • The Terraform Book by James Turnbull provides an excellent introduction to Terraform all the way through repository structure and collaboration techniques.
  • The Hashicorp AWS VPC Module is one of the most popular modules in the Terraform Registry and is an excellent example of a well-written Terraform module.
  • The source code for James Nugent’s Hashidays NYC 2017 talk code is an exemplary Terraform repository. Although it’s based on an older version of Terraform (before providers were broken out from the main Terraform executable), the code structure, formatting, and use of Makefiles is still current.

For help getting started adopting Infrastructure as Code, contact us.

  • Josh Kodroff, Associate Cloud Consultant

Corrupted Stolen CPU Time

There is a feature in the Linux Kernel that is relevant to VM’s hosted on Xen servers that is called the “steal percentage.”  When the OS requests from the host system’s use of the CPU and the host CPU is currently tied up with another VM, the Xen server will send an increment to the guest Linux instance which increases the steal percentage.  This is a great feature as it shows exactly how busy the host system is, and it is a feature available on many instances of AWS as they host using Xen.  It is actually said that Netflix will terminate an AWS instance when the steal percentage crosses a certain threshold and start it up again, which will cause the instance to spin up in a new host server as a proactive step to ensure their system is utilizing their resources to the fullest.

What I wanted to discuss here is that it turns out there is a bug in the Linux kernel versions 4.8, 4.9 and 4.10 where the steal percentage can be corrupted during a live migration on the physical Xen server, which causes the CPU utilization to be reported as 100% by the agent.

When looking at Top you will see something like this:

As you can see in the screen shot of Top, the %st metric on the CPU(s) line shows an obviously incorect number.

During a live migration on the physical Xen server, the steal time gets a little out of sync and ends up decrementing the time.  If the time was already at or close to zero, itcauses the time to become negative and, due to type conversions in the code, it causes an overflow.

CloudWatch’s CPU Utilization monitor calculates that utilization by adding the System and User percentages together.  However, this only gives a partial view into your system.  With our agent, we can see what the OS sees.

That is the Steal percentage spiking due to that corruption.  Normally this metric could be monitored and actioned as desired, but with this bug it causes noise and false positives.  If Steal were legitimately high, then the applications on that instance would be running much slower.

There is some discussion online about how to fix this issue, and there are some kernel patches to say “if the steal time is less than zero, just make it zero.”  Eventually this fix will make it through the Linux releases and into the latest OS version, but until then it needs to be dealt with.

We have found that a reboot will clear the corrupted percentage.  The other option is to patch the kernel… which also requires a reboot.  If a reboot is just not possible at the time, the only impact to the system is that it makes monitoring the steal percentage impossible until the number is reset.

It is not a very common issue, but due to the large number of instances we monitor here at 2nd Watch, it is something that we’ve come across frequently enough to investigate in detail and develop a process around.

If you have any questions as to whether or not your servers hosted in the cloud might be effected by this issue, please contact us to discuss how we might be able to help.

-James Brookes, Product Manager


CI/CD for Infrastructure as Code with Terraform and Atlantis

In this post, we’ll go over a complete workflow for continuous integration (CI) and continuous delivery (CD) for infrastructure as code (IaC) with just 2 tools: Terraform, and Atlantis.

What is Terraform?

So what is Terraform? According to the Terraform website:

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.

In practice, this means that Terraform allows you to declare what you want your infrastructure to look like – in any cloud provider – and will automatically determine the changes necessary to make it so. Because of its simple syntax and cross-cloud compatibility, it’s 2nd Watch’s choice for infrastructure as code.

Pain You May Be Experiencing Working With Terraform

When you have multiple collaborators (individ

uals, teams, etc.) working on a Terraform codebase, some common problems are likely to emerge:

  1. Enforcing peer review becomes difficult. In any codebase, you’ll want to ensure that your code is peer reviewed in order to ensure better quality in accordance with The Second Way of DevOps: Feedback. The role of peer review in IaC codebases is even more important. IaC is a powerful tool, but that tool has a double-edge – we are clearly more productive for using it, but that increased productivity also means that a simple typo could potentially cause a major issue with production infrastructure. In order to minimize the potential for bad code to be deployed, you should require peer review on all proposed changes to a codebase (e.g. GitHub Pull Requests with at least one reviewer required). Terraform’s open source offering has no facility to enforce this rule.
  1. Terraform plan output is not easily integrated in code reviews. In all code reviews, you must examine the source code to ensure that your standards are followed, that the code is readable, that it’s reasonably optimized, etc. In this aspect, reviewing Terraform code is like reviewing any other code. However, Terraform code has the unique requirement that you must also examine the effect the code change will have upon your infrastructure (i.e. you must also review the output of a terraform plan command). When you potentially have multiple feature branches in the review process, it becomes critical that you are assured that the terraform plan output is what will be executed when you run terraform apply. If the state of infrastructure changes between a run of terraform plan and a run of terraform apply, the effect of this difference in state could range from inconvenient (the apply fails) to catastrophic (a significant production outage). Terraform itself offers locking capabilities but does not provide an easy way to integrate locking into a peer review process in its open source product.
  1. Too many sets of privileged credentials. Highly-privileged credentials are often required to perform Terraform actions, and the greater the number principals you have with privileged access, the higher your attack surface area becomes. Therefore, from a security standpoint, we’d like to have fewer sets of admin credentials which can potentially be compromised.

What is Atlantis?

And what is Atlantis? Atlantis is an open source tool that allows safe collaboration on Terraform projects by making sure that proposed changes are reviewed and that the proposed change is the actual change which will be executed on your infrastructure. Atlantis is compatible (at the time of writing) with GitHub and Gitlab, so if you’re not using either of these Git hosting systems, you won’t be able to use Atlantis.

How Atlantis Works With Terraform

Atlantis is deployed as a single binary executable with no system-wide dependencies. An operator adds a GitHub or GitLab token for a repository containing Terraform code. The Atlantis installation process then adds hooks to the repository which allows communication to the Atlantis server during the Pull Request process.

You can run Atlantis in a container or a small virtual machine – the only requirement is that the Terraform instance can communicate with both your version control (e.g. GitHub) and infrastructure (e.g. AWS) you’re changing. Once Atlantis is configured for a repository, the typical workflow is:

  1. A developer creates a feature branch in git, makes some changes, and creates a Pull Request (GitHub) or Merge Request (GitLab).
  2. The developer enters atlantis plan in a PR comment.
  3. Via the installed web hooks, Atlantis locally runs terraform plan. If there are no other Pull Requests in progress, Atlantis adds the resulting plan as a comment to the Merge Request.
    • If there are other Pull Requests in progress, the command fails because we can’t ensure that the plan will be valid once applied.
  4. The developer ensures the plan looks good and adds reviewers to the Merge Request.
  5. Once the PR has been approved, the developer enters atlantis apply in a PR comment. This will trigger Atlantis to run terraform apply and the changes will be deployed to your infrastructure.
    • The command will fail if the Merge Request has not been approved.

The following sequence diagram illustrates the sequence of actions described above:

Atlantis sequence diagram

We can see how our pain points in Terraform collaboration are addressed by Atlantis:

  1. In order to enforce code review, you can launch Atlantis with the –require approvals flag: https://github.com/runatlantis/atlantis#approvals
  2. In order to ensure that your terraform plan accurately reflects the change to your infrastructure that will be made when you run terraform apply, Atlantis performs locking on a project or workspace basis: https://github.com/runatlantis/atlantis#locking
  3. In order to prevent creating multiple sets of privileged credentials, you can deploy Atlantis to run on an EC2 instance with a privileged IAM role in its instance profile (e.g. in AWS). In this way, all of your Terraform commands run through a single set of privileged credentials and obviate the need to distribute multiple sets of privileged credentials: https://github.com/runatlantis/atlantis#aws-credentials

Conclusion

You can see that with minimal additional infrastructure you can establish a safe and reliable CI/CD pipeline for your infrastructure as code, enabling you to get more done safely! To find out how you can deploy a CI/CD pipeline in less than 60 days, Contact Us.

-Josh Kodroff, Associate Cloud Consultant