1-888-317-7920 info@2ndwatch.com

5 Steps to Cloud Cost Optimization: Hurdles to Optimization are Organizational, Not Technical

In my last blog post, I covered the basics of cloud cost optimization using the Six Pillars model, and focused on the ‘hows’ of optimization and the ‘whys’ of its importance. In this blog, I’d like to talk about what comes next: preparing your organization for your optimization project. The main reason most clients delay and/or avoid confronting issues regarding cloud optimization is because it’s incredibly complex. Challenges from cloud sprawl to misaligned corporate priorities can cause a project to come to a screeching halt. Understanding the challenges before you begin is essential to getting off on the right foot. Here are the 5 main challenges we’ve seen when implementing a cloud cost optimization project:

  • Cloud sprawl refers to the unrestricted, unregulated creation and use of cloud resources; cloud cost sprawl, therefore, refers to the costs incurred related to the use of each and every cloud resource (i.e., storage, instances, data transfer, etc.). This typically presents as decentralized account or subscription management.
  • Billing complexity, in this case, specifically refers to the ever-changing and variable billing practices of cloud providers and the invoices they provide you. Considering all possible variable configurations when creating many solutions across an organization, Amazon Web Services (AWS) alone has 500,000 plus SKUs you could see on any single invoice. If you cannot make sense of your bill up front, your cost optimization efforts will languish.
  • Lack of Access to Data and Application Metrics is one of the biggest barriers to entry. Cost optimization is a data driven exercise. Without billing data and application metrics over time, many incorrect assumptions end up being made resulting in higher cost.
  • Misaligned policies and methods can be the obstacle that will make or break your optimization project. When every team, organization or department has their own method for managing cloud resources and spend, the solution becomes more organizational change and less technology implementation. This can be difficult to get a handle on, especially if the teams aren’t on the same page with needing to optimize.
  • A lack of incentives may seem surprising to many, after all who doesn’t want to save money, however it is the number one blocker in large enterprises that we have experienced toward achieving optimization end goals. Central IT is laser focused on cost management and application/business units are focused more on speed and innovation. Both goals are important, but without the right incentives, process, and communication this fails every time. Building executive support to directly reapply realized optimization savings back to the business units to increase their application and innovation budgets is the only way to bridge misaligned priorities and build the foundation for lasting optimization motivation.

According to many cloud software vendors, waste accounts for 30% to 40% of all cloud usage. In the RightScale State of the Cloud Report 2019, a survey revealed that 35% of cloud spend is wasted. 2nd Watch has found that within large enterprise companies, there can be up to 70% savings through a combination of software and services.  It often starts by just implementing a solid cost optimization methodology.

When working on a project for cloud cost optimization, it’s essential to first get the key stakeholders of an organization to agree to the benefits of optimizing your cloud spend. Once the executive team is onboard and an owner is assigned, the path to optimization is clear covering each of the 6 Pillars of Optimization.

THE PATH TO OPTIMIZATION

STEP ONE – Scope It Out!

As with any project, you first want to identify the goals and scope and then uncover the current state environment. Here are a few questions to ask to scope out your work:

  • Overall Project Goal – Are you focused on cost savings, workload optimization, uptime, performance or a combination of these factors?
  • Budget – Do you want to sync to a fiscal budget? What is the cycle? What budget do you have for upfront payments? Do you budget at an account level or organization level?
  • Current State – What number of instances and accounts do you have? What types of agreements do you have with your cloud provider(s)?
  • Growth – Do you grow seasonally, or do you have planned growth based on projects? Do you anticipate existing workloads to grow or shrink overtime?
  • Measurement – How do you currently view your cloud bill? Do you have detailed billing enabled? Do you have performance metrics over time for your applications?
  • Support – Do you have owners for each application? Are people available to assess each app? Are you able to shutdown apps during off hours? Do you have resources to modernize applications?

STEP TWO – Get Your Org Excited

One of the big barriers to a true optimization is gaining access to data. In order to gather the data (step 3) you first need to get the team onboard to grant you or the optimization project team access to the information.

During this step, get your cross-functional team excited about the project, share the goals and current state info you gathered in the previous step and present your strategy to all your stakeholders.

Stakeholders may include application owners, cloud account owners, IT Ops, IT security and/or developers who will have to make changes to applications.

Remember, data is key here, so find the people who own the data. Those who are monitoring applications or own the accounts are the typical stakeholders to involve. Then share with them the goals and bring them along this journey.

STEP THREE – Gather Your Data

Data is grouped into a few buckets:

  1. Billing Data – Get a clear view of your cloud bill over time.
  2. Metrics Data – CPU, I/O, Bandwidth and Memory for each application over time is essential.
  3. Application Data – Conduct interviews of application owners to understand the nuances. Graph out risk tolerance, growth potential, budget constraints and identify the current tagging strategy.

A month’s worth of data is good, though three months of data is much better to understand the capacity variances for applications and how to project into the future.

STEP FOUR – Visualize and Assess Your Usage

This step takes a bit of skill. There are tools like CloudHealth that can help you understand your cost and usage in cloud. Then there are other tools that can help you understand your application performance over time. Using the data from each of these sources and collaborating them across the pillars of optimization is essential to understanding where you can find the optimal cost savings.

I often recommend bringing in an optimization expert for this step. Someone with a data science, cloud and accounting background can help you visualize data and find the best options for optimization.

STEP FIVE – Plan Your Remediation Efforts and Get to Work!

Now that you know where you can save, take that information and build out a remediation plan. This should include addressing workloads in one or more of the pillars.

For example, you may shut down resources at night for an application and move it to another family of instances/VMs based on current pricing.

Your remediation should include changes by application as well as:

  1. RI Purchase Strategy across the business on a 1 or 3-year plan.
  2. Auto-Parking Implementation to part your resources when they’re not in use.
  3. Right-Sizing based on CPU, memory, I/O.
  4. Family Refresh or movement to the newer, more cost-effective instance families or VM-series.
  5. Elimination of Waste like unutilized instances, unattached volumes, idle load balancers, etc.
  6. Storage reassessment based on size, data transfer, retrieval time and number of retrieval requests.
  7. Tagging Strategy to track each instance/VM and track it back to the right resources.
  8. IT Chargeback process and systems to manage the process.

Remediation can take anywhere from one month to a year’s time based on organization size and the support of application teams to make necessary changes.

Download our ‘5 Steps to Cloud Cost Optimization’ infographic for a summary of this process.

End Result

With as much as 70% savings possible after implementing one of these projects, you can see the compelling reason to start. A big part of the benefits is organizational and long lasting including:

  • Visibility to make the right cloud spending decisions​
  • Break-down of your cloud costs by business area for chargeback or showback​
  • Control of cloud costs while maintaining or increasing application performance​
  • Improved organizational standards to keep optimizing costs over time​
  • Identification of short and long-term cost savings across the various optimization pillars:

Many companies reallocate the savings to innovative projects to help their company grow. The outcome of a well-managed cloud cost optimization project can propel your organization into a focus on cloud-native architecture and application refactoring.

Though complex, cloud cost optimization is an achievable goal. By cross-referencing the 6 pillars of optimization with your organizations policies, applications and teams, you can quickly find savings from 30 – 40% and grow from there.

By addressing project risks like lack of awareness, decentralized account management, lack of access to data and metrics, and lack of clear goals, your team can quickly achieve savings.

Ready to get started with your cloud cost optimization? Schedule a Cloud Cost Optimization Discovery Session for a free 2-hour session with our team of experts.

-Stefana Muller, Sr Product Manager

Facebooktwitterlinkedinmailrss

Serverless Aurora – Is it Production-Ready Yet?

In the last few months, AWS has made several announcements around it’s Aurora offering such as:

All of these features work towards the end goal of making serverless databases a production-ready solution. Even with the latest offerings, should you explore migrating to a serverless architecture? This blog highlights some considerations when looking to use Backend-as-a-Services (BaaS) at your data layer.

Let’s assume that you’ve already either made the necessary schema changes and have migrated already or have a general familiarity of implementing a new database with Aurora Classic. Aurora currently comes in two models -Provisioned and Serverless Aurora. A traditional AWS database that is provisioned either has a self-managed EC2 instance or operates as a PAAS model using an AWS managed RDS instance. In both use cases, you have to allocate memory and CPU in addition to creating security groups to allow applications to listen on a TCP connection string.

In this pattern, issues can arrive right at the connection. There are limits as to how many connections can access a database before you start to see performance degradation or an inability to connect altogether when the limit is maxed out. In addition to that, your application may also receive varying degrees of traffic (e.g., a retail application used during a peak season or promotion). Even if you implement a caching layer in front, such as Memcache or Redis, you still have scenarios where the instance will eventually either have to scale vertically to a more robust instance or horizontally with replicas to distribute reads and writes.

This area is where serverless provides some value. It’s worth recalling that a serverless database does not equal no servers. There are servers there, but that is abstracted away from the user (or in this case the application). Following recent compute trends, Serverless focuses more on writing business logic and less on infrastructure management and provisioning to deploy from the requirements stage, to prod-ready quicker. In the traditional database model, you are still responsible for securing the box, authentication, encryption, and other operations unrelated to the actual business functions.

How Aurora Serverless works

What serverless Aurora provides to help alleviate issues with scaling and connectivity is a Backend as a Service solution. The application and Aurora instance must be deployed in the same VPC and connect through endpoints that go through a network load balancer (NLB). Doing so allows for connections to terminate at the load balancer and not at the application.

By abstracting the connections, you no longer have to create logic manage load balancing algorithms or worry about making DNS changes to facilitate for database endpoint changes. The NLB has routing logic through request routers that make the connection to whichever instance is available at the time, which then maps to the underlying serverless database storage. If the serverless database needs to scale up, a pool of resources is always available and kept warm. In the event the instances scale down to zero, a connection cannot persist.

By having an available pool of warm instances, you now have a pay-as-you-go model where you pay for what you utilize. You can still run into the issue of max connections, which can’t be modified, but the number allowed for smaller 2 and 4 ACU implementations has increased since the initial release.

Note: Cooldowns are not instantaneous and can take up to 5 mins after the instance is entirely idle, and you are still billed for that time. Also, even though the instances are kept warm, the connection to those instances still has to initiate. If you make a query to the database during that time, you can see wait times of 25 seconds or more before the query fully executes.

Cost considerations:

Can you really scale down completely? Technically yes, if certain conditions are made:

  • CPU below 30 percent utilization
  • Less than 40 percent of connections being used

To achieve this and get the cost savings, the database must be completely idle. There can’t be long-running queries or locked tables. Also, varying activities outside of the application can generate queries such as open sessions, monitoring tools, health-checks, so on and so forth. The database only pauses when the conditions are met, AND there is zero activity.

Serverless Aurora at .06/VCU starts at a higher price than its provisioned predecessor at .041. Aurora classic also charges hourly, where Serverless Aurora charges by the second with a 5-minute minimum AND a 5-minute cool-down period. We already discussed that cool-downs in many cases are not instantaneous, and now you pile on that billing doesn’t stop until an additional 5 minutes after that period. If you go with the traditional minimal setup of 2 VCU and never scale down the instances, the cost is more expensive by a magnitude of at least 3x. Therefore, to get the same cost payoff, your database would have to run only 1/3 of the time and can be achievable for dev/test boxes that are parked or apps only used during business hours in a single time-zone. Serverless Aurora is supposed to be highly available by default, so if you are getting two instances at this price point, then you are getting a better bargain performance-wise than running a single, provisioned instance for an only slightly higher price point.

Allowing for a minimum of 1ACU allows you the option of fully scaling down to a serverless database and makes the price point more comparable to RDS without enabling pausing.

Migration

Migrating to Serverless Aurora is relatively simple as you can just load in a snapshot from an existing database.

Data API

With Data API, you no longer need a persistent connection to query the database. In previous scenarios, a fetch could take 25 seconds or more if the query is executed after a cool-down period. In this scenario, you can query the serverless database even if it’s been idle for some time. You can leverage a Lambda function via API gateway which works around the VPC implementation. AWS has mentioned it is providing performance metrics around the time it takes on average to execute a query with data API in the next coming months.

Conclusion

With the creation of EC2, Docker, and Lambda functions, we’ve seen more innovation in the area of compute and not as much on the data layer. Traditional provisioned relational databases have difficulties scaling and have a finite limit on the number of connections. By eliminating the need for an instance, this level of abstraction presents a strong use case for unpredictable workloads. Kudos to AWS for engineering a solution at this layer. The latest updates these last few months embellish AWS’ willingness to solve complex problems. Running 1ACU does bring the cost down to a rate comparable to RDS while providing a mechanism for better performance if you disable pauses. However, while it is now possible to run Aurora serverless 24/7 more cost-effectively, this scenario contrasts their signature use case of having an on/off database.

Serverless still seems a better fit for databases that are rarely used and only see spikes on occasion or applications primarily used during business hours. Administration time is still a cost, and serverless databases, despite the progress, still has many unknowns. It can take an administrator some time and patience to truly get a configuration that is performant, highly available, and not overly expensive. Even though you don’t have to rely on automation and can manually scale your Aurora serverless cluster, it takes some effort to do so in a way that doesn’t immediately terminate the connections. Today, you can leverage ECS or Fargate with spot instances and implement a solution that yields similar or better results at a cheaper cost if a true serverless database is the desired goal. I would still recommend this for dev/test workloads and see if you can work your way up to prod for smaller workloads as the tool still provides much value. Hopefully, AWS releases GA offerings for MySQL 5.7 and Postgres soon.

Want more tips and info on Serverless Aurora or serverless databases? Contact our experts.

-Sabine Blair, Cloud Consultant

Facebooktwitterlinkedinmailrss

The 6 Pillars of Cloud Cost Optimization

Let me start by painting the picture: You’re the CFO. Or the manager of a department, group, or team, and you’re ultimately responsible for any and all financial costs incurred by your team/group/department. Or maybe you’re in IT and you’ve been told to keep a handle on the costs generated by application use and code development resources. Your company has moved some or all of your projects and apps to the public cloud, and since things seem to be running pretty smoothly from a production standpoint, most of the company is feeling pretty good about the transition.

Except you.

The promise of moving to cloud to cut costs hasn’t matriculated and attempting to figure out the monthly bill from your cloud provider has you shaking your head.

Source: Amazon Web Services (AWS). “Understanding Consolidated Bills – AWS Billing and Cost Management”. (2017). Retrieved from https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/con-bill-blended-rates.html

From Reserved Instances and on-demand costs, to the “unblended” and “blended” rates, attempting to even make sense of the bill has you no closer to understanding where you can optimize your spend.

It’s not even just the pricing structure that requires an entire department of accountants to make sense of, the breakdown of the services themselves is just as mind boggling. In fact, there are at least 500,000 SKUs and price combinations in AWS alone! In addition, your team likely has no limitation on who can spin up any specific resource at any time, intrinsically compounding the problem—especially when staff leave them running, the proverbial meter racking up the $$ in the background.

Addressing this complex and ever-moving problem is not, in fact, a simple matter, and requires a comprehensive and intimate approach that starts with understanding the variety of opportunities available for cost and performance optimization. This where 2nd Watch and our Six Pillars of Cloud Optimization come in.

The Six Pillars of Cloud Cost Optimization

  1. Reserved Instances (RIs)

AWS Reserved Instances, Azure Reserved VM Instances, and Google Cloud Committed Use Discounts take the ephemeral out of cloud resources, allowing you to estimate up front what you’re going to use. This also entitles you to steep discounts for pre-planning, which ends up as a great financial incentive.

Most cloud cost optimizations, erroneously, begin and end here—providing you and your organization with a less than optimal solution. Resources to estimate RI purchases are available through cloud providers directly and through 3rd party optimization tools. For example, CloudHealth by VMware provides a clear picture into where to purchase RI’s based on your current cloud use over a number of months and will help you manage your RI lifecycle over time.

Two of the major factors to consider with cloud cost optimization are Risk Tolerance and Centralized RI Management portfolios.

  • Risk Tolerance refers to identifying how much you’re willing to spend up front in order to increase the possibility of future gains or recovered profits. For example, can your organization take a risk and cover 70% of your workloads with RIs? Or do you worry about consumption, and will therefore want to limit that to around 20-30%? Also, how long, in years, are you able to project ahead? One year is the least risky, sure, but three years, while also a larger financial commitment, comes with larger cost savings.
  • Centralized RI Management portfolios allow for deeper RI coverage across organizational units, resulting in even greater savings opportunities. For instance, a single application team might have a limited pool of cash in which to purchase RIs. Alternatively, a centralized, whole organization approach would cover all departments and teams for all workloads, based on corporate goals. This approach, of course, also requires ongoing communication with the separate groups to understand current and future resources needed to create and execute a successful RI management program.

Once you identify your risk tolerance and centralize your approach to RI’s you can take advantage of this optimization option. Though, an RI-only optimization strategy is short-sighted. It only allows you to take advantage of pricing options that your cloud vendor offers. It is important to overlay RI purchases with the 5 other optimization pillars to achieve the most effective cloud cost optimization.

  1. Auto-Parking

One of the benefits of the cloud is the ability to spin up (and down) resources as you need them. However, the downside of this instant technology is that there is very little incentive for individual team members to terminate these processes when they are finished with them. Auto-Parking refers to scheduling resources to shut down during off hours—an especially useful tool for development and test environments. Identifying your idle resources via a robust tagging strategy is the first step; this allows you to pinpoint resources that can be parked more efficiently. The second step involves automating the spin-up/spin-down process. Tools like ParkMyCloud, AWS Instance Scheduler, Azure Automation, and Google Cloud Scheduler can help you manage the entire auto-parking process.

  1. Right-Sizing

Ah, right-sizing, the best way to ensure you’re using exactly what you need and not too little or too much. It seems like a no-brainer to just “enable right-sizing” immediately when you start using a cloud environment. However, without the ability to analyze resource consumption or enable chargebacks, right-sizing becomes a meaningless concept. Performance and capacity requirements for cloud applications often change over time, and this inevitably results in underused and idle resources.

Many cloud providers share best practices in right-sizing, though they spend more time explaining the right-sizing options that exist prior to a cloud migration. This is unfortunate as right-sizing is an ongoing activity that requires implementing policies and guardrails to reduce overprovisioning, tagging resources to enable department level chargebacks, and properly monitoring CPU, Memory and I/O, in order to be truly effective.

Right-sizing must also take into account auto-parked resources and RIs available. Do you see a trend here with the optimization pillars?

  1. Family Refresh

Instance types, VM-series and “Instance Families” all describe methods by which cloud providers package up their instances according to the hardware used. Each instance/series/family offers different varieties of compute, memory, and storage parameters. Instance types within their set groupings are often retired as a unit when the hardware required to keep them running is replaced by newer technology. Cloud pricing changes directly in relationship to this changing of the guard, as newer systems replace the old. This is called Family Refresh.

Up-to-date knowledge of the instance types/families being used within your organization is a vital component to estimating when your costs will fluctuate. Truth be told, though, with over 500,000 SKU and price combinations for any single cloud provider, that task seems downright impossible.

Some tools exist, however, that can help monitor/estimate Family Refresh, though they often don’t take into account the overlap that occurs with RIs—or upon application of any of the other pillars of optimization. As a result, for many organizations, Family Refresh is the manual, laborious task it sounds like. Thankfully, we’ve found ways to automate the suggestions through our optimization service offering.

  1. Waste

Related to the issue of instances running long past their usefulness, waste is prevalent in cloud. Waste may seem like an abstract concept when it comes to virtual resources, but each wasted unit in this case = $$ spent for no purpose. And, when there is no limit to the amount of resources you can use, there is also no incentive to individuals using the resources to self-regulate their unused/under-utilized instances. Some examples of waste in the cloud include:

  • AWS RDSs or Azure SQL DBs without a connection
  • Unutilized AWS EC2s
  • Azure VMs that were spun up for training or testing
  • Dated snapshots that are holding storage space that will never be useful
  • Idle load balancers
  • Unattached volumes

Identifying waste takes time and accurate reporting. It is a great reason to invest the time and energy in developing a proper tagging strategy, however, since waste will be instantly traceable to the organizational unit that incurred it, and therefore, easily marked for review and/or removal. We’ve often seen companies buy RIs before they eliminate waste, which, without fail, causes them to overspend in cloud – for at least a year.

  1. Storage

Storage in the cloud is a great way to reduce on-premises hardware spend. That said, though, because it is so effortless to use, cloud storage can, in a very short matter of time, expand exponentially, making it nearly impossible to predict accurate cloud spend. Cloud storage is usually charged by four characteristics:

  • Size – How much storage do you need?
  • Data Transfer (bandwidth) – How often does your data need to move from one location to another?
  • Retrieval Time – How quickly do you need to access your data?
  • Retrieval Requests – How often do you need to access your data?

There are a variety of options for different use cases including using more file storage, databases, data backup and/or data archives. Having a solid data lifecycle policy will help you estimate these numbers, and ensure you are both right-sizing and using your storage quantity and bandwidth to its greatest potential at all times.

So, you see, each of these six pillars of cloud cost optimization houses many moving parts, and what with public cloud providers constantly modifying their service offerings and pricing, it seems wrangling in your wayward cloud is unlikely. Plus, optimizing only one of the pillars without considering the others offers little to no improvement, and can, in fact, unintentionally cost you more money over time. An efficacious optimization process must take all pillars and the way they overlap into account, institute the right policies and guardrails to ensure cloud sprawl doesn’t continue, and implement the right tools to allow your team regularly to make informed decisions.

The good news is that the future is bright! Once you have completely assessed your current environment, taken the pillars into account, made the changes required to optimize your cloud, and found a method by which to make this process continuous, you can investigate optimization through application refactoring, ephemeral instances, spot instances and serverless architecture.

The promised cost savings of public cloud is reachable, if only you know where to look.

2nd Watch offers a Cloud Cost Optimization service that can help guide you through this process. Our Cloud Cost Optimization service can reduce your current cloud computing costs by as much as 25% to 40%, increasing efficiency and performance. Our proven methodology empowers you to make data driven decisions in context, not relying on tools alone. Cloud cost optimization doesn’t have to be time consuming and challenging. Start your cloud cost optimization plan with our proven method for success at https://offers.2ndwatch.com/optimization.

-Stefana Muller, Sr. Product Manager

Facebooktwitterlinkedinmailrss

Podcast: Cisco Cloud Unfiltered – Half a Million SKUs and Counting

How do you make the right decisions about moving to the cloud when there are over half a million SKUs in AWS alone? In this week’s episode of the Cloud Unfiltered podcast, Ali Amagasu and Pete Johnson talk with Jeff Aden of 2nd Watch about how AWS has changed, how moving to the cloud is helping some companies save a LOT of money, and where the cost savings opportunities are for others that haven’t made the move yet.Check it out:Hosted by Ali Amagasu and Pete Johnson, Cisco Cloud Unfiltered is a series of interviews with the people that are working to move cloud technology and implementation forward.

Facebooktwitterlinkedinmailrss

Automating Windows Server 2016 Builds with Packer

It might not be found on a curated list of urban legends, but trust me, IT IS possible (!!!) to fully automate the building and configuration of Windows virtual machines for AWS.  While Windows Server may not be your first choice of operating systems for immutable infrastructure, sometimes you as the IT professional are not given the choice due to legacy environments, limited resources for re-platforming, corporate edicts, or lack of internal *nix knowledge.  Complain all you like, but after the arguing, fretting, crying, and gnashing of teeth is done, at some point you will still need to build an automated deployment pipeline that requires Windows Server 2016.  With some PowerShell scripting and HashiCorp Packer, it is relatively easy and painless to securely build and configure Windows AMIs for your particular environment.

Let’s dig into an example of how to build a custom configured Windows Server 2016 AMI.  You will need access to an AWS account and have sufficient permissions to create and manage EC2 instances.  If you need an AWS account, you can create one for free.

I am using VS Code and built-in terminal with Windows Subsystem for Linux to create the Packer template and run Packer, however Packer is available for several many Linux distros, Mac OS, and Windows.

First, download and unzip Packer:

/*
~/packer-demo:\> wget https://releases.hashicorp.com/packer/1.3.4/packer_1.3.4_linux_amd64.zip 
https://releases.hashicorp.com/packer/1.3.4/packer_1.3.4_linux_amd64.zip 
Resolving releases.hashicorp.com (releases.hashicorp.com)... 151.101.1.183, 151.101.65.183, 151.101.129.183, ... 
Connecting to releases.hashicorp.com (releases.hashicorp.com)|151.101.1.183|:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 28851840 (28M) [application/zip] 
Saving to: ‘packer_1.3.4_linux_amd64.zip’ 
‘packer_1.3.4_linux_amd64.zip’ saved [28851840/28851840] 
~/packer-demo:\> unzip packer_1.3.4_linux_amd64.zip 
Archive:  packer_1.3.4_linux_amd64.zip   
 inflating: packer
*/

Now that we have Packer unzipped, verify that it is executable by checking the version:

/*
~/packer-demo:\> packer --version
1.3.3
*/

Packer can build machine images for a number of different platforms.  We will focus on the amazon-ebs builder, which will create an EBS-backed AMI.  At a high level, Packer performs these steps:

  1. Read configuration settings from a json file
  2. Uses the AWS API to stand up an EC2 instance
  3. Connect to the instance and provision it
  4. Shut down and snapshot the instance
  5. Create an Amazon Machine Image (AMI) from the snapshot
  6. Clean up the mess

The amazon-ebs builder can create temporary keypairs, security group rules, and establish a basic communicator for provisioning.  However, in the interest of tighter security and control, we will want to be prescriptive for some of these settings and use a secure communicator to reduce the risk of eavesdropping while we provision the machine.

There are two communicators Packer uses to upload scripts and files to a virtual machine: SSH (the default) and WinRM.  While there is a nice Win32 port of OpenSSH for Windows, it is not currently installed by default on Windows machines, but WinRM is available natively in all current versions of Windows, so we will use that to provision our Windows Server 2016 machine.

Let’s create and edit the userdata file that Packer will use to bootstrap the EC2 instance:

/*
 
# USERDATA SCRIPT FOR AMAZON SOURCE WINDOWS SERVER AMIS
# BOOTSTRAPS WINRM VIA SSL
 
Set-ExecutionPolicy -ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore
$ErrorActionPreference = "stop"
 
# Remove any existing Windows Management listeners
Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse
 
# Create self-signed cert for encrypted WinRM on port 5986
$Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer-ami-builder"
New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force
 
# Configure WinRM
cmd.exe /c winrm quickconfig -q
cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="false"}'
cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="false"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer-ami-builder`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"
cmd.exe /c netsh advfirewall firewall add rule name="WinRM-SSL (5986)" dir=in action=allow protocol=TCP localport=5986
cmd.exe /c net stop winrm
cmd.exe /c sc config winrm start= auto
cmd.exe /c net start winrm
 
*/

There are four main things going on here:

  1. Set the execution policy and error handling for the script (if an error is encountered, the script terminates immediately)
  2. Clear out any existing WS Management listeners, just in case there are any preconfigured insecure listeners
  3. Create a self-signed certificate for encrypting the WinRM communication channel, and then bind it to a WS Management listener
  4. Configure WinRM to:
    • Require an encrypted (SSL) channel
    • Enable basic authentication (usually this not secure as the password goes across in plain text, but we are forcing encryption)
    • Configure the listener to use port 5986 with the self-signed certificate we created earlier
    • Add a firewall rule to open port 5986

Now that we have userdata to bootstrap WinRM, let’s create a Packer template:

/*
~/packer-demo:\> touch windows2016.json
*/

Open this file with your favorite text editor and add this text:

/*
{
    "variables": {
        "build_version": "{{isotime \"2006.01.02.150405\"}}",
        "aws_profile": null,
        "vpc_id": null,
        "subnet_id": null,
        "security_group_id": null
    },
    "builders": [
        {
            "type": "amazon-ebs",
            "region": "us-west-2",
            "profile": "{{user `aws_profile`}}",
            "vpc_id": "{{user `vpc_id`}}",
            "subnet_id": "{{user `subnet_id`}}",
            "security_group_id": "{{user `security_group_id`}}",
            "source_ami_filter": {
                "filters": {
                    "name": "Windows_Server-2016-English-Full-Base-*",
                    "root-device-type": "ebs",
                    "virtualization-type": "hvm"
                },
                "most_recent": true,
                "owners": [
                    "801119661308"
                ]
            },
            "ami_name": "WIN2016-CUSTOM-{{user `build_version`}}",
            "instance_type": "t3.xlarge",
            "user_data_file": "userdata.ps1",
            "associate_public_ip_address": true,
            "communicator": "winrm",
            "winrm_username": "Administrator",
            "winrm_port": 5986,
            "winrm_timeout": "15m",
            "winrm_use_ssl": true,
            "winrm_insecure": true
        }
    ]
}
*/

Couple of things to call out here.  First, the variables block at the top references some values that are needed in the template.  Insert values here that are specific to your AWS account and VPC:

  • aws_profile: I use a local credentials file to store IAM user credentials (the file is shared between WSL and Windows). Specify the name of a credential block that Packer can use to connect to your account.  The IAM user will need permissions to create and modify EC2 instances, at a minimum
  • vpc_id: Packer will stand up the instance in this VPC
  • aws_region: Your VPC should be in this region. As an exercise, change this value to be set by a variable instead
  • user_data_file: We created this file earlier, remember? If you saved it in another location, make sure the path is correct
  • subnet_id: This should belong to the VPC specified above. Use a public subnet if needed if you do not have a Direct Connect
  • security_group_id: This security group should belong to the VPC specified above. This security group will be attached to the instance that Packer stands up.  It will need, at a minimum, inbound TCP 5986 from where Packer is running

Next, let’s validate the template to make sure the syntax is correct, and we have all required fields:

/*
~/packer-demo:\> packer validate windows2016.json
Template validation failed. Errors are shown below.
Errors validating build 'amazon-ebs'. 1 error(s) occurred:
* An instance_type must be specified

Whoops, we missed instance_type.  Add that to the template and specify a valid EC2 instance type.  I like using beefier instance types so that the builds get done quicker, but that’s just me.

            "ami_name": "WIN2016-CUSTOM-{{user `build_version`}}",
            "instance_type": "t3.xlarge",
            "user_data_file": "userdata.ps1",
*/

Now validate again:

/*
~/packer-demo:\> packer validate windows2016.json
Template validated successfully.
*/

Awesome.  Couple of things to call out in the template:

  • source_ami_filter: we are using a base Amazon AMI for Windows Server 2016 server. Note the wildcard in the AMI name (Windows_Server-2016-English-Full-Base-*) and the owner (801119661308).  This filter will always pull the most recent match for the AMI name.  Amazon updates its AMIs about once a month
  • associate_public_ip_address: I specify true because I don’t have Direct Connect or VPN from my workstation to my sandbox VPC. If you are building your AMI in a public subnet and want to connect to it over the internet, do the same
  • ami_name: this is a required property, and it must also be unique within your account. We use a special function here (isotime) to generate a timestamp, ensuring that the AMI name will always be unique (e.g. WIN2016-CUSTOM-2019.03.01.000042)
  • winrm_port: the default unencrypted port for WinRM is 5985. We disabled 5985 in userdata and enabled 5986, so be sure to call it out here
  • winrm_use_ssl: as noted above, we are encrypting communication, so set this to true
  • winrm_insecure: this is a rather misleading property name. It really means that Packer should not check if encryption certificate is trusted.  We are using a self-signed certificate in userdata, so set this to true to skip certificate validation

Now that we have our template, let’s inspect it to see what will happen:

/*
~/packer-demo:\> packer inspect windows2016.json
Optional variables and their defaults:
  aws_profile       = default
  build_version     = {{isotime "2006.01.02.150405"}}
  security_group_id = sg-0e1ca9ba69b39926
  subnet_id         = subnet-00ef2a1df99f20c23
  vpc_id            = vpc-00ede10ag029c31e0
Builders:
  amazon-ebs
Provisioners:
 
Note: If your build names contain user variables or template
functions such as 'timestamp', these are processed at build time,
and therefore only show in their raw form here.
*/

Looks good, but we are missing a provisioner!  If we ran this template as is, all that would happen is Packer would make a copy of the base Windows Server 2016 AMI and make you the owner of the new private image.

There are several different Packer provisioners, ranging from very simple (the windows-restart provisioner just reboots the machine) to complex (the chef client provisioner installs chef client and does whatever Chef does from there).  We’ll run a basic powershell provisioner to install IIS on the instance, reboot the instance using windows-restart, and then we’ll finish up the provisioning by executing sysprep on the instance to generalize it for re-use.

We will use the PowerShell cmdlet Enable-WindowsOptionalFeature to install the web server role and IIS with defaults, add this text to your template below the builders [ ] section:

/*
"provisioners": [
        {
            "type": "powershell",
            "inline": [
                "Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServerRole",
                "Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServer"
            ]
        },
        {
            "type": "windows-restart",
            "restart_check_command": "powershell -command \"& {Write-Output 'Machine restarted.'}\""
        },
        {
            "type": "powershell",
            "inline": [
                "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\InitializeInstance.ps1 -Schedule",
                "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\SysprepInstance.ps1 -NoShutdown"
            ]
        }
    ]
*/

Couple things to call out about this section:

  • The first powershell provisioner uses the inline Packer simply appends these lines in order into a file, then transfers and executes the file on the instance using PowerShell
  • The second provisioner, windows-restart, simply reboots the machine while Packer waits. While this isn’t always necessary, it is helpful to catch instances where settings do not persist after a reboot, which was probably not your intention
  • The final powershell provisioner executes two PowerShell scripts that are present on Amazon Windows Server 2016 AMIs and part of the EC2Launch application (earlier versions of Windows use a different application called EC2Config). They are helper scripts that you can use to prepare the machine for generalization, and then execute sysprep

After validating your template again, let’s build the AMI!

/*
~/packer-demo:\> packer validate windows2016.json
Template validated successfully.
*/
/*
~/packer-demo:\> packer build windows2016.json
amazon-ebs output will be in this color.
==> amazon-ebs: Prevalidating AMI Name: WIN2016-CUSTOM-2019.03.01.000042
    amazon-ebs: Found Image ID: ami-0af80d239cc063c12
==> amazon-ebs: Creating temporary keypair: packer_5c78762a-751e-2cd7-b5ce-9eabb577e4cc
==> amazon-ebs: Launching a source AWS instance...
==> amazon-ebs: Adding tags to source instance
    amazon-ebs: Adding tag: "Name": "Packer Builder"
    amazon-ebs: Instance ID: i-0384e1edca5dc90e5
==> amazon-ebs: Waiting for instance (i-0384e1edca5dc90e5) to become ready...
==> amazon-ebs: Waiting for auto-generated password for instance...
    amazon-ebs: It is normal for this process to take up to 15 minutes,
    amazon-ebs: but it usually takes around 5. Please wait.
    amazon-ebs:  
    amazon-ebs: Password retrieved!
==> amazon-ebs: Using winrm communicator to connect: 34.220.235.82
==> amazon-ebs: Waiting for WinRM to become available...
    amazon-ebs: WinRM connected.
    amazon-ebs: #> CLIXML
    amazon-ebs: System.Management.Automation.PSCustomObjectSystem.Object1Preparing modules for first 
use.0-1-1Completed-1 1Preparing modules for first use.0-1-1Completed-1 
==> amazon-ebs: Connected to WinRM!
==> amazon-ebs: Provisioning with Powershell...
==> amazon-ebs: Provisioning with powershell script: /tmp/packer-powershell-provisioner561623209
    amazon-ebs:
    amazon-ebs:
    amazon-ebs: Path          :
    amazon-ebs: Online        : True
    amazon-ebs: RestartNeeded : False
    amazon-ebs:
    amazon-ebs: Path          :
    amazon-ebs: Online        : True
    amazon-ebs: RestartNeeded : False
    amazon-ebs:
==> amazon-ebs: Restarting Machine
==> amazon-ebs: Waiting for machine to restart...
    amazon-ebs: Machine restarted.
    amazon-ebs: EC2AMAZ-LJV703F restarted.
    amazon-ebs: #> CLIXML
    amazon-ebs: System.Management.Automation.PSCustomObjectSystem.Object1Preparing modules for first 
use.0-1-1Completed-1 
==> amazon-ebs: Machine successfully restarted, moving on
==> amazon-ebs: Provisioning with Powershell...
==> amazon-ebs: Provisioning with powershell script: /tmp/packer-powershell-provisioner928106343
    amazon-ebs:
    amazon-ebs: TaskPath                                       TaskName                          State
    amazon-ebs: --------                                       --------                          -----
    amazon-ebs: \                                              Amazon Ec2 Launch - Instance I... Ready
==> amazon-ebs: Stopping the source instance...
    amazon-ebs: Stopping instance, attempt 1
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating unencrypted AMI WIN2016-CUSTOM-2019.03.01.000042 from instance i-0384e1edca5dc90e5
    amazon-ebs: AMI: ami-0d6026ecb955cc1d6
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: Deleting temporary keypair...
Build 'amazon-ebs' finished.
 
==> Builds finished. The artifacts of successful builds are:
==> amazon-ebs: AMIs were created:
us-west-2: ami-0d6026ecb955cc1d6
*/

The entire process took approximately 5 minutes from start to finish.  If you inspect the output, you can see where the first PowerShell provisioner installed IIS (Provisioning with powershell script: /tmp/packer-powershell-provisioner561623209), where sysprep was executed (Provisioning with powershell script: /tmp/packer-powershell-provisioner928106343) and where Packer yielded the AMI ID for the image (ami-0d6026ecb955cc1d6) and cleaned up any stray artifacts.

Note: most of the issues that occur during a Packer build are related to firewalls and security groups.  Verify that the machine that is running Packer can reach the VPC and EC2 instance using port 5986.  You can also use packer build -debug win2016.json to step through the build process manually, and you can even connect to the machine via RDP to troubleshoot a provisioner.

Now, if you launch a new EC2 instance using this AMI, it will already have IIS installed and configured with defaults.  Building and securely configuring Windows AMIs with Packer is easy!

For help getting started securely building and configuring Windows AMIs for your particular environment, contact us.

-Jonathan Eropkin, Cloud Consultant

Facebooktwitterlinkedinmailrss

Hat Trick for 2nd Watch

2nd Watch again has been named in the Magic Quadrant for Public Cloud Infrastructure Professional and Managed Services, Worldwide.

For the third year in a row, 2nd Watch was recognized in Gartner’s Magic Quadrant (MQ), this year positioned highest among companies in the Challengers quadrant for its ability to execute.

Our team is honored to be included in Gartner’s research, which is the gold standard for technology thought leadership. We attribute our success year-over-year to an amazing team of individuals who work at 2nd Watch to provide best-in-class cloud services to our customers day in and day out!

Each year, Gartner has continued to raise the bar for inclusion in the MQ because the market is evolving so rapidly.  For the past few years professional and managed services have been tightly coupled and could be combined into cloud services, as reflected in this year’s nw Magic Quadrant title.  With the rate of innovation in the cloud services, customer require providers that are proactive, knowledgeable, and highly-skilled in architecture, engineering and optimization, which is embodied by cloud services.  Lines may be blurred between professional and managed services, which is why 2nd Watch is able to react to customer needs quickly whereas some of the larger providers may have difficulties sharing knowledge across businesses or departments.

From our perspective, here are some key take-aways from the report. Read the full report here.

  • 2nd Watch’s investment and development in the science of optimization is paying dividends. Our customers tell us they save more money when working with 2nd Watch compared to other providers or by managing their cloud spend on their own.
  • 2nd Watch’s reputation of mastering core capabilities for the leading CSPs is being recognized by our customers.
  • 2nd Watch is easy to work with from our pre-sales to our final deliverables.
  • 2nd Watch customers are extremely satisfied with our results, as reflected in one of the highest Net Promoter Scores in the industry. Our customer retention rates remain high, which is a reflection of our teams’ expertise and sensitivity to customer needs.

As we drive forward in 2019, we will continue to focus on our customers by delivering extremely valuable solutions for them every day. Being recognized by Gartner is very valuable and none of this would be possible without the highly-talented individuals that are part of the 2nd Watch team – thank you!

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

-Jeff Aden, Co-Founder & EVP of Marketing & Business Development

Facebooktwitterlinkedinmailrss

How to use waiters in boto3 (And how to write your own!)

What is boto3?

Boto3 is the python SDK for interacting with the AWS api. Boto3 makes it easy to use the python programming language to manipulate AWS resources and automation infrastructure.

What are boto3 waiters and how do I use them?

A number of requests in AWS using boto3 are not instant. Common examples of boto3 requests are deploying a new server or RDS instance. For some long running requests, we are ok to initiate the request and then check for completion at some later time. But in many cases, we want to wait for the request to complete before we move on to the subsequent parts of the script that may rely on a long running process to have been completed. One example would be a script that might copy an AMI to another account by sharing all the snapshots. After sharing the snapshots to the other account, you would need to wait for the local snapshot copies to complete before registering the AMI in the receiving account. Luckily a snapshot completed waiter already exists, and here’s what that waiter would look like in Python:

As far as the default configuration for the waiters and how long they wait, you can view the information in the boto3 docs on waiters, but it’s 600 seconds in most cases. Each one is configurable to be as short or long as you’d like.

Writing your own custom waiters.

As you can see, using boto3 waiters is an easy way to setup a loop that will wait for completion without having to write the code yourself. But how do you find out if a specific waiter exists? The easiest way is to explore the particular boto3 client on the docs page and check out the list of waiters at the bottom. Let’s walk through the anatomy of a boto3 waiter. The waiter is actually instantiated in botocore and then abstracted to boto3. So looking at the code there we can derive what’s needed to generate our own waiter:

  1. Waiter Name
  2. Waiter Config
    1. Delay
    2. Max Attempts
    3. Operation
    4. Acceptors
  3. Waiter Model

The first step is to name your custom waiter. You’ll want it to be something descriptive and, in our example, it will be “CertificateIssued”. This will be a waiter that waits for an ACM certificate to be issued (Note there is already a CertificateValidated waiter, but this is only to showcase the creation of the waiter). Next we pick out the configuration for the waiter which boils down to 4 parts. Delay is the amount of time it will take between tests in seconds. Max Attempts is how many attempts it will try before it fails. Operation is the boto3 client operation that you’re using to get the result your testing. In our example, we’re calling “DescribeCertificate”. Acceptors is how we test the result of the Operation call. Acceptors are probably the most complicated portion of the configuration. They determine how to match the response and what result to return. Acceptors have 4 parts: Matcher, Expected, Argument, State.

  • State: This is what the acceptor will return based on the result of the matcher function.
  • Expected: This is the expected response that you want from the matcher to return this result.
  • Argument: This is the argument sent to the matcher function to determine if the result is expected.
  • Matcher: Matchers come in 5 flavors. Path, PathAll, PathAny, Status, and Error. The Status and Error matchers will effectively check the status of the HTTP response and check for an error respectively. They return failure states and short circuit the waiter so you don’t have to wait until the end of the time period when the command has already failed. The Path matcher will match the Argument to a single expected result. In our example, if you run DescribeCertificate you would get back a “Certificate.Status” as a result. Taking that as an argument, the desired expected result would be “ISSUED”. Notice that if the expected result is “PENDING_VALIDATION” we set the state to “retry” so it will continue to keep trying for the result we want. The PathAny/PathAll matchers work with operations that return a python list result. PathAny will match if any item in the list matches, PathAll will match if all the items in the list match.

Once we have the configuration complete, we feed this into the Waiter Model call the “create_waiter_with_client” request. Now our custom waiter is ready to wait. If you need more examples of waiters and how they are configured, check out the botocore github and poke through the various services. If they have waiters configured, they will be in a file called waiters-2.json. Here’s the finished code for our customer waiter.

And that’s it. Custom waiters will allow you to create automation in a series without having to build redundant code of complicated loops. Have questions about writing custom waiters or boto3? Contact us

-Coin Graham, Principal Cloud Consultant

Facebooktwitterlinkedinmailrss

The Cloudcast Podcast with Jeff Aden, Co-Founder and EVP at 2nd Watch

The Cloudcast’s Aaron and Brian talk with Jeff Aden, Co-Founder and EVP at 2nd Watch, about the evolution of 2nd Watch as a Cloud Integrator as AWS has grown and shifted its focus from startups to enterprise customers. Listen to the podcast at http://www.thecloudcast.net/2019/02/evolution-of-public-cloud-integrator.html.

Topic 1 – Welcome to the show Jeff. Tell us about your background, the founding of 2nd Watch, and how the company has evolved over the last few years.

Topic 2 – We got to know 2nd Watch at one of the first AWS re:Invent shows, as they had one of the largest booths on the floor. At the time, they were listed as one of AWS’s best partners. Today, 2nd Watch provides management tools, migration tools, and systems-integration capabilities. How does 2nd Watch think of themselves?

Topic 3 –  What are the concerns of your customers today, and how does 2nd Watch think about matching customer demands and the types of tools/services/capabilities that you provide today?

Topic 4 – We’d like to pick your brain about the usage and insights you’re seeing from your customers’ usage of AWS. It’s mentioned that 100% are using DynamoDB, 53% are using Elastic Kubernetes, and a fast growing section is using things likes Athena, Glue and Sagemaker. What are some of the types of applications that you’re seeing customer build that leverage these new models? 

Topic 5 – With technologies like Outpost being announced, after so many years of AWS saying “Cloud or legacy Data Center,” how do you see this impacting the thought process of customers or potential customers?

Facebooktwitterlinkedinmailrss

Leveraging the cloud for SOC 2 compliance

In a world of high profile attacks, breaches, and information compromises, companies that rely on third parties to manage and/or store their data sets are wise to consider a roadmap for their security, risk and compliance strategy. Failure to detect or mitigate the loss of data or other security breaches, including breaches of their suppliers’ information systems, could seriously expose a cloud user and their customers to a loss or misuse of information in such a harmful way that it becomes difficult to recover from. In 2018 alone, there were nearly 500 million records exposed from data breaches, according to the Identity Theft Resource Center’s findings, https://www.idtheftcenter.org/2018-end-of-year-data-breach-report/. While absolute security can never be attained while running your business, there are frameworks, tools, and strategies that can be applied to minimize the risks to acceptable levels while maintaining continuous compliance.

SOC 2 is one of those frameworks that is particularly beneficial in the Managed Service Providers space. It is a framework that is built on the AICPA’s Trust Services Principles (TSP) for service security, availability, confidentiality, processing integrity, and privacy.  SOC 2 is well suited for a wide range of applications, especially in the cloud services space. Companies have realized that their security and compliance frameworks must stay aligned with the inherent changes that come along with cloud evolution. This includes making sure to stay abreast of developing capabilities and feature enhancements.  For example, AWS announced a flurry of new services and features at its annual re:Invent conference in 2018 alone. When embedded into their cloud strategy, companies can use the common controls that SOC 2 offers to build the foundation for a robust Information Systems security program.

CISO’s, CSO’s, and company stakeholders must not take on the process of forming the company security strategy in a vacuum. Taking advantage of core leaders in the organization, both at the management level and at the individual contributor level, should be part of the overall security development strategy, just as it is with successful innovation strategies. In fact, the security strategy should be integrated within the company innovation strategy. One of the best approaches to ensure this happens, for example, is to develop a steering committee with participation from all major divisions and/or groups. This is more effective with smaller organizations where information can quickly flow vertically and horizontally, however, larger organizations would simply need to ensure that the vehicles are in place to allow for a quick flow of information to all stakeholders

Organizations with strong security programs have good controls in place to address each of the major domain categories under the Trust Service Principles. Each of the Trust Service Principles can be described through the controls that the company has established. Below are some ways that Managed Cloud Service providers like 2nd Watch meet the requirements for security, availability, and confidentiality while simultaneously lowering the overall risk to their business and their customers business:

Security

  • Change Management – Implement both internal and external system change management using effective ITSM tools to track, at a minimum, the change subject, descriptions, requester, urgency, change agent, service impact, change steps, evidence of testing, back-out plan, and appropriate stakeholder approvals.
  • End-User Security – Implement full-disk encryption for end-user devices, deploy centrally managed Directory Services for authorization, use multi-factor authentication, follow password/key management best-practices, use role based access controls, segregate permission using a least-user-privilege approach, and document the policies and procedures. These are all great ways towards securing environments fairly quickly.
  • Facilities – While “security of the cloud” environment fall into the responsibility of your cloud infrastructure provider, your Managed Services Provider should work to adequately protect their own, albeit not in scope, physical spaces. Door access badges, logs, and monitoring of entry/exit points are positive ways to prevent unauthorized physical entry.
  • AV Scans – Ensure that your cloud environments are built with AV scanning solutions.
  • Vulnerability Scans and Remediation – Ensure that your Managed Services Provider or third party provider is running regular vulnerability scans and performing prompt risk remediation. Independent testing of the provider’s environment will help to identify any unexpected risks so implementing an annual penetration test is important.

Availability

  • DR and Incident Escalations – Ensure that your MSP provider maintains current documented disaster recovery plans with at least annual exercises. Well thought-out plans include testing of upstream and downstream elements of the supply chain, including a plan for notifications to all stakeholders.
  • Risk Mitigation – Implement an annual formal risk assessment with a risk mitigation plan for the most likely situations.

Confidentiality

  • DLP – Implement ways and techniques to prevent data from being lost by unsuspecting employees or customers. Examples may include limiting use of external media ports to authorized devices, deprecating old cypher protocols, and blocking unsafe or malicious downloads.
  • HTTPS – Use secure protocols and connections for the safe transmission of confidential information.
  • Classification of Data – Make sure to identify elements of your cloud environment so that your Managed Service Providers or 3rd Parties can properly secure and protect those elements with a tagging strategy.
  • Emails – Use email encryption when sending any confidential information. Also, check with your own Legal department for proper use of your Confidentiality Statement at end of emails that are appropriate to your business.

By implementing these SOC 2 controls, companies can be expected to have a solid security framework to build on. Regardless of their stage in the cloud adoption lifecycle, businesses must continue to demonstrate to their stakeholders (customers, board members, employees, shareholders) that they have a secure and compliant business. As with any successful customer-service provider relationship, the use of properly formed contracts and agreements comes into play. Without these elements in place and in constant use, it is difficult to evaluate how well a company is measuring up. This is where controls and a framework on compliance like SOC 2 plays a critical role.

Have questions on becoming SOC 2 compliant? Contact us!

– By Eddie Borjas, Director of Risk & Compliance

Facebooktwitterlinkedinmailrss

Operating and maintaining systems at scale with automation

Managing numerous customers with unique characteristics and tens of thousands of systems at scale can be challenging. Here, I want to pull back the curtain on some of the automation and tools that 2nd Watch develops to solve these problems. Below I will outline our approach to this problem and its 3 main components: Collect, Model, and React.

Collect: The first problem facing us is an overwhelming flood of data. We have CloudWatch metrics, CloudTrail events, custom monitoring information, service requests, incidents, tags, users, accounts, subscriptions, alerts, etc. The data is all structured differently, tells us different stories, and is collected at an unrelenting pace. We need to identify all the sources, collect the data, and store it in a central place so we can begin to consume it and make correlations between various events.

Most of the data I described above can be gathered from the AWS & Azure APIs directly, while others may need to be ingested with an agent or by custom scripts. We also need to make sure we have a consistent core set of data being brought in for each of our customers, while also expanding that to include some specialized data that perhaps only certain customers may have. All the data is gathered and sent to our Splunk indexers. We build an index for every customer to ensure that data stays segregated and secure.

Model: Next we need to present the data in a useful way. The modeling of the data can vary depending on who is using it or how it is going to be consumed. A dashboard with a quick look at several important metrics can be very useful to an engineer to see the big picture. Seeing this data daily or throughout the day will make anomalies very apparent. This is especially helpful because gathering and organizing the data at scale can be time consuming, and thus could reasonably only be done during periodic audits.

Modeling the data in Splunk allows for a low overhead view with up-to-date data so the engineer can do more important things. A great example of this is provisioned resources by region. If the engineer looks at the data on a regular basis, they would quickly notice that the number of provisioned resources has drastically changed. A 20% increase in the number of EC2 resources could mean several things; Perhaps the customer is doing a large deployment, or maybe Justin accidently put his AWS access key and secret key on GitHub (again).

We provide our customers with regular reports and reviews of their cloud environments. We also use the data collected and modeled in this tool for providing that data. Historical data trended over a month, quarter, and year can help you ask questions or tell a story. It can help you forecast your business, or the number of engineers needed to support it. We recently used the historical tending data to show progress of a large project that included waste removal and a resource tagging overhaul for a customer. Not only were we able to show progress throughout the project,t but we used that same view to ensure that waste did not creep back up and that the new tagging standards were being applied going forward.

React: Finally, it’s time to act on the data we collected and modeled. Using Splunk alerts we can provide conditional logic to the data patterns and act upon them. From Splunk we can call our ticketing system’s API and create a new incident for an engineer to investigate concerning trends or to notify the customer of a potential security risk. We can also call our own APIs that trigger remediation workflows. A few common scenarios are encrypting unencrypted S3 buckets, deleting old snapshots, restarting failed backup jobs, requesting cloud provider limit increases, etc.

Because we have several independent data sources providing information, we can also correlate events and have more advanced conditional logic. If we see that a server is failing status checks, we can also look to see if it recently changed instance families or if it has all the appropriate drivers. This data can be included in the incident and available for the engineer to review without having to check it themselves.

The entire premise of this idea and the solution it outlines is about efficiency and using data and automation to make quicker and smarter decisions. Operating and maintaining systems at scale brings forth numerous challenges and if you are unable to efficiently accommodate the vast amount of information coming at you, you will spend a lot of energy just trying to keep your head above water.

For help getting started in automating your systems, contact us.

-Kenneth Weinreich, Managed Cloud Operations

Facebooktwitterlinkedinmailrss