1-888-317-7920 info@2ndwatch.com

Operating and maintaining systems at scale with automation

Managing numerous customers with unique characteristics and tens of thousands of systems at scale can be challenging. Here, I want to pull back the curtain on some of the automation and tools that 2nd Watch develops to solve these problems. Below I will outline our approach to this problem and its 3 main components: Collect, Model, and React.

Collect: The first problem facing us is an overwhelming flood of data. We have CloudWatch metrics, CloudTrail events, custom monitoring information, service requests, incidents, tags, users, accounts, subscriptions, alerts, etc. The data is all structured differently, tells us different stories, and is collected at an unrelenting pace. We need to identify all the sources, collect the data, and store it in a central place so we can begin to consume it and make correlations between various events.

Most of the data I described above can be gathered from the AWS & Azure APIs directly, while others may need to be ingested with an agent or by custom scripts. We also need to make sure we have a consistent core set of data being brought in for each of our customers, while also expanding that to include some specialized data that perhaps only certain customers may have. All the data is gathered and sent to our Splunk indexers. We build an index for every customer to ensure that data stays segregated and secure.

Model: Next we need to present the data in a useful way. The modeling of the data can vary depending on who is using it or how it is going to be consumed. A dashboard with a quick look at several important metrics can be very useful to an engineer to see the big picture. Seeing this data daily or throughout the day will make anomalies very apparent. This is especially helpful because gathering and organizing the data at scale can be time consuming, and thus could reasonably only be done during periodic audits.

Modeling the data in Splunk allows for a low overhead view with up-to-date data so the engineer can do more important things. A great example of this is provisioned resources by region. If the engineer looks at the data on a regular basis, they would quickly notice that the number of provisioned resources has drastically changed. A 20% increase in the number of EC2 resources could mean several things; Perhaps the customer is doing a large deployment, or maybe Justin accidently put his AWS access key and secret key on GitHub (again).

We provide our customers with regular reports and reviews of their cloud environments. We also use the data collected and modeled in this tool for providing that data. Historical data trended over a month, quarter, and year can help you ask questions or tell a story. It can help you forecast your business, or the number of engineers needed to support it. We recently used the historical tending data to show progress of a large project that included waste removal and a resource tagging overhaul for a customer. Not only were we able to show progress throughout the project,t but we used that same view to ensure that waste did not creep back up and that the new tagging standards were being applied going forward.

React: Finally, it’s time to act on the data we collected and modeled. Using Splunk alerts we can provide conditional logic to the data patterns and act upon them. From Splunk we can call our ticketing system’s API and create a new incident for an engineer to investigate concerning trends or to notify the customer of a potential security risk. We can also call our own APIs that trigger remediation workflows. A few common scenarios are encrypting unencrypted S3 buckets, deleting old snapshots, restarting failed backup jobs, requesting cloud provider limit increases, etc.

Because we have several independent data sources providing information, we can also correlate events and have more advanced conditional logic. If we see that a server is failing status checks, we can also look to see if it recently changed instance families or if it has all the appropriate drivers. This data can be included in the incident and available for the engineer to review without having to check it themselves.

The entire premise of this idea and the solution it outlines is about efficiency and using data and automation to make quicker and smarter decisions. Operating and maintaining systems at scale brings forth numerous challenges and if you are unable to efficiently accommodate the vast amount of information coming at you, you will spend a lot of energy just trying to keep your head above water.

For help getting started in automating your systems, contact us.

-Kenneth Weinreich, Managed Cloud Operations

Facebooktwitterlinkedinmailrss

Cloud Autonomics and Automated Management and Optimization

Autonomics systems is an exciting new arena within cloud computing, although it is not a new technology by any means. Automation, orchestration and optimization have been alive and well in the datacenter for almost a decade now. Companies like Microsoft with System Center, IBM with Tivoli and ServiceNow are just a few examples of platforms that harness the ability to collect, analyze and make decisions on how to act against sensor data derived from physical/virtual infrastructure and appliances.

Autonomic cloud capabilities are lighting up quickly across the cloud ecosystem. The systems can monitor infrastructure, services, systems and make decisions to support remediation and healing, failover and failback and snapshot and recovery. The abilities come from workflow creation, runbook and playbook development, which helps to support a broad range of insight with action and corrective policy enforcement.

In the compliance world, we are seeing many great companies come into the mix to bring autonomic type functionality to life in the world of security and compliance.

Evident is a great example of a technology that functions with autonomic-type capabilities. The product can do some amazing things in terms of automation and action. It provides visibility across the entire cloud platform and identifies and manages risk associated with the operation of core cloud infrastructure and applications within an organization.

Using signatures and control insight as well as custom-defined controls, it can determine exploitable and vulnerable systems at scale and report the current state of risk within an organization. That on face value is not autonomic, however, the next phase it performs is critical to why it is a great example of autonomics in action.

After analyzing the current state of the vulnerability and risk landscape, it reports current risk and vulnerability state and derives a set of guided remediations that can be either performed manually against the infrastructure in question or automated for remediation to ensure a proactive response hands off to ensure vulnerabilities and security compliance can always be maintained.

Moving away from Evident, the focus going forward is a marriage of many things to increase systems capabilities and enhance autonomic cloud operations. Operations management systems in the cloud will light up advanced Artificially Intelligent and Machine Learning-based capabilities, which will take in large amounts of sensor data across many cloud-based technologies and services and derive analysis, insight and proactive remediation – not just for security compliance, but across the board in terms of cloud stabilization and core operations and optimization.

CloudHealth Technologies and many others in the cloud management platform space are looking deeply into how to turn the sensor data derived into core cloud optimization via automation and optimization.

AIOps is a term growing year over year, and it fits well to describe how autonomic systems have evolved from the datacenter to the cloud. Gartner is looking deeply into this space, and we at 2nd Watch see promising advancement coming from companies like Palo Alto Networks with their native security platform capabilities along with Evident for continuous compliance and security.

MoogSoft is bringing a next generation platform for IT incident management to life for the cloud, and its Artificial Intelligence capabilities for IT operations are helping DevOps teams operate smarter, faster and more effectively in terms of automating traditional IT operations tasks and freeing up IT engineers to work on the important business-level needs of the organization vs day-to-day IT operations. By providing intelligence to the response of systems issues and challenges, IT operations teams can become more agile and more capable to solve mission critical problems and maintain a proactive and highly optimized enterprise cloud.

As we move forward, expect to see more and more AI and ML-based functionality move into the core cloud management platforms. Cloud ISVs will be leveraging more and more sensor data to determine response, action and resolution and this will become tightly coupled directly to the virtual machine topology and the cloud native services underlying all cloud providers moving forward.

It is an exciting time for autonomic systems capabilities in the cloud, and we are excited to help customers realize the many potential capabilities and benefits which can help automate, orchestrate and proactively maintain and optimize your core cloud infrastructure.

To learn more about autonomic systems and capabilities, check out Gartner’s AIOps research and reach out to 2nd Watch. We would love to help you realize the potential of these technologies in your cloud environment today!

-Peter Meister, Sr Director of Product Management

Facebooktwitterlinkedinmailrss

Logging and Monitoring in the era of Serverless – Part 1

Figuring out monitoring in a holistic sense is a challenge for many companies still, whether it is with conventional infrastructure or new platforms like serverless or containers.

In most applications there are two aspects of monitoring an application:

  • System Metrics such as errors, invocations, latency, memory and cpu usage
  • Business Analytics such as number of signups, number of emails sent, transactions processed, etc

The former is fairly universal and generally applicable in any stack to a varying degree. This is what I would call the undifferentiated aspect of monitoring an application. The abilities to perform error detection and track performance metrics are absolutely necessary to operate an application.

Everything that is old is new again. I am huge fan of the Twelve-Factor App. If you aren’t familiar, I highly suggest taking a look at it. Drafted in 2011 by developers at Heroku, the Twelve-Factor App is a methodology and set best practices designed to enable applications to be built with portability and resiliency when deployed to the web.

In the Twelve-Factor App manifesto, it is stated that applications should produce “logs as event streams” and leave it up to the execution environment to aggregate them. If we are to gather information from our application, why not make that present in the logs? We can use our event stream (i.e. application log) to create time-series metrics. Time-series metrics are just datapoints that have been sampled and aggregated over time, which enable developers and engineers to track performance. They allow us to make correlations with events at a specific time.

AWS Lambda works almost exactly in this way by default, aggregating its logs via AWS CloudWatch. CloudWatch organizes logs based on function, version, and containers while Lambda adds metadata for each invocation. And it is up to the developer to add application-specific logging to their function. CloudWatch, however, will only get you so far. If we want to track more information than just invocation, latency, or memory utilization, we need to analyze the logs deeper. This is where something like Splunk, Kibana, or other tools come into play.

In order to get to the meat of our application and the value it is delivering we need to ensure that we have additional information (telemetry) going to the logs as well:

e.g. – Timeouts – Configuration Failures – Stack traces – Event objects

Logging out these types of events or information will enable those other tools with rich query languages to create a dashboard with just about anything we want on them.

For instance, let’s say we added the following line of code to our application to track an event that was happening from a specific invocation and pull out additional information about execution:

log.Println(fmt.Sprintf(“-metrics.%s.blob.%s”, environment, method))

In a system that tracks time-series metrics in logs (e.g. SumoLogic), we could build a query like this:

“-metrics.prod.blob.” | parse “-metrics.prod.blob.*” as method | timeslice 5m | count(method) group by  _timeslice, method | transpose row _timeslice column method

This would give us a nice breakdown of the different methods used in a CRUD or RESTful service and can then be visualized in the very same tool.

While visualization is nice, particularly when taking a closer look at a problem, it might not be immediately apparent where there is a problem. For that we need some way to grab the attention of the engineers or developers working on the application. Many of the tools mentioned here support some level of monitoring and alerting.

In the next installment of this series we will talk about increasing visibility into your operations and battling dashboard apathy! Check back next week.

-Lars Cromley, Director, Cloud Advocacy and Innovation

Facebooktwitterlinkedinmailrss

Managing Azure Cloud Governance with Resource Policies

I love an all you can eat buffet. One can get a ton of value from a lot to choose from, and you can eat as much as you want or not, for a fixed price.

In the same regards, I love the freedom and vast array of technologies that the cloud allows you. A technological all you can eat buffet, if you will. However, there is no fixed price when it comes to the cloud. You pay for every resource! And as you can imagine, it can become quite costly if you are not mindful.

So, how do organizations govern and ensure that their cloud spend is managed efficiently? Well, in Microsoft’s Azure cloud you can mitigate this issue using Azure resource policies.

Azure resource policies allow you to define what, where or how resources are provisioned, thus allowing an organization to set restrictions and enable some granular control over their cloud spend.

Azure resource policies allow an organization to control things like:

  • Where resources are deployed – Azure has more than 20 regions all over the world. Resource policies can dictate what regions their deployments should remain within.
  • Virtual Machine SKUs – Resource policies can define only the VM sizes that the organization allows.
  • Azure resources – Resource policies can define the specific resources that are within an organization’s supportable technologies and restrict others that are outside the standards. For instance, your organization supports SQL and Oracle databases but not Cosmos or MySQL, resource policies can enforce these standards.
  • OS types – Resource policies can define which OS flavors and versions are deployable in an organization’s environment. No longer support Windows Server 2008, or want to limit the Linux distros to a small handful? Resource policies can assist.

Azure resource policies are applied at the resource group or the subscription level. This allows granular control of the policy assignments. For instance, in a non-prod subscription you may want to allow non-standard and non-supported resources to allow the development teams the ability to test and vet new technologies, without hampering innovation. But in a production environment standards and supportability are of the utmost importance, and deployments should be highly controlled. Policies can also be excluded from a scope. For instance, an application that requires a non-standard resource can be excluded at the resource level from the subscription policy to allow the exception.

A number of pre-defined Azure resource policies are available for your use, including:

  • Allowed locations – Used to enforce geo-location requirements by restricting which regions resources can be deployed in.
  • Allowed virtual machine SKUs – Restricts the virtual machines sizes/ SKUs that can be deployed to a predefined set of SKUs. Useful for controlling costs of virtual machine resources.
  • Enforce tag and its value – Requires resources to be tagged. This is useful for tracking resource costs for purposes of department chargebacks.
  • Not allowed resource types – Identifies resource types that cannot be deployed. For example, you may want to prevent a costly HDInsight cluster deployment if you know your group would never need it.

Azure also allows custom resource policies when you need some restriction not defined in a custom policy. A policy definition is described using JSON and includes a policy rule.

This JSON example denies a storage account from being created without blob encryption being enabled:

{
 
"if": {
 
"allOf": [
 
{
 
"field": "type",
 
"equals": "Microsoft.Storage/ storageAccounts"
 
},
 
{
 
"field": "Microsoft.Storage/ storageAccounts/ enableBlobEncryption",
 
"equals": "false"
 
}
 
]
 
},
 
"then": { "effect": "deny"
 
}
 
}

The use of Azure Resource Policies can go a long way in assisting you to ensure that your organization’s Azure deployments meet your governance and compliance goals. For more information on Azure Resource Policies visit https://docs.microsoft.com/en-us/azure/azure-policy/azure-policy-introduction.

For help in getting started with Azure resource policies, contact us.

-David Muxo, Sr Cloud Consultant

Facebooktwitterlinkedinmailrss

Why buy Amazon Web Services through a Partner and who “owns” the account?

As an AWS Premier Partner and audited, authorized APN Managed Service Provider (MSP), 2nd Watch offers comprehensive services to help customers accelerate their journey to the cloud.  For many of our customer we not only provide robust Managed Cloud Services, we also resell Amazon products and services.  What are the advantages for customers who purchase from a value-added AWS reseller?  Why would a customer do this? With an AWS reseller, who owns the account? These are all great questions and the subject of today’s blog post.

I am going to take these questions in reverse order and deal with the ownership issue first, as it is the most commonly misconstrued part of the arrangement.  Let me be clear – when 2nd Watch resells Amazon Web Services as an AWS reseller, our customer “owns” the account.  At 2nd Watch we work hard every day to earn our customers’ trust and confidence and thereby, their business.  Our pricing model for Managed Cloud Services is designed to leverage the advantages of cloud computing’s consumption model – pay for what you use.  2nd Watch customers who purchase AWS through us have the right to move their account to another MSP or purchase direct from AWS if they are unhappy with our services.

I put the word “own” in quotes above because I think it is worth digressing for a minute on how different audiences interpret that word.  Some people see the ownership issue as a vendor lock-in issue, some as an intellectual property concern and still others a liability and security requirement.  For all of these reasons it is important we are specific and precise with our language.

With 2nd Watch’s Managed Cloud Services consumption model you are not locked-in to 2nd Watch as your AWS reseller or MSP.  AWS Accounts and usage purchased through us belong to the customer, not 2nd Watch, and therefore any intellectual property contained therein is the responsibility and property of the customer.  Additionally, as the account owner, a customer’s AWS accounts use a shared responsibility model.  With regards to liability and security, however, our role as an MSP can be a major benefit.

Often MSP’s “govern or manage” the IAM credentials for an AWS account to ensure consistency, security and governance.  I use the words govern or manage and not “own” as I want to be clear that the customer still has the right to take back the credentials and overall responsibility for managing each account, which is the opposite of lock-in.  So why would a customer want their MSP to manage their credentials?  The reason is pretty simple; similar to a managed data center or colocation facility, you own the infrastructure, but you hire experts to manage the day-to-day management for increased limits of liability, security and enhanced SLA’s.

Simply put, if you, as a customer, want your MSP to carry the responsibility for your AWS account and provide service level agreements (complete with financial repercussions), you are going to want to make sure administrative access to the environment is limited with regards to who can make changes that may impact stability or performance.  As a 2nd Watch Managed Cloud Services customer, allowing us to manage IAM credentials also comes with the benefit of our secure SOC 2 Type 2 (audited) compliant systems and processes.  Often our security controls exceed the capabilities of our customers.

Also worth noting – as we on-board a Managed Cloud Services customer, we often will audit their environment and provide best practice recommendations.  These recommendations are aligned with the excellent AWS Well Architected framework and help customers achieve greater stability, performance, security and cost optimization.  Our customers have the option of completing the remediation or having 2nd Watch perform the remediation.  Implementing best practices for managing user access along with leveraging cutting edge technology results in a streamlined journey to the cloud.

So now we have addressed the question of who owns the account, but we haven’t addressed why a customer would want to procure AWS through a partner.  First, see my earlier blog post regarding Cloud Cost Complexity for some background.  Second, buying AWS through 2nd Watch as an AWS partner, or AWS reseller, comes with several immediate advantages:

  • All services are provided at AWS market rates or better.
  • Pass through all AWS volume tier discounts and pricing
  • Pass through AWS Enterprise Agreement terms, if one exists
  • Solution based and enhanced SLA’s (above and beyond what AWS provides) shaped around your business requirements
  • Familiarity with your account – our (2) U.S. based NOC’s are staffed 24x7x365 and have access to a comprehensive history of your account and governance policies.
  • Access to Enterprise class support including 2nd Watch’s multiple dedicated AWS Technical Account Managers with Managed Cloud Services agreements
  • Consolidate usage across many AWS accounts (see AWS volume discount tiers above)
  • Consolidated billing for both Managed Cloud Services and AWS Capacity
  • Access to our Cloud Management Platform, a web-based console, greatly simplifies the management and analysis of AWS usage
    • Ability to support complex show-back or charge-back bills for different business units or departments as well as enterprise-wide roll-ups for a global view
    • Ability to allocate Volume and Reserved Instance discounts to business units per your requirements
    • Set budgets with alerts, trend analysis, tag reporting, etc.
  • Ability to provide Reserved Instance recommendations and management services
    • Helps improve utilization and prevent spoilage
  • You can select the level of services for Managed Cloud Services on any or all accounts – you can consolidate your purchasing without requiring services you don’t need.
  • Assistance with AWS Account provisioning and governance – we adhere to your corporate standards (and make pro-active recommendations).

In short, buying your AWS capacity through 2nd Watch as your MSP is an excellent value that will help you accelerate your cloud adoption.  We provide the best of AWS with our own services layered on top to enhance the overall offering.  Please contact us for more information about our Managed Cloud Services including Managed AWS Capacity, and 2nd Watch as an AWS Reseller Partner.

-Marc Kagan, Director, Account Management

Facebooktwitterlinkedinmailrss