1-888-317-7920 info@2ndwatch.com

A Case for Enterprises to Leverage Managed Cloud Services

Cloud Adoption is almost mainstream. What are you doing to get on board?

If you follow the hype, you’d think that every enterprise has migrated their applications to the cloud and that you’re ‘behind the times’ when it comes to your on-premise or co-located datacenter. The truth is, many cloud computing technologies are a few years away from mainstream adoption. Companies find the prospect of moving the majority of their workloads to cloud daunting, not only due to the cost to migrate, but because their IT organization isn’t ready to operate in this new world. The introduction of new standards like Infrastructure as Code, CI/CD, serverless, containers, and the concern over security and compliance can place IT operations teams in a state of flux for years, which causes uptime, reliability, and costs to suffer.

Despite the challenges, Gartner predicts that Cloud Computing and Software as a Service (SaaS) is less than 2 years from mainstream adoption. {reference Gartner Hype Cycle for Cloud Computing, 2018 – published July 31, 2018 by David Smith & Ed Anderson.}

One expected early adopter of cloud technologies and IaaS is Independent Software Vendors (ISVs). Delivering their software as a service, enabling their customers to pay as they go, has become a requirement of the industry. The majority of ISVs are not dealing with green-field technology. They have legacy code and monolithic architectures to contend with, which require, in many cases, a rewrite to function effectively in cloud. I remember a time where my team (at a multi-national ISV) thought it was ‘good enough’ to fork-lift our executable into Docker and call it a day. This method of delivery will not compete with the Salesforce, ServiceNow, and Splunks of the world.

But how do ISVs compete when Cloud or SaaS Ops isn’t their core competency; when SaaS Ops has now become a distinct part of their product value stream?

The answer is Managed Cloud Services – outsourcing daily IT management for cloud-based services and technical support to automate and enhance your business operations.

Gartner says 75% of fully successful implementations will be delivered by highly skilled, forward looking boutique managed services providers with cloud-native, DevOps-centric services delivery approach.

Though this has traditionally been considered a solid solution for small to medium-sized companies looking to adopt cloud without the operational overhead, it has proven to be a game-changer for large enterprises, especially ISVs who can’t ramp up qualified SaaS operations staff fast enough to meet customer demand.

AWS has jumped on board with their own managed services offering called AWS Managed Services (AMS), which provides companies with access to AWS infrastructure, allowing them to scale their software deployments for end-users without increasing resources to manage their operations. The result is a reduction in operational overhead and risk as the company scales up to meet customer demand.

The AMS offering includes:

  • Logging, Monitoring, and Event Management
  • Continuity Management
  • Security and Access Management
  • Patch Management
  • Change Management
  • Provisioning Management
  • Incident Management
  • Reporting

In addition, if the ISV leverages AWS Marketplace to sell their SaaS solution, their billing, order processing, and fulfillment can be automated from start-to-finish letting them focus on their software and features rather than the minutia of operating a SaaS business and infrastructure, further reducing the strain of IT management. An example of an integration between AWS Marketplace and AMS that our team at 2nd Watch built for Cherwell Software is pictured here:

An example of an integration between AWS Marketplace and AMS

This AMS/AWS Marketplace integration is a win-win for any ISV looking to up their game with a SaaS offering. According to 451 Research, 41% of companies indicate they are lacking the platform expertise required to fully adopt hosting and cloud services within their organization. If this is the case, for companies whose core competency is not infrastructure or cloud, a managed service is a sure fit.

If you’re really looking to get up to speed quickly, our new onboarding service for AWS Managed Services (AMS) helps enterprises accelerate the process to assess, migrate, and operationalize their applications from on-premises to AWS. In addition, our Managed Cloud solutions help clients save 42% more than managing cloud services alone. Schedule a Discovery Workshop to learn more or get started.

I’ll throw one more stat at you; 72% of companies globally, across industries, will adopt cloud computing by 2022 based on the latest Future of Jobs Survey by the World Economic Forum (WEF). If you want to beat the “mainstream” crowd, start your migration now, knowing there are MSPs like 2nd Watch who can help with the transition as well as minimizing strain on your IT Operations team.

-Stefana Muller, Sr Product Manager

Facebooktwitterlinkedinmailrss

Cloud Autonomics and Automated Management and Optimization: Update

The holy grail of IT Operations is to achieve a state where all mundane, repeatable remediations occur without intervention, with a human only being woken for any action that simply cannot be automated.  This allows not only for many restful nights, but it also allows IT operations teams to become more agile while maintaining a proactive and highly-optimized enterprise cloud.  Getting to that state seems like it can only be found in the greatest online fantasy game, but the growing popularity of “AIOps” gives great hope that this may actually be closer to a reality than once thought.

Skeptics will tell you that automation, autonomics, orchestration, and optimization have been alive and well in the datacenter for more than a decade now. Companies like Microsoft with System Center, IBM with Tivoli, and ServiceNow are just a few examples of autonomic platforms that harness the ability to collect, analyze and make decisions on how to act against sensor data derived from physical/virtual infrastructure and appliances.  But when you couple these capabilities with advancements brought through AIOps, you are able take advantage of the previously missing components by incorporating big data analytics along with artificial intelligence (AI) and Machine Learning (ML).

As you can imagine, these advancements have brought an explosion of new tooling and services from Cloud ISV’s thought to make the once utopian Autonomic cloud a reality. Palo Alto Network’s Prisma Public Cloud product is great example of a technology that functions with autonomic capabilities.  The security and compliance features of Prisma Public Cloud are pretty impressive, but it also has a component known as User and Entity Behavior Analytics (UEBA).  UEBA analyzes user activity data from logs, network traffic and endpoints and correlates this data with security threat intelligence to identify activities—or behaviors—likely to indicate a malicious presence in your environment. After analyzing the current state of the vulnerability and risk landscape, it reports current risk and vulnerability state and derives a set of guided remediations that can be either performed manually against the infrastructure in question or automated for remediation to ensure a proactive response, hands off, to ensure vulnerabilities and security compliance can always be maintained.

Another ISV focused on AIOps is MoogSoft who is bringing a next generation platform for IT incident management to life for the cloud.  Moogsoft has purpose-built machine learning algorithms that are deigned to better correlate alerts and reduce much of the noise associated with all the data points. When you marry this with their Artificial Intelligence capabilities for IT operations, they are helping DevOps teams operate smarter, faster and more effectively in terms of automating traditional IT operations tasks.

As we move forward, expect to see more and more AI and ML-based functionality move into the core cloud management platforms as well. Amazon recently released AWS Control Tower to aide your company’s journey towards AIOps.  While coming with some pretty incredible features for new account creation and increased multi-account visibility, it uses service control policies (SCPs) based upon established guardrails (rules and policies).  As new resources and accounts come online, Control Tower can force compliance with the policies automatically, preventing “bad behavior” by users and eliminating the need to have IT configure resources after they come online. Once AWS Control Tower is being utilized, these guardrails can apply to multi-account environments and new accounts as they are created.

It is an exciting time for autonomic platforms and autonomic systems capabilities in the cloud, and we are excited to help customers realize the many potential capabilities and benefits which can help automate, orchestrate and proactively maintain and optimize your core cloud infrastructure.

To learn more about autonomic systems and capabilities, check out Gartner’s AIOps research and reach out to 2nd Watch. We would love to help you realize the potential of autonomic platforms and autonomic technologies in your cloud environment today!

-Dusty Simoni, Sr Product Manager

 

 

 

Facebooktwitterlinkedinmailrss

Operating and maintaining systems at scale with automation

Managing numerous customers with unique characteristics and tens of thousands of systems at scale can be challenging. Here, I want to pull back the curtain on some of the automation and tools that 2nd Watch develops to solve these problems. Below I will outline our approach to this problem and its 3 main components: Collect, Model, and React.

Collect: The first problem facing us is an overwhelming flood of data. We have CloudWatch metrics, CloudTrail events, custom monitoring information, service requests, incidents, tags, users, accounts, subscriptions, alerts, etc. The data is all structured differently, tells us different stories, and is collected at an unrelenting pace. We need to identify all the sources, collect the data, and store it in a central place so we can begin to consume it and make correlations between various events.

Most of the data I described above can be gathered from the AWS & Azure APIs directly, while others may need to be ingested with an agent or by custom scripts. We also need to make sure we have a consistent core set of data being brought in for each of our customers, while also expanding that to include some specialized data that perhaps only certain customers may have. All the data is gathered and sent to our Splunk indexers. We build an index for every customer to ensure that data stays segregated and secure.

Model: Next we need to present the data in a useful way. The modeling of the data can vary depending on who is using it or how it is going to be consumed. A dashboard with a quick look at several important metrics can be very useful to an engineer to see the big picture. Seeing this data daily or throughout the day will make anomalies very apparent. This is especially helpful because gathering and organizing the data at scale can be time consuming, and thus could reasonably only be done during periodic audits.

Modeling the data in Splunk allows for a low overhead view with up-to-date data so the engineer can do more important things. A great example of this is provisioned resources by region. If the engineer looks at the data on a regular basis, they would quickly notice that the number of provisioned resources has drastically changed. A 20% increase in the number of EC2 resources could mean several things; Perhaps the customer is doing a large deployment, or maybe Justin accidently put his AWS access key and secret key on GitHub (again).

We provide our customers with regular reports and reviews of their cloud environments. We also use the data collected and modeled in this tool for providing that data. Historical data trended over a month, quarter, and year can help you ask questions or tell a story. It can help you forecast your business, or the number of engineers needed to support it. We recently used the historical tending data to show progress of a large project that included waste removal and a resource tagging overhaul for a customer. Not only were we able to show progress throughout the project,t but we used that same view to ensure that waste did not creep back up and that the new tagging standards were being applied going forward.

React: Finally, it’s time to act on the data we collected and modeled. Using Splunk alerts we can provide conditional logic to the data patterns and act upon them. From Splunk we can call our ticketing system’s API and create a new incident for an engineer to investigate concerning trends or to notify the customer of a potential security risk. We can also call our own APIs that trigger remediation workflows. A few common scenarios are encrypting unencrypted S3 buckets, deleting old snapshots, restarting failed backup jobs, requesting cloud provider limit increases, etc.

Because we have several independent data sources providing information, we can also correlate events and have more advanced conditional logic. If we see that a server is failing status checks, we can also look to see if it recently changed instance families or if it has all the appropriate drivers. This data can be included in the incident and available for the engineer to review without having to check it themselves.

The entire premise of this idea and the solution it outlines is about efficiency and using data and automation to make quicker and smarter decisions. Operating and maintaining systems at scale brings forth numerous challenges and if you are unable to efficiently accommodate the vast amount of information coming at you, you will spend a lot of energy just trying to keep your head above water.

For help getting started in automating your systems, contact us.

-Kenneth Weinreich, Managed Cloud Operations

Facebooktwitterlinkedinmailrss

Cloud Autonomics and Automated Management and Optimization

Autonomics systems is an exciting new arena within cloud computing, although it is not a new technology by any means. Automation, orchestration and optimization have been alive and well in the datacenter for almost a decade now. Companies like Microsoft with System Center, IBM with Tivoli and ServiceNow are just a few examples of platforms that harness the ability to collect, analyze and make decisions on how to act against sensor data derived from physical/virtual infrastructure and appliances.

Autonomic cloud capabilities are lighting up quickly across the cloud ecosystem. The systems can monitor infrastructure, services, systems and make decisions to support remediation and healing, failover and failback and snapshot and recovery. The abilities come from workflow creation, runbook and playbook development, which helps to support a broad range of insight with action and corrective policy enforcement.

In the compliance world, we are seeing many great companies come into the mix to bring autonomic type functionality to life in the world of security and compliance.

Evident is a great example of a technology that functions with autonomic-type capabilities. The product can do some amazing things in terms of automation and action. It provides visibility across the entire cloud platform and identifies and manages risk associated with the operation of core cloud infrastructure and applications within an organization.

Using signatures and control insight as well as custom-defined controls, it can determine exploitable and vulnerable systems at scale and report the current state of risk within an organization. That on face value is not autonomic, however, the next phase it performs is critical to why it is a great example of autonomics in action.

After analyzing the current state of the vulnerability and risk landscape, it reports current risk and vulnerability state and derives a set of guided remediations that can be either performed manually against the infrastructure in question or automated for remediation to ensure a proactive response hands off to ensure vulnerabilities and security compliance can always be maintained.

Moving away from Evident, the focus going forward is a marriage of many things to increase systems capabilities and enhance autonomic cloud operations. Operations management systems in the cloud will light up advanced Artificially Intelligent and Machine Learning-based capabilities, which will take in large amounts of sensor data across many cloud-based technologies and services and derive analysis, insight and proactive remediation – not just for security compliance, but across the board in terms of cloud stabilization and core operations and optimization.

CloudHealth Technologies and many others in the cloud management platform space are looking deeply into how to turn the sensor data derived into core cloud optimization via automation and optimization.

AIOps is a term growing year over year, and it fits well to describe how autonomic systems have evolved from the datacenter to the cloud. Gartner is looking deeply into this space, and we at 2nd Watch see promising advancement coming from companies like Palo Alto Networks with their native security platform capabilities along with Evident for continuous compliance and security.

MoogSoft is bringing a next generation platform for IT incident management to life for the cloud, and its Artificial Intelligence capabilities for IT operations are helping DevOps teams operate smarter, faster and more effectively in terms of automating traditional IT operations tasks and freeing up IT engineers to work on the important business-level needs of the organization vs day-to-day IT operations. By providing intelligence to the response of systems issues and challenges, IT operations teams can become more agile and more capable to solve mission critical problems and maintain a proactive and highly optimized enterprise cloud.

As we move forward, expect to see more and more AI and ML-based functionality move into the core cloud management platforms. Cloud ISVs will be leveraging more and more sensor data to determine response, action and resolution and this will become tightly coupled directly to the virtual machine topology and the cloud native services underlying all cloud providers moving forward.

It is an exciting time for autonomic systems capabilities in the cloud, and we are excited to help customers realize the many potential capabilities and benefits which can help automate, orchestrate and proactively maintain and optimize your core cloud infrastructure.

To learn more about autonomic systems and capabilities, check out Gartner’s AIOps research and reach out to 2nd Watch. We would love to help you realize the potential of these technologies in your cloud environment today!

-Peter Meister, Sr Director of Product Management

Facebooktwitterlinkedinmailrss

Logging and Monitoring in the era of Serverless – Part 1

Figuring out monitoring in a holistic sense is a challenge for many companies still, whether it is with conventional infrastructure or new platforms like serverless or containers.

In most applications there are two aspects of monitoring an application:

  • System Metrics such as errors, invocations, latency, memory and cpu usage
  • Business Analytics such as number of signups, number of emails sent, transactions processed, etc

The former is fairly universal and generally applicable in any stack to a varying degree. This is what I would call the undifferentiated aspect of monitoring an application. The abilities to perform error detection and track performance metrics are absolutely necessary to operate an application.

Everything that is old is new again. I am huge fan of the Twelve-Factor App. If you aren’t familiar, I highly suggest taking a look at it. Drafted in 2011 by developers at Heroku, the Twelve-Factor App is a methodology and set best practices designed to enable applications to be built with portability and resiliency when deployed to the web.

In the Twelve-Factor App manifesto, it is stated that applications should produce “logs as event streams” and leave it up to the execution environment to aggregate them. If we are to gather information from our application, why not make that present in the logs? We can use our event stream (i.e. application log) to create time-series metrics. Time-series metrics are just datapoints that have been sampled and aggregated over time, which enable developers and engineers to track performance. They allow us to make correlations with events at a specific time.

AWS Lambda works almost exactly in this way by default, aggregating its logs via AWS CloudWatch. CloudWatch organizes logs based on function, version, and containers while Lambda adds metadata for each invocation. And it is up to the developer to add application-specific logging to their function. CloudWatch, however, will only get you so far. If we want to track more information than just invocation, latency, or memory utilization, we need to analyze the logs deeper. This is where something like Splunk, Kibana, or other tools come into play.

In order to get to the meat of our application and the value it is delivering we need to ensure that we have additional information (telemetry) going to the logs as well:

e.g. – Timeouts – Configuration Failures – Stack traces – Event objects

Logging out these types of events or information will enable those other tools with rich query languages to create a dashboard with just about anything we want on them.

For instance, let’s say we added the following line of code to our application to track an event that was happening from a specific invocation and pull out additional information about execution:

log.Println(fmt.Sprintf(“-metrics.%s.blob.%s”, environment, method))

In a system that tracks time-series metrics in logs (e.g. SumoLogic), we could build a query like this:

“-metrics.prod.blob.” | parse “-metrics.prod.blob.*” as method | timeslice 5m | count(method) group by  _timeslice, method | transpose row _timeslice column method

This would give us a nice breakdown of the different methods used in a CRUD or RESTful service and can then be visualized in the very same tool.

While visualization is nice, particularly when taking a closer look at a problem, it might not be immediately apparent where there is a problem. For that we need some way to grab the attention of the engineers or developers working on the application. Many of the tools mentioned here support some level of monitoring and alerting.

In the next installment of this series we will talk about increasing visibility into your operations and battling dashboard apathy! Check back next week.

-Lars Cromley, Director, Cloud Advocacy and Innovation

Facebooktwitterlinkedinmailrss

Managing Azure Cloud Governance with Resource Policies

I love an all you can eat buffet. One can get a ton of value from a lot to choose from, and you can eat as much as you want or not, for a fixed price.

In the same regards, I love the freedom and vast array of technologies that the cloud allows you. A technological all you can eat buffet, if you will. However, there is no fixed price when it comes to the cloud. You pay for every resource! And as you can imagine, it can become quite costly if you are not mindful.

So, how do organizations govern and ensure that their cloud spend is managed efficiently? Well, in Microsoft’s Azure cloud you can mitigate this issue using Azure resource policies.

Azure resource policies allow you to define what, where or how resources are provisioned, thus allowing an organization to set restrictions and enable some granular control over their cloud spend.

Azure resource policies allow an organization to control things like:

  • Where resources are deployed – Azure has more than 20 regions all over the world. Resource policies can dictate what regions their deployments should remain within.
  • Virtual Machine SKUs – Resource policies can define only the VM sizes that the organization allows.
  • Azure resources – Resource policies can define the specific resources that are within an organization’s supportable technologies and restrict others that are outside the standards. For instance, your organization supports SQL and Oracle databases but not Cosmos or MySQL, resource policies can enforce these standards.
  • OS types – Resource policies can define which OS flavors and versions are deployable in an organization’s environment. No longer support Windows Server 2008, or want to limit the Linux distros to a small handful? Resource policies can assist.

Azure resource policies are applied at the resource group or the subscription level. This allows granular control of the policy assignments. For instance, in a non-prod subscription you may want to allow non-standard and non-supported resources to allow the development teams the ability to test and vet new technologies, without hampering innovation. But in a production environment standards and supportability are of the utmost importance, and deployments should be highly controlled. Policies can also be excluded from a scope. For instance, an application that requires a non-standard resource can be excluded at the resource level from the subscription policy to allow the exception.

A number of pre-defined Azure resource policies are available for your use, including:

  • Allowed locations – Used to enforce geo-location requirements by restricting which regions resources can be deployed in.
  • Allowed virtual machine SKUs – Restricts the virtual machines sizes/ SKUs that can be deployed to a predefined set of SKUs. Useful for controlling costs of virtual machine resources.
  • Enforce tag and its value – Requires resources to be tagged. This is useful for tracking resource costs for purposes of department chargebacks.
  • Not allowed resource types – Identifies resource types that cannot be deployed. For example, you may want to prevent a costly HDInsight cluster deployment if you know your group would never need it.

Azure also allows custom resource policies when you need some restriction not defined in a custom policy. A policy definition is described using JSON and includes a policy rule.

This JSON example denies a storage account from being created without blob encryption being enabled:

{
 
"if": {
 
"allOf": [
 
{
 
"field": "type",
 
"equals": "Microsoft.Storage/ storageAccounts"
 
},
 
{
 
"field": "Microsoft.Storage/ storageAccounts/ enableBlobEncryption",
 
"equals": "false"
 
}
 
]
 
},
 
"then": { "effect": "deny"
 
}
 
}

The use of Azure Resource Policies can go a long way in assisting you to ensure that your organization’s Azure deployments meet your governance and compliance goals. For more information on Azure Resource Policies visit https://docs.microsoft.com/en-us/azure/azure-policy/azure-policy-introduction.

For help in getting started with Azure resource policies, contact us.

-David Muxo, Sr Cloud Consultant

Facebooktwitterlinkedinmailrss