Cloud DR: Recovery Begins Well Before Disaster Strikes

In the world of IT, disasters come in all shapes and sizes from infrastructure and application outages, to human error, data corruption, ransomware, malicious attacks, and other unplanned events.  Other than perhaps a hurricane or blizzard, we often don’t have visibility into when a disaster will occur.  After the immediate impact of the disaster subsides, the focus rapidly shifts to the recovery.

At the core of the disaster recovery is a focus on how quickly applications and data can be restored to resume servicing your customers. Downtime means a loss of productivity, revenue, or even profit from credits being paid out to your customers for failure to maintain service.

But disaster recovery goes well beyond the post-crisis events, and its success hinges on the preparation done well in advance of any disaster occurring. Now, a disaster recovery strategy should not be confused with a business continuity plan. A business continuity plan is far greater in scope, covering not only recovering your IT systems, data, and applications to service customers again, but how to continue running your business even beyond IT system disruptions.  For example, a business continuity plan will outline what steps to take when the physical building becomes unavailable and your employees can’t come into the office; how to handle supply chain disruptions, etc.

When discussing disaster recovery strategies, often times back-up and disaster recovery are used synonymously.  Back-up should factor into your business continuity planning, and in some cases a back-up may be sufficient in restoring your systems and meeting compliance requirements.  However, back-ups are a point-in-time solution and can take significant time to restore your systems, delaying your recovery time. Compounding this dilemma, back-ups are only as up to date as the last snapshot taken, which, for many, could mean losing a complete day’s worth of sales.  A solid disaster recovery strategy should not only focus on recovering your systems but do it in a manner that exceeds the business requirements and minimizes the disruption your customers.

Traditional disaster recovery solutions have really required significant investment from both a financial perspective and a human resource perspective.  It’s not unusual for enterprises to be required to purchase fully redundant hardware and duplicative software licenses, locate that hardware in geographically disbursed colo facilities, set-up connectivity and replication between the two sites, and have IT admins maintain the second site, which is commonly under-utilized.

Cloud based disaster recovery has solved many of these problems and can do it for a fraction of the price. To help bring this solution to our customers, 2nd Watch has partnered with CloudEndure, an AWS Company, to help enterprises accelerate their adoption of Cloud Disaster Recovery.

The CloudEndure Disaster Recovery solution replicates everything in real time, meaning everything is always up to date, down to the second, allowing you to achieve your Recovery Point Objectives (RPOs).  CloudEndure provisions a very low-cost staging area in AWS, eliminating the need for duplicate resource provisioning. Should a disaster occur, automated orchestration combined with machine conversion enables you to achieve a Recovery Time Objectives (RTOs) of minutes and only pay for the cloud instances when actually needed.

Our Cloud Disaster Recovery service provides you a disaster recovery proof of concept for 100 machines in less than 30 days, while allowing you to continue to leverage your entire existing infrastructure.  We apply our proven methodology to ensure your organization is getting optimal value from your existing infrastructure while allowing fast, easy, and cost-effective recovery in the AWS cloud.

Download our datasheet to learn more about our Cloud Disaster Recovery service.

-Dusty Simoni, Sr Product Manager

How to Achieve Redundancy for High Availability in the Cloud

In the last of our four-part blog series with our strategic partner, Alert Logic, we explore business resumption for cloud environments. Check out last week’s article on Free Tools and Tips for Testing the Security of Your Environment Against Attacks first.

Business resumption, also known as disaster recovery, has always been a challenge for organizations. Aside from those in the banking and investment industry, many businesses don’t take business resumption as seriously as they should.

I formerly worked at a financial institution that would send their teams to another city in another state where production data was backed up and could be restored in the event of a disaster. Employees would go to this location and use the systems in production to complete their daily workloads. This would the redundancy of a single site, but what if you could have many redundant sites? What if you could have a global backup option and have redundancy not only when you need it, but as a daily part of your business strategy?

To achieve true redundancy, I recommend understanding your service provider’s offerings. Each service provider has different facilities located in different regions that are spread between different telecom service providers.

From a customer’s perspective, this creates a good opportunity to build out an infrastructure that has fully redundant load balances, giving your business a regional presence in almost every part of the world. In addition, you are able to deliver application speed and efficiency to your regional consumers.

Look closely at your provider’s services like hardware health monitoring, log management, security monitoring and all the management services that accompany those solutions.  If you need to conform to certain compliance regulations, you also need to make sure the services and technologies meet each regulation.

Organize your vendors and managed service providers so that you can get your data centralized based on service across all providers and all layers of the stack. This is when you need to make sure that your partners share data, have the ability to ingest logs, and exchange APIs with each other to effectively secure your environment.

Additionally, centralize the notification process so you are getting one call per incident versus multiple calls across providers. This means that API connectivity or log collection needs to happen between technologies that are correlating triggered events across multiple platforms. This will centralize your notification and increase the efficiency and decrease detection time to mitigate risks introduced into your environment by outside and inside influences.

Lastly, to find incidents as quickly as possible, you need to find a managed services provider that will be able to ingest and correlate all events and logs across all infrastructures. There are also cloud migration services that will help you with all these decisions as they help move you to the cloud.

Learn more about 2W Managed Cloud Security and how our partnership with Alert Logic can ensure your environment’s security

Article contributed by Alert Logic

A Refresher Course on Disaster Recovery with AWS

IT infrastructure is the hardware, network, services and software required for enterprise IT. It is the foundation that enables organizations to deliver IT services to their users. Disaster recovery (DR) is preparing for and recovering from natural and people-related disasters that impact IT infrastructure for critical business functions. Natural disasters include earthquakes, fires, etc. People-related disasters include human error, terrorism, etc. Business continuity differs from DR as it involves keeping all aspects of the organization functioning, not just IT infrastructure.

When planning for DR, companies must establish a recovery time objective (RTO) and recovery point objective (RPO) for each critical IT service. RTO is the acceptable amount of time in which an IT service must be restored. RPO is the acceptable amount of data loss measured in time. Companies establish both RTOs and RPOs to mitigate financial and other types of loss to the business. Companies then design and implement DR plans to effectively and efficiently recover the IT infrastructure necessary to run critical business functions.

For companies with corporate datacenters, the traditional approach to DR involves duplicating IT infrastructure at a secondary location to ensure available capacity in a disaster. The key downside is IT infrastructure must be bought, installed and maintained in advance to address anticipated capacity requirements. This often causes IT infrastructure in the secondary location to be over-procured and under-utilized. In contrast, Amazon Web Services (AWS) provides companies with access to enterprise-grade IT infrastructure that can be scaled up or down for DR as needed.

The four most common DR architectures on AWS are:

  • Backup and Restore ($) – Companies can use their current backup software to replicate data into AWS. Companies use Amazon S3 for short-term archiving and Amazon Glacier for long-term archiving. In the event of a      disaster, data can be made available on AWS infrastructure or restored from the cloud back onto an on-premise server.
  • Pilot Light ($$) – While backup and restore are focused on data, pilot light includes applications. Companies only provision core infrastructure needed for critical applications. When disaster strikes, Amazon Machine Images (AMIs) and other automation services are used to quickly provision the remaining environment for production.
  • Warm Standby ($$$) – Taking the Pilot Light model one step further, warm standby creates an active/passive cluster. The minimum amount of capacity is provisioned in AWS. When needed, the environment rapidly scales up to meet full production demands. Companies receive (near) 100% uptime and (near) no downtime.
  • Hot Standby ($$$$) – Hot standby is an active/active cluster with both cloud and on-premise components to it. Using weighted DNS load-balancing, IT determines how much application traffic to process in-house and on AWS.      If a disaster or spike in load occurs, more or all of it can be routed to AWS with auto-scaling.

In a non-disaster environment, warm standby DR is not scaled for full production, but is fully functional. To help adsorb/justify cost, companies can use the DR site for non-production work, such as quality assurance, ing, etc. For hot standby DR, cost is determined by how much production traffic is handled by AWS in normal operation. In the recovery phase, companies only pay for what they use in addition and for the duration the DR site is at full scale. In hot standby, companies can further reduce the costs of their “always on” AWS servers with Reserved Instances (RIs).

Smart companies know disaster is not a matter of if, but when. According to a study done by the University of Oregon, every dollar spent on hazard mitigation, including DR, saves companies four dollars in recovery and response costs. In addition to cost savings, smart companies also view DR as critical to their survival. For example, 51% of companies that experienced a major data loss closed within two years (Source: Gartner), and 44% of companies that experienced a major fire never re-opened (Source: EBM). Again, disaster is not a ready of if, but when. Be ready.

-Josh Lowry, General Manager – West

Increasing Your Cloud Footprint

The jump to the cloud can be a scary proposition.  For an enterprise with systems deeply embedded in traditional infrastructure like back office computer rooms and datacenters the move to the cloud can be daunting. The thought of having all of your data in someone else’s hands can make some IT admins cringe.  However, once you start looking into cloud technologies you start seeing some of the great benefits, especially with providers like Amazon Web Services (AWS).  The cloud can be cost-effective, elastic and scalable, flexible, and secure.  That same IT admin cringing at the thought of their data in someone else’s hands may finally realize that AWS is a bit more secure than a computer rack sitting under an employee’s desk in a remote office.  Once the decision is finally made to “try out” the cloud, the planning phase can begin.

Most of the time the biggest question is, “How do we start with the cloud?”  The answer is to use a phased approach.  By picking applications and workloads that are less mission critical, you can try the newest cloud technologies with less risk.  When deciding which workloads to move, you should ask yourself the following questions; Is there a business need for moving this workload to the cloud?  Is the technology a natural fit for the cloud?  What impact will this have on the business? If all those questions are suitably answered, your workloads will be successful in the cloud.

One great place to start is with archiving and backups.  These types of workloads are important, but the data you’re dealing with is likely just a copy of data you already have, so it is considerably less risky.  The easiest way to start with archives and backups is to try out S3 and Glacier.  Many of today’s backup utilities you may already be using, like Symantec Netbackup  and Veeam Backup & Replication, have cloud versions that can directly backup to AWS. This allows you to use start using the cloud without changing much of your embedded backup processes.  By moving less critical workloads you are taking the first steps in increasing your cloud footprint.

Now that you have moved your backups to AWS using S3 and Glacier, what’s next?  The next logical step would be to try some of the other services AWS offers.  Another workload that can often be moved to the cloud is Disaster Recovery.   DR is an area that will allow you to more AWS services like VPC, EC2, EBS, RDS, Route53 and ELBs.  DR is a perfect way to increase your cloud footprint because it will allow you to construct your current environment, which you should already be very familiar with, in the cloud.  A Pilot Light DR solution is one type of DR solution commonly seen in AWS.  In the Pilot Light scenario the DR site has minimal systems and resources with the core elements already configured to enable rapid recovery once a disaster happens.  To build a Pilot Light DR solution you would create the AWS network infrastructure (VPC), deploy the core AWS building blocks needed for the minimal Pilot Light configuration (EC2, EBS, RDS, and ELBs), and determine the process for recovery (Route53).  When it is time for recovery all the other components can be quickly provisioned to give you a fully working environment. By moving DR to the cloud you’ve increased your cloud footprint even more and are on your way to cloud domination!

The next logical step is to move Test and Dev environments into the cloud. Here you can get creative with the way you use the AWS technologies.  When building systems on AWS make sure to follow the Architecting Best Practices: Designing for failure means nothing will fail, decouple your components, take advantage of elasticity, build security into every layer, think parallel, and don’t fear constraints! Start with proof-of-concept (POC) to the development environment, and use AWS reference architecture to aid in the learning and planning process.  Next your legacy application in the new environment and migrate data.  The POC is not complete until you validate that it works and performance is to your expectations.  Once you get to this point, you can reevaluate the build and optimize it to exact specifications needed. Finally, you’re one step closer to deploying actual production workloads to the cloud!

Production workloads are obviously the most important, but with the phased approach you’ve taken to increase your cloud footprint, it’s not that far of a jump from the other workloads you now have running in AWS.   Some of the important things to remember to be successful with AWS include being aware of the rapid pace of the technology (this includes improved services and price drops), that security is your responsibility as well as Amazon’s, and that there isn’t a one-size-fits-all solution.  Lastly, all workloads you implement in the cloud should still have stringent security and comprehensive monitoring as you would on any of your on-premises systems.

Overall, a phased approach is a great way to start using AWS.  Start with simple services and traditional workloads that have a natural fit for AWS (e.g. backups and archiving).  Next, start to explore other AWS services by building out environments that are familiar to you (e.g. DR). Finally, experiment with POCs and the entire gambit of AWS to benefit for more efficient production operations.  Like many new technologies it takes time for adoption. By increasing your cloud footprint over time you can set expectations for cloud technologies in your enterprise and make it a more comfortable proposition for all.

-Derek Baltazar, Senior Cloud Engineer

The next frontier: recovering in the cloud

The pervasive technology industry has created the cloud and all the acronyms that go with it.   Growth is fun, and the cloud is the talk of the town. From the California Sun to the Kentucky coal mines we are going to the cloud, although Janis Joplin may have been there before her time.  Focus and clarity will come later.

There is so much data being stored today that the biggest challenge is going to be how to quantify it, store it, access it and recover it. Cloud-based disaster recovery has broad-based appeal across industry and segment size.  Using a service from the AWS cloud enables more efficient disaster recovery of mission critical applications without any upfront cost or commitment.   AWS allows customers to provision virtual private clouds using its infrastructure, which offers complete network isolation and security. The cloud can be used to configure a “pilot-light” architecture, which dramatically reduces cost over traditional data centers where the concept of “pilot” or “warm” is not an option – you pay for continual use of your infrastructure whether it’s used or not. With AWS, you only use what you pay for, and you have complete control of your data and its security.

Backing data up is relatively simple: select an object to be backed up and click a button. More often than not, the encrypted data reaches its destination, whether in a local storage device or to an S3 bucket in an AWS region in Ireland.  Restoring the data has always been a perpetual challenge. What the cloud does is make ing of the backup capabilities more flexible and more cost effective.  As the cost of cloud-based ing falls rapidly, from thousands of dollars or dinars, to hundreds, it results in more ing, and therefore, more success after a failure whether it’s from a superstore or superstorm, or even a supermodel one.

-Nick Desai, Solutions Architect

Disaster Recovery – Don’t wait until it’s too late!

October 28th marked the one year anniversary of Hurricane Sandy, the epic storm that ravaged the Northeastern part of the United States. Living in NJ where the hurricane made landfall and having family that lives across much of the state we personally lived through the hurricane and its aftermath. It’s hard to believe that it’s been a year already. It’s an experience we’ll never forget, and we have made plans to ensure that we’re prepared in case anything like that happens again. Business mirrors Life in many cases, and when I speak with customers across the country the topic of disaster recovery comes up often. The conversations typically have the following predictable patterns:

  • I’ve just inherited the technology and systems of company X (we’ll call it company X to protect the innocent), and we have absolutely no backup or disaster recovery strategy at all. Can you help me?
  • We had a disaster recovery strategy, but we haven’t really looked at it in a very long time, I’ve heard Cloud Computing can help me. Is that true?
  • We have a disaster recovery approach we’re thinking about. Can you review it and validate that we’re leveraging best practices?
  • I’m spending a fortune on disaster recovery gear that just sits idle 95% of the time. There has to be a better way.

The list is endless with permutations, and yes there is a better way. Disaster recovery as a workload is a very common one for a Cloud Computing solution, and there’s a number of ways you can approach it. As with anything there are tradeoffs of cost vs. functionality and typically depends on the business requirements. For example a full active/active environment where you need complete redundancy and sub second failover can be costly but potentially necessary depending on your business requirements. In the Financial Services industry for example, having revenue generating systems down for even a few seconds can cost a company millions of dollars.

We have helped companies of all sizes think about, design and implement disaster recovery strategies. From Pilot Lights, where there’s just the glimmer of an environment, to warm standby’s to fully redundant systems. The first step is to plan for the future and not wait until it’s too late.

-Mike Triolo, General Manager East

AWS DR in Reverse!

Amazon and 2nd Watch have published numerous white papers and blog articles on various ways to use Amazon Web Services™ (AWS) for a disaster recovery strategy.  And there is no doubt at all that AWS is an excellent place to run a disaster recovery environment for on premise data centers and save companies enormous amounts of capital while preserving their business with the security of a full DR plan.  For more on this, I’ll refer you to our DR is Dead article as an example of how this works.

What happens though, when you truly cannot have any downtime for your systems or a subset of your systems?  Considering recent events like Hurricanes Sandy and Katrina, how do critical systems use AWS for DR when Internet connectivity cannot be guaranteed?  How can cities prone to earthquakes justify putting DR systems in the Cloud when true disasters they have to consider involve large craters and severed fiber cables?  Suddenly having multiple Internet providers doesn’t seem quite enough redundancy when all systems are Cloud based.  Now to be fair, in such major catastrophes most users have greater concerns than ‘can I get my email?’ or ‘where’s that TPS report?’ but what if your systems are supporting first responders?  DR has an even greater level of importance.

Typically, this situation is what keeps systems that have links to first responders, medical providers, and government from adopting a Cloud strategy or Cloud DR strategy.  This is where a Reverse DR strategy has merit: moving production systems into AWS but keeping a pilot light environment on premise.  I won’t reiterate the benefits of moving to AWS, there are more articles on this than I can possibly reference (but please, contact the 2nd Watch sales team and they’ll be happy to expound upon the benefits!) or rehash Ryan’s article on DR is Dead.  What I will say is this: if you can move to AWS without risking those critical disaster response systems, why aren’t you?

By following the pilot light model in reverse, customers can leave enough on premise to keep the lights on in the event of disaster.  With regularly scheduled s to make sure those on premise systems are running and sufficient for emergencies, customers can take advantage of the Cloud for a significant portion of their environments.  From my experiences, once an assessment is completed to validate which systems are required on premise to support enough staff in the event of a disaster, most customers find themselves able to move 90%+ of their environment to the Cloud, save a considerable amount of money, and suffer no loss of functionality.

So put everything you’ve been told about DR in the Cloud in reverse, move your production environments to AWS and leave just enough to handle those pesky hurricanes on premise, and you’ve got yourself a reverse DR strategy using AWS.

-Keith Homewood, Cloud Architect

Database Recovery: DR Your DB

Databases tend to host the most critical data your business has. From orders, customers, products and even employee information – it’s everything that your business depends on. How much of that can you afford to lose?

With AWS you have options for database recovery depending on your budget and Recovery Time Objective (RTO).

Low budget/Long RTO

  • Whether you are in the cloud or on premise, using the AWS Command Line Interface (CLI) tools you can script uploads of your database backups directly to S3. This can be added as a step to an existing backup job or an entirely new job.
  • Another option would be to use a third party tool to mount an S3 bucket as a drive. It’s possible to backup directly to the S3 bucket, but if you have write issues you may need to write the backup locally and then move it to the mounted drive.

These methods have a longer RTO as they will require you to stand up a new DB server and then restore the backups, but is a low cost solution to ensure you can recover your business.

The catch here is that you can only restore to the last backup that you have taken and copied to S3. You may want to review you backup plans to ensure you are comfortable with what you may lose. Just make sure you use the native S3 lifecycle policies to purge old backups otherwise your storage bill will slowly get out of hand.

High budget/short RTO

  • Almost all mainstream Relational Database Management Systems (RDBMS) have a native method of replication. You can setup an EC2 Instance database server to replicate your database to. This can be in real-time so that you can be positive that you will not lose a single transaction.
  • What about RDS? While you cannot use native RDBMS replication there are third party replication tools that will do Change Data Capture (CDC) replication directly to RDS. These can be easier to setup than the native replication methods, but you will want to make sure you are monitoring these tools to ensure that you do not get into a situation where you can lose transactional data.

Since this is DR you can lower the cost of these solutions by downsizing the RDS or EC2 instance. This will increase the RTO as you will need to manually resize the instances in the event of failure, but can be a significant cost saver. Both of these solutions will require connectivity to the instance over VPN or Direct Connect.

Another benefit of this solution is that it can easily be utilized for QA, Testing and development needs. You can easily snapshot the RDS or EC2 instance and stand up a new one to work against. When you are done – just terminate it.

With all database DR solutions, make sure you script out the permissions & server configurations. This either needs to be saved off with the backups or applied to RDS/EC2 instances. These are constantly changing and can create recovery issues if you do not account for them.

With an AWS database recovery plan you can avoid losing critical business data.

-Mike Izumi, Cloud Architect

Storage Gateway with Amazon Web Services

Backup and disaster recovery often require solutions that add complexity and additional cost to properly synchronize your data and systems.  Amazon Web Services™ (AWS) helps drive this cost and complexity with a number of services.  Amazon S3 provides a highly durable (99.999999999%) storage platform for your backups.  This service backs up your data to multiple availability zones (AZ) to provide you the ultimate peace of mind for your data.  AWS also provides an ultra-low cost service for long-term cold storage that is aptly named Glacier.  At $0.01 per GB / month this service will force you to ask, “Why am I not using AWS today?”

AWS has developed the AWS Storage Gateway to make your backups secure and efficient.  For only $125 per backup location per month, you will have a robust solution that provides the following features:

  • Secure transfers of all data to AWS S3 storage
  • Compatible with your current architecture – there is no need to call up your local storage vendor for a special adapter or firmware version to use Storage Gateway
  • Designed for AWS – this provides a seamless integration of your current environment to AWS services

AWS Storage Gateway and Amazon EC2 (snapshots of machine images) together provide a simple cloud-hosted DR solution.   Amazon EC2 allows you to quickly launch images of your production environment in AWS when you need them.  The AWS Storage Gateway seamlessly orchestrates with S3 to provide you a robust backup and disaster recovery solution that meets anyone’s budget.

-Matt Whitney, Sales Executive

DR is Dead

Having been in the IT Industry since the 90s I’ve seen many iterations on Disaster Recovery principals and methodologies.  The concept of DR of course far exceeds my tenure in the field as the idea started coming about in the 1970s as businesses began to realize their dependence on information systems and the criticality of those services.

Over the past decade or so we’ve really seen the concept of running a DR site at a colo facility (either leased or owned) become a popular way for organizations to have a rapidly available disaster recovery option.  The problem with a colo facility is that it is EXPENSIVE!  In addition to potentially huge CapEx (if you are buying your own infrastructure), you have the facility and infrastructure OpEx and all the overhead expense of managing those systems and everything that comes along with that.  In steps the cloud… AWS and the other players in the public cloud arena provide you the ability to run a DR site without having really any CapEx.  Now you are only paying for the virtual infrastructure that you are actually using as an operational cost.

An intelligently designed DR solution could leverage something like Amazon’s Pilot Light to keep your costs reduced by running the absolute minimal core infrastructure needed to keep the DR site fully ready to scale up to production.  Well that is a big improvement over purchasing millions of dollars of hardware and having thousands and thousands of dollars in OpEx and overhead costs every month.  Even still… there is a better way.  If you architect your infrastructure and applications following the AWS best practices, then in a perfect world there is really no reason to have DR at all.  By architecting your systems to balance across multiple AWS regions and availability zones; correctly designing architecture and applications for handling unpredictable and cascading failure; and to automatically and elastically scale to meet increases and decreases in demand you can effectively eliminate the need for DR.  Your data and infrastructure are distributed in a way that is highly available and impervious to failure or spikes/drops in demand.  So in addition to inherent DR, you are getting HA and true capacity-on-demand.  The whole concept of a disaster taking down a data center and the subsequent effects on your systems, applications, and users becomes irrelevant.  It may take a bit of work to design (or redesign) an application to this new cloud geo-distributed model, but I assure you that from a business continuity perspective, reduced TCO, scalability, and uptime it will pay off in spades.

That ought to put the proverbially nail in the coffin. RIP.

-Ryan Kennedy, Senior Cloud Engineer