A Refresher Course on Disaster Recovery with AWS

IT infrastructure is the hardware, network, services and software required for enterprise IT. It is the foundation that enables organizations to deliver IT services to their users. Disaster recovery (DR) is preparing for and recovering from natural and people-related disasters that impact IT infrastructure for critical business functions. Natural disasters include earthquakes, fires, etc. People-related disasters include human error, terrorism, etc. Business continuity differs from DR as it involves keeping all aspects of the organization functioning, not just IT infrastructure.

When planning for DR, companies must establish a recovery time objective (RTO) and recovery point objective (RPO) for each critical IT service. RTO is the acceptable amount of time in which an IT service must be restored. RPO is the acceptable amount of data loss measured in time. Companies establish both RTOs and RPOs to mitigate financial and other types of loss to the business. Companies then design and implement DR plans to effectively and efficiently recover the IT infrastructure necessary to run critical business functions.

For companies with corporate datacenters, the traditional approach to DR involves duplicating IT infrastructure at a secondary location to ensure available capacity in a disaster. The key downside is IT infrastructure must be bought, installed and maintained in advance to address anticipated capacity requirements. This often causes IT infrastructure in the secondary location to be over-procured and under-utilized. In contrast, Amazon Web Services (AWS) provides companies with access to enterprise-grade IT infrastructure that can be scaled up or down for DR as needed.

The four most common DR architectures on AWS are:

  • Backup and Restore ($) – Companies can use their current backup software to replicate data into AWS. Companies use Amazon S3 for short-term archiving and Amazon Glacier for long-term archiving. In the event of a      disaster, data can be made available on AWS infrastructure or restored from the cloud back onto an on-premise server.
  • Pilot Light ($$) – While backup and restore are focused on data, pilot light includes applications. Companies only provision core infrastructure needed for critical applications. When disaster strikes, Amazon Machine Images (AMIs) and other automation services are used to quickly provision the remaining environment for production.
  • Warm Standby ($$$) – Taking the Pilot Light model one step further, warm standby creates an active/passive cluster. The minimum amount of capacity is provisioned in AWS. When needed, the environment rapidly scales up to meet full production demands. Companies receive (near) 100% uptime and (near) no downtime.
  • Hot Standby ($$$$) – Hot standby is an active/active cluster with both cloud and on-premise components to it. Using weighted DNS load-balancing, IT determines how much application traffic to process in-house and on AWS.      If a disaster or spike in load occurs, more or all of it can be routed to AWS with auto-scaling.

In a non-disaster environment, warm standby DR is not scaled for full production, but is fully functional. To help adsorb/justify cost, companies can use the DR site for non-production work, such as quality assurance, ing, etc. For hot standby DR, cost is determined by how much production traffic is handled by AWS in normal operation. In the recovery phase, companies only pay for what they use in addition and for the duration the DR site is at full scale. In hot standby, companies can further reduce the costs of their “always on” AWS servers with Reserved Instances (RIs).

Smart companies know disaster is not a matter of if, but when. According to a study done by the University of Oregon, every dollar spent on hazard mitigation, including DR, saves companies four dollars in recovery and response costs. In addition to cost savings, smart companies also view DR as critical to their survival. For example, 51% of companies that experienced a major data loss closed within two years (Source: Gartner), and 44% of companies that experienced a major fire never re-opened (Source: EBM). Again, disaster is not a ready of if, but when. Be ready.

-Josh Lowry, General Manager – West

Backups – Don’t Do It

In migrating customers to AWS one of the consistent questions we are asked is, “How do we extend our backup services into the Cloud?”  My answer?  You don’t.  This is often met with incredulous stares where the customer is wondering if I’m joking, crazy, or I just don’t understand IT.  After all, backups are fundamental to data centers as well as IT systems in general, so why on Earth would I tell someone not to backup their systems?

The short answer to backups is just not to do it, honestly.  The more in depth answer is, of course, more complicated than that.  To be clear, I am talking about system backups; those backups typically used for bare metal restores.  Backups of databases, of file services – these we’ll tackle separately.  For the bulk of systems, however, we’ll leave backups as a relic of on premise data centers.

How?  Why?  Consider a typical three tiered architecture: web servers, application servers, and database servers.  In AWS, ideally your application and web servers are stateless, auto scaled systems.  With that in mind, why would you ever want to spend time, money, or resources on backing up and restoring one of these systems?  The design should be set so if and when a system fails, the health check/monitoring automatically terminates the instance, which in turn automatically creates an auto scale event to launch a new instance in its place.  No painfully long hours working through a restore process.

Similarly, your database systems can work without large scale backup systems.  Yes, by all means run database backups!  Database backups are not for server instance failures but for database application corruption or updates/upgrade rollbacks. Unfortunately, the Cloud doesn’t magically make your databases any more immune to human error.  For the database servers (assuming non-RDS), however, maintaining a snapshot of the server instance is likely good enough for backups.  If and when the database server fails, the instance can be terminated and the standby system can become the live system to maintain system integrity.  Launch a new database server based on the snapshot, restore the database and/or configure replication from the live system, depending on database technology, and you’re live.

So yes, in a properly configured AWS environment, the backup and restore you love to loathe from your on premise environment is a thing of the past.

-Keith Homewood, Cloud Architect

Increasing Your Cloud Footprint

The jump to the cloud can be a scary proposition.  For an enterprise with systems deeply embedded in traditional infrastructure like back office computer rooms and datacenters the move to the cloud can be daunting. The thought of having all of your data in someone else’s hands can make some IT admins cringe.  However, once you start looking into cloud technologies you start seeing some of the great benefits, especially with providers like Amazon Web Services (AWS).  The cloud can be cost-effective, elastic and scalable, flexible, and secure.  That same IT admin cringing at the thought of their data in someone else’s hands may finally realize that AWS is a bit more secure than a computer rack sitting under an employee’s desk in a remote office.  Once the decision is finally made to “try out” the cloud, the planning phase can begin.

Most of the time the biggest question is, “How do we start with the cloud?”  The answer is to use a phased approach.  By picking applications and workloads that are less mission critical, you can try the newest cloud technologies with less risk.  When deciding which workloads to move, you should ask yourself the following questions; Is there a business need for moving this workload to the cloud?  Is the technology a natural fit for the cloud?  What impact will this have on the business? If all those questions are suitably answered, your workloads will be successful in the cloud.

One great place to start is with archiving and backups.  These types of workloads are important, but the data you’re dealing with is likely just a copy of data you already have, so it is considerably less risky.  The easiest way to start with archives and backups is to try out S3 and Glacier.  Many of today’s backup utilities you may already be using, like Symantec Netbackup  and Veeam Backup & Replication, have cloud versions that can directly backup to AWS. This allows you to use start using the cloud without changing much of your embedded backup processes.  By moving less critical workloads you are taking the first steps in increasing your cloud footprint.

Now that you have moved your backups to AWS using S3 and Glacier, what’s next?  The next logical step would be to try some of the other services AWS offers.  Another workload that can often be moved to the cloud is Disaster Recovery.   DR is an area that will allow you to more AWS services like VPC, EC2, EBS, RDS, Route53 and ELBs.  DR is a perfect way to increase your cloud footprint because it will allow you to construct your current environment, which you should already be very familiar with, in the cloud.  A Pilot Light DR solution is one type of DR solution commonly seen in AWS.  In the Pilot Light scenario the DR site has minimal systems and resources with the core elements already configured to enable rapid recovery once a disaster happens.  To build a Pilot Light DR solution you would create the AWS network infrastructure (VPC), deploy the core AWS building blocks needed for the minimal Pilot Light configuration (EC2, EBS, RDS, and ELBs), and determine the process for recovery (Route53).  When it is time for recovery all the other components can be quickly provisioned to give you a fully working environment. By moving DR to the cloud you’ve increased your cloud footprint even more and are on your way to cloud domination!

The next logical step is to move Test and Dev environments into the cloud. Here you can get creative with the way you use the AWS technologies.  When building systems on AWS make sure to follow the Architecting Best Practices: Designing for failure means nothing will fail, decouple your components, take advantage of elasticity, build security into every layer, think parallel, and don’t fear constraints! Start with proof-of-concept (POC) to the development environment, and use AWS reference architecture to aid in the learning and planning process.  Next your legacy application in the new environment and migrate data.  The POC is not complete until you validate that it works and performance is to your expectations.  Once you get to this point, you can reevaluate the build and optimize it to exact specifications needed. Finally, you’re one step closer to deploying actual production workloads to the cloud!

Production workloads are obviously the most important, but with the phased approach you’ve taken to increase your cloud footprint, it’s not that far of a jump from the other workloads you now have running in AWS.   Some of the important things to remember to be successful with AWS include being aware of the rapid pace of the technology (this includes improved services and price drops), that security is your responsibility as well as Amazon’s, and that there isn’t a one-size-fits-all solution.  Lastly, all workloads you implement in the cloud should still have stringent security and comprehensive monitoring as you would on any of your on-premises systems.

Overall, a phased approach is a great way to start using AWS.  Start with simple services and traditional workloads that have a natural fit for AWS (e.g. backups and archiving).  Next, start to explore other AWS services by building out environments that are familiar to you (e.g. DR). Finally, experiment with POCs and the entire gambit of AWS to benefit for more efficient production operations.  Like many new technologies it takes time for adoption. By increasing your cloud footprint over time you can set expectations for cloud technologies in your enterprise and make it a more comfortable proposition for all.

-Derek Baltazar, Senior Cloud Engineer

How secure is secure?

There have been numerous articles, blogs, and whitepapers about the security of the Cloud as a business solution.  Amazon Web Services has a site devoted to extolling their security virtues and there are several sites that devote themselves entirely to the ins and outs of AWS security.  So rather than try to tell you about each and every security feature of AWS and try to convince you how secure the environment can be, my goal is to share a real world example of security that can be improved by moving from on premise datacenters to AWS.

Many AWS implementations are used for hosting web applications, most of which are Internet accessible.  Obviously, if your environment is for internal use only you can lock down security even further, but for the interest of this exercise, we’re assuming Internet facing web applications.  The inherent risk, of course, with any Internet accessible application is that accessibility to the Internet provides hackers and malicious users access to your environment as well as honest yet malware/virus/Trojan infected users.

As with on premise and colocation based web farms, AWS offers the standard security practices of isolating customers from one another so that if one customer experiences a security breach, all other customers remain secure.  And of course, AWS Security Groups function like traditional firewalls, allowing traffic only through allowed ports to/from specific destinations/sources.  AWS moves ahead of traditional datacenters starting with Security Groups and Network ACL’s by offering more flexibility to respond to attacks.  Consider the case of a web farm that has components suspected of being compromised; AWS Security Groups can be created in seconds to isolate the suspected components from the rest of the network.  In a traditional datacenter environment, those components may require making complex network changes to move them to isolated networks in order to prevent infection to spread over the network, something AWS blocks by default.

AWS often talks about scalability – able to grow and shrink the environment to meet demands.  That capability also extends to security features as well!  Need another firewall, just add another Security Group, no need to install another device.  Adding another subnet, VPN, firewall, all of these things are done in minutes with no action from on premise staff required.  No more waiting while network cables are moved, hardware is installed or devices are physically reconfigured when you need security updates.

Finally, no matter how secure an environment, no security plan is complete without a remediation plan.  AWS has tools that provide remediation with little to no downtime.  Part of standard practices for AWS environments is to take regular snapshots of EC2 instances (servers).  These snapshots can be used to re-create a compromised or non-functional component in minutes rather than the lengthy restore process for a traditional server.  Additionally, 2nd Watch recommends taking an initial image of each component so that in the event of a failure, there is a fall back point to a known good configuration.

So how secure is secure?  With the ability to respond faster, scale as necessary, and recover in minutes – the Amazon Cloud is pretty darn secure!  And of course, this is only the tip of the iceberg for AWS Cloud Security, more to follow the rest of December here on our blog and please check out the official link above for Amazon’s Security Center and Whitepapers.

-Keith Homewood, Cloud Architect

The next frontier: recovering in the cloud

The pervasive technology industry has created the cloud and all the acronyms that go with it.   Growth is fun, and the cloud is the talk of the town. From the California Sun to the Kentucky coal mines we are going to the cloud, although Janis Joplin may have been there before her time.  Focus and clarity will come later.

There is so much data being stored today that the biggest challenge is going to be how to quantify it, store it, access it and recover it. Cloud-based disaster recovery has broad-based appeal across industry and segment size.  Using a service from the AWS cloud enables more efficient disaster recovery of mission critical applications without any upfront cost or commitment.   AWS allows customers to provision virtual private clouds using its infrastructure, which offers complete network isolation and security. The cloud can be used to configure a “pilot-light” architecture, which dramatically reduces cost over traditional data centers where the concept of “pilot” or “warm” is not an option – you pay for continual use of your infrastructure whether it’s used or not. With AWS, you only use what you pay for, and you have complete control of your data and its security.

Backing data up is relatively simple: select an object to be backed up and click a button. More often than not, the encrypted data reaches its destination, whether in a local storage device or to an S3 bucket in an AWS region in Ireland.  Restoring the data has always been a perpetual challenge. What the cloud does is make ing of the backup capabilities more flexible and more cost effective.  As the cost of cloud-based ing falls rapidly, from thousands of dollars or dinars, to hundreds, it results in more ing, and therefore, more success after a failure whether it’s from a superstore or superstorm, or even a supermodel one.

-Nick Desai, Solutions Architect

Disaster Recovery on Amazon Web Services

After a seven-year career at Cisco, I am thrilled to be working with 2nd Watch to help make cloud work for our customers.  AWS Disaster recovery is a great use case for companies looking for a quick cloud win.  Traditional on-premise backup and disaster recovery technologies are complex and can require major capital investments.  I have seen many emails from IT management to senior management explaining the risk to their business if they do not spend money on backup/DR.  The associated costs with on-premise solutions for DR are often put off to next year’s budget and seem to always get cut midway throughout the year.  When the business does invest in backup/DR, there are countless corners cut in order to maximize the reliability and performance with a limited budget.

The public cloud is a great resource for addressing the requirements of backup and disaster recovery.  Organizations can avoid the sunk costs of data center infrastructure (power, cooling, flooring, etc.) while having all of the performance and resources available for their growing needs.  Atlanta based What’s Up Interactive saved over $1 million with their AWS Disaster Recovery solution (case study here).

I will highlight a few of the top benefits any company can expect when leveraging AWS for their disaster recovery project.

1. Eliminate costly tape backups that require transporting, storing, and retrieving the tape media.  This is replaced by fast disk-based storage that provides the performance needed to run your mission-critical applications.

2.  AWS provides an elastic solution that can scale to the growth of data or application requirements without the costly capital expenses of traditional technology vendors.   Companies can also expire and delete archived data according to organizational policy.  This allows companies to pay for only what they use.

3.  AWS provides a secure platform that helps companies meet compliance requirements due to easy access to data for deadlines that is secure and durable.

Every day we help new customers leverage the cloud to support their business.  Our team of Solutions Architects and Cloud Engineers can assist you in creating a plan to reduce risk in your current backup/DR solution.  Let us know how we can help prepare you with AWS Disaster Recovery.

-Matt Whitney, Cloud Sales Executive

HA vs HR

The other day I was working with my neighbor’s kid on his baseball fundamentals and I found myself repeating the phrase “Remember the Basics.”  Over the course of the afternoon we played catch, worked on swinging the bat, and fielded grounders until our hands hurt. As the afternoon slipped into the evening hours, I started to see that baseball and my business have several similarities.

My business is Cloud Computing, and my company, 2nd Watch, is working to pioneer Cloud Adoption with Enterprise businesses.  As we discover new ways of integrating systems, performing workloads, and running applications, it is important for us not to forget the fundamentals.

One of the most basic elements of this is using the proper terminology. I’ve found that in many cases my customers, partners, and even my colleagues can have different meanings for many of the same terms. One example that comes up frequently is the difference between having a Highly Available infrastructure vs. Highly Reliable infrastructure. I want to bring special attention to these two terms and help to clearly define their meaning.

HA (High Availability)

High Availability (HA) is based on designing and implementing systems that are proactively created to handle the operational capacity to meet their required performance. For example, within Cloud Computing we leverage tools like Elastic Load Balancing and Auto Scaling to automate the scaling of infrastructure to handle the variable demand for web sites. As traffic increases, servers are spun up to handle the load and vice versa as it decreases.  If a user cannot access a website or it is deemed “unavailable,” then you risk the potential loss of readership, sales, or the attrition of customers.

HR (Highly Reliable)

On the other hand, Highly Reliable (HR) systems have to do with your Disaster Recovery (DR) model and how well you prepare for catastrophic events. In Cloud Computing, we design for failure because anything can happen at any time. Having a proper Disaster Recovery plan in place will enable your business to keep running if problems arise.

Any company with sensitive IT operations should look into a proper DR strategy, which will support their company’s ability to be resilient in the event of failure. While a well-planned DR schema may cost you more money upfront, being able to support both your internal and external customers will pay off in spades if an event takes place that requires you to fail over.

HA vs HR

In today’s business market it is important to take the assumptions out of our day-to-day conversations and make sure that we’re all on the same page. The difference between being Highly Available and Highly Reliable systems is a great example of this. By simply going back to the fundamentals, we can easily make sure that our expectations are met and our colleagues, partners, and clients understand both our spoken & written words.

-Blake Diers, Cloud Sales Executive

Amazon S3: Backup Your Laptop for Less

For quite some time I’ve been meaning to tinker around with using Amazon S3 for a backup tool.  Sure I’ve been using S3 backed Dropbox for years now and love it, and there are a multitude of other desktop client apps out there that do the same sort of thing with varying price points and feature sets (including Amazon’s own cloud drive).  The primary reason I wanted to look into using something specific to S3 is because it is economical and very highly available and secure, but it also scales well in a more enterprise setting.  It is just a logical and compelling choice if you are already running IAAS in AWS.

If you’re unfamiliar with rsync, it is a UNIX tool for copying files or sets of files with many cool features. Probably the most distinctive feature is that it does differential copying, which means that it will only copy files that have changed on the source.  This means if you have a file set containing thousands of files that you want to sync between the source and the destination it will only have to copy the files that have changed since the last copy/sync.

Being an engineer my initial thought was, “Hey, why not just write a little python program using the boto AWS API libs and librsync to do it?”, but I am also kind of lazy, and I know I’m not that forward-thinking, so I figured someone has probably already done this.  I consulted the Google machine and sure enough… 20 seconds later I had discovered Duplicity (http://duplicity.nongnu.org/).   Duplicity is an open source GPL python based application that does exactly what I was aiming for – it allows you to rsync file to an S3 bucket.  In fact, it even has some additional functionality like encryption and passwords protecting the data.

A little background info on AWS storage/backups

Tying in to my earlier point about wanting to use S3 for EC2 Linux instances, traditional Linux AWS EC2 instance backups are achieved using EBS snapshots. This can work fairly well but has a number of limitations and potential pitfalls/shortcomings.

Here is a list of advantages and disadvantages of using EBS snapshots for Linux EC2 instance backup purposes. In no way are these lists fully comprehensive:

Advantages:

  • Fast
  • Easy/Simple
  • Easily scriptable using API tools
  • Pre-backed functionality built into the AWS APIs and Management Console

Disadvantages:

  • Non-selective (requires backing up an entire EBS volume)
  • More expensive
  • EBS is more expensive than S3
    • Backing up an entire EBS volume can be overkill for what you actually need backed up and result in a lot of extra cost for backing up non-essential data
  • Pitfalls with multiple EBS volume software RAID or LVM sets
    • Multiple EBS volume sets are difficult to snapshot synchronously
    • Using the snapshots for recovery requires significant work to reconstruct volume sets
  • No ability to capture only files that have changed since previous backup (ie rsync style backups)
  • Only works on EBS back instances

Compare that to a list of advantages/disadvantages of using the S3/Duplicity solution:

Advantages:

  • Inexpensive (S3 is cheap)
  • Data security (redundancy and geographically distributed)
  • Works on any Linux system that has connectivity to S3
  • Should work on any UNIX style OS (includes Mac OSX) as well
  • Only copies the deltas in the files and not the entire file or file-set
  • Supports “Full” and “Incremental” backups
  • Data is compressed with gzip
  • Lightweight
  • FOSS (Free and Open Source Software)
  • Works independently of underlying storage type (SAN, Linux MD, LVM, NFS, etc.) or server type (EC2, Physical hardware, VMWare, etc.)
  • Relatively easy to set up and configure
  • Uses syntax that is congruent with rsync (e.g. –include, –exclude)
  • Can be restored anywhere, anytime, and on any system with S3 access and Duplicity installed

Disadvantages:

  • Slower than a snapshot, which is virtually instantaneous
  • Not ideal for backing up data sets with large deltas between backups
  • No out-of-the-box type of AWS API or Management Console integration (though this is not really necessary)
  • No “commercial” support

On to the important stuff!  How to actually get this thing up and running

Things you’ll need:

  • The Duplicity application (should be installable via either yum, apt, or other pkg manager). Duplicity itself has numerous dependencies but the package management utility should handle all of that.
  • An Amazon AWS account
  • Your Amazon S3 Access Key ID
  • Your Amazon S3 Secret Access Key
  • A list of files/directories you want to back up
  • A globally unique name for an Amazon S3 bucket (the bucket will be created if it doesn’t yet exist)
  • If you want to encrypt the data:
    • A GPG key
    • The corresponding GPG key passphrase
  1. Obtain/install the application (and its pre-requisites):

If you’re running a standard Linux distro you can most likely install it from a ‘yum’ or ‘apt’ repository (depending on distribution).  Try something like “sudo yum install duplicity” or “sudo apt-get install duplicity”.  If all else fails, (perhaps you are running some esoteric Linux distro like Gentoo?) you can always do it the old-fashioned way and download the tarball from the website and compile it (that is outside of the scope of this blog).  “Use the source Luke.”  If you are a Mac user you can also compile it and run it on Mac OSX (http://blog.oak-tree.us/index.php/2009/10/07/duplicity-mac), which I have not ed/verified actually works.

  • NOTE: On Fedora Core 18, Duplicity was already installed and worked right out of the box.  On a Debian Wheezy box I had to apt-get install duplicity and python-boto.  YMMV
  1. Generate a GPG key if you don’t already have one:
  • If you need to create a GPG key use ‘gpg –gen-key’ to create a key with a passphrase. The default values supplied by ‘gpg’ are fine.
  • NOTE: record the GPG Key value that it generates because you will need that!
  • NOTE: keep a backup copy of your GPG key somewhere safe.  Without it you won’t be able to decrypt your backups, and that could make restoration a bit difficult.
  1. Run Duplicity backing up whatever files/directories you want saved on the cloud.  I’d recommend reading the main page for a full rundown on all the options and syntax.

I used something like this:

$ export AWS_ACCESS_KEY_ID=’AKBLAHBLAHBLAHMYACCESSKEY’

$ export AWS_SECRET_ACCESS_KEY=’99BIGLONGSECRETKEYGOESHEREBLAHBLAH99′

$ export PASSPHRASE=’mygpgpassphrase’

$ duplicity incremental –full-if-older-than 1W –s3-use-new-style –encrypt-key=MY_GPG_KEY –sign-key=MY_GPG_KEY –volsize=10 –include=/home/rkennedy/bin –include=/home/rkennedy/code –include=/home/rkennedy/Documents –exclude=** /home/rkennedy s3+http://myS3backup-bucket

 

  1. Since we are talking about backups and rsync this is probably something that you will want to run more than once.  Writing a bash script or something along those lines and kicking it off automatically with cron seems like something a person may want to do.  Here is a pretty nice example of how you could script this – http://www.cenolan.com/2008/12/how-to-incremental-daily-backups-amazon-s3-duplicity/

 

  1. Recovery is also pretty straightforward:

$ duplicity  –encrypt-key=MY_GPG_KEY –sign-key=MY_GPG_KEY –file-to-restore Documents/secret_to_life.docx –time 05-25-2013 s3+http://myS3backup-bucket /home/rkennedy/restore

Overwhelmed or confused by all of this command line stuff?  If so, Deja-dup might be helpful.  It is a Gnome based GUI application that can perform the same functionality as Duplicity (turns out the two projects actually share a lot of code and are worked on by some of the same developers).  Here is a handy guide on using Deja-dup for making Linux backups: (http://www.makeuseof.com/tag/dj-dup-perfect-linux-backup-software/)

This is pretty useful, and for $4 a month, or about the average price of a latte, you can store nearly 50GB compressed of de-duped backups in S3 (standard tier). For just a nickel you can get at least 526MB of S3 backup for a month.  Well, that and the 5GB of S3 you get for free.

-Ryan Kennedy, Senior Cloud Engineer

Get Yourself the Right Cloud Provider SLA

Making sure your cloud provider is giving you what you need is based on a Service Level Agreement (SLA). This wasn’t so difficult in years past because off-site computing was usually about non-critical functions like archive storage, web serving, and sometimes about “dark” disaster recovery infrastructure. Keeping mission-critical infrastructure and applications in-house made for an easier time measuring service levels because you had full visibility across your entire infrastructure, knew your staff capabilities, your budget, compliance needs, and your limitations. But the cloud is allowing companies to save significant money by moving more and more IT functionality into the cloud. When that happens, SLAs not only change, they become very important, often critically so.

A cloud SLA can’t be accepted as “boiler plate” from the customer’s perspective. You need to carefully analyze what your provider is offering, and you need to ensure that it’s specific and measureable from your side. Not every provider can give you a customized SLA, especially the largest providers; but most, including AWS, can give you augmented SLAs via partners that can be more easily bent to your needs. If you’ve analyzed your provider’s standard SLA and it’s not cutting the mustard, then working with a partner is really your best option.

The most common criteria in an SLA is downtime. Most providers will offer “five 9s” in this regard, or 99.999% uptime, though often this is for cloud services, not necessarily cloud infrastructure. That’s because cloud infrastructure downtime is subject to more factors than a service. In a service model, the provider knows they’re responsible for all aspects of delivery; so similar to an internal SLA, they have full visibility over their own infrastructure, software, datacenter locations and so on. But when customers move infrastructure into the cloud, there are two sides of possible downtime – yours and theirs. Virtual networks may crash because one of your network administrators goofed, not necessarily the provider. Those issues need to be resolved before help can be provided and systems restored. It’s very important that not only this situation be covered in your SLA, but also the steps that will be taken by both sides to resolve the issue. A weak SLA here gives the provider too much leeway to push back or delay. And on the flip side, your IT staff needs to have clear steps in place as well as time-to-resolve metrics in place so they aren’t the resolution blocker either.

Another important concern, and often an unnecessary blocker to the benefits of cloud computing, is making sure your applications are properly managed so they can comply with regulatory requirements specific to your business. You and your customers can  feel safe putting data into the cloud, and compliance audits won’t give your architecture unnecessary audit problems. We’ve seen customers who thought this was a show stopper when considering cloud adoption, when really it just takes some planning and foresight.

Last, take a long look at your business processes. Aside from cost savings, what impact is the cloud having on the way you do business? What will be the impact if it fails? Are you dead in the water or are there backup processes in place? Answering these questions will effectively provide you with two cloud SLAs: The providers and your own. The two need to be completely in sync both to ensure your business as well as making your cloud adoption a success and keeping the door open to new opportunities to leverage cloud computing.

As I mentioned earlier, a good way to do this is to work with one of the larger cloud providers’ value-add partners. With AWS, for example, you’re able to work with a certified partner like 2nd Watch to purchase tiered SLAs based on your needs that build off the SLAs offered by AWS. For example, 2nd Watch is the first partner to offer 99.99% uptime on top of AWS’ uptime SLA for all enterprise applications. We also offer both technical incidence response and Managed Services, which takes managing your applications off your plate, works to analyze possible technical problems before they happened, and will help ensure your application adheres to compliance regulations. For any of these offerings, customers can opt for Select, Executive, or Premier SLAs. Each of these has their own market leading uptime, problem response, and management service agreements so you can tailor an SLA based on your needs and budget.

-Jeff Aden, President

Cloud Backup: More is Always Better

You’ve seen podcasts, blogs, whitepapers, and magazine articles all promoting backups, the concept, the benefits they provide, and the career crushing disasters that can happen when they’re ignored. Bottom line: You do it, and you do it now. Hey, do more than one, the more the merrier. Run at the right time (meaning low user load), and in a well maintained hierarchical storage process, they can only do good. A solid backup strategy can even get you business insurance discounts in some cases.

But backing up in-house servers is a relatively simple and time-ed process. Most operating systems have one or more backup utilities built-in, and then there are third-party options that take the concept to whole new level. But backup in or from the cloud is a new concept with new variables – data security, multi-tenant segregation, virtualized services, and especially cloud and internet bandwidth. That’s got to be difficult at the enterprise computing level. Wrong! Not for AWS customers.

For a while, cloud storage was mainly a backup solution on its own. But Amazon has matured its backup and data redundancy offerings over the years to a point where you’ll find a range of sophisticated options at your disposal, depending on your needs. Obviously, we start with S3.

S3 has been and still is a basic backup solution for many companies – some rely on it exclusively, since data in S3 gets additional redundancy and backup features in AWS’ datacenters, so customers can be fairly confident their data will be there when they need it. S3’s storage architecture is pleasantly simple to understand. You’ll find it revolves primarily around two key nouns: buckets and objects. Buckets are just ways to organize your objects, and objects represent every individual file that’s been backed up. Every object has its own secure URL, which allows easy organization and management of data using a number of homegrown tools like basic access control or Amazon prescribed bucket polices. But you’re also able to choose from new third-party solutions with even easier interfaces, like Symantec for enterprises or Dropbox for small businesses.

Recently, Amazon has fleshed out its backup solution even more with the introduction of Glacier, which doesn’t relate to melting polar ice caps as much as it does to slow data retrieval times. Glacier’s mission is to provide a much lower-cost backup solution (as little as a penny per gigabyte in some situations). The tradeoff is that because its low cost, it’s significantly slower than S3. But for long-term backup of important files, Glacier removes the need for repetitive backup operations, capacity planning, hardware provisioning and more. These are all time consuming tasks that add to the hidden costs of secure in-house backup. Glacier takes all that off your plate for very little money. Bottom line: use S3 if you’re storing data you’ll access often or that requires high-speed access; use Glacier for less frequently accessed data for which slow retrieval times (sometimes several hours) are acceptable.

That’s only a short glimpse into the S3 portion of Amazon’s backup capabilities. If we went into detail about all of AWS data protection features, we could publish this as a book. We’ll hit each of them in detail in future posts, but here’s a quick list of Amazon’s data protection solutions across their service portfolio:

  • Multi-zone and multi-region redundancy
  • AWS relational database service (RDS) bundled, automated back up
  • RDS backups you can control manually, like snapshots
  • AWS Import/Export
  • EC2 and S3
  • S3 and Elastic Block Store (EBS)
  • Route 53
  • Elastic Load Balancing (ELB)
  • Third-party backup platforms you can use to construct your own AWS storage hierarchy

Data safety has been a hot-button issue for folks still on the fence about adopting the cloud. But AWS has done an excellent job innovating sophisticated new data redundancy solutions that can make backing up to the cloud safer than in your own datacenter. Check it out.

-Travis Greenstreet, Senior Cloud Systems Engineer