For quite some time I’ve been meaning to tinker around with using Amazon S3 for a backup tool. Sure I’ve been using S3 backed Dropbox for years now and love it, and there are a multitude of other desktop client apps out there that do the same sort of thing with varying price points and feature sets (including Amazon’s own cloud drive). The primary reason I wanted to look into using something specific to S3 is because it is economical and very highly available and secure, but it also scales well in a more enterprise setting. It is just a logical and compelling choice if you are already running IAAS in AWS.
If you’re unfamiliar with rsync, it is a UNIX tool for copying files or sets of files with many cool features. Probably the most distinctive feature is that it does differential copying, which means that it will only copy files that have changed on the source. This means if you have a file set containing thousands of files that you want to sync between the source and the destination it will only have to copy the files that have changed since the last copy/sync.
Being an engineer my initial thought was, “Hey, why not just write a little python program using the boto AWS API libs and librsync to do it?”, but I am also kind of lazy, and I know I’m not that forward-thinking, so I figured someone has probably already done this. I consulted the Google machine and sure enough… 20 seconds later I had discovered Duplicity (https://duplicity.nongnu.org/). Duplicity is an open source GPL python based application that does exactly what I was aiming for – it allows you to rsync file to an S3 bucket. In fact, it even has some additional functionality like encryption and passwords protecting the data.
A little background info on AWS storage/backups
Tying in to my earlier point about wanting to use S3 for EC2 Linux instances, traditional Linux AWS EC2 instance backups are achieved using EBS snapshots. This can work fairly well but has a number of limitations and potential pitfalls/shortcomings.
Here is a list of advantages and disadvantages of using EBS snapshots for Linux EC2 instance backup purposes. In no way are these lists fully comprehensive:
Advantages:
- Fast
- Easy/Simple
- Easily scriptable using API tools
- Pre-backed functionality built into the AWS APIs and Management Console
Disadvantages:
- Non-selective (requires backing up an entire EBS volume)
- More expensive
- EBS is more expensive than S3
- Backing up an entire EBS volume can be overkill for what you actually need backed up and result in a lot of extra cost for backing up non-essential data
- Pitfalls with multiple EBS volume software RAID or LVM sets
- Multiple EBS volume sets are difficult to snapshot synchronously
- Using the snapshots for recovery requires significant work to reconstruct volume sets
- No ability to capture only files that have changed since previous backup (ie rsync style backups)
- Only works on EBS back instances
Compare that to a list of advantages/disadvantages of using the S3/Duplicity solution:
Advantages:
- Inexpensive (S3 is cheap)
- Data security (redundancy and geographically distributed)
- Works on any Linux system that has connectivity to S3
- Should work on any UNIX style OS (includes Mac OSX) as well
- Only copies the deltas in the files and not the entire file or file-set
- Supports “Full” and “Incremental” backups
- Data is compressed with gzip
- Lightweight
- FOSS (Free and Open Source Software)
- Works independently of underlying storage type (SAN, Linux MD, LVM, NFS, etc.) or server type (EC2, Physical hardware, VMWare, etc.)
- Relatively easy to set up and configure
- Uses syntax that is congruent with rsync (e.g. –include, –exclude)
- Can be restored anywhere, anytime, and on any system with S3 access and Duplicity installed
Disadvantages:
- Slower than a snapshot, which is virtually instantaneous
- Not ideal for backing up data sets with large deltas between backups
- No out-of-the-box type of AWS API or Management Console integration (though this is not really necessary)
- No “commercial” support
On to the important stuff! How to actually get this thing up and running
Things you’ll need:
- The Duplicity application (should be installable via either yum, apt, or other pkg manager). Duplicity itself has numerous dependencies but the package management utility should handle all of that.
- An Amazon AWS account
- Your Amazon S3 Access Key ID
- Your Amazon S3 Secret Access Key
- A list of files/directories you want to back up
- A globally unique name for an Amazon S3 bucket (the bucket will be created if it doesn’t yet exist)
- If you want to encrypt the data:
- A GPG key
- The corresponding GPG key passphrase
- Obtain/install the application (and its pre-requisites):
If you’re running a standard Linux distro you can most likely install it from a ‘yum’ or ‘apt’ repository (depending on distribution). Try something like “sudo yum install duplicity” or “sudo apt-get install duplicity”. If all else fails, (perhaps you are running some esoteric Linux distro like Gentoo?) you can always do it the old-fashioned way and download the tarball from the website and compile it (that is outside of the scope of this blog). “Use the source Luke.” If you are a Mac user you can also compile it and run it on Mac OSX (https://blog.oak-tree.us/index.php/2009/10/07/duplicity-mac), which I have not ed/verified actually works.
- NOTE: On Fedora Core 18, Duplicity was already installed and worked right out of the box. On a Debian Wheezy box I had to apt-get install duplicity and python-boto. YMMV
- Generate a GPG key if you don’t already have one:
- If you need to create a GPG key use ‘gpg –gen-key’ to create a key with a passphrase. The default values supplied by ‘gpg’ are fine.
- NOTE: record the GPG Key value that it generates because you will need that!
- NOTE: keep a backup copy of your GPG key somewhere safe. Without it you won’t be able to decrypt your backups, and that could make restoration a bit difficult.
- Run Duplicity backing up whatever files/directories you want saved on the cloud. I’d recommend reading the main page for a full rundown on all the options and syntax.
I used something like this:
$ export AWS_ACCESS_KEY_ID=’AKBLAHBLAHBLAHMYACCESSKEY’
$ export AWS_SECRET_ACCESS_KEY=’99BIGLONGSECRETKEYGOESHEREBLAHBLAH99′
$ export PASSPHRASE=’mygpgpassphrase’
$ duplicity incremental –full-if-older-than 1W –s3-use-new-style –encrypt-key=MY_GPG_KEY –sign-key=MY_GPG_KEY –volsize=10 –include=/home/rkennedy/bin –include=/home/rkennedy/code –include=/home/rkennedy/Documents –exclude=** /home/rkennedy s3+https://myS3backup-bucket
- Since we are talking about backups and rsync this is probably something that you will want to run more than once. Writing a bash script or something along those lines and kicking it off automatically with cron seems like something a person may want to do. Here is a pretty nice example of how you could script this – https://www.cenolan.com/2008/12/how-to-incremental-daily-backups-amazon-s3-duplicity/
- Recovery is also pretty straightforward:
$ duplicity –encrypt-key=MY_GPG_KEY –sign-key=MY_GPG_KEY –file-to-restore Documents/secret_to_life.docx –time 05-25-2013 s3+https://myS3backup-bucket /home/rkennedy/restore
Overwhelmed or confused by all of this command line stuff? If so, Deja-dup might be helpful. It is a Gnome based GUI application that can perform the same functionality as Duplicity (turns out the two projects actually share a lot of code and are worked on by some of the same developers). Here is a handy guide on using Deja-dup for making Linux backups: (https://www.makeuseof.com/tag/dj-dup-perfect-linux-backup-software/)
This is pretty useful, and for $4 a month, or about the average price of a latte, you can store nearly 50GB compressed of de-duped backups in S3 (standard tier). For just a nickel you can get at least 526MB of S3 backup for a month. Well, that and the 5GB of S3 you get for free.
-Ryan Kennedy, Senior Cloud Engineer




