Section 7.3. rdiff-backup | Backup & Recovery: Inexpensive Backup Solutions for Open Systems

7.3. rdiff-backup

rdiff-backup is a program written in Python and C that uses the same rolling-checksum algorithm that rsync does. Although rdiff-backup and rsync are similar and use the same algorithm, they do not share any code and must be installed separately.

When backing up, both rsnapshot and rdiff-backup create a mirror of the source directory. For both, the current backup is just a copy of the source, ready to be copied and verified like an ordinary directory. And both can be used over ssh in either push or pull mode. The most important conceptual differences between rsync snapshots and rdiff-backup are how they store older backups and how they store file metadata.

An rsync-snapshot system basically stores older backups as complete copies of the source. As mentioned earlier in the chapter, by being clever with hard links, these copies do not take long to create and usually do not take up nearly as much disk space as unlinked copies. However, every distinct version of every file in the backup is stored as a separate copy of that file. For instance, if you add one line to a file or change a file's permissions, that file is stored twice in the backup archive in its entirety. This can be troublesome especially with logfiles, which grow slightly quite often.

On the other hand, rdiff-backup does not keep complete copies of older files in the backup archive. Instead, it stores only the compressed differences between current files and their older versions, called diffs or deltas. For logfiles, rdiff-backup would not keep a separate copy of the older and slightly shorter log. Instead, it would save to the archive a delta file that contains the information "the older version is the current version but without the last few lines." These deltas are often much smaller than an entire copy of the older file. When a file has changed completely, the delta is about the same size as the older version (but is then compressed).

When an rdiff-backup archive has multiple versions of a file, the program stores a series of deltas. Each one contains instructions on how to construct an earlier version of a file from a later one. When restoring, rdiff-backup starts with the current version and applies deltas in reverse order.

Besides storing older versions as deltas instead of copies, rdiff-backup also stores the (compressed) metadata of all files in the backup archive. Metadata is data associated with a file that describes the file's real data. Some examples of file metadata are ownership, permissions, modification time, and file length. This metadata does not take up much space because metadata is generally very compressible. Newer versions go further and store only deltas of the metadata, for even more space efficiency.

At the cost of some disk space, storing metadata separately has several uses: first, data loss is avoided even if the destination filesystem does not support all the features of the source filesystem. For instance, ownership can be preserved even without root access, and Linux filesystems with symbolic links, device files, and ACLs can be backed up to a Windows filesystem. You don't have to examine the details of each filesystem to know that the backup will work. Second, with metadata stored separately, rdiff-backup is less disk-intensive on the backup server. When backing up, rdiff-backup does not need to traverse the mirror's directory structure to determine which files have changed. Third, metadata such as SHA-1 checksums can be used to verify the integrity of backups.

7.3.1. Advantages

Here are some advantages of using rdiff-backup instead of an rsync script or rsnapshot:

Backup size: Because rdiff-backup does not store complete copies of older files but only the compressed differences between older and current files, backups generally consume less disk space.
Easier-to-use: Unlike rsync, rdiff-backup was written originally for backups. It has sensible defaults (so no need for the -av -delete -e ssh options) and fewer quirks (for instance, there is no distinction between <destination>, <destination>/, and <destination>/.).
Preserves all information: With rsync, all information is stored in the filesystem itself. If you log in to your backup repository as a nonroot user (generally a good idea), the rsync method forgets who owns all your files! rdiff-backup keeps a copy of all metadata in a separate file, so no information is lost, even if you aren't root or if you back up to a different kind of filesystem.
Handy backup features: rdiff-backup has several miscellaneous handy features. For example, it keeps detailed logs on what is changing and has commands to process those logs so that you know which files are using up your space and time. Also, newer versions keep SHA-1 checksums of all files so you can verify the integrity of backups. Some rsync scripts have similar featurescheck their documentation.

7.3.2. Disadvantages

Let's be honest. rdiff-backup has some disadvantages, too:

Speed: rdiff-backup consumes more CPU than rsync and is therefore slower than most rsync scripts. This difference is often not noticeable when the bottleneck is the network or a disk drive but can be significant for local backups.
Transparency: With rsync scripts, all past backups appear as copies and are thus easy to verify, restore, and delete. With rdiff-backup, only the current backup appears as a true copy. (Earlier backups are stored as compressed deltas.)
Requirements: rdiff-backup is written in Python and requires the librsync library. Unless you use a distribution that includes rdiff-backup (most of them include it), installation could entail downloading and installing other files.

7.3.3. Quick Start

Here's a basic, but complete, example of how to use rdiff-backup to back up and restore a directory. Suppose the directory to be backed up is called <source>, and we want our archive directory to be called <destination>:

$ rdiff-backup source destination

This command backs up the <source> directory into <destination>. If you look into <destination>, you'll see that it is just like <source> but contains a directory called <destination>/rdiff-backup-data where the metadata and deltas are stored. The rdiff-backup-data directory is laid out in a fairly straightforward wayall information is either in (possibly gzipped) text files or in deltas readable with the rdiff utilitybut we don't have the space to go into the data format here.

The first time you run this command, it creates the <destination> and <destination>/rdiff-backup-data directories. On subsequent runs, it sees that <destination> exists and makes an incremental backup instead. For daily backup usage, no special switches are necessary.

Suppose you accidentally delete the file <source>/foobar and want to restore it from backups. Both of these commands do that:

$ cp -a destination/foobar source $ rdiff-backup -r now destination/foobar source

The first command works because <destination>/foobar is a mirror of <source>/foobar, so you can use cp or any other utility to restore. The second command contains the - r switch, which tells rdiff-backup to enter restore mode, and restore the specified file at the given time. In the example, now is specified, meaning restore the most recent version of the file. rdiff-backup accepts a large variety of time formats.

Now suppose you realize you deleted the important file <source>/foobar a week ago and want to restore. You can't use cp to restore because the file is no longer present in <destination> in its original form (in this case it's gzipped in the <destination>/rdiff-backup-data directory). However the -r syntax still works, except you tell it 7D for seven days:

$ rdiff-backup -r 7D destination/foobar source

Finally, suppose that the <destination> directory is getting too big, and you need to delete older backups to save disk space. This command deletes backup information more than one year old:

$ rdiff-backup -remove-older-than 1Y destination

Just like rsync, rdiff-backup allows the source or destination directory (or both) to be on a remote computer. For example, to back up the local directory <source> to the <destination> directory on the computer host.net, use the command:

$ rdiff-backup source user@host.net::destination

This works as long as rdiff-backup is installed on both computers, and host.net can receive ssh connections. The earlier commands also work if user@host.net::<destination> is substituted for <destination>.

7.3.4. Windows, Mac OS X, and the Future

Although rdiff-backup was originally developed under Linux for Unix-style systems, newer versions have features that are useful to Windows and Mac users. For instance, rdiff-backup can back up case-sensitive filesystems and files whose names contain colons (:) to Windows filesystems. Also, rdiff-backup supports Mac resource forks and Finder information, and is easy to install on Mac OS X because it is included in the Fink distribution. Unfortunately, rdiff-backup is a bit trickier to install natively under Windows; currently, cygwin is probably the easiest way.

Future development of rdiff-backup may consist mostly of making sure that the newer features like full Mac OS X support are as stable as the core Unix support, and adding support for new filesystem features as they emerge. For more information on rdiff-backup, including full documentation and a pointer to the mailing list, see the rdiff-backup project home page at http://rdiff-backup.nongnu.org/.

BackupCentral.com has a wiki page for every chapter in this book. Read or contribute updated information about this chapter at http://www.backupcentral.com.