Section 23.2. Volatile Filesystems | Backup & Recovery: Inexpensive Backup Solutions for Open Systems

23.2. Volatile Filesystems

A volatile filesystem is one that changes almost constantly while it is being backed up. Backing up a volatile filesystem could result in a number of negative side effects. The degree to which a backup is affected is directly proportional to the volatility of the filesystem and highly dependent on the backup utility that you are using. Some files could be missing or corrupted within the backup, or the wrong versions of files may be found within the backup. The worst possible problem, of course, is that the backup itself could become corrupted, although this should happen only under the most extreme circumstances. (See the section "Demystifying dump" for details on what can happen when performing a dump backup of a volatile filesystem.)

23.2.1. Missing or Corrupted Files

Files that are changing during the backup do not always make it to the backup correctly. This is especially true if the filename or inode changes during the backup. The extent to which your backup is affected by this problem depends on what type of utility you're using and how volatile the filesystem is.

For example, suppose that the utility performs the equivalent of a find command at the beginning of the backup. This utility then begins backing up those files based on the list that it created at the beginning of the backup. If a filename changes during a backup, the backup utility receives an error when it attempts to back up the old filename. The file, with its new name, is simply overlooked.

Another scenario occurs when the filename does not change, but its contents do. The backup utility begins backing up the file, and the file changes while being backed up. This is probably most common with a large database file. The backup of this file would be essentially worthless because different parts of it were created at different times. (This is actually what happens when backing up Oracle database files in hot-backup mode. Without Oracle's ability to rebuild the file, the backup of these files would be worthless.)

23.2.2. Referential Integrity Problems

This is similar to the corrupted files problem but on a filesystem level. Backing up a particular filesystem may take several hours. This means that different files within the backup are backed up at different times. If these files are unrelated, this creates no problem. However, suppose that two different files are related in such a way that if one is changed, the other is changed. An application needs these two files to be related to each other. This means that if you restore one, you must restore the other. It also means that if you restore one file to 11:00 p.m. yesterday, you should restore the other file to 11:00 p.m. yesterday. (This scenario is most commonly found in databases but can be found in other applications that use multiple, interrelated files.)

Suppose that last night's backup began at 10:00 p.m. Because of the name or inode order of the files, one is backed up at 10:15 p.m. and the other at 11:05 p.m. However, the two files were changed together at 11:00 p.m., between their separate backup times. Under this scenario, you would be unable to restore the two files to the way they looked at any single point in time. You could restore the first file to how it looked at 10:15, and the second file to how it looked at 11:05. However, they need to be restored together. If you think of files within a filesystem as records within a database, this would be referred to as a referential integrity problem.

23.2.3. Corrupted or Unreadable Backup

If the filesystem changes significantly while it is being backed up, some utilities may actually create a backup that they cannot read. This is obviously one of the most dangerous things that can happen to a backup, and it would happen only under the most extreme circumstances.

23.2.4. Torture-Testing Backup Programs

In 1991, Elizabeth Zwicky did a paper for the LISA^[*] conference called "Torture-testing Backup and Archive Programs: Things You Ought to Know But Probably Would Rather Not." Although this paper and its information are somewhat dated now, people still refer to it when talking about this subject. Elizabeth graciously consented to allow us to include some excerpts in this book:

^[*] Large Installation System Administration Conference, sponsored by Usenix and Sage (http://www.usenix.org).

Many people use tar, cpio, or some variant to back up their filesystems. There are a certain number of problems with these programs documented in the manual pages, and there are others that people hear of on the street, or find out the hard way. Rumors abound as to what does and does not work, and what programs are best. I have gotten fed up, and set out to find Truth with only Perl (and a number of helpers with different machines) to help me.

As everyone expects, there are many more problems than are discussed in the manual pages. The rest of the results are startling. For instance, on Suns running SunOS 4.1, the manual pages for both tar and cpio claim bugs that the programs don't actually have any more. Other "known'' bugs in these programs are also mysteriously missing. On the other hand, new and exciting bugsbugs with symptoms like confusions between file contents and their namesappear in interesting places.

Elizabeth performed two different types of tests. The first type were static tests that tried to see which types of programs could handle strangely named files, files with extra-long names, named pipes, and so on. Since at this point we are talking only about volatile filesystems, I will not include her static tests here. Her active tests included:

A file that becomes a directory
A directory that becomes a file
A file that is deleted
A file that is created
A file that shrinks
Two files that grow at different rates

Elizabeth explains how the degree to which a utility would be affected by these problems depends on how that utility works:

Programs that do not go through the filesystem, like dump , write out the directory structure of a filesystem and the contents of files separately. A file that becomes a directory or a directory that becomes a file will create nasty problems, since the content of the inode is not what it is supposed to be. Restoring the backup will create a file with the original type and the new contents.

Similarly, if the directory information is written out and then the contents of the files, a file that is deleted during the run will still appear on the volume, with indeterminate contents, depending on whether or not the blocks were also reused during the run.

All of the above cases are particular problems for dump and its relatives; programs that go through the filesystem are less sensitive to them. On the other hand, files that shrink or grow while a backup is running are more severe problems for tar , and other filesystem based programs. dump will write the blocks it intends to, regardless of what happens to the file. If the block has been shortened by a block or more, this will add garbage to the end of it. If it has lengthened, it will truncate it. These are annoying but nonfatal occurrences. Programs that go through the filesystem write a file header, which includes the length, and then the data. Unless the programmer has thought to compare the original length with the amount of data written, these may disagree. Reading the resulting archive, particularly attempting to read individual files, may have unfortunate results.

Theoretically, programs in this situation will either truncate or pad the data to the correct length. Many of them will notify you that the length has changed, as well. Unfortunately, many programs do not actually do truncation or padding; some programs even provide the notification anyway. (The "cpio out of phase: get help!" message springs to mind.) In many cases, the side reading the archive will compensate, making this hard to catch. SunOS 4.1 tar, for instance, will warn you that a file has changed size, and will read an archive with a changed size in it without complaints. Only the fact that the test program, which runs until the archiver exits, got ahead of tar, which was reading until the file ended, demonstrated the problem. (Eventually the disk filled up, breaking the deadlock.)

23.2.4.1. Other warnings

Most of the things that people told me were problems with specific programs weren't; on the other hand, several people (including me) confidently predicted correct behavior in cases where it didn't happen. Most of this was due to people assuming that all versions of a program were identical, but the name of a program isn't a very good predictor of its behavior. Beware of statements about what tar does, since most of them are either statements about what it ought to do, or what some particular version of it once did....Don't trust programs to tell you when they get things wrong either. Many of the cases in which things disappeared, got renamed, or ended up linked to fascinating places involved no error messages at all.

23.2.4.2. Conclusions

These results are in most cases stunningly appalling. dump comes out ahead, which is no great surprise. The fact that it fails the name length tests is a nasty surprise, since theoretically it doesn't care what the full name of a file is; on the other hand, it fails late enough that it does not seem to be an immediate problem. Everything else fails in some crucial area. For copying portions of filesystems, afio appears to be about as good as it gets, if you have long filenames. If you know that all of the files will fit within the path limitations, GNU tar is probably better, since it handles large numbers of links and permission problems better.

There is one comforting statement in Elizabeth's paper: "It's worth remembering that most people who use these programs don't encounter these problems." Thank goodness!

23.2.5. Using Snapshots to Back Up a Volatile Filesystem

What if you could back up a very large filesystem in such a way that its volatility was irrelevant? A recovery of that filesystem would restore all files to the way they looked when the entire backup began, right? A technology called the snapshot allows you to do just that. A snapshot provides a static view of an active filesystem. If your backup utility is viewing a filesystem through its snapshot, it could take all night long to back up that filesystem; yet it would be able to restore that filesystem to exactly the way it looked when the entire backup began.

23.2.5.1. How do snapshots work?

When you create a snapshot, the software records the time at which the snapshot was taken. Once the snapshot is taken, it gives you and your backup utility another name through which you may view the filesystem. For example, when a Network Appliance creates a snapshot of /home, the snapshot may be viewed at /home/.snapshot. Creating the snapshot doesn't actually copy data from /home to /home/.snapshot, but it appears as if that's exactly what happened. If you look inside /home/.snapshot, you'll see the entire filesystem as it looked at the moment when /home/.snapshot was created.

Actually creating the snapshot takes only a few seconds. Sometimes people have a hard time grasping how the software could create a separate view of the filesystem without copying it. This is why it is called a snapshot: it didn't actually copy the data, it merely took a "picture" of it.

Once the snapshot has been created, the software monitors the filesystem for activity. When it is going to change a block of data, it keeps a record of the way the block used to look like when you took the snapshot. How a given product holds on to the previous image varies from product to product.

When you view the filesystem using the snapshot directory, it watches what you're looking for. If you request a block of data that has not changed since the snapshot was taken, it retrieves that block from the actual filesystem. However, if you request a block of data that has changed since the snapshot was taken, it retrieves that block from another location. This, of course, is completely invisible to the user or application accessing the data. The user or application simply views the filesystem using the snapshot, and where the blocks come from is managed by the snapshot software.

23.2.5.2. Available snapshot software

Most NAS filers now have built-in snapshot capabilities. In addition, many filesystem drivers now support the ability to create snapshots as well.