Section 14.1. Backup Strategies

14.1. Backup Strategies

To the uninitiated, computer backup can be an intimidating topic, filled with its own list of things that must be learned. These include backup hardware, complete and incremental backups, local and network backups, and client- versus server-initiated network backups. These topics all require at least minimal description before you can make an informed decision about how to set up a network backup system.

14.1.1. Backup Hardware

The first choice you must make when putting together a network backup solution is what type of hardware to use. The choices can be baffling because there are so many. If you want to use Linux with an existing backup device, you must consider Linux's compatibility with your hardware. In any event, backup hardware falls into several broad classes, each of which has many specific models and subtypes:

Tapes: Tape backup has historically been the most common form of backup medium, due largely to the low cost per gigabyte of tapes, their high capacities, and the fact that they're highly portable, which is a boon for storing some of your backups off-site. Tape, though, is an inconvenient backup medium because of its sequential-access nature, meaning that data must be read or written sequentially; you can't randomly seek to and read a particular file, as you can with disks and other random-access media. The price advantage of tapes is less advantageous in recent years, as hard disk prices have plummeted. Tape is less reliable than many other media; finding that a tape has lost some or all of its data is a sadly common occurrence. Tape is unusual because most mid-range and high-end tape drives provide built-in data compression features. In fact, manufacturers often advertise their typical capacities when using compression. Be sure to remember this fact when comparing tapes to other backup media.
Optical media: Optical media include CD-R, CD-RW, and various recordable DVD formats. These media have the advantages of being extremely common and inexpensive, but their capacities (even of DVDs) are low, at least for full network backups. Nonetheless, optical media can be important for backing up individual projects or for creating basic desktop system recovery disks.
Removable disks: This category includes floppy disks, magneto-optical (MO) disks, Zip disks, Jaz disks, and similar devices. These devices use technologies similar to those of hard disks (MO disks are a cross between magnetic disk and optical technologies, though), but individual disks can be removed from their drives for storage or transport to other computers with compatible drives. As computer backup tools, however, they're poor choices because the media are expensive and usually low in capacity. These disks work well for backing up individual users' files or specific projects.
Removable hard disks: A variant on the removable disk idea is to place a hard disk in a special housing that enables it to be easily removed. This can be either an external disk that connects to the computer using SCSI, IEEE-1394, or USB-2.0 connectors, or an internal disk with a special mounting bay. Removable hard disks have the advantage of fast random access and, increasingly, low cost. Hard disks are fairly delicate, though, so they aren't good for routine transport between sites; the risk of a shock causing damage is too great.

Of these broad classes, tape is still the medium of choice for backing up entire networks, but the initial cost can be high. A high-capacity single-tape drive can cost over $1,000, and a tape changer, which automatically changes several tapes, enabling you to treat several as one, is even more expensive. However, high-end tape formats, use tape media that are relatively inexpensivetypically about $1 per gigabyte, uncompressed. Individual tape capacities range from 4 GB to 160 GB uncompressed, for current models.

Removable hard disks have fallen in price enough that they're now competitive with tape, particularly for small sites. A typical removable disk system costs about $100, with extra trays going for another $50 or so. You'll need one tray for each hard disk you use, which is likely to raise the price for the media (tray plus disk) to $1 per gigabyte or thereabouts, at least in early 2005. Hard disk capacities, of course, compete with those of tapes.

Removable disks (other than hard disks) and optical media simply lack the capacity to be used for full network backups, or even for full backups of individual servers or desktop systems. You might still want to use them as part of your backup plan, however. For instance, if your desktop systems hold an OS but little or no user data (that is, if you store user data on a server), you can create CD-R or recordable DVD backups of your OS installations when you first set the systems up or when you perform major OS upgrades, then omit these computers from your normal backup schedules. If your OS installations are small enough, they might fit (with compression) on a single CD-R, and almost certainly on a recordable DVD. Because most desktop systems have CD-ROM drives, and many now have DVD-ROM drives, you can restore these backups without using the network, which greatly simplifies the restore process. You could also use this approach in conjunction with selective network backups of user data directories (such as /home on a Linux desktop system) to protect data stored on users' desktop systems.

If you elect to use tapes for some or all of your backup needs, you must choose a tape format. Quite a few exist, with varying capacities, prices, and speed. Table 14-1 summarizes some of the more common tape formats. Prices in this table were taken from Internet retailers in late summer 2004; they may change by the time you read this. Also, existing tape formats are often extended to support higher capacities, and new formats are periodically introduced. Thus, you may find something better suited to your needs than anything described here. Table 14-1 summarizes drives that are currently on the market and tapes for these drives; tapes for lower-capacity variants of these units are still available and may cost less than indicated here. This table also shows prices for single-tape units; changers for many of these formats are also available, but cost more.

Table 14-1. Common tape formats
Drive type	Drive price	Media price	Uncompressed capacity	Speed
Travan	$250-550	$30-50	10-20 GB	0.6-2 MB/s
DAT/DDS	$400-1,200	$5-30	4-20 GB	1.5-5 MB/s
8mm	$800-4,000	$8-90	7-60 GB	3-12 MB/s
VXA	$600-1,300	$30-100	33-80 GB	3-6 MB/s
AIT	$800-3,800	$75-120	35-100 GB	3-12 MB/s
DLT and SuperDLT	$800-4,700	$50-170	80-160 GB	3-16 MB/s

One more consideration in your choice of backup hardware is how the hardware interacts with software. Removable disks and removable hard disks can be accessed like internal hard disks, by creating a filesystem on the disk and copying files to the disk. You can also compress files and store them in carrier archives, such as tarballs. Tapes must be accessed using special tape device files, which provide sequential access to the drive. Typically, files are backed up using a carrier archive file. Optical media are usually written using a special program, such as cdrecord, which writes the entire disc's contents at once. The disc usually holds a filesystem, though, so that it can be read as if it were an ordinary magnetic disk. Some software enables more direct read/write access to the drive, but it is still relatively new in Linux and may not be suitable for backup purposes. In all cases, using a carrier archive file can help preserve file permissions, time stamps, and so on, even if the carrier file isn't a strict requirement.

14.1.2. Complete Versus Incremental Backups

One of the difficult questions you must answer when designing a backup solution is how much to back up. Most computers hold gigabytes of data, but only some of that data changes frequently. For instance, most executable program files change infrequently. Even many user data files can go unchanged for extended periods of time. Thus, if you can identify the changed files and update them without updating unchanged files, you can save considerable time (and backup media space) on your backups. Doing this is called an incremental backup, which contrasts with a complete backup or full backup, in which every file is backed up.

Incremental backups sound like a great idea, but they do have a drawback: they complicate restores. Suppose for the sake of argument that you perform a complete backup on Monday and an incremental backup every day thereafter. If the hard disk dies on Friday, you need to restore Monday's full backup followed by either every intervening incremental backup or the last one, depending on whether the incremental backups copy files that have changed since the last backup of any type or just the last full backup. What's more, your restored system will have files that might have been intentionally deleted during the week. This can cause serious problems if the system sees heavy turnover in large files, such as if users routinely create and then quickly destroy large multimedia files. (Some backup packages can spot such deletions and handle them automatically, but not all backup software can do this.) These problems become more severe the longer you go between full backups.

Generally speaking, using a small number of incremental backups between full backups can be a great time-saver. For instance, on critical systems that see lots of activity, you might perform a weekly full backup and a daily incremental backup. A less busy or less critical system might manage with monthly full backups and weekly incremental backups.

Given these examples, you may be wondering just how often you need to perform backups. There's no easy answer to this question because it depends on your own needs. You should ask yourself how much trouble a complete system failure would cause and design a backup schedule from there. For instance, if losing a single day's work would be a major hassle, that system should be backed up daily; however, if losing even a week's worth of data would not be a major inconvenience, weekly or even less frequent backups might suffice. The answer to this question, of course, can vary from one system to another; a major file server might need daily backups, whereas desktop computers might need much less frequent backups, or even none at all if they just hold stock OS installations.

14.1.3. Local Versus Network Backups

Much of the preceding description has assumed that individual computers are being backed up. You can certainly back up computers one by one, equipping each one with its own backup hardware or using portable backup hardware that you can move between computers. This is likely to be tedious and expensive, though. When it comes to users' desktop systems, getting them to perform backups can be difficult. One solution to these problems is to perform network backups. These use network protocols to transfer data from the system being backed up (the backup client) to the computer that holds the backup hardware (the backup server).

The main advantages of performing network backups are reduced hardware cost and the potential for simplified backup administration. This second advantage has a corollary: because backups are likely to be less tedious, they're more likely to be done. On the other hand, network backups have certain disadvantages: they can consume a great deal of network bandwidth, they require larger backup storage devices than do individual backups, they require careful planning so as to operate smoothly, and they may require overcoming cross-platform differences (such as Linux versus Windows filename conventions).

Overall, network backups are worth doing on all but the smallest networksor at least, on any network with more than a tiny number of computers that are worth backing up. Typically, your first priority will be your servers, followed by workstations on which users store their data files. You may want to create your own priority list, though; knowing what's most important on your own network will help you plan what hardware to buy and what software will best back up the data.

The backup server computer itself can be fairly unassuming, aside from its backup device and a decent network connection. The computer most likely won't be running any RAM-intensive programs. (Some high-end backup software uses large RAM buffers, however.) If you compress your backups, the CPU might need to be adequate to back up the data, but this task won't strain a CPU unless you've paired it with much more modern network and data storage systems. You might be tempted to equip a major file server with the backup hardware and make it your backup server, and this does have the advantage of simplifying the backup of this important server. On the other hand, it also imposes an extra load on the file server, both in terms of CPU (particularly if you use it to compress data) and network bandwidth. This might be acceptable if you expect to be able to fully complete backups in off hours, but if you expect your backups to occur partly when the network is in use, you might want to use a dedicated backup server. Also, a backup server may have increased vulnerability to certain types of attack, so placing it on its own computer can have security implications compared to having a file server do double duty.

14.1.4. Client- Versus Server-Initiated Backups

When doing network backups, one critical detail is which system controls the backup process: the backup server or the backup client. Both approaches have several consequences:

Scheduling: When the backup server initiates the backup process, it can do so in a way that makes scheduling sense for the network as a whole, and you can specify this schedule from a single computer (namely, the backup server). When the backup client initiates the process, by contrast, scheduling can become difficult, because the possibility of conflicts increases dramatically. This is particularly true if backups are performed on an as-needed basis rather than being strictly scheduled.
Computer availability: When the backup server initiates the backup process, the backup clients must be turned on and available for backup whenever the server does its job. This might be a hassle when backing up desktop computers, which are often powered down at night or over the weekend. When the backup clients initiate the process, though, the server must be available at all times, or at least at scheduled backup times. Because this requirement is placed on just one computer, it's usually less onerous.
Security: When the backup server initiates the backup, the backup client computers must all run a server to respond to backup requests. This server is a potential security risk, making the backup clients vulnerable to outside intrusion. When the backup client initiates the backup, by contrast, it means that the backup server must run a server program. Again, this is a potential security risk, but it applies to just one computer. (The client's files must typically be accessed using a program running as root or its equivalent, but this program need not respond to outside accesses, and therefore needn't be as much of a security risk.) Thus, server-initiated backups can be more of a risk to your network as a whole, particularly if the server software used for backups isn't something you'd otherwise run. (Some backup methods, however, use protocols, such as SMB/CIFS, that you might use even if you didn't perform network backups.)

Network backups use the terms client and server in an unusual way. Typically, the backup server is the computer that houses the backup hardware, and the backup client is the computer that holds data to be backed up. When the backup client initiates the backup, the client/server relationship is as you'd expect; however, when the backup server initiates the backup, the backup client runs network server software, and the backup server runs network client software. This relationship can be confusing if you're unfamiliar with the terminology.

Both client- and server-initiated backups have their uses. Broadly speaking, client-initiated backups work best on small networks with few users and irregular backup schedules, such as in a business with half a dozen employees. As the number of computers grows, though, the scheduling hassles of client-initiated backups become virtually impossible to manage, so server-initiated backups become preferable. You might also prefer server-initiated backups even on a small network because of software features of specific packages or for other reasons; don't feel compelled to use a client-initiated backup strategy on a small network.

14.1.5. Backup Pitfalls

Backups don't always proceed as planned. Worse, restores don't always work the way you expect, and a backup is useless if you can't restore it. Some common problems, particularly in cross-platform network backups, include:

Network bandwidth consumption: Backing up over the network necessarily consumes a certain amount of bandwidth. Ideally, you should schedule backups during off hours to minimize the impact of this activity on day-to-day work.
Metadata support: Every filesystem supports its own types of metadata (data about files, such as file creation times and permissions), and not all backup tools support all the metadata you need. This issue comes up again later in this chapter.
In-use files: Sometimes it's not possible to read a file that's in use by another program, or a file's backup may be corrupted if it was being modified at the moment of the backup. This can cause problems with such Windows files as the Registry, the Outlook mail file, and files used by Microsoft Exchange. One radical solution is to shut down the system and boot it into a secondary OS installation (of the same OS or of another one) for backup, but this is a disruptive process. Some program-specific solutions exist, such as creating backups of the affected files from the programs that create them. These backups should then be handled by the backup software and can be restored to the main files if it becomes necessary.
Restore glitches: No matter what backup solution you choose, you should perform periodic tests of your ability to restore data, simply to ensure that it can be done. Unused and untested procedures have a tendency to "rot" as you upgrade software, rendering a formerly working procedure inoperable.

Unfortunately, backup pitfalls can be very site-specific because they often involve details of your own network, the systems you're backing up, your backup hardware, and the programs you use (both for backup and on the systems being backed up). You may need to rely on testing and experience to discover these problems, then try to find a solution on the Web or in some other way. This is why testing your backups is so critically important; it's far better to discover problems before you need to restore data than after such a restore is needed!