Disaster Recovery

Even the most fault-tolerant networks will fail, which is an unfortunate fact. When those costly and carefully implemented fault-tolerant strategies do fail, you are left with disaster recovery.

Disaster recovery can take on many forms. In addition to real disaster, fire, flood, theft, and the like, many other potential business disruptions can fall under the banner of disaster recovery. For example, the failure of the electrical supply to your city block might interrupt the business function. Such an event, although not a disaster per se, might invoke the disaster recovery methods.

The cornerstone of every disaster recovery strategy is the preservation and recoverability of data. When talking about preservation and recoverability, we are talking about backups. When we are talking about backups, we are likely talking about tape backups. Implementing a regular backup schedule can save you a lot of grief when fault tolerance fails or when you need to recover a file that has been accidentally deleted. When it comes time to design a backup schedule, there are three key types of backups that are usedfull, differential, and incremental.

Full Backup

The preferred method of backup is the full backup method, which copies all files and directories from the hard disk to the backup media. There are a few reasons why doing a full backup is not always possible. First among them is likely the time involved in performing a full backup.

A full backup is the fastest way to restore all of the methods discussed here because only one tape, or set of tapes, is required for a full restore.

Depending on the amount of data to be backed up, full backups can take an extremely long time and can use extensive system resources. Depending on the configuration of the backup hardware, this can slow down the network considerably. In addition, some environments have more data than can fit on a single tape. This makes taking a full backup awkward, as someone may need to be there to manually change the tapes.

The main advantage of full backups is that a single tape or tape set holds all the data you need backed up. In the event of a failure, a single tape might be all that is needed to get all data and system information back. The upshot of all this is that any disruption to the network is greatly reduced.

Unfortunately, its strength can also be its weakness. A single tape holding an organization's data can be a security risk. If the tape were to fall into the wrong hands, all the data can be restored on another computer. Using passwords on tape backups and using a secure offsite and onsite location can minimize the security risk.

Differential Backup

For those companies that just don't quite have enough time to complete a full backup daily, there is the differential backup. Differential backups are faster than a full backup, as they back up only the data that has changed since the last full backup. This means that if you do a full backup on a Saturday and a differential backup on the following Wednesday, only the data that has changed since Saturday is backed up. Restoring the differential backup will require the last full backup and the latest differential backup.

Differential backups know what files have changed since the last full backup by using a setting known as the archive bit. The archive bit flags files that have changed or been created and identifies them as ones that need to be backed up. Full backups do not concern themselves with the archive bit, as all files are backed up regardless of date. A full backup, however, will clear the archive bit after data has been backed up to avoid future confusion. Differential backups take notice of the archive bit and use it to determine which files have changed. The differential backup does not reset the archive bit information.

If you experience trouble with any type of backup, you should clean the tape drive and then try the backup again. Also visually inspect the tape for physical damage.

Incremental Backup

Some companies have a very finite amount of time they can allocate to backup procedures. Such organizations are likely to use incremental backups in their backup strategy. Incremental backups save only the files that have changed since the last full or incremental backup. Like differential backups, incremental backups use the archive bit to determine the files that have changed since the last full or incremental backup. Unlike differentials, however, incremental backups clear the archive bit, so files that have not changed are not backed up.

Full and incremental backups clear the archive bit after files have been backed up.

The faster backup times of incremental backups comes at a pricethe amount of time required to restore. Recovering from a failure with incremental backups requires numerous tapesall the incremental tapes and the most recent full backup. For example, if you had a full backup from Sunday and an incremental for Monday, Tuesday, and Wednesday, you would need four tapes to restore the data. Keep in mind: Each tape in the rotation is an additional step in the restore process and an additional failure point. One damaged incremental tape and you will be unable to restore the data. Table 9.4 summarizes the various backup strategies.

Table 9.4. Backup Strategies
Backup Type	Advantages	Disadvantages	Data Backed Up	Archive Bit
Full	Backs up all data on a single tape or tape set Restoring data. requires the least amount of tapes.	Depending on the amount of data, full backups can take a long time.	All files and directories are backed up.	Does not use the archive bit, but resets it after data has been backed up.
Differential	Faster backups than a full.	Uses more tapes than a full backup. Restore process takes longer than a full backup.	All files and directories that have changed since the last full or differential backup.	Uses the archive bit to determine the files that have changed, but does not reset the archive bit.
Incremental	Faster backup times.	Requires multiple disks; restoring data takes more time than the other backup methods.	The files and directories that have changed since the last full or incremental backup.	Uses the archive bit to determine the files that have changed, and resets the archive bit.

Tape Rotations

After you have decided on the backup type you will use, you are ready to choose a backup rotation. Several backup rotation strategies are in usesome good, some bad, and some really bad. The most common, and perhaps the best, rotation strategy is the Grandfather, Father, Son rotation (GFS).

The GFS backup rotation is the most widely used and for good reason. An example GFS rotation may require 12 tapes: four tapes for daily backups (son), five tapes for weekly backups (father), and three tapes for monthly backups (grandfather).

Using this rotation schedule, it is possible to recover data from days, weeks, or months previous. Some network administrators choose to add tapes to the monthly rotation to be able to retrieve data even further back, sometimes up to a year. In most organizations, however, data that is a week old is out of date, let alone six months or a year.

Backup Best Practices

Many details go into making a backup strategy a success. The following list contains issues to consider as part of your backup plan.

Offsite storage Consider having backup tapes stored offsite so that in the event of a disaster in a building, a current set of tapes is still available offsite. The offsite tapes should be as current as any onsite and should be secure.
Label tapes The goal is to restore the data as quickly as possible, and trying to find the tape you need can be difficult if not marked. Further, it can prevent you from recording over a tape you need.
New tapes Like old cassette tapes, the tape cartridges used for the backups wear out over time. One strategy used to prevent this from becoming a problem is to introduce new tapes periodically into the rotation schedule.
Verify backups Never assume that the backup was successful. Seasoned administrators know that checking backup logs and performing periodic test restores are parts of the backup process.
Cleaning From time to time, it is necessary to clean the tape drive. If the inside gets dirty, backups can fail.

A backup strategy must include offsite storage to account for theft, fire, flood, or other disasters.

Hot and Cold Spares

The impact that a failed component has on a system or network depends largely on the pre-disaster preparation and on the recovery strategies used. Hot and cold spares represent a strategy for recovering from failed components.

Hot Spare and Hot Swapping

Hot spares gives system administrators the ability to quickly recover from component failureanother mechanism to deal with component failure. In a common use, a hot spare enables a RAID system to automatically failover to a spare hard drive should one of the other drives in the RAID array fail. A hot spare does not require any manual interventionrather, a redundant drive resides in the system at all times, just waiting to take over if another drive fails. The hot spare drive will take over automatically, leaving the failed drive to be removed at a later time. Even though hot-spare technology adds an extra level of protection to your system, after a drive has failed and the hot spare has been used, the situation should be remedied as soon as possible.

Hot swapping is the ability to replace a failed component while the system is running. Perhaps the most commonly identified hot-swap component is the hard drive. In certain RAID configurations, when a hard drive crashes, hot swapping allows you simply to take the failed drive out of the server and install a new one.

The benefits of hot swapping are very clear in that it allows a failed component to be recognized and replaced without compromising system availability. Depending on the system's configuration, the new hardware will normally be recognized automatically by both the current hardware and the operating system. Nowadays, most internal and external RAID subsystems support the hot-swapping feature. Some hot-swappable components include power supplies and hard disks.

Cold Spare and Cold Swapping

The term cold spare refers to a component, such as a hard disk, that resides within a computer system but requires manual intervention in case of component failure. A hot spare will engage automatically, but a cold spare might require configuration settings or some other action to engage it. A cold spare configuration will typically require a reboot of the system.

The term cold spare has also been used to refer to a redundant component that is stored outside the actual system but is kept in case of component failure. To replace the failed component with a cold spare, the system would need to be powered down.

Cold swapping refers to replacing components only after the system is completely powered off. This strategy is by far the least attractive for servers because the services provided by the server will be unavailable for the duration of the cold-swap procedure. Modern systems have come a long way to ensure that cold swapping is a rare occurrence. For some situations and for some components, however, cold swapping is the only method to replace a failed component. The only real defense against having to shut down the server is to have redundant components residing in the system.

The term "warm swap" is applied to a device that can be replaced while the system is still running but that requires some kind of manual intervention to disable the device before it can be removed. Using a PCI hot plug is technically a warm-swap strategy because it requires that the individual PCI slot be powered down before the PCI card is replaced. Of course, a warm swap is not as efficient as a hot swap, but it is far and away better than a cold swap.

Hot, Warm, and Cold Sites

A disaster recovery plan might include the provision for a recovery site that can be brought quickly into play. These sites fall into three categories: hot, warm, and cold. The need for each of these types of sites depends largely on the business you are in and the funds available. Disaster recovery sites represent the ultimate in precautions for organizations that really need it. As a result, they don't come cheap.

The basic concept of a disaster recovery site is that it can provide a base from which the company can be operated during a disaster. The disaster recovery site is not normally intended to provide a desk for every employee, but is intended more as a means to allow key personnel to continue the core business function.

In general, a cold recovery site is a site that can be up and operational in a relatively short time span, such as a day or two. Provision of services, such as telephone lines and power, is taken care of, and the basic office furniture might be in place, but there is unlikely to be any computer equipment, even though the building might well have a network infrastructure and a room ready to act as a server room. In most cases, cold sites provide the physical location and basic services.

Cold sites are useful if there is some forewarning of a potential problem. Generally speaking, cold sites are used by organizations that can weather the storm for a day or two before they get back up and running. If you are the regional office of a major company, it might be possible to have one of the other divisions take care of business until you are ready to go; but if you are the one and only office in the company, you might need something a little hotter.

For organizations with the dollars and the desire, hot recovery sites represent the ultimate in fault-tolerance strategies. Like cold recovery sites, hot sites are designed to provide only enough facilities to continue the core business function, but hot recovery sites are set up to be ready to go at a moment's notice.

A hot recovery site will include phone systems with the phone lines already connected. Data networks will also be in place, with any necessary routers and switches plugged in and turned on. Desks will have desktop PCs installed and waiting, and server areas will be replete with the necessary hardware to support business-critical functions. In other words, within a few hours, the hot site can become a fully functioning element of an organization.

The issue that confronts potential hot-recovery site users is simply that of cost. Office space is expensive at the best of times, but having space sitting idle 99.9 percent of the time can seem like a tremendously poor use of money. A very popular strategy to get around this problem is to use space provided in a disaster recovery facility, which is basically a building, maintained by a third-party company, in which various businesses rent space. Space is apportioned, usually, on how much each company pays.

Sitting in between the hot and cold recovery sites is the warm site. A warm site will typically have computers but not configured ready to go. This means that data might need to be upgraded or other manual interventions might need to be performed before the network is again operational. The time it takes to get a warm site operational lands right in the middle of the other two options, as does the cost.

A hot site that mirrors the organization's production network will be capable of assuming network operations in a moment's notice. Warm sites have the equipment needed to bring the network to an operational state but require configuration and potential database updates. A cold site has the space available with basic service but typically requires equipment delivery.