Recovering from Disaster: Restoring Data | Data Protection and Information Lifecycle Management

< Day Day Up >

The reason for backing up data is to be able to restore it later. Restore differs from failover in that it is an offline process. With failover, a duplicate of the data and an approximation of the computing environment must be available all the time, in case of loss of the primary system. In the event of such a loss, the failover system uses the secondary resources be they servers, network paths, or storage instead of the primary ones, to ensure that the applications keep running. Failover can be very expensive, owing to the need to have sufficient duplicate resources that are seldom used available all the time.

The restore process uses copies of data to recover a system to the state it was originally. Data previously stored on backup media is transferred by the backup software to the repaired or new storage devices. These devices will then act as the primary storage system. In some cases, only specific objects are restored, such as files and directories. At other times, entire volumes and disk sets need to be recovered from backups. While data is being restored, applications that are dependent on the data cannot be used unless a failover data store is available.

The two most critical issues associated with restoring data are the speed of the restore and the validity of the restored data. Many factors are involved in how fast a restore operation can occur. One of the most important is the speed at which the data can be streamed from the backup media to the new disk drives. The speed of the interface is also critical to overall restore time. The time it takes to perform a restore over a 100-megabit-per-second network is vastly different from what can be accomplished over a 2-gigabit Fibre Channel SAN. How fast the backup server can facilitate the transfer of data (which itself is a function of the server resources) also figures in the speed with which data can be restored. The speed at which the disks can accept and write data is a minor factor in the performance of restore operations.

Architecting for restore is an exercise in finding and eliminating bottlenecks. Simply changing one component of the system may not achieve the performance goals for that system. Consider the case of a backup system that is based on a 100-megabit Ethernet network with a backup server using Windows 2003. Attached to the server, via an Ultra2 LVD SCSI host adapter, is a small autoloader deploying a Quantum DLT VS160 tape drive. (An autoloader is a tape backup device that has one tape drive but can load tapes automatically.)

The backup solution is designed to back up three servers, each roughly 100 gigabytes in size and each running a moderate-size database, with another 30 gigabytes taken up by system software. Given that the tapes hold 160 gigabytes of compressed data each, only one tape is used for each server. Because the tape drive can stream compressed data at about 16 megabytes per second under optimal conditions, the best case for restoring any one server is around 4.5 hours. In reality, it will take longer when network congestion and server resources are taken into account, perhaps much longer.

Several bottlenecks need to be addressed if the restore time is to be cut down. The Ultra2 LVD SCSI connection can be ignored, because it is capable of transferring data much more quickly than any of the other system components, at 40 megabytes per second. The tape drive and network interface are serious bottlenecks, however. Even under the best of circumstances, the 100-megabit Ethernet connection has a wire speed of 12.5 megabytes per second of throughput. Overhead from the Ethernet frames, collisions, and the TCP and IP protocols often takes this down further, to between 8 and 10 megabytes per second, if the servers and network have low loads. The tape drive operates at a maximum throughput of 16 megabytes per second for compressed data. The real throughput is probably a bit lower, because decompressing data often introduces some overhead into the system.

It would seem that the network is the major bottleneck. If the 100-megabits-per-second network connection is changed to Gigabit Ethernet, the bottleneck would shift to the tape drive. Even if the tape drive stayed the same, the theoretical restore time would be reduced to 2.25 hours, or about half of the current best time. Upgrading the tape drive to a SDLT 220, with a maximum throughput of 22 megabytes per second, would further drop the restore time to 1.6 hours. Clearly, changing the network interface has a major impact, and upgrading the tape drive has a lesser one.

< Day Day Up >