Things That Go Wrong with Restore Operations | Data Protection and Information Lifecycle Management

< Day Day Up >

Many things can cause a restore operation to fail. Unfortunately, most of these are due to factors that occurred long before the restore was attempted. The best policy is to rotate media beforehand and to enable options to read back and verify data written to the media. Check error logs from backups as well, to see whether there were recoverable errors. Too many recoverable errors signal that something is wrong. An analysis needs to take place to identify the cause of the problem.

Tip

Try to determine the cause of any error encountered during backup. It may look like backups are being performed correctly despite errors. Errors, especially read verification errors, mean that something is not right and may become an issue when you least want it: during restore. Recoverable errors are still errors and need to be explained and fixed.

Bad Media

Bad media is like bad tuna something to be avoided at all cost. One reason that restore operations fail is damaged media. Improper storage and overuse of volatile materials like magnetic tapes can lead to situations in which backups are complete, yet restore is impossible.

It is assumed that if, during backup, the media is noticeably damaged or worn, it will not be used. The problems begin with borderline media, such as an old tape. It may complete a backup without error or with only small errors, but when it is time to perform a restore, it breaks or can no longer be read from reliably. All removable media, including floppy drives, CD-RW, and magnetic tapes, have these problems.

The best way to avoid this situation is to rotate the media on a regular basis, discarding media that has become too old. The data sheets of most media list an average lifespan or durability specification. This specification translates into the number of uses the media can reasonably withstand.

Bad drives can also lead to bad media. A tape head that needs cleaning may damage a tape, or a misaligned laser may ruin some data on a CD-RW. The backup may complete, but the data is damaged. To avoid this issue, clean and maintain all drives on a regular basis. Use of read verification options in the backup software will detect whether data has been written incorrectly. This alerts system administrators that there is a problem that needs to be addressed.

Poor media storage is another culprit. It is all too common for system administrators to leave tapes and CDs in hot cars, causing damage to the underlying plastic. Keeping anything magnetic near a magnetic or electrical field is another media-storage-related issue. One interesting mistake is keeping magnetic media on top of a computer. Computers, despite shielding, generate low-level electromagnetic fields that can erase portions of some media over a long period of time.

Data Corruption

Nothing is worse then a perfectly good backup that has perfectly bad data on it. Corrupt data can also lead to a bad restore operation that the backup software doesn't even detect. The restore worked perfectly it was just bad data that was restored.

Some causes of data corruption are random errors, failing but not failed hardware, and data that was damaged to begin with. In some cases, only selected objects may be corrupt; sometimes, the entire backup set is.

Random errors do occur. No hardware is perfect, and the laws of physics apply. Everything from electromagnetic interference to stray neutrinos can change the data on the way to the media or after it's already there. Enabling read verification usually detects this and allows the backup software to rewrite the corrupt data.

Another cause of corruption is hardware that has become unstable but has not failed completely. Controllers that have damaged components may act erratically and write data erroneously to the media. This can occur in all types of media. A laser or tape head that is damaged may also corrupt data. Having read verification enabled will detect errors caused by failing hardware. Rewriting data to the media provides a quick fix but not a permanent solution. Regular maintenance of system components will prevent some of these problems. A few read verification errors may be random; a larger number are media or hardware failure about to happen.

A source of near failures is the environment within which the hardware is kept. Hot system rooms will cause components to overheat, but not to the point of complete failure. Monitoring the environment in the data center can help eliminate data corruption caused by components operating outside their environmental specifications.

The most insidious data error occurs when the backed-up data is already corrupted. Technically speaking, the backup and restore worked perfectly. Unfortunately, bad data was backed up and cannot be relied upon. This occurs quite frequently. One reason to restore data is because the primary storage hardware failed. There are plenty of opportunities to ruin data before complete failure kicks in.

Network Congestion

Just as network congestion can cause backups to go wrong, it can cause restore operations to fail. Quite simply, if it takes too long for data to get between the backup device and host, the backup software will time out. It may even assume that the devices are no longer reachable and abandon the restore instead of retrying.

There are two good ways to avoid this. First, design the backup system so that it provides necessary bandwidth all the time. Many system architects have used Fibre Channel networks for this purpose. Even if Fibre Channel is not used, a dedicated network for backup would give the system a better chance of avoiding congestion during restore (and backup) operations. Simply having enough bandwidth to perform a restore should suffice, though there is clearly a cost involved with it.

Another way to ensure that a restore operation can complete without interruption is to cut off all other traffic on the system. Shutting down other high-volume systems or disconnecting end-users from the network while restore takes place will ensure that there is sufficient bandwidth to complete the operation. This is highly disruptive to operations. To shut down a large portion of a network to restore data should be a last resort. Shutting down noncritical systems while building the network with enough overhead to accommodate critical restore operations should suffice.

What If the Backup Is Bad?

Every system administrator's bad dream is trying to restore a system, only to find that the backup is corrupt or ruined. Applications can be restored from original disks, but the data is unrecoverable. For some organizations, the costs of restoring from paper records could destroy them.

There are some options, depending on how the backups were conducted. Using a rotating backup scheme, rather than reusing the same media every day, means that only data since the previous backup is lost. Ideally, that is only a day's worth of data. Recovering a single day's data from paper and other electronic media is a chore but not an organization-killer.

If the backup process simply reused the same media each day without provision for making a copy of that media, the problem is more acute. Some database data may be recoverable through transaction logs. Perhaps static content, such as web pages, is archived as well as backed up. If not, there are few options outside of hiring specialized companies to try to recover data from the damaged disks and backup media. They may be able to cobble the recovered data together into some semblance of the original data.

The best solution is to take preventive action and not reuse media each day. It takes a lot of media, but it's cheap insurance.

< Day Day Up >