| < Day Day Up > |
Things That Go Wrong with Restore Operations
Many things can cause a restore operation to fail. Unfortunately, most of these are due to factors that occurred long before the restore was attempted. The best policy is to rotate media beforehand and to enable options to read back and verify data written to the media. Check error logs from
Tip
Try to determine the cause of any error
Bad Media
Bad media is like bad tuna—something to be avoided at all cost. One reason that restore operations fail is damaged media. Improper storage and overuse of volatile materials like magnetic tapes can lead to situations in which backups are complete, yet restore is
It is assumed that if, during backup, the media is noticeably damaged or worn, it will not be used. The problems begin with borderline media, such as an old tape. It may complete a backup without error or with only small errors, but when it is time to perform a restore, it breaks or can no longer be read from reliably. All removable media, including floppy
The best way to avoid this situation is to rotate the media on a regular basis,
Bad drives can also lead to bad media. A tape head that needs cleaning may damage a tape, or a
Poor media storage is another culprit. It is all too common for system administrators to leave tapes and CDs in hot
Data Corruption
Nothing is
Some causes of data corruption are random errors, failing but not failed hardware, and data that was damaged to begin with. In some cases, only selected objects may be corrupt; sometimes, the entire backup set is. Random errors do occur. No hardware is perfect, and the laws of physics apply. Everything from electromagnetic interference to stray neutrinos can change the data on the way to the media or after it's already there. Enabling read verification usually detects this and allows the backup software to rewrite the corrupt data.
Another cause of corruption is hardware that has become unstable but has not failed completely. Controllers that have damaged components may act erratically and write data erroneously to the media. This can occur in all types of media. A laser or tape head that is damaged may also corrupt data. Having read verification enabled will detect errors caused by failing hardware. Rewriting data to the media provides a quick fix but not a permanent solution. Regular maintenance of system
A source of near failures is the environment within which the hardware is kept. Hot system rooms will cause components to overheat, but not to the point of complete failure. Monitoring the environment in the data center can help eliminate data corruption caused by components operating outside their environmental specifications.
The most insidious data error occurs when the
Network Congestion
Just as network congestion can cause backups to go wrong, it can cause restore operations to fail. Quite simply, if it takes too long for data to get between the backup device and host, the backup software will time out. It may even assume that the devices are no longer
There are two good ways to avoid this. First, design the backup system so that it provides necessary bandwidth all the time. Many system
Another way to ensure that a restore operation can complete without interruption is to cut off all other traffic on the system. Shutting down other high-volume systems or disconnecting end-users from the network while restore takes place will ensure that there is sufficient bandwidth to complete the operation. This is highly disruptive to operations. To shut down a large portion of a network to restore data should be a last resort. Shutting down noncritical systems while building the network with enough overhead to accommodate critical restore operations should suffice.
|
| < Day Day Up > |