System Failure | Microsoft SQL Server 2000 Administrators Companion

3 4

You might be wondering whether backups are really necessary if you use technologies such as Microsoft Cluster Services and RAID fault tolerance. The answer is a resounding "yes." Your system can fail in a number of ways, and those methods of fault tolerance and fault recovery will help keep your system functioning properly through only some of them. In this section, we'll explore some of the potential causes of failure and ways to survive those failures.

Some system failures can be mild; others can be devastating. To understand why backups are so important, you need to know about the three main categories of failures: hardware failures, software failures, and human error.

Hardware Failures

Hardware failures are probably the most common type of failure you will encounter. Although these failures are becoming less frequent as computer hardware becomes more reliable, components will still wear out over time. Typical hardware failures include the following:

CPU, memory, or bus failure These failures usually result in a system crash. After you replace the faulty component and restart the system, SQL Server automatically performs a database recovery. The database itself is intact, so it does not need to be restored—SQL Server needs simply to replay the lost transactions.
Disk failure If you're using RAID fault tolerance, this failure type will probably not affect the state of the database at all. You must simply repair the RAID array. If you are not using RAID fault tolerance or if an entire RAID array fails, your only alternative is to restore the database from the backup and use the transaction log backups to recover the database.
Catastrophic system failure or permanent loss of server If the entire system is destroyed in a fire or some other disaster, you might have to start over from scratch. The hardware will need to be reassembled, the database restored from the backup, and the database recovered by means of the data and transaction log backups.

Software Failures

Software failures are rare, and your system probably will never experience them. However, a software failure is usually more disastrous than a hardware failure because software has built-in features that minimize the effect of hardware failures, and without these protective features, the system is vulnerable to disaster if a hardware failure occurs. The transaction log is an example of a software feature designed to help systems recover from hardware failures. Typical software failures include the following:

Operating system failure If a failure of this type occurs in the I/O subsystem, data on disk can be corrupted. If no database corruption occurs, only recovery is necessary. If database corruption occurs, your only option is to restore the database from a backup.
RDBMS failure SQL Server itself can fail. If this type of failure causes corruption to occur, the database must be restored from a backup and recovered. If no corruption occurs, only the automatic recovery is needed to return the system to the state it was in at the point of failure.
Application failure Applications can fail, which can cause data corruption. Like an RDBMS failure, if this type of failure causes corruption to occur, the database must be restored from a backup. If no corruption occurs, no restore is necessary; the automatic recovery will return the system to the state it was in at the point of failure. You might also need to obtain a patch from your application vendor to prevent this type of failure from recurring.

NOTE
Companies often try beta versions of SQL Server. Beta versions of software are designed for evaluation and testing only and should not be used in a production environment. Sometimes, beta versions contain software bugs and include features that have not been fully tested. You should use the production release of Microsoft SQL Server 2000, which has been fully tested and is ready for production use.

Human Error

The third main category of failure is human error. Human errors can occur at any time and without notice. They can be mild or severe. Unfortunately, these types of errors can go unnoticed for days or even weeks, which can make recovery more difficult. By establishing a good relationship (including good communication) with your users, you can help make recovery from user errors easier and faster. Users should not be afraid to come to you immediately to report a mistake. The earlier you find out about an error, the better. The following failures can be caused by human error:

Database server loss Human errors that can cause the server to fail include accidentally shutting off the power or shutting down the server without first shutting down SQL Server. Recovery is automatic when SQL Server is restarted, but it might take some time. Because the database is intact on disk, a restore is not necessary.
Data loss This type of loss can be caused by someone's accidentally deleting a data file, for example, thus causing loss of the database. Restore and recovery operations must be performed to return the database to its prefailure state.
Table loss or corrupted data If a table is dropped by mistake or its data is somehow incorrectly changed, you can use backup and recovery to return the table to its original state. Recovery from this type of failure can be quite complex because a single table or a small set of data that is lost cannot simply be recovered from a backup. An example of restoring data after this type of failure is shown in Chapter 33.