Types of Failure | Professional SQL Server 2005 Administration (Wrox Professional Guides)

Different failures can bring down your database. Anything from a user error to a natural disaster could take your database offline. Your backup and recovery plan needs to account for all of these failures.

Hardware Failure

Nowadays, hardware is fairly reliable. However, components can still fail, including the CPU, memory, bus, network card, disk drives, and controllers. A database system on a high-availability solution can mitigate a hardware failure so that if one database server fails, SQL Server will failover to the standby database server that includes fault-tolerant disk arrays and perhaps using redundant IO controllers. All of this will help keep the database online. However, what high-availability solutions cannot protect is when a faulty controller or a disk causes IO failures and corrupts the data.

Note

Use SQLIOStress to help identify the optimal disk configuration or troubleshoot IO faults. Download from http://support.microsoft.com/default.aspx?scid=kb;en-us;231619.

User Error

A common user error is not including a restrictive WHERE clause during an update or delete operation and modifying more rows than expected. As a preventive measure, users should start data modifications inside a BEGIN TRANSACTION, and then the user can verify that the correct numbers of rows were updated before executing a COMMIT TRANSACTION. If the data modification was not performed inside a transaction, the data will be permanently changed and the user would have no capability to undo the changes. Then, to recover the data, a DBA will need to restore from backup. One option in the case of user error is to use a log explorer utility to read the transaction log and undo the data changes.

Application Failure

The user application may contain a bug that causes unwanted data modifications. To prevent it, the application should go through a strict QA process to uncover any such bugs. However, there is no guarantee that an undetected bug may not cause unwanted data modifications in the future. Once the bug is detected and corrected, the DBA may need to recover the database from backup or possibly using a log explorer utility. The scenario is to identify the time that the problem occurred and to recover to that point in time.

Software Failure

The operating system can fail, as can the relational database system. A software driver may be faulty and causing data corruption; for example, the IO hardware controller may be functioning correctly, but the software driver may not be. One preventive measure is to maintain the system current with service packs and patches, including security patches. The DBA may choose any of the following patch-management solutions: Automatic updates from Microsoft Update, Windows Server Update Services, SMS, or a partner solution to keep the servers updated. Unfortunately, some of these updates may require a reboot either to SQL Server or to the Windows OS that will cause some planned downtime. But planned downtime can be mitigated by a high-availability solution to fail over to the standby database server. Choose a maintenance time window when there is lowest user activity; identify the patches that require a reboot ahead of time and apply them at one time whenever possible, so as to require only a single reboot. Record each software driver version and check the vendor Web site for the most current updates. Additionally, the driver must be approved for the computer hardware and the Windows version. Having a supported and updated driver version can make a significant performance difference to the hardware device.

Too Much Privilege

Oftentimes, applications are using SQL Server logins that have more privilege than necessary. That is, instead of restricting security to just what the application needs, it is faster and easier to just grant DBO or SA security. Then the application using this privilege may delete data from the wrong table, either as a result of a bug or by a user accidentally using a freeform query window. To reduce this risk, to give application users only the database permissions to do their work, limit access to freeform SQL tools, and restrict SA and DBO permissions only to users who need it and have the experience to know how to use it.

Local Disasters

An unexpected disaster can devastate an area where the data center may become inoperable or completely destroyed. You'll need to relocate the data center, and that is where the disaster planning comes into effect: to quickly bring up the new data center and reopen for business. Depending on the disaster, there can be data loss where the location is inaccessible, and you may be unable to extract the last few data records. To reduce the exposure from a local disaster, an organization may set up a disaster recovery site by means of data mirroring, a geographically dispersed failover cluster, log shipping, or SAN replication.