Transaction Recovery | Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features

I have been careful not to interchange disaster recovery with transaction recovery, because the latter deals mainly with SQL Server’s capability to automatically recover transactions and restore the state of a database after a system crash. It has nothing to do with restoring data and backed-up, or otherwise duplicated, transaction logs, a subject we will cover in this chapter.

Depending on how your backups are done and the nature of your backup technology, just starting up the disaster recovery process could take anywhere from ten minutes to several hours, spent reading the catalog, finding the right backups, and so on. In cases where backup media are off-site, you would need to take into consideration how long it takes after placing a call to the backup bank for the media to arrive at the data center. This could be anything from 30 minutes to 6 hours. And you may be charged for rush delivery.

Mission-critical disaster recovery, having backup media on-site, thus has to override the risk of data center destruction, fire, and theft. However, data center security has nothing to do with SQL Server 2005, so let’s go direct to backup and restore options.

There is no question that you need a good tape backup environment with any server. For starters, you need to back up your system databases and schema catalogs as well, because that’s where all system- or server-wide configurations, such as logins and maintenance plans, are stored. You also need to regularly back up your metadata repositories, data warehouses, and so on.

The most common form of backup is to a tape drive, and I will discuss the formats a little later in this chapter. The initial cost of a tape backup solution is really insignificant in relation to the benefit: the ability to back up and recover large amounts of data as quickly as possible. A good tape drive can run anywhere from $500 for good Quarter-Inch Cartridge (QIC) systems to $3,000–$4,000 for the high-speed, high-capacity Digital Linear Tape (DLT) systems. A robotic library system can cost as much as $30,000 on the low end and six and seven figures on the high end.

Let’s now consider two classes of restoration. I came up with this data while managing DR for a score of data centers in the food distribution business. If Burger King were out of fries or Long John Silver out of fish, I would always call to find out if my servers were still up, so I decided to come up with a formula to help me relax while away from the data center. This data also helped get the database owners and me rowing in the same direction.

The following list defines levels of database and transaction log restoration in terms of currency (was this the last transaction written to the log, or close to it?):

Zero loss (0) Transactions in the transaction log must be the last transactions taken before failure. In other words, you have zero tolerance for transaction loss (see Chapter 2 for information on transaction log management).
Ten minutes loss (10) Transactions received from real time back to ten minutes can be lost. It is thus acceptable to lose 10 minute’s worth of transactions. The client process or order entry person can reenter the transactions.
Sixty minutes loss (60) Transactions received in the first hour can be lost. It is thus acceptable to lose one hour’s worth of transactions.
Three hundred and sixty minutes loss (360) Transactions received in the last six hours can be lost.

The next list assumes you are at one of the preceding levels, but it defines how quickly you need the databases and transaction logs restored.

Zero wait (0) Transaction log restoration is required in real time (now). Let’s call this the critical restoration level.
Ten minutes wait (10) Restore is required within ten minutes of losing the databases and transaction logs. Let’s call this emergency restore. (By the way, loss could mean total loss as well as bad data.)
Sixty minutes wait (60) Restore is required within one hour of losing the databases and transaction logs. Let’s call this urgent restore.
Three hundred and sixty minutes wait (360) Restore is required within six hours of losing the databases and transaction logs. Let’s call this important restore.

All other restores that can occur later than six hours could be considered casual restores and are not factors for this exercise.

If we now look at both scales, we can establish a database and transaction log restoration rating in terms of an acuity index that considers the two scales: the highest level being 0/0 (zero loss for zero wait time to online status) and the lowest level being 360/360 (six hours to get back to online state, and restored data should be at the most six hours old at the time of disaster). It is no accident that the index sounds like a course in optometry I felt it gave both the DBA and the database owner a “vision” of what was expected to be met on the service level agreement.

You can then present this index to a database owner, with the key of course, and then agree to the terms to meet the agreed-upon backup/restore acuity index. Figure 7–1 shows this in a visual hierarchy.

image from book
Figure 7-1: The backup-level pyramid

The pyramid in Figure 7-1 illustrates that the faster the response to a transaction log or database restore, the higher the chance of restoring poor data. Each layer of the pyramid covers the critical level of the database or transaction log backup. This does not mean that critical transaction log restores are always going to be a risk, or that the restored transaction logs are flawed. However, it shows that databases or transaction logs backed up closest to the point of failure are more likely to be at risk compared to data that was backed up hours or days before the failure. If a hard disk crashes, the data on the backup tapes is probably sound, but if the crash is due to corrupt data or bad data (which might not cause system failure), the likelihood of bad transaction log data is going to be high closest to the point in time of the crash, and you'll need to consider restoring to a safe point in time (discussed later).

Another factor to consider is that often you'll find that the cleanest backup data is the furthest away from the point of restoration, or it is the most out-of-date. In other words, your soundest database or transaction log backup media may in fact be miles away, one hundred feet underground, in a vault.

In 1999, just prior to Y2K, I was maintaining backups of several SQL Server systems. But I was not required to check database consistency or database health. The database owner kept me out of the system except when I needed to monitor the backups or check on the schedules. One fine day the database went bad, really bad. A bunch of users somehow turned life in the easy lane upside down for the database owner. The DBA responsible for the integrity and consistency of the databases in question had me go back more than seven days to find backups that were healthy By the time we found clean backups, the loss was running wild. The entire on-site backups for the week were corrupt (that was on a SQL Server 6.5 system, in case you were wondering). I started the restore process, and the DBA left to find a rope. Today's systems are much more resilient against such scenarios as long as you actually do the backups.