Establishing Baselines for Disaster Recovery | Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features

Establishing baselines is essential when it comes to assessing risk, managing disaster recovery, stopping loss, and so forth, no matter how big the database or how critical the application. When formulating a DR plan, no matter that you are starting from scratch or taking over from another DBA, it is critical to fully understand the business plan and the application’s need for availability and service level. You thus need to establish the tolerance level for your enterprise or service to be without access to data and databases for a given time. In other words, how long do you have after a database goes offline, before the business begins to suffer serious harm? Many database or OLTP system clients do not understand their needs, so establishing a baseline is one of the most important exercises in a data center.

This might seem a frivolous activity, but it’s not. You might think that all DR plans should aim to restore service immediately or at least very soon after failure. However, obtaining the dream of almost instant recovery requires a number of provisions. They include, in no intended order, trained DBAs, the right equipment, the right premises, the right demeanor, the right software, the right technology, and money. No doubt the last item is the most important one-a well-funded data center or contract makes life easy for everyone involved. In other words, establish the needs first and then see what you and SQL Server 2005 can do to rise to the challenge, while making sure you have the budget with which to fully implement the plan.

Baselines allow you to establish certain tolerance levels and then move out from there. For example, you might have service level agreements (SLAs) or quality of support requirements in place that will dictate requirements to you. A bank or an Internet ordering application will require a very different collection of DR ingredients than a casual order entry system.

Let’s consider some baselines in terms of quality of support. First, you need to understand that no matter what technology or regimen you are using, you can only restore up to the point of your last complete backup and any available transactions that have been flushed from the transaction log file. The time from the last backup and the last duplication of the transaction log to the reception of a new transaction I refer to as void time or more precisely void-backup time. Some DBAs have called this the transaction twilight zone, the time just before a transaction is saved to the transaction log and saved from disaster.

I coined the term void time in the network administration realm to explain to users why the data or files they lost could not be restored to the last minute or second of their work. Void time is the useless time in which it is impossible, using conventional tape backup technology, to make a backup copy of a file. Backing up or duplicating data is a sequential process. The slowest technology, where the void time window is the widest, is tape backup. However, tape backup has the highest capacity for the least cost; the highest media safety, durability, storability, security, and so forth. The fast technology on the other end of the scale is a mirrored disk, but mirrored disks are expensive (if they are really large and fast) and have low security, storability, and safety attributes.

My aim in network administration is to narrow or close the void-time window as much as possible using whatever technology is available at the time. The higher the value of the work, the easier it is to justify the cost of fast backup technology Conversely, the lower the value of the work, the harder it is to justify the cost.

This means that unless you are backing up every millisecond of the day, it is very difficult, in the event of a disaster, to restore to the last transaction or commit point saved to the log. As you learned in Chapters 3 and 4, SQL Server can back up the databases and logs to local tape devices or remote disk files all the time. It costs the DBMS some bandwidth-CPU and memory, but you need some powerful and expensive equipment to transfer data at terrific speeds to close the void-time window as much as possible. This integrated feature of SQL Server backup is known as active backup in the industry, and SQL Server has demonstrated record-breaking capabilities for concurrent or active backup during highly active OLTP states, in the order of hundreds of transactions per second (TPS). So you need to understand what is possible and if it works for you and the application owner.

I established a collection of backup baseline parameters to help me determine and then establish the database owner’s backup and restoration needs, and I have been refining this theory over the years. The order of the items in the list of baselines determines what can be considered adequate in terms of the age of tape or file backed-up databases and transaction logs. It helps the database owners focus on a restore level because often they are not fully informed about what is possible or what it will cost. They just tell you that losing even one transaction is a disaster without understanding what it will take to ensure zero transaction loss and 100 percent availability-neither is possible, really, because it is very possible for a system to crash when data is still in volatile memory and has not been written to the hard disk, where it can be swept up to the backup device.

Once I have established the loss-tolerance level-the baseline-with the owner, I work out how to cater to it for an individual system or application-the service level need-and the cost. First let’s establish the critical nature of our backup requirement, an exercise in pigeonholing. Consider the following list:

The backup is one month or more old. The database owner needs monthly and annual backup for archival purposes, analysis, or some legal requirement. Pretty standard stuff; probably backup of temporal data, a data warehouse, and so on.
The backup is between one and four weeks old. The database owner needs weekly backups for access to analysis data and possible rollback of the database to a certain point in time. Also pretty standard stuff…any inexpensive tape device will work.
The backup is between four and seven days old. This is the same as number 2, but the rollback period is days and not weeks.
The backup is between one and three days old. The database may have become corrupt and the database owner requires a restore of a recent backup. The lost data can be easily or cheaply recovered. This is not casual stuff any more. You require the establishment of a rotation scheme, backup windows, and schedules.
The backup is between 6–12 hours old. The lost data costs more to recover, but the database owner cannot afford to wait longer than a day, nor can he or she afford to return to a state older than 12 hours. The lost data of the past 12 hours can be manually recovered. We are getting onto delicate ground with this option. Backup/restore management is now a daily chore requiring operator duties and daily log reports, typical data center jobs.
The backup is between two and five hours old. The database owner cannot afford to return to a database state older than five hours. The database must be brought current before the close of business. This involves data center operations, 24×7 attention of operators, access to DBAs via beepers and e-mail, and so on.
The backup has captured all transactions saved to the transaction log files in the past sixty minutes. The database owner cannot afford to lose more than an hour’s worth of data. We now need highly sophisticated backup/restore operations, hourly backups, warm standby servers with log shipping (see Chapter 9), and so on.
The backup has captured the last transactions saved to the transaction log files. The database owner cannot afford to lose any transactions up to the point of system or database failure. There is usually little chance of manually getting back lost transactions. We now need full-blown active backup, with redundancy and fail-over architectures to ensure that all transactions can survive a disaster.

If you look at the list, you can safely say that levels 1–5 address very low transaction traffic situations and that any crash of a database or loss of the database server after the last backup will not cause the business serious harm. Such environments are typically order entry (not order taking) scenarios, and lost data can usually be recovered at low cost… manually reentered if need be.

Cost, of course, is relative. To lose a staff member for the day to rekey information is itself a costly exercise. Losing the day’s data may mean little to one company, but a lot to another (my motto is “time lost is life lost”). If you feel you should be at a higher level, but you or the database owner can only afford a lower level, you may be courting disaster, risking a critical application on a shoestring budget-something to think about if you outsource to data centers or roll your own. If you are supporting a low transaction rate application, however, backing up databases overnight is usually not a problem, because no new data is entered to the database until the following business day. You are usually safe as long as you diligently perform backups.

From level 6 to level 8, you start to get into a gray area that requires careful attention to human and technology resources. At level 8, full or differential database backups will not capture the most recent transactions, because you’ll need to be duplicating transaction logs at the same time the transactions are being flushed to the primary log. Some method of duplicating transaction logs to the last commit point is thus required. With online transaction processing systems, especially order-taking systems on the Internet, the loss of even one transaction can mean the loss of a customer. One lost customer can cause significant collateral damage.

SQL Server 2005 has the mechanisms to continue backing up data all the time it is in use, but as with any system, this can cause significant load on the server if you don’t have sufficient bandwidth, both network and backup device capability. However, it is not so much the backing up or duplication or replication that takes time, but also the restoration of the lost server or the lost database, or hard disks and so on. When you get into a situation that dictates that you cannot afford to lose any transactions written to a transaction log, you might as well stop at level 6 with respect to tape backup and restore because you cannot achieve the higher levels with conventional tape backup practice (remember even a restore of the most recent backup still takes time to execute). Continuous or very aggressive conventional backups of both databases and transaction logs is not going to put you back in business seconds or minutes after meltdown.

In this regard, we might then revisit the list and classify levels 1–5 as archival or general availability backups. This means that you treat the backups as necessary only to restore a database to a certain point in time, after which time the data is not there or is bad. Levels 6–8 then require an alternative classification in terms of database access.

Levels 6-8 need to address three needs:

Requirement 1, Focus on availability The system must be back online as soon as possible so as not to lose any new transactions. For example, a customer on the Internet finds he or she cannot make a transaction or place an order and goes elsewhere if the situation persists longer than a few seconds. Money lost. At this level, you are already into automatic fail-over systems for service level.
Requirement 2, Focus on transaction recovery (disaster recovery) All transactions written to the log up to the disaster “event horizon” need to be recovered. In other words, being offline is not as critical as losing transactions.
Requirement 3, Equal attention to availability and transaction recovery We need both the availability level of requirement 1 and the disaster recovery functions of requirement 2, or the application or service will not fly. The risk of transaction loss is just too high, as is being offline.

If requirement 1 is the baseline you are dealing with, then you (the DBA) and the database owner (and you could be both) need to investigate redundancy options, such as warm server standbys, mirrors, replication, and fail-over clusters. Your options here fall under the industry terms and practices of high availability-the proactive arm of service level management.

Transaction loss due to media or data file failure is not usually an issue with requirement 1. The system goes offline, and at the point of failure no further transactions come in. You still have your database, and you still have your transaction logs. Hard-disk crashes are dealt with using RAID-level hardware, as discussed in Chapters 6 and 9, so having the most current database available when services come back up is not an issue. The issue is how quickly the actual system, the server and the hardware, returns to operational state.

If requirement 2 is the baseline you are dealing with, then you need to consider having highly available backups or duplicates of your data and transaction logs. In other words, as soon as a hard disk crashes or a database gets destroyed or a system gets destroyed, you would need to rapidly restore to the last transaction in the transaction log prior to the disaster. You would likely need both warm standbys and redundant hardware in addition to fully up-to-the-last-transaction duplicate copies of your databases and transaction logs. In requirement 2, the transaction log is the focus of your backup and duplication strategy, because only the transaction log and the safe backup or alternate duplication of it-its availability after a crash-can actually completely close the void-time window for highly valuable OLTP-based systems.

Requirement 2 dictates disaster recovery procedures, but it is perfectly clear that the dividing lines between availability and disaster recovery are blurry and endure considerable overlap. To clearly see the two IT disciplines, think of DR as a reactive process, an after-the-crash reactive activity (like restoring from tape), and availability as proactive activity (making sure that, when the system does crash, the application can continue). When you need the services of both requirements 1 and 2, you are really looking at requirement 3, which will dictate what you need to meet both levels-disaster recovery and availability are in full swing.

There are many situations in which a company or a client will not need a fail-over service or a mirrored or redundant server as urgently as some data centers can provide. The database users or customers might be inconvenienced, but if they cannot go elsewhere, there is no need to go into warp drive to get the service back up. In other words, losing the service a few times a year does not constitute bad customer support.

A good example: A few years ago I won a contract from a state department to set up an online database system servicing a call center. Part of the RFP (request for proposal) was an application that allowed parents dependent on child support to phone a voice-activated system to obtain account balances and information about the status of their child support payments from the database. Agents on the back end were updating information as well as servicing callers with special needs on rotary dial phones. The RFP called for a duplicate call center to be established on the other side of the country. In my presentation, I advised that a completely redundant call center and database application would cost almost three times what a single center would, given the cost of the additional standby staff. I advised that it was a complete waste of money It was not critical that the database never be offline for a day, because no one was losing money, a life, or sleep over not gaining access.

I cited the number of times the existing system was “downed” for service for several hours and sometimes longer than a day, without anyone complaining. In this situation, it is more critical to have up-to-the-last-transaction restorability because callers were actually updating information on every call.

What SQL Server 2005 offers in terms of system availability, as well as scalability, is covered extensively in Chapter 9. We will also look at log shipping as well in the next chapter, along with database mirroring, a new feature in SQL Server 2005. The rest of this chapter is dedicated to all DBAs that have lost their logs; it covers database and transaction recovery.