Planning for Recovery

To mitigate the risk and extent of data loss, one of the DBA's most important tasks is database backup and planning for recovery. You need to develop a backup plan that will minimize the data loss and is able to be implemented within the maintenance time window allowed. You need to choose the best SQL Server backup capabilities that will achieve the preferred backup plan that meets the business continuity and data loss requirements. Also, you must set up the backup procedure and monitor it every day to ensure that it is working successfully. You must validate that the database backup will restore properly. An organization may be current with their backups and may think that they have the backups to restore their database, only to find that the database was corrupted and some of the recent database backups will not restore. Cases like these can go undiscovered for months until someone needs to restore a database and finds out that it is not recoverable. To reduce this risk, you should run the database-consistency checks against each database and design a process to test the recoverability of the database backup. Additionally, the database backups should be sent offsite to protect them in case of a local disaster, but keep local copies of recent backups, if you need to perform a quick restore. Another critical task is disaster recovery planning. If the organization data center were to be completely destroyed, you should be able to quickly deploy a new data center with minimum data loss and minimum business disruptions. Disaster recovery planning is not complete until a team periodically simulates a data center failure and proceeds through the test drill to deploy a new data center.

Recovery Requirements

Any backup planning should start with the end goal in mind: the recoverability requirements. Categorize each database server and database with respect to the recovery requirements. Some database servers are mission critical; they keep the business activity going, and when they are down the business will incur a financial loss. Other servers may have decision support or reporting functions that are less critical. Before you can begin to plan, you have to know what items the plan should include.

Determine the availability requirements for each database. How critical is it to maintain the business daily activity, and what is the financial loss if this database is not available? Can the business continue to operate if the database is down, and how long can the business afford to have it down? Also, what part of the database must be online? You can consider a piecemeal restore to reduce your restore time, especially on larger databases where the restore can take a long time. Determine what parts of the database must be available, and arrange the data into filegroups so that you can recover the most critical filegroups first. Archived data or reporting data is less critical and can be recovered last.

For each database, consider the financial loss during downtime that the business will incur, as that may determine the database recovery priority. Additionally, the organization may allocate newer, redundant hardware and RAID array with a high-availability solution to mitigate downtime. The organization may also consider faster and more backup devices to quickly restore the database.

When you ask how much data the organization can afford to lose, oftentimes, the answer is none. Zero data loss can be achieved but at a very high cost in hardware, infrastructure, and processes that would need to be implemented. It is more feasible to identify which data-loss scenarios to protect from and plan for those risks. There may be scenarios that are less likely to happen and some data-loss risk may be acceptable, as protecting from every possible event will drive up the total cost of ownership. Furthermore, identify the databases that are most critical to reduce data loss and plan the processes around those databases.

Determine how easy or difficult it would be to recreate lost data for each database. For certain databases, data can be easily recreated by either extracting data from another system or from flat-file data loads. Typically, decision-support databases use ETL tools to extract data; for example, if some unrecoverable data loss had occurred, the ETL tool can be executed to reload the data.

What is the acceptable downtime in a media failure, such as a failed disk drive? As disk technology keeps becoming less expensive, most organizations will deploy databases on a fault-tolerant disk array that will reduce the exposure of one of the disk drive failing, causing the database to become unavailable. For example, on a RAID 5 set, a loss of a single drive will cause a noticeable performance slowdown. Also if a second drive in the same RAID 5 were to fail, the data would be lost. To mitigate this risk, have spare drives in the disk array system, and get a service-level agreement from the hardware provider to deliver and install the drives. Another scenario to consider that often happens is a department inside the organization deploying a database in a less than ideal hardware environment. With time, the database becomes mission critical to that department, but it lives under the DBA's radar with no accountability. The DBA should attempt to identify all database sources within the organization and develop a recovery plan.

Consider the size of each database. If an organization has databases that are terabytes in size, the restore time will be considerably longer. You may need additional strategies, such as a piecemeal restore. Also, determine if the database has any external dependencies with other databases that require that both databases be restored for users to be able to perform their daily activity. Find out if there are there any linked servers, external applications, or mainframe connectivity that this database has dependencies with.

Make a list of local disasters that can happen in the area and plan how to react to each. In case of a local disaster (for example, a server lost in a fire), what is the acceptable downtime? Prioritize the databases starting with the most critical and identify the available hardware that can be allocated for redeployment and where it is located.

Identify the staff required for backup, restore, and disaster recovery. They need to understand the disaster recovery procedures and where they fit in these procedures. Record what times each staff member will be available, their contact numbers, the chain of communication, and the responsibility of each member. Determine the chain of command, and find out, if the lead is unavailable, if the backup members have the expertise to carry on the duties for backup, restore, and disaster recovery. Find out the expertise of the staff and what additional training will they need to support the environment. Identify any training classes that may be beneficial.

Finally, document any information about store SQL jobs, linked servers, and logins that may be needed when the database is restored onto another database server.

Data Usage Patterns

Part of your recovery plan should include analyzing how your data is used in a typical scenario. Determine for each database how often the data is modified. You'll require different backup strategies for a database that may have a data load once a day than for others that may be read-only or some that change every minute. Separate the tables that get modified from read-only tables. Each type can be placed on different filegroups and a backup plan developed around it.

Identify the usage pattern of the databases during the day to determine the backup strategy to use. For example, during high activity, a DBA may schedule more frequent differential or transaction-log backups, while full backups may be performed during off-peak hours.

Determine the disk space used by the transaction log during peak times and its performance. For example, during peak times, the transaction log may fill the available disk drive allocated to it. Moreover, during peak time, the number of disks allocated for the transaction log may not be adequate to sustain the database performance. The database recovery model setting will affect both disk space and performance.

For a database in the full recovery model, consider switching to bulk-logged mode during bulk operations to improve performance, as that will do minimal transaction logging. Prior to the start of the bulk operations, you should at minimum perform a transactional or differential backup to protect from the risk of a data-drive failure when the tail transaction log may not be accessible.

Maintenance Time Window

Oftentimes, the backup strategy is dictated by the maintenance time window available to perform database defragmentation, backups, statistics updates, and other maintenance activities. To keep enhancing the customer experience, organizations are demanding more timely information and giving customers greater access to information, and customers are more dependant on having this information. This creates a challenge to create the best customer experience, mitigate the risk of data loss, and enable quick restores if the database system fails.

The task of the DBA is to find the best backup strategy to meet the organization's business requirements. Usually, the maintenance time window is limited. SQL Server 2005 implements various backup options that can be used in combination to meet these requirements. Some of the challenges you have to face when you design a backup strategy are:

Available backup time may be limited in a mission-critical, highly available database. Oftentimes, organizations have service-level agreements and must finish their maintenance by certain times of the next morning when users are back on the system. If the backup takes longer, it may make other database activities start later and not finish by the time users log in to the system, costing the organization opportunity loss.
There may be a large number of databases to back up during the maintenance time window. However, you can try to optimize your time for all available backup media by performing concurrent backups within the capacity of the database server.
As the database keeps growing, it will put pressure on the maintenance window. Additional backup devices, higher- performance database servers, and faster IO may be needed to relieve the pressure. Sometimes the maintenance time window can be increased, but oftentimes it cannot. You may need to consider a SAN copy solution to speed the backup process.
Other database activities are likely performed on all of the databases in the database server (for example, database-consistency checking, defragmentation, update statistics, and perhaps data loads). As the database grows, these other activities may take longer to perform, too.
Software updates, security patches, service packs, and database structure updates may need to fit within this maintenance time window.
Full text catalogs may need to be processed.
As more organizations see the benefit of decision-support systems such as SQL Server Analysis Services, the analysis services database may need to be processed during this time.

As a result, the solution will need to address and balance among the following goals:

A backup strategy that can be accomplished within the allocated maintenance time window
A restore strategy based on the backup strategy that will require the minimum number of steps and complexity to recovery the databases
Minimal to no data loss
The backup strategy that is least expensive to maintain and manage

To meet these requirements, a small database can use a full database backup every night. However, as the database becomes larger, that may not be possible. A good next step is to perform a full database backup on the weekend and nightly full differential backups. As the database becomes larger, consider moving read-only and read/write data to different filegroups, then using full partial backups during the weekend and partial differential during the night. As the database continues to grow, consider an individual file nightly backup. A file backup solution is more difficult to maintain, and another preferred solution for these larger database may be a SAN copy operation. It is a more expensive solution, but for these larger databases, it can better meet the backup and restore requirements with the fewest steps and highest availability.

Also consider how the database is to be used. If the database is mission critical, apply redundancy around the hardware; for example, a failover cluster solution with fault-tolerant disk drives. Additionally, identify the data loss that the organization can afford and plan to back up the transaction log to meet the time requirement. Plan to use the full recovery model as at that point, so that if the data files are lost, the tail transaction log may be accessible to restore to that same point in time.

The other important component is the number of steps required to restore the database and the amount of time required. The more steps required, the higher the probability that something may not work or one of the backup images may be unavailable. When a database failure occurs, the goal is to restore it as fast as possible.

Recovery Requirements

Data Usage Patterns

Maintenance Time Window

Other High-Availability Solutions