Chapter 9: Database Environment Basics for Recovery | Microsoft SQL Server 2000 High Availability

Whether you implement failover clustering, log shipping, replication, or a combination of these, they cannot supplant a solid backup and recovery plan. Things can ”and will ”go wrong. Even a well-planned highly available system is subject to user error, administrative error, procedural failure, or a catastrophic hardware failure. Creating, testing, and maintaining a database environment in which little to no data is lost and downtime is entirely avoided in a disaster is no trivial task. Because backup and restore are important, required parts of any disaster recovery plan, your backup and restore strategy should minimize both data loss and downtime. This chapter gives you the basic understanding to be able to proceed to Chapter 10, Implementing Backup and Restore, where you learn how to implement a backup and restore plan in your environment.

Fundamentals

No matter what size the database or the availability requirements, restore is always an option. In some cases, such as in the event of accidental data modifications or deletions, it is the only option that lets you restore the database to the state it was in before the modification. But what does it mean to have a backup and restore strategy focused on high availability?

To be focused on high availability you must be focused on whether or not the system is accessible and, if it is not, on how long it will be down. A successful disaster recovery plan lets you recover your database within your company s defined acceptable amount of downtime and data loss. If downtime must be kept to an absolute minimum, the key requirement for your backup strategy is recovery speed. How do you make your recovery fast? Are there database and server settings that can impact the recovery of your database? What options must be determined as part of your strategy? There are many possible backup and restore strategies, each of which offers different levels of trade-offs between hardware costs, administrative complexity, potential work loss exposure (data loss in the event of a failure), performance during the backup, performance of batch operations or maintenance operations, as well as day-to-day performance of user operations around the clock. Finally, the options that you choose to employ could also have effects on the transaction log in terms of active size, backup size, and whether log backups are negatively affected during other operations.

For example, did you know that log backups are paused while a full backup is running? Do you know the impact of pausing log backups on other features that depend on frequent log backups, such as log shipping? Do you know what else could go wrong if the log backup does not occur? The decisions you make here are truly critical to the overall success of your recovery plan. But where do you start?

Technology Last

First, you must know the barriers for which backup and restore will be the solution. This helps you determine where you are at risk. Second, you must know your environment. Knowing your data; the database structures; how the data is being used; changes made to database settings throughout the day, week, or month; and the acceptable amount of downtime and data loss (which could vary at different times of day) is important. You must be familiar with user, administrative, and batch processes so that you are aware of all that could fail and what was happening at the time of the failure. This information will help you estimate how much time is acceptable for repair and recovery, which in turn helps to dictate your hardware choices. Third, you must have a recovery- oriented plan that fully aligns with the process of database recovery and the restoration phases.

Using these key facts, you can decide among the backup types available and come up with the best strategy for your environment. Only after you fully understand each one of these factors should you determine which backup strategy is best. Unfortunately, administrators commonly define backup strategies by learning only the backup technologies available without considering recovery. This is exactly the wrong approach; instead, you should consider technology last and recovery first. If you do not have a recovery-oriented plan, you will more than likely suffer data loss and significant downtime.

Understanding Your Backup and Restore Barriers

Whether you are recovering from accidental data deletion, hardware failure, natural disaster, or other unplanned incident, you will want your recovery to be well thought out. There are really two categories of barriers that are likely to be overcome with your backup and restore strategy: hardware failure and application or user error.

Hardware Failure

You should not use backups to recover from hardware failure as a common practice, as it is likely your system already has hardware redundancy in place. Whether you are trying to set up a highly available server or just a production database server, you should always start by using some form of disk redundancy, such as RAID. In Chapter 4, Disk Configuration for High Availability, you looked at many disk considerations for the foundation of your database, and it is likely you have chosen some form of mirroring, striped mirrors, or striping with parity. However, what if you lose an entire RAID array? What if a single disk is lost in a RAID 5 set and the administrator replaces the wrong disk during the hot swap to replace the failed disk? In the case of hardware failure, you might choose to minimize downtime by bringing a secondary or standby server online; however, you will need to recover the failed primary. Often, this is done with backups.

User Error

You might think that your job would be great if there were no users or even other DBAs or system administrators. Quite frankly, no human intervention of any sort would be preferable for most! If you could create a database and then never use it, it would be much easier to manage. Nevertheless, if users can modify data, inevitably someone at some time will modify something incorrectly. Similarly, if system administrators have direct access to the production server, they have the ability to directly change your production data. In Chapter 14, Administrative Tasks to Increase Availability, you will look at the administrative processes that should be in place to maintain and secure a highly available system, but even with extensive preparations , accidents will happen. In fact, accidental damage is the most difficult from which to recover and it can spread much further than just an incorrectly dropped table. Application, user, and process error could occur almost anywhere . Examples include the following scenarios:

Administrators or database owners (DBOs) dropping a table incorrectly because they are connected to their production and development databases all day long ”within the same tools.
Users accidentally modifying the wrong data because they have direct base table permission to INSERT, UPDATE, and DELETE, and although they normally remember a WHERE clause, they forget to highlight it when executing their query.
Batch processes accidentally dropping the wrong database because the script performs a drop and re-create of the database. It is the first time the script is being run on that server, where a different database ” named the same as the other database but supporting different functionality ”already exists.
Batch processes accidentally creating objects or making changes to the wrong database because the initial database creation fails due to a path error or a not enough disk space error. With little or no error handling in the script, it continues to run incorrectly in the wrong database. All of the script s objects end up in the connected user s default database, which in this case is set to the master database.

All of these scenarios are possible, and these few examples are really only the tip of the iceberg. More important than the original failure is the recovery process that follows . Incorrect data modifications are the most difficult to recover from because the longer any problem is left unmanaged, the more likely you are to lose data. Additionally, the longer you wait, the more difficult it is to recover the data from the still potentially changing database.

How quickly do your users come running down the hall or pick up the phone to tell you about their accidental data deletion? Does the DBA immediately refer to the disaster recovery plan when he or she makes a mistake, or does he or she try to troubleshoot the problem, possibly compounding it? What is the best plan of recovery and, more important, how can you prevent some of these mistakes from happening in the first place?

Minimizing Human Error

To make a system both secure and highly available, you need to have administrative change control as well as maintenance processes in place to minimize direct access to production databases. There is a common question asked by DBAs: Is there a way that SQL Server system administrators can be prevented from dropping tables? This sometimes garners the answer, Get new system administrators. All kidding aside, this can and does happen. This problem is so common that many system administrators have learned quite a few tricks, and the next several paragraphs include a few ideas ”not specific to backup and restore ”picked up from them along the way.

To prevent tables from being incorrectly dropped, consider schema- bound views using declarative referential integrity (DRI), which makes inadvertently dropping a table more difficult. However, DRI prevents a dropped table only when the table is being referenced with a foreign key constraint. A sales table, for example, often references other tables, but other tables do not always reference it. So how can you prevent an accidental table drop? Consider using a schema-bound view. If a view is created with SCHEMABINDING, then the table s structure cannot be altered (for all columns listed in the view), and the table cannot be dropped (unless the view is dropped first).

More Info

For details on how to create schema-bound views in Microsoft SQL Server, see Instant DocID#22073 on the SQL Server Magazine Web site at http://www.sqlmag.com . This article does not require a subscription.

To prevent data from being incorrectly modified, consider eliminating all direct access to the base tables. Often having applications designed to manipulate the data are best; however, users might then require direct ad hoc access to your data. Is it really necessary? This requirement usually indicates that the users are not getting the information they need and the developers gave up. Regardless, creating boundaries within the ad hoc environment is better than complete chaos that is marked by bad queries, poor performance, and unhappy users. Instead of granting direct access (SELECT, INSERT, UPDATE, and DELETE) to the base tables, create views, stored procedures, and functions to handle the data access. Using these objects, you can add error handling, trap unwanted change, manage data redundancy, and generally prevent accidental modifications where a WHERE clause has been left off inadvertently.

If this seems like a lot of work, you might be surprised at the secondary benefits. Typically when users write ad hoc queries, performance suffers because of mistakes in writing Transact-SQL. Users who do not write a lot of Transact -SQL code are prone to writing poorly performing code. You can optimize objects through the development and quality assurance testing of the views, procedures, and functions, providing a better outcome for everyone.

Finally, to prevent mistakes in batch processing, consider error handling and a scheduled code review with all key personnel. The code review allows other experienced DBAs to determine if anything could interfere with what they are responsible for. Additionally, another set of eyes to review the batch could prevent something that might otherwise be a problem with running the batch in the existing environment. This need not be a line-by-line code review (although it could be), but it should at least explain the general principles behind the script s execution. Give special attention to all components of the script that drop or modify already existing data, objects, or databases. If the script includes proper error handling, the time for the code review could be reduced.

As a trick in batch processing, consider using RAISERROR. There is a special value for the state parameter of RAISERROR that might help by forcing the termination of a complex script and preventing further execution when the script is processing incorrectly. Raising an error with a state of 127 causes the script to stop processing, and this can be especially helpful when the script might end up processing in the wrong database. However, setting the state value does not always appear to terminate the session. Applications such as SQL Server Query Analyzer might automatically reconnect when a connection is broken. To fully realize the benefit of the state option, you need to use a tool such as Osql.exe, which does not reconnect automatically after the connection is terminated .

More Info

For details on some of the benefits of using RAISERROR in your Transact-SQL statements, see Instant DocID#22980 on the Web at http://www.sqlmag.com . This article does not require a subscription.

However, be aware that nothing is guaranteed . Even if you prevent many errors using these techniques, the database probably will still need to be recovered after some form of human or application error.

Symptoms and Recovery

Recovering a database after hardware failures or incorrect data modifications can be quite complex, as there are numerous elements that can fail. Even more numerous are the options for recovery. The failure might be isolated to one disk, one RAID array, one table, a group of tables, or only part of the data. Remember, to create a strategy for high availability you want to recover as fast as possible. If the damage is isolated, can your restore and recovery be isolated? Possibly, if you plan for it.