Providing Good Recoverability | Mastering Microsoft Exchange Server 2007 SP1

A wise technology curmudgeon once said, "Don't create a backup plan, create a restore plan." That is good advice. An administrator will typically look at the data on a server and select the data that they think they need to back up. A better approach is to back up the data that you need to restore and back it up frequently so you will not lose more data than you can afford to lose.

What Does Good Recoverability Mean?

Simply put, providing good recoverability means that you have available the information necessary to meet your organization's recovery objectives. This means that you are backing up everything you might need to restore and that you have procedures in place to ensure that you are you are able to restore the information that you need to restore. Here are some important requirements of a good recoverability plan:

You have documentation necessary to recover every server role.
You have analyzed the information necessary to perform a complete recovery of each Exchange server role.
You perform regular backups of all data.
Your backups allow you to meet your recovery point objectives.
Your backups allow you to meet your recovery time objectives.
You perform periodic verifications to ensure that you can restore data even if you are simply restoring to a recovery storage group.
Your transaction logs are available if you need to restore the last full backup.

Recovery Point Objectives

The term recovery point objective (RPO) defines the time to which you are able to recover data. The RPO could be the previous day's backup or it could be a specific number of hours. If you think about it in terms of a traditional tape backup solution being your only method of recovery, then the RPO can be a long period of time. If you specify that your RPO is the previous day's backup, then a catastrophic failure of your Exchange server's storage system could conceivably mean you could lose up to 24 hours worth of data. Depending on the type of data that you are backing up or restoring, that can be a lot of critical data, especially for an e-mail system that is being updated continually. If you specify that the RPO is a fixed number of hours (such as six hours), you must perform some type of backup ever six hours.

Determining your RPO and the maximum amount of acceptable data loss is important when it comes to planning your backup capacity and backup schedules. There are technologies that can help you to reduce the RPO for your organization by ensuring that you reduce the likelihood of a failure or that you keep your databases and transaction logs more highly available:

Implementing local continuous replication (LCR) or clustered continuous replication (CCR)
Storing databases and transaction logs on SAN- or iSCSI-based storage systems that are not affected by failure of an Exchange server
Using SAN-based replication solutions that can keep replicated copies of databases and transaction logs

Of course, a simple solution to reduce your RPO is to simply perform more frequent backups of your data.

Recovery Time Objectives

The recovery time objective (RTO) is the targeted amount of time it takes you to restore service for a particular application. Depending on the type of outage you experience, you may have different types of RTOs. An RTO for a single failed database might be one hour, while recovery of an entire failed server might be eight hours.

For Exchange, there are two types of recoveries that you should consider. The first type is called a dial-tone restore; messaging services are restored quickly, but actual Exchange data is restored at some point in the future (hopefully the very near future). The second type is restoring the entire messaging system (messaging services plus data) at the same time.

Defining and understanding your RTO can help you decide if your current processes, backup hardware, and backup software are sufficient.

Information Necessary to Recovery

Probably the worst possible thing that can happen to an Exchange server is a complete server failure involving the operating system and data disks. This type of failure would require not only an operating system rebuild, but reinstallation of Exchange, reconfiguring any customizations that have been made to the system, and restoring the data. This type of recovery is called a bare metal restore.

If any one of your Exchange servers completely or partially failed, would you have the information necessary to rebuild it. First and foremost, the Active Directory data must be available since that is where much of the Exchange server configuration is located. Medium and large businesses should always have multiple Active Directory domain controllers, and the domain controllers and Exchange servers should run on separate hardware. Even a small businesses can simplify recovery by ensuring that Exchange Server is not installed on Active Directory domain controllers.

There is, of course, other important information that you will need in order to recover a server. Here is a list of the documentation that you should collect to simplify recovery:

Server disk configuration (drive letters, capacity, LUN configuration if applicable)
Backups of certificates uses for SSL (including private keys)
OWA, Autodiscover, ActiveSync, and other web services virtual directory configurations
Security customizations such as configuration files created by the Security Configuration Wizard
Any custom scripts that you have created

Building a Crash Cart

One of the first things that usually happens when even the most skilled IT department realizes that they have to do a server recovery is the "where is that?" syndrome. This naturally occurring phenomena is a result of no one knowing where to find things that are needed right away. Too frequently, CD-ROMs disappear, product keys get lost (or they are in a mailbox that is on the server that is down), and operating system setup CDs must be downloaded and burned again.

We recommend building a disaster recovery kit or a crash cart for each type of server that you support. This can be something as simple as a three-ring binder that contains the following:

Escalation procedures such as who makes the decision to restore
Contact list including cell phone numbers for IT management and a notification list of people that should be notified of an outage
Server hardware, Windows, and Exchange documentation
Product keys, activation codes, and license key disks
Windows operating system and service pack CD-ROMs
Hardware vendor CD-ROMs (such as a CD-ROM used to run setup for a particular piece of hardware)
Exchange Server 2007 DVD
CD or floppy with device drivers that is currently in use on the server
Third-party CD-ROMs such as antivirus software and backup software
Anything else that is critical to the server being rebuilt.

Once you have built this kit, set it aside and do not loan any of the contents to anyone. If you have a shrink-wrap machine, it might be a good idea to shrink-wrap the binder so that no one is tempted to borrow anything.

Testing Backups

Everyone who writes about backups always warns you to test your backups. Don't let the monotony of repetition lead you to ignore this warning. As many have said, a backup is useless if you can't restore from it. We recommend testing backups when you first set them up and then again whenever you change anything from server software and hardware to backup hardware and software. Try restore to hardware that is as much like the hardware on your real server as possible, though sometimes it is a good idea to restore to really different hardware just to be sure you can handle some diversity.

A number of system managers tell us that they just don't have the time to test backups. Actually, they usually use the past tense, as in, "I just didn't have time to test my backups." And this is usually in response to calls from clients in the middle of a resume producing event. These are usually administrators who have drifted far up the Exchange creek (crashed server) and now find themselves without a paddle (backup).

What can we say? If you really don't have time to test backups, tell your boss and ask for more resources or work with the boss to prioritize the tasks you have. While you're talking with the boss, be sure to add that without backup tests, you can't guarantee you'll be able to bring e-mail back up in case of a hardware or software failure. You can also use this argument when requesting the kinds of redundant hardware that we discussed in Chapter 15. If nothing else comes of these discussions, you will have at least set your boss's expectations at a more realistic level should all or part of all heck break loose.

Understanding Online Backups

The Exchange database file structure is a pretty complex thing. At the most basic level, the database is organized in a modified "b-tree" structure. All data is stored in 8KB pages; each page consists of a "next page" pointer, a "previous page" pointer, a checksum, and some data. If any of these pieces of the page of data becomes corrupted, the database itself could be in serious jeopardy.

Note

Which ever backup type you choose to use, it should be compatible with Exchange Server 2007.

There are essentially four methods of Exchange backups that you can perform against an Exchange database. The first method is a backup that is not Exchange aware. This backup type makes a backup of the actual file (or possibly the disk blocks in the case of some hardware solutions). If you use any method of standard tape backup software and attempt to back up the Exchange database files, you won't succeed unless you dismount the database files or stop the information store service. Some backup systems have open-file agents that can back up the Exchange database file even when it is in use. These should not be used with Exchange Server under any condition unless they have been approved by Microsoft for use with Exchange. These types of backups also do not purge the transaction log files once the backup is completed.

The second method uses the Exchange APIs to perform the backup. This method is called a streaming backup. This backup type does not back up the actual database file but rather requests the database a page at a time. As the Exchange database engine is reading each page of the database, it verifies the next page and previous page pointers as well as reads the data and checks to ensure that the checksum is correct. If the backup process detects corruption in the database, the backup halts. This is a feature of Exchange that is designed to help prevent you from continually making backups of corrupted data.

However, if the backup completes successfully, you know that the database has no problems with corrupted pages. Once the database has been successfully backed up, the transaction logs are backed up and then purged.

The third method of backup is a Volume Shadow Copy Service (VSS) backup; a VSS backup usually works in conjunction with third-party hardware and software and backs the data up at the disk level. The advantage of a VSS backup system is that it is much faster than streaming tape backups. If the VSS backup is compatible with Exchange, it will perform a verification to ensure that the database is not corrupted and purge the transaction logs just like a streaming backup.

The fourth method of Exchange backup is a mailbox-level backup. This type is more commonly referred to a brick-level backup and is supported by a number of third-party vendors. These backups use MAPI client software and back up each mailbox message by message and folder by folder. The advantage of these is that you can restore a single folder or single message without having to restore an entire database. However, with brick-level backups, there are often problems getting the backup client to work properly and they usually take at least three to four times longer to back up a mailbox database than a streaming backup takes. We do not recommend brick-level backups.

These are four basic types of Exchange backups. Undoubtedly, though, you will come across some specialized or vendor-specific solutions. While many of these solutions will have some merit, ultimately you should consider a solution that you know will allow you to recover the data you need to recover. If you choose a specialized backup and recovery solution, ensure that the vendor offers good support and that you are able to contact its support department when necessary.

What Can Go Wrong?

Anything that can go wrong will go wrong. Pessimists make good system administrators - at least pessimists that prepare for any possible contingency. When you are making plans for system backups, think about each type of restore, outage, or disaster that might occur within your organization and how you can mitigate those risks. While we are going to discuss most of these in more detail, here is list of possible issues that would require administrator intervention:

Deleted message or folder Users can recover their own deleted items or folders using the Recover Deleted Items feature of Outlook or Outlook Web Access. The only requirement is that the recover deleted items cache is enabled for the mailbox database. By default, deleted items and folders are retained for 14 days. If a message or folder cannot be retrieved, the administrator can always restore the database on which the mailbox resides to a recovery storage group.
Deleted mailboxes Accidents do sometimes happen; user accounts and mailboxes can accidentally be deleted. By default, a deleted mailbox can be reconnected to a user account (that does not have a mailbox) for 30 days after it was deleted.
Corrupted databases While a rare occurrence, an Exchange database can become corrupted. If you are using local continuous replication (LCR), you can swap out the production database for the LCR database. If you are using clustered continuous replication (CCR), you can move the clustered mailbox server to the passive node of the database and start using the CCR copy of the database. If you are not using either LCR or CCR, then you will need to restore the database from the last complete backup.
Server failure The worst case failure situation is when the entire server fails. If you have not clustered the server (in the case of a Mailbox server role) or provided a high-availability solution such as network load balancing for Client Access, Hub Transport, Unified Messaging, or Edge Transport server roles, then you will need to rebuild the server (possibly from scratch) in order to restore the messaging services the server is providing.