6.9 Best practice 9: Disaster-recovery management | Mission-Critical Microsoft Exchange 2003: Designing and Building Reliable Exchange Servers (HP Technologies)

6.9 Best practice #9: Disaster-recovery management

Although we will never be able to prevent data loss and catastrophes or plan for all contingencies, there are some good disaster-recovery management practices that should be part of your deployment routines to potentially help alleviate the problems when they occur. The first item is the creation of an Exchange Server disaster-recovery toolkit. The Exchange disaster-recovery toolkit is unique to an Exchange deployment and should go beyond the typical kit that your organization may have for a Windows server. This toolkit must “add a layer” and provide tools for successfully recovering not only the operating system, but also Exchange Server. The disaster-recovery toolkit ensures that all materials and documentation are available when and where you need it in the event a disaster occurs. The following are some typical items that I recommend be included in your disaster-recovery toolkit.

A server hardware configuration worksheet: Provides documentation on how the server hardware components were installed and configured. Most important are hardware CMOS/BIOS settings configured using the system configuration program. Critical to recovery is the configuration of the server storage including how disk devices are configured and RAID levels if applicable. Don’t forget to keep this updated when configuration changes are made!
An operating system configuration worksheet: Provides documentation on installation and configuration parameters needed to return the operating system to the same state it was in before the disaster occurred. This should include any additional device drivers or utilities installed, as well as registry settings that were modified from defaults. If the server is a Windows domain controller or global catalog server, the worksheet should contain any information settings pertinent to this configuration.
An Exchange Server configuration worksheet: Provides information on Exchange-specific server configuration such as services installed and configurations. Critical to recovery operations is configuration data on how Exchange storage groups and databases are configured and allocated across server storage. Details on where log files and databases are stored will be critical to successful restoration to the last known good state. The Exchange worksheet should also contain data about any Exchange-specific configurations such as routing and administrative groups the server belongs to or Active Directory, SMTP, IIS, and X.400 connector settings.
A contact information worksheet : Provides a source listing of proper individuals to contact in the event of an emergency or if specific configuration or security data is required. May also contain escalation procedures and contacts for both hardware and software issues encountered during recovery operations.
Recovery disks and CD-ROMs: Includes all necessary software for successful installation, setup, and recovery of the Exchange Server including Windows emergency repair disk, hardware system configuration disks, device driver disks/CDs, and third-party software disks/ CDs. May include Windows and Exchange Server CD-ROMs, as well as other media for third-party software if they are not readily available. Make sure those CDs that need to be bootable indeed are!

These key components will form your disaster-recovery toolkit for your Exchange deployment. In addition, your toolkit should also include any components specific to your deployment or organizational needs. The disaster-recovery toolkit forms a solid cornerstone for excellent configuration-management practices in your Exchange deployment. Configuration management with disaster-recovery in mind will ensure that disaster-recovery operations are performed efficiently and smoothly across the entire population of servers in an Exchange deployment. The following are some key points that are part of good configuration-management practices. In Chapter 10, I will discuss configuration management in greater detail and beyond the limited scope of disaster-recovery.

Tightly control the configuration of all Exchange servers.
Document all server configurations and keep a change log.
Ensure that hardware device drivers and firmware updates are consistent across the deployment.
Ensure that operating system and application service packs are consistently applied across the deployment.
Use like hardware configurations for all Exchange servers.
Deploy management software that provides configuration management capabilities.

Another solid practice for successful disaster-recovery management is a good media rotation scheme. Here I am referring to the rotation of physical media to an off-site location and not to a backup strategy (full, incremental, differential, and so forth). Many organizations rotate tapes to both on-site and off-site “vault” locations. The purpose of this is to protect the media from disasters like fire and flood at a primary business location. If you store your backup media in the closet and the building burns down, there will be no media available from which to perform recovery operations. When planning for Exchange disaster-recovery, develop an off-site rotation scheme that provides protection from disaster as well as a good tracking mechanism for knowing where a particular media set is located. Some backup software applications will make recommendations and provide suggested routines and techniques for this practice. Don’t leave out this often neglected part of successful disaster-recovery planning. Disaster-recovery for Exchange is severely handicapped if the disaster-recovery management practice of backup verification and validation is neglected. If you never are sure of whether your backups are good, you can never be certain that they can be used to recover your data come crunch time. Your ability to restore Exchange data depends on the quality and usability of your backup sets. Many things can happen to a backup set between backup and restore that render the set useless. Therefore, it is important to implement processes that will ensure that these backups are reliable. Performing disaster-recovery drills with personnel may cover part of this. However, I would not solely rely on disaster-recovery drills to provide this function. For complete assurance, you should form a triple-verification strategy.

First, verify the event. It is important for you to know that the backup was successful and completed without any errors. As I discussed in earlier chapters, the Exchange database engine verifies every physical page of the database during backup (as well as every transaction log record during recovery) to ensure that corrupt data is not written to backup media. When problems are encountered, however, errors are only reported to the Windows event logs. It is important that you establish measures that will check both Windows event logs and backup application logs to ensure that a backup completed without error. This should be done at every point at which a backup operation is completed.

The second phase in the triple-verification strategy is to verify the data. Verification of the data includes periodic restores of backup sets to another test or recovery server in order to ensure that the data was actually recoverable. Of course, you certainly cannot expect your operations staff to verify each and every backup using this method. However, you may want to establish a pattern or procedure of random sampling or rotation from server to server that allows you to identify potential problems proactively and to feel confident that your disaster-recovery measures will be successful. While testing the integrity of your disaster-recovery procedures and facilities, you can provide invaluable training for your operations staff and can identify potential problems and issues before data is actually lost.

Finally, the third phase of our triple-verification strategy is to verify the configuration. Exchange API-based backups (on-line backups) do not provide protection for the operating system or server configuration data. This information is stored in three primary locations—the AD, the IIS metabase, and the Windows registry. I recommend that you also develop a strategy for protecting this data as well. You may choose to perform periodic offline or file system backups of your Exchange server that include all of this data. However, I recommend that additional measures be taken as well. For AD data, most organizations will have multiple domain controllers and global catalog servers deployed that contain replicas of AD data. In most cases, I do not recommend that your Exchange server be configured as a domain controller unless it is absolutely necessary. This will make recovery operations much simpler. Since AD is a “multimaster” replicated directory service, there may not be any need for the Exchange deployment personnel to worry about AD data since it may be handled by a different organization. Also, in the event that a domain controller is lost, a new server can be added to AD and the information replicated (similar to the Exchange 5.5 directory database). If you are responsible for disaster-recovery operations for AD, you should take every precaution and planning approach for Exchange and apply it to AD recovery. You should perform regular backups to ensure that, if all AD instances are lost, you will be able to recover your Exchange data. If you would like more information on Windows and AD recovery procedures, see the Windows Server documentation and white papers available on Microsoft TechNet and the Microsoft Windows Web site.

Windows NT4.0 and Windows 2000 Disaster-Recovery and Backup and Restore Procedures

http://support.microsoft.com/default.aspx?scid=kb;
en-us;287061

Microsoft AD Disaster-recovery

http://www.microsoft.com/technet/prodtechnol/ad
/windows2000/support/adrecov.asp

IIS data is also critical to Exchange 2000/2003. Since Exchange relies on IIS for storing configuration data and servicing Internet client protocols such as IMAP, POP3, HTTP, and SMTP, you must provide measures to recover the IIS information. Since this information is stored in the IIS metabase, the simplest method is to back up the IIS metabase using the Configuration Backup option in the Internet Services Manager snap-in for Microsoft Management Console in Windows (it is not really as simple as checking the system state box for the server backup). Configuration data for

Exchange 2000/2003 is very important, and steps need to be taken to ensure that your current configuration for each server in your deployment can be recreated. Without this information, you could spend days trying to recreate your server configuration from scratch. As you are developing your disaster-recovery plans for Exchange 2000/2003, don’t forget to include this vital piece that will save you critical time.

One final important component of successful disaster-recovery management is an archival strategy for your Exchange data (I also referred to this earlier). What is amazing to me is the degree of control you can achieve over your disaster-recovery concerns by simply practicing some good archival and retention policies for your Exchange deployment. There are several ways to accomplish this level of control over the amount of Exchange server data that must be backed up and recovered. Good data archival and retention measures, in my opinion, start at the inbox. Educating users on how to reduce the amount of mail they and their peers must deal with is the first line of offense in combating bursting inboxes. E-mail users must be cognizant and considerate of the amount of mail they store and send. For example, sensitivity to the number and size of attachments can be a critical part of users’ contributions to disaster-recovery management. User discipline can only be accomplished through established polices and recognized good habits. The Gartner Group proposed several good habits for “Conquering E-Mail Overload” in its November 1999 research report. These good habits can be categorized into sender and receiver habits and are provided below.

Good sender habits:

Use distribution lists with caution.
Be succinct and to the point in e-mail communications.
Keep focused on the topic.
Use a descriptive subject line to help recipients prioritize, file, and search.
Use message tags and flags appropriately.
Do not reply to all unless necessary.
Use URL (http://server/file.doc) or UNC links (\\servershare) instead of sending large and numerous attachments.
Avoid long dialogs and discussion threads via e-mail.
Avoid complex or large graphical signatures in messages.

Good receiver habits:

Establish regular intervals or time blocks for e-mail.
Delete messages when they are no longer needed.
Organize messages into folders outside of your inbox.
Browse messages by subject line to prioritize.
Eliminate unnecessary replies or acknowledgements.
Delegate mailbox access when unable to retrieve messages.
Use inbox rules sparingly and keep them simple.
Avoid membership in unnecessary distribution lists or subscriptions.

Good e-mail user habits and awareness can aid immeasurably in controlling the amount of data stored on an Exchange server. By reducing data, you also impact disaster-recovery and make it much easier and potentially faster for messaging data to be recovered. When you have done everything to ensure that your users are disciplined in their e-mail use, the next line of offense is administratively imposing methods of controlling the amount of data stored in Exchange. Through the use of administrative settings such as mailbox size, attachment size, and others, an administrator can limit the amount of data a user is able to store, thereby forcing good habits and practices by users. In addition, through the use of add-on features to Exchange like the Exchange Mailbox Manager, the Exchange Archive Agent (Exchange 5.5), and deleted-item retention, system managers can control mailbox size (and therefore disaster-recovery windows), provide basic message archival functions, and cache deleted items (in the event that messages are inadvertently deleted or need to be recovered without a full disaster recovery exercise).

Another administrative option available in Exchange 2000/2003 is administrative policies. Policies can be configured for a wide variety of controls and applied based on membership or other global attributes. Setting sound policies and administrative limits on e-mail usage can have a drastic impact on disaster-recovery. Table 6.5 shows one example of a published policy for e-mail usage that can be administratively implemented and controlled.

Table 6.5: Sample Three-Tier e-Mail Policy for an Organization
Tier	Policy
Level 1—standard users	40-MB inbox limit 35-MB warning/45-MB send prohibit 7-day deleted-item retention
Level 2—heavy users	60-MB inbox limit 55-MB warning/65-MB send prohibit 14-day deleted-item retention Requires justification and manager approval
Level 3—custom/VIP	Increase inbox limits of levels I and II in increments of 10 MB or 20 MB. Custom warning and send prohibit settings; custom deleted-item-retention settings require department head (director/VP) approval.

The final measure or line of offense in controlling message store size is the use of an advanced third-party product designed for this specific purpose. CommVault Systems (http://www.commvault. com), kVault Software (www.k-vault.com/), IXOS Software (http://www.ixos.com), SRA International (http://assentor.com), and Veritas (www.veritas.com) are five such companies that offer products that work with Exchange to provide advanced message-archival functions. Like the Microsoft Exchange Archiving Agent (for the low-budget minded), these products provide a hierarchical storage management (HSM) approach to archiving through the use of specialized hardware and software components that apply high performance, reliability, and security to this practice. While none of these products is by itself a panacea, each can provide a high-end solution that offers an effective means of archiving messaging and collaboration data if you are prepared to make the investment required. Whether you can afford only freeware (the Exchange Archive Agent) or the Cadillac of archival solutions, you should visit this option as you plan disaster-recovery for your Exchange deployment.

Through proactive measures to control and reduce the size of Exchange information stores, you can manage and maintain disaster-recovery windows and meet associated SLAs. These measures, as well as a well-thoughtout disaster-recovery toolkit, solid configuration management, and verification of disaster-recovery events, data, and configuration are key to excellence in disaster-recovery management.