Chapter 10: Proactive Management for Mission-Critical Exchange Servers


Over the last several years, as I have met with organizations deploying Exchange Server, one fundamental concept is continually driven home to me. It may seem obvious to some of you, but proactive management is one of the most important, yet underutilized, techniques for increasing system availability. This point was illustrated by one organization that I have dealt with quite often. This organization was facing severe pressures from senior management and their client base because of unacceptable levels of downtime for the Exchange deployment. This organization’s cause of downtime was not hardware or environment; it was the result of frequent Exchange information store corruption. Since it was impossible to solve this issue and guarantee that it would never occur again, the organization chose to overcome the problem through proactive management practices. By simply deploying a management application that was Exchange-aware and by putting in place some problem notification, tracking, and resolution/response procedures, they were able to reduce significantly the user impact of this issue. In the short term, the number of occurrences did not significantly change (although Microsoft and the hardware vendor diligently continued to work the issue). However, the system-management staff ’s alerting, notification, and procedures for handling the issue did change. Instead of reacting to an occurrence of database corruption (which caused several hours of downtime), support staff was immediately alerted to the condition and was able to move mailboxes immediately from the server experiencing the error to a spare standby server. This minimized the actual downtime a user experienced, while support staff could troubleshoot the problem server. Since this organization measured downtime in terms of lost client opportunity (versus actual hours of server downtime), availability metrics improved, and the organization was successful in meeting its reliability SLAs. Senior management and users cheered, and in the long run, Exchange system managers were heroes. This was all accomplished without actually solving the cause of downtime (this organization continues to experience the same issue, although less frequently). It was proactive management that virtually eliminated the issue (in the eyes of clients, anyway).

10.1 Proactive versus reactive management

Every system manager has a choice to make when he or she takes on the job. You can choose to manage problems after they occur (flying by the seat of your pants, or reactive management), or you can make a conscious choice to build a foundation and set a precedent by trying to manage problems before and when they occur and anticipating their occurrence ( proactive management).

This choice has already been made for some of you based on budgetary constraints or legacy practices. After all, proactive management is more expensive. More training is required, more planning and development are needed, and the tools necessary for proactive management are costly. In fact, the staffing, training, and tools necessary to provide proactive management for your Exchange deployment could very well be one of the most costly components of the overall messaging system. The choice must be made with this in mind. Also, organizational management (the people who will pay for it) needs to buy into a proactive management strategy from the beginning. They will have to be educated on why the extra staff, training, and tools are required. Everyone should have an understanding of the tradeoffs involved as well. Without proactive management, you may need to commit to lower levels of service for the messaging system. If the organization understands and accepts this, all is well. You, as the system manager, however, will have to work hard to understand these trade-offs in cost and service and adequately communicate this to your customers and management. To some, proactive management is not a requirement, but a luxury (although the point could be argued extensively). If the messaging system is not a business-critical tool for an organization, the costs of proactive management may not be justified.

Let me give you a scenario with two possible outcomes to illustrate my point. Suppose an Exchange server supporting 1,500 users has three 20-GB information stores (each in a separate storage group and each supporting 500 users). Suppose one of these 20-GB information stores were to become corrupt. In the case of a proactive management approach, the corruption would be reported during a daily full backup and logged to the application log on the server. Proactive measures would be in place to scan the event log and backup log for errors. These errors would be reported to support staff who would immediately take action and either restore the database from tape or perform some other proactive measure such as moving user mailboxes from that database to a spare database reserved for this purpose. Most likely, in this case of proactive management, the problem would be corrected and/or alleviated before the users even became aware of the problem. The result: minimal or no impact on SLA achievement.

Now, let’s turn to the case of reactive management. Taking our scenario with a corrupt 20-GB database, daily full backups would still report the corrupted database. However, with no proactive measures in place to alert support staff, the error would go unnoticed and continue to occur for several weeks. All along, the database would never be backed up (because a corruption error will terminate backup). After several weeks, the users begin reporting lost messages or corrupted mailboxes. Eventually, the Exchange database engine dismounts the database and reports that it is corrupt. The 500 users on this database are now without service. Support staff now scrambles to begin repair and/or recovery operations for this 20-GB database. Upon closer examination, they determine that the last good backup was taken several weeks ago. Depending on several factors, the recovery operation could take several hours to complete, and in the worst-case scenario, users may have lost several weeks of e-mail or other content. In the meantime, users are unhappy, and management is disappointed because 500 users were without service for an extended period—a miserable failure to meet SLAs.

Hopefully, I have convinced you that proactive management is the best approach. I do realize that not all organizations have the resources available to implement every aspect of a complete proactive management strategy. However, if you are able to take some key insights and ideas from this chapter, I am confident you will improve service levels for your Exchange deployment.

In this chapter, I will focus on the aspects of proactive management that can have a significant impact on your Exchange deployment. In my thinking, proactive management has three key components (illustrated in Figure 10.1):

click to expand
Figure 10.1: A three-pronged approach to proactive management.

  1. Performance management: This involves the monitoring and management of system performance and capacity characteristics to ensure that performance and delivery service levels are achieved. It involves establishment of performance baselines in order to compare observed statistics. Performance management also includes the capacity-planning functions critical to anticipating growth.

  2. Configuration management: This involves the documentation, monitoring, and management of system hardware, software, firmware, and configuration data to ensure the highest levels of homogeneity among Exchange servers within a deployment. Configuration management facilitates troubleshooting and minimizes the occurrence of anomalies across a deployment. Configuration management also establishes guidelines and a means for managing change across a population of servers.

  3. Fault Management: This establishes the process, procedures, and tools required for alerting and notification, action and decision making, resolution, and reporting of system faults and anomalies. The goal of fault management is to identify and resolve error conditions before they impact the end user.

After exploring each component area in greater detail, I will also discuss some popular tools available that provide these features and support Exchange Server and Windows Server. Closely tied to proactive management is the establishment of SLAs. Since I believe that service levels must be established in order to determine the level of proactive management necessary, I will first devote some time to the definition and establishment of SLAs for your messaging system. By the end of the chapter, I hope to give you some excellent tools and practices that you can implement in your own deployment. While I cannot guarantee availability and reliability for your deployment, I can promise that proactive management techniques can cover a multitude of downtime woes.




Mission-Critical Microsoft Exchange 2003. Designing and Building Reliable Exchange Servers
Mission-Critical Microsoft Exchange 2003: Designing and Building Reliable Exchange Servers (HP Technologies)
ISBN: 155558294X
EAN: 2147483647
Year: 2003
Pages: 91
Authors: Jerry Cochran

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net