2.3 Step 2: Understanding the anatomy of Exchange downtime


Before you can properly assess downtime causes for your Exchange deployment, you must be able to recognize that downtime is not simply a mysterious “black box.” By black box, I mean that downtime is not just a singular event in which the internals of the event are hidden. Downtime for any environment, including Exchange, is actually a series of incidents and activities that can be exposed and individually analyzed. In other words, it is possible to open up and view the internal components of the black box of downtime. Figure 2.3 illustrates this concept.

click to expand
Figure 2.3: The black box of Exchange downtime.

If one looks at an outage event for Exchange, there are actually many sub-events that compose the entire outage. If a single outage lasts for 8 hours, be assured that the 8-hour period can be divided into smaller units that can be individually analyzed. On the time continuum of a downtime event, understanding the components that make up the downtime can be a key to reducing the overall duration of an outage. For example, suppose that your Exchange server is down for 4 hours. During the 4-hour period, 1 hour is spent diagnosing the problem, 2 hours are spent finding a good backup tape, and the remaining hour is spent actually restoring from tape. Understanding that 2 hours were spent searching for a good backup tape would be an important data point in your search to decrease downtime for your deployment. In another scenario, suppose the server was down for 3 hours before any of the operations staff was notified. In this scenario, this knowledge would indicate that effective outage-notification measures could have trimmed as much as 3 hours from the total outage period. Let’s take a look at the typical components within the black box of downtime for Exchange.

2.3.1 Prefailure errors

Prefailure errors are indicators and conditions that point to an impending problem that will result in downtime. These indications are not always obvious, and not all system administrators or operators may understand their significance. They can come in many forms such as hardware events, software indications, performance degradation, or other benign indicators. For example, one common occurrence in the Exchange environment that I would classify as a prefailure error is the infamous “–1018” error. This error occurs when the Exchange database engine encounters a problem with pages in the database (I expand on this error further in later chapters). The error (or its subsequent errors) is logged to the application event log when encountered. This error is a precursor to much bigger problems and indicates that the database (or portions of it) is corrupt (more on this in later chapters). This error is usually only logged during backup operations or on-line maintenance. If this error is ignored (in my experience, it is ignored all too often), you may continue to operate your Exchange server—worsening an already corrupted database. In addition, you may place a false sense of security on your backups. If a –1018 error is encountered during on-line backup, the backup operation is terminated. If best practices for proactive scanning of event logs are not in place, no one will be the wiser about the backup not being successful. I have seen many organizations go for days or even weeks unaware that the last good backup was completed weeks ago. The awareness of various errors that indicate more serious problems, which later result in downtime or causing additional downtime, gives operators the ability to assess impact on users and the business risk associated.

By establishing best practices that provide a look into the prefailure component of downtime, you allow insight into how overall downtime can be reduced or eliminated. For Exchange Server, the leading shortfalls encountered with understanding the prefailure component include the following:

  • No proactive error checking and monitoring

  • No ability to correlate error events with downtime

  • No predictive or root-cause analysis capability

  • No ability to link hardware and software errors

  • Lack of tools or improper use of tools

Overall, an understanding of this component can have the most impact on your ability to manage proactively around Exchange deployment downtime. We will discuss this key component in greater detail in Chapter 10.

2.3.2 Failure point

The actual failure point is where the downtime clock begins to tick. For Exchange, this can be triggered when a user calls the help desk complaining that he or she cannot access his or her server. Conversely, the operator or Exchange administrator may be notified (by a service management and monitoring tool such as Microsoft Operations Manager) of the failure point when a management application shows an Exchange service failure or a configured Exchange server or link monitor alerts him or her of an incident. Of course, if an organization does not implement management tools, the user’s call to the help desk will be the first word (similar to the if-a-tree-falls-in-a-forest scenario). In many cases, the failure point may not be the point of notification. Also, many times, the root cause of the failure may not be the object of monitoring tools or capabilities. In many cases, the failure point and the next component—the notification point—can be treated as a combined event.

2.3.3 Notification point

This is an important component of the larger black box of downtime. This is the point at which human awareness, brainpower, and decision making are applied to the situation. An operator or administrator takes notice of whatever event has occurred and then must take action. This may include scrambling to determine the root cause, consulting the operations procedures, or calling Microsoft Product Support Services. What happens at this point of the outage can be critical in minimizing the duration. If operators are not notified, are not properly trained, or ignore procedures, the length of downtime may be extended. If no ability exists to determine root cause or if no resources exist to escalate the problem, support staff will struggle to get the system operational and may waste valuable time or make matters worse. Problems often encountered with the notification point component include the following:

  • No response to notification

  • Inability to troubleshoot the fault condition

  • Poor or inaccurate procedures

  • Poor support staff training

  • Ineffective escalation procedures

  • Incorrect support information from vendors

2.3.4 Decision/action point

Once again, the actions and decisions of operators and administrators can drastically impact the duration of an outage event for your Exchange deployment. The decision point component of downtime has little to do with the software, hardware, or any other system component. Unlike many of the other components of downtime, the decision point, for the most part, is strictly reliant upon the support staff. Therefore, the duration of downtime is largely impacted by how well equipped they are to manage the situation. This is the point in the downtime event when an environmental cause must be rectified, defective hardware or software must be replaced, troubleshooting must begin, or recovery procedures must be initiated. The error condition must be analyzed and resolved. Troubleshooting skills and system knowledge are key during this period. Again, training and experience as well as procedures and best practices will come into play here. If support staff makes the wrong decision at this point in an outage, downtime will increase. In a worst-case scenario, a bad decision could also result in permanent data loss and cost the organization even more than the actual downtime event. The bottom line is that if you do not know what you are doing, there is no way to justify not calling Microsoft PSS. Some common problems or failures at this point include the following:

  • Lack of documentation, training, and procedures for support staff

  • Corrective action failure to resolve the issue properly

  • Incorrect information from vendor support

  • Operators choosing an incorrect action

  • Failure to recognize the correct condition

2.3.5 Recovery action point

The recovery action point assumes that the actual root cause has been determined and resolved. For example, if a power supply on a server has failed and is replaced, the server can be restarted and any necessary recovery steps initiated. In another example, if you have a failed disk drive on which the Exchange information store is placed, the information store must be recovered once the failed disk drive has been replaced (assuming no RAID capability). This may also be the point at which diagnostic and repair tools such as ESEUTIL and ISINTEG are utilized. In many cases, this is the point at which a restore from backup media occurs. However, this may be as simple as a server reboot and allowing Exchange to perform a soft recovery. Issues like support staff training, documentation, procedures, and decision-making abilities will impact the duration of this downtime component in much the same way as they can impact other components of downtime. In addition, staff intervention or second-guessing the recovery process has great potential to be detrimental to the system. Problems that can occur at this stage in the downtime timeline include the following:

  • Missing, corrupt, or incomplete backup media

  • Inoperable backup hardware or software

  • Poor procedures, documentation, or training

  • Improper interference in the recovery process

  • Improper use of recovery or diagnostic tools (ESEUTIL, ESEFILE, and so forth)

2.3.6 Postrecovery action point

Once recovery of an Exchange server is complete, there are, in many cases, several operation points that follow. These actions are usually taken before the server is made available for end-user services. For example, after you restore an information store from backup, you may want to run some integrity-checking tools like ISINTEG, ESEFILE, or ESEUTIL. This step may be an extra measure to ensure that the original problem that caused data loss or corruption has been rectified. Another postrecovery step that is a common practice is the initiation of a postrecovery backup. By performing a postrecovery backup of your information stores, you can ensure that the databases are error-free. Remember that a normal backup operation checks every page of the database for corruption. Running postrecovery backups is a good best practice to adopt as part of normal postrecovery procedures. Other common postrecovery checks include verifying server connectivity; ensuring that all services are started; verifying third-party software; and checking directory, public folder, and free/busy replication activity for the recovered server. Issues that sometimes cause problems during postrecovery include the following:

  • Omission of a postrecovery backup

  • Improper use of tools

  • Poor training and procedures

  • Failure to resolve the original root cause

2.3.7 Normal operational point

The downtime clock stops ticking once the recovery operation reaches the point when normal operations can be declared. At this point, clients should be able to access the services they had been denied during the outage. The point of resumed normal operations is where all system administrators want to get as soon as possible when an outage occurs. This is also the point at which incident documentation should be done. This includes a complete report of the failure and the activities that occurred at each point in the outage (a postmortem). The incident reporting and documentation process should be the beginning of an operational feedback loop that drives a process improvement program for your Exchange deployment. Without this feedback loop and improvement program, these issues will continue to extend the downtime periods as personnel run into the same problems each time a crisis occurs. While the point of normal operational resumption may not seem like an important part of an outage (everyone is just glad to get the Exchange server back up and running), it may be one of the most important points where process improvement is concerned.

I have taken the time here to guide you through the individual components that make up the entire downtime incident for one reason. Only through an understanding of what a downtime incident actually composes can we devise methods of reducing downtime. Too often, we look at the black box of downtime and are afraid to look inside. Many organizations I work with are frustrated by the amount of downtime they have for their overall Exchange deployment. They often see a downtime incident for Exchange and credit (or debit) the entire incident to the Exchange software or their hardware vendor. Many times, I have seen downtime incidents last for many hours and result in Exchange Server getting a bad rap. For example, in one case in which I was involved, the Exchange information store was found to be corrupt and had to be restored from backup. The recovery process lasted over 8 hours, resulting in a huge downtime incident charged against the messaging system’s team. On looking deeper into this incident, we found that several things contributed to the total of 8 hours of downtime. First, the problem was not recognized until an hour into the outage (the notification point). Once the problem was determined, an inexperienced staff scrambled to get the necessary backup tapes required to perform the restore. However, after beginning the restore process, it was determined that the backup was not good. In fact, the last successful backup had occurred 3 days earlier. Since the staff was not well trained and procedures were not well documented, there was a delay in finding the correct backup tapes from 3 days earlier. Overall, approximately 4 hours were spent trying to identify and locate the last known good backup of the downed Exchange server. Once the correct tape was located, the recovery operation began, and the server was fully recovered in about 3 hours. However, since the last good backup tape was 3 days old, 3 days’ worth of messaging data was lost. This extreme case serves to illustrate how several components of downtime took excessive amounts of time.

By looking at downtime as a series of components instead of a singular event, we can start to identify problem spots. These problem spots can be analyzed further and targeted for process improvement in order to reduce the overall outage period. In my example, if better notification methods had existed, the operational staff could have responded more quickly. In addition, if backup verification procedures had existed for validating each day’s backup, 3 days’ worth of data would not have been lost. You can be certain that this particular organization quickly recognized the need to look at the notification, decision/action, and recovery points within the overall outage in order to eliminate the tremendous amount of unneeded downtime that occurred in this incident. By looking deeper into the outage and not treating it as a simple single component, we can look for ways to increase system availability.

2.3.8 Recovery-point focus versus recovery-time focus

High availability means different things to different people. As I have discussed, at the pinnacle, we have continuous availability or nonstop computing. What is your definition of high availability for your Exchange deployment? Perhaps you don’t need a full “five nines” (99.999%), but you would like to get as close as you can. Regardless of your needs, you need to consider the difference between focusing on recovery point versus focusing on recovery time. Over the years, we have consistently seen the reliability of systems components increase either because the components have become more reliable or because we have been able to use redundant components (like power supplies, fans, or disks) to guard against single points of failure. However, downtime still happens—software crashes, hardware fails, and data can be lost. A recovery-point focus means that your primary concern is being able to recover your system to an exact point in time or system state, and there is no tolerance for data loss.

With a recovery-point focus, the disaster-recovery planner is concerned with the ability to recover to a specific system transaction point without data loss. What is the impact when you measure your operations using a recovery-point standard? If you are not able to pick up where you left off, will it be inconvenient, damaging, or completely catastrophic? With a focus on recovery time, you are concerned with continuous operations and care most about the time it takes to restore the system to operational status. The big question is this: Do you need fast recovery, recovery to the exact system state prior to the failure, or both?

For Exchange, we could contrast recovery point and recovery time with this example. A recovery-time focus would favor simply getting the server back in operation and would be less concerned with whether the users’ mailbox data was intact. The end result would be that users could send and receive mail as soon as possible, but data in their mailboxes may not be immediately available. With a recovery-point focus, the importance might be placed on recovering mailbox data to an exact point preceding a failure with less concern for quickly restoring system operation. Obviously, when planning for a real-world situation, disaster-recovery specialists must strike a balance between a focus on recovery point and recovery time. For an Exchange deployment, we would like to have our users back online with access to their messaging data as soon as possible—both recovery point and recovery time are important. However, the two focal points can be in direct competition with one another. The ability to ensure recovery-point capability may impact the ability to ensure a rapid recovery time. The focus you choose or, more likely, the balance you strike will be important as you plan your recovery procedures and strategy for Exchange. Using the criteria of recovery point and recovery time will help you determine what is most important in your attempts to reduce downtime for your Exchange deployment.




Mission-Critical Microsoft Exchange 2003. Designing and Building Reliable Exchange Servers
Mission-Critical Microsoft Exchange 2003: Designing and Building Reliable Exchange Servers (HP Technologies)
ISBN: 155558294X
EAN: 2147483647
Year: 2003
Pages: 91
Authors: Jerry Cochran

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net