2.4 Step 3: Assessing the causes of downtime for Exchange


For each Exchange deployment, the causes of downtime may vary. Also, the leading causes of downtime for Exchange may depend on how well you understand the technology and plan and implement your operational procedures. Finally, the architecture and design of Exchange Server itself may somewhat contribute to the leading causes of downtime. For most mission-critical environments, the leading causes of downtime fall into four key areas, as illustrated in Figure 2.4. These areas are software, hardware, operator, and environment.

click to expand
Figure 2.4: Categorizing the leading causes of downtime.

2.4.1 Software

Bugs, configuration errors, performance problems, and other issues with software can cause significant downtime for any environment. Software causes of downtime can also fall into many areas such as the operating system, drivers, applications, tools, and so forth. Problems in any of these areas can render a server unavailable. For Exchange Server, software problems, in my experience, usually come as a result of configuration errors. Since Exchange Server is a very complex system, there are many opportunities for configuration problems. Likewise, third-party applications and drivers can also be a source of much grief for administrators. For Exchange, third-party software, such as hardware device drivers, antivirus scanners, management agents, fax servers, connectors, and so forth, are well-known culprits. All too often, these types of software are frequently found to have issues such as memory leaks and other problems that can result in downtime for Exchange. In addition, anything that destabilizes Windows will eventually wreak havoc on Exchange as well. Microsoft is not without fault either. We are all painfully aware of the continuous wave of service packs and patches that Microsoft must provide to address issues discovered in the Windows operating systems or Exchange Server. Although we are often frustrated when software bugs are discovered, Microsoft is no worse an offender than any other developer. Left unchecked or misunderstood, software problems can be a significant contributor to system downtime. In practice, we often find that measures such as regularly scheduled system reboots become part of operational procedures in an attempt to compensate for shortcomings and problems with the operating system, third-party software, or Exchange Server itself. It is important to understand that we must accept this as a reality of the technology we are using and seek to understand how and why software can be a significant cause of downtime. Only then can we plan and implement methods of reducing our exposure to the occurrence of software problems.

2.4.2 Hardware

Hardware is another area in which problems and issues can cause downtime. Having previously worked for a hardware manufacturer and having dealt daily with the impacts of hardware on overall system reliability, I have become painfully aware of this issue. What is most interesting about hardware failures is that, according to most research, hardware is rarely cited as the leading cause of downtime. In fact, software and administrative error are more likely culprits where the causes of downtime are concerned. The hardware platform that runs Exchange Server is not a single device but a collection of interoperating devices. The problem with hardware is that all components are mostly subject to the laws of physics, over which we have no control. The server on which you run Exchange is composed of electronic circuits and mechanical devices that, over time and use, can break down and degrade. For example, a hard disk drive is a magnetic platter (or several platters) rotating at speeds up to 15,000 RPM or higher. A device like this is bound to fail. Not coincidentally, hard disk drives are the most likely component to fail in a server, followed by memory, power supplies, and fans. Again, while we cannot prevent hardware failures from occurring, we can take steps to reduce the impacts of such failures by building systems that are tolerant of these faults.

Another key point here is that most hardware also contains software— usually called firmware. The system board on a server contains firmware or Basic Input/Output System (BIOS) for controlling low-level hardware operations. Likewise, devices such as disk drives and controllers also contain firmware microcode, which controls the operation of these devices. In my experience, firmware problems are the cause of device failure all too often.

Although hardware manufacturers are continually improving technology, there is little hope of eliminating the possibility of hardware device failures. Best results are usually achieved by carefully choosing hardware with an appropriate degree of redundancy and by aligning your organization with a hardware vendor that is in the top tier and that has excellent service and support offerings, as well as solid experience and expertise with Exchange Server and Windows Server. A key point to consider about hardware failures is that we can design and manage our systems in a manner that significantly reduces or possibly even eliminates hardware as a cause of system downtime.

2.4.3 Operator

Next we come to the operator. Unfortunately, all of the research I have seen points to the operator (the human factor in the equation) as the leading cause of downtime for mission-critical systems. A recent study by OnTrack Data International showed that as much as 32% of downtime is due to human error. How often I hear stories of people clicking the mouse when they shouldn’t have, hitting the wrong key, tripping over a cable, selecting the wrong options, or just plain not knowing what to do. It isn’t that people are stupid (I know you are thinking of someone in particular right now ...); they often just are not well trained or have incorrect procedures or instructions from which to work. In addition, no matter how well trained people are, humans are not infallible. For example, many operators have never actually performed a server recovery until they are faced with a real-world crisis and have to perform flawlessly under pressure. If these operators have practiced disaster-recovery drills many times beforehand, many potential mistakes that an inexperienced operator will make in a crisis can be avoided. Knowing that operator error is the leading cause of downtime is half the battle, however. The hope is in the fact that this potential cause of downtime is the one you have the most control over. Proper planning, procedures, and training can do wonders toward eliminating operator error as a cause of downtime.

2.4.4 Environment

The last category of downtime causes is environmental. The environment includes everything that is outside of or connected to the system.

Environmental causes of downtime are things like power failure, poor power conditioning, loss of heating and cooling, hardware destruction, poor physical security, and a host of other external occurrences that can directly cause a system outage. One of my favorite tales that illustrates a problem with physical security is one in which a system administrator entered the computer room at night, shut down each server, and removed a 128-MB memory module from each. This did not create a problem on most servers since nobody seemed to notice that the additional memory was gone (evidently, most of the servers had plenty of free memory). However, for the database server, the missing 128 MB of RAM soon became noticeable. When month-end reports and summaries were generated, the server would run out of memory, grind to a halt, and finally crash. Upon closer examination, it was discovered that the missing memory was sorely needed once a month when system activity peaked during monthly report runs.

The most extreme examples of environmental conditions causing system outages are natural and man-made disasters that result in total system destruction. While impossible to predict (but not to plan for), these incidents do require some thought in your disaster-recovery planning. Also, technologies are available that can tolerate such disasters that are a very real threat in the world we live in. Most environmental causes of downtime are tamer, however, and can be compensated for with countermeasures. These include issues with power or heating and cooling. It is within our control to provide adequate heating and cooling for data centers and to provide power conditioning and backup. Most of these issues are eliminated through solid data center design practices. Even so, I have seen issues with power and heating and cooling cause problems in even well-designed data centers. While working (in a past life) in an MIS department that had a state-of-the-art data center, I found myself troubleshooting an elusive problem with memory modules and disk drives failing at rates too high to be a coincidence. Upon further investigation, we discovered that our very expensive UPS was indeed providing a continuous but also dirty supply of power for the data center. The bad power (full of spikes and slumps) had taken its toll on our equipment. The memory modules and disk drives were most susceptible to the issue and began to fail at a rapid rate. Once the power condition problem was resolved, the failures began to taper off. While environmental conditions are not the leading cause of outage, they are often the leading contributor to outage duration.

It is important to understand here that within each of these four areas (software, hardware, operations, and environment) are many potential causes of downtime. For your organization, you must determine which of these are your leading causes of downtime since every organization is different. Understanding the potential issues that could possibly cause downtime in your environment is critical to the process of planning and implementing mission-critical systems. Armed with the knowledge of your leading causes of downtime, you can take the next steps toward designing systems that resolve your leading issues. For example, if you determine that hardware is the single largest contributor to downtime, you can design and purchase systems that eliminate or reduce these issues. If you determine that operator error is your biggest issue, you are empowered to take steps to provide better training, documentation, procedures, and experience for your operations staff. Whatever your leading causes of downtime are, the advantage is given to those organizations that can understand them and take steps toward their eradication.




Mission-Critical Microsoft Exchange 2003. Designing and Building Reliable Exchange Servers
Mission-Critical Microsoft Exchange 2003: Designing and Building Reliable Exchange Servers (HP Technologies)
ISBN: 155558294X
EAN: 2147483647
Year: 2003
Pages: 91
Authors: Jerry Cochran

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net