Monitoring, Managing, and Troubleshooting Server Health

< Day Day Up >

The key to having a successful Exchange Server organization is to keep it running smoothly. To do that, you must learn to keep a watchful eye on your Exchange servers, ensuring that problems don't occur and if they should occur, ensuring that the required actions are carried out as quickly as possible.

Several tools within Exchange Server 2003 and Windows Server 2003 are available and should be part of your monitoring tool bag. They include

Event Viewer
Exchange server monitors
Exchange Server monitor notifications
Exchange diagnostics logging
System Monitor

We briefly examine each of these tools in the following sections.

Using Event Viewer

The Event Viewer should always be your first stop for both routine and nonroutine checkups on your servers because it offers a concise and fairly easy-to-read historical view of what has occurred on a server. By default, three logs are provided in the Event Viewer on all servers:

Application The application log is where Windows and other applications write event entries for their status, such as the successful startup of a service or the failure of a service to start. The application log can provide a treasure trove of information to the administrator looking to find out when a problem first started or what might have led to it.
Security The security log contains all audit entries that Windows has written, depending on what auditing policies your network has in place. In the security log, you can track unauthorized user access attempts that have failed, locked out user accounts, privilege usage, or changes made to files and other objects.
System The system log contains entries that pertain to the operation of Windows itself, such as information about drivers that have failed to load or devices that have started up successfully. The system log is useful in tracking down network and other hardware problems.

You need to understand how to configure and implement auditing for your servers before the security log will yield any useful information for you. Auditing is not enabled on most servers in Windows Server 2003, and is only configured for minimal auditing on domain controllers.

Your servers might have additional logs, such as the DNS log, depending on their role in the network. Within each log is recorded events of varying types, depending on the information being recorded. Information events are fairly routine and provide information about actions occurring successfully, such as the successful startup of a service. Warning events indicate future problems that require further investigation. Error events indicate that a significant problem has occurred on the system, such as the failure to start a required service. You need to investigate and correct Error events immediately upon their discovery. You can see all three event types in Figure 8.7. Note that the security log does not use these event types, but instead uses Success and Failure events.

Figure 8.7. The Event Viewer should always be your first stop when performing monitoring on your servers.

Using Exchange Server Monitors

Exchange Server 2003 creates, by default during its installation, monitors that are used to monitor servers and connectors. As you might guess, server monitors are used to monitor the status of specified services and usage of resources on servers. Connection status monitors are used to monitor the status of a connector object between servers.

The Exchange monitors are located in the Tools, Monitoring and Status, Status folder of the Exchange System Manager, as shown in Figure 8.8.

Figure 8.8. The Exchange monitors can be used to get real-time information about the status of critical services and resources.

graphics/08fig08.jpg

By default, several key services are monitored on each Exchange server, as shown in Figure 8.9. To access this dialog box, you simply double-click on the desired server, select the Default Microsoft Exchange Services item, and click the Details button.

Figure 8.9. By default, key Exchange services are being monitored.

graphics/08fig09.jpg

You can add additional items to be monitored by returning to the server monitoring Properties dialog box and clicking the Add button to open the Add Resource dialog box, as shown in Figure 8.10. Note the six different categories of resources that you can monitor we discuss each of them briefly in turn.

Figure 8.10. You can create additional monitors if desired.

graphics/08fig10.jpg

Available Virtual Memory

Exchange is very memory intensive this should come as no surprise to any experienced Exchange administrator. Exchange Server 2003 requires a minimum of 256MB of RAM and has a recommended value of 512MB, although in larger organizations you might have 1GB of RAM or more installed in your servers. To keep an eye on how your servers are performing in regard to using memory effectively, you might want to consider monitoring the available virtual memory on the server. When it dips too low for an extended period of time, you have a situation that warrants further investigation.

CPU Utilization

In addition to RAM, Exchange needs a lot of CPU cycles to accomplish its assigned tasks. By monitoring the CPU utilization of your Exchange servers, you can help identify those servers that have too many services running or do not have enough processing power currently. This information can be used to justify the need for additional servers for off-loading of Exchange (or other) services or perhaps the addition of another or more powerful processor into the server.

Free Disk Space

One thing that all databases have in common is that they require large amounts of disk space to be available for the database itself as well as the various logs that go along with the database. Exchange is no different in this regard adequate disk space is a must for both Exchange and Windows to operate smoothly. You can configure monitoring on a per volume basis to keep you informed of when one or more of your server's drives are becoming dangerously full.

Offline database defragmentations require free disk space equal to at least 110% of the database size. You need to plan for events such as this when creating, managing, and monitoring Exchange servers.

Windows 2000 Service

Don't let the name mislead you; this category is used to monitor Windows services on Windows 2000 Server or Windows Server 2003 computers. Any service that is installed and operational on the server can be monitored here, such as a service for a third-party add-in to Exchange that is considered to be a part of your critical Exchange infrastructure.

SMTP and X.400 Queue Growth

Monitoring of queue growth, particularly of SMTP queue growth, can be quite useful in keeping a watchful eye out for abnormal messaging situations. For example, suppose that you've somehow been exposed as an open SMTP relay. You might notice a large increase in your queue growth without a corresponding increase in the number of messages actually being delivered due to spam being delivered through your open relay. Another situation you might encounter is a script-based email worm that replicates itself through email messages; this might also create a large increase in your queue growth. Lastly, monitoring of the queue growth is helpful in situations in which outgoing messages cannot be transported out of the Exchange organization for one reason or another indicating a possible problem with your Internet connectivity.

Creating Notifications

As useful as server (and connection) monitoring is, it really doesn't do you any good unless you configure notifications to occur when the monitors reach a warning or critical state.

You can create one of two types of notifications: those that send email messages or those that run a script. Email notifications are pretty straightforward and do just what you'd expect they would. Script notifications can get a bit more complex and are really only limited by your needs, your scripting capabilities, and your imagination.

Notifications are located in the Notifications folder shown previously in Figure 8.8. To create a new notification, right-click on the folder and select New, E-mail Notification or New, Script Notification depending on your needs. The email notification configuration is shown in Figure 8.11. You need to select the server to perform the monitoring, the item(s) to be monitored, the state at which to send the email, the addresses to send the email to, and the server that is to be used to send the email.

Figure 8.11. You should create email notifications for all of your servers as soon as possible after installation.

graphics/08fig11.jpg

When creating a notification, it's usually best to configure the monitoring to be done by a different server than you are using to send the email.

Using Exchange Diagnostics Logging

Using the Exchange diagnostics logging allows you to perform very deep monitoring of all Exchange services, with the output being written to the application log of the Event Viewer. The diagnostics logging options can be configured on the Diagnostics Logging tab of the server Properties dialog box, as shown in Figure 8.12.

Figure 8.12. You can configure diagnostics logging to gather extremely detailed information about your Exchange servers.

graphics/08fig12.jpg

By default, all categories are configured for a logging level of None. To configure diagnostic logging, select the high level category you are interested in logging and then configure logging for the appropriate categories under that service. There are four levels of logging available:

None Events with a logging level of 0 are logged. These events include application and system failure.
Minimum Events with a logging level of 1 or lower are logged.
Medium Events with a logging level of 3 or lower are logged.
Maximum Events with a logging level of 5 or lower are logged.

The logging levels are configured by the developers of the application and cannot be modified after the fact. Be wary of using Medium and Maximum logging for a large number of categories or for a large period of time as they can quickly lead to the application log being flooded with event entries, masking otherwise important lower-level events that you might need to see to successfully troubleshoot a server problem.

Although you're not likely to be heavily tested on diagnostics logging, you should at least be aware of what it is, what it does, and how it works. You might have a need to use diagnostics logging later in your own Exchange organization.

Using System Monitor

The last of the four basic monitoring tools that you have is the Windows System Monitor, located in the Performance console of the Administrative Tools folder. When performing performance monitoring, we are usually interested in collecting information about four general items (or problem areas):

Bottlenecks Bottlenecks are usually not so difficult to identify, but can be quite difficult to identify correctly. A bottleneck should be thought of as a lack of adequate resources meaning that the network is demanding more resources than your servers have available to them. Bottlenecks often manifest themselves as slowdowns, thus making them more difficult to correctly identify and correct. Items such as the number of requests and the frequency of the requests are large contributors to bottlenecks.
Queue length A queue is a temporary holding location for objects until they can be processed. Short queues are indicative of healthy systems in which all processes and services are functioning well, whereas longer (and growing) queues are indicative of insufficient resources or other problems.
Response time Response time is, as you might expect, the time it takes a service to respond to a given request. As the loading on a server or service increases, so does the response time. You've likely seen this at very popular Internet Web sites or on an Exchange server that has too many mailboxes on it. Response time is typically best tracked over long periods of time, although you can get a snapshot idea of how a server or resource is performing by monitoring response times over a short period of time.
Throughput rate The concept of throughput is typically applied to data transmission over a network from one location to another, and rightly so. Throughput, however, comes in other forms, such as disk throughput. When the demand for throughput becomes greater than the resource can support, response times and queue lengths increase and a bottleneck will likely appear. A resource with a sustained high throughput rate, but a very low (and not growing) queue length is not considered a bottleneck, but merely a candidate for future upgrading as the trend for throughput is to typically increase over time for various reasons, such as additional users and more data being moved from one point to another.

Monitoring performance begins with the collection of data. The System Monitor allows you various methods of working with data, although all methods use the same means of collecting data. Data collected by the System Monitor is broken down into objects, counters, and instances. An object is the software or device being monitored, such as memory or processor. A counter is a specific statistic for an object. Memory has a counter called Available Bytes, and a processor has a counter called % Processor Time. An instance is the specific occurrence of an object you are watching; in a multiprocessor server with two processors, you will have three instances: 0, 1, and _Total.

The primary difference between using the System Monitor and counter logs/trace logs is that you typically watch performance in real time in System Monitor (or play back saved logs), whereas you use counter logs and trace logs to record data for later analysis. Alerts function in real time by providing you with (you guessed it) an alert when a user-defined threshold is exceeded, much the same as with the Exchange server monitors.

The basic use of the System Monitor is straightforward. You decide which object/instance/counter combinations you want to display and then configure the monitor accordingly. At that point, information begins to appear. You can also change the properties of the monitor to display information in different ways.

Figure 8.13 shows a typical Add Counters dialog box. At the top of the dialog box is a set of radio buttons with which you can obtain statistics from the local machine or a remote machine. This is useful when you want to monitor a computer in a location that is not within reasonable physical distance from you. Under the radio buttons is a pull-down menu naming the performance objects that can be monitored. Which performance objects are available depends on the features (and applications) you have installed on your server. Also, some counters come with specific applications. These performance counters enable you to monitor statistics relating to that application from the Performance Monitor.

Figure 8.13. Adding a counter in the Add Counters dialog box.

graphics/08fig13.jpg

When you first start the System Monitor, you might want to add the counters discussed later in this section from the memory, processor, hard disk, and network objects; however, you can add any combination of counters that you find helpful in tuning and monitoring your computers.

Under the Performance object is a list of counters. When applied to a specific instance of an object, counters are what you are really after, and the object just narrows down your search. The counters are the actual statistical information you want to monitor. Each object has its own set of counters from which you can choose. Counters enable you to move from the abstract concept of an object to the concrete events that reflect that object's activity. For example, if you choose to monitor the processor, you can watch for the average processor time and how much time the processor spent doing nonidle activity. In addition, you can watch for %user time (time spent executing user application processes) versus %privileged time (time spent executing system processes).

To the right of the counter list is the instances list. If applicable, instances enumerate the physical objects that fall under the specific object class you have chosen. In some cases, the instances list is not applicable. For example, there is no instances list with memory. In cases in which the instances list is applicable, you might see multiple instance variables. One variable represents the average of all the instances, and the rest of the variables represent the values for the first physical object (number 0, 1, and so on). For example, if you have two processors in your server, you will see (and be able to choose from) three instance variables: _Total, 0, and 1. This enables you to watch each processor individually and to watch them as a collective unit.

Don't forget about the Task Manager even though it's a very simple tool, do not underestimate the usefulness of the Task Manager. You can quickly launch the Task Manager to get a real-time look at network utilization (and process performance) without having to open the System Monitor and configure counters.

Using System Monitor to Discover Bottlenecks

Every chain, regardless of its strength, has its weakest link. When pulled hard enough, some point will give before all the others. Your server is similar to a chain. When it's under stress, some component will not be able to keep up with the others. This results in a degradation of overall performance. The weak link in the server is referred to as a bottleneck because it's the component that slows everything else down. As an administrator responsible for ensuring efficient operation of your Windows 2000 server, you need to determine the following two things:

Which component is causing the bottleneck?
Is the stress on the server typical enough that action is warranted either now or in the future?

As mentioned previously, under normal operation, only four system components greatly affect system performance: memory, processor, disk, and network. Therefore, you should monitor the counters that will tell you the most about how those four components affect system performance so you can determine the answer to the two diagnostic questions.

The biggest monitoring problem is not collecting the data, but interpreting it. Not only is it difficult to determine what a specific value for a particular counter means, it is also difficult to determine what it means in the context of other counters. The biggest difficulty is that no subsystem (disk, network, processor, or memory) exists in isolation. As a result, weaknesses in one might show up as weaknesses in another. Unless you take them all into consideration, you might end up adding another processor when all you need is more RAM.

Understanding how the subsystems interact is important to understanding the significance of the counter values that are recorded. For example, if you detect that your processor is constantly running at 90%, you might be tempted to purchase a faster processor (or another processor if you have a system board that accommodates more than one). However, it is important to look at memory utilization and disk utilization as well because the problem could be originating there instead. If you do not have enough memory, the processor must swap pages to the disk frequently. This results in high memory utilization, high disk utilization, and higher processor utilization. By purchasing more RAM, you could alleviate all those problems.

That one example illustrates how no one piece of information is enough to analyze your performance problems or your solution. You must monitor the server as a whole unit by putting together the counters from a variety of objects. Only then will you be able to see the big picture and solve problems that might arise.

The recommended method of monitoring is to use a counter log, which captures data over a period of time. This helps you eliminate questions of whether the current stress on the server is typical. If you log over a period of a week or a month and consistently see a certain component under excessive load, you can be sure the stress is typical.

Counters to Monitor for the Exchange Organization

To keep a watchful eye on your Exchange servers, you need to monitor several key counters from each of the four areas previously discussed.

Table 8.2 presents the counters that you should monitor for memory use.

Table 8.2. Useful Memory Counters to Monitor in System Monitor
Counter	Description
Memory \ Pages/Sec	This counter displays the number of hard page faults occurring per second. A hard page fault occurs when data or code is not in memory and must be retrieved from the hard drive. Each time this happens, disk activity is required, and the process is temporarily halted (because disk access is momentarily slower than RAM access). A bottleneck in memory is likely when this number is 20 or greater.
Memory \ Available Bytes	This counter indicates the total amount of physical memoryavailable to processes running on the computer. This number's significance varies as the amount of memory in the computer varies, but if this number is less than 4MB, you generally have a memory deficiency.
Paging File (_Total) \ % Usage	This counter measures the total amount of the available paging file that is currently in use on the computer. Sustained high values indicate a paging file that is too small or that more memory is required in the computer. A sustained value above 75 is usually indicative of a problem.

Table 8.3 presents the counters that you should monitor for processor use.

Table 8.3. Useful Processor Counters to Monitor in System Monitor
Counter	Description
Processor \ % Processor Time	This counter indicates the amount of time the processor spends executing nonidle threads. This is an indication of how busy the processor is. The processor for a single-processor system should not exceed 80% capacity for a significant period of time. The processors in a multiple-processor system should not exceed 50% for a significant period of time. High processor utilization can be an indication of processor bottlenecks, but it could also indicate lack of memory.
System \ Processor Queue Length	This counter displays the number of processes that are ready but waiting to be serviced by the processor(s). There is a single queue for all processors, even in a multiprocessor environment. A sustained queue of more than 2 generally indicates processor congestion.

Table 8.4 presents the counters that you should monitor for disk usage.

Table 8.4. Useful Disk Counters to Monitor in System Monitor
Counter	Description
Physical Disk \ Disk Sec/Transfer	This counter indicates how quickly data is being moved to and from the disk. High values might indicate that the disk is retrying requests to a long queue or a disk failure. By comparing the current value to a baseline value, you can determine where differences exist.
Physical Disk \ Avg. Disk Queue Length	This counter indicates the number of disk requests that are queued and waiting to be handled. A sustained queue of more than 2 generally indicates disk congestion.
Physical Disk \ Disk Bytes/Sec	This counter indicates disk throughput and displays the rate at which data is being moved on the disk.
Physical Disk \ Disk Transfer/Sec	This counter indicates the number of read and write operations that the disk is performing each second. Sustained values over 50% indicate a possible disk bottleneck.

When using RAID arrays, you should monitor the Logical Disk counters. The Logical Disk counters represent the overall status of the RAID array, which is composed of several individual disks.

Table 8.5 presents the counters that you should monitor for network usage.

Table 8.5. Useful Network Counters to Monitor in System Monitor
Counter	Description
Network Interface \ Output Queue Length	This counter measures the length of the output packet queue (in packets). If this counter is greater than 2, there are delays, and you should find and eliminate the bottleneck, if possible. A sustained value of greater than 2 is usually indicative of a problem.
Network Interface \ Packets Outbound Discarded	This counter measures the number of outbound packets that were chosen to be discarded even though no errors had been detected to prevent transmission. One possible reason for discarding packets could be to free up buffer space.
Network Interface \ Bytes Total/Sec	This counter indicates the total throughput of the network interface. It can be used for general capacity planning and does not necessarily indicate a network bottleneck. Higher numbers indicate more successful transmissions.
Network Segment \ BroadcastFrames Received/Sec	This counter displays the total broadcast frames being put out on the network and must be compared to earlier baselines to detect potential problems.
Network Segment \ % Network Utilization	This counter displays the percentage of available network bandwidth being used on the local segment. Sustained values over 40% result in a high number of collisions and reduced throughput capability.
IP / Datagrams/Sec	This counter displays the rate at which IP datagrams (packets) are sent from and received at the network interface.
TCP / Segments Received/Sec	This counter displays the rate at which segments are received on the network interface. A sustained lower than normal value might indicate an excessive amount of broadcast traffic on the network.
TCP / Segments Retransmitted/Sec	This counter displays the rate at which segments are being retransmitted. Sustained higher than normal values are indicative of a saturated network or a hardware malfunction on the network.
Redirector / Network Errors/Sec	This counter displays the number of serious network errors as indicated by the Redirector or one or more servers having communications problems.
Server / Pool Paged Failures	This counter displays the number of paged pool allocation failures that have occurred. A sustained high and/or increasing value is an indicator of too little RAM and too small of a paging file.