Monitoring System Performance

Capacity analysis is not about how much information you can collect; it is, however, about collecting the appropriate system health indicators and the right amount of information. Without a doubt, you can capture and monitor an overwhelming amount of information from performance counters. There are more than 1,000 counters, so you'll want to carefully choose what to monitor. Otherwise, you may collect so much information that the data will be hard to manage and difficult to decipher. Keep in mind that more is not necessarily better with regard to capacity analysis. This process is more about efficiency. Therefore, you need to tailor your capacity-analysis monitoring as accurately as possible to how the server is configured.

Every Windows Server 2003 has a common set of resources that can affect performance, reliability, stability, and availability. For this reason, it's important that you monitor this common set of resources.

Four resources comprise the common set of resources: memory, processor, disk subsystem, and network subsystem. They are also the most common contributors to performance bottlenecks. A bottleneck can be defined in two ways. The most common perception of a bottleneck is that it is the slowest part of your system. It can either be hardware or software, but generally speaking, hardware is usually faster than software. When a resource is overburdened or just not equipped to handle higher workload capacities, the system may experience a slowdown in performance. For any system, the slowest component of the system is, by definition, considered the bottleneck. For example, a Web server may be equipped with ample RAM, disk space, and a high-speed network interface card (NIC), but if the disk subsystem has older drives that are relatively slow, the Web server may not be able to effectively handle requests. The bottleneck (that is, the antiquated disk subsystem) can drag the other resources down.

A less common, but equally important form of bottleneck, is one where a system has significantly more RAM, processors, or other system resources than the application requires. In these cases, the system creates extremely large pagefiles, has to manage very large sets of disk or memory sets, yet never uses the resources. When an application needs to access memory, processors, or disks, the system may be busy managing the idle resource, thus creating an unnecessary bottleneck caused by having too many resources allocated to a system. Thus, performance optimization not only means having too few resources, but also means not having too many resources allocated to a system.

In addition to the common set of resources, the functions that the Windows Server 2003 performs can influence what you should consider monitoring. So, for example, you would monitor certain aspects of system performance on file servers differently than you would for a domain controller (DC). There are many functional roles (such as file and print sharing, application sharing, database functions, Web server duties, domain controller roles, and more) that Windows Server 2003 can serve under, and it is important to understand all those roles that pertain to each server system. By identifying these functions and monitoring them along with the common set of resources, you gain much greater control and understanding of the system.

The following sections go more in depth on what specific counters you should monitor for the different components that comprise the common set of resources. It's important to realize though that there are several other counters that you should consider monitoring in addition to the ones described in this chapter. You should consider the following material a baseline of the minimum number of counters to begin your capacity-analysis and performance-optimization procedures.

Later in the chapter, we will identify several server roles and cover monitoring baselines, describing the minimum number of counters to monitor.

Key Elements to Monitor

The key elements to begin your capacity analysis and performance optimization are the common contributors to bottlenecks. They are memory, processor, disk subsystem, and network subsystem.

Monitoring System Memory

Available system memory is usually the most common source for performance problems on a system. The reason is simply that incorrect amounts of memory are usually installed on a Windows Server 2003 system. By definition, Windows Server 2003 tends to consume a lot of memory. Fortunately, the easiest and most economical way to resolve the performance issue is to configure the system with additional memory. This can significantly boost performance and upgrade reliability.

When you first start the Performance Console in Windows Server 2003, three counters are monitored. One of these counters is an important one related to memory: the Pages/sec counter. The Performance Console's default setting is illustrated in Figure 35.4. It shows three counters being monitored in real-time. The purpose is to provide a simple and quick way to get a basic idea of system health.

There are many significant counters in the memory object that could help determine system memory requirements. Most network environments shouldn't need to consistently monitor every single counter to get accurate representation of performance. For long-term monitoring, two very important counters can give you a fairly accurate picture of memory requirements: Page Faults/sec and Pages/sec memory. These two memory counters alone can indicate whether the system is properly configured with the proper amount of memory.

Systems experience page faults when a process requires code or data that it can't find in its working set. A working set is the amount of memory that is committed to a particular process. In this case, the process has to retrieve the code or data in another part of physical memory (referred to as a soft fault) or, in the worst case, has to retrieve it from the disk subsystem (a hard fault). Systems today can handle a large number of soft faults without significant performance hits. However, because hard faults require disk subsystem access, they can cause the process to wait significantly, which can drag performance to a crawl. The difference between memory and disk subsystem access speeds is exponential even with the fastest drives available.

The Page Faults/sec counter reports both soft and hard faults. It's not uncommon to see this counter displaying rather large numbers. Depending on the workload placed on the system, this counter can display several hundred faults per second. When it gets beyond several hundred page faults per second for long durations, you should begin checking other memory counters to identify whether a bottleneck exists.

Probably the most important memory counter is Pages/sec. It reveals the number of pages read from or written to disk and is therefore a direct representation of the number of hard page faults the system is experiencing. Microsoft recommends upgrading the amount of memory in systems that are seeing Pages/sec values consistently averaging above 5 pages per second. In actuality, you'll begin noticing slower performance when this value is consistently higher than 20. So, it's important to carefully watch this counter as it nudges higher than 10 pages per second.

Note

The Pages/sec counter is also particularly useful in determining whether a system is thrashing. Thrashing is a term used to describe systems experiencing more than 100 pages per second. Thrashing should never be allowed to occur on Windows Server 2003 systems because the reliance on the disk subsystem to resolve memory faults greatly affects how efficiently the system can sustain workloads.

Analyzing Processor Usage

Most often the processor resource is the first one analyzed when there is a noticeable decrease in system performance. For capacity-analysis purposes, you should monitor two counters: % Processor Time and Interrupts/sec.

The % Processor Time counter indicates the percentage of overall processor utilization. If more than one processor exists on the system, an instance for each one is included along with a total (combined) value counter. If this counter averages a usage rate of 50% or greater for long durations, you should first consult other system counters to identify any processes that may be improperly using the processors or consider upgrading the processor or processors. Generally speaking, consistent utilization in the 50% range doesn't necessarily adversely affect how the system handles given workloads. When the average processor utilization spills over the 65 or higher range, performance may become intolerable.

The Interrupts/sec counter is also a good guide of processor health. It indicates the number of device interrupts that the processor (either hardware or software driven) is handling per second. Like the Page Faults/sec counter mentioned in the "Memory" section, this counter may display very high numbers (in the thousands) without significantly impacting how the system handles workloads.

Evaluating the Disk Subsystem

Hard disk drives and hard disk controllers are the two main components of the disk subsystem. Windows Server 2003 only has Performance Console objects and counters that monitor hard disk statistics. Some manufacturers, however, may provide add-in counters to monitor their hard disk controllers. The two objects that gauge hard disk performance are Physical and Logical Disk. Unlike its predecessor (Windows 2000), Windows Server 2003 automatically enables the disk objects by default when the system starts.

Although the disk subsystem components are becoming more and more powerful, they are often a common bottleneck because their speeds are exponentially slower than other resources. The effects, though, may be minimal and maybe even unnoticeable, depending on the system configuration.

Monitoring with the Physical and Logical Disk objects does come with a small price. Each object requires a little resource overhead when you use them for monitoring. As a result, you should keep them disabled unless you are going to use them for monitoring purposes. To deactivate the disk objects, type diskperf -n. To activate them at a later time, use diskperf -y or diskperf -y \\mycomputer to enable them on remote machines that aren't running Windows Server 2003. Windows Server 2003 is also very flexible when it comes to activating or deactivating each object separately. To specify which object to enable or disable, use a d for the Physical Disk object, or a v for the Logical Disk object.

Activating and deactivating the disk subsystem objects in Windows Server 2003 is fairly straightforward. Use diskperf -y to activate all disk counters, diskperf -y \\mycomputer to enable them on remote machines, or diskperf -n to deactivate them. Windows Server 2003 is also very flexible when it comes to activating or deactivating each object separately. To specify which object to enable or disable, use a d for the Physical Disk object or a v for the Logical Disk object.

To minimize system overhead, disable the disk performance counters if you don't plan on monitoring them in the near future. For capacity-analysis purposes, though, it's important to always watch the system and keep informed of changes in usage patterns. The only way to do this is to keep these counters enabled.

So, what specific disk subsystem counters should be monitored? The most informative counters for the disk subsystem are % Disk Time and Avg. Disk Queue Length. The % Disk Time counter monitors the time that the selected physical or logical drive spends servicing read and write requests. The Avg. Disk Queue Length monitors the number of requests not yet serviced on the physical or logical drive. The Avg. Disk Queue length value is an interval average; it is a mathematical representation of the number of delays the drive is experiencing. If the delay is frequently greater than 2, the disks are not equipped to service the workload and delays in performance may occur.

Monitoring the Network Subsystem

The network subsystem is by far one of the most difficult subsystems to monitor because of the many different variables. The number of protocols used in the network, the network interface cards, network-based applications, topologies, subnetting, and more play vital roles in the network, but they also add to its complexity when you're trying to determine bottlenecks. Each network environment has different variables; therefore, the counters that you'll want to monitor will vary.

The information that you'll want to gain from monitoring the network pertains to network activity and throughput. You can find this information with the Performance Console alone, but it will be difficult at best. Instead, it's important to use other tools, such as the Network Monitor, in conjunction with Performance Console to get the best representation of network performance as possible. You may also consider using third-party network analysis tools such as sniffers to ease monitoring and analysis efforts. Using these tools simultaneously can broaden the scope of monitoring and more accurately depict what is happening on the wire.

Because the TCP/IP suite is the underlying set of protocols for a Windows Server 2003 network subsystem, this discussion of capacity analysis focuses on this protocol. The TCP/IP counters are added after the protocol is installed (by default).

There are several different network performance objects relating to the TCP/IP protocol, including ICMP, IPv4, IPv6, Network Interface, TCPv4, UDPv6, and more. Other counters such as FTP Server and WINS Server are added after these services are installed. Because entire books are dedicated to optimizing TCP/IP, this section focuses on a few important counters that you should monitor for capacity-analysis purposes.

First, examining error counters, such as Network Interface: Packets Received Errors or Packets Outbound Errors, is extremely useful in determining whether traffic is easily traversing the network. The greater the number of errors indicates that packets must be present, causing more network traffic. If a high number of errors is persistent on the network, throughput will suffer. This may be caused by a bad NIC, unreliable links, and so on.

If network throughput appears to be slowing because of excessive traffic, you should keep a close watch on the traffic being generated from network-based services such as the ones described in Table 35.3.

Table 35.3. Network-Based Service Counters to Monitor Network Traffic
Counter	Description
NBT Connection: Bytes Total/sec	Monitors the network traffic generated by NBT connections
Redirector: Bytes Total/sec	Processes data bytes received for statistical calculations
Server: Bytes Total/sec	Monitors the network traffic generated by the Server service

Key Elements to Monitor

Monitoring System Memory

Analyzing Processor Usage

Evaluating the Disk Subsystem

Monitoring the Network Subsystem

Table 35.3. Network-Based Service Counters to Monitor Network Traffic