The first required task is to understand what subsystems are critical to scrutinize. Begin with one or more of the following subsystems:
Weaknesses in each of these areas can generate a significant bottleneck in one or more combinations of other areas. Looking at each one in detail will help determine what sort of impact might be incurred.
The processor is the most obvious component that is critical to the performance of the system. But with a long list of potential counters, you need to pare down what is important to monitor and define the requirement for doing so. There are multiple counters that can be monitored for potential CPU bottlenecks, but the following three cover the majority of issues:
When we refer to an object, counter, or instance in this chapter, the format will be as follows: Object\Counter\Instance.
Processor\% Processor Time\_Total This counter shows the real-time utilization of the processor or processors. A value that is consistently above 50 percent demonstrates an emerging bottleneck at the processor. Consistent values at or above 75 percent require additional CPUs or farm servers to reduce the load on the processors being monitored.
System\Processor Queue Length\(N/A) This counter measures how many items are waiting to be processed. A value based on the following formula (# of CPUs x 2) is the maximum this counter should read for an extended period. So in the case of a two-processor system, a value of four or less is acceptable. Sustained values above four (in this example) either require upgrading CPUs (additional L2 Cache), additional processors, or scaling out by adding more servers to the same farm role.
Processor\Interrupts/sec This counter measures the average rate at which the processor must service system requests for hardware interrupts, such as network, hard drive, mouse, and system clock. This counter should be monitored over a longer period of time, looking for upward trends. Less than 1000 is acceptable for unhindered performance. Dramatic increases in this counter without corresponding increases in system use, indicate faulty hardware. Use your system vendor provided diagnostics to check for hardware anomalies.
In many cases, system administrators are tempted to "throw memory" at the problem. This can work in the short term, but a correctly diagnosed problem will help you to avoid spending potentially thousands of dollars without actually resolving the issue. Monitoring memory counters can reap significant rewards.
Memory\Pages/sec\(N\A) This counter measures the number of times per second that memory must either be written to or read from the hard disk. Consistent values above 150 to 200 typically mean the system is hard page faulting. This means the server is swapping content from memory to the pagefile on the disk or is thrashing for some other reason. Even the newest, fastest drives are still orders of magnitude slower than system memory, which can potentially cause a severe system impact. This counter should be monitored over a longer period of time, as normal activity can cause short periods of paging.
Memory\Pages Faults/sec\(N\A) This counter measures the number of hard and soft page faults per second. Soft page faults, which means accessing other parts of physical memory for the memory pages needed, are not critical because modern processors are powerful enough to handle many thousands of them per second. Hard page faults, reading from disk, will create a serious bottleneck with even just a small number due to the very slow speed of disk compared to memory. The way to determine whether a system is experiencing hard page faults is to monitor Memory\Pages/sec in conjunction with PhysicalDisk\Avg. Disk Bytes/Read. Multiply the value of the PhysicalDisk counter by 4096, and if these values are approximately equal, the system is experiencing excessive page faults. To resolve this issue, increase the amount of physical memory.
Memory\Available Mbytes\(N\A) This counter measures the amount of physical memory available to the system. Although this counter is something you should obviously monitor, it is often overlooked. You will find it helpful to monitor this alongside other predictors. A low value, such as less than 10 percent of total physical memory, over even a short period of time indicates a dire need for additional memory. The longer a low physical memory condition persists, the greater the impact on system performance due to the use of the pagefile.
Memory\Pool Nonpaged Bytes\(N\A) The number of bytes that cannot be paged to disk and must remain in physical memory. This counter is not widely monitored but has a very drastic effect on system performance. Monitor it in combination with Available Bytes to determine whether there is an application requiring large amounts of memory that is unable to be paged. This condition could indicate either a need for additional memory or a poorly written application. This can be monitored directly by using Process\Pool Nonpaged Bytes\ for specific SharePoint Server 2007 processes. The most important processes are Office and Windows SharePoint Server Search (both called mssearch.exe), Windows SharePoint Services Timer (owstimer.exe), Windows SharePoint Services Tracing (wsstracing.exe), and Internet Information Services (IIS) (inetinfo.exe). The two largest consumers of memory will be the Server Search (Office or Windows) and IIS. If any combination of these processes claims 90 percent or more of the available nonpaged bytes, an interesting problem occurs. IIS will stop serving requests, but there will be no symptoms. To resolve the issue, restart IIS and then determine which items are causing the excessive use of nonpaged memory.
There are two types of counters for disk: physical and logical. Physical disk refers to a disk without regard for grouping configurations, such as a concatenation of disks or RAID sets. Logical disk counters report only on the activity of the logical disk in a grouping. A great deal of performance benefit can be gained by tracking down and resolving disk issues. Because even the newest modern hard drives are orders of magnitude slower than memory or processor, even small gains will return large rewards. Note that if you are focusing your monitoring on disk-related issues, you should log your data to another server to ensure you are not adding load to your disk subsystem.
PhysicalDisk\% Disk Time\DriveLetter This counter measures the percentage of time within the reporting window that the physical drive is active. If this counter consistently shows values above 80 percent, there is a lack of system memory or a disk controller issue. There are other counters you will use in conjunction with this one to determine the fault.
PhysicalDisk\Current Disk Queue Length\DriveLetter This counter measures the number of requests waiting to be serviced by the disk at the instant of the poll. Disk drives with multiple spindles can handle multiple requests. If the value of this counter is over two times the number of spindles for a sustained period of time, along with a high % Disk Time, a disk upgrade is required in the disk subsystem. Typically, drives have only a single spindle. You should add to the number of disks available in the RAID set. Consider upgrading to a RAID 0 or RAID 5 configuration if this is a single drive.
PhysicalDisk\Avg. Disk sec/Transfer\DriveLetter This counter measures the average number of disk transfers per second. The value for this counter should remain below 0.3. Higher values indicate possible failures of the disk controller to access the drive. If this occurs, confirm that the drive, as well as the disk controller, is functioning normally.
The counters just listed for physical drives pertain to logical disk as well, and in the same manner. Differences occur with RAID sets and dynamic disks. With a RAID set, it is possible to have greater than 100 % Disk Time. Use the Avg Disk Queue Length counter to determine the requests pending for the disks. When dynamic disks are in use, logical counters are removed. When you have a dynamic volume with more than one physical disk, instances will be listed as 'Disk 0 C:', 'Disk 1 C:', 'Disk 0 D:', and so on. In situations where you have multiple volumes on a single drive, instances will be listed as '0 C: D:'.
There are differences when monitoring disks on a Storage Area Network (SAN). A SAN is different than a physical disk in that you must be concerned with how many disks make up the logical unit number (LUN). Your SAN administrator will be able to provide that information. Most SANs will return a value to the Performance tool as if a physical disk is being monitored. This number is inaccurate because it is the additive value of all the disks. To determine the correct value, divide the Performance tool result by the number of disks in the LUN. Typically, physical disk counters and logical disk counters will return the same value on a SAN. It is a good idea to check with your SAN team before you start using the Performance tool, as tools specifically written for the SAN hardware generally give better information. It is likely this data will not be available in a usable format, and this is where the Performance tool can be very useful.
Many companies employ server administrators who must wear multiple hats. It is not uncommon for the person who maintains servers to also maintain personal computers and the network. Windows Server exposes some very good counters for helping to track network-related issues. If you must play the role of the network engineer in a smaller company, be aware that there are a multitude of helpful counters. In large companies with distinct network and server teams, these counters can be invaluable in coordinating with other groups to resolve complex challenges.
In most modern servers, the network card has a processor to handle the moving and encoding of network traffic. However, you might still administer systems that do not have server-level network cards. It is important to monitor processor and memory along with network statistics to determine the root cause of problems that arise. Unlike other counters previously covered in this chapter, network monitoring is done at different layers of the OSI model ranging from the Data-link layer up to the Presentation layer. Because most companies use Ethernet as the network medium and TCP/IP as the protocol, that will be the focus of this section. The TCP/IP layer model maps directly to the OSI model. All layers are monitored with different counters due to the unique nature of each.
|More Info|| |
For more information on how TCP/IP is implemented on Microsoft Windows platforms, see the online book titled TCP/IP Fundamentals for Microsoft Windows found at http://www.microsoft.com/technet/itsolutions/network/evaluate/technol/tcpipfund/tcpipfund.mspx. For a map of the TCP/IP and OSI models, go to the following Web site: http://www.microsoft.com/library/media/1033/technet/images/itsolutions/network/evaluate/technol/tcpipfund/caop0201_big.gif.
The data-link layer is the bottom layer in the TCP/IP protocol stack. Even though the processes within the layer are dependent on the physical medium (Ethernet, SONET, ATM, and so on) and devices drivers, the information is passed on to the TCP/IP stack. It is crucial that you monitor these counters when exploring network-related bottlenecks.
Network Interface\Bytes Sent, Received and Total/sec This counter measures the number of bytes sent, the number of bytes received, or the sum of both that pass in and out of the network interface per second during the polling period. These counters can be monitored individually or as a total. Typically, the total is the important counter, unless there is a specific application with heavy data flow in one direction. Monitor these counters for a longer period during normal production hours. This approach will help you chart a baseline for network activity so that you will be able determine whether the issues are network related.
A good rule of thumb for maximum expected throughput is ((Network Card Speed x 2) / 8) x 75%. Most network-use switches allow for full duplex (sending and receiving at the same time), which is why the speed is doubled in the formula. Divide the result by 8 to get the speed measured in bytes. The reason for only 75 percent of the listed speed is due to TCP/IP's and Ethernet's error checking and packet assembly/disassembly. For a 100-Mbit Ethernet card, you can expect a maximum throughput of 18.75 megabytes (MBs) per second. If applications or users are experiencing slow data-transfer speeds, confirm that your network cards are set to full duplex if you are in a switched environment. If you are not sure, set the card to auto-detect duplex or ask your network administrator for their requirements.
This is the first layer that is independent of the physical medium. The network layer handles the routing of packets across a heterogeneous network. When referring to the OSI network model, this layer and its functions are referred to as layer 3.
Network Interface\Datagrams (Forwarded, Received, Sent, Total)/sec As with the data-link layer, each of these counters should be monitored for a specified length of time during normal production hours so that a baseline can be established. The throughput of the datagrams through the network interface depends on a variety of factors. Most importantly, if a significant increase occurs, consider upgrading your network cards to server-class cards or upgrading the speed of your network. Problems with the network layer arise from the inability of the card and server to process the packets quickly. Server-class cards offload this functionality from the server.
The transport layer is responsible for ensuring packets arrive intact or are rtransmitted, congestion control, and packet ordering. This layer does a lot of the heavy work with regard to the network. Many of the problems with network issues can occur here, and therefore, this is one of the most critical layers to monitor.
Network Interface\Segments (Received, Sent, Total)/sec Again, you should establish a baseline level of performance to look for with these counters to help assess future problems. Transport-layer work is handled by servers on each end of the communication and is not an intensive task for modern servers.
Network Interface\Segments Retransmitted /sec If this value shows a sudden significant increase, check the status of your network card. Retransmissions occur when duplexing is set incorrectly or there are issues with a network route.
This layer ensures that the information from the network layer is available in the correct format to the system. It ensures translation and encryption or decryption is performed before the data is passed.
There are two types of counters under this heading to be concerned with: server and redirector. The server object is specifically for monitoring the server or the machines serving the information. The redirector is used when monitoring client machines. Either or none of these machines could be a server in the hardware sense, but this refers to how they interact with each other in the client-server paradigm.
Server\Nonpaged Pool Failures The number of times a failure occurs while attempting to read from the Nonpaged Pool. This is the memory that cannot be paged. After you have established a baseline for this counter, consider upgrading the memory in your system after a 10 to 20 percent increase.
Server\Work Item Shortages The number of times during the polling interval the server had nothing to do or that a time slice could not be allocated for the request. This counter should remain unchanged, with a value under 3. If it does not, consider increasing the value of InitWorkItems or MaxWorkItems in the registry under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Lanman-Server. If the values do not exist, create them as REGDWORD with values in decimal format. InitWorkItems can range from 1 through 512, and MaxWorkItems can range from 1 through 65,535. For any value of InitWorkItems or MaxWorkItems, start with a value of 4,096, then double both values until the counter stays below 3.
Redirector\Server Sessions Hung The number of sessions either hung or unable to be processed due to a server that is too busy. Any number above zero indicates some kind of bottleneck, but do not be concerned until the number is higher than one per second. If the number is higher, check the other counters for memory and processor on the server side to help trace the issue.
It is helpful to involve networking staff when tracking down possible network issues. Network engineers understand the network and are familiar with how it should respond. Be cautious when monitoring network counters without the cooperation of the network team.
There are many references to baselining in this chapter, but what exactly does it mean? Baselining means recording performance statistics for a relevant set of counters during normal usage times. Normal usage times should be during regular operating hours and off-peak times. Gathering statistics during timeframes with heavy, light, and no usage will help define what is normal for an individual system. There are quite a few options for you to choose from when monitoring your front-end SharePoint servers, but the most important ones are listed in Table 13-1.
Processor\% Processor Time\_Total
System\Processor Queue Length\(N/A)
<# of CPUs x 2
PhysicalDisk\% Disk Time\DataDrive
PhysicalDisk\Current Disk Queue Length\DataDrive
< #of Disks x 2
There is no hard limit. Determine this total by baselining.
ASP.NET\Worker Processes Restarts
Any number above zero can indicate that problems exist.
.NET CLR Memory\% Time in GC
Time spent on garbage collection. Thresholds depend on many factors, but a value over 25% could indicate there are too many unreachable objects.