Typical Processor-related Problems and Solutions

A processor bottleneck occurs when the demand is overshooting supply of processor threads of the system or applications being deployed. This is caused by processor demands being queued and thus maintaining high CPU utilization until the queue is being emptied, which causes the system response to degrade.

When you find that the processor utilization on a server is consistently high (90 percent or higher) it usually leads to processes queuing up, waiting for processor time, and causing a bottleneck. Such a sustained high level of processor usage is unacceptable for a server.

Let s discuss an example of high processor utilization. If you are monitoring an IIS server hosting a single Web site that relies upon a legacy COM+ application written in Visual Basic 6 to parse through extensive XML documents, you may find that the COM+ application is utilizing more than 90 percent of the processor s time. This high processor utilization by the COM+ application affects the Web application s ability to handle new connections to the site. If you understand the type of bottleneck (in this case, a processor bottleneck) and the root cause of the bottleneck (processor hungry COM+ application), you can decide how to handle the resource problem. One solution may be to physically separate the COM+ application from the Web server, or to convert your code to more efficient and faster performing managed code.

NOTE
When examining processor usage, keep in mind the role of the computer and the type of work being done. High processor values on a SQL server are less desirable than on a Web server.

There are two methods for correcting most processor bottlenecks. The first is to add faster or additional processors to your system. The downside to this option is that its not cost effective and is a temporary solution. The next surge in traffic to your Web site will cause you to scramble to add additional hardware or replace the old servers with newer faster servers. The other and more appropriate route is to analyze the software to see which specific process or portion of the application is causing this bottleneck. As a rule, you should always try to performance tune your software before reverting to the more costly route of adding additional hardware. In addition to monitoring counters found under the Processor object, there are other counters found under the System object that you should monitor when verifying the existence of a processor bottleneck.

System Object

The System object and its associated counters measure aggregate data for threads running on the processor. They provide valuable insights into your overall system performance. The following system counters are the most important to monitor.

Processor Queue Length
The number of threads in the processor queue. Unlike the disk counters (discussed later in the chapter), this counter shows ready threads only, not threads that are running. There is a single queue for processor time even on computers with multiple processors. Therefore, if a computer has multiple processors, you need to divide this value by the number of processors servicing the workload.

One way to determine if a processor bottleneck exists with your application is to monitor the System\ Processor Queue Length counter. A sustained queue length along with an over-utilized processor (90 percent and above) is a strong indicator of a processor bottleneck.

When monitoring the Processor Queue Length counter we generally do not want to see a sustained processor queue length of 2 or more along with high processor utilization. If you find that the queue length is 2 or higher, but your processor utilization is consistently low, you may be dealing with some form of processor blocking rather than a bottleneck.

You can also monitor Processor\ % Interrupt Time for an indirect indicator of the activity of disk drivers, network adapters, and other devices that generate interrupts.
Context Switches/sec
The combined rate at which all processors on the computer are switched from one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is pre-empted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. It is the sum of Thread\\Context Switches/sec for all threads running on all processors in the computer and is measured in numbers of switches. There are context switch counters on the System and Thread objects. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

A system that experiences excessive context switching due to inefficient application code or poor system architecture can be extremely costly in the terms of resource usage. Your goal should always be to decrease the amount of context switching occurring at your application or database servers. Context switches essentially prevent the server from getting any real work done as valuable processor resources are taken up dealing with a thread that is no longer able to run because it is blocked waiting for a logical or physical resource, or the thread puts itself to sleep. Symptoms of high context switching can include lower throughput coupled with high CPU utilization, which begins to occur at switching levels of 15,000 or higher. You can determine whether context switching is excessive by comparing it with the value of Processor\ % Privileged Time. If this counter is at 40 percent or more and the context-switching rate is high, then you should investigate the cause for the high rates of context switches.

Finally, when monitoring your system you should make sure that the System\Context Switches/sec counter that reports system wide context switches is close to, if not identical to, the value provided by the _Total instance of the Thread\Context Switches/sec counter. Monitoring this over time can help you determine the range by which the two counters value might vary.

Disk Bottlenecks

Disk space is a recurring problem. No matter how much drive space you configure your servers or network storage devices with, your software seems to consume it. However, disk bottlenecks problems are related to time, not disk space. When the disk becomes the limiting factor in your server, it is because the components involved in reading from and writing to the disk cannot keep pace with the rest of the system.

The parts of the disk that create a time bottleneck are less familiar than the megabytes or gigabytes of space. They include the I/O bus, the device bus, the disk controller, and the head stack assembly. Each of these components contributes to and, in turn, limits the performance of the disk configuration.

System Monitor measures different aspects of physical and logical disk performance. To truly understand the state of disk resource consumption you will need to monitor several disk counters, and in some instances you will need to monitor them for several days. On top of this, you will probably find yourself churning through some mathematical formulas to determine whether or not a disk bottleneck exists at your server. These formulas are detailed in the real world example below. However, before we delve into these formulas let s review some of the counters you will monitor when hunting down a disk bottleneck. These counters will allow you to troubleshoot, capacity plan and measure the activity of your disk subsystem. In the case of some of the counters the information they provide is required for the aforementioned disk bottleneck formulas.

Average Disk Queue Length
The average number of both read and writes requests that were queued for the selected disk during the sample interval.
Average Disk Read Queue Length
The average number of read requests that were queued for the selected disk during the sample interval.
Average Disk Write Queue Length
The average number of write requests that were queued for the selected disk during the sample interval.
Average Disk sec/Read
The average time, in seconds, of a read of data from the disk.
Average Disk sec/Transfer
The time, in seconds, of the average disk transfer.
Disk Reads/sec
The rate of read operations on the disk.
Disk Writes/sec
The rate of write operations on the disk.

How the ACE Team Discovered a Disk Bottleneck

An internal product team at Microsoft was interested in evaluating server hardware from two different vendors. These servers would be used to host the SQL database for a Web application they were designing. This Web application would be accessed by several thousand customers simultaneously; therefore, selecting the right hardware was critical for the success of their project. The product team was interested in conducting several stress tests and monitoring the effect these tests had on the SQL server s resources. A stress test harness was developed that simulated production environment activity. The stress harness was written using Visual Basic and run on client machines as a Win32 application. One hundred client machines were configured to execute the stress test harness. The stress harness was designed to spawn instances that simulated five users per instance, each connecting to a different database (that is, db1 through db5) on the server. The workflow used results in each client executing a SQL batch file via ADO or an OSQL instance for each operation. These batch files were generated using SQL Profiler to trace manual user navigation of the site then saving the trace as a SQL batch file. The operations performed in this manner for these tests were:

Load the login page
Select a user name and hit enter
Load the tasks page
Submit actual work times to the manager
Load the resource views page
Set and save notification reminders
Delegate one task to another resource

The client machines were configured so that all of the 500 databases at the SQL server would be accessed during the tests. This helped prevent any one of the databases from receiving a majority of the SQL transactions. After configuring the client machines, the stress test harness was started and run for 20 minutes (15 minutes were set aside as a warm up period). During these 20 minutes, performance data at the SQL server was collected for benchmark purposes.

A wait time of 10 and 60 seconds was used when executing the load against the targeted databases. Each simulated user started the test at a random offset from the global start time of the test and performed one operation. The user would then wait either 10 or 60 seconds before beginning the next operation.

On executing both scenarios a significant disk read times and write times was noticed which prompted an investigation as to the disk capacity of the hardware being utilized. The calculations indicated the I/O per disk exceeded the manufacturer s specified I/O that the disk can successfully handle.

The performance data collected during the 10 second and 60 second wait-time benchmark indicated the existence of a disk bottleneck at Server 1. In order to verify this, our team applied the performance data gathered from the physical disk activity to the following formula:

I/Os per Disk = [Reads + (4xWrites)] / Number of Disks

If the calculated I/Os per disk exceeded the capacity for the server, this would verify the existence of a disk bottleneck. The disk I/O capacity and calculated disk I/O per disk is outlined below. It should be noted that for each of the calculations, 85 random I/Os per disk is used as the capacity for a disk in a RAID 5 configuration.

10-Second Wait Time Test Scenario on Server 1

Disk I/O capacity = 85 random I/Os per disk

Calculated I/Os per disk = [269.7 + (4x74.6) ] / 5

Calculated I/Os per disk = 113.62 random I/Os per disk

At 113.62 random I/Os per disk Server1 is suffering from a disk bottleneck as the capacity for each disk in the server was only 85 random I/Os per disk.

10-Second Wait Time Test Scenario on Server 2

Disk I/O capacity = 85 random I/Os per disk

Calculated I/Os per disk = [138.3 + (4x43.0)] / 4

Calculated I/Os per disk = 77.7 random I/Os per disk

At 77.7 random I/Os per disk Server 2 is below the capacity of 85 random I/Os per disk, therefore no disk bottleneck exists.

60-Second Wait Time Test Scenario on Server 1

Disk I/O capacity = 85 random I/Os per disk

Calculated I/Os per disk = [294.8 + (4x71.8) ] / 5

Calculated I/Os per disk = 116.4 random I/Os per disk

At 116.4 random I/Os per disk Server 1 is suffering from a disk bottleneck as the capacity for each disk in the server is only 85 random I/Os per disk.

60-Second Wait Time Test Scenario on Server 2

Disk I/O capacity = 85 random I/Os per disk

Calculated I/Os per disk = [68.9 + (4x24.0) ] / 4

Calculated I/Os per disk = 41.2 random I/Os per disk

At 41.2 random I/Os per disk Server 2 is significantly below the capacity of 85 random I/Os per disk, therefore no disk bottleneck exists. At 113.62 and 116.4 random I/Os per disk respectively Server1 is suffering from a disk bottleneck as the capacity for each disk in the server is only 85 random I/Os per disk thus exceeding the manufacturer s specified number of disk I/Os the hardware can sustain.

Disk Architecture Matters to Performance

Today, many Web applications are built to interact with database server. Many if not all of the applications we test use SQL Server 2000, and in most cases we find some significant performance gains by tuning the SQL server. These wins come through optimization of the SQL code, database schema, or disk utilization. When designing the architecture of your database, you will be required to select how data and log files are read and written from disk. For example, do you want to write your log files to a RAID device versus a non-RAID device? If you do not make the right choices, this can lead to a disk bottleneck. In one such case we were able to apply formulas that proved or disproved the existence of a disk bottleneck. You will find details of the project and formulas utilized in the real world example above.

Memory

When analyzing the performance of your Web applications, you should determine if a system is starving for memory due to a memory leak or other application fault, or if the system is simply over-used and requires more hardware. In this section we discuss the counters you should monitor to determine the existence and then cause of the memory bottleneck. (Note that there are tools available to you other than System Monitor to analyze memory utilization of a server. It may be worth your while to investigate some of these tools, as they can save time when monitoring the system.)

Page faults/sec
The average number of pages faulted per second. It is measured in number of pages faulted per second because only one page is faulted in each fault operation; hence this is also equal to the number of page fault operations. This counter includes both hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory.) Most processors can handle large numbers of soft faults without significant consequences. However, hard faults, which require disk access, can cause significant delays.
Available Bytes
Indicates how many bytes of memory are currently available for use by processes. Pages/sec provides the number of pages that were either retrieved from disk due to hard page faults or written to disk to free space in the working set due to page faults.
Page Reads /sec
The rate at which the disk was read to resolve hard page faults. It shows the number of read operations, without regard to the number of pages retrieved in each operation. A hard page fault occurs when a process references a page in virtual memory that is not in the working set or elsewhere in physical memory, and must be retrieved from disk. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It includes read operations to satisfy faults in the file system cache (usually requested by applications) and in non-cached mapped memory files. Compare the value of Memory\\Pages Reads/sec to the value of Memory\\Pages Input/sec to determine the average number of pages read during each operation.
Page writes /sec
The rate at which pages are written to disk to free up space in physical memory. Pages are written to disk only if they are changed while in physical memory, so they are likely to hold data, not code. This counter shows write operations, without regard to the number of pages written in each operation. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.
Pages/sec
The rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of Memory\\Pages Input/sec and Memory\\Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory\\Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files.

How the ACE Team Discovered a Memory Leak

In this example we discuss how we were able to determine the existence of a memory leak in an application that was submitted to our team for performance testing. Performance analysts on our team met with the development team to understand some of the common user scenarios for the Web application. The analyst discussed existing performance issues the development team was aware of. The developers were concerned about memory usage by COM+ applications running on the Web server. Keeping this in mind, the analyst thought the best approach in ruling out memory issues would be to execute a series of stress tests. These tests would help to uncover memory utilization issues at the server if they truly existed. The analyst built test scripts of the user scenarios provided by the development team and executed a short stress test. Performance logs recorded resource utilization at the server hosting the COM+ application. During this one-hour test the analyst observed a memory consumption of approximately 20 MB. He noted that this memory was still not released three hours after the test was stopped. These findings prompted a further investigation into the application s memory consumption (see Table 4-2). A 12-hour continuous stress test was conducted to analyze the applications memory behavior. At the end of the 12-hour continuous test it was discovered that in addition to heavy CPU activity, growth in private bytes was significant for the test period and the server was extremely low on virtual memory (see Table 4-3). Of the 671 MB acquired by the dllhost private bytes, 640 MB was still allocated three hours after the test ended. Virtual memory growth appeared to be centered almost entirely on private bytes for the dllhost process. For the 1-hour test, the memory only grew from between 38 to 58 megabytes. For the 12-hour test, this growth was much higher, from 368 to 671 megabytes. The memory was not released until the server was rebooted. The dllhost process was then analyzed to identify the processes that were involved in the execution of the dllhost to narrow down the potential memory leak to a specific process. After identifying the exact process causing the memory leak, the code for that process was profiled and the developer was able to pinpoint exactly where in his code memory was not being managed correctly. Of course with managed code, you won t find yourself running into the slew of memory management issues you did in the days of unmanaged code.

Table 4-2. Summary of 1-hour Test Results
Windows 2000 IIS 5.0	~ Average-IIS	~Maximum / Total-IIS
System-% Total Processor Time	55%	100%
Inetinfo-% Total Processor Time	.5%	1%
Dllhost-% Total Processor Time	41%	100%
Memory: Available in Megabytes	164 MB	185 MB
Memory: Pages/sec	0	.2
Inetinfo: Private in Megabytes	14 MB	14 MB
Dllhost: Private in Megabytes	38 MB	56 MB

Table 4-3. Summary of 12-hour Test Results
Windows 2000 IIS 5.0	~ Average-IIS	~Maximum / Total-IIS
System-% Total Processor Time	69%	100%
Inetinfo-%Total Processor Time	.6 %	1.5%
Dllhost-% Total Processor Time	71%	100%
Memory: Available in Megabytes	56 MB	196 MB
Memory: Pages/sec	51	295
Inetinfo: Private in Megabytes	14 MB	14.4 MB
Dllhost: Private in Megabytes	368 MB	671 MB

Memory leaks should be investigated by monitoring Memory\ Available bytes, Process\ Private Bytes and Process\ Working Set. A memory leak would typically indicate Process\ Private Bytes and Process\ Working Set increasing while Memory\Available bytes would be decreasing. This should be verified in Task Manager by identifying PID and then trace this back to your application. Memory leaks should always be verified by running a performance test for an extended period of time to verify the applications reaction when all available memory is depleted.

Create and Configure Alerts

You can configure the Performance Logs and Alerts service to fire off alerts when a specified performance event has occurred at the server. For example, if the available memory at the Web server drops below 20 MB, an event could be trigged that satisfies one or all of the following conditions:

Logs an entry to the application event log
Sends a network message to a specified user
Starts a performance data log
Runs a specified program

There are several instances when configuring an alert to trigger an event helps increase your testing efficiency. One is when you are running an extended stress test. Let s say the stress test must be run over a 24-hour period and you are particularly interested in what happens with the Web server s memory. You could configure an alert that records an event to the application event log each time a spike occurs with the Pages/Sec counter. This way, you don t have to try to count the number of spikes in an enormous log file. You can simply sort the application event log for each instance you are most concerned with.

To create an alert follow these steps:

Open Performance and click Start, point to Programs, point to Administrative Tools, and then click Performance.
Double-click Performance Logs and Alerts, and then click Alerts. Any existing alerts will be listed in the details pane. A green icon indicates that an alert is running; a red icon indicates an alert has been stopped or is not currently active.
Right-click a blank area of the details pane and click New Alert Settings.
In Name, type the name of the alert, and then click OK.
To define a comment for your alert, along with counters, alert thresholds, and the sample interval, use the General tab. To define actions that should occur when counter data triggers an alert, use the Action tab, and to define when the service should begin scanning for alerts, use the Schedule tab.

NOTE
You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. The subkey is:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log Queries

In general, administrators have this access by default. Administrators can grant access to users using the Security menu in Regedt32.exe. In addition, to run the Performance Logs and Alerts service (which is installed by Setup and runs in the background when you configure a log to run), you must have the right to start or otherwise configure services on the system. Administrators have this right by default and can grant it to users by using Group Policy.

CAUTION
Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

To define counters and thresholds for an Alert, follow these steps:

Open Performance.
Double-click Performance Logs and Alerts, and then click Alerts.
In the details pane, double-click the alert.
In Comment, type a comment to describe the alert as needed.
Click Add.

For each counter or group of counters that you want to add to the log, perform the following steps:

To monitor counters from the computer on which the Performance Logs and Alerts service will run, click Use Local Computer Counters.

Or, to monitor counters from a specific computer regardless of where the service is run, click Select Counters From Computer and specify the name of the computer you want to monitor.
In Performance object, click an object to monitor.
In Performance counters, click one or more counters to monitor.
To monitor all instances of the selected counters, click All Instances. (Binary logs can include instances that are not available at log startup but subsequently become available.)

Or, to monitor particular instances of the selected counters, click Select Instances From List, and then click an instance or instances to monitor.
Click Add.
In Alert When The Value Is, specify Under or Over, and in Limit, specify the value that triggers the alert.
In Sample Data Every, specify the amount and the unit of measure for the update interval.
Complete the alert configuration using the Action and Schedule tabs.

NOTE
When creating a monitoring console for export, be sure to select Use Local Computer Counters. Otherwise, counter logs will obtain data from the computer named in the text box, regardless of where the console file is installed.

To define actions for an alert, follow these steps:

Open Performance.
Double-click Performance Logs and Alerts, and then click Alerts.
In the details pane, double-click the alert.
Click the Action tab.
To have the Performance Logs and Alerts service create an entry visible in Event Viewer, select Log An Entry in the Application Event Log.
To have the service trigger the messenger service to send a message, select Send a Network Message to and type the name of the computer on which the alert message should be displayed.
To run a counter log when an alert occurs, select Start Performance Data Log and specify the counter log you want to run.
To have a program run when an alert occurs, select Run This Program and type the file path and name or click Browse to locate the file. When an alert occurs, the service creates a process and runs the specified command file. The service also copies any command-line arguments you define to the command line that is used to run the file. Click Command Line Arguments and select the appropriate check boxes for arguments to include when the program is run.

To start or stop a counter log, trace log, or alert manually, follow these steps:

Open Performance.
Double-click Performance Logs and Alerts, and click Counter Logs, Trace Logs, or Alerts.
In the details pane, right-click the name of the log or alert you want to start or stop, and click Start to begin the logging or alert activity you defined, or click Stop to terminate the activity.

NOTE
There may be a slight delay before the log or alert starts or stops, indicated when the icon changes color (from green for started to red for stopped, and vice versa).

To remove counters from a log or alert, follow these steps:

Open Performance.
Double-click Performance Logs and Alerts, and then click Counter Logs or Alerts.
In the details pane, double-click the name of the log or alert.
Under Counters, click the counter you want to remove, and then click Remove.

To view or change properties of a log or alert, follow these steps:

Open Performance.
Double-click Performance Logs and Alerts.
Click Counter Logs, Trace Logs, or Alerts.
In the details pane, double-click the name of the log or alert.
View or change the log properties as needed.

To define start or stop parameters for a log or alert, follow these steps.

Open Performance.
Double-click Performance Logs and Alerts, and then click Counter Logs, Trace Logs, or Alerts.
In the details pane, double-click the name of the log or alert.
Click the Schedule tab.
Under Start log, click one of the following options:
- To start the log or alert manually, click Manually. When this option is selected, to start the log or alert, right-click the log name in the details pane, and click Start.
- To start the log or alert at a specific time and date, click At, and then specify the time and date.
Under Stop Log, select one of the following options:
- To stop the log or alert manually, click Manually. When this option is selected, to stop the log or alert, right-click the log or alert name in the details pane, and click Stop.
- To stop the log or alert after a specified duration, click After, and then specify the number of intervals and the type of interval (days, hours, and so on).
- To stop the log or alert at a specific time and date, click At, and then specify the time and date. (The year box accepts four characters; the others accept two characters.)
- To stop a log when the log file becomes full, select options as follows:
  - For counter logs, click When the Log File is Full. The file will continue to accumulate data according to the file-size limit you set on the Log Files tab (in kilobytes up to two gigabytes).
  - For trace logs, click When the n-MB Log File is Full. The file will continue to accumulate data according to the file-size limit you set on the Log Files tab (in megabytes).
Complete the properties as appropriate for logs or alerts:

When setting this option, take into consideration your available disk space and any disk quotas that are in place. An error might occur if your disk runs out of disk space due to logging.
- For logs, under When a Log File Closes, select the appropriate option:
  - If you want to configure a circular (continuous, automated) counter or trace logging, select Start a New Log File.
    
    If you want to run a program after the log file stops (for example, a copy command for transferring completed logs to an archive site), select Run This Command. Also type the path and file name of the program to run, or click Browse to locate the program.
- For alerts, under When An Alert Scan Finishes, select Start a New Alert Scan if you want to configure continuous alert scanning.

To delete a log or alert, follow these steps:

Open Performance.
Double-click Performance Logs and Alerts.
Click Counter Logs, Trace Logs, or Alerts.
In the details pane, right-click the name of the log or alert, and click Delete.

When you schedule a log to close at a specific time and date or close the log manually, the Start a New Log File option is unavailable.