Monitoring for Bottlenecks | Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features

When your system’s performance deviates from the baselines you have established, as discussed earlier, bottlenecks occur. So it behooves you to have guidelines. Table 18–4 suggests thresholds for a minimum set of system counters.

Table 18–4: Thresholds for a Minimum Set of System Counters
Resource	Object	Counter	Threshold
Disk(1)	LogicalDisk	% Free Space	15%
Disk(2)	LogicalDisk	% Disk Time	80%
Disk(3)	PhysicalDisk	Disk Reads/Sec or Disk Writes/Sec	Check manufacturer’s specifications
Disk(4)	PhysicalDisk	Current Disk Queue Length	Number of spindles plus 2
Memory(5)	Memory	Available Bytes	4 MB, but best not to drop below 16 MB
Memory(6)	Memory	Pages/Sec	20 per second
Network(7)	Network Segment	% Net Utilization	30%
Paging File(8)	Paging File	% Usage	70% +
Processor (9)	Processor	% Processor Time	85%
Processor(10)	Processor	Interrupts/Sec	1,500 per second
Server (11)	Server	Bytes Total/Sec	N/A
Server(12)	Server	Work Item Shortages	3
Server(13)	Server work queues	Queue Length	4
Multiple Processors(14)	System	Processor Queue Length	2

The values presented in the table are recommended values for a minimum performance monitoring set. Some of these options may vary and still be satisfactory for SQL Server. The following list of notes provides some additional guidance:

Depending on what your server is doing (a high-end OLTP system, or an OLAP server) a threshold of 15 percent may be too low. You can also ensure that the disk quota threshold is never suddenly exceeded (you might have to manage the quota manually because most SQL Server databases are installed on standard disks (as opposed to dynamic disks) because they rely on hardware RAID technology that is not compatible with dynamic disk architecture). While not all processes can be blocked from using disk space, it is still a good idea to configure alerts to raise the alarm when this threshold is exceeded.
The value given for Disk Time in the table is the length of the usage period. In other words, the disk should not be used more than 90 percent of the time. You need to check this value, however, against the advice of the manufacturer. Disks that exceed this may not last too long. I have seen disks easily overheat and crash and burn when the threshold scaled to 100 percent and pegged there for a considerable length of time. This is an important factor to watch in a high-end OLTP system.
The transfer rate information of your disk is usually printed on the disk or provided in a booklet that ships with the product. You should program alerts if the monitor reports that your rates are exceeding this. If the services are hammering away your disks, you should upgrade to faster technology, such as Ultra-Wide SCSI. Favor 15,000 RPM disks over the lower 10,000 RPM technology I know the former are more expensive, but it may be worth it.
The number of spindles is a snapshot evaluation. You should thus observe this value over several intervals. You can also use the Average Disk Queue Length for analysis.
If free memory drops below 4 MB, the paging activity will begin to increase and system will become sluggish. If the condition continues, you get that familiar message that often popped up on servers advising you that system resources are getting low.
If the server’s memory use increases, you need to keep an eye on this threshold, lest the server exceed baselines and run out of memory.
This threshold varies according to the type of network you are running. In the case of your typical Ethernet network, the threshold will be around 30 percent.
Before you try to use this counter, make sure you should fully understand how paging works. The reason I say this is that the threshold varies according to the nature of the hardware and the number of instances you have running on the server, so understanding how all the factors influence paging is an important prerequisite.
Processor Time can be easily observed in the Task Manager, and any level of processor use at the 85 percent or higher mark should make you nervous. You should first try and use Task Manager to identify the process that is using up your CPU’s bandwidth. If it is a critical function in SQL Server, you might need to add another processor or upgrade to a faster CPU. When SQL Server is idle, the System Idle Process uses the CPU most of the time.
This processor counter can be used to raise the alarm to possible hardware problems. If you notice that this counter increases dramatically without a corresponding increase in server activity, it points to a piece of hardware that is responsible for the flood in interrupts. The problem hardware might be a disk controller card or interface, a network interface card, or something less obvious.
You can sum the total (Bytes Total/sec) for all servers with this counter. And if the value is equal to the maximum transfer rate for the network, you may have some network topology problems to sort out.
Don’t let Bytes Total/sec value exceed 3. If it does, you may have to change parameters in the Registry Advanced information on WorkItems is beyond the scope of this book, but you can look up the information on WorkItems in the Microsoft Knowledge Base, which has a complete discussion of the Work Item Shortages counter.
The server work queue is another snapshot counter that may signal a processor bottleneck. Keep a watch on this counter over several intervals.
Another snapshot counter, the processor queue length, should monitored over several intervals. If you notice a value higher than 2 over several intervals, you should start an investigation.

Identifying Bottlenecks

Bottlenecks will arise as sure as the sun sets in the West when excessive demand is placed on a system’s resources. Bottlenecks are present in every system, albeit to varying degrees. Even the human body can only take so much before the soul is squeezed out of it. When you monitor SQL Server for bottlenecks, you gain the ability to determine whether changes can be made to a limiting component with the objective of getting it to perform at an optimal level.

There are several reasons that bottlenecks occur:

The server has insufficient resources. You are thus required to add additional or upgraded components.
You have resources of the same type that do not share workloads evenly. A good example is a collection of hard disks in which one is being monopolized.
Your have malfunctioning resources.
Resources are incorrectly configured.

Analyzing Bottlenecks on SQL Server

When you analyze event data, you will often discover that low numbers can be just as meaningful as high numbers. If you find that a number is lower than you expected, it may point to a problem in another area. For example,

There might be another component that is preventing the load from reaching the component in question.
Network congestion may be preventing client requests from reaching the server. It is also possible that your network connection is down and the server is still because no one is connected to it.
A bottleneck somewhere in or on the network may be preventing client computers from accessing the server as frequently as they usually do. On the other hand, the Internet might have finally shut down.
You may also have not configured the performance monitoring tools correctly. For example, if you have not turned on the disk counters, or you are looking at the wrong instance, the wrong counters, or the wrong computer, event data numbers may appear inexplicably low and even nonexistent. I guess you might need a counter for Murphy’s law, all things considered.

So when a low number indicates that the system is performing better than expected, you should not become nonchalant. Table 18–5 lists five key areas to monitor when tracking server performance and identifying bottlenecks.

Table 18–5: Tracking SQL Server Performance
Bottleneck Candidate	Effects on the Server
Memory usage	Insufficient memory allocated or available to SQL Server will degrade performance. Data must be read from the disk continually rather than residing in the data cache. Windows NT 4.0 and Windows 2000 perform excessive paging by swapping data to and from the disk as the pages are needed.
CPU processor utilization	A constantly high CPU rate may indicate the need for a CPU upgrade or the addition of multiple processors. You need to be sure of what the cause is. Adding another CPU might require you to fork out another $20,000 in licensing.
Disk 1/0 performance	A slow disk 1/0 (disk reads and writes) will cause transaction throughput to degrade. The higher the RPM the better the performance.
User connections	An improperly configured number of users can cause the system to run slowly. It can also limit the amount of memory that would otherwise have been made available to SQL Server.
Blocking locks	A process may be forcing another process to wait, thereby slowing down or stopping the blocking process.