Common Performance Bottlenecks | Microsoft SQL Server 2000 Administrators Companion

3 4

Now that you know how to use performance-monitoring tools such as System Monitor, Enterprise Manager, Query Analyzer, and Profiler, you are ready to tackle performance bottlenecks. In this section, we'll look at some of the most common performance bottlenecks and at various solutions to them. Many of these bottlenecks are closely related, and one bottleneck can disguise itself as another. You must look for both hardware-caused and software-caused bottlenecks—many performance problems stem from a combination of bottlenecks. Hardware components that can cause bottlenecks include the CPU, memory, and the I/O subsystem; software that can cause bottlenecks includes SQL Server applications and SQL statements. We'll examine each of these types of problems in more detail in the following sections.

CPU

One common performance problem is simply a lack of horsepower. The processing power of the system is determined by the number, type, and speed of the CPUs in the system. If your system doesn't have enough CPU power, it will not be able to process transactions fast enough to satisfy users. To use System Monitor to determine the percentage of CPU utilization, check the %Processor Time counter of the Processor object. (If you have a multiple-processor system, select all CPU instances.) If your CPUs are running at 75 percent or higher for long periods (75 percent is the desired maximum, according to the rules set out in Chapter 6), your system is likely experiencing a CPU bottleneck. If you consistently see CPU utilization as low as 60 percent, you might still benefit from adding faster CPUs or more CPUs to the system.

Be sure to monitor the other characteristics of the system before making the decision to upgrade your CPUs. If your SQL statements are highly inefficient, for example, your system might be doing much more processing than is necessary, and optimizing the statements can help lower the CPU utilization. Or suppose the cache hit ratio for the SQL Server data cache is below 90 percent. You might need to add memory to the data cache (either by increasing the max server memory parameter or by adding physical memory to the system). This will allow more data to be cached and therefore fewer physical disk I/O operations, leading to lower CPU utilization since less processing time will be spent dealing with I/O requests. You can sometimes get better performance out of your system by simply adding processors, especially if you are starting with a single-processor system. However, not all applications scale across multiple-processor systems. SQL Server scales well, but not all of the SQL statements that you might run will scale. A process will run on only one CPU at a time, and a single CPU is certainly sufficient for a single SQL statement. For the performance to improve by using multiple processors, your SQL Server system must be running many SQL statements concurrently—multiple statements can be running simultaneously on different CPUs, so the multiple CPUs can concurrently process the statements.

In general, you can almost always improve performance by using more and faster CPUs. However, in some cases, you must increase the memory and I/O capacity when you add faster CPUs to prevent moving the bottleneck to another location. For example, if you originally had a CPU bottleneck and you solved that problem by adding a CPU, you might find that your system is able to perform more work, resulting in more disk I/O. A disk I/O bottleneck could then occur. Also, be sure to check the Level 2 (L2) cache size of any new CPUs that you decide to add. The larger the CPU cache, the better the performance will be, especially in a multiple-processor system.

If you do decide you need more processing power, you can either add CPUs or replace your existing CPUs with faster ones. For example, if your system has two CPUs but can be expanded to contain four, try adding two more CPUs of the same type or switching to four new, faster CPUs. If your system already holds the maximum number of CPUs and you need more processing power, try getting faster CPUs to replace them. For example, let's say you have four CPUs that each run at a speed of 200 MHz. You could replace them with four CPUs that each run at a speed of, say, 500 MHz. Faster CPUs will allow processing to be completed more quickly.

Memory

The amount of memory available to SQL Server is one of the most critical factors in SQL Server performance. The relation of memory to the I/O subsystem is also an important factor. For example, in I/O-intensive systems, the more memory SQL Server can use for caching data, the fewer physical I/O operations must be performed. This is because the data is retrieved from the data cache instead of from the disk. SQL Server uses a complex caching system to speed up system performance: it accesses as much data as possible from memory and as little data as possible from disks, which provide slower performance. The more data SQL Server can access from memory, the better your system performance will be. Memory accesses are much faster than physical disk accesses.

In some cases, an insufficient amount of memory can result in an apparent disk bottleneck because more physical disk I/O is caused by the system's inability to cache effectively. To see how much of the system memory SQL Server is using, use System Monitor to check the Total Server Memory (KB) counter of the SQL Server: Memory Manager object. If SQL Server is not using the amount of memory you expect it to, you might need to adjust the memory configuration parameters described in the section "SQL Server Configuration Settings" later in this chapter.

To determine whether SQL Server has a sufficient amount of cache, use System Monitor to check the Buffer Cache Hit Ratio counter of the SQL Server: Buffer Manager object. The general rule is that SQL Server should have a 90 percent or higher cache hit rate, if possible. Add more memory to the cache if the hit rate is lower than 90 percent. Note that some systems never achieve a 90 percent cache hit rate because of the nature of the application being used. This might be the case if data pages are rarely reused and the system continually flushes out old data pages in the cache to bring in new ones.

NOTE
Microsoft SQL Server 2000 dynamically allocates memory for the buffer cache based on available system memory and the memory parameter settings. Some external processes, such as printing processes and other applications, can cause SQL Server to free up much of its memory for their use. Monitor the system memory carefully, and isolate SQL Server on its own system, if possible. See Chapter 30 and the section "SQL Server Configuration Settings" later in this chapter for details about using the memory parameters.

I/O Subsystem

Bottlenecks occurring in the I/O subsystem are the most common hardware-related problems database systems experience. Poorly configured I/O subsystems are second only to poorly written SQL statements as the top cause of performance problems. Fortunately, problems with the I/O subsystem are also some of the easiest performance problems to fix. In many cases, the addition of disk drives will completely eliminate the performance bottleneck.

The I/O subsystem problem stems from the fact that the number of I/O operations a disk drive can perform is limited—for example, a drive might be able to handle no more than 85 random I/O operations per second. If disk drives are overloaded, the I/O operations to those disk drives are queued and SQL Server experiences long I/O latencies. These long I/O latencies can cause locks to be held longer or threads to sit idle while waiting for a resource. The net result is that the performance of the entire system suffers, which, in turn, results in angry users who complain that their transactions take too long to be completed.

In most cases, I/O subsystem performance problems occur because the I/O subsystem has not been properly sized for the load that it is experiencing. The topic of sizing was covered in detail in Chapters 5 and 6, but let's review it briefly here.

Sizing is necessary because, as mentioned, the number of I/O operations per second a single disk drive can perform is limited. When a disk drive performs mostly sequential I/O operations, as it does for the transaction log, it might achieve over 150 I/O operations per second without becoming overloaded. (As mentioned, overloading results in long latencies.) When the disk performs random I/O operations, on the other hand, as is common for data files, it might have a capacity of only 85 I/O operations per second. If the demand for I/O is higher, increasingly long latencies will occur.

To determine whether you are overdriving your disk drives, monitor the counters provided by the System Monitor objects PhysicalDisk and LogicalDisk. These counters, some of which are described in more detail later in this section, collect data about physical and logical disk I/O activity, such as the number of reads and writes per second that occur on a disk. These counters are enabled when you install the operating system, but you should know how to enable and disable them. These counters use system resources, such as CPU time, when they gather statistics, so you should enable them only when you want to monitor the system I/O. To enable or disable these counters, you enable or disable the Windows NT/2000 command DISKPERF.

To see if DISKPERF is enabled, run the following command at a command prompt:

 diskperf

If DISKPERF is enabled, you will see this message: "Physical Disk Performance counters on this system are currently set to start at boot." If DISKPERF is disabled, you will see this message: "Both Logical and Physical Disk Performance counters on this system are now set to never start."

To enable DISKPERF if it is disabled, type the following command at a command prompt:

 diskperf -Y

To disable DISKPERF, run this command:

 diskperf -N

You must restart the computer for the changes to DISKPERF to take effect. Type the following command to see more options for DISKPERF:

 diskperf ?

Of particular importance are the counters Disk Writes/Sec, Disk Reads/Sec, Avg. Disk Queue Length, Avg. Disk Sec/Write, and Avg. Disk Sec/Read. These counters can be invaluable in helping you determine whether you are overdriving the disk subsystem. Both the PhysicalDisk object and the LogicalDisk object provide these counters. To view details about the information these counters provide, click Explain in the Add Counters dialog box when you are adding the counter in System Monitor.

Let's look at an example of using these counters. Suppose you believe your system is experiencing an I/O bottleneck. You would check the Avg. Disk Sec/Transfer, Avg. Disk Sec/Read, and Avg. Disk Sec/Write counters of the PhysicalDisk object because they monitor disk latency (the amount of time the disk requires to perform a read or write), and increased latency is a sign that you are overdriving your disk drives or disk array. The general rule is that a normal reading for these counters is between 1 and 15 milliseconds (between 0.001 and 0.015 seconds), but you should not be concerned if latencies are as long as 20 milliseconds (0.020 seconds) at times of peak usage. If you are seeing values higher than 20 milliseconds, your system is definitely experiencing an I/O subsystem performance problem.

You would also look at Disk Writes/Sec and Disk Reads/Sec. Let's say these counters show that a disk performs 20 writes and 20 reads per second, for a total of 40 I/O operations, and the disk's capacity is 85 I/O operations per second. If, at the same time, the disk's latency is long, the disk drive might be faulty. Now let's say that the disk is performing 100 I/O operations per second and that the latencies are around 20 milliseconds or more. In this case, you need to add drives to improve performance.

To determine how many I/O operations your system is performing if you are using a RAID array, divide the number of I/O operations per second that you see in System Monitor by the number of drives in the array and factor in the RAID overhead. The following table lists the number of physical I/O operations that are generated for reads and writes when RAID technology is used.

Table 36-1. Number of physical I/Os performed for a read and a write per RAID level

RAID Level	Read	Write
0	1	1
1 or 10	1	2
5	1	4

In general, the best way to correct an I/O subsystem bottleneck is simply to add more disk drives. But remember to look at other possible causes of the I/O bottleneck, such as a low cache hit rate and transactions that perform more I/O operations than are necessary. (As mentioned, in most cases, a cache hit rate below 90 percent is too low.) If you do find an I/O bottleneck, review the directions in Chapter 6 for determining the number of disk drives needed for your system.

Faulty Components

Occasionally, your system might experience performance problems because of faulty components. If the faulty component has not failed completely but is degrading, the problem could be difficult to debug. Because these problems and their solutions take so many different forms and are so complex, this book does not cover them in detail. Instead, we'll look at a few basic tips for identifying faulty-component problems:

Compare disk drives and arrays. When you view statistics in System Monitor, compare like components. If you notice, for example, that two disks are performing about the same number of I/O operations but are displaying different latencies, the slower disk drive might be experiencing a problem.
Watch the lights. Network hubs usually have collision lights. If you notice that a particular network segment is experiencing an unusually high number of collisions, you might have a faulty component, possibly the network card or the network cable.
Get to know your system. The more time you spend observing your system, the better you will understand its quirks. You will begin to recognize when the system is not performing normally.
Use System Monitor. This is a good way to monitor the behavior of the system on a regular basis.
Read the logs. Make a habit of regularly checking the SQL Server and the Windows 2000 Event Viewer system and application logs. Review these logs every day to catch problems before they get out of hand. This can't be emphasized enough.

Application

Another system component that commonly causes performance problems is the SQL Server applications. These problems can originate in the application code or in the SQL statements that the application is running. This section provides tips and guidelines for solving performance problems related to SQL Server applications.

Optimize Execution Plans

As you saw in Chapter 35, it's important to choose the best execution plan and data access method for a query. Unfortunately, there is no secret formula for determining the best plan. SQL Server automatically chooses the execution plan that Query Optimizer calculates is the best. You can use various types of hints in your queries, including query hints, table hints, and join hints. As you become familiar with these hints, you might be able to analytically determine the best plan for a query. Usually, you will need to look at various plans before you find an acceptable one.

Use Indexes Judiciously

As you saw in Chapters 17 and 35, the correct use of indexes is critical to good performance. Finding the desired data by using an index might require only 10 to 20 I/O operations, whereas finding the desired data via a table scan could require thousands or millions of I/O operations. However, indexes should be used with caution. Remember that when you modify data in a table by using an INSERT, UPDATE, or DELETE statement, the index or indexes associated with that data will be automatically updated to reflect the changed data, which produces I/O operations in addition to those produced by the table modification. Be careful to not create too many indexes; if you do, the performance of data modification operations will be affected by the overhead needed to maintain the indexes.

Use Stored Procedures

As you saw in Chapter 21, stored procedures are used to run prepackaged and precompiled SQL statements on the server. Calling stored procedures from your applications instead of calling individual SQL statements can help performance both by improving the reusability of SQL statements on the server and by greatly reducing network traffic. The amount of data sent between clients and servers can be reduced because a stored procedure resides on the server and because you can program processing and filtering of the data within the stored procedure, rather than within the application.