Performance Monitoring | Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features

There are three essential concepts you need to grasp before a complete understanding of performance monitoring is achieved. These concepts are throughput, queues, and response time. Only after you fully understand these terms can you broaden your scope of analysis and perform calculations to report transfer rate, access time, latency, tolerance, thresholds bottlenecks, and so on…and be sure SQL Server is performing at optimum efficiency.

What Is Rate and Throughput?

Throughput is the amount of work done in a unit of time. An example I like to use is drawn from observing my son. If he is able to construct 100 pieces of Legos or K’nex per hour, I can say that his assemblage rate is 100 pieces per hour, assessed over a period of x hours, as long as the rate remains constant and he gets enough chocolate milk. However, if the rate of assemblage varies, through fatigue, lack of cheese slices, lack of milk, and so forth, I can calculate the throughput.

The throughput will increase as the number of components increases, or the available resources are reduced. In a computer system, or any system for that matter, the slowest point in the system sets the throughput for the system as a whole, which is why you often hear people use the cliché “the chain is only as strong as its weakest link.” We might be able to make references to millions of instruction per second, but that would be meaningless if a critical resource, such as memory, was not available to hold the instruction information, or worse, someone switched off the power.

What Is a Queue?

If I give my son too many K’nex to assemble or reduce the available time he has to perform the calculations and build the model, the number of pieces will begin to pile up. This happens too in software and IS terms, where the number of threads can back up, one behind the other, forming a queue. A typical scenario is the line at the bank or the supermarket that forms because there are more people waiting for service than there are tellers or cashiers. When a queue develops, we say that a bottleneck has occurred. Looking for bottlenecks in the system is the essence of monitoring for performance and troubleshooting or problem detection. If there are no bottlenecks, the system might be considered healthy; on the other hand, a bottleneck might soon start to develop.

Queues can also form if requests for resources are not evenly spread over the unit of time. If my son assembles 45 K’nex, at the rate of one K’nex per minute, he will get through every piece in 45 minutes. But if he does nothing for 30 minutes and then suddenly gets inspired, a bottleneck will occur in the final 15 minutes because there are more pieces than can be processed in the remaining time. On computer systems, when queues and bottlenecks develop, systems become unresponsive. No matter how good SQL Server 2005 may be as a DBMS, if queues develop, due for example to a denial of service attack on the server, additional requests for processor or disk resources will be stalled. When requesting services are not satisfied, the system begins to break down. To be alerted of this possibility, you need to reference the response time of a system.

What Is Response Time?

When we talk about response time, we talk about the measure of how much time is required to complete a task. Response time will increase as the load increases. A system that has insufficient memory or processing capability will process a huge database sort or a complex join a lot slower than a better-endowed system, with faster hard disks, CPUs, and memory. If response time is not satisfactory, we will have to either work with less data or increase the resources…which can be achieved by scale-up of the server and its resources (as by adding more CPUs) or scale-out, which means adding more servers. Scale-up and scale-out are fully discussed in Chapter 9.

How do we measure response time? Easy. You just divide the queue length by the throughput. Response time, queues, and throughput are reported and calculated by the Windows Server 2003 reporting tools, so the work is done for you.

How the Performance Objects Work

Windows Server 2003’s performance monitoring objects are endowed with certain functionality known as performance counters. These so-called counters perform the actual analysis for you. For example, the hard-disk object is able to calculate transfer rate averages, while a processor-associated object is able to calculate processor time averages.

To gain access to the data or to start the data collection, you would first have to instantiate the performance object. The base performance object you need is stored in the operating system, but you first need to make a copy of one to work with. This is done by calling a create function from a user interface or some other process. As soon as the object is created, its methods, or functions, are called to begin the data collection process and store the data in properties, or they stream the data out to disk, in files or RAM. You can then get at the data to assess the data and present it in some meaningful way.

The objects can be instantiated, or created, at least once. This means that, depending on the object, your analysis software can create at least one copy of the performance object and analyze the counter information it generates. There are also other performance objects that can be instantiated more than once. Windows allows you to instantiate an object for a local computer’s services, or you can create an object that operates on a remote computer.

Two methods of data collection and reporting are made possible using performance objects. First, the objects can sample the data. In other words, data is collected periodically rather than when a particular event occurs. This is a good idea because all forms of data collection place a burden on resources and you don’t want to be taxing a system when the number of connections it is serving begins to skyrocket. So sampled data has the advantage of being a period-driven load, but it carries the disadvantage that the values may be inaccurate when certain activity falls outside the sampling period.

The other method of data collection is called event tracing. Event tracing, new to the Windows platform, enables us to collect data as and when certain events occur. And as there is no sampling window, you can correlate resource usage against events. As an example you can “watch” an application consume memory when it executes a certain function and monitor when and if it releases that memory when the function completes.

But there is a downside to too much of a good thing. Event tracing can consume more resources than sampling. You would thus only want to perform event tracing for short periods where the objective of the trace is to troubleshoot, and not just to monitor per se.

Counters report their data in one of two ways: instantaneous counting or average counting. An instantaneous counter displays the data as it happens; you could call it a snapshot. In other words, the counter does not compute the data it receives; it just reports it. On the other hand, average counting computes the data for you. For example, it is able to compute bits per second, or pages per second, and so forth. There are other counters you can use that are better able to report percentages, difference, and so on.