Section 12.3. Defining Good Performance

12.3. Defining Good Performance

Load is one thing, performance another. Good performance proves an elusive notion to define. Using CPU or performing a large number of I/O operations is not wrong in itself; your company, presumably, didn't buy powerful hardware with the idea of keeping it idle.

When the time comes to assess performance, there is a striking similarity between the world of databases and the world of corporate finance. You find in both worlds some longing for "key performance indicators" and magical ratioand in both worlds, global indicators and ratios can be extremely misleading. A good average can hide distressing results during the peaks, and a significant part of the load may perhaps be traced back to a batch program that is far from optimal but that runs at a time of night when no one cares what the load is. To get a true appreciation of the real state of affairs, you must drill down to a lower level of detail.

To a large extent, getting down to the details is an exercise similar to that which is known in managerial circles as "activity-based costing." In a company, knowing in some detail how much you spend is relatively easy. However, relating costs to benefits is an exercise fraught with difficulties, notoriously for transverse operations such as information technology. Determining if you spend the right amount on hardware, software, and staff, as well as the rubber bands and duct tape required to hold everything together is extremely difficult, particularly when the people who actually earn money are "customers" of the IT department.

Assessing whether you do indeed spend what you should has three prerequisites:

Knowing what you spend
Knowing what you get for the money
Knowing how your return on investment compares with acknowledged standards

In the following subsections, I shall consider each of these points in turn in the context of database systems.

12.3.1. Knowing What You Spend

In the case of database performance, what we spend means, first and foremost, how many data pages we are hitting. The physical I/Os that some people tend to focus on are an ancillary matter. If you hit a very large number of different data pages, this will necessarily entail sustained I/O activity unless your database entirely fits in memory. But CPU load is also often a direct consequence of hitting the same data pages in memory again and again. Reducing the number of data pages accessed is not a panacea, as there are cases when the global throughput is higher when some queries hit a few more pages than is strictly necessary. But as far as single indicators go, the number of data pages hit is probably the most significant one. The other cost to watch is excessive SQL statement parsing, an activity that can consume an inordinate amount of CPU (massive hardcoded insertions can easily take 75% of the CPU available for parsing alone).

The two most significant indicators of database load are the amount of CPU spent on statement parsing and the number of data pages visited when executing queries.

12.3.2. Knowing What You Get

There is a quote that is famous among advertisers, a quip attributed to John Wanamaker, a 19th-century American retailer:

Half the money I spend on advertising is wasted; the trouble is I don't know which half.

The situation is slightly better with database applications, but only superficially. You define what you get in terms of the number of rows (or bytes) returned by select statements; and similarly the number of rows affected by change operations. But such an apparently factual assessment is far from providing a true measure of the work performed on your behalf by the SQL engine, for a number of reasons:

First, from a practical point of view, all products don't provide you with such statistics.
Second, the effort required to obtain a result set may not be in proportion to the size of the result set. As a general rule you can be suspicious of a very large number of data page hits when only a few rows are returned. However such a proportion may be perfectly legitimate when data is aggregated. It is impossible to give a hard-and-fast rule in this area.
Third, should data returned from the database for the sole purpose of using it as input to other queries be counted as useful work? What about systematically updating to N a column in a table without using a where clause when N already happens to be the value stored in most rows? In both cases, the DBMS engine performs work that can be measured in terms of bytes returned or changed. Unfortunately, most of the work performed can be avoided.

There are times when scanning large tables or executing a long-running query may be perfectly justified (or indeed inescapable). For instance, when you run summary reports on very large volumes of data, you cannot expect an immediate answer. If an immediate answer is required, then it is likely that the data model (the database representation of reality) is inappropriate to the questions you want to see answered. This is a typical case when a decision support database that doesn't necessarily require the level of detail of the main operational database may be suitable. Remember what you saw in Chapter 1: correct modeling depends both on the data and what you want to do with the data. You may share data with your suppliers or customers and yet have a totally different database model than they do. Naturally, feeding a decision support system will require long and costly operations both on the source operational database and the target decision support database.

Because what you do with the data matters so much, you cannot judge performance if you don't relate the load to the execution of particular SQL statements. The global picture that may be available through monitoring utilities (that most often provide cumulative counters) is not of much interest if you cannot assign to each statement its fair share of the load.

As a first stage in the process of load analysis, you must therefore capture and collect SQL statements, and try to determine how much each one contributes to the overall cost. It may not be important to capture absolutely every statement. Database activity is one of those areas where the 80/20 rule, the empirical assessment that 80% of the consequences result from 20% of the causes, often describes the situation rather well. Usually, much of the load comes from a small number of SQL statements. We must be careful not to overlook the fact that hardcoded SQL statements may distort the picture. With hardcoded statements, the DBMS may record thousands of distinct statements where a properly coded query would be referenced only once, even though it might be called thousands of times, each time with differing parameters. Such a situation can usually be spotted quite easily by the great number of SQL statements, and sometimes by global statistics. For instance, a procedure such as sp_trace_setevent in Transact-SQL lets you obtain a precise count of executed cursors, reexecutions of prepared cursors, and so on.

If nothing else is available and if you can access the SQL engine cache , a snapshot taken at a relatively low frequency of once every few minutes may in many cases prove quite useful. Big bad queries are usually hard to miss, as also are queries that are being executed dozens of times a minute. Global costs should in any case be checked in order to validate the hypothesis that what has been missed contributes only marginally to the global load. It's when SQL statements are hardcoded that taking snapshots will probably give less satisfactory results; you should then try to get a more complete picture, either through logging (as already mentioned a high-overhead solution), or by use of less intrusive "sniffer" utilities. I should note that even if you catch all hardcoded statements, then they have to be "reverse soft-coded" by taking constant values out of the SQL text before being able to estimate the relative load, not of a single SQL statement, but of one particular SQL statement pattern.

Identifying the statements that keep the DBMS busy, though, is only part of the story. You will miss much if you don't then relate SQL activity to the essential business activity of the organization that is supported by the database. Having an idea of how many SQL statements are issued on average each time you are processing a customer order is more important to SQL performance than knowing the disk transfer rate or the CPU speed under standard conditions of temperature and pressure. For one thing, it helps you anticipate the effect of the next advertising campaign; and if the said number of SQL statements is in the hundreds, you can raise interesting questions about the program (could there be, by chance, SQL statements executed inside loops that fetch the results of other statements? Could there be a statement that is repeatedly executed when it needs to be executed only once?). Similarly, costly massive updates of one column in a table accompanied by near identical numbers of equally massive updates of other columns from the same table with similar where clauses immediately raises the question of whether a single pass over the table wouldn't have been enough.

Load figures must be related to SQL statements. SQL statements must be related to business activity. Business activity must be related to business requirements.

12.3.3. Checking Against Acknowledged Standards

Collecting SQL statements, evaluating their cost and roughly relating them to what makes a company or agency tick is an exercise that usually points you directly to the parts of the code that require in-depth review. The questionable code may be SQL statements, algorithms, or both. But knowing what you can expect in terms of improvement or how far you could or should go is a very difficult part of the SQL expert's craft; experience helps, but even the seasoned practitioner can be left with a degree of uncertainty.

It can be useful to establish a baseline, for instance by carrying out simple insertion tests and having an idea about the rate of insertion that is sustainable on your hardware. Similarly, you should check the fetch rate that can be obtained when performing those dreaded full scans on some of the biggest tables. Comparing bare-bones rates to what some applications manage to accomplish is often illuminating: there may be an order of magnitude or more between the fetch or insert speed that the SQL engine can attain and what is achieved by application programs.

Know the limits of your environment. Measure how many rows you can insert, fetch, update, or delete per unit of time on your machines.

Once you have defined a few landmarks, you can identify where you will obtain the best "return on improvement," in terms of both relevance to business activities and technical feasibility. You can then focus on those parts of your programs and get results where it matters.

Some practitioners tend to think that as long as end users don't complain about performance, there is no issue and therefore no time to waste on trying to make operations go faster. There is some wisdom in this attitude; but there is also some short-sightedness as well, for two reasons:

First, end users often have a surprisingly high level of tolerance for poor performance; or perhaps it would be more appropriate to say that their perception of slowness differs widely from that of someone who has a better understanding of what happens behind the scenes. End users may complain loudly about the performance of those processes of death that cannot possibly do better, and express a mild dissatisfaction about other processes when I would have long gone ballistic. A low level of complaint doesn't necessarily mean that everything is fine, nor does vocal dissatisfaction necessarily mean that there is anything wrong with an application except perhaps trying to do too much.
Second, a slight increase in the load on a server may mean that performance will deteriorate from acceptable to unacceptable very quickly. If the environment is perfectly stable, there is indeed nothing to fear from a slight increase in load. But if your activity records a very high peak during one particular month of the year, the same program that looks satisfactory for 11 months can suddenly be the reason for riots. Here the background noise matters a lot. An already overloaded machine cannot keep on providing the same level of service when activity increases. There is always a threshold that sees mediocre performance tumbling down all of a sudden. It is therefore important to study an entire system before a burst of activity is encountered to see whether the load can be reduced by improving the code. If improving the code isn't enough to warrant acceptable performance, it may be time to switch to bigger iron and upgrade the hardware.

Do not forget that "return on improvement" is not simply a technical matter. The perception of end users should be given the highest priority, even if it is biased and sometimes disconnected from the most severe technical issues. They have to work with the program, and ergonomics have to be taken into account. It is not unusual to meet well-meaning individuals concentrating on improving statistics rather than program throughput, let alone end-user satisfaction. These well-intentioned engineers can feel somewhat frustrated and misunderstood when end users, who only see a very local improvement, welcome the result of mighty technical efforts with lukewarm enthusiasm. An eighteenth-century author reports that somebody once said to a physician, "Well, Mr. X has died, in spite of the promise you had made to cure him." The splendid answer from the physician was, "You were away, and didn't check the progress of the treatment: he died cured."

A database with excellent statistics and yet unsatisfactory performance from an end-user point of view is like a patient cured of one ailment, but who died of another. Improving performance usually means both delivering a highly visible improvement to end users, even if it affects a query that is run only once a month but that is business-critical, and the more humble, longer-term work of streamlining programs, lowering the background noise, and ensuring that the server will be able to deliver that power boost when it is needed.

Performance improvement as perceived by end users is what matters most, but never forget the narrow margin between acceptable and unacceptable performance in a loaded environment.

12.3.4. Defining Performance Goals

Performance goals are often defined in terms of elapsed time, for example, "this program must run in under 2 hours." It is better though to define them primarily in terms of business items processed by unit of time, such as "50,000 invoices per hour" or "100 loans per minute," for several reasons:

It gives a better idea of the service actually provided by a given program.
It makes a decrease in performance more understandable to end users when it can be linked to an increase in activity. This makes meetings less stormy.
Psychologically speaking, it is slightly more exciting when trying to improve a process to boost throughput rather than diminish the elapsed time. An upward curve makes a better chart in management presentations than a downward one.

More than anything else, improved performance means first, doing more work in the same time, and second, doing it in even less time.