Section 2.1. Specification Reliability | Optimizing Oracle Performance

2.1 Specification Reliability

A project specification can be called "reliable" only if any project that successfully fulfills the letter of that specification also fulfills the specification's true intent. Unfortunately, the most commonly used specifications for performance improvement projects are unreliable. Examples of specifications include:

Distribute disk I/O as uniformly as possible across many disk drives .
Ensure that there is at least x % of unused CPU capacity during peak hours.
Increase the database buffer cache hit ratio to at least x %.
Eliminate all full-table scans from the system.

Each of these specifications is unreliable because the letter of each specification can be accomplished without actually producing a desired impact upon your system. There is a simple game that enables you to determine whether you have a reliable specification or not:

To establish whether or not the specification for a performance improvement project is reliable, ask yourself the question: "Is it possible to achieve the stated goal (the specification) of such a project without actually improving system performance?"

One easy way to get the game going is to imagine the existence of an evil genie. Is it possible for an evil genie to adhere to the letter of your "wish" (the project specification) while producing a project result that actually contradicts your obvious underlying goal? If the evil genie can create a system on which she could meet your project specification but still produce an unsatisfactory performance result, then the project specification has been proved unreliable.

The evil genie game is a technique employed in thought experiments by Ren Descartes in the 1600s and, more recently, by Elizabeth Hurley's character in the film Bedazzled . Here's how the evil genie game can play out for the bad specifications listed earlier:

Distribute disk I/O as uniformly as possible across many disk drives

This specification is a perfectly legitimate goal for trying to prevent performance problems when you are configuring a new system, but it is an unreliable specification for performance improvement projects. There are many systems on which making significant improvement to disk I/O performance will cause either negligible or even negative performance impact.

For example, imagine a system in which each of the most important business processes needing performance repair consumes less than 5% of the system's total response time performing disk I/O operations. (We have hundreds of trace files that fit this description at www.hotsos.com.) On such a system, no amount of I/O "tuning" can create meaningful response time improvement of more than 5%. Since distributing disk I/O uniformly across many disk drives can result in a system without meaningfully improved performance, this specification is unreliable.

Ensure that there is at least x % of unused CPU capacity during peak hours

There are several ways that an evil genie could accomplish this goal without helping the performance of your system. One way is to introduce a horrific disk I/O bottleneck, such as by placing the entire database on one gigantic disk drive with excessively poor I/O-per-second capacity. As more and more user processes stack up in the disk I/O queue, much CPU capacity will go unused. Since increasing the amount of unused CPU can result in worse performance, this specification is unreliable.

Increase the database buffer cache hit ratio to at least x %

This one's easy: simply use Connor McDonald's innovative demonstration that I include in Appendix B. The application will show you how to increase your database buffer cache hit ratio to as many nines as you like, by adding CPU-wasting unnecessary workload. This additional wasted workload will of course degrade the performance of your system, but it will "improve" your buffer cache hit ratio. Connor's application is, of course, a trick designed to demonstrate that it is a mistake to rely on the buffer cache hit ratio as a measure of system "goodness." (I happen to know that Connor is definitely not evil, although I have on occasion noticed him exhibit behavior that is at least marginally genie-like.)

There are subtler ways to degrade a system's performance while "improving" its cache hit ratio. For example, SQL "tuners" often do it when they engage in projects to eradicate TABLE SCAN FULL row source operations (discussed again in the next specification I'll show). Another way an evil genie could improve your cache hit ratio in a way that harms performance is to reduce all your array fetch sizes to a single row [Millsap (2001b)]. Because it is so easy to increase the value of your buffer cache hit ratio in ways that degrade system performance, this specification is particularly unreliable.

Eliminate all full-table scans from the system

Unfortunately, many students of SQL performance optimization learn early the untrue rule of thumb that "all full-table scans are bad." An evil genie would have an easy time concocting hundreds of SQL statements whose performance would degrade as TABLE SCAN FULL row source operations were eliminated [Millsap (2001b); (2002)]. Because eliminating full-table scans can actually degrade performance, the action is an unreliable basis for a performance improvement project specification.

The cure for unreliable performance improvement specifications is conceptually simple. Just say what you mean. But of course, by the same logic, golf is simple: just hit the ball into the hole every time you swing. The problem in curing unreliable performance improvement specifications is to figure out how to specify what you really mean in a manner that doesn't lead to other errors. For example, a performance specification that comes closer to saying what you really mean is this one:

Make the system go faster.

However, even this specification is unreliable. I've seen dozens of projects with this specification result in ostensible success but practical failure. For example, a consultant finds, by examining V$SQL , a batch job that consumes four hours. He "tunes" it so that it runs in 30 minutes. This is a project success; the consulting engagement summary says so. However, the success was meaningless. The batch program was already as fast as it needed to be, because it ran in an otherwise vacant eight- hour batch window. The expensive input into performance improvement (the consultant's fee) produced no positive value to the business.

Worse yet, I've seen analysts make some program A go faster, but at the expense of making another vastly more important program B go slower. Many systems contain process interdependencies that can cause this situation. On these systems, "tuning" the wrong program not only consumes time and money to execute the tuning project, it results in actual degradation of a system's value to the business (see Section 12.1 for an example).

This "make the system go faster" specification is just too vague to be useful. In my service line management role at Oracle Corporation, I had many discussions about how to specify projects ”the whole idea of packaged services requires contract-quality specification of project goals. Most participants in those meetings understood very quickly that "make the system go faster" is too vague. What I find remarkable today is that most of these people saw the vagueness in entirely the wrong place.

Most people identify the go faster part of the specification as the root of the problem. People commonly suggest that "make the system go faster" is deficient because the statement doesn't say, numerically , how much faster. In my Oracle meetings, explorations of how to improve "make the system go faster" generally led to discussion of various ways to measure actual and perceived speeds, ways to establish "equivalency" metrics such as count-based utilization measures (like hit ratios), and so on. Of course, the search for "equivalency" measures finds a dead end quickly because ”if you execute the evil genie test correctly ”such presumed equivalency measures are usually unreliable.

Figuring out how much faster a system "needs to go" often leads into expensive project rat holes. (An exception is when an analyst has found the maximum allowable service time for an operation by using a model like the queueing theory one that I describe in Chapter 9.) When our students today discuss the "make the system go faster" spec, it usually takes very little leading for students to realize that the real problem is actually hidden in the word system . For example, consider the following commonly suggested "improvements" to the original "make the system go faster" specification:

Make the system go 10% faster.
Make the system complete all business functions in less than one second.

First of all, each specification expressed in this style is susceptible to the same evil genie tricks as the original spec. But by adding detail, we've actually weakened the original statement. For example:

Make the system go 10% faster: Do you really expect that every business transaction on the system can go 10% faster? Even those that perform only a couple of Oracle logical I/O calls (LIOs) to begin with? On the other hand, is 10% really enough of an improvement for an online query that consumes seventeen minutes of response time?
Make the system complete all business functions in less than one second: Is it really good enough for a single-row fetch via a primary key to consume 0.99 seconds of response time? On the other hand, is it really reasonable to expect that an Oracle application should be able to emit a 72-page report in less than one second?

Do these two formats actually lead to an improvement of the original "make the system go faster" specification? They do not. A bigger problem is actually the lack of definition for the word "system."

2.1.1 The System

What is the system ? Most database and system administrators interpret the term much differently than anyone else in the business does. To most database and system administrators, the system is a complex collection of processes and shared memory and files and locks and latches, and all sorts of technical things that can be measured by looking at " V$ tables" and operating system utilities and maybe even graphical system monitoring dashboards. However, nobody else in the business sees a system this way. A user thinks of the system as the collection of the few forms and batch jobs in that user's specific job domain. A manager thinks of the system as a means for helping improve the efficiency of the business. To users and managers, the redness, yellowness, or greenness of your dashboard dials is completely and utterly irrelevant.

Here's a simple test to determine for yourself whether I'm telling the truth. Try to imagine yourself as a user who has just waited two hours past your reporting deadline this morning because your "fifteen-minute report" required three full hours to run. Try to imagine your reaction to a database administrator who would say the following words in front of your colleagues during a staff meeting: "There was absolutely nothing wrong with the system while your report was running, because all our dashboard dials were green during the entire three-hour period."

Please remember this when you are acting in the role of performance analyst: a system is a collection of end-user programs. An end-user is watching each of these programs attentively. (If no one is watching a particular program attentively, then it should be running only during off-peak time periods, or perhaps not at all.) The duration that each program requires to deliver a requested chunk of business value is that program's response time. The response time of an individual user action is practically the only performance metric that your business cares about. Hence:

Response time for an end-user action is the first metric that you should care about.

2.1.2 Economic Constraints

When you eliminate the ambiguity of the word "system," you take one big step closer to a foolproof goal:

Improving the performance of program A during the weekday 2:00 p.m. to 3:00 p.m. window is critical to the business. Improve the performance of A as much as possible for this time period.

But is this specification evil genie-proof? Not yet. Imagine that the average run time of program A is two minutes. Suppose that the evil genie could reduce the response time from two minutes to 0.25 seconds. Great... But at a cost of $1,000,000,000. Oops. Maybe improving response time only to 0.5 seconds would have been good enough and would only have cost $2,000. The specification omits any mention of an economic constraint.

There is an optimization project specification that I believe may actually be evil genie-proof. It is the optimization goal described by Eli Goldratt in [Goldratt (1992), 49]:

Make money by increasing net profit, while simultaneously increasing return on investment, and simultaneously increasing cash flow.

This specification gives us the ultimate acid test by which to judge any other project specification. However, it does fall prey to the same "hit the ball into the hole on every swing" lack of detail that I discussed earlier.

Top