Section 1.3. Three Important Advances | Optimizing Oracle Performance

1.3 Three Important Advances

In the Preface, I began with the statement:

Optimizing Oracle response time is, for the most part, a solved problem.

This statement stands in stark contrast to the gloomy picture I painted at the beginning of this chapter ”that, "For many people, Oracle system performance is a very difficult problem." The contrast, of course, has a logical explanation. It is this:

Several technological advances have added impact, efficiency, measurability, predictive capacity, reliability, determinism, finiteness, and practicality to the science of Oracle performance optimization.

In particular, I believe that three important advances are primarily responsible for the improvements we have today. Curiously, while these advances are new to most professionals who work with Oracle products, none of these advances is really "new." Each is used extensively by optimization analysts in non-Oracle fields; some have been in use for over a century.

1.3.1 User Action Focus

The first important advance in Oracle optimization technology follows from a simple mathematical observation:

You can't extrapolate detail from an aggregate.

Here's a puzzle to demonstrate my point. Imagine that I told you that a collection of 1,000 rocks contains 999 grey rocks and one special rock that's been painted bright red. The collection weighs 1,000 pounds . Now, answer the following question: "How much does the red rock weigh?" If your answer is, "I know that the red rock weighs one pound," then, whether you realize it or not, you've told a lie. You don't know that the red rock weighs one pound . With the information you've been given, you can't know. If your answer is, "I assume that the red rock weighs one pound," then you're too generous in what you're willing to assume. Such an assumption puts you at risk of forming conclusions that are incorrect ”perhaps even stunningly incorrect.

The correct answer is that the red rock can weigh virtually any amount between zero and 1,000 pounds. The only thing limiting the low end of the weight is the definition of how many atoms must be present in order for a thing to be called a rock . Once we define how small a rock can be, then we've defined the high end of our answer. It is 1,000 pounds minus the weight of 999 of the smallest possible rocks. The red rock can weigh virtually anything between zero and a thousand pounds. Answering with any more precision is wrong unless you happen to be very lucky. But being very lucky at games like this is a skill that can be neither learned nor taught, nor repeated with acceptable reliability.

This is one reason why Oracle analysts find it so frustrating to diagnose performance problems armed only with system-wide statistics such as those produced by Statspack (or any of its cousins derived from the old SQL scripts called bstat and estat ). Two analysts looking at exactly the same Statspack output can "see" two completely different things, neither of which is completely provable or completely disprovable by the Statspack output. It's not Statspack 's fault. It's a problem that is inherent in any performance analysis that uses system -wide data as its starting point ( V$SYSSTAT , V$SYSTEM_EVENT , and so on). You can in fact instruct Statspack to collect sufficiently granular data for you, but no Statspack documentation of which I'm aware makes any effort to tell you why you might ever want to.

A fine illustration is the case of an Oracle system whose red rock was a payroll processing problem. The officers of the company described a performance problem with Oracle Payroll that was hurting their business. The database administrators of the company described a performance problem with latches: cache buffers chains latches, to be specific. Both arguments were compelling. The business truly was suffering from a problem with payroll being too slow. You could see it, because checks weren't coming out of the system fast enough. The "system" truly was suffering from latch contention problems. You could see it, because queries of V$SYSTEM_EVENT clearly showed that the system was spending a lot of time waiting for the event called latch free .

The company's database and system administration staff had invested three frustrating months trying to fix the "latch free problem," but the company had found no relief for the payroll performance problem. The reason was simple: payroll wasn't spending time waiting for latches. How did we find out? We acquired operational timing data for one execution of the slow payroll program. What we found was amazing. Yes, lots of other application programs in fact spent time waiting to acquire cache buffers chains latches. But of the slow payroll program's total 1,985.40-second execution time, only 23.69 seconds were consumed waiting on latches. That's 1.2% of the program's total response time. Had the company completely eradicated waits for latch free from the face of their system, they would have made only a 1.2% performance improvement in the response time of their payroll program.

How could system-wide statistics have been so misleading? Yes, lots of non-payroll workload was prominently afflicted by latch free problems. But it was a grave error to assume that the payroll program's problem was the same as the system-wide average problem. The error in assuming a cause-effect relationship between latch free waiting and payroll performance cost the company three months of wasted time and frustration and thousands of dollars in labor and equipment upgrade costs. By contrast, diagnosing the real payroll performance problem consumed only about ten minutes of diagnosis time once the company saw the correct diagnostic data.

My colleagues and I encounter this type of problem repeatedly. The solution is for you (the performance analyst) to focus entirely upon the user actions that need optimizing. The business can tell you what the most important user actions are. The system cannot. Once you have identified a user action that requires optimization, then your first job is to collect operational data exactly for that user action ”no more, and no less.

1.3.2 Response Time Focus

For a couple of decades now, Oracle performance analysts have labored under the assumption that there's really no objective way to measure Oracle response time [Ault and Brinson (2000), 27]. In the perceived absence of objective ways to measure response time, analysts have settled for the next -best thing: event counts . And of course from event counts come ratios. And from ratios come all sorts of arguments about which "tuning" actions are important, and which ones are not.

However, users don't care about event counts and ratios and arguments; they care about response time : the duration that begins when they request something and ends when they get their answer. No matter how much complexity you build atop any timing-free event-count data, you are fundamentally doomed by the following inescapable truth, the subject of the second important advance:

You can't tell how long something took by counting how many times it happened .

Users care only about response times. If you're measuring only event counts, then you're not measuring what the users care about. If you liked the red rock quiz, here's another one for you: What's causing the performance problem in the program that produced the data in Example 1-1?

Example 1-1. Components of response time listed in descending order of call volume

 Response Time Component          # Calls ------------------------------ --------- CPU service                       18,750 SQL*Net message to client          6,094 SQL*Net message from client        6,094 db file sequential read            1,740 log file sync                        681 SQL*Net more data to client          108 SQL*Net more data from client         71 db file scattered read                34 direct path read                       5 free buffer waits                      4 log buffer space                       2 direct path write                      2 log file switch completion             1 latch free                             1

Example 1-2 shows the same data from the same program execution, this time augmented with timing data ( reported in seconds) and sorted by descending response time impact. Does it change your answer?

Example 1-2. Components of response time listed in descending order of contribution to response time

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------  SQL*Net message from client       166.6s  91.7%          6,094    0.027338s  CPU service                         9.7s   5.3%         18,750    0.000515s unaccounted-for                     2.2s   1.2% db file sequential read             1.6s   0.9%          1,740    0.000914s log file sync                       1.1s   0.6%            681    0.001645s SQL*Net more data from client       0.3s   0.1%             71    0.003521s SQL*Net more data to client         0.1s   0.1%            108    0.001019s free buffer waits                   0.1s   0.0%              4    0.022500s SQL*Net message to client           0.0s   0.0%          6,094    0.000007s db file scattered read              0.0s   0.0%             34    0.001176s log file switch completion          0.0s   0.0%              1    0.030000s log buffer space                    0.0s   0.0%              2    0.005000s latch free                          0.0s   0.0%              1    0.010000s direct path read                    0.0s   0.0%              5    0.000000s direct path write                   0.0s   0.0%              2    0.000000s ----------------------------- ----------------- -------------- ------------ Total                             181.8s 100.0%

Of course it changes your answer, because response time is dominatingly important, and event counts are inconsequential by comparison. The problem with the program that generated this data is what's going on with SQL*Net message from client , not what's going on with CPU service .

If you are an experienced Oracle performance analyst, you may have heard that SQL*Net message from client is an idle event that can be ignored. You must not ignore the so-called idle events if you collect your diagnostic data in the manner I describe in Chapter 3.

If the year were 1991, we'd be in big trouble right now, because in 1991 the data that I've shown in this second table wasn't available from the Oracle kernel. But if you've upgraded by now to at least Oracle7, then you don't need to settle for event counts as the "next-best thing" to response time data. The basic assumption that you can't tell how long the Oracle kernel takes to do things is simply incorrect, and it has been since Oracle release 7.0.12.

1.3.3 Amdahl's Law

The final "great advance" in Oracle performance optimization that I'll mention is an observation published in 1967 by Gene Amdahl, which has become known as Amdahl's Law [Amdahl (1967)]:

The performance enhancement possible with a given improvement is limited by the fraction of the execution time that the improved feature is used.

In other words, performance improvement is proportional to how much a program uses the thing you improved. Amdahl's Law is why you should view response time components in descending response time order. In Example 1-2, it's why you don't work on the CPU service "problem" before figuring out the SQL*Net message from client problem. If you were to reduce total CPU consumption by 50%, you'd improve response time by only about 2%. But if you could reduce the response time attributable to SQL*Net message from client by the same 50%, you'll reduce total response time by 46%. In Example 1-2, each percentage point of reduction in SQL*Net message from client duration produces nearly twenty times the impact of a percentage point of CPU service reduction.

Amdahl's Law is a formalization of optimization common sense. It tells you how to get the biggest "bang for the buck" from your performance improvement efforts.

1.3.4 All Together Now

Combining the three advances in Oracle optimization technology into one statement results in the following simple performance method:

Work first to reduce the biggest response time component of a business' most important user action.

It sounds easy, right? Yet I can be almost certain that this is not how you optimize your Oracle system back home. It's not what your consultants do or what your tools do. This way of "tuning" is nothing like what your books or virtually any of the other papers presented at Oracle seminars and conferences since 1980 tell you to do. So what is the missing link?

The missing link is that unless you know how to extract and interpret response time measurements from your Oracle system, you can't implement this simple optimization method. Explaining how to extract and interpret response time measurements from your Oracle system is a main point of this book.

I hope that by the time you read this book, my claims that "this is not how you do it today" don't make sense anymore. As I write this chapter, many factors are converging to make the type of optimization I'm describing in this book much more common among Oracle practitioners . If the book you're holding has played an influencing role in that evolution, then so much the better.

Top