Section 1.5. Method R | Optimizing Oracle Performance

1.5 Method R

The real goal of this book is not just to help you make an Oracle system go faster. The real goal of this book is to optimize the project that makes an Oracle system go faster. I don't just want to help you make one system faster. I want to help you make any system faster, and I want you to be able to accomplish that task in the most economically efficient way possible for your business. Method R is the method I will describe by which you can achieve this goal. Method R is in fact the basis for the remainder of this book.

Method R: A Response Time-Based Performance Improvement Method That Yields Maximum Economic Value to Your Business

Select the user actions for which the business needs improved performance.
Collect properly scoped diagnostic data that will allow you to identify the causes of response time consumption for each selected user action while it is performing sub-optimally.
Execute the candidate optimization activity that will have the greatest net payoff to the business. If even the best net-payoff activity produces insufficient net payoff, then suspend your performance improvement activities until something changes.
Go to step 1.

Method R is conceptually very simple. As you should expect, it is merely a formalization of the simple "Work first to reduce the biggest response time component of a business' most important user action" objective that you've seen many times by now.

1.5.1 Who Uses the Method

An immediately noticeable distinction of Method R is the type of person who will be required to execute it. Method R specifically can not be performed in isolation by a technician who has no interest in your business. As I have said, the goal of Method R is to improve the overall value of the system to the business . This goal cannot be achieved in isolation from the business. But how does a person who leads the execution of Method R fit into an information technology department?

1.5.1.1 The abominable smokestack

Most large companies organize their technical infrastructure support staff in a manner that I call the "abominable smokestacks," like the departmental segmentation shown in Figure 1-2. Organizational structures like this increase the difficulty of optimizing the performance of a system, for one fundamental reason:

Compartmentalized organizational units tend to optimize in isolation from other organizational units, resulting in locally optimized components. Even if they succeed in doing this, it's not necessarily good enough. A system consisting of locally optimized components is not necessarily itself an optimized system.

One of Goldratt's many contributions to the body of system optimization knowledge is a compelling illustration of how local optimization does not necessarily lead to global optimization [Goldratt (1992)].

Figure 1-2. Typical organizational structure for a technical infrastructure department

The smokestack mentality is pervasive. Even the abstract submission forms we use to participate in Oracle conferences require that we choose a smokestack for each of our presentations (conference organizers tend to call them tracks instead of smokestacks). There is, for example, one track for papers pertaining to database tuning, and a completely distinct track for papers pertaining to operating system tuning. What if a performance optimization solution requires that attention be paid iteratively to both components of the technology stack? I believe the mere attempt at categorization discourages analysts from considering such solutions. At least analysts who do implement solutions that span stack layers are ensured of having a difficult time choosing the perfect track for their paper proposals.

One classic aspect of segmentation is particularly troublesome for almost every Oracle system owner I've ever talked with: the distinction between application developers and database administrators. Which group is responsible for system performance? The answer is both . There are performance problems that application developers will not detect without assistance from database administrators. Likewise, there are performance problems that database administrators will not be able to repair without assistance from application developers.

The Goal

One inspiration behind Method R is the story told in Eli Goldratt's The Goal [Goldratt (1992)]. The Goal describes the victory of a revolutionary new performance optimization method over a method that is culturally ingrained but produces inferior results. Goldratt's method applies to factory optimization, but his story is eerily reminiscent of what the Oracle community is going through today: the overthrow of an optimization method based upon a faulty measurement system.

The Goal dismantles a lot of false ideas that a lot of analysts think they "know" about optimization. Two of the most illuminating lessons that I learned from the book were:

Cost accounting practices often promote bad optimization decisions. Oracle practitioners use cost accounting practices when they target a system's hit ratios for optimization.
A collection of optimized components is itself not necessarily optimized . This explains why systems with 100% "best in class" componentry can have performance problems. It explains why so many slow Oracle systems have dozens of component administrators standing behind them who each swears that his component "can't possibly be the cause of a performance problem."

If you haven't read The Goal , then I think you're in for a real treat. If you have read it already, then consider reading it again with the intent to apply what you read by analogy to the world of Oracle performance. The cover says that " Goal readers are now doing the best work of their lives." This statement is a completely accurate portrayal of my personal relationship with the book.

1.5.1.2 The optimal performance analyst

A company's best defense against performance problems begins with a good performance analyst who can diagnose and discourse intelligently in all the layers of the technology stack. In the context of Figure 1-2, this person is able to engage successfully "in the smoke." The performance analyst can navigate above the smokestacks long enough to diagnose which pipes to dive into. And the best analyst has the knowledge, intelligence, charisma, and motivation to drive change in the interactions among smokestacks once he's proven where the best leverage is.

Of the dozens of great Oracle performance analysts I've had the honor of meeting, most share a common set of behavioral qualities that I believe form the basis for their success. The best means I know for describing the capabilities of these talented analysts is a structure described by Jim Kennedy and Anna Everest [Kennedy and Everest (1994)], which decomposes personal behavioral qualities into four groups:

Education/experience/knowledge factors: In the education/experience/knowledge category, the capabilities required of the optimal analyst are knowledge of the business goals , processes , and user actions that comprise the life of the business. The optimal analyst knows enough about finance to understand the types of input information that will be required for a financially -minded project sponsor to make informed investment decisions during a performance improvement project. And the optimal analyst of course understands the technical components of his application system, including the hardware , the operating system , the database server , the application programs , and any other computing tiers that join clients to servers. I describe many important technical factors in Part II of this book.
Intellectual factors: The optimal performance analyst exhibits several intellectual factors as well. Foremost, I believe, is the strong sense of relevance ”the ability to understand what's important and what's not. Sense of relevance is a broad category. It combines the attributes of perceptiveness , common sense , and good judgment . General problem solving skills are indispensable , as is the ability to acquire and assimilate new information quickly .
Interpersonal factors: The optimal performance analyst exhibits several interpersonal factors. Empathy is key to acquiring accurate information from users, business owners , and component administration staff. Poise is critical for maintaining order during a performance crisis, especially during the regularly scheduled panic phase of a project. Self-confidence is necessary to inspire adequate morale among the various project victims and perpetrators to ensure that the project is allowed to complete. The optimal analyst is tactful and successful in creating collaborative effort to implement a solution plan.
Motivational factors: Finally, the optimal performance analyst exhibits several important motivational factors. She is customer oriented and interested in the business . She enjoys a difficult challenge , and she is resourceful . I have found the best performance analysts to be always mindful that technical, intellectual, interpersonal, and motivational challenges are all surmountable, but that different problem types often require drastically different solution approaches. The best performance analysts seem not only to understand this, but to actually thrive on the variety .

1.5.1.3 Your role

As a result of buying this book, I want you to become so confident in your performance problem diagnosis skills that a scenario like the following doesn't scare you one bit:

Scene : Big meeting. Participants include several infrastructure department managers, you, and a special guest: the CEO, whose concerns about online order form performance are critical enough that he has descended upon your meeting to find out what you're going to do about it....

Senior manager of the system administration department ("System manager") : In two weeks, we're going to upgrade our CPU capacity, at a cost to the business of US$65,000 in hardware and upgraded software license fees. However, we expect that because we're doubling our CPU speeds, this upgrade will improve performance significantly for our users.

CEO : (Nods.) We must improve the performance of our online order form, or we'll lose one of our biggest retail customers.

You : But our online order form consumes CPU service for only about 1.2 seconds of the order form's 45-second commit time. Even if we could totally eliminate the response time consumed by CPU service, we would make only about a one-second improvement in the form's response time.

System manager : I disagree . I think there are so many unexplained discrepancies in the response time data you're looking at that there's no way you can prove what you're saying.

You : Let's cover this offline. I'll show you how I know.

(Later, after reconvening the meeting.)

System manager : Okay, I get it. He's right. Upgrading our CPU capacity won't help order form performance in the way that we'd hoped.

You : But by modifying our workload in a way that I can describe, we can achieve at least a 95% improvement in the form's commit response time, without having to spend the money on upgrading our CPUs. As you can see in this profile of the order form's response time, upgrading CPU capacity wouldn't have helped us here anyway.

I've witnessed the results of a lot of conversations that began this way but never veered back on-course when it was the You character's first turn to speak. The result is often horrifying. A company works its way through the alphabet in search of something that might help performance. Sometimes it stops only when the company runs out of time or money, or both.

Perhaps even more painful to watch is the conversation in which the You character does speak up on cue but then is essentially shouted down by a group of people who don't believe the data. Unless you can defend your diagnostic data set all the way to its origin ” and how it fits in with the data your debaters are collecting ”you stand a frighteningly large chance of losing important debates, even when you're right.

1.5.2 Overcoming Common Objections

I hope that I've written this book effectively enough that you will want to try Method R on your own system. If you can work alone, then most of the obstacles along your way will be purely technical, and you'll probably do a great job of figuring those out. I've tried hard to help you overcome those with the information in this book.

However, it's more likely that improving the performance of your system will be a collaborative effort. You'll probably have to engage your colleagues in order to implement your recommendations. The activities you recommend as a result of using Method R will fall into one of two categories:

Your colleagues have heard the ideas before and rejected them
They've never heard the ideas before

Otherwise, your system would have been fixed by now. Either way, you will probably find yourself in an environment that is ready to challenge your ideas. To make any progress, you will have to justify your recommendations in language that makes sense to the people who doubt you.

Justifying your recommendations this way is healthy for you to do anyway, even in the friendliest of environments where your words become other people's deeds almost instantaneously.

The most effective ways I've found to justify such recommendations are:

Proof-of-concept tests

There's no better way to prove a result than to actually demonstrate it. Dave Ensor describes this as the Jeweler's Method . Any good jeweler will place interesting merchandise into a prospective customer's hands as early in the sales process as possible. Holding the piece activates all the buyer's senses in appreciating the beauty and goodness of the thing being sold. The buyer's full imagination goes to work for the seller as the buyer locks in on the vision of how much better life will become if only the thing being held can be obtained. The method works wonderfully for big-ticket items, including jewelry , cars , houses , boats, and system performance. There's probably no surer way to build enthusiasm for your proposal than to let your users actually feel how much better their lives will become as a result of your work.

Direct statistics that make sense to end users

If proof-of-concept tests are too complicated to provide, the next best thing is to speak in direct statistics that make sense to end users. There are only three acceptable units of measure for such statistics:

Your local currency
The duration by which you'll improve someone's response time
The number of business actions per unit of time by which you'll improve someone's throughput

Any other measure will cause one of two problems. Either your argument will be too weak to convince the people you're trying to persuade, or, worse yet, you'll succeed in your persuasions, but because you were thinking in the wrong units of measure you'll risk producing end results with inadequate "real" benefit. Real benefit is always measured in units of either money or time. Succeeding in your proposal but failing in your end result of course causes an erosion of your credibility for future recommendations.

Track record of actualized predictions

If you have the luxury of a strong reputation to enhance your persuasive power, then merely making your wishes known may be enough to inspire action. However, if this is the case, beware . Every prediction you make runs the risk of eroding your credibility. Even if you have the power to convert your instructions into other people's tasks , I strongly encourage you to assess your recommendations privately using proof-of-concept tests or direct statistics that make sense to end users. Don't borrow from the account of your own credibility until you're certain of your recommendations.

1.5.2.1 "But my whole system is slow"

At hotsos.com , we use Method R for our living. After using the method many times, I can state categorically that the most difficult step of Method R is one that's not even listed: it is the step of convincing people to use it. The first objection my colleagues and I encounter to our focus on user actions is as predictable as the sunrise :

"But my whole system is slow."

"I need to tune my whole system , not just one user."

"When are you going to come out with a method that helps me tune my whole system ?"

We hear it everywhere we go.

What if the whole system is slow? Practitioners often react nervously to a performance improvement method that restricts analysis to just one user action at a time. Especially if users perceive that the "whole system" is slow, there is often an overwhelming compulsion to begin an analysis with the collection of system-wide statistics. The fear is that if you restrict the scope of analysis to anything less than the entire system, you might miss something important. Well, in fact, a focus on prioritized user actions does cause you to miss some things:

A focus on high-priority user actions causes you to overlook irrelevant performance data. By "irrelevant," I mean any data that would abate your progress in identifying and repairing your system's most important performance problem.

Here's why Method R works regardless of whether a system's problem is an individual user action or a whole mess of different user actions. Figure 1-3 shows the first information that analysts get when they learn of system performance problems. Legitimate information about performance problems usually comes first from the business in the form of user complaints.

It is possible for information providers to be the first to know about performance problems. In Chapter 9 I describe one way in which you can acquire such a priori knowledge. But it is rare for information providers to know about performance problems before their information consumers tell them.

Figure 1-3. What performance analysts first see when there's a performance problem. Shaded circles represent user actions that are experiencing performance problems

Upon receipt of such information, the first impulse of most analysts is to establish a cause-effect relationship between the symptoms being observed and one or more root causes that might be motivating the symptoms. I wholeheartedly agree that this step is the right step. However, many projects fail because analysts fail to establish the correct cause-effect relationships. A core strength of Method R is that it allows you to determine cause-effect relationships more quickly and accurately than with any other method.

Figure 1-4 shows why. It depicts three possible sets of cause-effect relationships between problem root causes and performance problem symptoms. Understanding the effectiveness of Method R for each of these scenarios compared to conventional tuning methods will help you decide for yourself whether Method R is an effective system-wide optimization or not. The three possible scenarios depicted in Figure 1-4 are:

At one extreme, case (a) depicts that every user-discernible symptom on the system is caused by a single "universal" root cause.
In case (b), there is a many-to-many relationship between symptoms and root causes. Some symptoms have two or more contributory root causes, and some root causes contribute to more than one symptom.
At the other extreme, case (c) depicts a situation in which every symptom is linked to its own distinct root cause. No single root cause creates negative performance impact for more than one user action.

Figure 1-4. Three possible sets of cause-effect relationships (depicted by arrows) between root causes and performance problem symptoms

Of course it is easy to draw pictures of cause-effect relationships between root causes and performance problem symptoms. It's another matter entirely to determine such cause-effect relationships in reality. The ability to do this is, I believe, the most distinguishing strength of Method R. Let me explain.

For problems resembling Figure 1-4(a), Method R works quite well. Even if you were to completely botch the business prioritization task inherent in the method's step 1, you'd still stumble upon the root cause in the first diagnostic data you examined. The reason is simple. If all symptoms have the same root cause, then no matter which symptom you investigate, you'll find the single, universal root cause in that symptom's response time profile.

Method R also works well for problems resembling Figure 1-4(b) and (c). In these cases, the only way to provide system-wide relief is to respond to each of the root causes that contributes to a symptom. Constraints on analyst labor (your time) probably make it impossible to respond to all the symptoms simultaneously , so it will probably be important to prioritize which activities you'll conduct first. This requirement is precisely the motive for the work prioritization inherent in Method R. Remembering that the true goal of any performance improvement project is economic , the proper way to prioritize the project activities is to respond to the most important symptoms first. Method R is distinctive in that it encourages alignment of project priorities with business priorities.

By contrast, let's examine the effectiveness of Method C for each of the same three scenarios. Remember, the first step of Method C is:

Hypothesize that some performance metric x has an unacceptable value.

In the context of Figure 1-4, this step is analogous to searching for the shaded circles in the portion of the diagram labeled root causes . After identifying probable root causes of performance problems, Method C next requires the analyst to establish a cause-effect relationship between root causes and symptoms. One problem with Method C is that it forces you to compute this cause-effect relationship rather more by accident than by plan. The conventional method for determining this cause-effect relationship is literally to "fix" something and then see what impact you created. It's a trial-and-error approach.

The challenge to succeeding with Method C is how quickly you can identify the right "unacceptable" system metric value. The longer it takes you to find it, the longer your project will drag on. Certainly , your chances of finding the right problem to solve are greatest when there's only one problem in the whole system. However, it's not certain that finding the root cause will be easy, even in an "easy" case like Figure 1-4(a). Just because there's only one root cause for a bunch of problems doesn't mean that there will be only one system-wide performance statistic that looks "unacceptable."

The real problem with Method C becomes apparent when you consider its effectiveness in response to the cases shown in Figure 1-4(b) and (c). In both of these cases, when we look "from the bottom up," there are several potential root causes to choose from. How will you determine which root cause to work on first? The best prioritization scheme would be to "follow the arrows" backward from the most important business symptoms to their root causes. The root causes you'd like to address first are the ones causing the most important symptoms.

However, Method C creates a big problem for you at this point:

System-wide performance metrics provide insufficient information to enable you to draw the cause-effect arrows.

You cannot reliably compute the cause-effect relationships shown in Figure 1-4 unless you measure response time consumption for each user action ”"from the top down" in the context of the drawing. Understanding what information is required to draw the cause-effect arrows reveals both the crippling flaw of Method C and the distinctive strength of Method R. It is impossible to draw the cause-effect arrows reliably from root causes to symptoms (from the bottom to the top). However, it is very easy to draw the arrows from symptoms to root causes (from the top down), because the resource profile format for targeted user actions tells you exactly where the arrows belong.

Without the cause-effect arrows, a project is rudderless. Any legitimate prioritization of performance improvement activities must be driven top-down by the economic priorities of the business. Without the arrows, you can't prioritize your responses to the internal performance metrics you might find in your Statspack reports . Without the arrows, about the only place you can turn is to "cost accounting" metrics like hit ratios, but unfortunately , these metrics don't reliably correspond to the economic motives of the business. The Oracle Payroll situation that I described earlier in this chapter was rudderless for three months. The project concluded on the day that the team acquired the data shown in Example 1-3.

Ironically, then, the popular objection to Method R actually showcases the method's greatest advantage. We in fact designed Method R specifically to respond efficiently to systems afflicted with several performance root causes at once.

The reason Method R works so well in system-wide performance crises is that your "whole system" is not a single entity; it's a collection of user actions, some more important than others. Your slow user actions may not all be slow for the same reason. If they're not, then how will you decide which root cause to attack first? The smart way is by prioritizing your user actions in descending order of value to your business. What if all your slow user actions actually are caused by the same root cause? Then it's your lucky day, because the first diagnostic data you collect for a single process is going to show you the root cause of your single system-wide performance problem. When you fix it for one session, you'll have fixed it for every session. Table 1-1 summarizes the merits of conventional methods versus the new method.

Table 1-1. The merits of Method C and Method R. Method R yields its greatest comparative advantage when "the whole system is slow"

Figure 1-4 case	Method C effectiveness	Method R effectiveness
(a)	Effective in some cases. Existence of only one problem root cause increases the likelihood that this root cause will be prominent in the analysis of system-wide statistics.	Effective. Even if business prioritization is performed incorrectly, the method will successfully identify the sole root cause on the first attempt.
(b)	Unacceptable. Inability to link cause with effect means that problems are attacked "from the bottom up" in an order that may not suit business priorities.	Effective. Business prioritization of user actions ensures that the most important root cause will be found and addressed first.
(c)	Unacceptable. Same reasons as for (b).	Effective. Same reasons as above.

1.5.2.2 "The method only works if the problem is the database"

Another common objection to Method R is the perception that it is incapable of finding and responding to performance problems whose root causes originate outside the database tier. In a world whose new applications are almost all complicated multi- tier affairs, this perception causes a feeling that Method R is severely limited in its effective scope.

Method R itself is actually not restricted at all in this manner. Notice that nowhere in the four-step method is there any directive to collect response time data just for the database . The perception of database focus arises in the implementation of step 2, which is the step in which you will collect detailed response time diagnostic data. This book, as you shall see, provides coverage only of the response time metrics produced specifically by the Oracle kernel. There are several reasons for my writing the book this way:

When performance problems occur, people tend to point the finger of blame first at the least well- understood component of a system. Thus, the Oracle database is often the first component blamed for performance problems. The Oracle kernel indeed emits sufficient diagnostic data to enable you to prove conclusively whether or not a performance problem's root cause lies within the database kernel.
At the time of this writing, the Oracle kernel is in fact the most robustly instrumented layer in the technology stack; however, many analysts fail to exploit the diagnostic power inherent in the data this instrumentation emits. Oracle's diagnostic instrumentation model is very robust in spite of its simplicity and efficiency (Chapter 7). Vendors of other layers in the application technology stack have already begun to catch onto this notion. I believe that the response time diagnostic instrumentation built into the Oracle kernel will become the standard model for instrumenting other application tiers.

Even without further instrumentation of non-database tiers, if your performance problem is in the database, Method R helps you solve it quickly and efficiently. If your problem is not caused by something going on in your database, then Method R helps you prove that fact quickly and efficiently. Regardless of where in your architecture your root cause resides, Method R prevents you from trying to fix the wrong problem.

The proof is in the experience. Method R routinely leads us to the doorstep of problems whose repair must be enacted either inside or outside of the database, including such cases as:

Query mistakes caused by inefficiently written application SQL statements, poor data designs, ill-advised indexing strategies, data density mistakes, etc.
Application software mistakes caused by excessive parsing, poorly designed serialization (locking) mechanisms, misuse (or disuse) of array processing features, etc.
Operational mistakes caused by errors in collection of statistics used by the cost-based optimizer, accidental schema changes (e.g., dropped indexes), inattention to full file systems, etc.
Network mistakes caused by software configuration mistakes, hardware faults, topology design errors, etc.
Disk I/O mistakes caused by poorly sized caches, imbalances in I/O load to different devices, etc.
Capacity planning mistakes resulting in capacity shortages of resources like CPU, memory, disk, network, etc.

1.5.2.3 "The method is unconventional"

Even if Method R could prove to be the best thing since the invention of rows and columns , I expect for some pockets of resistance to exist for at least a couple of years after the publication of this book. The method is new and different, and it's not what people are accustomed to seeing. As more practitioners, books, and tools adopt the techniques described in this book, I expect that resistance will fade. In the meantime, some of your colleagues are going to require careful explanations about why you're recommending a completely unconventional performance optimization method that doesn't rely on Statspack or any of the several popular performance monitoring tools for which your company may have paid dearly. They may cite your use of an unconventional method as one of the reasons to reject your proposals.

One of my goals for this book is certainly to arm you with enough knowledge about Oracle technology that you can exploit your data to its fullest diagnostic capacity. I hope by the end of this book I'll have given you enough ammunition that you can defend your recommendations to the limit of their validity. I hope this is enough to level the playing field for you so that any debates about your proposed performance improvement activities can be judged on their economic merits, and not on the name of the method you used to derive them.

1.5.3 Evaluation of Effectiveness

Earlier in this chapter, I listed eight criteria against which I believe you should judge a performance improvement method. I'll finish the chapter by describing how Method R has measured up against these criteria in contrast to conventional methods:

Impact: Method R causes you to produce the highest possible impact because you are always focused on the goal that has meaning to the business: the response time of targeted user actions.
Efficiency: Method R provides excellent project efficiency because it keeps you focused on the top priorities for the business, and because it allows you to make fully informed decisions during every step of the project. Project efficiency is in fact the method's key design constraint.
Measurability: Method R uses end-user response time as its measurement criterion, not internal technical metrics that may or may not translate directly to end-user benefit.
Predictive capacity: Method R gives the unprecedented ability to predict the impact of a proposed tuning activity upon a targeted user action, without having to invest in expensive experimentation.
Reliability: Method R performs reliability in virtually every performance problem situation imaginable; a distinction of the method is its ability to pinpoint the root cause of any type of performance problem without having to resort to experience, intuition, or luck.
Determinism: Method R eliminates diagnostic guesswork first by maintaining your focus on business priority, and second by providing a foolproof method for determining the true relationships between problem symptoms and their root causes.
Finiteness: Method R has a clearly stated termination condition. The method provides the distinctive capacity to prove when no further optimization effort is economically justifiable.
Practicality: Method R is a teachable method that has been used successfully by hundreds of analysts of widely varying experience levels to resolve Oracle performance problems quickly and permanently.

The next chapters show you how to use Method R.

Top