Section 10.1. How to Work a Resource Profile | Optimizing Oracle Performance

10.1 How to Work a Resource Profile

Although most people seem to innately understand the resource profile format, some formal guidelines usually help people make the most efficient use of the information. After analyzing several hundred resource profiles since the year 2000, my colleagues and I have refined our approach into the following guiding principles:

Work the resource profile in descending order of response time contribution.
Eliminate unnecessary calls before attempting to reduce per-call latency.
If a response time component is still prominent after you have eliminated unnecessary calls to the resource, then eliminate unnecessary competition for the resource.
Only after eliminating unnecessary calls to a resource and eliminating unnecessary competition for the resource should you consider increasing the capacity of the resource.

These guidelines have consistently helped us to produce effective optimizations quickly. The following sections describe the guidelines in detail.

Dave Ensor's "Three Approaches" Model

In public appearances , my friend and fellow O'Reilly author Dave Ensor has noted that there are three approaches to repairing a response time problem at a specified resource:

The commercial approach is to add more capacity. This approach indeed optimizes net profit, return on investment, and cash flow. Unfortunately it usually optimizes these three measurements only for your vendors . . . not for you.
The geek approach is to fiddle with configurations and settings and physical layouts and anything else that might be " tuned " to provide some reduction in request service times. When per-call latencies are really bad, the technician's compulsion to "tune" those numbers can be irresistible. This kind of tuning is Most Excellent Fun for the geek in all of us. The best thing is that it keeps us so busy-looking that we can avoid doing a lot of other things that might not be fun at all. But unfortunately the effort often results in lots of time invested in return for no noticeable impact. The ultimate test of relevance is Amdahl's Law.
The smart approach is to reduce the calls to the resource. The question becomes how to do this.

If you've met Dave or watched him speak, you will understand why, actually, I quite admire Dave for his restraint in how he has named the three approaches.

10.1.1 Work in Descending Response Time Order

It is easy to work with a resource profile that is sorted in descending order of response time contribution. You simply use the data in top-down order. The top-line response time consumer is the resource that provides the greatest performance improvement leverage. Remember Amdahl's Law: the lower an item appears in the profile, the less opportunity that item provides for overall response time improvement.

Example 10-2 shows a resource profile that was created for a targeted user action on a system with "an obvious disk I/O problem." The I/O subsystem was occasionally providing single-block I/O latencies in excess of 0.600 seconds. Most technicians would deem single-block I/O latencies greater than 0.010 seconds to be unacceptable. The ones produced by this system were fully sixty times worse than this threshold.

Example 10-2. A resource profile created for a targeted user action on a system with a known disk I/O performance problem

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ CPU service                        37.7s  68.9%            214    0.175981s unaccounted-for                     8.4s  15.4% db file sequential read             5.5s  10.1%            568    0.009630s db file scattered read              2.1s   3.8%             89    0.024157s latch free                          0.9s   1.6%             81    0.011605s log file sync                       0.1s   0.2%              3    0.026667s SQL*Net more data to client         0.0s   0.0%              4    0.002500s file open                           0.0s   0.0%             12    0.001667s SQL*Net message to client           0.0s   0.0%             58    0.000000s ----------------------------- ----------------- -------------- ------------ Total                              54.7s 100.0%

However, the resource profile for this targeted user action indicates strongly that addressing the disk I/O problem is not the first thing you should do to improve response time for the action. The only response time components that better I/O performance will impact are the db file sequential read and db file scattered read line items. Even if you could totally eliminate both these lines from the resource profile, response time would improve by only about 14%.

Ironically the I/O subsystem "problem" did impact the performance of the program profiled in Example 10-2. Evidence in the raw trace data revealed that some single-block I/O calls issued by this user action consumed as much as 0.620 seconds apiece. But even this knowledge is irrelevant. For this user action, the upside of fixing any I/O subsystem problem is so severely limited that your analysis and repair time will be better spent somewhere else.

10.1.1.1 Why targeting is vital

Now is a good time to test your commitment to Method R. You might ask, "But fixing such a horrible I/O problem would surely provide some benefit to system performance...." The answer is that yes, fixing the I/O problem will in fact provide some benefit to system performance. But it is vital for you to understand that fixing the I/O problem will not materially benefit this user action (the one corresponding to the resource profile in Example 5-2, and that the business was desperate to improve). Understanding this is vital for two reasons:

Any time or materials that you might invest into fixing the "I/O problem" will be resources that you cannot invest into making material performance improvements to the user action profiled in Example 10-2. If you have correctly targeted the user action, then working on an I/O problem will be at best an unproductive distraction.
Fixing the I/O problem can actually degrade the performance of the user action profiled in Example 10-1. This is not just a theoretical possibility; we see this type of phenomenon in the field (see Chapter 12). Here's one way it can happen: imagine that at the same time as the user action profiled in Example 10-1 runs, several other programs are running on the system as well. Further imagine that these programs consume a lot of CPU capacity, but they presently spend a lot of time queued for service from the slow disk. To remove the disk queueing delay for these slow processes will actually intensify competition for the CPU capacity consumption that presently dominates our targeted user action's response time. In this case, fixing the I/O problem will actually degrade the performance of the targeted user action.

Yes, fixing the I/O problem will provide some performance benefit to those other programs. But if you have properly targeted the user action of Example 10-2 for performance improvement, then fixing the I/O problem will degrade performance in an important action in trade for improving the performance of user actions that are less important. This result is contrary to the priorities of the business.

For both reasons, if you have properly targeted the user action depicted in Example 10-2, then working on the "I/O problem" is a mistake. If the user action is not a proper target for performance improvement, then you have not correctly done the job that I described in Chapter 2.

10.1.1.2 Possible benefits of low-return improvements

Having said this, it is possible that the most economically advantageous first response to a resource profile is to address an issue that is not the top-line issue. For example, imagine that the I/O subsystem problem of Example 10-2 could be "repaired" simply by deactivating a particular long-running, disk-I/O intensive report that runs every day during the targeted user action's execution. Imagine that the fix is to simply eliminate the report from the system's workload, because you discover that absolutely nobody in the business ever reads it. Then the problem's repair ”simply turning off the report ”is so inexpensive that you'd be crazy not to implement it.

Mathematically, the return on investment (ROI) of some repair activities can be extremely high because even though the return R is small, the investment I is so small that R / I is large. It happens sometimes. However, realize that high ROI is not your only targeted goal. My finance professor , Michel Vetsuypens, once illustrated this concept by tossing a five-cent coin to a student in our classroom. The student who kept the nickel enjoyed a nearly infinite ROI for the experience ”it cost virtually nothing to catch the coin, and the return was five cents. Although it was probably the highest-ROI event in the student's whole life, of course adding five cents to his net worth produced an overall impact that was completely inconsequential. This story illustrates why it is so important that your performance improvement goal include the notions of net profit and cash flow in addition to ROI, as I describe in Chapter 2.

This story also illustrates the fundamental flaw of ratios: they conceal information about magnitude.

10.1.2 Eliminate Unnecessary Calls

A well-worn joke in our Hotsos Clinic classrooms is this one:

Question : What's the fastest way to do x ? (In our classes, x can be virtually anything from executing database calls to flying from one city to another, to going to the bathroom.)

Answer : Don't.

The fastest way to do anything is to avoid doing it at all. (This axiom should hold until someone invents a convenient means of human-scale time travel. Until we can figure out how to make task durations negative, the best we're going to be able to do is make them zero.)

The most economically efficient way to improve a system's performance is usually to eliminate workload waste. Waste is any workload that can be eliminated from a system with no loss of functional value to its owner. Analysts who are new to Method R are often shocked to find the following maxim alive and well within their systems:

Many systems' workloads consist of more than 50% waste.

It's been true for almost every system I've measured since 1989, and chances are that it's true for your system as well. It's true for good reason: throughout the 1980s and 1990s, when many Oracle performance analysts were trained, we were actually taught principals that encouraged waste. For example, the once-popular belief that higher database buffer cache hit ratios are better encourages many application inefficiencies . Several sources illustrate this fallacy, including [Millsap (2001b; 2001c); Lewis (2001a); Vaidyanatha et al. (2001); McDonald (2000)].

10.1.2.1 Why workload elimination works so well

Eliminating unnecessary work has an obvious first-order impact upon the performance of the job formerly doing the work. However, many people fail to understand the fabulous collateral benefit of workload elimination. Every time you eliminate unnecessary requests for a resource, it reduces the probability that other users of that resource will have to queue for it. It's easy to appreciate the second-order benefits of workload reduction. Imagine, for example, a program that consumes CPU capacity pretty much non-stop for about 14 hours (the resource profile in Example 1-4 shows such a program). Further imagine that the program's performance could be repaired so that it would consume only ten minutes of CPU capacity. (Such repairs usually involve manipulation of a critical SQL statement's query execution plan.)

It's easy to see why the user of the report that now takes ten minutes instead of fourteen hours will be delighted . However, imagine also the benefits that the other users on the system will enjoy. Before the repair, users who were competing for CPU service had to fight for capacity against a process that consumed a whole CPU for more than half of a day . In the post-repair scenario, the report competes for CPU only for ten minutes. The probability of queueing behind the report for CPU service drops to a mere sliver of its original value. For the 14- hour period, the benefit to the system will approximate the effect of installing another CPU into the system.

The benefit of reducing workload in this case will actually be more than the benefit of adding a new CPU because adding a new chip would have incrementally increased the operating system overhead required to schedule the additional capacity. Plus, reducing workload in fact cost less than actually installing another CPU would have.

The collateral benefits of workload reduction can be stunning. Chapter 9 explains the mathematics of why.

10.1.2.2 Supply and demand in the technology stack

So how does one eliminate unnecessary workload? The answer varies by level in the technology stack. I introduced the concept of a system's technology stack in Chapter 1 when I described the sequence diagram notation of depicting response time for a user action. The technology stack consists of layers that interact with each other through a supply-and-demand relationship, as shown in Figure 10-1. The relationship is simple. Demand goes in; supply comes out; and everything takes time (hence, the demand and supply arrows are tilted downward).

Figure 10-1. This sequence diagram illustrates the supply and demand relationships among technology stack layers as time moves forward (downward on the page)

Considering your technology stack in this way will help you to understand a fundamental axiom of performance improvement:

Almost every performance problem is caused by excessive demand for one or more resources.

Virtually any performance problem can be solved by reducing demand for some resource. You will accomplish the task of demand reduction by looking "upward" from a high-demand device in the technology stack. (To look upward in the stack actually means to look leftward in the sequence diagram shown in Figure 10-1.) The question that will guide your performance improvement effort is this:

Is the apparent requirement to use so much of this resource actually a legitimate requirement?

Consider the resource profile shown in Example 10-3. Almost 97% of the targeted user action's 1.3-hour response time was consumed waiting for disk I/O calls. The resource profile suggests two possible solutions:

Reduce the number of calls to some number smaller than 12,165.
Reduce the duration per call from 0.374109 seconds.

Notice that any improvement to either of these two numbers will translate linearly into the duration for the response time component. For example, if you can cut the number of calls in half, you will cut the duration in half. Similarly, if you can cut the per-call duration in half, you will cut the duration in half. Although reductions in call count and duration per call translate with equal potency to the duration column, it is generally much easier to achieve spectacular call count reductions than to achieve spectacular per-call latency reductions. I specifically chose the left-to-right column order of the resource profiles shown in this book to encourage you to see the better solution first.

Example 10-3. A targeted user action whose response time is dominated by read calls of the disk I/O subsystem

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ db file scattered read          4,551.0s  96.9%         12,165    0.374109s CPU service                        78.5s   1.7%            215    0.365023s db file sequential read            64.9s   1.4%            684    0.094883s SQL*Net message from client         0.1s   0.0%             68    0.001324s log file sync                       0.0s   0.0%              4    0.010000s SQL*Net message to client           0.0s   0.0%             68    0.000000s latch free                          0.0s   0.0%              1    0.000000s ----------------------------- ----------------- -------------- ------------ Total                           4,694.5s 100.0%

10.1.2.3 How to eliminate calls

How do you reduce the number of events executed by a user action? First, figure out what the resource that's being consumed does . What causes the Oracle process profiled in Example 10-3 to execute 12,165 multiblock read calls? Then figure out whether there's any way you can meet your functional requirements with fewer calls to that resource. In Chapter 11, I explain how to do this for a few commonly occurring Oracle events. You proceed by assessing whether you can reduce demand for the busy resource at each level as you move throughout the technology stack. For example:

Many analysts assume that by increasing the size of the database buffer cache (that is, by allocating more memory to an Oracle system), they can ensure that fewer of their memory lookups will motivate visits to disk devices.
However, moving up the stack a little farther often provides better results without motivating a system memory upgrade. By improving the query execution plan that your SQL uses to fetch rows from the database, you can often eliminate even the memory accesses .
Moving even farther up the stack reveals potential benefits that cost even less to implement. For example, perhaps running the targeted user action less frequently, or perhaps even not at all (maybe something else instead), would not at all diminish the business value of the system.

10.1.2.4 Thinking in a bigger box

Technicians sometimes confine their work to a zone of comfort within the bottom layers of the technology stack. Such behavior increases the risk of missing significant performance improvement opportunities. For example, at one customer site I visited in the mid-1990s, the accounting department generated a three- foot -deep stack of General Ledger (GL) Aged Trial Balance reports every day. Upon learning why the users were running this report so frequently, the GL implementation leader from Oracle Consulting taught the users how they could acquire the information they needed more efficiently by running a fast online form. As a result, we were able to eliminate billions of computer instructions per day from the system's overall workload, with absolutely no "tuning" investment. Not only was the solution easier on the system, using the online form was actually more convenient for the users than trying to visually pluck details out of an inch-thick report.

My www.hotsos.com company cofounder, Gary Goodman, tells stories of application implementation projects he led while he was at Oracle Corporation. One technique that he practiced during an implementation was to simply turn off every application report on the system. When users would come asking for a report they were missing, his project team would reactivate the requested report. In Gary's experience, not once did he ever reinstate more than 80% of the system's original reporting workload. Can you figure out which 20% of your reports your users never use?

At the business requirement layer in your technology stack, the right question for you to answer is:

Which apparent business requirements are actually legitimate business requirements?

For user actions that provide no legitimate business value, simply turn them off. For user actions that really are necessary, try to eliminate any unnecessary work within them (Chapter 11 describes several ways). Your performance diagnostic data will drive your analysis from the bottom up, but it's usually cheaper to implement solutions from the top down. For example, find out whether a report should be deactivated before you tune it. Don't limit your optimization work to studying only the technical details of how something works. As I described in Chapter 1, the optimal performance analyst must also invest himself into understanding the relationship between technical workload and the business requirements that the workload is ostensibly required to support.

Finally, don't forget that from a business's perspective, the users don't just use a system, they are part of the system. A story told by my colleague Rick Minutella illustrates. A company had called him to optimize the performance of a recently upgraded order entry application. Table 10-1 shows the performance difference before and after the upgrade. In a classic Big Meeting with the company's CFO, users, IS department managers, and hardware vendor all in attendance, the company demanded that Oracle Corporation fix the performance of the order entry form because it was killing their business.

Table 10-1. Order Entry performance before and after an upgrade

Performance measurement	Value before upgrading	Value after upgrading
Order throughput	10 calls/hr	6 calls/hr
Entry form response time	5 sec/screen	60 sec/screen

Waiting 60 seconds for a response from an online order entry form is almost certainly much too long. However, the argument that "performance of the order entry form is killing our business" is simply not true. Here's why. If the business processes an average of six calls per hour, then the average duration per phone call is ten minutes. Example 10-4 shows the resource profile for such a call using a 60-second form.

Example 10-4. Resource profile for the order entry process when the form is behaving objectionably

 Before optimizing the online order entry form Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ other                               540s  90.0%              1         540s wait for the order entry form        60s  10.0%              1          60s ----------------------------- ----------------- -------------- ------------ Total                               600s 100.0%

What is the maximum impact to the business that can be obtained by optimizing the form? Example 10-4 shows the answer. If the form's response time can be completely eliminated , total order processing time will drop only to nine minutes.

Example 10-5. Resource profile for the order entry process if the form's response time could be entirely eliminated

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ other                               540s 100.0%              1         540s wait for the order entry form         0s   0.0%              1           0s ----------------------------- ----------------- -------------- ------------ Total                               540s 100.0%

The problem is that this isn't good enough. If an order clerk could process an order in an average of nine minutes, then a single clerk could process an average of only 6.67 orders per hour. This of course is still well short of the ten calls-per-hour requirement. Fixing the performance of this business's online order entry form, by itself, will never improve throughput to the required ten calls per hour. It's not the form that's "killing the business," it's what the order takers are doing for the other nine minutes.

Solving this performance problem will require thinking outside the box of conventional "system tuning." What is it that consumes this "other" time? A few possibilities include:

If most of the "other" duration results in customer inconvenience (for example, long waits for product ID lookups), then you should find ways to reduce the "other" duration.
If most of the "other" duration is spent improving the company's relationship with the customer, then perhaps it's a better idea to hire more clerks so that an average per-clerk order throughput of about six calls per hour yields sufficient total order throughput for the business.

Another interesting problem to figure out is whether the "clustering" in time of incoming calls occurs in such a manner that customers spend a lot of time on hold during busy parts of the day. The queueing theory lessons presented in Chapter 9 can help you understand how to deal with peak incoming call times more effectively by either using more clerks or reducing per-call durations. There's a lot to think about. The point is not to constrict your view of your "system" to just a few bits of hardware and software. Your business needs you to think of the "order entry system" more broadly as all the participants in the order entry process that influence net profit, return on investment, and cash flow.

10.1.3 Eliminate Inter-Process Competition

What happens when you have eliminated all the unnecessary calls that you can in the user action under diagnosis, but its response time is still unacceptable? Your next step is to assess whether its individual per-call latencies are acceptable. Understanding whether a given latency is acceptable requires some knowledge of what numbers you should expect. There are surprisingly few numbers that constitute such knowledge. The ones that Jeff and I have found to be the most important are listed in Table 10-2. These constants will evolve as hardware speeds improve, but the numbers are reasonable upper bounds for many systems at the time of this writing in 2003. In particular, LIO numbers vary as CPU speeds vary, and of course CPU speeds are a rapidly moving target these days. The footnote to Table 10-2 explains.

Table 10-2. Useful constants for the performance analyst [Millsap and Holt (2002)]

Event	Maximum tolerated latency per event	Events-per-second rate at this latency
Logical read (LIO) ^[1]	20 ms or 0.000 020 s	50,000
Single-block disk read (PIO)	10 ms or 0.010 000 s	100
SQL*Net transmission via WAN	200 ms or 0.200 000 s	5
SQL*Net transmission via LAN	15 ms or 0.015 000 s	67
SQL*Net transmission via IPC	1 ms or 0.001 000 s	1,000

^[1] Experiments published by Jonathan Lewis indicate strongly that you can expect a CPU to perform roughly 10,000 LIO/sec per 100 MHz of CPU capacity [Lewis (2003)]. Hence, on a 1 GHz CPU, you should expect performance of roughly 100,000 LIO/sec, or 10 m s per LIO. If you are using a 500 MHz system, you should average approximately the 20- m s numbers listed here.

Latencies that violate the expectations listed in Table 10-2 sometimes indicate malfunctioning devices, but more often they indicate resource queueing delays. What causes long queueing delays? The most likely answer by far is ”can you guess? ”excessive demand for the resource. What could be causing that excessive demand? The answer to this question is, of course, one or more other programs that are competing for resources your targeted user action needs while it is running.

10.1.3.1 How to attack a latency problem

Example 10-6 depicts a situation in which the overall response time of a targeted user action is excessive because of excessive individual I/O call latencies. By the time this resource profile was generated, the analyst had eliminated unnecessary disk read calls, leaving only eighteen necessary calls. However, the average disk read latency of more than 2.023 seconds per call is far out of bounds compared to the expectation of 0.010 seconds from Table 10-2. From this resource profile alone, it is impossible to determine whether, for example, each of the 18 disk reads consumed 2.023 seconds apiece, or just one of the disk reads consumed so much time that it dominated the average. (Remember, it is impossible to extrapolate detail from an aggregate . . . even in resource profiles.)

Example 10-6. A resource profile for a user action whose response time is dominated by unacceptable disk I/O latencies

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ db file sequential read            36.4s  98.9%             18    2.023048s CPU service                         0.4s   1.1%              4    0.091805s SQL*Net message from client         0.0s   0.0%              3    0.004295s SQL*Net message to client           0.0s   0.0%              3    0.000298s ----------------------------- ----------------- -------------- ------------ Total                              36.8s 100.0%

However, one thing is clear: something is desperately wrong with the latency for at least one disk read call for this user action. The following steps will help you get to the bottom of the problem:

Which block or blocks are the ones participating in the high-latency I/O calls? Your extended SQL trace file contains the answers. Chapter 5 and Chapter 6 provide the information you need to find them.
Once you know which blocks are taking so long to read, you can work with your disk subsystem manager to figure out on which devices the blocks reside.
Once you've figured out exactly which devices are at the root of the problem, determine whether programs that compete with your targeted action for the "hot" devices themselves use those devices wastefully. If they do, then eliminating the waste will reduce queueing delays for the hot device.
Assess whether the configuration of the slow device is itself generating wasted workload. For example:
- I've seen systems with two or more mirrors set up so that reads and writes to separate devices bottleneck on a single controller.
- RAID level 5 disk systems commonly have inadequate I/O call capacities . Using RAID level 5 is not necessarily a mistake. However, people commonly fail to realize that to provide adequate I/O performance with a RAID level 5 configuration typically requires the purchase of two to four times more disk drives than they initially might have believed [Millsap (2000a)].
- It is sometimes possible to move workload from a hot device to one that's less busy during the problem time interval. System administrators refer to this operation as I/O load balancing . In the early 1990s, I visited a lot of Oracle sites that had I/O latency problems caused by extremely poor file layouts (such as putting all of an Oracle database's files on one disk). I don't think this kind of thing happens very often anymore. However, if you happen to suffer from such a dreadful configuration problem, then of course it's highly likely that you'll be plagued by excessive I/O latencies, regardless of whether your application issues a wastefully large number of disk I/O calls or not.
- Faulty hardware can of course cause performance problems as well. A bad disk controller that causes unnecessary retry or timeout operations can contribute significantly to response time. For inexplicably slow I/O devices, check your system logs to ensure that your operating system isn't having a hard time getting your hardware to cooperate.

Steps 3 and 4 are the ones in which experience and creativity can produce excellent payoffs.

10.1.3.2 How to find competing workload

The job of learning which programs are out there competing against your user action resembles conventional performance tuning, at least insofar as which tools you'll use.

There are lots of tools available for analyzing a specific resource in detail. The Oracle fixed views described in Chapter 8 are excellent places to look first.

Though the job of digging through details about some high-latency device may remind you of the old trial-and-error tuning approach (Method C from Chapter 1), there is an important distinction. That distinction is the hallmark of Method R ”the ever-present companion of deterministic targeting . You won't be sifting through innumerable performance metrics wondering which ones might have a meaningful performance impact and which ones don't. Instead, you'll know exactly which resource it is that you're trying to improve. You'll know, because your resource profile has told you.

The most difficult part of finding a user action's competitors is a problem with collecting properly scoped diagnostic data. This is when it would really pay off to have a detailed X$TRACE -like history of everything that happened on a system during your performance problem time interval. Without such a history of detailed diagnostic data, it can be difficult to find out which programs were competing against your targeted user action, even if the action you're trying to improve just finished running a few minutes ago. There are several ways to make progress anyway, including:

Batch queue manager logs: Practically by definition, the most intensely competitive workload on a system is that motivated by batch programs. Most good batch management software maintains a log of which jobs ran at what times. Beginning with this information, it is often easy to guess which programs motivated significant competition for a given resource. You can graduate from guesses to complete information by collecting properly scoped diagnostic data for these programs the next time they're scheduled to run.
Oracle connect-level auditing: It is easy to configure an Oracle instance to perform lightweight logging of session-level resource consumption statistics. These statistics can help you determine which sessions were responsible for the greatest workloads on the system over a specified duration. Once you have that information, then usually a brief end-user interview is all it takes to construct a good guess about which programs might have motivated the competition for a given device. Again, you can graduate from guesses to complete information by collecting properly scoped diagnostic data for the suspects at some time in the future. To get started, search your Oracle documentation for information about the DBA_AUDIT_SESSION view.
Operating system process accounting: Some operating systems provide the capability to collect and record relevant performance statistics for individual programs. This capability can be important, because not all competition for a specified resource is necessarily motivated from another Oracle process.
Custom timing instrumentation: There's nothing better for the performance analyst than application code that can tell you where it spends all of its time. If you have the ability to instrument the code that is performing poorly (for example, because it is code that you wrote), then instrument it . Chapter 7 explains how, in detail.

When you find the programs that are competing with your targeted user action for a "hot" resource, use the techniques described earlier in Section 10.1.2 Your job becomes the familiar one of determining whether the requirement to overburden the resource is really a legitimate requirement.

10.1.4 Upgrade Capacity

Capacity upgrades are the last place you should look for performance improvement opportunities. The reasons for last-place status are straightforward:

It is seldom possible to make as much progress with an expensive capacity upgrade as you can make with an inexpensive round of wasted workload elimination.
Capacity upgrades, if executed without sufficient forethought, can actually degrade the performance of the user action you're trying to improve.

Any capacity upgrade is a gamble. The first observation says that an investment into faster hardware has a potentially lower payoff than you'd like. Many managers think of capacity upgrades as guaranteed investment successes, because "How can you ever have too much CPU [or memory, or disk, or whatever]?" The popular belief is that even if the performance problem at hand doesn't benefit directly from the upgrade, how can it hurt? You'll use the spare capacity eventually anyway, right? Well, not exactly. The gamble has a downside that a lot of decision- makers don't realize. I've already described one downside situation in Section 10.1.1.1. The case in Section 12.1 is another example of the same problem:

A capacity upgrade is going to help some part of a system's workload, but the key issue is whether a capacity upgrade will help a system in alignment with the business priorities of its owner.

The first formal explanation that I ever read about such a counterintuitive possibility was in Neil Gunther's Practical Performance Analyst [Gunther (1998) 117-122]. When I presented Gunther's example to Oracle conference audiences worldwide, participants without fail would approach the podium to share the news that by seeing Gunther's example they could finally explain the bizarre result that had plagued some past project. I was pleased but actually a little surprised by how many different people had seen hardware up grades de grade system performance. After gaining some intimacy with Amdahl's Law, it became clear to me that any capacity upgrade can degrade the performance of some user action by unleashing extra competition for a resource that was not upgraded. The real key is whether or not the harmed user actions are ever noticed .

When capacity upgrades fail to improve performance, the results are some of the worst project disasters imaginable. Here's what happens. A company lives with a performance problem long enough that the collective pain rises above some threshold that triggers the expenditure of cash for an upgrade. Expectations form in direct proportion to the size of the expenditure. So, on the Friday before the Big Upgrade, a whole company is nervously awaiting Monday, when "We're spending so much money to fix this problem that performance is bound to be spectacular ." Then when Monday rolls around, not only is performance unspectacular, it's actually worse . By Tuesday, the business is assessing whether the person who suggested the upgrade should bother to come to work on Wednesday.

Capacity upgrades motivate interesting ironies:

Decision-makers often regard capacity upgrades as inexpensive alternatives to expensive analysis, yet the upside potential of capacity upgrades is severely limited in comparison to the upside potential of workload reduction.
Decision-makers often perceive capacity upgrades as completely safe, yet they bear significant downside risk. Their downside potential actually demands serious, careful, and possibly even expensive analytical forethought.

Even when capacity upgrades work, they usually don't work as well as the people doing the upgrade had hoped. When capacity upgrades don't work, they jeopardize careers. The failures are often so visible and so spectacular that the project sponsors never regain their credibility.

Are hardware upgrades ever necessary? Certainly, there are many cases in which they are. But I implore you not to consider hardware upgrades as a first-line defense against performance problems. Is your system really under- sized ? Odds are that its workload is just bigger than it needs to be. So, please , eliminate wasteful calls to a resource before you upgrade it. And when you do upgrade a resource, then make sure you think it through first:

Don't upgrade capacity until you know that the resource you're upgrading is going to (a) help important user actions, and (b) harm only unimportant ones.

And, of course, don't lose sight of the fact that a user action that's unimportant today might become important tomorrow if you slow it down.

Top