Section 11.2. Response Time Components

11.2 Response Time Components

When I began work on this chapter, I had in mind that I would write details about the twenty or so Oracle wait events that you would encounter most commonly in your work. However, Oracle Corporation has done such a good job improving its wait event documentation in the past few years that I think rehashing that information would be a waste of your time. The Oracle9 i release 2 performance documentation [Oracle (2002)] is actually quite good. Rather than rehash what you can obtain freely from other sources, I focus my effort in this book upon three topics that are missing from the standard Oracle documentation:

Treatment of the two pseudoevents ("events" that are not really events) whose durations you can obtain from extended SQL trace data: CPU service and unaccounted-for time.
More comprehensive description of a few so-called idle wait events that other authors dismiss as unimportant. These events become very important in light of the improved diagnostic capabilities that you gain if you collect performance diagnostic data with proper time and action scope.
More emphasis on how to eliminate wasteful workload in reaction to the appearance of various wait events.

In this section, I cover the pseudoevents and so-called idle wait events. Later in the chapter, I cover the topic of workload elimination .

11.2.1 Oracle Pseudoevents

You may by now have noticed that I use the term "response time component" in places where you might have expected the term Oracle "wait event." The reason I do this is simple. What I'm calling response time components consists of two different things: actual Oracle wait events described in V$EVENT_NAME , and two other important components of response time that do not show up in V$EVENT_NAME :

CPU service

unaccounted-for

Though neither of these components is officially an "Oracle wait event," each is a measurable (and often significant) component of response time for every user action imaginable. I include them on par with the Oracle wait events because every microsecond an Oracle kernel process spends "working" contributes just as much to user action response time as a microsecond spent "waiting." Recent versions of Statspack include CPU time in its list of top five "wait events" as well.

11.2.1.1 CPU service

CPU service will be a response time component of virtually every Oracle resource profile you'll ever see. It is often the dominant contributor to response time, both for efficient user actions and for extremely inefficient ones. The key is to understand whether the apparent requirement to consume as much CPU capacity as you're seeing is actually a legitimate requirement (Chapter 5).

Your first step in diagnosing excessive CPU service durations is to learn which database calls are predominantly responsible. Forward attribution is not necessary for determining the root cause of excessive consumption of CPU capacity, because the c statistics that define CPU service durations sit right on the trace file lines for the database calls themselves . Once you've identified the calls that are most responsible for the CPU service consumption, it's easy to search backward in the trace file for the appropriate PARSING IN CURSOR sections that identify the source text (SQL or PL/SQL) motivating those calls.

The database calls you should attend to first are the ones with the greatest progeny- exclusive c values. If a database call's high c value is the result of consumptions rolled up from its expensive recursive children, then determine the call's progeny-exclusive CPU capacity consumption by using the technique described in Section 5.3.3. Remember, a database call's CPU and LIO statistics ( c , cr , and cu ) are among those that roll up from child to parent.

Once you've found the database calls that contribute the most to your targeted user action's CPU service duration, the following items describe what action to take, depending upon the type of database call that has consumed the most CPU service:

Large c value for a FETCH , EXEC , UNMAP , or SORT UNMAP database call

If you have many small c values distributed across a large number of FETCH or EXEC calls, then eliminate as many unnecessary database calls as possible, and consolidate the calls that remain into the smallest number of database calls that you can (for example, process arrays of rows instead of processing one row at a time).

If you have a large c value for an individual FETCH or EXEC database call, then first determine whether the large CPU duration was due to logical I/O (LIO) processing:

High LIO count (more than about 10 LIO calls per non-aggregate row returned per table in the FROM clause)

If the LIO count for the call is high, then optimize the SQL as I outline in Section 11.3.1 later in this chapter.

Low LIO count

If the call's LIO count is low, then the Oracle kernel process may have consumed an excessive amount of CPU capacity performing sort or hash operations. Events 10032 and 10033 provide detailed information about Oracle sorting operations, and event 10104 provides detailed information about hash joins.

Another possibility is that the time consumption was the result of type coercion operations. For example, a table scan that does a date comparison for every row could use a lot of CPU.

Finally, perhaps the time consumption was the result of excessive PL/SQL language processing. For example, PL/SQL instructions of course consume CPU capacity, even when they don't make any database calls. High c durations in EXEC trace file lines for PL/SQL blocks that have low progeny-exclusive LIO counts often indicate that branching, assignments, and other language processing elements are consuming excessive CPU capacity. You can use Oracle's DBMS_PROFILER package to diagnose PL/SQL language processing performance in detail [Kyte (2001)].

Large c value for a PARSE call

If you have many small c values distributes across a large number of PARSE calls, then eliminate as many parse calls as you can, using methods outlined in Section 11.3.2 later in this chapter.

If you have a large c value for a PARSE call, then investigate whether the SQL statement can be simplified. Also consider reducing the value of OPTIMIZER_MAX_PERMUTATIONS (see http://www.ixora.com.au/q+a/0010/19140702.htm for more information).

11.2.1.2 unaccounted-for

As I describe in Chapter 7, there are five sources of unaccounted-for time in a trace file, which create non-zero D values in the equation e = c + S ela + D :

Measurement intrusion effect
Double-counting of CPU service between c and ela statistics
Quantization error
Time spent not executing
Un-instrumented time

Having so many unknowns in a single equation might suggest that it is impossible to isolate the effect of any one contribution to D . However, in reality dealing with unaccounted-for time is simple. Three of the five sources of unaccounted-for time are constrained so that their overall impact upon response time tends to be negligible. If unaccounted-for is the largest component of a user action's resource profile for properly scoped diagnostic data , then the indication is almost always either un-instrumented time or time spent not executing . If un-instrumented Oracle kernel code is the source of your unaccounted-for time, then you're probably an Oracle patch application away from reducing the number of material unaccounted-for time sources to just one.

Most real-life instances of large unaccounted-for time durations are caused by errors in data collection such as the ones described in Chapter 3 and Chapter 6. If you are very careful to target exactly the data you need when you collect your performance diagnostic data, then you can exploit the full benefit of the knowledge contained within your unaccounted-for component durations.

Chapter 7 explains the details of the five different contributors to unaccounted-for response time components. Here is a recap:

Measurement intrusion effect: The effect of measurement intrusion is small (on the order of a few microseconds per gettimeofday or getrusage call), so overall it is generally safe for you to ignore it. If you are concerned about measurement intrusion, you can measure its exact effect using techniques outlined in Chapter 7. Measurement intrusion effect influences D slightly in the positive direction.
Double-counting of CPU service between c and ela statistics: The double-counting effect is small for most events. The largest impact I've seen from this effect occurs with db file scattered read events involving large data transfers (on the order of 100 KB or more). In such cases, I have seen apparent CPU double-counting of roughly one centisecond per read event. CPU double-counting influences D slightly in the negative direction.
Quantization error: The overall effect of quantization error is also small. Because positive and negative quantization errors occur with equal probability, the errors tend to counteract each other, resulting in a nearly net-zero effect upon D .
Time spent not executing: If the magnitude of total unaccounted-for duration of a resource profile is significant, then you have almost certainly discovered either time spent not executing or un-instrumented time. A good rule of thumb is given in Table 11-2. Time spent not executing always influences D in the positive direction.
Un-instrumented time: If your application executes a significant amount of un-instrumented Oracle kernel code path, then the effect in the resource profile is indistinguishable from the effect of time spent not executing. Chapter 7 describes how you can detect the existence of un-instrumented code path . If unaccounted-for time dominates total response time on even a system that's not excessively context switching, paging, or swapping, then patch your Oracle kernel so that the your user action's code path is correctly instrumented. Un-instrumented Oracle code segments always influence D in the positive direction.

Table 11-2. Rule of thumb for dealing with unaccounted-for duration D , where R is the total response time for the user action under analysis

Condition	Indication
D is negative and has magnitude of more than 10% of total response time That is, D < `-` 0.1 R	This is an extremely rare (but hypothetically possible) case in which error from CPU double-counting dominates your statistics for an entire trace file. (You can usually detect this phenomenon by analyzing extended SQL trace data for only a single database call.)
D is between `-` 10% and 10% of total response time That is, `-` 0.1 R +0.1 R	Ignore the `unaccounted-for` duration. It is a small enough contributor to total response time that it is not necessary to understand the exact cause.
D is positive and more than 10% of total response time That is, +0.1 R
"Idle" Is a Four-Letter Word Any good book about the Oracle wait interface must discuss the so-called idle wait events . Authors tag "idle" events as special because they represent code path in the Oracle kernel in which a kernel process awaits an instruction to do something. System monitoring tools specifically omit statistics about idle wait events. For example, Oracle's Statspack utility contains a table called `STATS$IDLE_EVENT` that holds the names of events database administrators commonly omit from the Statspack reports upon `V$SYSTEM_EVENT` . Those events are: `dispatcher timer` `lock manager wait for remote message` `pipe get` `pmon timer` `PX Idle Wait` `PX Deq Credit: need buffer` `PX Deq Credit: send blkd` `rdbms ipc message` `smon timer` `SQLNet message from client` `virtual circuit status` As you've already seen in Chapter 8 (Section 8.3.7.1), authors drop various wait events into the "idle" bucket to solve a problem that is caused by improperly scoped diagnostic data. Improperly scoped diagnostic data forces the analyst to ignore "idle" events, even when those events contain vital information. However, if you collect properly scoped diagnostic data, then whether an event is "idle" becomes an unproductive distraction. Rather than classify events as "idle" or "non-idle," in Chapter 5, I introduce the classification of events that execute within the context of a database call versus events that execute between database calls. While it is tempting to equate what other authors refer to as idle events with what I refer to as between -call events, even this is not a reliable mapping. Between-call events and the idle wait events are not synonyms. For example, the `PX Deq Credit: need` `buffer` event is a within -call event. Other events that are often considered idle, like `SQLNet` `more data from client` and `SQL*Net more data from dblink` , also occur within the context of a database call, not between calls.

11.2.2 No Event Is Inherently "Unimportant"

In most publications that are in print while I'm writing this chapter, authors still distinguish carefully between events that are idle and events that are non-idle . Those authors generally describe the non-idle events as the important ones, and the idle ones as unimportant. However, I encourage you to make only one classification of Oracle wait events: the between/within distinction that I introduce in Chapter 5. You need to understand this distinction to use the fundamental relation among trace file statistics, e c + S ela , without introducing omissions or double-counting errors. The reason I encourage you to make no other classification of Oracle wait events is that any event can be an important contributor to response time.

There is only one legitimate criterion for determining whether a response time component is important:

If a component contributes significantly to the response time of a properly targeted user action, then it is important; otherwise , it is not.

Therefore, I disagree with generalized claims that some events are "important" and that others are not. Any event can be important. It doesn't matter what event it is or how many authors say it's not important. If an event contributes significantly to the response time of a properly targeted user action, then the event is important to you.

How can "idle" events ever be important? When the time spent executing them contributes significantly to response time. Two events that you will encounter in almost every resource profile you will ever see are the events SQL*Net message to client and SQL*Net message from client . The from client event is irrefutably a so-called idle event. The Oracle kernel uses these to client and from client events to measure the performance of interprocess communications that take place through SQL*Net. Example 11-1 depicts how the events work inside the Oracle kernel.

Example 11-1. Typical code path that the Oracle kernel executes upon the completion of a database call

 # database call completes here     # write the result of the db call to the client through SQL*Net ela0 = gettimeofday;  write  (SQLNET, ...); ela1 = gettimeofday; nam = '  SQL*Net message to client  '; ela = ela1 - ela0; printf(TRCFILE, "WAIT #%d: nam='%s' ela= %d' ...", cursor, nam, ela, ...); printf(TRCFILE, "\n");     # listen to SQL*Net for further instructions from the client ela0 = gettimeofday;  read  (SQLNET, ...); ela1 = gettimeofday; nam = '  SQL*Net message from client  '; ela = ela1 - ela0; printf(TRCFILE, "WAIT #%d: nam='%s' ela= %d' ...", cursor, nam, ela, ...); printf(TRCFILE, "\n");     # next database call gets processed here

SQL*Net message to client events typically consume only a couple of microseconds. But SQL*Net message from client events can take much longer. For example:

User think time: If you connect to Oracle at 08.00 and then not issue a single database call until 10.00, Oracle will tally 7,200 seconds of SQL*Net message from client to V$SYSTEM_EVENT.TIME_WAITED at 10.00.
Client program execution time: An Oracle Applications Financial Statement Generator report (a batch program that executes as an Oracle client process) will typically make some database calls and then spend a comparatively long time executing C code in the client program upon the retrieved data. From the Oracle kernel process's perspective, this time consumed by the client is simply time consumed blocked upon the read call depicted in Example 11-1, so the time is logged as SQL*Net message from client .
Inter-call latency: Even when an application issues two database calls in rapid succession, it is common for the SQL*Net message from client event that occurs between the calls to have a latency of hundreds of microseconds.

Notice a couple of things from these examples. First, from looking at diagnostic data alone, it is impossible to determine which "database idle" time is actually part of someone's response time, and which "database idle" time is simply time that a user spends not paying attention to her screen. Knowing the difference requires knowledge of the user experience. Second, although the latency between rapid-fire database calls is only a few hundred microseconds, the total time adds up. For example, even if your average SQL*Net message from client latency is a microscopic 500 m s, then 1,000,000 database calls will generate a full 500 seconds (8.3 minutes) of response time.

11.2.2.1 Responding to large SQL*Net response time contributions

These observations lead us to an understanding of how to respond to a resource profile that is dominated by between-call event durations. If a between-call event dominates your resource profile, here is what to do:

Confirm that the between-call event duration is actually a component of someone's response time. If not, then correct the data collection error and construct a new resource profile.
If a between-call event's response time contribution is high because of a few large latencies, then investigate why the application spends so much time between database calls.
Otherwise, if the between-call event's response time contribution is high because the number of calls to the event is so large, then investigate why the application makes so many distinct database calls.

If you cannot reduce the number of distinct database calls, then investigate whether you can improve the individual wait event's average latency. For example, eliminate other processes' unnecessary database calls that might be causing queueing delays for network resources.

Notice that this sequence of steps is completely consistent with the procedure laid out in Chapter 5. First, eliminate unnecessary calls. Next, eliminate unnecessary interprocess competition.

The SQL*Net events whose latencies are most likely to dominate a resource profile are:

SQL*Net message from client

The remedy for this event is to eliminate as many unnecessary database calls as possible. For example, eliminate redundant parse calls (see Section 11.3.2 for details). Use Oracle's array processing features to manipulate many rows per database call instead of just one. ^[1]

^[1] The same advice applies to the job of eliminating single-task message events. The single-task message event is not a SQL*Net event, but its use is the single-task analog of the SQL*Net message from client event on two-task process configurations.

SQL*Net more data from client

If you are passing enormous SQL text strings in your application's parse calls, then stop it. Instead, issue stored procedure calls from the application. However, if you are sending large arrays of values to be bound into placeholders ("bind variables ") in your SQL text, then you may not be able to reduce the time spent waiting for this event without creating some other, worse problem.

SQL*Net message from dblink

Consider replicating data locally instead of joining across a database link. Replication can be operationally efficient, especially for tables that change very rarely, or for which using slightly stale data produces negligible functional harm. It may be possible to eliminate executions of this event by reworking a SQL statement's execution plan so that less data are transmitted between instances.

You can derive a good approximation of the number of network round-trips an application generates by counting the number of event executions whose nam values have the string SQL*Net in them. Studying the ela values for these event executions of course results in powerful evidence about whether your network latencies are a contributor to a response time problem.

11.2.2.2 Responding to large response time contributions from other events

Occasionally, you'll find an unsatisfactorily large response time contribution from an event you've never heard of. There are many Oracle wait events that I don't cover in much detail within this book. However, the pattern of your response shouldn't vary, even when your number one performance problem is something you don't know how to fix. The sequence of steps should sound familiar to you by now:

Identify the event that's consuming most of your user action's response time.
Determine what action causes the event to be executed so many times.
Make the application do that action less often.

Of course, step 1 is pretty thoroughly documented throughout this book. Steps 2 and 3 are not much more difficult. The names of most events give you a big clue about their meanings. For example, the PX events are emitted by code path within Oracle's parallel execution capability. Once you learn how parallel execution works, the meanings of the PX events become clear: a master session assigns work to slave sessions and waits for the results. The slaves do most of the work. Another example: a global cache cr request event is what a RAC-enabled Oracle kernel process emits when it needs to access a database buffer held by another RAC instance. How do you get rid of global cache cr request executions? Require fewer buffers from the other instance, or perhaps fewer buffer visits in general (see Section 11.3.1, below).

Even if the event is something you've never heard of, and you find its name completely incomprehensible, Method R takes you to the threshold of a solution. As I mention at the beginning of this chapter, there are several references available to you, free of charge on the Internet. Even the worst case imaginable isn't very bad. If you find absolutely no help on the Internet, your call to Oracle Corporation product support should be much easier than you might expect:

Support call without Method R : My system is slow. We're desperate, but we have no clue what to do. What can we do to fix it?

Depending upon which support analyst you get, be prepared for just about anything. However, with Method R, your support call should be considerably more tightly scoped.

Support call with Method R : The response time for the most important user action on my system is 75.32 seconds. Of that time, more than 73 seconds are spent executing an event called resmgr:waiting in check2 . Can you please tell me (1) what action in my application is causing this Oracle kernel code path to be executed? And (2) what can I do to execute it less often?

It's a rhythm you get into: find out why an event was executed; find out how to avoid it next time. It's a rhythm that works.

Top