Section 3.2. Data Scope

   

3.2 Data Scope

Good Oracle performance data collection requires good decision-making in two dimensions. You must collect data for the right time scope and the right action scope . Let's begin by drawing a user action's response time consumption as a sequence of chunks of time spent consuming various resources. Figure 3-2 shows the result. To keep it simple, our imaginary system consists of only three types of resource, called C , D , and S . Imagine that these symbols stand for CPU, disk, and serialization (such as the one-at-a-time access that the Oracle kernel imposes for locks, latches, and certain memory buffer operations). In Figure 3-2, the time dimension extends in the horizontal direction.

Figure 3-2. The consumption of three types of resource over the duration of a user action
figs/oop_0302.gif

We can denote a system that is executing several user actions at the same time by stacking such drawings vertically, as shown in Figure 3-3. In this drawing, the action dimension extends in the vertical direction.

Figure 3-3. By adding a vertical dimension, this drawing depicts a system containing seven concurrent actions, each consuming three different types of resource through time
figs/oop_0303.gif

The following sections use this graphical notation to illustrate why the data collection methods that many Oracle experts have been teaching since the 1980s are actually what have been killing performance improvement projects all over the world.

3.2.1 Scoping Errors

In the system shown in Figure 3-3, imagine that the targeting process described in Chapter 2 has revealed the following: the most important performance problem for the business is that a user named Wallace endures an unacceptably long response time between times t 1 and t 2 , as shown in Figure 3-4.

Figure 3-4. This system's most important user, Wallace, experiences unacceptable performance in the time interval [t 1 , t 2 ]
figs/oop_0304.gif

In the following discussions, I shall use the mathematical notation for a closed interval . The notation [ a , b ] represents the set of values between a and b , inclusive:

[ a , b ] = {all x values for which a x b }

From the picture in Figure 3-4, it is easy to see that during the problem time interval, Wallace's response time was consumed predominantly by S and secondarily by C , as shown in Table 3-1. Of course, repairing Wallace's performance problem will require a reduction in time for Wallace's action spent consuming either S , or C , or both. Amdahl's Law indicates that any percentage reduction in consumption of S will have 1.5 times the response time impact that an equivalent percentage reduction in consumption of C will have, because the response time contribution of S is 1.5 times the size of the response time contribution of C .

Table 3-1. Resource profile for Wallace's action for the time interval [t 1 , t 2 ]

Resource

Elapsed time

Percentage of total time

S

3

60.0%

C

2

40.0%

Total

5

100.0%

Perhaps the most common data collection error is to collect data that are aggregated in both dimensions. Figure 3-5 shows what this mistake looks like. The heavy, dark line around all the blocks in the entire figure indicate that data were aggregated for all processes (not just Wallace's), and for the whole time interval [ t , t 3 ] (not just [ t 1 , t 2 ]). Counting the time units attributable system-wide during the [ t , t 3 ] interval produces the resource profile shown in Table 3-2. As you can see, Wallace's performance problem ”which we know to have been too much time spent doing S ”has been thoroughly buried by all of the irrelevant data that we collected. The result of the botched data collection will be a longer and probably less fruitful performance improvement project than we want.

Figure 3-5. Collecting data that are improperly scoped on both the time and action dimensions will completely conceal the nature of Wallace's problem in the time interval [t 1 , t 2 ]
figs/oop_0305.gif

From the data shown in Table 3-2, you simply cannot see that S is Wallace's principal problem root cause. It would actually be irresponsible to assume that S might be the root cause of Wallace's problem.

Table 3-2. Resource profile for the entire system for the time interval [t , t 3 ]

Resource

Elapsed time

Percentage of total time

D

66

47.1%

C

58

41.4%

S

16

11.4%

Total

140

100.0%

Unfortunately, the deeply flawed data collection method illustrated here is the default behavior of Statspack , the utlbstat .sql and utlestat.sql script pair, and virtually every other Oracle performance tool created between 1980 and 2000. Of the most deeply frustrating performance improvement projects with which I've ever assisted, this type of data collection error is far and away the most common root cause of their failure.

The remedy to the data collection problem must be executed on both dimensions. Repairing the collection error in only one dimension is not enough. Observe, for example, the result of collecting the data shown in Figure 3-6. Here, the time scoping is done correctly, but the action scope is still too broad. The accompanying resource profile is shown in Table 3-3. Again, remember that you know the root cause of Wallace's performance problem: it is a combination of S and C . But the data collected system-wide provides apparent "evidence" quite to the contrary, even though the data were collected for the correct time interval.

Figure 3-6. Collecting data that are scoped improperly on the action dimension also conceals the nature of Wallace's performance problem, even though the data were collected for the correct time scope
figs/oop_0306.gif
Table 3-3. Resource profile for the entire system for the time interval [t 1 , t 2 ]

Resource

Elapsed time

Percentage of total time

D

23

65.7%

C

9

25.7%

S

3

8.6%

Total

35

100.0%

Finally, examine the result of collecting data for the correct action scope but the wrong time scope, as shown in Figure 3-7. Table 3-4 shows the resource profile. Once again, presented with these data, even a competent performance analyst will botch the problem diagnosis job. Wallace's problem is S and C , but you certainly wouldn't figure it out by looking at Table 3-4.

Figure 3-7. Collecting data that are scoped improperly on the time dimension also conceals the nature of Wallace's performance problem, even though the data were collected for the correct action scope
figs/oop_0307.gif
Table 3-4. Resource profile for Wallace's action for the time interval [t , t 3 ]

Resource

Elapsed time

Percentage of total time

D

8

40.0%

C

8

40.0%

S

4

20.0%

Total

20

100.0%

From this sequence of simple examples, it is easy to see why proper diagnostic data collection is so vital to a performance improvement project. The examples also clearly reveal the identity of the two dimensions along which you can assess whether or not a given diagnostic data collection can be deemed proper :

Reliable problem diagnosis cannot proceed unless the data collection phase produces response time data for exactly the right time scope and exactly the right action scope .

3.2.2 Long-Running User Actions

When you have a really long-running user action, do you need to collect performance diagnostic data for the whole thing? Perhaps you have an action that ran in ten minutes last week, but today it has already run for over four hours, and you're wondering whether you should kill it. Do you have to restart the job in order to collect diagnostic data for it? Sometimes, I hear about batch jobs that run for several days before their users give up and terminate the jobs instead of letting them finish. [1] Do you really need to collect performance diagnostic data for the whole job?

[1] In some of these cases, I've been able to prove that if the job were left to run to completion, it would not be able to complete in our lifetimes.

The answer is no. Of course, collecting performance diagnostic data for some subset of an action's performance problem duration introduces a type of time-scoping error, but it is actually useful to collect time-subset diagnostic data in some circumstances. For example:

  • If a user action is supposed to run in n minutes, then collecting data for just n + m minutes will reveal at least m minutes of response time that shouldn't exist. For example, if a job is supposed to run in 10 minutes, then 25 minutes' worth of diagnostic data will reveal at least 15 minutes of workload that shouldn't exist.

  • If a user action consists of a long sequence of repetitive tasks, then performance diagnostic data collected for a small number of the tasks will reveal the resources consumed by the whole action, as long as the tasks are homogeneous.

In Chapter 6, I discuss some collection errors that might occur if your data collection process begins in the midst of a database action. But in many cases, collecting time-subset diagnostic data can help you along your way.

3.2.3 "Too Much Data" Is Really Not Enough Data

It is tempting to say that the scoping problems in Tables Table 3-2 through Table 3-4 were the result of collecting "too much data." However, the problem with these three resource profiles was not necessarily in what data were collected , it is more an issue of how the data were aggregated . Look again at Figure 3-5. There is plenty of information here to produce a correctly scoped resource profile. The problem with Table 3-2 is in how the data from Figure 3-5 were aggregated. The same can be said for Figures Figure 3-6 and Figure 3-7 and their resource profiles. The problem is not that the figures contain too much data, it's that their corresponding resource profiles are aggregated incorrectly.

Poor aggregation is an especially big problem for projects that use SQL queries of Oracle V$ fixed views as their performance diagnostic data sources. Oracle V$ views by their nature provide data that are either aggregated for an entire instance since instance startup, or for an entire session since connection. For example, using V$SYSSTAT or V$SYSTEM_EVENT is guaranteed to produce the action scoping errors depicted in Tables Table 3-2 and Table 3-3. Even meticulous use of V$SESSTAT and V$SESSION_EVENT makes you prone to the type of time scoping error depicted in Table 3-4 (as you can see by experimenting with my vprof program described in Chapter 8).

When used with careful attention to time scope, Oracle's V$SESSTAT and V$SESSION_EVENT views provide a high-level perspective of why a user action is taking so long. However, for the next level of your diagnosis, you'll need to know details that V$SESSTAT and V$SESSION_EVENT can't provide. For example, what if your preliminary analysis indicates that your targeted user action is spending most of its time waiting for the event called latch free ? Then you'll wish you had collected data from V$LATCH (and perhaps V$LATCH_CHILDREN ) for the same time interval. But even if you had, you'll notice that neither fixed view contains a session ID attribute, so collecting properly action-scoped data about latches on a busy system will be impossible .

The problem of acquiring secondary detail data from V$ views is an extremely serious one. It's by no means just a problem with V$LATCH . What if the dominant consumer of response time had been CPU service? Then you need properly time- and action-scoped data at least from V$SQL . What if the dominant consumer had been waits for db file scattered read ? Then you need properly time- and action-scoped data at least from V$FILESTAT . What if the problem had been waits for buffer busy waits ? Then you need V$WAITSTAT . In Oracle9 i there are roughly 300 events that beg for details from any of dozens of V$ fixed views. Even if you could query from all these V$ views at exactly the right times to produce accurately time-scoped data, you'd still be left with aggregations whose values fall short of what you could acquire through other means.

Happily, there are at least three ways to acquire the drill-down data you need. The first doesn't work very well. The second is expensive, but you might already have the capability. The third is available to you for the price of the book that you are holding. The following section explains.


   
Top


Optimizing Oracle Performance
Optimizing Oracle Performance
ISBN: 059600527X
EAN: 2147483647
Year: 2002
Pages: 102

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net