Section 1.4. Tools for Analyzing Response Time | Optimizing Oracle Performance

1.4 Tools for Analyzing Response Time

The definition of response time set forth by the International Organization for Standardization is plain but useful:

Response time is the elapsed time between the end of an inquiry or demand on a computer system and the beginning of a response; for example, the length of the time between an indication of the end of an inquiry and the display of the first character of the response at a user terminal (source: http://searchnetworking.techtarget.com/sDefinition/0,,sid7_gci212896,00.html).

Response time is an objective measure of the interaction between a consumer and a provider. Consumers of computer service want the right answer with the best response time for the lowest cost. Your goal as an Oracle performance analyst is to minimize response time within the confines of the system owner's economic constraints. The ways to do that become more evident when you consider the components of response time.

1.4.1 Sequence Diagram

A sequence diagram is a convenient way to depict the response time components of a user action. A sequence diagram shows the flow of control as a user action consumes time in different layers of a technology stack. The technology stack is a model that considers system components such as the business users, the network, the application software, the database kernel, and the hardware in a stratified architecture. The component at each layer in the stack demands service from the layer beneath it and supplies service to the layer above it. Figure 1-1 shows a sequence diagram for a multi- tier Oracle system.

Figure 1-1. A sequence diagram for a multi-tier Oracle system

Figure 1-1 denotes the following sequence of actions, allowing us to literally see how each layer in the technology stack contributes to the consumption of response time:

After considering what she wants from the system, a user initiates a request for data from a browser by pressing the OK button. Almost instantaneously, the request arrives at the browser. The user's perception of response time begins with the click of the OK button.
After devoting a short bit of time to rendering the pixels on the screen to make the OK button look like it has been depressed, the browser sends an HTTP packet to the wide-area network (WAN). The request spends some time on the WAN before arriving at the application server.
After executing some application code on the middle tier, the application server issues a database call via SQL*Net across the local-area network (LAN). The request spends some time on the LAN (less than a request across a WAN) before arriving at the database server.
After consuming some CPU time on the database server, the Oracle kernel process issues an operating system function call to perform a read from disk.
After consuming some time in the disk subsystem, the read call returns control of the request back to the database CPU.
After consuming more CPU time on the database server, the Oracle kernel process issues another read request.
After consuming some more time in the disk subsystem, the read call returns control of the request again to the database CPU.
After a final bit of CPU consumption on the database server, the Oracle kernel process passes the results of the application server's database call. The return is issued via SQL*Net across the LAN.
After the application server process converts the results of the database call into the appropriate HTML, it passes the results to the browser across the WAN via HTTP.
After rendering the result on the user's display device, the browser returns control of the request back to the user. The user's perception of response time ends when she sees the information she requested .

A good sequence diagram reveals only the amount of detail that is appropriate for the analysis at hand. For example, to simplify the content of Figure 1-1, I have made no effort to show the tiny latencies that occur within the Browser, Apps Server, and DB CPU tiers as their operating systems' schedulers transition processes among running and ready to run states. In some performance improvement projects, understanding this level of detail will be vital . I describe the performance impact of such state transitions in Chapter 7.

In my opinion, the ideal Oracle performance optimization tool does not exist yet. The graphical user interface of the ideal performance optimization tool would be a sequence diagram that could show how every microsecond of response time had been consumed for any specified user action. Such an application would have so much information to manage that it would have to make clever use of summary and drill-down features to show you exactly what you wanted when you wanted it.

Such an application will probably be built soon. As you shall see throughout this book, much of the information that is needed to build such an application is already available from the Oracle kernel. The biggest problems today are:

Most of the non-database tiers in a multi-tier system aren't instrumented to provide the type of response time data that the Oracle kernel provides. Chapter 7 details the response time data that I'm talking about.
Depending upon your application architecture, it can be very difficult to collect properly scoped performance diagnostic data for a specific user action. Chapter 3 explains what constitutes proper scoping for diagnostic data, and Chapter 6 explains how to work around the data collection difficulties presented by various application architectures.

However, much of what we need already exists. Beginning with Oracle release 7.0.12, and improving ever since, the Oracle kernel is well instrumented for response time measurement. This book will help you understand exactly how to take advantage of those measurements to optimize your approach to the performance improvement of Oracle systems.

1.4.2 Resource Profile

A complete sequence diagram for anything but a very simple user action would show so much data that it would be difficult to use all of it. Therefore, you need a way to summarize the details of response time in a useful way. In Example 1-2, I showed a sample of such a summary, called a resource profile . A resource profile is simply a table that reveals a useful decomposition of response time. Typically, a resource profile reveals at least the following attributes:

Response time category
Total duration consumed by actions in that category
Number of calls to actions in that category

A resource profile is most useful when it lists its categories in descending order of elapsed time consumption per category. The resource profile is an especially handy format for performance analysts because it focuses your attention on exactly the problem you should solve first. The resource profile is the most important tool in my performance diagnostic repertory .

The idea of the resource profile is nothing new, actually. The idea for using the resource profile as our company's focus was inspired by an article on profilers published in the 1980s [Bentley (1988) 3-13], which itself was based on work that Donald Knuth published in the early 1970s [Knuth (1971)]. The idea of decomposing response time into components is so sensible that you probably do it often without realizing it. Consider how you optimize your driving route to your favorite destination. Think of a "happy place" where you go when you want to feel better. For me it's my local Woodcraft Supply store (http://www.woodcraft.com), which sells all sorts of tools that can cut fingers or crush rib cages, and all sorts of books and magazines that explain how not to.

If you live in a busy city and schedule the activity during rush- hour traffic, the resource profile for such a trip might resemble the following ( expressed in minutes):

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ rush-hour expressway driving         90m    90%              2          45m neighborhood driving                 10m    10%              2           5m ----------------------------- ----------------- -------------- ------------ Total                               100m   100%

If the store were, say, only fifteen miles away, you might find the prospect of sitting for an hour and a half in rush-hour traffic to be disappointing. Whether or not you believe that your brain works in the format of a resource profile, you probably would consider the same optimization that I'm thinking of right now: perhaps you could go to the store during an off-peak driving period.

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ off-peak expressway driving          30m    75%              2          15m neighborhood driving                 10m    25%              2           5m ----------------------------- ----------------- -------------- ------------ Total                                40m   100%

The driving example is simple enough, and the stakes are low enough, that a formal analysis is almost definitely unnecessary. However, for more complex performance problems, the resource profile provides a convenient format for proving a point, especially when decisions about whether or not to invest lots of time and money are involved.

Resource profiles add unequivocal relevance to Oracle performance improvement projects. Example 1-3 shows a resource profile for the Oracle Payroll program described earlier in Section 1.3.1. Before the database administrators saw this resource profile, they had worked for three months fighting a perceived problem with latch contention . In desperation, they had spent several thousand dollars on a CPU upgrade, which had actually degraded the response time of the payroll action whose performance they were trying to improve. Within ten minutes of creating this resource profile, the database administrator knew exactly how to cut this program's response time by roughly 50%. The problem and its solution are detailed in Part II of this book.

Example 1-3. The resource profile for a network configuration problem that had previously been misdiagnosed as both a latch contention problem and a CPU capacity problem

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ SQL*Net message from client       984.0s  49.6%         95,161    0.010340s SQL*Net more data from client     418.8s  21.1%          3,345    0.125208s db file sequential read           279.3s  14.1%         45,084    0.006196s CPU service                       248.7s  12.5%        222,760    0.001116s unaccounted-for                    27.9s   1.4% latch free                         23.7s   1.2%         34,695    0.000683s log file sync                       1.1s   0.1%            506    0.002154s SQL*Net more data to client         0.8s   0.0%         15,982    0.000052s log file switch completion          0.3s   0.0%              3    0.093333s enqueue                             0.3s   0.0%            106    0.002358s SQL*Net message to client           0.2s   0.0%         95,161    0.000003s buffer busy waits                   0.2s   0.0%             67    0.003284s db file scattered read              0.0s   0.0%              2    0.005000s SQL*Net break/reset to client       0.0s   0.0%              2    0.000000s ----------------------------- ----------------- -------------- ------------ Total                           1,985.4s 100.0%

Example 1-4 shows another resource profile that saved a project from a frustrating and expensive ride down a rat hole. Before seeing the resource profile shown here, the proposed solution to this report's performance problem was to upgrade either memory or the I/O subsystem. The resource profile proved unequivocally that upgrading either could result in no more than a 2% response time improvement. Almost all of this program's response time was attributable to a single SQL statement that motivated nearly a billion visits to blocks stored in the database buffer cache.

You can't tell by looking at the resource profile in Example 1-4 that the CPU capacity was consumed by nearly a billion memory reads. Each of the 192,072 "calls" to the CPU service resource represents one Oracle database call (for example, a parse, an execute, or a fetch). From the detailed SQL trace information collected for each of these calls, I could determine that the 192,072 database calls had issued nearly a billion memory reads. How you can do this is detailed in Chapter 5.

Problems like this are commonly caused by operational errors like the accidental deletion of schema statistics used by the Oracle cost-based query optimizer (CBO).

Example 1-4. The resource profile for an inefficient SQL problem that had previously been diagnosed as an I/O subsystem problem

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ CPU service                    48,946.7s  98.0%        192,072    0.254835s db file sequential read           940.1s   2.0%        507,385    0.001853s SQL*Net message from client        60.9s   0.0%        191,609    0.000318s latch free                          2.2s   0.0%            171    0.012690s other                               1.4s   0.0% ----------------------------- ----------------- -------------- ------------ Total                          49,951.3s 100.0%

Example 1-4 is a beautiful example of how a resource profile can free you from victimization to myth. In this case, the myth that had confused the analyst about this slow session was the proposition that a high database buffer cache hit ratio is an indication of SQL statement efficiency. The statement causing this slow session had an exceptionally high buffer cache hit ratio. It is easy to understand why, by looking at the computation of the cache hit ratio (CHR) metric for this case:

figs/eq_0101.gif

In this formula, LIO ( logical I/O ) represents the number of Oracle blocks obtained from Oracle memory (the database buffer cache), and PIO ( physical I/O ) represents the number of Oracle blocks obtained from operating system read calls. ^[1] The expression LIO - PIO thus represents the number of blocks obtained from Oracle memory that did not motivate an operating system read call.

^[1] This formula has many problems other than the one illustrated in this example. Many authors ”including Adams, Lewis, Kyte, and myself ”have identified dozens of critical flaws in the definition of the database buffer cache hit ratio statistic. See especially [Lewis (2003)] for more information.

Although most analysts would probably consider a ratio value of 0.9995 to be "good," it is of course not "perfect." In the absence of the data shown in Example 1-4, many analysts I've met would have assumed that it was the imperfection in the cache hit ratio that was causing the performance problem. But the resource profile shows clearly that even if the 507,385 physical read operations could have been serviced from the database buffer cache, the best possible total time savings would have been only 940.1 seconds. The maximum possible impact of fixing this "problem" would have been to shave a 14- hour execution by a mere 16 minutes.

Considering the performance of user actions using the resource profile format has revolutionized the effectiveness of many performance analysts. For starters, it is the perfect tool for determining what to work on first, in accordance with our stated objective:

Work first to reduce the biggest response time component of a business' most important user action.

Another huge payoff of using the resource profile format is that it is virtually impossible for a performance problem to hide from it. The informal proof of this conjecture requires only two steps:

Proof : If something is a response time problem, then it shows up in the resource profile. If it's not a response time problem, then it's not a performance problem. QED

Part II of this book describes how to create resource profiles from which performance problems cannot hide.

In Case You've Heard That More Memory Makes All Your Performance Problems Go Away

Example 1-4 brings to mind the first "tuning" class I ever attended. The year was 1989, during one of my first weeks as a new Oracle Corporation employee. Our instructor advised us that the way to tune an Oracle query was simple: just eliminate physical I/O operations. I asked, "What about memory accesses ?", referring to a big number in the query column of the tkprof output we were looking at. Our instructor responded that fetches from memory are so fast that their performance impact is negligible. I thought this was a weird answer, because prior to the beginning of my Oracle career, I had tuned a lot of C code. One of the most important steps in doing that job was eliminating unnecessary memory accesses [Dowd (1993)].

Example 1-4 illustrates why eliminating unnecessary memory accesses should be a priority for you , too. Unnecessary memory accesses consume response time. Lots of them can consume lots of response time. With 2GHz CPUs, the code path associated with each Oracle logical I/O operation (LIO) typically motivates tens of microseconds of user-mode CPU time consumption. Therefore, a million LIOs will consume tens of seconds of response time. Excessive LIO processing inhibits system scalability in a number of other ways as well, as I explain in Parts II and III of this book. See [Millsap (2001c)] for even more information.

Top