Section 12.1. Case 1: Misled by System-Wide Data

12.1 Case 1: Misled by System-Wide Data

After several months of trying to arrange a sales call into a local account in Dallas, our www.hotsos.com sales staff finally got a phone call on a Monday. The people with whom we'd been trying to arrange a visit had finally reached their wits' end with a performance problem, and they were going to give us a shot at fixing it. So on Wednesday, we got our chance.

The scenario was similar to ones we had seen before. The company had been fighting a performance problem with a particular program's response time for several months. All of the in-house experts had given the problem their shot over those months. None of their attempts had resulted in any appreciable performance gains. They had finally reached the level of frustration with the problem that management had decided to invest a large chunk of cash into fixing the problem. So on the weekend prior to our phone call, the company upgraded their system's CPUs. The upgrade process went successfully, and of course everyone was excited ”perhaps a bit nervous ”to see the improvement on Monday.

To their horror , the performance of the slow program was actually worse after the very expensive upgrade. Not just "seemed worse," was worse; measurably worse. So, on Monday we got a phone call. Our invitation said, "Come in and show us what you can do." Two days later, we got in our car and drove across town. This is what we found:

The company used the Oracle Payroll product. It was configured in a conventional way, with batch jobs running on the database server, and dozens of browser-based users scattered throughout the building on a local area network (LAN).
PYUGEN program performance had been hurting the business for several months. When we arrived, the PYUGEN program ”a batch job ”was able to process about 27 assignments per minute. Targeted performance was twice this throughput.
The customer used an internationally renowned, full-featured performance monitoring tool that queried data from fixed views like V$SYSTEM_EVENT and V$LATCH . This tool showed that the system's bottleneck was waits for the Oracle latch free wait event. The vast majority of latch free waits were waits for cache buffers chains latches.
The customer understood correctly that contention for cache buffers chains latches was a likely indication of inefficient (that is, high-LIO) SQL. However, the customer's application developers had analyzed the PYUGEN program and found no way to reduce the SQL's LIO count.
A recent upgrade of all twelve of the system's CPUs from 700 MHz to 1 GHz had made PYUGEN performance measurably worse . Failure of the CPU upgrade to improve performance was the "final straw" motivating the customer to invite www.hotsos.com staff to come onsite.

12.1.1 Targeting

By the time we arrived onsite, the customer had already completed the user action targeting step by identifying PYUGEN as the system's more important performance problem. Thus, our first step with the customer was to begin the collection of properly scoped diagnostic data. At this customer, collecting Oracle extended SQL trace data was straightforward because user action response time consisted exclusively of PYUGEN program execution. We used our free tool called Sparky (http://www.hotsos.com) to manage the extended SQL trace activation and deactivation .

The execution that we traced consumed slightly more than half an hour , producing roughly 70 MB of extended trace data. After the program completed, we executed the Hotsos Profiler upon the data, producing the resource profile shown in Example 12-1.

Example 12-1. Resource profile for an Oracle Payroll program

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ SQL*Net message from client       984.0s  49.6%         95,161    0.010340s SQL*Net more data from client     418.8s  21.1%          3,345    0.125208s db file sequential read           279.3s  14.1%         45,084    0.006196s CPU service                       248.7s  12.5%        222,760    0.001116s unaccounted-for                    27.9s   1.4% latch free                         23.7s   1.2%         34,695    0.000683s log file sync                       1.1s   0.1%            506    0.002154s SQL*Net more data to client         0.8s   0.0%         15,982    0.000052s log file switch completion          0.3s   0.0%              3    0.093333s enqueue                             0.3s   0.0%            106    0.002358s buffer busy waits                   0.2s   0.0%             67    0.003284s SQL*Net message to client           0.2s   0.0%         95,161    0.000003s db file scattered read              0.0s   0.0%              2    0.005000s SQL*Net break/reset to client       0.0s   0.0%              2    0.000000s ----------------------------- ----------------- -------------- ------------ Total                           1,985.3s 100.0%

The data in the resource profile came as a surprise to everyone who had worked on the project for the past few months. From the resource profile alone, we could already determine beyond the shadow of a doubt that waits for latch free were virtually irrelevant in their influence over total PYUGEN response time. If the company had been completely successful in eliminating latch free waits from their system, it would have made only about a 1% difference in the runtime of this program.

This kind of thing happens frequently in our field work: you can not detect many types of user action performance problems by examining system-wide data. The data from V$SYSTEM_EVENT was true; it was just irrelevant to the problem at hand. You cannot extrapolate detail for a specific session from aggregated system-wide data.

Actually, the V$SYSTEM_EVENT view had indicated clearly that the top wait event was SQL*Net message from client , but of course every good Oracle performance analyst knows that you have to discard all the SQL*Net events because they are "idle" events.

Roughly 50% of the total PYUGEN response time was consumed by executions of read system calls to the SQL*Net mechanism. The occurrence of SQL*Net message from client events at the top of the resource profile motivated a quick re-check of the collected data to ensure that the prominence of this between-calls event was not the result of data collection error. It wasn't. The SQL*Net message from client events and their durations were distributed uniformly throughout the trace file. These wait events were the results of thousands of database calls. When you add in the effect of the other SQL*Net event, SQL*Net more data from client , we had discovered the cause for over 70% of PYUGEN 's total response time.

12.1.2 Diagnosis and Response

You of course cannot ignore 70% of a program's response time, even if people do call the motivating events "idle." Idle or not, this time was part of someone's response time, so we needed to deal with it. If we hadn't collected our statistics so carefully (with proper time scope and proper program scope), then we would have seen probably much more SQL*Net message from client time in our data. If you make that particular collection error, then you must disregard the so-called idle wait time.

The top line of the resource profile was naturally the symptom we investigated first. Because this was a prepackaged application, we expected that the number of database calls would be difficult for us to manipulate, so we let our attention wander to the duration-per-call column. Here, we found a number that looked suspiciously LAN-like (on the order of 0.010 seconds, as described in Chapter 10), not IPC-like (on the order of 0.001 seconds or less). So we reconfirmed that the PYUGEN batch program had indeed run on the database server host (with the PYUGEN process's corresponding Oracle kernel process) by checking the V$SESSION data collected automatically by Sparky upon collection activation (Example 12-2).

Example 12-2. The data that Sparky obtained from V$SESSION verified that the PYUGEN process was in fact running on the same host as its Oracle kernel process

 Oracle instance = prod (8.1.6.3.0)                host =  dalunix150.xyz.com  (OSF1 V5.1)             program = PYUGEN@  dalunix150.xyz.com  (TNS V1-V2) (session 611)          trace file = /prod/u001/app/oracle/admin/prod/udump/ora_922341.trc          line count = 1,760,351 (0 ignored, 0 oracle error)                  t0 = Wed Sep 12 2001 14:10:27 (388941433)                  t1 = Wed Sep 12 2001 14:43:32 (389139973)   interval duration = 1,985.40s        transactions = 672 (672 commits, 0 rollbacks)

Sure enough, the hostname reported to the right of the @ character in V$SESSION exactly matched the Node name reported in the preamble of the SQL trace file. PYUGEN had definitely run on the same host as the database server. So why would the PYUGEN program suffer from such large SQL*Net message from client latencies? We examined the system's tnsnames.ora file to find out. It turns out that to conserve system administration effort, the system's managers had decided to use a single TNS alias system-wide. The batch jobs were using the same TCP/IP protocol adapter as the system's browser clients were using.

It was easy to devise a strategy that was perfectly acceptable in terms of system administrative overhead. We could add a second alias to the existing tnsnames.ora file. The second alias would be identical to the existing alias except that it would have a different name, and it would use the syntax (PROTOCOL=BEQ) instead of (PROTOCOL=TCP) . The customer would shut down the Oracle Applications Concurrent Manager and restart it, specifying the new alias that used the bequeath protocol adapter. The new tnsnames.ora file could be pushed out to everyone on the system without side effect. Everyone except for the person who started the Concurrent Manager would use the same TNS alias as before.

Before implementing this change, the customer ran a simple test. He executed and timed a SELECT statement that would require a few thousand database calls from a SQL*Plus session executed on the database server itself. He ran it once through a session established with the old alias that used the TCP/IP protocol adapter. He then ran the statement again through a session established with the new alias that used the bequeath protocol adapter. The test showed that using the bequeath protocol adapter reduced SQL*Net message from client latencies to less than 0.001 seconds. We could expect to eliminate at least 40% of the program's total response time by executing this one change alone, as shown in Figure 12-1.

Figure 12-1. We could expect that reducing the per-call latency of SQL*Net message from client events from 0.010 seconds to 0.001 seconds would eliminate more than 40% of PYUGEN's response time

We actually had reason to expect better than a 40% improvement. The second most important contributor to PYUGEN response time was another SQL*Net event called SQL*Net more data from client . The cause of this event was a sequence of parse calls that passed excessively long SQL text strings through SQL*Net from the client to the server (instead of using stored procedure calls to accomplish the same thing). The long SQL text strings wouldn't fit into a single SQL*Net packet, so the Oracle kernel spent a considerable amount of time awaiting second and subsequent SQL*Net packets during parse calls. Of course, because Oracle Payroll was a prepackaged application, our short- term hopes for reducing the number of executions of this event were dim. However, we had reason to believe that because some of the SQL*Net more data from client latency was network transport, the protocol adapter change would improve the performance of this event's executions as well.

12.1.3 Results

The bottom-line results were excellent . Payroll processing performance improved from executing 27 assignments per minute to 61 assignments per minute. The proposed tnsnames.ora change took 15 minutes to test and about a week to navigate through change control. Our whole engagement at the client lasted less than four hours. Of this time, two hours were consumed installing Sparky (which required a Perl upgrade on the database server host), and a little more than half an hour was consumed by letting the PYUGEN program run with a level-8 extended SQL trace turned on. The remaining hour and a half contained the whole meeting, greeting, analysis, testing, and recap activities.

Oh yes... Why did the Payroll program get slower after the CPU upgrade? Not much Payroll program time was spent consuming CPU service, so the upgrade had very little direct positive effect upon PYUGEN . Most of the program's time was spent queueing for the network. Other programs ran at the same time as this Payroll job. The CPU upgrade made them faster, which intensified their number of network calls (which remained unchanged after the upgrade) into a smaller time window. The result was increased competition for the network during the Payroll run. Therefore, every network I/O call the Payroll program made was a little slower than before the CPU upgrade. The degradation in network response time overwhelmed the small direct improvement of the CPU time reduction, resulting in a net degradation of Payroll performance...not a good thing, because this Payroll program had a higher business priority than everything else on the system.

12.1.4 Lessons Learned

This case is a classic illustration of the following important points:

Don't let your V$ data tell you what your system's problem is. Your business should decide that. The real performance problem in your system is whatever it is that is causing response time problems for your business's most important user action.
You can't extrapolate detail from an aggregate. You cannot necessarily determine what's wrong with an individual program by examining only the system-wide statistics for an instance.
Capacity upgrades are a riskier performance improvement activity than many people think. Not only can they waste a lot of money by being ineffective , they can actually degrade performance for the very programs you're trying to improve.
It's nearly impossible to find and repair your performance problems by executing the old trial-and-error approach. There are just too many things that might be your performance problem. Instead of checking everything that might be causing your performance problem, it's easy enough to simply ask your targeted user actions what is causing your performance problem.

Top