Section 12.3. Case 3: Large SQLNet Event Duration

12.3 Case 3: Large SQL*Net Event Duration

This case came to us in the final segment of a Hotsos Clinic. The game is that another instructor and I lecture for two and a half days, and then at the end of the course, we offer to open up people's trace files and try to diagnose them in a public forum. The theory is that we're brave enough to try to diagnose the trace file of anyone who is brave enough to show off their slow applications in public. It's great fun. The students get to see whether the techniques we've been talking about actually work in reality, and they get to practice their new ideas by shouting them out in the classroom. The submitter usually gets a serious problem fixed. And we get to see lots of very interesting application performance problems.

In this class, a very nice young lady who had sat at the back of the class handed us a CD on the final day of the course. The trace file on this disk, she explained, was the trace file from a purchased application built with PowerBuilder that had been slow for as long as she could remember. In fact, the company across the street used the same application and was having severe performance problems with it as well. At every local user group meeting, her story continued , she and the other users would routinely ask each other whether anyone had yet figured out how make this thing run faster. Nobody had figured out why it was so slow.

No pressure.

12.3.1 Targeting

It seemed like an excellent opportunity to demonstrate the power of Method R ”how great of a setup is a performance problem that people have looked at for years without solving it. It was an opportunity straight out of the first few paragraphs of Chapter 1!

My heart sank when we looked at the resource profile for the file. What we saw is in Example 12-6: lots of SQL*Net message from client , and not really much else. The first sentence out of my mouth expressed my disappointment that we might not be able to help as much as we would like to, because this resource profile apparently included a lot of user think time, or time spent in a probably un-instrumented application server tier , or something like that.

Example 12-6. Resource profiler for an application written in PowerBuilder

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ SQL*Net message from client       166.6s  91.8%          6,094    0.027338s CPU service                         9.7s   5.3%         18,750    0.000515s unaccounted-for                     1.9s   1.1% db file sequential read             1.6s   0.9%          1,740    0.000914s log file sync                       1.1s   0.6%            681    0.001645s SQL*Net more data from client       0.3s   0.1%             71    0.003521s SQL*Net more data to client         0.1s   0.1%            108    0.001019s free buffer waits                   0.1s   0.0%              4    0.022500s db file scattered read              0.0s   0.0%             34    0.001176s SQL*Net message to client           0.0s   0.0%          6,094    0.000007s log file switch completion          0.0s   0.0%              1    0.030000s latch free                          0.0s   0.0%              1    0.010000s log buffer space                    0.0s   0.0%              2    0.005000s direct path read                    0.0s   0.0%              5    0.000000s direct path write                   0.0s   0.0%              2    0.000000s ----------------------------- ----------------- -------------- ------------ Total                             181.5s 100.0%

No, she asserted, she had in fact been awake for the duration of the course. She sincerely and patiently explained that she knew what collection error and think time were, and this trace file didn't have any. She had begun the extended SQL trace data collection immediately before a user clicked an OK button to initiate an online action, and she had stopped the data collection immediately after she had noticed that the system had returned the result, by having the user disconnect from the application. The user had actually waited about three minutes from button click to disconnect. This was a two-tier application, with no application middle tier and consequently no un-instrumented application server code. And furthermore, she had used our Sparky tool to collect the data.

Oh.

I had missed a clue, actually. In addition to the average duration per call information, our actual Hotsos Profiler output shows the minimum and maximum single wait time for each event. (It's too much information for me to show in the limited page width allowed for Example 12-6.) The longest wait for a SQL*Net message from client event had been on the order of a few seconds. And of course, the call count ( # Calls in Example 12-6), had I paid more attention to it, was actually a big clue that this wasn't a collection error or a think time problem.

At the time, I really wasn't certain of how to attack the problem, so, with Jeff's lead, we starting looking through our Hotsos Profiler output. Using the principle of forward attribution as our guide (Chapter 11), we looked for database calls that followed the SQL*Net message from client events. Hotsos have since modified the code so that this type of problem is very easy to solve within just a couple of minutes. I'll describe the diagnostic process in terms of an analysis that would take place today with the improved Hotsos Profiler.

12.3.2 Diagnosis and Response

To tie up any possible loose ends, we examined the SQL*Net message from client time a little more closely. Example 12-6 doesn't show it, because the pages of this book aren't wide enough, but the Hotsos Profiler output showed that the maximum SQL*Net message from client execution duration was 17.43 seconds. A quick search in the raw trace file for the string ela= 1743 (notice the blank space that the Oracle kernel emits between the = and the 1 ), revealed that there was actually a bit of collection error at the tail of the file. Sitting between two XCTEND lines was a SQL*Net message from client execution with an ela value of 17.43 seconds. The first commit had been the end of the user action. The second commit occurred when the user disconnected from the application. It had taken the ladies a few seconds to notice that the action had completed. After correcting for this little bit of collection error, the resource profile for the user action is the one shown in Example 12-7.

Example 12-7. The resource profile from Example 12-6, after correcting for a 17.43-second collection error

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ SQL*Net message from client       149.2s  91.0%          6,093    0.024482s CPU service                         9.7s   5.9%         18,750    0.000515s unaccounted-for                     1.9s   1.2% db file sequential read             1.6s   1.0%          1,740    0.000914s log file sync                       1.1s   0.7%            681    0.001645s SQL*Net more data from client       0.3s   0.2%             71    0.003521s SQL*Net more data to client         0.1s   0.1%            108    0.001019s free buffer waits                   0.1s   0.1%              4    0.022500s db file scattered read              0.0s   0.0%             34    0.001176s SQL*Net message to client           0.0s   0.0%          6,094    0.000007s log file switch completion          0.0s   0.0%              1    0.030000s latch free                          0.0s   0.0%              1    0.010000s log buffer space                    0.0s   0.0%              2    0.005000s direct path read                    0.0s   0.0%              5    0.000000s direct path write                   0.0s   0.0%              2    0.000000s ----------------------------- ----------------- -------------- ------------ Total                             164.0s 100.0%

The next question to answer in this situation is, "Which SQL is responsible for the remaining SQL*Net message from client duration?" The Hotsos Profiler output provides the answer automatically, as shown in Example 12-8. ^[1]

^[1] Unfortunately tkprof and Trace File Analyzer provide no help on this type of problem. However, using the principal of forward attribution, you can find the top contributing statements by searching the raw trace data.

Example 12-8. The Hotsos Profiler identifies the contribution to SQL*Net message from client duration by SQL statement

 SQL Statement Id             Duration ------------------ ------------------ 1525010069              23.1s   15.5% 2038166283              18.7s   12.5% 1966856986              17.6s   11.8% 1547563725              13.9s    9.3% 3230460720              10.8s    7.2% 77 others               65.1s   43.6% ------------------ ------------------ Total                  149.2s  100.0%

From the percentages shown in the Example 12-8 list of SQL statements, you can see that no single SQL statement accounts for a disproportionately large part of the SQL*Net message from client contribution. Therefore, to make a big impact upon the reduction of message from client time, it will actually be necessary in this case to look at more than just one SQL statement. However, we began by looking at the first one, whose statement text and statement statistics are shown in Example 12-9.

Example 12-9. SQL text and performance statistics for statement 1525010069, the top contributor to the session's SQL*Net message from client duration

 Statement Text INSERT INTO STAGING_AREA (   DOC_OBJ_ID, TRADE_NAME_ID, LANGUAGE_CODE, OBJECT_RESULT, GRAPHIC_FLAG,    USER_LAST_UPDT, TMSP_LAST_UPDT ) VALUES (   1000346, 54213, 'ENGLISH', '<BLANK>', 'N', 'sa',    TO_DATE('11/05/2001 16:40:54', 'MM/DD/YYYY HH24:MI:SS') )     Statement Cumulative Database Call Statistics Cursor  Action         ------ Response Time -------         LIO        PIO Action   Count   Rows    Elapsed       CPU    Other      Blocks     Blocks ------- ------ ------  --------- --------- --------  ---------- ---------- Parse      696      0        0.9       0.8      0.1           0          0 Execute    348    348        1.7       1.6      0.0       5,251        351 Fetch        0      0        0.0       0.0      0.0           0          0 ------- ------ ------  --------- --------- --------  ---------- ---------- Total    1,044    348        2.6       2.4      0.1       5,251        351     Per Exe      1      1        0.0       0.0      0.0          15          1 Per Row      1      1        0.0       0.0      0.0          15          1

Though I've not shown it in the output here, the Hotsos Profiler revealed that there were 347 "similar" statements in the trace file. The Profiler defines two SQL statements as similar if and only if the statements are identical (as regarded by the Oracle kernel) except for literal string values in the statements. Oracle's tkprof would not have aggregated the unshared SQL statements this way; instead, it would have listed all 348 distinct SQL texts as each consuming a very small amount of capacity. This, of course, makes analysis a bit more difficult, but all the information you need to determine that the statements should have been sharable is present in the SQL trace file.

The data in Example 12-9 make it clear why there were so many distinct SQL texts. The statement uses several string literals instead of placeholder ("bind") variables :

1000346

54213

' ENGLISH '

' <BLANK> '

' N '

' sa '

' 11/05/2001 16:40:54 '

The value that sticks out immediately as completely unsharable is the date value in the final variable position. This value corresponds to a column called TMSP_LAST_UPDT ”the timestamp of the last update. How many times would you expect for an application to ever reuse the exact SQL text that contains a hard-coded timestamp value with one-second resolution?

Right, you would expect the answer to be zero, but actually there's a little twist to the answer. This particular application reused the statement exactly once (that is, the application used this statement twice in total). Notice the action count for parses and executions of this statement (and its similar statements): the 348 statements accounted for a total of 696 parse calls and 348 executions. That's right, it accounted for exactly twice as many parses as executions! In Section 11.3.2, I admonish you to extract parse calls from within loops so that an application can reuse the cursor prepared by a single parse call many times. The Hotsos Profiler output shows (it's not shown in Example 12-9) that for the 696 parse calls, there are 348 misses on the library cache (from the mis statistic of the raw trace data). What's happening here is that, incredibly, this application actually parses each of these SQL statements twice for each execute call, as shown in Example 12-10.

Example 12-10. An application that parses twice for every execution...a really bad idea

 # REALLY BAD, unscalable application code for each v in (897248, 897249, ...) {   c = parse("select ... where orderid = ".v);  # just ignore the result   c = parse("select ... where orderid = ".v);  # do the same parse again   execute(c);   data = fetch(c);   close_cursor(c); }

The first parse call for each statement results in a library cache miss (a "hard" parse), and the second parse call for each statement results in a library cache hit (a "soft" parse).

At this point, it's important to remember our overall goal. Executions of the SQL*Net message from client wait event dominate user action response time. Example 12-7 shows that this user action executes a total of 6,093 such events, for a total response time contribution of 149.2 seconds, each consuming an average of 0.024482 seconds per call. SQL*Net message from client is a wait event that the Oracle kernel executes between database calls. Therefore, eliminating database calls will eliminate some response time attributable to SQL*Net message from client events. Happily, we had just found an opportunity to eliminate 348 parse calls. Simply find a way to stop parsing twice for every execute call. The expected savings: about 8.5 seconds for the session (about 5% of the session's total response time).

Not a huge start, but we're not done yet. As I mention in Chapter 11, by extracting a parse call from within a loop, the student should be able to eliminate all but one parse call. Using bind variables and making the code look like the scalable application code shown in Example 10-2 will result in the elimination of 695 unnecessary parse calls. Expected total impact to our student's user action: about 17 seconds, or about 10% of total session response time.

We wondered how many other SQL statements might be suffering from the same problems as the first one we examined? By visiting the detailed statistics for all the SQL statements in the Hotsos Profiler output, we found that over 3,000 total database calls should be candidates for elimination. Expected total impact to the user action: over 73 seconds, or about 45% of the session's total response time.

But even that's not all. Remember, the user clicked OK once and then waited over 180 seconds for a result, with no opportunity for further input provided to the application. But look again at Example 12-9. It is a single-row INSERT statement, executed 348 times, manipulating a grand total of 348 rows. Why would the application need to make 348 database calls to insert 348 rows? Oracle provides an array insert function that might reduce the number of insert statements from 348 to perhaps 4 (if the application could use an array size of 100 rows). If this database call reduction could be implemented, it would reduce the session's total database call count further by more than 650 calls. Expected additional impact to the user action: about 16 seconds, for 10% more of the session's total response time.

If all the proposed database call eliminations could be implemented, the total savings would amount to the elimination of more than 3,650 calls, for a grand total of roughly 89 seconds saved, or a 54% reduction in response time. If the database calls could be eliminated, the user could expect a response time improvement from 164 seconds to about 75 seconds. It is possible that if a lot of users run applications like this one simultaneously , then a lot of the network subsystem's capacity is eaten up by wasted database calls. If this is the case, then eliminating those wasteful calls might reduce network queueing delays so that the 0.024-second latency of SQL*Net message from client events might actually drop to about 0.015 seconds. If this were to happen, then the approximately 2,400 SQL*Net message from client events that remain after optimization might consume only about 37 seconds, which would represent a grand total of about 110 seconds' worth of response time reduction.

12.3.3 Results

When we assess trace files in class, we don't always get to see the end result of our recommendations. This case is one such example. Because the user action in question was part of a packaged application, the three manipulations that we suggested require vendor participation. The student did pass our suggestions to the vendor:

Don't execute each parse call twice, to cut the number of parse calls for many cursors in half.
Use bind variables and extract parse calls from loops to reduce the number of parse calls further to one per cursor.
Reduce overall database calls counts by using array processing instead of processing only one row at a time.

While I write this chapter, the student and I are still waiting to see what might happen. Her company is planning to upgrade to the vendor's next release shortly after this book goes to print. We'll keep you posted on the Web.

12.3.4 Lessons Learned

This case highlights the following important ideas:

You can't just ignore SQL*Net message from client wait events. When they contribute significantly to the response time of a properly targeted user action, you have to pay attention to them.
Too many database calls can ruin your performance, even when your SQL is just fine. In this case, the SQL might have in fact been wasteful. But fixing it before fixing the problem with too many database calls would have produced unnoticeably small results.

Top