Section 12.4. Case 4: Large Read Event Duration | Optimizing Oracle Performance

12.4 Case 4: Large Read Event Duration

Regularly, we get nice letters from people who are interested in letting us know that our method and tools are helping. This case is the result of one such letter from a friend in Iceland who reduced the response time of a query from 6.5 hours to 10.9 seconds ”and fixed a previously undiscovered functional bug ”by adding four bytes to its SQL text. This is the story of how Method R helped him identify the SQL statement that was causing the problem. In the end, he improved the response time of an important batch job from nearly eight hours to just one hour .

12.4.1 Targeting

Application targeting in this case was typical. The system owner was a bank. One of the application's batch jobs was taking so long to run that it was unable to finish in its assigned batch window. The job started at 11:00 p.m. each night, and it would sometimes run until noon the next day. The bank was limiting the number of accounts to update so that the batch job would finish before opening hours. So our friend targeted this batch job for the collection of extended SQL trace data. Example 12-11 shows the resource profile for a run that consumed nearly eight hours of run time.

Example 12-11. Resource profile for a batch job that consumed almost eight hours

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ db file scattered read         19,051.1s  68.4%      1,828,249    0.010420s CPU service                     6,889.3s  24.7%        959,148    0.007183s db file sequential read         1,892.7s   6.8%        406,417    0.004657s latch free                         29.0s   0.1%          1,071    0.027106s log file switch completion          1.6s   0.0%             14    0.112143s SQL*Net message from client         0.3s   0.0%             10    0.034000s log buffer space                    0.1s   0.0%              1    0.100000s log file sync                       0.1s   0.0%              4    0.022500s file open                           0.1s   0.0%             54    0.001296s buffer busy waits                   0.0s   0.0%             14    0.002143s undo segment extension              0.0s   0.0%          2,111    0.000014s SQL*Net message to client           0.0s   0.0%             10    0.000000s ----------------------------- ----------------- -------------- ------------ Total                          27,864.4s 100.0%

The top lines of Example 12-11 bear the distinct signature of inefficient SQL: lots of file reading and CPU capacity consumption. The next task is to determine which SQL statements are consuming so many resources. In this case, executing tkprof with the option sort =prsdsk,exedsk,fchdsk will produce a report with the SQL having the largest number of PIO blocks at the top.

Notice, however, that basing your tkprof sort order upon the PIO block count is not the same thing as sorting by total PIO call duration . What you really want is statements sorted by total I/O call duration, but tkprof does not provide this information unless you're using the 9 i version. Therefore, when you use tkprof , you have to examine your output visually to make sure you have identified the SQL statement that you really want to analyze.

The analyst in this story knew which SQL statement contributed the most to the db file scattered read problem, because he used the Hotsos Profiler contribution table shown in Example 12-12.

Example 12-12. The Hotsos Profiler identifies the contribution to db file scattered read duration by SQL statement

 SQL Statement Id             Duration ------------------ ------------------ 1163242303          19,028.9s   99.9% 1626975503               6.7s    0.0% 808413641                5.2s    0.0% 3187134541               3.7s    0.0% 1594054818               2.4s    0.0% 8 others                 4.3s    0.0% ------------------------------------- Total               19,051.1s  100.0%

12.4.2 Diagnosis and Repair

The hunt thus progressed quickly to the performance analysis of the SQL statement 1163242303 (this is the statement's hv value in the PARSING IN CURSOR section of the raw trace data). Example 12-13 shows the text and performance statistics for this statement.

Note that hv is only almost unique (that is, hv is not unique). Two different SQL statements can share the same hv value. You won't see it very often, but it can happen.

Example 12-13. SQL text and performance statistics for statement 1163242303, the top contributor to the session's db file scattered read duration

 Statement Text SELECT EIGANDI,INNLENT_ERLENT,VERDTRYGGING,SKULDFLOKKUN,VBRTEGUND,FLOKKUR  FROM V_SKULDABREF_AVOXTUN  WHERE EIGANDI = :b1  AND INNLENT_ERLENT = :b2  AND ((RAFVAETT = :b3 )  OR ((RAFVAETT = :b4 )  AND (INNLAUSN IS NULL  OR INNLAUSN > :b5 )  AND (VIDMIDDAGS <= :b6 )))  GROUP BY EIGANDI,INNLENT_ERLENT,VERDTRYGGING,SKULDFLOKKUN,VBRTEGUND,FLOKKUR  ORDER BY EIGANDI,INNLENT_ERLENT,VERDTRYGGING,SKULDFLOKKUN,VBRTEGUND,FLOKKUR     Statement Cumulative Database Call Statistics  Cursor  Action         ------ Response Time -------         LIO        PIO Action   Count   Rows    Elapsed       CPU    Other      Blocks     Blocks ------- ------ ------  --------- --------- --------  ---------- ---------- Parse    3,739      0        1.9       0.7      1.2         147         17 Execute  3,739      0        1.7       1.6      0.2           0          0 Fetch    4,212    473   23,466.4   4,135.6 19,330.8  36,566,201 36,550,345 ------- ------ ------  --------- --------- --------  ---------- ---------- Total   11,690    473   23,470.0   4,137.9 19,332.1  36,566,348 36,550,452     Per Exe     1       0        6.3       1.1      5.2       9,780      9,092 Per Row     8       1       49.6       8.8     40.9      77,307     71,868

The SQL text for this statement is not too difficult to understand. In spite of the Icelandic object names , the statement is simply a query from a single object. There's apparently not even a join. However, the statement's execution plan obtained from the trace file's STAT lines and shown in Example 12-14 reveal a different story.

Something a little more complicated than a simple one-table query is going on here. In fact, V_SKULDABREF_AVOXTUN is a view. The raw trace file confirms it. The parse call for the SELECT shown in Example 12-13 required a recursive parse, execute, and fetch calls against VIEW$ , in the same manner as the query from DBA_OBJECTS that I describe in Chapter 5. The definition of the view V_SKULDABREF_AVOXTUN becomes the next target of our attention. One thing that an Oracle extended SQL trace file does not contain is the definition of each view accessed by SQL identified within the trace file. However, because the trace file makes it clear that V_SKULDABREF_AVOXTUN is a view, it's a simple enough matter to query DBA_VIEWS to determine the definition. Example 12-15 shows that definition.

Example 12-14. The execution plan for the time-consuming SELECT statement before optimization

 Rows Row Source Operation (Object Id) --------------- ----------------------------------------------------------             473 SORT ORDER BY             473  SORT GROUP BY             974   VIEW V_SKULDABREF_AVOXTUN             974    SORT UNIQUE             974     UNION-ALL             686      HASH JOIN             886       TABLE ACCESS FULL SKULDABREF_AVOXTUN(21435)         103,247       TABLE ACCESS FULL VBR_FLOKKAR(19409)             288      FILTER             288       HASH JOIN             940        TABLE ACCESS BY INDEX ROWID BREF(19460)           2,649         INDEX RANGE SCAN(20593)          10,744        TABLE ACCESS FULL VBR_FLOKKAR(19409)

Example 12-15. The definition of the view behind the performance problem

 CREATE OR REPLACE VIEW v_skuldabref_avoxtun (    eigandi,    innlent_erlent,    verdtrygging,    skuldflokkun,    vbrtegund,    flokkur,    rafvaett,    ostadladur,    brefnumer,    vidmiddags,    innlausn,    nafnverd,    kaupkrafa,    ees,    vextir ) AS select   s.eigandi,   s.innlent_erlent,   s.verdtrygging,   s.skuldflokkun,   s.vbrtegund,   s.flokkur,   'N' rafvaett,   nvl(f.ostadlad,'N') ostadladur,   s.brefnumer,   s.vidmiddags,   s.innlausn,   s.nafnverd,   s.kaupkrafa,   s.ees,   null vextir from fjastofn.vbr_flokkar f,      fja_pfm.skuldabref_avoxtun s where f.audkenni=s.flokkur and       f.rafvaett is null  union  select    s.eig eigandi,    'I',    Fja_pfm.Ymis_Foll.TegundTryggingar(f.visitala) verdtrygging,    f.skuldaranumer skuldflokkun,    f.vbrtegund vbrtegund,    s.aud flokkur,    'J' rafvaett,    'N' ostadladur,    'x' brefnumer,    to_date(null) vidmiddags,    to_date(null) innlausn,    s.nav nafnverd,    0 kaupkrafa,    '+' ees,    null vextir from fjastofn.vbr_flokkar f,      fja_pfm.bref s where f.audkenni=s.aud and       f.rafvaett is not null /

The analyst discussed the view definition with a developer who understood its business purpose. From that discussion, the analyst and developer were able to prove that the view definition was flawed. Using a UNION of the two SELECT statements in the view definition instead of a UNION ALL was causing two problems:

It caused a very costly but unneeded SORT UNIQUE row source operation. (The new data that the Oracle release 9.2 kernel emits in the STAT lines would better highlight the enormous cost.)
It caused a bug as well, because the UNION erroneously eliminated rows that the application users needed.

Example 12-16 shows the execution plan of the statement shown in Example 12-13 after the view definition was corrected by inserting the bytes ALL into the view definition.

Example 12-16. The execution plan for the time-consuming SELECT statement after optimizing the view definition

 Rows Row Source Operation (Object Id) --------------- ----------------------------------------------------------             473 SORT ORDER BY             473  SORT GROUP BY             974   VIEW V_SKULDABREF_AVOXTUN             974    UNION-ALL             686     HASH JOIN             886      INDEX RANGE SCAN(21436)         103,314      TABLE ACCESS FULL VBR_FLOKKAR(19409)             288     FILTER             288      HASH JOIN             940       INDEX RANGE SCAN(29887)          10,744       TABLE ACCESS FULL VBR_FLOKKAR(19409)

Example 12-17 shows the more exciting news. The query that had previously consumed 23,470.0 seconds of response time now consumes only 10.9 seconds. It produces the same (actually better) application output; it just consumes about 6.5 fewer hours to do it.

Example 12-17. SQL text and performance statistics for statement 1163242303, after the view definition change. Note that this is the same SQL text as shown in Example 12-13; only the underlying view definition has changed. Total time reduction for the statement: from over 23,000 seconds to just over 10 seconds

 Statement Text SELECT EIGANDI,INNLENT_ERLENT,VERDTRYGGING,SKULDFLOKKUN,VBRTEGUND,FLOKKUR  FROM V_SKULDABREF_AVOXTUN  WHERE EIGANDI = :b1  AND INNLENT_ERLENT = :b2  AND ((RAFVAETT = :b3 )  OR ((RAFVAETT = :b4 )  AND (INNLAUSN IS NULL  OR INNLAUSN > :b5 )  AND (VIDMIDDAGS <= :b6 )))  GROUP BY EIGANDI,INNLENT_ERLENT,VERDTRYGGING,SKULDFLOKKUN,VBRTEGUND,FLOKKUR  ORDER BY EIGANDI,INNLENT_ERLENT,VERDTRYGGING,SKULDFLOKKUN,VBRTEGUND,FLOKKUR     Statement Cumulative Database Call Statistics Cursor  Action         ------ Response Time -------         LIO        PIO Action   Count   Rows    Elapsed       CPU    Other      Blocks     Blocks ------- ------ ------  --------- --------- --------  ---------- ---------- Parse    3,722      0        2.0       0.6      1.4          44          1 Execute  3,722      0        1.3       1.4     -0.1          14          0 Fetch    4,195    473        7.6       2.8      4.8      44,764        792 ------- ------ ------  --------- --------- --------  ---------- ---------- Total   11,639    473       10.9       4.8      6.2      44,822        793     Per Exe      1      0        0.0       0.0      0.0          12          0 Per Row      8      1        0.0       0.0      0.0          95          2

12.4.3 Results

The results were stunning. Example 12-18 shows the resource profile for the job after optimization. Total response time for the eight-hour batch job dropped to slightly more than one hour.

Example 12-18. Resource profile for the same batch job that consumed almost eight hours (compare Example 12-11). After optimization, the job consumed only slightly more than one hour

 Response Time Component                Duration        # Calls     Dur/Call ----------------------------- ----------------- -------------- ------------ CPU service                     2,684.7s  73.9%        953,452    0.002816s db file sequential read           847.6s  23.3%         77,944    0.010874s unaccounted-for                    93.2s   2.6% db file scattered read              5.8s   0.2%            295    0.019627s log file switch completion          1.6s   0.0%              7    0.234286s latch free                          1.0s   0.0%            362    0.002873s file open                           0.1s   0.0%             49    0.002041s log file sync                       0.1s   0.0%              7    0.011429s buffer busy waits                   0.0s   0.0%              1    0.010000s SQL*Net message from client         0.0s   0.0%             10    0.001000s SQL*Net message to client           0.0s   0.0%             10    0.000000s ----------------------------- ----------------- -------------- ------------ Total                           3,634.1s 100.0%

Before correcting the view definition for V_SKULDABREF_AVOXTUN , the bank had restricted the number of accounts upon which they would allow the job to run. Otherwise, the job's execution would violate the morning's online window by several hours. After making the correction, they can feed virtually any parameters they want to the query and it will not take considerably longer to run. Since optimization, the batch job has never even gotten close to the eight-hour runtime that the "before" job required.

12.4.4 Lessons Learned

This case illustrates the following points:

The resource profile pattern of large db file ... and CPU service durations usually indicates the use of inefficient SQL somewhere within the user action. To fix the problem, you have to find that SQL.
The extended SQL trace file contained exactly the information that the analyst needed to optimize the targeting process for finding the root cause of his batch job's performance problem. Even though the view definition was not present in the trace data, the trace file contained information that focused our attention upon the view definition as the source of the performance problem.
Sometimes, the scrutiny of performance analysis exposes functional bugs . In this case, the performance analyst was able to determine that not only was a specific query doing more work than it should have, it was actually returning an incomplete result set in some circumstances.

Top