Chapter 7: Interpreting Common Latency-Related Wait Events


There are hundreds of wait events in an Oracle Database, but the nice thing about OWI is that you need to know only a handful of common wait events to cover most of your tuning needs. So far, you have been exposed to 10 common I/O- and lock-related wait events. This chapter covers another 6 common latency-related wait events. It equips you to diagnose and correct problems related to: log file sync, log buffer space, free buffer waits, write complete waits, log file switch completion, and log file switch (checkpoint incomplete) . Oracle Real Application Clusters (RAC) related events are covered in Chapter 8.

log file sync

The log file sync wait event has one parameter: buffer#. In Oracle Database 10 g , this wait event falls under the Commit wait class. Keep the following key thoughts in mind when dealing with the log file sync wait event.

  • The log file sync wait event is related to transaction terminations (commits or rollbacks ).

  • When a process spends a lot of time on the log file sync event, it is usually indicative of too many commits or short transactions.

Common Causes, Diagnosis, and Actions

Oracle records both the transactions and the block changes in the log buffer in SGA, a method known as physiological logging. The LGWR process is responsible for making room in the log buffer by writing the content to log files at various schedules, including:

  • At every three seconds

  • When the log buffer is 1/3 full or has 1MB worth of redo entries

  • When a user commits or rolls back a transaction

  • When signaled by the DBWR process (write ahead logging)

Writes that are initiated by user commits and rollbacks are known as sync writes; the rest are known as background writes . A log file sync wait is related only to the sync write. In other words, a user process may be processing a large transaction and generating a lot of redo entries that trigger the LGWR to perform background writes, but the user session never has to wait for the background writes to complete. However, as soon as the user session commits or rolls back its transaction and the _WAIT_FOR_SYNC parameter is TRUE, the process posts the LGWR and waits on the log file sync event for the LGWR to flush the current redo entries, including the commit marker, to log files. During this log synchronization, the LGWR process waits for the sync write to complete on the log file parallel write event while the user session waits for the synchronization process to complete on the log file sync event.

Once a process enters the log file sync wait, there are two possible exits. One is when the foreground process is posted by the LGWR when the log synchronization is complete. The other is when the wait has timed out (typically at 1 second), at which point the foreground process checks the current log SCN (System Change Number) to determine if its commit has made it to disk. If it has, the process continues processing, otherwise the process re-enters the wait.

Note ‚  

You must never set the parameter _WAIT_FOR_SYNC to FALSE, not even in a development or test database, because committed transactions are not guaranteed to be recoverable upon instance failure. People use this feature to cheat on benchmarks.

Typically, a log file sync wait is a very uneventful event. It is brief and hardly noticeable by the end user. However, a multitude of them can contribute to longer response time and chalk up noticeable wait statistics in both the V$SYSTEM_EVENT and V$SESSION_EVENT views. Use the following query to find the current sessions that spend the bulk of their processing time waiting on the log file sync event since logon. Evaluate the TIME_WAITED with the USER_COMMITS and HOURS_CONNECTED. To discover who performs a lot of commits between two specific points of time, you can compute the delta using the example shown in the ‚“How To Use V$SYSTEM_EVENT View ‚½ section in Chapter 2.

 select a.sid, 
a.event,
a.time_waited,
a.time_waited / c.sum_time_waited * 100 pct_wait_time,
d.value user_commits,
round((sysdate - b.logon_time) * 24) hours_connected
from v$session_event a, v$session b, v$sesstat d,
(select sid, sum(time_waited) sum_time_waited
from v$session_event
where event not in (
'Null event',
'client message',
'KXFX: Execution Message Dequeue - Slave',
'PX Deq: Execution Msg',
'KXFQ: kxfqdeq - normal deqeue',
'PX Deq: Table Q Normal',
'Wait for credit - send blocked',
'PX Deq Credit: send blkd',
'Wait for credit - need buffer to send',
'PX Deq Credit: need buffer',
'Wait for credit - free buffer',
'PX Deq Credit: free buffer',
'parallel query dequeue wait',
'PX Deque wait',
'Parallel Query Idle Wait - Slaves',
'PX Idle Wait',
'slave wait',
'dispatcher timer',
'virtual circuit status',
'pipe get',
'rdbms ipc message',
'rdbms ipc reply',
'pmon timer',
'smon timer',
'PL/SQL lock timer',
'SQL*Net message from client',
'WMON goes to sleep')
having sum(time_waited) > 0 group by sid) c
where a.sid = b.sid
and a.sid = c.sid
and a.sid = d.sid
and d.statistic# = (select statistic#
from v$statname
where name = 'user commits')
and a.time_waited > 10000
and a.event = 'log file sync'
order by pct_wait_time, hours_connected;

SID EVENT TIME_WAITED PCT_WAIT_TIME USER_COMMITS HOURS_CONNECTED
---- ------------- ----------- ------------- ------------ ---------------
423 log file sync 13441 84.9352291 62852 15
288 log file sync 13823 85.3535042 63007 15
422 log file sync 13580 85.7648099 62863 15
406 log file sync 13460 87.0239866 62865 15
260 log file sync 13808 87.0398386 62903 15
Note ‚  

In a RAC environment, the LMS background process may register waits on the log file sync event. This is due to the write ahead logging mechanism. When a foreground process requests a block from another instance, all the redo entries associated with the block must be flushed to disk prior to transferring the block. In this case, the LMS background process initiates the log synchronization and waits on the log file sync event.

The following sections discuss the three main causes of high log file sync waits.

High Commit Frequency

High commit frequency is the number one cause for foreground log file sync waits. Find out if the session that spends a lot of time on the log file sync event belongs to a batch or OLTP process or if it is a middle- tier (Tuxedo, Weblogic, etc.) persistent connection.

If the session belongs to a batch process, it may be committing each database change inside a loop. Discover the module name and ask the developer to review the code to see if the number of commits can be reduced. This is an application issue, and the solution is simply to eliminate unnecessary commits and reduce the overall commit frequency.

Some application developers have learned that if they commit infrequently, jobs may fail due to rollback segments running out of space, and they get calls in the middle of the night. Those who have been bitten by deadlocks may have been told to commit more frequently. Naturally, they become commit-happy people. The right thing to do is to properly define what a transaction is and commit at the end of each transaction. A transaction is a unit of work. A unit of work should either succeed or fail in its entirety. The proper place for a commit or rollback is at the end of each unit of work. Do not introduce additional commits for the sake of rollback segments space or deadlocks. That is treating the symptom and not the problem. If none of the existing rollback segments can handle the unit of work, then you as the DBA should provide one that will. (If you are using the Automatic Undo Management (introduced in Oracle9 i Database), then allocate enough space to the undo tablespace and set an appropriate undo retention time.)

Introducing additional commits can create other problems, among them, the infamous ORA-01555: snapshot too old error because the rollback (or undo) data can be overwritten. A high commit frequency also increases the overhead that is associated with starting and ending transactions. At the beginning of each transaction, Oracle assigns a rollback segment (called a rollback segment binding) and updates the transaction table in the rollback segment header. The transaction table must also be updated at the end of each transaction, followed by a commit cleanout activity. Updates to the rollback segment headers must also be recorded in the log buffer because the blocks are dirtied. Therefore, make the necessary adjustments in the application so that it only commits at the end of a unit of work. The commit statement that is inside a loop may need to be moved out of the loop so that the job commits only once instead of once in every loop iteration.

If the session that spends a lot of time on the log file sync event is a persistent connection from a middle-tier layer, then this is a tough case because it services many front-end users. You have to trace the session with event 10046 and observe the application behavior over time. Look for the log file sync event in the trace file. It may give you an idea of the commit frequency. Alternatively, you can mine the redo log files with Oracle Log Miner. This will show you the systemwide commit behavior.

In OLTP databases, you normally notice a high log file sync wait time at the system level (V$SYSTEM_EVENT) but not at the session level. The high system-level wait time may be driven by many short transactions from OLTP sessions that actively log in and out of the database. If this is your scenario, then about the only thing you can do in the database is to ensure a smooth I/O path for the LGWR process. This includes using asynchronous I/O and putting your log files on raw devices or an equivalent, such as the Veritas Quick I/O, that is serviced by dedicated I/O controllers ‚ or better yet, using solid state disks for the log files. (James Morle has an excellent article and benchmark numbers on solid state disks for redo logs at http://www.oaktable.net/fullArticle.jsp?id=5) However, you only need to do this if the log file parallel write average wait time is bad. Otherwise, there isn ‚ t much benefit. The real solution has to come from the application.

Slow I/O Subsystem

Query the V$SYSTEM_EVENT view as follows to discover the LGWR average wait time ( log file parallel write event). An average wait time of 10ms (1 centisecond) or less is generally acceptable.

 select * 
from v$system_event
where event in ('log file sync','log file parallel write');

EVENT TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT
------------------------ ----------- -------------- ----------- ------------
log file parallel write 6411138 0 3936931 .614076783
log file sync 6597655 9819 6534755 .99046631

A higher system I/O throughput can improve the average wait time of the log file sync and log file parallel write events. However, this is not an excuse for not fixing the application if it is poorly designed and is committing too frequently. You may be tempted to mess with the database layout and I/O subsystem, perhaps because it is difficult to deal with the application group or it is a third-party application. However, just because it is a third-party application does not give the vendor the right to throw junk code at your database, right? Remember, you cannot solve an application problem from the database. Any change you make in the database may give your users the impression that it is a database problem. Besides, the problem won ‚ t go away because the root cause is in the application.

That said, there are many things that you and your system administrator can do to provide some relief to the log file sync wait by increasing the I/O throughput to the log files. This includes using fiber channel (FC) connection to databases on SAN (storage area network); gigabit Ethernet (Gig-E) or Infiniband connection to databases on NAS (network attached storage); ultrawide SCSI or FC connection to databases on DAS (direct attached storage); private networks; high-speed switches; dedicated I/O controllers; asynchronous I/O; placing your log files on raw device and binding the LUN in RAID 0 or 0+1 instead of RAID 5, and so on. The questions you have to ask are which business unit is paying for the new hardware and how much faster the commits will be. Let ‚ s take the preceding listing, for example. The log file parallel write average wait baseline is 6.1ms, and the log file sync average wait baseline is 9.9ms. What numbers can you expect to get with the new hardware ‚ 25 percent improvement, 50 percent improvement? How does a 50 percent improvement in the average log file sync wait translate to the session response time? Will the users be happy? You must weigh the hardware cost with the potential benefits.

In practice, you often have to compromise because of company policies and because you don ‚ t make hardware decisions. Your log files may be on RAID 5 LUNs that are shared with other systems. This is especially true in SAN environments. The storage companies are not helping either by rolling out large capacity disk drives . Most likely, your management makes storage decisions based on cost per MB instead of performance. As a result, you end up with fewer disks. Furthermore, you may be required to manage disk utilization into the 90 percentiles. Believe us, we feel your pain!

Oversized Log Buffer

Depending on the application, an oversized log buffer (that is, greater than 1MB) can prolong the log file sync waits. A large log buffer reduces the number of background writes and allows the LGWR process to become lazy and causes more redo entries to pile up in the log buffer. When a process commits, the sync write time will be longer because the LGWR process needs to write out many more redo entries. Your log buffer size should not be so small that it causes sessions to start waiting on the log buffer space event or so large that it prolongs the log file sync waits. On one hand, you want a larger log buffer to cope with the log buffer space demands during redo generation bursts, especially immediately after a log switch. On the other hand, you want a smaller log buffer to increase background writes and reduce the log file sync wait time. How can you satisfy these competing requirements? The answer is the _LOG_IO_SIZE parameter, which defaults to 1/3 of the LOG_BUFFER or 1MB, whichever is less. It is one of the thresholds that signal the LGWR process to start writing. A proper _LOG_IO_SIZE setting will let you have your cake and eat it too. In other words, you can have a large log buffer, but a smaller _LOG_IO_SIZE will increase background writes, which in turn reduces the log file sync wait time. However, as mentioned in Chapter 5, this method is not overhead free. A more active LGWR will put a heavier load on the redo copy and redo writing latches. You may refer back to the log file parallel write wait event section in Chapter 5.

By default, the _LOG_IO_SIZE parameter is reduced to 1/6 of the LOG_BUFFER size in Oracle Database 10 g . This is because the default value of the _LOG_PARALLELISM_MAX parameter is 2 when the COMPATIBLE is set to 10.0 or higher. The default value of the _LOG_IO_SIZE in Oracle Database 10 g is discovered by dividing the LOG_BUFFER size with the log block size (LEBSZ) and the value of kcrfswth , as shown next :

 orcl> oradebug setospid 14883 
Oracle pid: 5, Unix process pid: 14883, image: oracle@aoxn1000 ( LGWR )
orcl> oradebug unlimit
Statement processed.
orcl> oradebug call kcrfw_dump_interesting_data
Function returned 90F42748
orcl> oradebug tracefile_name
/orahome/admin/orcl/bdump/orcl_lgwr_14883.trc
orcl> !grep kcrfswth /orahome/admin/orcl/bdump/orcl_lgwr_14883.trc
kcrfswth = 1706

orcl> select value from v$parameter where name = 'log_buffer';
VALUE
---------------
5242880

orcl> select max(LEBSZ) from x$kccle;
MAX(LEBSZ)
----------
512

orcl> select 5242880/512/1706 from dual;
5242880/512/1706
----------------
6.00234467
Note ‚  

A larger PROCESSES parameter value also increases the waits for log file sync. During every sync operation, the LGWR has to scan the processes ‚ data structure to find out which sessions are waiting on this event and write their redo to disk. Lowering the number of PROCESSES can help to reduce the overall log file sync wait. Use the V$RESOURCE_LIMIT view for guidance. According to Oracle, this problem is fixed in Oracle9i Database Release 2.




Oracle Wait Interface
Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning (Osborne ORACLE Press Series)
ISBN: 007222729X
EAN: 2147483647
Year: 2004
Pages: 114

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net