Chapter 8: Wait Events in a Real Application Clusters Environment | Oracle Wait Interface: A Practical Guide to Performance Diagnostics & Tuning (Osborne ORACLE Press Series)

In recent years , you may have heard Oracle say that you can use Real Application Clusters (RAC) to reap benefits of high availability, fault tolerance, and excellent return on your investment in hardware and that use of RAC is on the rise. As a DBA you may already be supporting a RAC environment. This chapter will introduce you to Oracle wait events in a RAC environment and show you how to identify and resolve bottlenecks with wait events specific to a RAC environment. We will focus on the global cache waits because those affect the entire cluster.

What ‚ s So Special About Waits in Real Application Clusters (RAC)?

Before we begin discussing wait events in a RAC environment, you need to understand how the buffer cache works in a RAC environment.

In a single database instance environment, there will be only one set of shared memory segments for the database. All I/O operations, SQL processing, and library cache operations are routed through one set of shared memory structures. In other words, the buffer cache and shared pool are local to that particular instance; at any point in time the processes attached to that particular instance will be accessing only one set of memory structures.

However, in the RAC environment, the scenario is entirely different. Multiple instances share the same database. These instances typically run on different servers or nodes. The buffer cache is split among these multiple instances; each instance has its own buffer cache and other well-known SGA structures. It is possible for a buffer (also known as a block) to be in the buffer cache of one of the instances on another node. Processes, which are local to that particular machine, will be accessing these buffer caches for reading. Other than the local foreground processes, the remote machine ‚ s background processes access the local buffer cache. The remote instance ‚ s Lock Manager Service (LMS) processes will be accessing the global buffer cache, and the DBWR process will be accessing the local buffer cache.

Because the buffer cache is global and spread across multiple instances, the management operations associated with the buffer cache and shared pool are also different from typical single-instance Oracle environments, as are the wait events.

Note ‚

Real Application Clusters requires a shared disk system for storage of datafiles. This is usually achieved by using Raw devices or Cluster file system. Both allow the datafiles access by more than one node simultaneously .

Note ‚

The terms block and buffer are used in a similar context throughout our discussions for ease of understanding. The block is stored on the disk and can be loaded to a buffer in any of the buffer caches. A block can be loaded to any one of the buffers in the buffer cache. Oracle always accesses blocks in the buffer cache. If it is already loaded in any instance ‚ s buffer cache and can be transferred without more work from the holding node, it will be transferred to the other instance ‚ s buffer cache. Otherwise , the block is read into buffer from the disk.

Global Buffer Cache in Real Application Clusters

In general, the data buffer cache is shared by more than one instance, and the buffer cache is called the global cache. Each instance has its own buffer cache local to that particular instance, and all of the buffer caches together create the global cache.

Because the cache is global, the consistent read (CR) processing differs from the single-instance environment. In a single-instance environment, when a process needs to modify or read a block, it reads the block from disk into memory, pins the buffer, and may modify the buffer. In the RAC environment, when a process needs a block, it also reads it from disk and modifies it in the buffer. Because the buffer cache is global, there is a good chance that the buffer may already have been read by another process and be in one of the buffer caches. In this case, reading the block from disk may lead to data corruption. The following section explains how Oracle avoids data corruption.

Parallel Cache Management

In Oracle Parallel Server, the global buffer management operation was called Parallel Cache Management (PCM) and buffer locks, known as PCM locks, were used to protect the buffers in the cache. In simple terms, a PCM lock is a data structure that usually covers the set of blocks. The number of PCM locks for a file was configured using the GC_FILES_TO_LOCKS init.ora parameter. Detailed discussion of PCM locks is beyond the scope of this chapter. Here, we will discuss the cache management in RAC environments and related wait events.

Lock Mastering and Resource Affinity

In normal cluster environments, the resources are split between various nodes, which are typically different servers participating in the cluster. Each node, called a master node, owns a subset of total resources; the control of those resources is handled from the respective nodes. This process is known as resource mastering or lock mastering . If one node wants to acquire a resource that is mastered by another node, the requesting node makes a request to the master node or the holding node to grant access to the requesting node. In this way, a resource is handled by only one node at a time, avoiding data corruption in clustered environments.

In Oracle's RAC implementation, if one node (the requesting node) wants to acquire a resource that is mastered by another node (mastering node), the requesting node makes a request to the Global Cache Service on the mastering node, and the global cache service grants access to the requesting node. The requestor does not make the request to the holding node unless it is also the master node.

In the RAC environment, the Global Resource Directory (GRD) that is maintained by the Global Enqueue Services (GES) and Global Cache Services (GCS) handles the ‚“lock mastering ‚½ operations. They dynamically remaster the resources based on resource affinity, which enhances the performance because the ownership of those resources are localized. For example, if any instance uses a resource more frequently than other instances, that resource will dynamically be remastered to that instance. This helps the resource to be controlled from that particular instance and increases the performance of the clusters by reducing the lock management operations. This new feature is called dynamic remastering . The initialization parameter _LM_DYNAMIC_REMASTERING controls the behavior of dynamic remastering; setting this parameter to FALSE disables the dynamic remastering.

A buffer can be read into the following modes in the buffer cache; the states are visible in V$BH view. The following is the list of possible state of buffers in the buffer cache:

 FREE - not currently in use 
 XCUR - exclusive current 
 SCUR - shared current 
 CR - consistent read 
 READ - being read from disk 
 MREC - in media recovery mode 
 IREC - in instance recovery mode 
 WRI- Write Clone Mode 
 PI- Past Image

CR (Consistent Read) Processing

In a buffer cache a buffer can be in any of the preceding states. We will focus on the XCUR, SCUR, and CR states for this discussion. Any SELECT operation on the buffer cache will require the buffer in SCUR mode during the execution of the statement. DML commands require buffers in XCUR mode (called current mode buffer ), and the process that makes the changes (in this case the DML operator) owns the buffers exclusively. During that time, if any other process requires access to those blocks, it will clone the buffer within the buffer cache and use that cloned copy, called a CR copy, for processing. The process of cloning increments the statistic consistent gets reported in V$SYSSTAT.

Note ‚

Buffer cloning is a process of making a copy of the buffer from/to the point where the buffer contents are deemed consistent using the undo vectors and called as Pre Image or Past Image abbreviated as PI.

During the buffer cloning, there ‚ s a chance that a buffer will be cloned infinitely because of the high volatility of the buffer contents. This can cause the buffer cache to be flooded by the same buffer clones , which can restrict the buffer cache usage for the rest of the buffers. To avoid flooding the buffer cache with cloned buffers, Oracle limits the number of cloned buffers (or CR copies) to six per DBA (Data Block Address ‚ not to be confused with database administrator). Once the limit is reached, Oracle waits for the buffer; the normal wait time is 10cs before it attempts to clone/reread the buffer. The number of cloned copies per DBA is controlled by the initialization parameter _DB_BLOCK_MAX_CR_DBA, which defaults to six otherwise. Having more CR copies will not harm the buffer cache much in management operations because the CR copies are not subject to the normal MFU (Most Frequently Used) algorithm. The CR copies are always kept at the cold end of the buffer cache unless the parameter _DB_AGING_FREEZE_CR is modified from default value TRUE to FALSE, or unless a recycle cache is configured. Note that the CR buffers in the cold end of the buffer cache can be flushed out anytime .

The New Buffer Cache Management

The Oracle Database buffer cache algorithm is no longer LRU/MRU (least recently used/most recently used) based. Instead, the new algorithm is based on how frequently the buffer is accessed. The most frequent buffers are always kept in the hot end of the buffer cache, and the newly accessed buffers (a.k.a. MRU buffers) are now placed in the middle of the buffer cache (in earlier Oracle versions they were kept in the hot end). This is called mid-point insertion. In this new MFU algorithm, the buffer is heated based on the frequency of the usage and slowly moved to the hot end; the buffer ‚ s temperature is reduced if it is not accessed (touched) for a certain period of time. Each buffer structure has a counter called a touch count , and a buffer is moved to the hot or cold region based on this counter. The touch count is halved if that buffer is not accessed for the default of three seconds (or the time specified in the initialization parameter _DB_AGING_TOUCH_TIME). The CR buffers are placed at the cold end of the buffer cache.

Having discussed the buffer cache basics and CR processing in a single database instance, we will now discuss CR handling in the RAC environments. As you saw earlier, the RAC environment will have two or more buffer caches managed by different instances. We will review how the buffer transfer occurred in the pre-RAC days and then review how it is changed in the RAC environment.

Pings and False Pings

In OPS (Oracle Parallel Server), whenever a process ( belonging to one instance) wants to read a resource/buffer (we ‚ ll call it a resource for simplicity), it acquires a lock on the resource and reads it into its own buffer cache. The Distributed Lock Manager (DLM) structures keep track of the resources and owners in their own lock structures. In this scenario, if the resource is acquired and used by the other instance, Oracle sends a request to the other instance to release the lock on that resource. For example, instance A wants a block that was used in instance B ‚ s cache. To perform the transfer, instance B would write the block to disk and instance A would read it. The writing of a block to disk upon request of another instance is called a ping .

There is another type of disk write that happens if the block is written to disk by one instance because another instance requires the same lock that covers different blocks. This is called a false ping . Once the holder downgrades the lock and writes the buffer to the disk, the requester can acquire the lock and read the block into its own buffer cache. For example, instance A wants a block that shares the same PCM lock as a block in the cache of instance B. Having additional PCM locks configured will greatly reduce the false pings , but it would be too resource- intensive to cover every block by a lock unless it was a very small database. A single READ request can require multiple write operations as well the read, and the disk will be used as a data transfer media. True pings and false pings put a heavy load on the I/O subsystem, and affect the scalability of the system if the application is not partitioned correctly based on the workload characteristics. This is one of the strong reasons for workload partitioning in Oracle Parallel Server environments.

In addition, it puts a huge administrative overhead on allocating PCM locks for each database file based on the frequency/concurrency the administrator uses to configure the fixed, releasable, and hash locks. Improper configuration of lock objects causes the excessive false pings , and the systems supposedly designed for scalability never scales to the required level.

The DLM keeps track of the ownership of the blocks (attributes), such as which instance holds which blocks in shared and exclusive mode. At any point of time, only one instance can hold a block in an exclusive mode, and more than one instances can hold that block in a shared mode. During lock downgrade (or ping), the holder writes the changes to the redo log, flushes the redo to the disk, and downgrades the lock from exclusive mode to null /shared mode. The requestor can acquire the lock in required mode and read the block into its own buffer cache.

Cache Fusion

Starting from Oracle8 i Database, the CR server processing was simplified and the new background process, Block Server Process (BSP), was introduced to handle the CR processing. In this case, when a requestor pings the holder for the CR copy, the CR copy is constructed from the holder ‚ s undo information, and is shipped to requestor ‚ s buffer cache via the interconnect (high speed and dedicated). The disk is not used as a data transfer medium. The interconnect is used to fuse the buffer across the caches; this data transfer method is called Cache Fusion . The BSP handles the cache transfer between instances.

Starting Oracle9 i Database, the Global Cache Service (GCS) handles the Cache Fusion traffic. The current buffer (XCUR) can also be transferred through a network connection, which is usually a dedicated fast interconnect. The internals of the current mode transfer are very complex and beyond the scope of the discussion, but one interesting thing to note is that Oracle limits the number of CR copies per DBA that can be created and shipped to the other instance through the network.

There is a fairness counter kept in every CR buffer, and the holder increments the counter after it makes a CR copy and sends it to the requestor. Once the holder reaches the threshold defined by the parameter _FAIRNESS_THRESHOLD, it stops making more CR copies, flushes the redo to the disk, and downgrades the locks.

From here onward, the requestor has to read the block from the disk after acquiring the lock for the buffer from the lockmaster. The _FAIRNESS_THRESHOLD parameter value defaults to 4, which means up to 4 consistent version buffers are shipped to the other instance cache, and thereafter the CR server stops responding to the CR request for that particular DBA. The view V$CR_BLOCK_SERVER has the details about the requests and FAIRNESS_DOWN_CONVERTS details.

 V$CR_BLOCK_SERVER 
 Name                     Type    Notes 
 ----------------------   ------  ----------------------------- 
 CR_REQUESTS              NUMBER  CR+CUR =Total Requests 
 CURRENT_REQUESTS         NUMBER 
 DATA_REQUESTS            NUMBER 
 UNDO_REQUESTS            NUMBER 
 TX_REQUESTS              NUMBER  DATA+UNDO+TX= CR+CUR 
 CURRENT_RESULTS          NUMBER 
 PRIVATE_RESULTS          NUMBER 
 ZERO_RESULTS             NUMBER 
 DISK_READ_RESULTS        NUMBER 
 FAIL_RESULTS             NUMBER 
 FAIRNESS_DOWN_CONVERTS   NUMBER  # of downconverts from X 
 FAIRNESS_CLEARS          NUMBER  # of time Fairness counter cleared 
 FREE_GC_ELEMENTS         NUMBER 
 FLUSHES                  NUMBER  Log Flushes 
 FLUSHES_QUEUED           NUMBER 
 FLUSH_QUEUE_FULL         NUMBER 
 FLUSH_MAX_TIME           NUMBER 
 LIGHT_WORKS              NUMBER  # of times light work rule evoked 
 ERRORS                   NUMBER

The column CR_REQUESTS contains the number of requests received for a particular block at a specific version or a specific SCN. Any request that involves SCN verification is called consistent get . The total number of requests handled by the LMS process will be equal to the sum of CR_REQUESTS and CURRENT_REQUESTS. The total number of requests will be split into DATA_REQUESTS, UNDO_REQUESTS, and TX_REQUESTS (undo header block) requests.

 select cr_requests cr, 
 current_requests cur, 
 data_requests data, 
 undo_requests undo, 
 tx_requests tx 
 from   v$cr_block_server; 
 
 CR        CUR       DATA       UNDO         TX 
 -------- ---------- ---------- ---------- ---------- 
 74300       3381      74300       3381          0

Light Work Rule

In some cases, constructing CR copy may be too much work for the holder. This may include reading data and undo blocks from disk or from some other instance ‚ s cache which is too CPU intensive. In this case the holder simply sends the incomplete CR copy to the requester ‚ s cache and the requester will create a CR copy either by block clean out which may include reading several undo blocks and data blocks. This operation is known as the light work rule , and the LIGHT_WORKS column indicates the number of times the light work rule is applied for constructing the CR blocks.

The number of times light work rule is applied will be visible in the view V$CR_BLOCK_SERVER:

 select cr_requests,light_works 
 from   v$cr_block_server; 
 
 CR_REQUESTS LIGHT_WORKS 
 ----------- ----------- 
 23975         133

_fairness_threshold

Reducing the _FAIRNESS_THRESHOLD to lower values from the default value will provide some performance improvement if the data request to down convert (downgrade) ratio is greater than 30 percent. Setting the threshold to a lower value will greatly reduce the interconnect traffic for the CR messages. Setting _FAIRNESS_THRESHOLD to 0 disables the fairness down converts and is usually advised for systems that mainly perform SELECT operations.

In Oracle8 i Database, the holder will down convert the lock and write the buffer to the disk, and the requester will always have to read the block from the disk. From Oracle9 i Database, the holder down converts the lock to shared mode from exclusive, and the shared mode buffer will be transferred to the requestor ‚ s cache through the interconnect.

 select data_requests,fairness_down_converts 
 from   v$cr_block_server; 
 
 DATA_REQUESTS FAIRNESS_DOWN_CONVERTS 
 ------------- ---------------------- 
 74304                   6651