4.10 Failover

 < Day Day Up > 



In the earlier chapters it was established that one of the greatest advantages of a clustered configuration is availability and the possibility of users to continue using the system when one node in the cluster fails. In a database-clustered configuration, access to the database is also available through the other nodes when one node in the system fails. Implementation of failover in a RAC implementation can be of two types: server-side failover and host-based failover. On most platforms, server-side failover is different from host-based failover.

Host-based failover  The host-based failover option is normally imple mented on a two-node clustered configuration. In situations where the monitoring of services is done on a given cluster node acting as the primary node, and there is failure in the primary system, the services failover to a secondary node. This is implemented by having a node to act as an active node and another to act as the passive node. The monitoring of services from the passive node to the active node initiates failover when the primary node service is unavailable. This feature is supported under RAC with the implementation of the Real Application Cluster Guard feature.

Oracle 9i 

New Feature: The Real Application Cluster Guard feature replaces the Oracle Fail Safe feature that was available in the prior versions of Oracle.

Server-side failover  The server-side failover option has quite an opposite architecture compared to the host-based failover option. Server-side failover is accomplished by a concurrent active–active node configuration. A concurrent active–active node configuration is when all instances have applications running on them and have monitoring services, and where one node monitors the services on the other node in a circular fashion. All nodes have access to all disks, as they are shared. When a failure happens there is no ownership to be transferred.

RAC relies on the CM of the operating system for failure detection because the CM maintains the heartbeat functions. Using the heartbeat mechanism, the CM allows the nodes to communicate with the other nodes that are available on a continuous basis at preset intervals, e.g., 2 seconds on Sun and Linux clusters. When a node, or the communication to a node, fails, the heartbeat from the node does not get through. After a timeout period (configurable on most operating systems), remaining nodes detect the failure and attempt to reform the cluster. (The heartbeat timeout parameter, like the heartbeat interval, varies from operating system to operating system, while the default heartbeat timeout parameter on Sun clusters is 12 seconds and the default on Linux clusters is 10 seconds. ) If the remaining nodes form a quorum, the other nodes reorganize the cluster membership. The reorganization process regroups the nodes that are accessible and removes the nodes that have failed. For example, in a four-node cluster, if one node fails, the CM will regroup among the remaining three nodes. The CM performs this step when a node is added to a cluster or a node is removed from a cluster. This information is exposed to the respective Oracle instances by the LMON process running on each cluster node.

Note 

On busy systems, there is a potential that the response required by the heartbeat mechanism may not be received in the required timeframe defined by the timeout parameter. Under such circumstances the CM may signal a false failure of a node. In systems where this could potentially occur, it is advisable to tune the timeout parameter to ensure this does not happen. It is advisable to use a private interconnect isolated from the regular interconnect used for data transfers.

The next step in the failover process is the database recovery operation. During the database recovery process, remastering of GCS from the failed instance, cache recovery (rolling forward all transactions) from the redo logs of the failed instances, and transactions recovery (rolling back all uncommitted transactions) are performed by the remaining nodes. With the redo logs located on a shared disk system (in RAC implementations), the recovery process is even smoother. The instance that first deducted the failure reads from the redo logs and applies them to the database.



 < Day Day Up > 



Oracle Real Application Clusters
Oracle Real Application Clusters
ISBN: 1555582885
EAN: 2147483647
Year: 2004
Pages: 174

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net