|< Day Day Up >|| |
As we learned in the 'Protection Modes' section, the speed and latency of the network can have a considerable effect on how Data Guard operates. An equally important issue to understand is how network disconnects or dead connections affect Data Guard. In order to understand their importance, let's consider what occurs during a simple network disconnect between the primary and standby systems.
When the network between two host is disconnected or when one host within a TCP session is no longer available, the session is known as a dead connection. A dead connection indicates that there is no physical connection, but the connection appears to still be there to the processes on each system. If the LGWR and RFS processes are involved in a dead connection, when the LGWR process attempts to send a new message to the RFS process, it will notice that the connection appears to be broken. At this point, LGWR will wait on the TCP layer to timeout on the network session between the primary and standby before establishing that network connectivity has indeed been lost.
The TCP timeout, as defined by TCP kernel parameter settings, is key to how long either LGWR or ARCH will remain in a wait state before abandoning the network connection. On some platforms, the default for TCP timeout can be as high as two hours. In order to better control LGWR timeouts for network connections, the MAX_FAILURE, REOPEN, and NET_TIMEOUT attributes were developed.
On the standby side, the RFS process is always synchronously waiting for new information to arrive from the LGWR or ARCH process on the primary. The RFS process that is doing the network read operation is blocked until more data arrives or until the operating system's network software determines that the connection is dead.
Once the RFS process receives notification of the dead network connection, it will terminate itself. However, until the RFS process terminates itself, it will retain lock information on the archivelog on the standby site, or the standby redo log, whose redo information was being received from the primary database. Any attempt to perform a failover using the RECOVER MANAGED STANDBY DATABASE FINISH command will fail while the RFS process maintains a lock on the standby redo log. The RECOVER command will fail with the following errors:
ORA-00283: recovery session canceled due to errors ORA-00261: log 4 of thread 1 is being archived or modified ORA-00312: online log 4 thread 1: '/database/10gDR/srl1.dbf'
At this point, we must wait for either the operating system network software to clean up the dead connection or kill the RFS process before the failover attempt will succeed. One method to decrease the time for the operating system network software to clean up the dead connection is the use of Oracle's Dead Connection Detection feature.
With Oracle's Dead Connection Detection feature, Oracle Net periodically sends a network probe to verify that a client/server connection is still active. This ensures that connections are not left open indefinitely due to an abnormal client termination. If the probe finds a dead connection or a connection that is no longer in use, it returns an error that causes the RFS process to exit. However, as we are still dependent on the operating system network software for timeouts and retries, it can take Dead Connection Detection up to 9 minutes to terminate the RFS network connection.
Once the network problem is resolved, and the primary database processes are again able to establish network connections to the standby database, a new RFS process will automatically be spawned on the standby database for each new network connection. These new RFS processes will resume the reception of redo data from the primary database.
|< Day Day Up >|| |