Recovery Considerations | DB2 for z/OS Version 8 DBA Certification Guide

In a data sharing environment, each DB2 member writes to its own recovery logs and BSDS data sets, achieving a sort of striping effect. Each DB2 member must be able to read the logs and BSDS data sets of every other member in the group, so they must be on shared DASD with appropriate access granted. The reason it must be accessible is that the logs may be required from multiple DB2 members to do media recovery. A group restart will require access to all logs.

The SCA contains information about all members' logs and BSDS data sets; during the backward log-recovery phase of DB2 start-up, each member updates the SCA with new log information, which is then read by all DB2 members during a recovery. The BSDS data sets back up the contents of the SCA. Every member's BSDSs contains the same information that is held in the SCA, because the SCA can fail, as it is a cache in the coupling facility.

Logs and Recovery

The logs from several DB2 members may need to be merged for object recovery. The log record sequence number (LRSN) provides common log-record sequencing across members and is used to control REDO/UNDO records for data sharing. The LRSN is based on a 6-byte value derived from the sysplex timer timestamp and is store-clock-instruction based. The RBA is used for non-data sharing data.

Archive logs on tape should be avoided, as recovery will increase by the number of members whose archive logs must be processed. Depending on where archive logs are kept, this can become a lengthy process. Keep archive logs needed to a minimum by having large active logs, incremental copies, and frequent commits. Archiving to DASD is best; if using tape, never archive logs for more than one DB2 in the data sharing group to the same tape.

During a recovery, DB2 accesses the logs of all DB2 subsystems in the group and merges the log records in sequence by the LRSNs. DB2 then compares the LRSN in the log record to that on the data page; if larger, the change is applied. See Figure 9-9.

Figure 9-9. LRSNs and recovery

Recovery Scenarios

DASD Failure

When recovering from a DASD failure, you execute the RECOVER utility to restore to the most recent image copy and then apply log records; in LRSN sequence, to the end of the logs. This will require merging logs from all updating members, and the logs must be on DASD. If tape archives must be read, enough drives must be available for all RECOVERs, and there must be no deallocation delay, to avoid tape not being accessible to other members.

DB2 Failure

When a DB2 member subsystem fails, the locks are retained, and the other members remain activeone of the biggest benefits of a data sharing environment. When DB2 is restarted, forward recovery processing begins from the unit of recovery of the oldest in-doubt UR and the oldest pending write from the virtual pool to GBP. This is not as bad as non-data sharing because only checkpoint forces data to DASD, and data sharing has updated pages forced-at-commit to GBP. Retained locks are freed at the end of the subsequent restart. A DB2 subsystem can be restarted on the same or a different MVS subsystem. The RESTART LIGHT option on the START DATABASE command is used here to bring the DB2 subsystem up quickly just to release the retained locks.

Coupling Facility Failure

If a coupling facility fails or connectivity to it is lost, a dynamic rebuild of Lock or SCA structure is triggered. This rebuild will be triggered only if thresholds of the system that was lost exceeds the rebuild threshold in CFRM; alternatively, this can be done via an operator command.

When rebuild is caused by storage failure and rebuild fails, a group restart is necessary. In this case, all DB2 members come down, and then a coordinated group restart of all members is performed to rebuild SCA/lock from the logs.

Some pages may be marked GRECP (group buffer pool recover pending) and require page set recovery. These pages were in the GBP when the CF failed. This recovery requires a LOGONLY recovery; no image copy is needed. A -START DATABASE command will start the recovery. The GBP checkpoint determines how far back in the log to process.

Structure Failures

In these cases, you can lose a structure in the coupling facility. Each case has different recovery needs.

A lock structure failure is detected by all active members at the time of failure; all members initiate rebuild, but only one rebuild occurs. The lock structure will be rebuilt into the backup coupling facility. The rebuild will use the active member's rebuild locks from local information. If connectivity to CF is lost, the weights from the SFM will determine whether the structure is rebuilt.
An SCA structure failure is detected by all members, and attempts are made to rebuild into the backup coupling facility. During the rebuild, the remaining active members rebuild SCA from local information. If connectivity to a coupling facility is lost, the weights from the SFM will determine whether the structure is rebuilt.
For a group buffer pool failure, the GBP will be rebuilt in the backup CF; any pages that were in the GBP will be marked GRECP. To recover page sets marked GRECP, you can issue a -START DATABASE() SPACENAM() ACCESS(RO) or ACCESS(RW). DB2 merges logs from the oldest modified page LRSN or the oldest pending write to DASD from the GBP to the end of the logs. Alternatively, you could run a RECOVER utility, LOAD REPLACE utility, or DROP TABLE statement to remove GRECP status. If connectivity is lost, DB2 will quiesce GBP-dependent page sets, and the result will be an SQLCODE 904 to a requesting application. Writes will then go on the LPL list, and GBP-dependent page sets will be marked GRECP.

NOTE

If the structures are duplexedthe primary exists in one CF and an active secondary exists in the otherduring a failure, the activity is switched from the primary to the secondary, without the rebuild process.