In a data sharing environment, each DB2 member writes to its own recovery logs and BSDS data sets, achieving a sort of striping effect. Each DB2 member must be able to read the logs and BSDS data sets of every other member in the group, so they must be on shared DASD with appropriate access granted. The reason it must be accessible is that the logs may be required from multiple DB2 members to do media recovery. A group restart will require access to all logs.
The SCA contains information about all members' logs and BSDS data sets; during the backward log-recovery phase of DB2 start-up, each member updates the SCA with new log information, which is then read by all DB2 members during a recovery. The BSDS data sets back up the contents of the SCA. Every member's BSDSs contains the same information that is held in the SCA, because the SCA can fail, as it is a cache in the coupling facility.
Logs and Recovery
The logs from several DB2 members may need to be merged for object recovery. The log record sequence number (LRSN) provides common log-record sequencing across members and is used to control REDO/UNDO records for data sharing. The LRSN is based on a 6-byte value derived from the sysplex timer timestamp and is store-clock-instruction based. The RBA is used for non-data sharing data.
Archive logs on tape should be avoided, as recovery will increase by the number of members whose archive logs must be processed. Depending on where archive logs are kept, this can become a lengthy process. Keep archive logs needed to a minimum by having large active logs, incremental copies, and frequent commits. Archiving to DASD is best; if using tape, never archive logs for more than one DB2 in the data sharing group to the same tape.
During a recovery, DB2 accesses the logs of all DB2 subsystems in the group and merges the log records in sequence by the LRSNs. DB2 then compares the LRSN in the log record to that on the data page; if larger, the change is applied. See Figure 9-9.
Figure 9-9. LRSNs and recovery
When recovering from a DASD failure, you execute the RECOVER utility to restore to the most recent image copy and then apply log records; in LRSN sequence, to the end of the logs. This will require merging logs from all updating members, and the logs must be on DASD. If tape archives must be read, enough drives must be available for all RECOVERs, and there must be no deallocation delay, to avoid tape not being accessible to other members.
When a DB2 member subsystem fails, the locks are retained, and the other members remain activeone of the biggest benefits of a data sharing environment. When DB2 is restarted, forward recovery processing begins from the unit of recovery of the oldest in-doubt UR and the oldest pending write from the virtual pool to GBP. This is not as bad as non-data sharing because only checkpoint forces data to DASD, and data sharing has updated pages forced-at-commit to GBP. Retained locks are freed at the end of the subsequent restart. A DB2 subsystem can be restarted on the same or a different MVS subsystem. The RESTART LIGHT option on the START DATABASE command is used here to bring the DB2 subsystem up quickly just to release the retained locks.
Coupling Facility Failure
If a coupling facility fails or connectivity to it is lost, a dynamic rebuild of Lock or SCA structure is triggered. This rebuild will be triggered only if thresholds of the system that was lost exceeds the rebuild threshold in CFRM; alternatively, this can be done via an operator command.
When rebuild is caused by storage failure and rebuild fails, a group restart is necessary. In this case, all DB2 members come down, and then a coordinated group restart of all members is performed to rebuild SCA/lock from the logs.
Some pages may be marked GRECP (group buffer pool recover pending) and require page set recovery. These pages were in the GBP when the CF failed. This recovery requires a LOGONLY recovery; no image copy is needed. A -START DATABASE command will start the recovery. The GBP checkpoint determines how far back in the log to process.
In these cases, you can lose a structure in the coupling facility. Each case has different recovery needs.
If the structures are duplexedthe primary exists in one CF and an active secondary exists in the otherduring a failure, the activity is switched from the primary to the secondary, without the rebuild process.