2.8 System Logger recovery

< Day Day Up >

In this chapter we discuss how System Logger handles errors or CF structure state changes, how log streams and applications connected to them might be affected, and what actions (if any) should be taken to resolve the issue. As part of this discussion, you should have a basic understanding of how System Logger works to provide failure independence, as it allows for greater recoverability of data and how System Logger processing attempts to recover both log stream types. We'll close out this chapter by discussing how System Logger handles some other common errors.

2.8.1 Failure independence

System Logger provides the ability to protect your log data from loss due to the failure of a system component; that is, System Logger tries to provide an environment by duplexing log data such that no single points of failure exist for that log stream. A single point of failure is an environment where one failure can result in the simultaneous loss of both copies of the log data currently resident in interim storage. For example, if the CF resides in the same CPC as the System Logger that is connected to it, the CF structure for a log stream and the local buffer copy of the data would both be lost if that CPC were to fail.

When your configuration contains a single point of failure, you can safeguard your data by telling System Logger to duplex log data to staging data sets (System Logger will do this depending on the environment the system is in and the log stream definition; see 2.5, "Log streams" on page 22 for more information).

In Figure 2-14 on page 66, a log stream has been defined with STG_DUPLEX=YES and DUPLEXMODE=COND. There are two systems connected to the log stream. System Sys 1, resides in the same CPC as the CF containing the log stream, so the data that Sys 1 writes to the log stream is duplexed to a staging data set because there is a single point of failure between the system and the CF. For example, a machine error might require a reset of CPC1, which would result in the loss of both the CF and Sys 1 with its local buffer copy of the CF log data if it was not duplexed to staging data sets. On the other hand, there is no single point of failure between Sys 2 (which resides in a different CPC) and the CF. In this case, System Logger on Sys 2 will keep the duplex copy of the log data in its local buffers. If the CF fails, the data is still available in the local buffers, and if the CPC or Sys 2 fails, the data is still available in the CF. So, you can see that the single definition results in two instances of System Logger behaving differently because of their different environments.

click to expand
Figure 2-14: A single point of failure exists between Sys1 and the CF

A single point of failure can also exist for a CF structure using System-Managed Duplexing where one failure can result in the simultaneous loss of both structure instances; this would be if the two CFs are in the same CPC, or if both CFs are volatile. The two structure instances in this environment are failure-dependent, meaning a single failure will cause all the structure data to be lost, even though it is in two copies of the structure. Since a single point of failure can cause the loss of all the log data in this case, System Logger will continue to safeguard the data by using one of its own duplexing methods.

System Logger evaluates a connection for a single point of failure based on the location of the CF and its status of volatile or non-volatile. The CF can reside in one of the following configurations:

The CF executes in an LPAR in the same CPC as a system that is connected to a log stream resident in the CF.
The CF is in a stand-alone CPC and therefore does not share the CPC with a system with a connection to the log stream.

System Logger uses the following rules for determining whether a log stream connection contains a single point of failure:

When a CF containing a LOGR structure shares a CPC with a system that is connected to that structure, each active connection between the system and the CF contains a single point of failure, regardless of the volatility state of the CF. The connections have a single point of failure because if the CPC should fail, CF resident log data will be lost when both the system, with local buffers, and the CF fail with the CPC.
When a CF (or the composite view of a duplex-mode CF structure) is separate from the system connected to the associated log stream, the volatility status of the CF determines whether the connection is vulnerable to a single point of failure:
- If the CF (or the composite view of a structure using System-Managed Duplexing) is non-volatile, the connection is failure independent.
- If the CF (or the composite view of a structure using System-Managed Duplexing) is volatile, the connection is vulnerable to a single point of failure. A power failure could affect both the systems, with its local copy of log data, and the CF.
A connection is failure independent when it is from a system on a separate CPC from the non-volatile CF to which it is connected.

If a connection contains a single point of failure and STG_DUPLEX(YES) DUPLEXMODE(COND) has been specified in the log stream definition, System Logger will duplex CF log data to staging data sets for that connection. In the event of a CF or CPC failure, System Logger can retrieve lost CF log data from the staging data set. See "CF-Structure based log streams" on page 23 and "Duplexing log data for CF-Structure based log streams" on page 42 for information on using the STG_DUPLEX and DUPLEXMODE keywords.

Note

If you specify STG_DUPLEX(NO), make sure your application is not vulnerable to data loss that might occur if your environment contains a single point of failure.

2.8.2 Failure recovery

Now that you have an understanding of how System Logger works to provide a failure independent environment, we can move on to the question "What happens if System Logger, the system or the sysplex fails?" Recovery is different for the two types of log stream (CF-Structure based and DASD-only) and there are a few new operations that we will discuss as well, such as structure rebuild. We'll start at the highest level of System Logger recovery (system/sysplex level recovery) and work down to the details of CF-Structure based log stream recovery; we'll then discuss DASD-only log stream recovery as a separate topic.

System level recovery

When System Logger fails, it will attempt to restart itself, providing you have installed the PTF for APAR OW53349. If it is unable to successfully restart itself after a few attempts, or you have issued the FORCE IXGLOGR,ARM command to terminate the System Logger address space, restarting will require use of the S IXGLOGRS command (for more information on these commands, see Chapter 7, "Logger operations" on page 243.

During startup, System Logger runs through a series of operations for all CF-Structure based log streams to attempt to recover and cleanup any failed connections; and to ensure all data is valid. DASD-only log stream recovery is not handled at this time so we'll discuss it as a separate topic.

As part of this process, System Logger will issue connect requests for each CF-Structure based log stream that the system was connected to at the time it failed (that is, for each CF-Structure based log stream that there now exists a failed-persistent connection to). For each of these log streams, System Logger will, if necessary, clean up each failed persistent connection to it (regardless of the system it originated on); System Logger will then attempt the same operation for each log stream connected to the same CF structure (that is, perform peer recovery, which is covered in the next topic). Note that recovery will attempt to offload any log data in interim storage to offload data sets and then clean up the failed connection. If recovery fails for some reason, the failed persistent connection will be kept to ensure data is not lost.

Let's look at Figure 2-15 for an example of a recovery operation. Assume both system A and B failed, and system A is IPLed first. Since system A was connected to log stream A when the system failed, recovery would be attempted for log stream A. As part of this recovery operation, System Logger would cleanup the connections from system A and system B because both systems will have failed-persistent connections to the log stream. System Logger would then try to recover any failed connectors for all the other log streams residing in structure A; here, log stream B only has one connection, from system B. However, since log stream B is in the same structure as log stream A (for which System Logger on system A initiated recovery) the failed persistent connection from system B to log stream B would be recovered as well. As a result of these recovery operations (assuming they are successful) there would be no connections to structure A, and all log data in log streams A and B has been offloaded to offload data sets.

click to expand
Figure 2-15: Structure A has peer connectors

Peer and same system log stream recovery

Going back to Figure 2-15, let's now assume only system A failed. Since system B is connected to the same CF structure as system A, system B will be informed that System A has failed and will attempt to recover and cleanup the connection from system A to log stream A; this is an example of peer recovery. Peer recovery does not take place for DASD-only log streams.

In order for peer recovery to take place, the CF structure must be connected to by more than one system, and at least one system must still be available after the failure. Note that when a system containing a System Logger application fails, another System Logger will try to safeguard the CF log data for the failed system, either by offloading it to offload data sets, or by ensuring that the log data is secure in a persistent duplex-mode structure. This recovery processing is done by a peer connector, which is another system in the sysplex with a connection to the same CF structure as the failing system. In general, when you set up your logging configuration, you should try to make sure that each log stream structure is accessed by more than one system.

If there is no peer connection available to perform recovery for a failed system, recovery is delayed until either the failing system re-IPLs or another system connects to a log stream in the same CF structure to which the failing system was connected. In Figure 2-16, for example, peer recovery can be performed on structure B. However, recovery for structure A log data is delayed until either Sys 1 re-IPLs, or another system connects to a log stream in that structure.

click to expand
Figure 2-16: Recovery delayed for some data

Every time a new first connection to a log stream structure is established from a particular system, System Logger examines the log stream to determine if recovery is needed. If this is also the first connection to this structure, System Logger will check if recovery is needed for any other log streams in the structure.

Note

To ensure peer recovery, a peer connector is only needed to the same structure from another system in the sysplex; it does not have to be to the same log stream.

To avoid having a structure with no peer connectors, we recommend one of the following:

Make sure that another log stream with connectors from other systems maps to the same CF structure.
Write a dummy application to run on another system that will connect to the same log stream or CF structure to which the failed system was connected.

What if both the system and the CF fail?

If the CF containing the log stream structure and the CPC containing System Logger both fail (because of a power failure, for example), all log data for that log stream that was still in interim storage will be lost if staging data sets were not being used for that log stream. If staging data sets exist, System Logger uses them to recover the log data. If no staging data sets were in use when both the system and the CF fails, any log streams affected are marked as damaged. This is an example of a situation where choosing not to use staging data sets results in System Logger not being able to prevent a single point of failure from existing.

DASD-only log stream recovery

Like a CF-Structure based log stream, a DASD-only log stream can be affected by a system or System Logger failure. However, because DASD-only log streams are always duplexed to DASD, it is not necessary to go through the same recovery process that is required for CF-Structure based log streams, where System Logger attempts to move the log data to a non-volatile medium as quickly as possible. Also, because DASD-only log streams can only be accessed by one system at a time, the concept of peer recovery does not apply.

Recovery for a DASD-only log stream only takes place when an application reconnects to the log stream. As part of connect processing, System Logger reads log data from the staging data set (associated with the last connection to the log stream) into the local buffers of the current connecting system. This allows the application to control recovery, by selecting which system they wish to have reconnect to the log stream and when. Note that for another system to connect to the log stream and perform recovery, the staging data sets must reside on devices accessible by both systems.

2.8.3 Other recovery processes and errors

As we are discussing System Logger recovery, there are a few other internal recovery processes that you should be aware of. In this section, we discuss how System Logger reacts to CF structure state changes and common errors.

Structure state changes

The following CF problems can occur, resulting in rebuild processing for the structure:

Damage to or failure of the CF structure.
Loss of connectivity to a CF.
A CF becomes volatile.

In this topic, we'll discuss each of these situations individually, as well as giving some insight into System Logger processing during a structure rebuild.

Note that this section only applies to CF-Structure based log streams and not DASD-only log streams.

Structure rebuild

Structure rebuild from a System Logger perspective is the process by which a new instance of a CF structure is created and repopulated after System Logger has been notified by XES that a rebuild has been initiated (either by an operator command, structure failure, or state change). Because System Logger provides support for user-managed rebuilds, system-managed rebuild is not used to rebuild the structures unless there are no active connections to the structure.

You can read more about structure rebuild processing from an overall system perspective in Chapter 5.5 of z/OS Sysplex Services Guide, SA22-7617.

CF structure rebuilds are intended for planned reconfiguration and recovery scenarios. The next few sections describe the most common events that trigger a structure rebuild.

As we mentioned, System Logger responds to rebuild events communicated to it by XES. System Logger has its own internal exits to notify connectors of the current state of the process. While the rebuild is in progress, System Logger rejects any service requests against the log stream—it is up to the connector to understand these reason codes and retry their requests. The new structure instance is then repopulated and the connectors that were actively connected to the original structure instance are connected to the new instance. Log data that was resident in interim storage will be copied from the duplex copy (that is, the local buffers or staging data sets, never from the old structure) to the new structure instance. Once the structure has been rebuilt (rebuilding is complete), the system deallocates the original structure and connectors can use the new structure. In some cases, System Logger will initiate a directed offload to move all log data in the structure at this point.

Note that while the structure is duplexed using System-Managed Duplexing, rebuild is not possible. If the structure needs to be rebuilt (for example, to clear a POLICY CHANGE PENDING condition), the structure must be reverted to simplex mode first.

Note

If the new structure instance cannot be successfully repopulated for any reason, and System Logger does not have connectivity to the original structure instance, staging data sets will be created to recover the log data to, regardless of the log stream's STG_DUPLEX parameter.

Damage to or failure of the CF structure

If a structure was being duplexed using System-Managed Duplexing and a CF failure or structure damage occurs to only one instance of the structure, XES will automatically revert the structure back to simplex mode. System Logger will only be notified of the mode switch change and will not be aware of any CF failure or damage to the structure. See "When the CF structure duplex/simplex mode changes" on page 75 for information on how System Logger handles mode switches for the structure.

If the structure was in simplex mode and the failure or damage occurs to the only instance of the CF structure, all systems connected to the CF structure detect the failure. The first system whose System Logger component detects the failure initiates the structure rebuild process. The structure rebuild process results in the recovery of one or more of the affected CF structure's log streams. All the systems in the sysplex that are connected to the CF structure participate in the process of rebuilding the log streams in a new CF structure. We discussed CF structure rebuilds in "Structure rebuild" on page 71.

Loss of connectivity to the CF structure

If a CF structure was in duplex mode and a loss of connectivity occurs to only one instance of the CF structure (due to a hardware link failure), XES will automatically switch the structure from duplex mode to simplex mode. System Logger will only be notified of the mode switch change and will not be aware of any loss of connectivity to the structure. See "When the CF structure duplex/simplex mode changes" on page 75 for how System Logger handles mode switches for the CF structure.

For a simplex mode structure, System Logger detects the loss of connectivity to the single instance of the structure. Then, based on the rebuild threshold specified, if any, in the structure definition in the CFRM policy, the system that lost connectivity may initiate a rebuild for the structure.

If XES cannot allocate a new structure instance in a CF that can be connected to from all the systems using that structure, System Logger does one of the following, depending on whether the system or systems that cannot connect to the new CF structure were using staging data sets:

If the system was using staging data sets, the rebuild process continues and the CF log data for the system is recovered from the staging data sets.
If the system was not using staging data sets, the rebuild process is stopped. The systems go back to using the source structure. The log stream will be available on the systems that have connectivity to the source CF structure; it will be unavailable on those that do not have connectivity.

When the CF structure volatility state changes

The following CF volatility state changes can occur, which may result in different reactions by System Logger:

A CF becomes volatile, or if the structure is duplexed using System-Managed Duplexing, both CFs could become volatile.
A CF becomes non-volatile, or if the structure is duplexed using System-Managed Duplexing, both CFs could become non-volatile.

System Logger's behavior to the state change depends on:

The type of state change.
The current duplexing environment for the log streams in the affected structures.
The log stream and CF structure duplexing specifications from the LOGR CDS and the CFRM policy.

Some of the environments may result in rebuild processing for the structure. For more information on rebuild processing see "Structure rebuild" on page 71.

When a CF becomes volatile

If a CF changes to the volatile state, the System Logger on each system using the CF structure is notified.

For structures that are in simplex mode, a rebuild of the structure is initiated so that the log data can be moved to a non-volatile CF. During rebuild processing, System Logger rejects any service requests (connect, and so on).

If there is not a CF structure available in a non-volatile CF, System Logger will still rebuild the structure to a new volatile CF. System Logger may then change the way it duplexes CF data because the volatile CF constitutes a single point of failure:

For log streams defined with STG_DUPLEX=YES, System Logger will begin duplexing data to staging data sets, if it was not already doing so.
For log streams defined with STG_DUPLEX=NO, System Logger will keep on duplexing data to local buffers on each system.

For structures that are duplexed using System-Managed Duplexing, structure rebuild is not possible because there are already two copies of the structure. However, System Logger may change whether it will duplex the log data and the way it duplexes the data since the new composite CF volatility state constitutes a single point of failure.

If the initial environment had a non-volatile composite structure CF state, but there was a failure-dependent relationship between the connecting system and the composite structure view, the log stream was considered to be in a single point of failure environment and still is after the state change, so System Logger makes no changes to the duplexed method.
If the initial environment had a non-volatile composite structure CF state and there was a failure-independent connection between the connecting system and the composite structure view, but there was a failure-dependent relationship between the two CF structure instances, the log stream was not considered to be in a single point of failure environment, but would be after the CF state changed to volatile. System Logger would continue to duplex the log data, but might change the duplexing method.
- For log streams defined with STG_DUPLEX=YES, System Logger will begin duplexing data to staging data sets, if they were not already in use.
If the initial environment had a non-volatile composite structure CF state, there was a failure-independent connection between the connecting system and the composite structure view, and there was a failure-independent relationship between the two CF structure instances, the log stream was not considered to be in a single point of failure environment, but would be after the CF state changed to volatile:
- For log streams defined with LOGGERDUPLEX(UNCOND), System Logger will continue to duplex the log data. However, for log streams defined with STG_DUPLEX=YES, System Logger will begin duplexing data to staging data sets, if they were not already in use.
- For log streams defined with LOGGERDUPLEX(COND), System Logger will first offload the log data from the CF structure, then begin duplexing any new log data written to the log stream.
- For log streams defined with STG_DUPLEX=NO, System Logger will begin duplexing data to local buffers.
- For log streams defined with STG_DUPLEX=YES, System Logger will begin duplexing data to staging data sets.

A CF Becomes Non-Volatile

If a CF changes to a non-volatile state, the System Logger on each system using the CF structure is notified.

For simplex mode structures, System Logger may change the way it duplexes CF data because the change to a non-volatile CF may have removed the single point of failure condition.

If the initial environment had a volatile structure CF state and there was a failure-dependent relationship between the connecting system and the CF structure, the log stream was considered to be in a single point of failure environment and continues to be so after the state change, so System Logger makes no changes to the duplexing method.
If the initial environment had a volatile structure CF state and there was a failure-independent relationship between the connecting system and the CF structure, System Logger may then change the way it duplexes CF data because the non-volatile CF no longer constitutes a single point of failure.
- System Logger will duplex the log data using local buffers, unless the log stream definition specified STG_DUPLEX(YES) DUPLEXMODE(UNCOND).

For duplex-mode structures, System Logger may also change the way it duplexes CF data because the change to a non-volatile CF may have removed the single point of failure condition.

If the initial environment had a volatile composite structure CF state but there was a failure-dependent relationship between the connecting system and the composite structure view, the log stream was considered to be in a single point of failure environment and continues to be so after the state change (because of the failure-dependent connection), so System Logger makes no changes to the duplexing method.
If the initial environment had a volatile composite structure CF state and there was a failure-independent connection between the connecting system and the composite structure view, but there was a failure-dependent relationship between the two structure instances, the log stream was considered to be in a single point of failure environment, but would not be after the CF state changes to non-volatile. System Logger would continue to duplex the log data, but might change the duplexing method.
- The data will be duplexed by System Logger using local buffers, unless staging data sets were specifically requested by having STG_DUPLEX(YES) DUPLEXMODE(UNCOND) on the log stream definition.
If the initial environment had a volatile composite structure CF state, there was a failure-independent connection between the connecting system and the composite structure view, and there was a failure-independent relationship between the two structure instances, the log stream was considered to be in a single point of failure environment, but would not be after the CF state changed to non-volatile.
- For log streams defined with LOGGERDUPLEX(UNCOND), System Logger will continue to duplex the log data.
- The log data will be duplexed by System Logger using local buffers, unless staging data sets were specifically requested by having STG_DUPLEX(YES) DUPLEXMODE(UNCOND) on the log stream definition.
- For log streams defined with LOGGERDUPLEX(COND), System Logger will stop providing its own duplexing because the system-managed duplexed structure provides sufficient back-up capability for the log data.

When the CF structure duplex/simplex mode changes

This section applies to CF log streams only. If a CF structure changes modes, the System Logger on each system using the CF structure is notified.

System Logger's behavior after a structure switches from simplex-mode to duplex-mode or vice versa depends on whether a single point of failure environment exists after the mode switch and the duplexing specifications for the structure and the log streams within the structure.

It is possible for System Logger to change from using local buffers to staging data sets or to change from using staging data sets to local buffers. It is also possible for System Logger to start providing its own duplexing of the log data or stop duplexing the log data after the mode switch.

"Duplexing log data for CF-Structure based log streams" on page 42 shows how System Logger decides what storage medium to duplex log data to based on the environment System Logger is operating in and the log stream definition.

< Day Day Up >