Other Advantages of Running IMS TM in a Sysplex Environment | An Introduction to IMS: Your Complete Guide to IBMs Information Management System

The next few sections discuss additional IMS functions that take advantage of the sysplex environment:

"Rapid Network Reconnect"
"Sysplex Failure Recovery" on page 488

Rapid Network Reconnect

Rapid Network Reconnect (RNR) is IMS's support for VTAM persistent sessions for non-XRF systems. RNR can provide great benefits for some environments, but it is not appropriate for all environments. RNR is optional.

VTAM persistent session support eliminates session cleanup and restart when a host failure occurs. There are two kinds of persistent session support:

Single node persistent sessions (SNPSs) provide support only for IMS failures. With SNPS, the VTAM instance must not fail.
Multinode persistent sessions (MNPSs) provide support for all types of host failures. These include failures of IMS, VTAM, z/OS, or the processor.

With persistent sessions, end users do not lose their sessions for the supported failures. In fact, they remain logged on. Even though their IMS system fails, their sessions are not terminated. This means that the unbind traffic does not flow through the network when the failure occurs. Secondly, when their IMS system is restarted, their sessions do not have to be reestablished and the bind traffic does not flow. For LU types that typically have users, such as SLUTYPE2 and SLUTYPE1 (CONSOLE), a sign-on is required. For LU types that typically are programmable, such as SLUTYPEP, or do not have direct users, such as LU1 (PRINTER1), a sign-on is not required.

Single Node Persistent Sessions

When using RNR with SNPS, only outages due to IMS abends are mitigated. The VTAM used by this IMS must not fail.

The scenario illustrated in Figure 27-13 shows how SNPS works. The numbers in Figure 27-13 refer to the descriptions in the following list.

Figure 27-13. SNPS Example Scenario: Logon Is Not Terminated When Its IMS Fails

When a session is established, session data is stored in a z/OS data space associated with the VTAM address space.
If the IMS system abends but VTAM does not, the session stays active and the session data remains.
The failed IMS is restarted.
The users' sessions are given to the restarted IMS. The users have remained logged on even though their IMS system had failed.

Multinode Persistent Sessions

With MNPS, the session data is stored in a coupling facility structure where it is available to other systems in the sysplex. All types of failures are supported with MNPS. As with SNPS support, when IMS is restarted, the users are automatically reconnected in a "logged on" state.

When using RNR with MNPS, all outages of the IMS, VTAM, or processor are mitigated.

The scenario illustrated in Figure 27-14 shows how MNPS works. The numbers in Figure 27-14 refer to the descriptions in the following list.

Figure 27-14. MNPS Example Scenario: Logon Is Not Terminated When Its IMS Fails

When a session is established, session data is stored in a coupling facility structure.
CPCA fails and IMSA also fails because it is running on CPCA. The session data is not lost, however, because it is on the coupling facility. Another VTAM in the sysplex detects the error, and the session survives the failure of CPCA.
IMSA is restarted on another processor (CPCB) in the sysplex.
When IMSA is restarted on CPCB, the users, sessions are given to the restarted IMS. These users have remained logged on even though their IMS system failed.

Benefits of Rapid Network Reconnect

The benefit of RNR is the maintenance of the sessions when IMS fails. RNR eliminates the time required to terminate and reestablish the sessions. This eliminates the bind and unbind traffic that would otherwise flow through the network, which can be time consuming. Service to the end users is reestablished more quickly.

Of course, the IMS system must be restarted. When using RNR, the end user does not have the option of logging on to another IMS in the Parallel Sysplex. The value of RNR depends on how quickly IMS is restarted. If the restart is slow, there is not much benefit. If the restart is quick, the benefit can be substantial. However, if another system with the same capabilities is available, the users would get quicker restoration of service by logging onto it, which means that RNR is probably not a good solution for IMS systems with clones.

Persistent session support for IMS users of APPC (LU 6.2) is provided by APPC/MVS, not IMS. With APPC, the sessions are persistent, but the conversations are not.

Sysplex Failure Recovery

Parallel Sysplex adds more components to a system. These include clones of systems and subsystems and new components such as coupling facilities and coupling facility links. Even though you might have another component available to do your work when one component fails, you want to restore the sysplex to full robustness as soon as possible. Recoveries from most failures in a sysplex can be automated.

Advantages of Multiple Copies of Servers

The main advantage in a sysplex is that you have multiple copies of your servers. When one fails, another is available to do its work. This applies to subsystems, processors, and coupling facilities.

If IMS fails, other IMS instances are available. You can use the routing and balancing capabilities to distribute the work to the active IMS systems.
If a processor or LPAR fails, the z/OS Automatic Restart Manager (ARM) can be used to restart failed subsystems on surviving processors or LPARs.
If a coupling facility fails, there are two ways of surviving the loss:
- Rebuild the coupling facility structures on another coupling facility.
- Use multiple copies of the structures.

Recovery Using the Automatic Restart Manager (ARM)

When IMS fails, you need to restart it as quickly as possible. Even though other IMS systems might be available to do work, the failed IMS might have inflight or indoubt work that needs to be resolved. This resolution releases locks on database resources and releases DBRC authorizations and allows new work to have access to all of the data.

ARM can be used to provide rapid restarts of IMS. ARM is a sysplex capability that allows an automatic restart of subsystems such as IMS, DB2, CICS, and IRLM. If the subsystem abends, the restart is on the same z/OS instance (LPAR). If the z/OS (LPAR) fails, the restart is on another z/OS instance in the sysplex.

Figure 27-15 illustrates the actions of ARM when an IMS abends.

Figure 27-15. ARM Restarting an Abended IMS

In Figure 27-15, IMS is restarted on the same z/OS system. This IMS was providing DBCTL services to a CICS and was using a DB2 subsystem. You must restart IMS on the same z/OS so that services between these subsystems can be restored. For example, indoubt threads must be resolved.

In the case of a z/OS or processor failure, IMS is restarted on another candidate z/OS (see Figure 27-16 on page 490). The z/OS is chosen according to a user-defined ARM policy. Subsystems that must remain together can be restarted as a group on the same z/OS. In the example in Figure 27-16, an IMS subsystem is using DB2 for database services. A CICS AOR (Application Owning Region) is using the same DB2 and the IMS for database services. When the z/OS system fails, the IMS, the DB2, and the CICS AOR must be moved together, but the CICS TOR (Terminal Owning Region) can be restarted on another z/OS in the sysplex because CICS AORs and TORs can communicate with other z/OS systems.

Figure 27-16. ARM Restarting IMS, CICS, and DB2 after a z/OS Failure

For ARM to restart subsystems, it must be active in the sysplex. ARM is controlled by a policy that the user defines. The policy is stored in an ARM couple data set. The policy is used to group subsystems for restart together. It also controls whether or not a subsystem is restarted. For example, an installation might not want to restart test subsystems.

ARM only restarts subsystems that register with ARM. Subsystems register with ARM when they initialize. IMS uses the ARMRST parameter to control whether or not IMS registers with ARM. ARMRST=Y is the default.

IMS has full ARM support. ARM can be used to restart IMS control regions, Common Queue Server regions, Fast Database Recovery regions, Common Service Layer components, and IRLMs. ARM does not directly restart IMS dependent regions. These are typically started by automation when the control region is started.

ARMWRAP is a program that registers an address space for ARM restarts. It is used for a step in a job. If the following step fails, ARM restarts the job. IMS Connect does not register with ARM. ARMWRAP can be used to get ARM support for IMS Connect.

Recovery After Coupling Facility Failures

Much of the sysplex support is provided through the use of coupling facility structures. If a CF is lost, it is important to have access to structures elsewhere.

If a CF survives, but you lose all of the links from a processor to the CF, you need to resolve the problem. This can be treated like the loss of the CF itself. You can either rebuild its structures on CFs that have connectivity to the processors that require it or you can use duplicate structures.

Recovery Using Structure Rebuild

Some structures can be rebuilt automatically when either a CF failure or a CF link failure occurs. These include IRLM lock structures, OSAM and VSAM cache structures, and IMS shared-queues structures. The example in Figure 27-17 on page 491 uses an IRLM lock structure.

Figure 27-17. Three IMSs on Three z/OSs Sharing One IRLM Lock Structure on a Coupling Facility

Figure 27-17 shows an IRLM structure on CF1 before a CF failure.

Figure 27-18 on page 492 shows a scenario where CF1 (on which the IRLM structure resides) fails. When the CF fails, the system automatically recognizes the loss and rebuilds the lock structure on another CF (CF2). Each IRLM retains the information necessary to restore its lock information in the structure. The IRLMs together rebuild the lock structure on another CF. Data sharing is resumed. Similar rebuild and recovery occurs for OSAM, VSAM, and shared-queue structures.

Figure 27-18. IRLM Structure on Failed Coupling Facility Is Rebuilt on Another Coupling Facility

In Figure 27-19 on page 492, the CF does not fail. Instead, the connectivity between one of the processors and the CF fails. This case is treated the same as the loss of the CF. That is, the system automatically rebuilds the structure on another CF. All processors have connectivity to this CF. This means that data sharing can continue. Similar rebuild and recovery occurs for OSAM, VSAM, and shared-queue structures.

Figure 27-19. IRLM Structure Rebuilt on Another Coupling Facility After a Connectivity Failure

Recovery Using Structure Duplexing

Fast Path shared VSO does not rebuild its cache structures. Instead, it relies on a duplicate copy to provide failure survival. The duplicate copy can be created in either of two ways.

Fast Path can build two structures. This is called user-managed duplexing.
By using appropriate hardware prerequisites, you can have the system build duplexed structures. This is system-managed duplexing. System-managed duplexing is also available for IRLM lock structures and shared-queues structures.

Figure 27-20 shows a duplexed DEDB VSO structure on two coupling facilities that are being shared by three IMSs.

Figure 27-20. Shared VSO Structure Duplexed on Two Coupling Facilities

If a CF is lost (as shown in Figure 27-21 on page 494), then a duplicate structure on another CF is used. With system-managed duplexing, a duplicate is immediately built if another CF is available. If another CF is not available, a duplicate structure is built when another CF becomes available.

Figure 27-21. System-Managed Duplicate Shared VSO Structure Is Used After a Coupling Facility Failure

Similarly, if connectivity to a CF is lost, then the use of its structure is discontinued. The duplicate structure on another CF is used instead.