High Availability Clustering | Advanced DBA Certification Guide and Reference for DB2 Universal Database v8 for Linux, UNIX, and Windows

High availability (HA) is the term that is used to describe systems that run and are available to customers more or less all the time.

Failover protection can be achieved by keeping a copy of your database on another machine that is perpetually rolling the log files forward. Log shipping is the process of copying whole log files to a standby machine, either from an archive device or through a user exit program running against the primary database. With this approach, the primary database is restored to the standby machine, using either the DB2 restore utility or the split mirror function. You can use the new suspended I/O support to initialize the new database quickly. The secondary database on the standby machine continuously rolls the log files forward.

If the primary database fails, any remaining log files are copied over to the standby machine. After a rollforward to the end of the logs and stop operation, all clients are reconnected to the secondary database on the standby machine.

Failover strategies are usually based on clusters of systems. A cluster is a group of connected systems that work together as a single system. Clustering allows servers to back each other up when failures occur by picking up the workload of the failed server.

IP address takeover (or IP takeover) is the ability to transfer a server IP address from one machine to another when a server goes down; to a client application, the two machines appear at different times to be the same server.

Failover software may use heartbeat monitoring or keepalive packets between systems to confirm availability. Heartbeat monitoring involves system services that maintain constant communication between all the servers in a cluster. If a heartbeat is not detected , failover to a backup system starts. End users are usually not aware that a system has failed.

NOTE

For clarification and consistency with the naming convention throughout the book, a database node is now called a database partition , and when referencing a node name in the cluster, we refer to it as a server .

The two most common failover strategies on the market are known as idle standby and mutual takeover, although the configurations associated with these terms may also be associated with different terms that depend on the vendor.

Idle Standby

In this configuration, one system is used to run a DB2 instance, and the second system is idle , or in standby mode, ready to take over the instance if there is an operating system or hardware failure involving the first system. Overall system performance is not impacted, because the standby system is idle until needed.

Mutual Takeover

In this configuration, each system is the designated backup for another system. Overall system performance may be impacted, because the backup system must do extra work following a failover: It must do its own work plus the work that was being done by the failed system.

Failover strategies can be used to failover an instance, a partition, or multiple database partitions.

When designing and testing a cluster:

Ensure that the administrator of the cluster is familiar with the system and what should happen when a failover occurs.
Ensure that each part of the cluster is truly redundant and can be replaced quickly if it fails.
Force a test system to fail in a controlled environment, and make sure that it fails over correctly each time.
Keep track of the reasons for each failover. Although this should not happen often, it is important to address any issues that make the cluster unstable. For example, if one piece of the cluster caused a failover five times in one month, find out why and fix it.
Ensure that the support staff for the cluster is notified when a failover occurs.
Do not overload the cluster. Ensure that the remaining systems can still handle the workload at an acceptable level after a failover.
Check failure-prone components (such as disks) often, so that they can be replaced before problems occur.

In order to implement a split mirror scenario with DB2 Universal Database (UDB) Enterprise Server Edition, it is very important to understand the following three concepts.

Split Mirror

Split mirror is an identical and independent copy of disk volumes that can be attached to a different system and can be used in various ways, e.g., to populate a test system, as a warm standby copy of the database, and to offload backups from the primary machine.

A split mirror of a database includes the entire contents of the database directory, all the table space containers, the local database directory, and the active log directory, if it does not reside on the database directory. The active log directory needs to be split only for creating a clone database using the "snapshot" option of the "db2inidb" tool.

Suspend I/O Feature

When splitting the mirror, it is important to ensure that there is no page write occurring on the source database. One way to ensure this is to bring the database offline. But, due to the required downtime, this method is not a feasible solution in a true 24x7 production environment.

In an effort to provide continuous system availability during the split mirror process, DB2 UDB Enterprise Server Edition (ESE) provides a feature known as suspend I/O, which allows online split mirroring without shutting down the database. The suspend I/O feature ensures the prevention of any partial page write by suspending all write operations on the source database. While the database is in write suspend mode, all of the table space states change to a new state SUSPEND_WRITE, and all operations function normally.

However, some transactions may wait if they require disk I/O, such as flushing dirty pages from the buffer pool or flushing logs from the log buffer. These transactions will proceed normally, once the write operations on the database are resumed. The following command is used to suspend or resume write operations on the source database:

  db2 set write <suspend  resume> for database

The db2inidb Tool

The split mirror created using the suspend I/O feature continues to stay in a write-suspend mode until it is initialized to a useable state. To initialize the split mirror, you can invoke the db2inidb tool.

This tool can either perform a crash recovery on a split mirror image or can put it in a rollforward pending state, depending on the options provided in the db2inidb command, the syntax of which is as follows :

  db2inidb <database_alias> as < snapshot  standby mirror >[ relocate using <config_file> ]

The snapshot option clones the primary database to offload work from the source database, such as running reports , analysis, or populating a target system.

The standby option continues rolling forward through the log, and even new logs that are created by the source database are constantly fetched from the source system.

The mirror uses the mirrored system as a backup image to restore over the source system.

The relocate option allows the split mirror to be relocated in terms of the database name, database directory path, container path, log path , and the instance name associated with the database.

Common Usage of Suspend I/O and db2inidb

The combination of the suspend I/O feature and the db2inidb tool is necessary to bring the split mirror database into a functional state. With the functionalities of the three options (snapshot, standby, mirror) provided in the db2inidb tool, in conjunction with the suspend I/O feature, it is possible to create a fast snapshot of a database, which can be used to:

Populate a test system by making a copy of the current data.
Create a standby database that can be used as a warm standby (DB2 backup can be taken if the database contains DMS only table spaces).
Provide a quick file system level recovery option.
Take database backups that can be restored on the database server.

The suspend I/O feature is necessary to ensure that all DB2 data gets written out to the disk consistently (no partial page write) before splitting the mirror. This assures a well-defined state where the database can be recovered to later, using the db2inidb tool.

The db2inidb tool can either force the database to perform a crash recovery (when the snapshot option is specified) or put the database into a rollforward pending state (when the standby or mirror option is specified) to allow processing of additional log files.

High Availability through Log Shipping

Log shipping is the process of copying whole log files to a standby machine, either from an archive device or through a user exit program running against the primary database. The standby database is continuously rolling forward through the log files produced by the production machine. When the production machine fails, a switch over occurs, and the following takes place:

The remaining logs are transferred over to the standby machine, if possible.
The standby database rolls forward to the end of the logs and stops.
The clients reconnect to the standby database and resume operations.

The standby machine has its own resources (i.e., disks) but must have the same physical and logical definitions as the production database. When using this approach, the primary database is restored to the standby machine by using the restore utility or the split mirror function.

To ensure that you are able to recover your database in a disaster recovery situation, consider the following:

The archive location should be geographically separate from the primary site.
Remotely mirror the log at the standby database site.
Use a synchronous mirror for no loss support. You can do this through:
- DB2 log mirroring or modern disk subsystems, such as ESS and EMC.
- NVRAM cache (both local and remote) is also recommended to minimize the performance impact of a disaster recovery situation.

NOTE

When the standby database processes a log record indicating that an index rebuild took place on the primary database, the indexes on the standby server are not automatically rebuilt. The index will be rebuilt on the standby server, either at the first connection to the database or at the first attempt to access the index after the standby server is taken out of rollforward pending state. It is recommended that the standby server be resynchronized with the primary server if any indexes on the primary server are rebuilt.
If the load utility is run on the primary database with the COPY YES option specified, the standby database must have access to the copy image.
If the load utility is run on the primary database with the COPY NO option specified, the standby database should be resynchronized; otherwise the table space will be placed in restore pending state.
There are two ways to initialize a standby machine:
1. Restoring it from a backup image
2. Creating a split mirror of the production system and issuing the db2inidb command with the STANDBY option (only after the standby machine has been initialized and you issue the ROLLFORWARD command on the standby system)
Operations that are not logged (i.e., activities performed on a table created with NOT LOGGED INITIALLY) will not be replayed on the standby database. As a result, it is recommended that you resync the standby database after such operations. You can do this through online split mirror and suspended I/O support.

High Availability through Online Split Mirror and Suspended I/O Support

Suspended I/O supports continuous system availability by providing a full implementation for online split mirror handling; that is, splitting a mirror without shutting down the database. A split mirror is an "instantaneous" copy of the database that can be made by mirroring the disks containing the data and splitting the mirror when a copy is required. Disk mirroring is the process of writing all of your data to two separate hard disks; one is the mirror of the other. Splitting a mirror is the process of separating the primary and secondary copies of the database.

If you would rather not back up a large database using the DB2 backup utility, you can make copies from a mirrored image by using suspended I/O and the split mirror function. This approach also:

Eliminates backup operation overhead from the production machine.
Represents a fast way to clone systems.
Represents a fast implementation of idle standby failover. There is no initial restore operation, and if a rollforward operation proves to be too slow or encounters errors, reinitialization is very quick.

The db2inidb command initializes the split mirror so that it can be used:

As a clone database
As a standby database
As a backup image

In a partitioned database environment, you do not have to suspend I/O writes on all partitions simultaneously . You can suspend a subset of one or more partitions to create split mirrors for performing offline backups. If the catalog partition is included in the subset, it must be the last partition to be suspended.

In a partitioned database environment, the db2inidb command must be run on every partition before the split image from any of the partitions can be used. The tool can be run on all partitions simultaneously, using the db2_all command.

NOTE

Ensure that the split mirror contains all containers and directories that comprise the database, including the volume directory (each autonumbered directory within a volume).

Split Mirror to Clone a Database

Clone the primary database to offload work from source database, such as running reports, analysis, or populating a target system.

The following scenario shows how to create a clone database on the target system, using the suspend I/O feature. In this scenario, the split mirror database goes through a crash recovery initiated by the db2inidb tool with the snapshot parameter. A clone database generated in this manner can be used to populate a test database or to generate reports. Due to crash recovery, the clone database will start a new log chain; therefore, it will not be able to replay any future log files from the source database. A database backup taken from this clone database can be restored to the source database. However, it will not be able to roll forward through any log records generated after the database was split. Thus, it will be a version-level copy only.

Suspend I/O on the source system.

The following commands will suspend I/O (all write activities from DB2 clients) on the source database so that the mirrors of the database containers can be split without the possibility of a partial page write occurring. Please note that suspending I/O on a database will not disconnect the existing connections to the database, and all operations would function normally. However, some transactions may wait if they require disk I/O. But as soon as the I/O has been resumed on the database, the transactions will proceed normally.
```
  db2 connect to <source-database>   db2 set write suspend for database  
```
Use appropriate operating system-level commands to split the mirror or mirrors from the source database.

The process to split a mirror differs from vendor to vendor. Please consult the storage vendor documentation applicable to your device on how to create a split mirror. Regardless of the variations on the split mirror process, the entire contents of the database directory, all the table space containers, the local database directory, and the active log directory (if it does not reside on the database directory) must be split at the same time.
Resume I/O on the source system.

The following command will resume I/O (all write activities from DB2 clients) on the source database, and the currently running transactions will proceed as normal. It is essential that the same database connection that was used to issue the db2 set write suspend command be used to issue the write resume command.
```
  db2 set write resume for database  
```
Attach to the mirrored database from the target machine.

After the split of the mirror, the administrator for the target machine must use the facilities of the storage vendor to provide access to the split mirror copy, to be referred to as mount . For initial setup, the following steps need to be taken on the target system:
- Create the same database instance as it is on the source machine.
- Catalog the database (system database directory).
- Mount the database directory into the same directory as it is on the source machine.
- Mount all of the containers to the same paths as they are on the source machine. If the containers are located in several directories, all container directories must be mounted.
- If the log files are located in a directory other than the database directory, the log directory should also be mounted into the same directory as it is on the source machine.
Start the database instance on the target system.

Start the database manager on the target machine, assuming that the DB2 registry variable DB2INSTANCE is set to the instance name the same as the source machine.
```
  db2start  
```
Bring the clone database into a consistent state.

The following command will initiate a crash recovery and will roll back all uncommitted transactions, making the database consistent. It is essential to have all the log files that were active at the time of the split. The active log directory should not contain any log file that is not a part of the split mirror. After the crash recovery a new log chain will be started; therefore, the database will not be able to roll forward through any of the logs from the source database. The database will now be available for any operation.
```
  db2inidb <dbname> as snapshot  
```
NOTE

This command will roll back transactions that were in flight when the split occured and start a new log chain sequence so that any logs from the primary database cannot be replayed on the cloned database.

Split Mirror as a Standby Database

Continue to roll forward through the logs and even new logs that are created by the source database are constantly fetched from the source system.

The following scenario shows how to create a standby database on the target system, using the suspend I/O feature. In a warm standby database scenario, the log files of the source database will be applied on the target (standby) database. The standby database will be kept in a rollforward pending state until the rollforward has been stopped . A DB2 backup image taken on the clone database (DMS only) can be used for restoring on the source database for the purpose of performing a rollforward recovery by using the log files produced on the source database after the mirror was split. Please see the following steps:

Suspend I/O on the source system.

The following commands will suspend the I/O (all write activities from DB2 clients) on the database so that the mirrors of the database containers can be split without the possibility of a partial page write occurring. Please note that suspending I/O on a database will not disconnect the existing connections to the database, and all operations would function normally. However, some transactions may wait if they require disk I/O. But as soon as the I/O has been resumed on the database, the transactions will proceed normally.
```
  db2 connect to <source-database>   db2 set write suspend for database  
```
Use appropriate operating system-level commands to split the mirror or mirrors from the source database.

The process to split a mirror differs from vendor to vendor. Please consult the storage vendor documentation applicable to your device on how to create a split mirror. Regardless of the variations on the split mirroring process, the entire contents of the database directory, all the table space containers, and the local database directory must be split at the same time. It is NOT necessary to split the active log directory in this case.
Resume I/O on the source system.

The following command will resume I/O (all write activities from DB2 clients) on the source database, and the currently running transactions will proceed as normal. It is essential that the same database connection that was used to issue the db2 set write suspend command be used to issue the write resume command.
```
  db2 set write resume for database  
```
Attach to the mirrored database from the target machine.

After the split of the mirror, the administrator for the target machine must use the facilities of the storage vendor to provide access to the split mirror copy, to be referred to as mount . For initial setup, the following steps need to be taken on the target system:
- Create the same database instance as it is on the source machine.
- Catalog the database (system database directory).
- Mount the database directory into the same directory as it is on the source machine.
- Mount all of the containers to the same paths as they are on the source machine. If the containers are located in several directories, all container directories must be mounted.
- If the log files are located in a directory other than the database directory, the log directory should also be mounted into the same directory as it is on the source machine.
Start the database instance on the target system.

Start the database manager on the target machine, assuming that the DB2 registry variable DB2INSTANCE is set to the instance name the same as the source machine.
```
  db2start  
```
Put the mirrored database in rollforward mode.

This places the split mirror database into a rollforward pending state. Crash recovery is not performed, and the database remains inconsistent.
```
  db2inidb <dbname> as standby  
```
Continually copy over the log files and roll forward.

Once the database is placed into a rollforward pending state, the log files from the source database can be used to roll forward the target database. A user exit program can be used in this case to automate the continuous archival of the inactive log files. If user exit is used, both source and target databases must be configured with the same user exit program.
```
  db2 rollforward db <dbname> to end of logs  
```
Activate the standby database.

If the source database crashes, the standby database on the target machine can be activated for user access. The user applications will have to make new connections to this standby database. In order to activate, the standby database needs to be taken out of the rollforward pending state. The users should issue the rollforward command with the "stop" or "complete" option to bring the database into a consistent state. Once the database is in consistent state, the users can switch over to the standby database to continue their work. The log files generated on the standby database cannot be applied on the source database.

While the target database is in rollforward pending state, it is possible to perform an offline backup if the database has DMS only table spaces.
```
  db2 rollforward db <dbname> stop  
```
NOTE

If you have only DMS table spaces, you can take a full database backup to offload the overhead of taking a backup on the production database.

Split Mirror as a Backup Image

Use the mirrored system as a backup image to restore over the source system.

The following scenario shows how to create a mirror database on the target system, using the suspend I/O feature. The purpose of this option is to provide the possibility of using a split mirror database for restoring on top of the source database, then to roll forward the log files of the source database. It is important to note that the split mirror must remain in the SUSPEND_WRITE state until it has been copied over on top of the source database.

Split Mirror

Suspend I/O on the source database.

The following commands will suspend I/O (all write activities from DB2 clients) on the database so that the mirrors of the database containers can be split without the possibility of a partial page write occurring. Please note that suspending I/O on a database will not disconnect the existing connections to the database, and all operations would function normally. However, some transactions may wait if they require disk I/O. But as soon as the I/O has been resumed on the database, the transactions will proceed normally.
```
  db2 connect to <source-database>   db2 set write suspend for database  
```
Split the mirror.

The process to split the mirror differs from vendor to vendor. Please consult the storage vendor documentation applicable to your device on how to create a split mirror. Regardless of the variations on the split mirroring process, the entire contents of the database directory, all the table space containers, and the local database directory must be split at the same time. It is not necessary to split the active log directory in this case.
Resume I/O on the source database.

The following command will resume I/O (all write activities from DB2 clients) on the source database, and the currently running transactions will proceed as normal. It is essential that the same database connection that was used to issue the db2 set write suspend command be used to issue the write resume command.
```
  db2 set write resume for database  
```

Restore the Split Mirror Image

There is no "target" database in this scenario. The intent of this scenario is to use the mirror copy to restore on top of the "source" database to recover from a disk failure. The split mirror cannot be backed up using the DB2 backup utility, but it can be backed up using operating system tools. If the source database happens to crash, it can be restored with the split mirror image by copying it on top of the source database. Please see the following steps:

Stop the source database instance.

The database instance needs to be shut down using the following DB2 command before restoring the split mirror image into it.
```
  db2stop  
```
Restore the split mirror image.

Using the storage vendor utilities, copy the data files of the split mirror database over the original database. Please do not use the operating system utilities in this case because the operating system does not have any knowledge of this split image.
Start the source database instance after restoring the split mirror image.
```
  db2start  
```
Initialize the mirror copy on the source database.

This step will replace the source database with the mirror copy of the database and will place it into a rollforward pending state. No crash recovery is initiated, and the database will remain inconsistent until it has been rolled forward to the end of logs.
```
  db2inidb <database> as mirror  
```
Rollforward to end of logs.

The log files from the source database must be used to roll forward the database.
```
  db2 rollforward database <database> to end of logs and complete  
```
In a multi-partitioned database environment, every database partition is treated as a separate database. Therefore, the I/O on each partition needs to be suspended during the split mirror process and should be resumed afterward. The same applies to the db2inidb tool and needs to be run on each mirrored partition before using the database.

Following are some examples of how to issue the commands simultaneously on all partitions.
```
  db2_all "db2 connect to <source-database>; db2 set write resume for database"   db2_all "db2inidb <target-database> as <options>"  
```

High Availability on AIX

Enhanced scalability (ES) is a feature of High Availability Cluster Multi-Processing (HACMP) for AIX. This feature provides the same failover recovery and has the same event structure as HACMP. Enhanced scalability also has other provisions:

Larger clusters.
Additional error coverage through user-defined events.
Monitored areas can trigger user-defined events, which can be as diverse as the death of a process or the fact that paging space is nearing capacity. Such events include pre- and post-events that can be added to the failover recovery process, if needed. Extra functions that are specific to the different implementations can be placed within the HACMP pre-event and post-event streams.
A rules file (/usr/sbin/cluster/events/rules.hacmprd) contains the HACMP events. User-defined events are added to this file. The script files that are to be run when events occur are part of this definition.
HACMP client utilities for monitoring and detecting status changes (in one or more clusters) from AIX physical server outside of the HACMP cluster.

The servers in HACMP ES clusters exchange messages called heartbeats or keepalive packets , by which each server informs the other server about its availability. A server that has stopped responding causes the remaining servers in the cluster to invoke recovery. The recovery process is called a server-down-event and may also be referred to as failover . The completion of the recovery process is followed by the reintegration of the server into the cluster. This is called a server-up-event .

There are two types of events: standard events that are anticipated within the operations of HACMP ES and user-defined events that are associated with the monitoring of parameters in hardware and software components. One of the standard events is the server-down-event . When planning what should be done as part of the recovery process, HACMP allows two failover options: hot (or idle) standby and mutual takeover.

NOTE

When using HACMP, ensure that DB2 instances are not started at boot time by using the db2iauto utility, as follows:

 db2iauto off InstName

where InstName is the login name of the instance.

Cluster Configuration

In a hot-standby configuration, the AIX server that is the takeover server is not running any other workload. In a mutual takeover configuration, the AIX server that is the takeover server is running other workloads.

Generally , in a partitioned database environment, DB2 UDB runs in mutual takeover mode with multiple database partitions on each server. One exception is a scenario in which the catalog partition is part of a hot-standby configuration.

When planning a large DB2 installation on an RS/6000 SP using HACMP ES, you need to consider how to divide the servers of the cluster within or between the RS/6000 SP frames. Having a server and its backup in different SP frames allows takeover in the event that one frame goes down (that is, the frame power/switch board fails). However, such failures are expected to be exceedingly rare because there are N+1 power supplies in each SP frame, and each SP switch has redundant paths, along with N+1 fans and power. In the case of a frame failure, manual intervention may be required to recover the remaining frames. This recovery procedure is documented in the SP Administration Guide. HACMP ES provides for recovery of SP server failures; recovery of frame failures is dependent on the proper layout of clusters within one or more SP frames .

Another planning consideration is how to manage big clusters. It is easier to manage a small cluster than a big one; however, it is also easier to manage one big cluster than many smaller ones. When planning, consider how your applications will be used in your cluster environment. If there is a single, large, homogeneous application running, for example, on 16 servers, it is probably easier to manage the configuration as a single cluster, rather than as eight two-server clusters. If the same 16 servers contain many different applications with different networks, disks, and server relationships, it is probably better to group the servers into smaller clusters. Keep in mind that servers integrate into an HACMP cluster one at a time; it will be faster to start a configuration of multiple clusters, rather than one large cluster. HACMP ES supports both single and multiple clusters, as long as a server and its backup are in the same cluster.

HACMP ES failover recovery allows predefined (also known as cascading ) assignment of a resource group to a physical server. The failover recovery procedure also allows floating (or rotating) assignment of a resource group to a physical server. IP addresses and external disk volume groups, file systems, or NFS file systems, as well as application servers within each resource group specify either an application or an application component, which can be manipulated by HACMP ES between physical servers by failover and reintegration. Failover and reintegration behavior is specified by the type of resource group created and by the number of servers placed in the resource group.

For example, consider a partitioned database environment, if its log and table space containers were placed on external disks and other servers were linked to those disks, it would be possible for those other servers to access these disks and to restart the database partition (on a takeover server). It is this type of operation that is automated by HACMP. HACMP ES can also be used to recover NFS file systems used by DB2 instance main user directories.

Read the HACMP ES documentation thoroughly as part of your planning for recovery with DB2 UDB in a partitioned database environment. You should read the Concepts, Planning, Installation, and Administration guides, then build the recovery architecture for your environment. For each subsystem that you have identified for recovery, based on known points of failure, identify the HACMP clusters that you need, as well as the recovery servers (either hot standby or mutual takeover).

It is strongly recommended that both disks and adapters be mirrored in your external disk configuration. For DB2 servers that are configured for HACMP, care is required to ensure that servers on the volume group can vary from the shared external disks. In a mutual takeover configuration, this arrangement requires some additional planning, so that the paired servers can access each other's volume groups without conflicts. In a partitioned database environment, this means that all container names must be unique across all databases.

One way to achieve uniqueness is to include the partition number as part of the name. You can specify a database partition expression for container string syntax when creating either SMS or DMS containers. When you specify the expression, the database partition number can be part of the container name or, if you specify additional arguments, the results of those arguments can be part of the container name. Use the argument $N ([blank]$N) to indicate the database partition expression. The argument must occur at the end of the container string.

Following are some examples of how to create containers using this special argument:

Creating containers for use on a two-database partition system

The following containers would be used:

  CREATE TABLESPACE TS1 MANAGED BY DATABASE USING (device '/dev/rcont  $N' 20000)   /dev/rcont0on DATABASE PARTITION 0   /dev/rcont1on DATABASE PARTITION 1

Creating containers for use on a four-database partition system

The following containers would be used:

[View full width]

  [View full width] 
  CREATE TABLESPACE TS2 MANAGED BY DATABASE USING (file '/DB2/containers/TS2/container  $N+100' 10000)   /DB2/containers/TS2/container100on DATABASE PARTITION 0   /DB2/containers/TS2/container101on DATABASE PARTITION 1   /DB2/containers/TS2/container102on DATABASE PARTITION 2   /DB2/containers/TS2/container103on DATABASE PARTITION 3

Creating containers for use on a two-database partition system

The following containers would be used:

  CREATE TABLESPACE TS3 MANAGED BY SYSTEM USING ('/TS3/cont  $N%2, '/TS3/cont  $N%2+2')   /TS3/cont0on DATABASE PARTITION 0   /TS3/cont2on DATABASE PARTITION 0   /TS3/cont1on DATABASE PARTITION 1   /TS3/cont3on DATABASE PARTITION 1

A script file, rc.db2pe, is packaged with DB2 UDB Enterprise Server Edition (and installed on each server in /usr/bin) to assist in configuring for HACMP ES failover or recovery in either hot standby or mutual takeover servers. In addition, DB2 buffer pool sizes can be customized during failover in mutual takeover configurations from within rc.db2pe. Buffer pool sizes can be configured to ensure proper resource allocation when two database partitions run on one physical server.

HACMP ES Event Monitoring and User-Defined Events

Initiating a failover operation if a process dies on a given server is an example of a user-defined event. Examples that illustrate user-defined events, such as shutting down a database partition and forcing a transaction abort to free paging space, can be found in the sqllib/samples/hacmp/es subdirectory.

A rules file, /usr/sbin/cluster/events/rules.hacmprd, contains HACMP events. Each event description in this file has the following nine components:

Event name, which must be unique.
State, or qualifier for the event. The event name and state are the rule triggers. HACMP ES Cluster Manager initiates recovery only if it finds a rule with a trigger corresponding to the event name and state.
Resource program path, a full path specification of the xxx.rp file containing the recovery program.
Recovery type. This is reserved for future use.
Recovery level. This is reserved for future use.
Resource variable name, which is used for Event Manager events.
Instance vector, which is used for Event Manager events. This is a set of elements of the form name=value . The values uniquely identify the copy of the resource in the system and, by extension, the copy of the resource variable.
Predicate, which is used for Event Manager events. This is a relational expression between a resource variable and other elements. When this expression is true, the Event Management subsystem generates an event to notify the Cluster Manager and the appropriate application.
Rearm predicate, which is used for Event Manager events. This is a predicate used to generate an event that alters the status of the primary predicate. This predicate is typically the inverse of the primary predicate. It can also be used with the event predicate to establish an upper and a lower boundary for a condition of interest.

Each object requires one line in the event definition, even if the line is not used. If these lines are removed, HACMP ES Cluster Manager cannot parse the event definition properly, and this may cause the system to hang. Any line beginning with "#" is treated as a comment line.

NOTE

The rules file requires exactly nine lines for each event definition, not counting any comment lines. When adding a user-defined event at the bottom of the rules file, it is important to remove the unnecessary empty line at the end of the file, or the server will hang.

HACMP ES uses PSSP event detection to treat user-defined events. The PSSP Event Management subsystem provides comprehensive event detection by monitoring various hardware and software resources.

The process can be summarized as follows:

Either Group Services/ES (for predefined events) or Event Management (for user-defined events) notifies HACMP ES Cluster Manager of the event.
Cluster Manager reads the rules.hacmprd file and determines the recovery program that is mapped to the event.
Cluster Manager runs the recovery program, which consists of a sequence of recovery commands.
The recovery program executes the recovery commands, which may be shell scripts or binary commands. (In HACMP for AIX, the recovery commands are the same as the HACMP event scripts.)
Cluster Manager receives the return status from the recovery commands. An unexpected status " hangs " the cluster until manual intervention (using smit cm_rec_aids or the /usr/sbin/cluster/utilities/clruncmd command) is carried out.

In Figure 3.1, both servers have access to the installation directory, the instance directory, and the database directory. The database instance db2inst is being actively executed on server 1. Server 2 is not active and is being used as a hot standby. A failure occurs on server 1, and the instance is taken over by server 2. Once the failover is complete, both remote and local applications can access the database within instance db2inst. The database will have to be manually restarted or, if AUTORESTART is on, the first connection to the database will initiate a restart operation. In the sample script provided, it is assumed that AUTORESTART is off and that the failover script performs the restart for the database.

Figure 3.1. Failover on a two-server HACMP cluster.

graphics/03fig01.gif

Partition Failover (Hot Standby)

In the following hot-standby failover scenario, we are using an instance partition instead of the entire instance. The scenario includes a two-server HACMP cluster as in the previous example, but the machine represents one of the partitions of a partitioned database server. Server 1 is running a single partition of the overall configuration, and server 2 is being used as the failover server. When server 1 fails, the partition is restarted on the second server. The failover updates the db2nodes.cfg file, pointing to server 2's host name and net name, then restarts the partition on the new server.

Following is a portion of the db2nodes.cfg file, both before and after the failover. In this example, database partition number 2 is running on server 1 of the HACMP machine, which has both a host name and a net name of srvr201 . The server 2 srvr202 is running as a hot standby, ready to take over the execution of the partition if there is a failure on srvr201 . After the failover, database partition number 2 is running on server 2 of the HACMP machine, which has both a host name and a net name of srvr202 .

  Before:   1 srvr101 0 srvr101   2 srvr201 0 srvr201 <= HACMP running on primary server   db2start dbpartitionnum 2 restart hostname srvr202 port 0 netname srvr202   After:   1 srvr101 0 srvr101   2 srvr202 0 srvr202 <= HACMP running on standby server

Multiple Logical Partition Database Failover

A more complex variation on the previous example involves the failover of multiple logical partition databases from one server to another. Again, we are using the same two server HACMP cluster configuration as above. However, in this scenario, server 1, srvr201 , is actively running three logical database partitions while server 2, srvr202, is running as a hot standby, ready to take over the execution of the partition if there is a failure on srvr201 . The setup is the same as that for the simple database partition failover scenario but in this case, when server 1 fails, each of the logical database partitions must be started on server 2. It is critical that each logical database partition must be started in the order that is defined in the db2nodes.cfg file: The logical database partition with port number 0 must always be started first.

Following is a portion of the db2nodes.cfg file, both before and after the failover. In this example, there are three logical database partitions defined on server 1 of a two-server HACMP cluster. After the failover, database partitions 2, 3, and 4 are running on server 2 of the HACMP machine, which has both a host name and a net name of srvr202 .

  Before:   1 srvr101 0 srvr101   2 srvr201 0 srvr201 <= HACMP running on the primary server   3 srvr201 1 srvr201 <= HACMP   4 srvr201 2 srvr201 <= HACMP   db2start dbpartitionnum 2 restart hostname srvr202 port 0 netname srvr202   db2start dbpartitionnum 3 restart hostname srvr202 port 1 netname srvr202   db2start dbpartitionnum 4 restart hostname srvr202 port 2 netname srvr202   After:   1 srvr101 0 srvr101   2 srvr202 0 srvr202 <= HACMP running on the standby server   3 srvr202 1 srvr202 <= HACMP   4 srvr202 2 srvr202 <= HACMP

Partition Failover (Mutual Takeover)

In this example, we are running two of the partitions of a multi-partitioned database system on the two separate servers of an HACMP configuration. The database partition for each server is created on the path /db2, which is not shared with other partitions. The following is the contents of the db2nodes.cfg file associated with the overall multi-partition instance before and after the failover. The srvr201 crashes and fails over to srvr202 . After the failover, the database partition that was executing on srvr201 , which is defined as database partition number 2, starts up on srvr202 . Because srvr202 is already running a database partition number 3 for this database, therefore, database partition number 2 will be started as a logical database partition number on srvr202 with the logical port 1.

  Before:   1 srvr101 0 srvr101   2 srvr201 0 srvr201 <= HACMP failover server   3 srvr202 0 srvr202 <= HACMP   db2start dbpartitionnum 2 restart hostname srvr202 port 1 netname srvr202   After:   1 srvr101 0 srvr101   2 srvr202 1 srvr202 <= srvr201 failover to srvr202   3 srvr202 0 srvr202 <= HACMP

Scenario #1: Hot Standby with a Cascading Resource Group

In this HACMP configuration (hot standby with a cascading resource group), we use HACMP/ES 4.3 and DB2 UDB Enterprise Server Edition running on AIX 4.3.3. The cluster being defined is called dbcluster . This cluster has two servers ( dbserv1 and dbserv2 ), one resource group ( db2grp ), and one application server ( db2as ). Because we want the resource group and the application server to be active on the dbserv1 server when there are no failovers, we will define the dbserv1 server in the resource group first. Each of these servers will have two network adapters and one serial port. The servers will have a shared external disk, with only one server accessing the disk at a time. Both servers will have access to a volume group ( havg ), three file systems (/home/db2inst1, /db1, and /home/db2fenc1), and a logical volume (/dev/udbdata).

If the dbserv1 server has a hardware or software failure, the dbserv2 server will acquire the resources that are defined in the resource group. The application server is then started on the dbserv2 server. In our case, the application server that is started is DB2 UDB ESE for the instance db2inst1 . There are failures that would not cause the application to move to the dbserv2 server; these include a disk failure or a network adapter failure.

Here is one example of a failover: DB2 UDB ESE is running on a server called dbserv1 ; it has a home directory of /home/db2inst1, a database located on the /db1 file system, and a /dev/udbdata logical volume. These two file systems and logical volume are in a volume group called havg . The dbserv2 server is currently not running any application except HACMP, but it is ready to take over from the dbserv1 server, if necessary. Suppose someone unplugs the dbserv1 server. The dbserv2 server detects this event and begins taking over resources from the dbserv1 server. These resources include the havg volume group, the three file systems, the logical volume, and the hostname dbs1.

Once the resources are available on the dbserv2 server, the application server start script runs. The instance ID can log on to the dbserv2 server (now called dbs1 ) and connect to the database. Remote clients can also connect to the database, because the hostname dbs1 is now located on the db2serv2 server.

Follow these steps to set up shared disk drives and the logical volume manager:

Set up the disk drives.
Create the volume group (VG). The VG must have a unique name and a major number for all servers in the cluster.
Create a JFSLog.
Create the LVs and the JFSs.
Unmount all of the file systems and deactivate the VG. The volume is varied off on the dbserv1 server before it is activated on the dbserver server.
Import the VG on the dbserv2.
Move the active VG to the dbserv1 server.

User Setup and DB2 Installation

Now that the components of the LVM are set up, DB2 can be installed. The db2setup utility can be used to install and configure DB2. To understand the configuration better, we will define some of the components manually and use the db2setup utility to install only the DB2 product and license.

All commands described in this chapter must be invoked by the root user. Although the steps used to install DB2 are outlined below, for complete details, please refer to the DB2 for UNIX Quick Beginnings guide and the DB2 Client Installation Guide.

NOTE

Before any groups or IDs are created, ensure that the volume group is activated and that the file systems /home/db2inst1 and /home/db2fenc1 are mounted.

To install and configure DB2 on the dbserv1 server:

Create the group for the DB2 instance.
Create the user ID for the DB2 instance.
Create the group and user ID for the DB2 fenced ID.
Mount the CD-ROM.
Install DB2 and set up the license key.
Create the DB2 instance.
Install the DB2 HACMP scripts. To copy the HACMP scripts to /usr/bin from /usr/opt/db2_08_01/samples/hacmp/es, use the db2_inst_ha.local command. The db2_inst_ha.local script also copies over the HACMP/ES event stanzas. These events are defined in the db2_event_stanzas file.
Failover $HOME to the dbserv2 server. Repeat steps 1 to 7 on the dbserv2 server.

NOTE

Once DB2 HACMP is configured and set up, any changes made (for example, to the IDs, groups, AIX system parameters, or the level of DB2 code) must be done on both servers.

Following are some examples:

The HACMP cluster is active on the dbserv1 server, and the password is changed on that server. When failover happens to the dbserv2 server and the user tries to log on, the new password will not work. Therefore, the administrator must ensure that passwords are kept synchronized.
If the ulimit parameter on the dbserv1 server is changed, it must also be changed on the dbserv2 server. For example, suppose the file size is set to unlimited on the dbserv1 server. When a failover happens to the dbserv2 server and the user tries to access a file that is greater than the default size of 1 GB, an error is returned.
If the AIX parameter maxuproc is changed on the dbserv1 server, it also must be changed on the dbserv2 server. When a failover occurs, and DB2 is now running on the dbserv2 server, it may hit the maxuproc value and return errors.
If non-DB2 software is installed or a DB2 upgrade is installed on the dbserv1 server but not on the dbserv2 server, the new software will not be available when a failover takes place.
Suppose that the database manager configuration parameter svcename is used and that /etc/services is updated on the dbserv1 server. If the dbserv2 server does not receive the same update, DB2 clients will report errors after a failover.

The testing procedure itself is simple. First, connect to the cluster from a client machine; next , cause one of the points of failure to fail; then watch to ensure that the failover takes place properly to make sure that the application is available and properly configured after failover. If the cluster is built using a cascading cluster configuration, check again after service has been restored to the original server. If the cluster is built using a rotating cluster configuration, bring up the original server again, then cause the second server to fail, which should restore the system to its original server.

When testing the availability of the application, be sure that accounts and passwords work as expected, that hostnames and IP addresses work as expected, that the data is complete and up to date, and that the hangover is essentially transparent to the user.

Configure a remote machine to be able to connect to the highly available DB2 UDB database. A script can be easily written that will connect to our database, select some data from a table, record the results, and disconnect from the database. If these steps are set inside a loop that will run until interrupted by the operator, the procedure can be used to monitor the state of the cluster.

Keep in mind that the script should continue even if the database cannot be contacted. This way, when the database restarts, it will provide a benchmark for the length of time failover is expected to take. Here is a brief sample script that may be useful for testing an HACMP cluster:

  while true   do   db2 connect to database   db2 "select count(*) from syscat.tables"   db2 terminate   sleep 60   done

Scenario #2: Mutual Takeover with a Cascading Resource Group

This configuration involves a six-database partitions and two clusters, each with mutual takeover and cascading resource groups. It uses HACMP/ES 4.3 and DB2 UDB ESE running on AIX 4.3.3.

The clusters being defined are named cl1314 and cl1516 , with cluster IDs of 1314 and 1516, respectively. We arbitrarily selected these numbers because we are using SP servers 13,14,15, and 16.

The cl1314 cluster has two servers ( bf01n013 and bf01n014 ), two cluster server names ( clsrv13 and clsrv14 ), two resource groups ( rg1314 and rg1413 ), and two application servers ( as1314 and as1413 ).
The cl1516 cluster has two servers ( bf01n015 and bf01n016 ), two cluster server names ( clsrv15 and clsrv16 ), two resource groups ( rg1516 and rg1615 ), and two application servers ( as1516 and as1615 ).
Each of these servers will have one SP switch and one Ethernet adapter.

The servers within a cluster will have a shared external disk (Table 3.1).

The cl1314 cluster will have access to two volume groups, havg1314 and havg1413.
The cl1516 cluster will have access to two volume groups, havg1516 and havg1615 .

Table 3.1. Scenario #2 Configuration

Cluster	Cluster Server	Ethernet	Switch	Resource Group	Volume Group	File System
cl1314	clsrv13	bf01n13	b_sw_013	rg1314	havg1314	/homehalocal
						/db1ha/svtha1/SRV130
						/db1ha/svtha1/SRV131
	clsrv14	bf01n14	b_sw_014	rg1413	havg1413	/db1ha/svtha1/SRV140
cl1516	clsrv15	bf01n15	b_sw_015	rg1516	havg1516	/db1ha/svtha1/SRV150
	clsrv16	bf01n16	b_sw_016	rg1615	havg1615	/db1ha/svtha1/SRV160
						/db1ha/svtha1/SRV161

In the initial target configuration, the db2nodes.cfg will have the following entries:

  130 b_sw_013 0 b_sw_013   131 b_sw_013 1 b_sw_013   140 b_sw_014 0 b_sw_014   150 b_sw_015 0 b_sw_015   160 b_sw_016 0 b_sw_016   161 b_sw_016 1 b_sw_016

If one of the two servers within the cluster (for example, cl1314 ) has a failure, the other server in the cluster will acquire the resources that are defined in the resource group. The application server is then started on the server that has taken over the resource group. In our case, the application server that is started is DB2 UDB ESE for the instance svtha1 .

In our example of a failover, DB2 UDB ESE is running on a server clsrv13 ; it has an NFS mounted home directory and a database located on the /db1ha/svtha1/SRV130 and /db1ha/svtha1/SRV131 file systems. This file system is in a volume group called havg1314 . The clsrv14 server is currently running DB2 for partition 140 and is ready to take over from the clsrv13 server, if necessary. Suppose someone unplugs the clsrv13 server.

The clsrv14 server detects this event and begins taking over resources from the clsrv13 server. These resources include the havg1314 volume group, the file system, and the hostname swserv13. Once the resources are available on the clsrv14 server, the application server start script runs. The instance ID can log on to the clsrv14 server (now with an additional hostname swserv13 ) and can connect to the database. Remote clients can also connect to the database, because the hostname swserv13 is now located on the clsrv14 server.

User Setup and DB2 Installation

Now that the components of the LVM are set up, DB2 can be installed. The db2setup utility can be used to install and configure DB2. To illustrate the configuration better, we will define some of the components manually and will use the db2setup utility to install only the DB2 product and license.

All commands described in this chapter must be invoked by the root user. Although the steps used to install DB2 are outlined below, for complete details, refer to the DB2 UDB ESE for UNIX Quick Beginnings guide and to the DB2 UDB ESE Installation and Configuration Supplement guide.

Before running db2icrt , make sure that the $HOME directory for the instance is available and the svtha1 id can write to the directory. Also make sure that a .profile file exists, because db2icrt will append to the file but will not create a new one.

For this example, we are using the svtha1 id that already exists on the SP complex.

Mount the CD-ROM.

  # crfs -v cdrfs -p ro -d'cd0' -m'/cdrom'

Install DB2 and set up the license key.

  # cd /cdrom   # ./db2setup   Note: Select DB2 UDB Enterprise Server Edition and install.

Create the DB2 instance.

  # cd /usr/opt/db2_08_01/instance   # ./db2icrt -u svtha1 svtha1

Test db2start and file system setup.

Because db2icrt adds only one line to the $HOME/sqllib/db2nodes.cfg file, we are required to update the file and add the other servers, such that the db2nodes.cfg file would look like the following:
```
  130 b_sw_013 0 b_sw_013   131 b_sw_013 1 b_sw_013   140 b_sw_014 0 b_sw_014   150 b_sw_015 0 b_sw_015   160 b_sw_016 0 b_sw_016   161 b_sw_016 1 b_sw_016  
```
We must also create the $HOME/.rhosts file because db2start and other DB2 programs require it to run remote shells from one server to another. The .rhosts file would look like the following in our example:
```
  swserv13 svtha1   swserv14 svtha1   swserv15 svtha1   swserv16 svtha1   b_sw_013 svtha1   b_sw_014 svtha1   b_sw_015 svtha1   b_sw_016 svtha1   bf01n013 svtha1   bf01n014 svtha1   bf01n015 svtha1   bf01n016 svtha1  
```
NOTE

Ensure the permissions on the $HOME/.rhosts files are correct (i.e., run chmod 600 .rhosts).

This is a good place to see whether db2start will work. Log on as the svtha1 instance and run the db2start command. To test the file system setup on each server, try creating a database. Be sure to create it on /db1ha and not in $HOME, which is the default. Use the following command to create the database:
```
  $ db2 create database testdb on /db1ha  
```
Ensure that all errors are corrected before proceeding to the next step; also be sure to stop DB2 using the db2stop command before proceeding to the next step.
Install the DB2 HACMP scripts.

DB2 UDB EEE supplies sample scripts for failover and user-defined events. These files are located in the /usr/opt/db2_08_01/samples/hacmp/es directory.

In our example, we copied this directory to a special directory on the control workstation of the SP complex. Our example used /spdata/sys1/hacmp on the control workstation.

The db2_inst_ha.local script is the tool used for installing scripts and events on multiple database partitions in an HACMP ESE environment. It was used in the following manner for our examples:
```
  # cd /spdata/sys1/hacmp   # db2_inst_ha.local svtha1 . 15-16 TESTDB  
```
This will install the scripts in the directory /usr/bin to all the servers listed (i.e., servers 15 and 16) and prepare them to work with the database TESTDB. Note that the database name needs to be in uppercase. The server selection can also be written in the form "15,16" if you want to copy the files to specific servers.

When the application server is set up and the start and stop scripts are defined, they will call / usr/bin/rc.db2pe with a number of parameters.

NOTE

The start and stop scripts that are called from the application server must exist on both servers and have the same name. They do not need to have the same content if, for example, some customizing is needed.

The db2_inst_ha.local script also copies over the HACMP/ES event stanzas. These events are defined in the db2_event_stanzas file. One example is the DB2_PROC_DOWN event, which will restart DB2 if it terminates for some reason.

NOTE

DB2 will also restart if terminated by the db2stop or db2stop force commands. To stop DB2 without triggering a failure event, use the ha_db2stop command.
Test a failover of the resources on bf01n015 to bf01n016 .

On the bf01n015 server:
```
  # unmount /db1ha/svtha1/SRV150   # varyoffvg havg1516  
```
On the bf01n016 server:
```
  # varyonvg havg1516   # mount /db1ha/svtha1/SRV150  
```
NOTE

These are the actual steps that the HACMP software takes during failover of the necessary file systems.

Once DB2 HACMP is configured and set up, any changes made (for example, to the ID, groups, AIX system parameters, or the level of DB2 code) must be done on all servers. Following are some examples:
- The HACMP cluster is active on the bf01n015 server, and the password is changed on that server. When failover happens to the bf01n016 server and the user tries to log on, the new password will not work. Therefore, the administrator must ensure that passwords are kept synchronized.
- If the ulimit parameter on the bf01n015 server is changed, it must also be changed on the bf01n016 server. For example, suppose the file size is set to unlimited on the bf01n015 server. When a failover happens to the bf01n016 server and the user tries to access a file that is greater than the default size of 1 GB, an error is returned.
- If the AIX parameter maxuproc is changed on the bf01n015 server, it also must be changed on the bf01n016 server. When a failover occurs and DB2 begins running on the bf01n016 server, it may reach the maxuproc value and return errors.
- If non-DB2 software is installed on the bf01n015 server but not on the bf01n016 server, the software will not be available when a failover takes place.
- Suppose that the database manager configuration parameter svcename is used, and that /etc/services is updated on the bf01n015 server. If the bf01n016 server does not receive the same update and a failover occurs, the DB2 server will report warnings during db2start , and will not start up the TCP/IP communications listeners, and DB2 clients will report errors.

High Availability on the Windows Operating System

Microsoft Cluster Service (MSCS) is a feature of Windows NT Server, Windows 2000 Server, and Windows .NET Server operating systems. It is the software that supports the connection of two servers (up to four servers in DataCenter Server) into a cluster for high availability and easier management of data and applications.

MSCS can also automatically detect and recover from server or application failures. It can be used to move server workloads to balance machine utilization and to provide for planned maintenance without downtime.

DB2 MSCS Components

A cluster is a configuration of two or more servers, each of which is an independent computer system. The cluster appears to network clients as a single server.

The servers in an MSCS cluster are connected using one or more shared storage buses and one or more physically independent networks. A network that connects only the servers but does not connect the clients to the cluster is referred to as a private network . The network that supports client connections is referred to as the public network. There are one or more local disks on each server. Each shared storage bus attaches to one or more disks. Each disk on the shared bus is owned by only one server of the cluster at a time. The DB2 software resides on the local disk. DB2 database files (tables, indexes, log files, etc.) reside on the shared disks. Because MSCS does not support the use of raw partitions in a cluster, it is not possible to configure DB2 to use raw devices in an MSCS environment.

The DB2 Resource

In an MSCS environment, a resource is an entity that is managed by the clustering software. For example, a disk, an IP address, or a generic service can be managed as a resource. DB2 integrates with MSCS by creating its own resource type called DB2 . Each DB2 resource manages a DB2 instance and when running in a partitioned database environment, each DB2 resource manages a database partition. The name of the DB2 resource is the instance name, although in the case of a partitioned database environment, the name of the DB2 resource consists of both the instance name and the partition number.

Pre-Online and Post-Online Scripts

You can run scripts both before and after a DB2 resource is brought online. These scripts are referred to as pre-online and post-online scripts . Pre-online and post-online scripts are .BAT files that can run DB2 and system commands.

In a situation when multiple instances of DB2 may be running on the same machine, you can use the pre-online and post-online scripts to adjust the configuration so that both instances can be started successfully. In the event of a failover, you can use the post-online script to perform manual database recovery. Post-online scripts can also be used to start any applications or services that depend on DB2.

The DB2 Group

Related or dependent resources are organized into resource groups. All resources in a group move between cluster servers as a unit. For example, in a typical DB2 single-partition cluster environment, there will be a DB2 group that contains the following resources:

DB2 resource. The DB2 resource manages the DB2 instance.
IP Address resource. The IP Address resource allows client applications to connect to the DB2 server.
Network Name resource. The Network Name resource allows client applications to connect to the DB2 server by using a name, rather than an IP address. The Network Name resource has a dependency on the IP Address resource. The Network Name resource is optional. (Configuring a Network Name resource may affect the failover performance.)
One or more Physical Disk resources. Each Physical Disk resource manages a shared disk in the cluster.

NOTE

The DB2 resource is configured to depend on all other resources in the same group, so the DB2 server can be started only after all other resources are online.

Two types of configuration are available:

Hot standby
Mutual takeover

In a partitioned database environment, the clusters do not all have to have the same type of configuration. You can have some clusters that are set up to use hot standby and others that are set up for mutual takeover. For example, if your DB2 instance consists of five workstations, you can have two machines set up to use a mutual takeover configuration, two to use a hot-standby configuration, and one machine not configured for failover support.

Hot standby configuration

In a hot standby configuration, one machine in the MSCS cluster provides dedicated failover support, and the other machine participates in the database system. If the machine participating in the database system fails, the database server on it will be started on the failover machine. If, in a partitioned database system, you are running multiple logical database partitions on a machine and it fails, the logical database partitions will be started on the failover machine.

Mutual takeover configuration

In a mutual takeover configuration, both workstations participate in the database system (i.e., each machine has at least one database server running on it). If one of the workstations in the MSCS cluster fails, the database server on the failing machine will be started to run on the other machine. In a mutual takeover configuration, a database server on one machine can fail independently of the database server on another machine.

Clustered Servers for High Availability

In this example, we define a dbclust cluster. The members of the cluster are serv1 and serv2. Clients communicate over the public network to the cluster through the IP address assigned to the cluster's host name. The cluster's host name can be assigned to only one member of the cluster at any given time but can move to any member of the cluster. The shared storage is accessible to all members in the cluster but can be assigned to only one member server at any given time. The member servers use a private network to check on the vitality of other members in the cluster (it is called the server's heartbeat ).

The basic components consist of two servers that establish a cluster when MSCS software is installed and configured on both servers. Prior to installing the MSCS software, these two servers must be able to communicate with each other over a network.

Highly recommended is a dedicated private network between the two servers that can be used to communicate their heartbeats without interference from traffic on the public network. Both servers must also have access to shared storage.

The MSCS software configuration process will create several default cluster resource types. These include a network name, an IP address, and at least one physical disk that is referred to as the quorum drive and usually assigned the Q: drive letter. The network name represents the cluster's host name that is registered in DNS and assigned the IP address. The primary purpose of the cluster's network name is to manage the cluster by a DNS name.

Before Installing Microsoft Cluster Service

Prior to installing and configuring the MSCS software, there are a number of pre-installation tasks that need to be addressed.

Verify the hardware compatibility list prior to performing the install.
Verify that your system or components are on the Hardware Compatibility List (HCL) as part of your planning effort. (Microsoft maintains a HCL specifically for cluster implementations. The HCL includes both complete systems and system components that are certified for MSCS.)
Verify that each server can access the storage once shared storage is physically connected to all servers in the cluster.
Verify that all of the shared disks that will be used in the cluster are defined as type basic, have a drive letter assigned, are formatted, and contain no mounted volumes because MSCS does not support dynamic disk volumes, physical disk names, mounted volumes, or raw devices.
Verify that you are not using Network Adapter Fault Tolerance or Load Balancing for the Private Network adapters because MSCS does not support this type of configuration for private heartbeat communications and provides for fault tolerance by using one or more public networks as a backup to the private heartbeat network.
Verify the order in which your network connections are accessed by DNS and network services. Select Start, Setting, Control Panel, Network and Dial-up Connections, Advanced, Advanced Settings. Verify that the public network connections are listed first.
Verify that no other client, service, or protocol network components are used by the private network connection except Internet Protocol (TCP/IP). Select Start, Setting, Control Panel, Network and Dial-up Connections, Private LAN, Properties and scroll the list found on the properties dialog.
Verify that your private network connection has a static TCP/IP address. Select Start, Setting, Control Panel, Network and Dial-up Connections, Private LAN, Properties, Internet Protocol (TCP/IP), Properties.
Verify that NetBIOS over TCP/IP is disabled. Select Start, Setting, Control Panel, Network and Dial-up Connections, Private LAN, Properties, Internet Protocol (TCP/IP), Properties, Advanced, WINS.
Verify that the Link Speed & Duplex on the private network adapter is set to 10Mbps/Half Duplex. Select Start, Setting, Control Panel, Network and Dial-up Connections, Private LAN, Properties, Configure, Advanced, Link Speed & Duplex.

Installing Microsoft Cluster Service

If the MSCS was installed as part of the initial operating system load, you can start the Cluster Service Configuration Wizard by selecting the Control Panel, Add/Remove Programs, Add/Remove Windows Components, Configure.

If the MSCS was not installed as part of the initial operating system load, you can install it by selecting Control Panel, Add/Remove Programs, Add/Remove Windows Components, Components, and the Cluster Service Configuration Wizard will start as part of this installation process.

Tasks to be performed during the installation and configuration using the Cluster Service Configuration Wizard:

Create a new cluster or join an existing cluster. Because this is the first node in the cluster, we will create a new cluster.
Define the cluster name, in this case, dbclust. This will be the name we use to manage the cluster, either locally or remotely, with the MSCS Cluster Administrator.
Create a domain user account that will be used to run the Cluster Service and add this account to the local Administrators Group. If possible, set the account password to never expire. Otherwise, be aware that you will need to change the password within the Services Microsoft Management Console (MMC) when the account password expires .
The Cluster Service Configuration Wizard presents a list of all disks on the shared storage that are supported by the Clustering Service. If your storage does not appear as expected, go back to the preinstall tasks and verify that the storage is configured correctly.
Select the disk partition that will be used for the quorum drive.
Select the network connection that will be used for the private heartbeat communications.
Select All communications (mixed network). This will enable the public network connection to work as a backup to the private heartbeat network.
Verify that the private network connection has the priority for internal cluster communication.
Assign a TCP/IP address to the cluster. This address will be used to manage the cluster over the public network. As the cluster moves from one server to another, this address, along with the cluster name, moves as well.
If the Cluster Service Configuration Wizard is successful, you will see this final confirmation that the cluster service has successfully started. At this point, the shared storage is managed by the Cluster Service, and the other servers can be booted .
Once the cluster has been created, you can open the Cluster Administrator MMC and see the default Cluster Group containing the Cluster Name, Cluster IP Address, and Disk Q: quorum drive.
Start the Cluster Service Configuration Wizard on the next server that will be added to the cluster.
Joining an existing cluster requires that we enter the cluster name.
Enter the password for the domain user account that was created to run the MSCS service and added to the local Administrators group.

As each individual node is added to the cluster, it will appear within the MSCS Cluster Administrator. We can see that both serv1 and serv2 are now members of the cluster dbclust from the left panel.

After Installing Microsoft Cluster Service

Once the MSCS software has been installed on all servers within the cluster, we need to perform post-install tasks to verify that everything is in working order. To prepare for these tasks, we will consolidate all of the resources into one group and rename the clusters quorum drive from Disk Q: to a more meaningful name.

Consolidating the Physical Disks resources
Renaming the quorum drive resource
Moving the Cluster Group
Initiating failure on the Cluster Group resources
Testing the Cluster Group

The following is a list of tests that can be performed to verify that the Cluster Service is working properly.

Test 1

Logon to the first server in the cluster, verify that the Cluster Group is currently online at this server, and open a Windows command prompt. Verify that you can ping the Cluster Group by IP address and name. Verify that you can access the quorum drive (Disk Q:). Move the Cluster Group to another member in the cluster and repeat.

Ping 192.168.1.51
Ping DBCLUST
DIR Q:

Test 2

Logon to a server that is not a member of the cluster, verify that the Cluster Group is currently online at the primary server, and open a Windows command prompt. Verify that you can ping the Cluster Group by IP address and name. Verify that you can access the quorum drive. Do this while moving the Cluster Group from one member of the cluster to another.

Ping 192.168.1.51 t
Ping DBCLUST t
NET USE Q: \\dbclust\q$

Test 3

Logon to a client that will use the resources of this cluster, verify that the Cluster Group is currently online at the primary server, and open a Windows command prompt. Verify that you can ping the Cluster Group by IP Address and Cluster Name. Verify that you can access the quorum drive.

Ping 192.168.1.51 t
Ping DBCLUST t
NET USE Q: \\dbclust\q$

Before Enabling DB2 MSCS Support

There are tasks that should be performed prior to enabling DB2 UDB HA support with MSCS.

Install DB2 on a local (non-clustered) drive on all servers that will participate in the cluster.
Create DB2 instance on the shared storage.
Configure the DB2 instance that will be clustered to start manually in the Windows Services dialog.
Enable DB2 to fall back to the primary server as soon as the primary server is available. You may also want DB2 to move back and forth between servers for testing purposes only. To accomplish this, you must set the DB2 registry variable DB2_FALLBACK to YES.
```
  db2set DB2_FALLBACK=YES  
```
Create database on the shared storage.

Enabling DB2 MSCS Support

Enabling DB2 MSCS support includes the following:

Enable a DB2 instance
Modify the DB2 MSCS configuration file
Modify DB2 dependencies
Modify DAS restart option
Implement pre-online and post-online scripts

High Availability on Sun Solaris

Although there are a number of methods to increase availability for a data service, the most common is an HA cluster. A cluster, when used for HA, consists of two or more machines, a set of private network interfaces, one or more public network interfaces, and some shared disks. This special configuration allows a data service to be moved from one machine to another. By moving the data service to another machine in the cluster, it should be able to continue providing access to its data. Moving a data service from one machine to another is called a failover .

The private network interfaces are used to send heartbeat messages, as well as control messages, among the machines in the cluster. The public network interfaces are used to communicate directly with clients of the HA cluster. The disks in an HA cluster are connected to two or more machines in the cluster, so that if one machine fails, another machine has access to them.

A data service running on an HA cluster has one or more logical public network interfaces and a set of disks associated with it. The clients of an HA data service connect via TCP/IP to the logical network interfaces of the data service only. If a failover occurs, the data service, along with its logical network interfaces and set of disks, are moved to another machine.

One of the benefits of an HA cluster is that a data service can recover without the aid of support staff, and it can do so at any time. Another benefit is redundancy. All of the parts in the cluster should be redundant, including the machines themselves . The cluster should be able to survive any single point of failure.

Even though HA data services can be very different in nature, they have some common requirements. Clients of an HA data service expect the network address and host name of the data service to remain the same and expect to be able to make requests in the same way, regardless of which machine the data service is on.

Consider a Web browser that is accessing an HA Web server. The request is issued with a URL (Uniform Resource Locator), which contains both a host name and the path to a file on the Web server. The browser expects both the host name and the path to remain the same after a failover of the Web server. If the browser is downloading a file from the Web server and the server is failed over, the browser will need to reissue the request.

Availability of a data service is measured by the amount of time the data service is available to its users. The most common unit of measurement for availability is the percentage of "up time"; this is often referred to as the number of nines :

99.99% => service is down for (at most) 52.6 minutes/year

99.999% => service is down for (at most) 5.26 minutes/year

99.9999% => service is down for (at most) 31.5 seconds/year

Hot Standby

Hot standby is the simplest HA cluster topology. In this scenario, the primary machine is hosting the production database instance and associated resources. A second idle machine is available to host the production database instance and associated resources, should a failure occur on the primary machine. The second machine can also be running a workload (perhaps another DB2 instance) in order to maximize resource use.

Mutual Takeover

In the mutual takeover case, you envision a cluster of N servers as N /2 pairs of servers. Server number N is responsible for failover support of server number N +1; server number N +2 is responsible for failover support of server number N +3; and so on until you reach the N th server.

Note that this scenario requires that N be an even number.

The advantage of this configuration is that in the normal (non-failure) case, all machines are hosting database resources and are performing productive work. The primary disadvantage is that, during the failure period (the period after one of the hardware resources has failed and before its repair), there is one server that is required to support, on average, twice the workload of any other physical server.

Mutual takeover ( N + 1)

Single defined server serves as standby . This case relies on an N server cluster, with one defined server as the standby for all N servers. The advantage of this scenario is that there is no performance degradation during the failure (the period after one of the hardware resources has failed and before its repair). The primary disadvantage is that approximately 1 / ( N + 1) of the aggregate physical computing resource goes unused during the normal operation.

Pair + M (N + M)

M defined servers serve as the hot standby for each N server . This case relies on an N server cluster, with M defined servers as the hot standby for each of the N servers. Essentially, this is the default cluster topology configured by the regdb2udb, where N is equal to the number of physical servers in the cluster and M is equal to N 1. The prime advantage of this configuration is that the environment is fully redundant; up to N 1 server failures can be tolerated while still maintaining full database access (subject, of course, to increased query response times due to capacity constraints when there are fewer than N servers in the cluster). In this way, DB2 UDB ESE, used in conjunction with Sun Cluster 3.0, ensures full database software redundancy and is most appropriate for environments requiring the highest degree of availability.

Fault Tolerance

Another way to increase the availability of a data service is fault tolerance. A fault tolerant machine has all of its redundancy built in and should be able to withstand a single failure of any part, including CPU and memory. Fault-tolerant machines are most often used in niche markets and are usually expensive to implement. An HA cluster with machines in different geographical locations has the added advantage of being able to recover from a disaster affecting only a subset of those locations.

An HA cluster is the most common solution to increase availability because it is scalable, easy to use, and relatively inexpensive to implement.

Failover

Sun Cluster 3.0 provides HA by enabling application failover. Each server is periodically monitored, and the cluster software automatically relocates a cluster-aware application from a failed primary server to a designated secondary server. When a failover occurs, clients may experience a brief interruption in service and may have to reconnect to the server.

However, they will not be aware of the physical server from which they are accessing the application and the data. By allowing other servers in a cluster automatically to host workloads when the primary server fails, Sun Cluster 3.0 can significantly reduce downtime and increase productivity.

Multihost Disks

Sun Cluster 3.0 requires multihost disk storage. This means that disks can be connected to more than one server at a time. In the Sun Cluster 3.0 environment, multihost storage allows disk devices to become highly available. Disk devices that reside on multihost storage can tolerate single-server failures because there is still a physical path to the data through the alternate server. Multihost disks can be accessed globally through a primary server. If client requests are accessing the data through one server and that server fails, the requests are switched over to another server that has a direct connection to the same disks. A volume manager provides for mirrored or RAID 5 configurations for data redundancy of the multihost disks.

Currently, Sun Cluster 3.0 supports Solstice DiskSuite and VERITAS Volume Manager as volume managers. Combining multihost disks with disk mirroring and striping protects against both server failure and individual disk failure.

Global Devices

Global devices are used to provide cluster-wide HA access to any device in a cluster, from any server, regardless of the physical device location. All disks are included in the global namespace with an assigned device ID (DID) and are configured as global devices. Therefore, the disks themselves are visible from all cluster servers.

File Systems/Global File Systems

A cluster or global file system is a proxy between the kernel (on one server) and the underlying file system volume manager (on a server that has a physical connection to one or more disks). Cluster file systems are dependent on global devices with physical connections to one or more servers. They are independent of the underlying file system and volume manager. Currently, cluster file systems can be built on UFS, using either Solstice DiskSuite or VERITAS Volume Manager. The data becomes available to all servers only if the file systems on the disks are mounted globally as a cluster file system.

Device Group

All multihost disks must be controlled by the Sun Cluster framework. Disk groups, managed by either Solstice DiskSuite or VERITAS Volume Manager, are first created on the multihost disk. Then they are registered as Sun Cluster disk device groups. A disk device group is a type of global device.

Multihost device groups are HA. Disks are accessible through an alternate path if the server currently mastering the device group fails. The failure of the server mastering the device group does not affect access to the device group, except for the time required to perform the recovery and consistency checks. During this time, all requests are blocked ( transparently to the application) until the system makes the device group available.

Resource Group Manager

The Resource Group Manager (RGM) provides the mechanism for HA and runs as a daemon on each cluster server. It automatically starts and stops resources on selected servers according to preconfigured policies. The RGM allows a resource to be highly available in the event of a server failure or to reboot by stopping the resource on the affected server and starting it on another. The RGM also automatically starts and stops resource-specific monitors that can detect resource failures and relocate failing resources onto another server.

Data Services

The term data service is used to describe a third-party application that has been configured to run on a cluster, rather than on a single server. A data service includes the application software and Sun Cluster 3.0 software that starts, stops, and monitors the application. Sun Cluster 3.0 supplies data service methods that are used to control and monitor the application within the cluster. These methods run under the control of the RGM, which uses them to start, stop, and monitor the application on the cluster servers. These methods, along with the cluster framework software and multihost disks, enable applications to become HA data services. As HA data services, they can prevent significant application interruptions after any single failure within the cluster, regardless of whether the failure is on a server, on an interface component, or in the application itself. The RGM also manages resources in the cluster, including network resources (logical host names and shared addresses) and application instances.

Resource Type, Resource, and Resource Group

A resource type is made up of the following:

A software application to be run on the cluster.
Control programs used as callback methods by the RGM to manage the application as a cluster resource.
A set of properties that form part of the static configuration of a cluster.

The RGM uses resource type properties to manage resources of a particular type.

A resource inherits the properties and values of its resource type. It is an instance of the underlying application running on the cluster. Each instance requires a unique name within the cluster. Each resource must be configured in a resource group. The RGM brings all resources in a group, online and offline, together on the same server. When the RGM brings a resource group online or offline, it invokes callback methods on the individual resources in the group.

The servers on which a resource group is currently online are called its primary servers , or its primaries . A resource group is mastered by each of its primaries. Each resource group has an associated server list property, set by the cluster administrator, to identify all potential primaries or masters of the resource group.

High Availability with VERITAS Cluster Server

VERITAS Cluster Server (VCS) can be used to eliminate both planned and unplanned downtime. It can facilitate server consolidation and effectively manage a wide range of applications in heterogeneous environments.

VCS supports up to 32 server clusters in both Storage Area Network (SAN) and traditional client/server environments. VCS can protect everything from a single critical database instance, to very large multi-application clusters in networked storage environments. This section provides a brief summary of the features of VCS.

Failover

VCS is an availability clustering solution that manages the availability of application services, such as DB2 UDB, by enabling application failover. The states of each individual cluster server and its associated software services are regularly monitored. When a failure occurs that disrupts the application service (in this case, the DB2 UDB service), VCS and/or the VCS HA-DB2 Agent detect the failure and automatically take steps to restore the service. This can include restarting DB2 UDB on the same server or moving DB2 UDB to another server in the cluster and restarting it on that server. If an application needs to be migrated to a new server, VCS moves everything associated with the application (i.e., network IP addresses, ownership of underlying storage) to the new server so that users will not be aware that the service is actually running on another server. They will still access the service using the same IP addresses, but those addresses will now point to a different cluster server.

When a failover occurs with VCS, users may or may not see a disruption in service. This will be based on the type of connection (stateful or stateless) that the client has with the application service. In application environments with stateful connections (such as DB2 UDB), users may see a brief interruption in service and may need to reconnect after the failover has completed. In application environments with stateless connections (such as NFS), users may see a brief delay in service but generally will not see a disruption and will not need to log back on.

By supporting an application as a service that can be automatically migrated between cluster servers, VCS can not only reduce unplanned downtime, but can also shorten the duration of outages associated with planned downtime (i.e., for maintenance and upgrades). Failovers can also be initiated manually. If a hardware or operating system upgrade must be performed on a particular server, DB2 UDB can be migrated to another server in the cluster, the upgrade can be performed, and DB2 UDB can then be migrated back to the original server.

Applications recommended for use in these types of clustering environments should be crash tolerant. A crash tolerant application can recover from an unexpected crash while still maintaining the integrity of committed data.

Crash tolerant applications are sometimes referred to as cluster friendly applications . DB2 UDB is a crash tolerant application.

Shared Storage

When used with the VCS HA-DB2 Agent, VCS requires shared storage. Shared storage is storage that has a physical connection to multiple servers in the cluster. Disk devices resident on shared storage can tolerate server failures because a physical path to the disk devices still exists through one or more alternate cluster servers.

Through the control of VCS, cluster servers can access shared storage through a logical construct called disk groups . Disk groups represent a collection of logically defined storage devices whose ownership can be atomically migrated between servers in a cluster. A disk group can be imported to only a single server at any given time. For example, if Disk Group A is imported to Server1 and Server1 fails, Disk Group A can be exported from the failed server and imported to a new server in the cluster. VCS can simultaneously control multiple disk groups within a single cluster.

In addition to allowing disk group definition, a volume manager can provide for redundant data configurations, using mirroring or RAID 5, on shared storage. VCS supports VERITAS Volume Manager and Solstice DiskSuite as logical volume managers. Combining shared storage with disk mirroring and striping can protect against both server failure and individual disk or controller failure.

VERITAS Cluster Server Global Atomic Broadcast and Low Latency Transport

An interserver communication mechanism is required in cluster configurations so that servers can exchange information concerning hardware and software status, keep track of cluster membership, and keep this information synchronized across all cluster servers. The Global Atomic Broadcast (GAB) facility, running across a low-latency transport (LLT), provides the high-speed, low-latency mechanism used by VCS to do this. GAB is loaded as a kernel module on each cluster server and provides an atomic broadcast mechanism that ensures that all servers get status update information at the same time.

By leveraging kernel-to-kernel communication capabilities, LLT provides high-speed LLT for all information that needs to be exchanged and synchronized between cluster servers. GAB runs on top of LLT. VCS does not use IP as a heartbeat mechanism but offers two other, more reliable options. GAB with LLT can be configured to act as a heartbeat mechanism, or a GABdisk can be configured as a disk-based heartbeat. The heartbeat must run over redundant connections. These connections can either be two private Ethernet connections between cluster servers or one private Ethernet connection and one GABdisk connection. The use of two GABdisks is not a supported configuration because the exchange of cluster status between servers requires a private Ethernet connection.

For more information about GAB or LLT, or how to configure them in VCS configurations, please consult the VERITAS Cluster Server User's Guide for Solaris.

Bundled and Enterprise Agents

An agent is a program that is designed to manage the availability of a particular resource or application. When an agent is started, it obtains the necessary configuration information from VCS, then periodically monitors the resource or application and updates VCS with the status. In general, agents are used to bring resources online, take resources offline, or monitor resources and provide four types of services: start, stop, monitor, and clean.

Start and stop are used to bring resources online or offline, monitor is used to test a particular resource or application for its status, and clean is used in the recovery process.

A variety of bundled agents are included as part of VCS and are installed when VCS is installed. The bundled agents are VCS processes that manage predefined resource types commonly found in cluster configurations (i.e., IP, mount, process, and share), and they help to simplify cluster installation and configuration considerably. There are over 20 bundled agents with VCS.

Enterprise agents tend to focus on specific applications, such as DB2 UDB. The VCS HA-DB2 Agent can be considered an Enterprise Agent, and it interfaces with VCS through the VCS Agent framework.

VCS Resources, Resource Types, and Resource Groups

A resource type is an object definition used to define resources within a VCS cluster that will be monitored. A resource type includes the resource type name and a set of properties associated with the resource that are salient from an HA point of view. A resource inherits the properties and values of its resource type, and resource names must be unique on a cluster-wide basis.

There are two types of resources: persistent and standard (non-persistent). Persistent resources are resources such as network interface controllers (NICs) that are monitored but are not brought online or taken offline by VCS. Standard resources are those whose online and offline status is controlled by VCS.

The lowest level object that is monitored is a resource, and there are various resource types (e.g., share, mount). Each resource must be configured into a resource group, and VCS will bring all resources in a particular resource group online and offline together. To bring a resource group online or offline, VCS will invoke the start or stop methods for each of the resources in the group. There are two types of resource groups: failover and parallel. An HA DB2 UDB configuration, regardless of whether it is partitioned or not, will use failover resource groups.

A "primary" or "master" server is a server that can potentially host a resource. A resource group attribute called systemlist is used to specify which servers within a cluster can be primaries for a particular resource group. In a two-server cluster, usually both servers are included in the systemlist, but in larger, multiserver clusters that may be hosting several HA applications, there may be a requirement to ensure that certain application services (defined by their resources at the lowest level) can never fail over to certain servers.

Dependencies can be defined between resource groups, and VCS depends on this resource group dependence hierarchy in assessing the impact of various resource failures and in managing recovery. For example, if the resource group ClientApp1 cannot be brought online unless the resource group DB2 has already been successfully started, resource group ClientApp1 is considered dependent on resource group DB2.

Logical Hostname/IP Failover

A logical hostname, together with the IP address to which it maps, must be associated with a particular DB2 UDB ESE instance. Client programs will access the DB2 database instance using this logical hostname instead of the physical hostname of a server in the cluster. This logical hostname is the entry point to the cluster, and it shields the client program from addressing the physical servers directly. That is, this logical hostname/IP address is cataloged from the DB2 TCP/IP clients (via the catalog TCP/IP node DB2 command).

This logical hostname is configured as a logical hostname resource and must be added to the same resource group as the instance resource. In the case of a failure, the entire resource group, including the instance and the logical host name, will be failed over to the backup server. This floating IP setup provides HA DB2 service to client programs.

Ensure that this hostname maps to an IP address and that this name-to-IP address mapping is configured on all servers in the cluster, preferably in /etc/inet/ hosts on each server. More information on configuration for public IP addresses can be found in the Sun Cluster 3.0 Installation Guide.

Considerations for High Availability with DB2 ESE

The logical hostname/IP must be collocated with the instance to ensure that it will always be local to the DB2 instance.
Ensure that the instance is not autostarted; the instance start and stop should be under the control of the Sun Cluster infrastructure.
The DB2 registry setting DB2SYSTEM should refer to the logical hostname, rather than the physical hostname.
Configure the $INSTHOME/sqllib/db2nodes.cfg file and the /etc/services file in order to allow for communications between databases partition.
Ensure that the port range in the /etc/services file is sufficiently large to support all failover scenarios envisioned .
File containers of DMS table spaces and containers of SMS table spaces need to reside on mounted file systems.
The disks for the file system must be in a disk group of the logical host responsible for the database partitions that need them.

One Logical Hostname

The DB2-HA package will create one logical hostname resource for a particular DB2 UDB ESE instance, and this logical hostname resource is added to the same resource group as the first partition in the instance (as defined by the first entry in the $INSTHOME/sqllib/db2nodes.cfg). In this case, client programs will use the logical hostname to access this DB2 UDB ESE instance. Therefore, this partition will be the coordinator partition (regardless of where that particular DB2 partition is physically hosted). This is the default install behavior of the DB2-HA package and is the most common configuration scenario.

DB2 UDB ESE is designed with symmetrical data access across partitions in the sense that client programs may access any database partition as an entry point to the DB2 UDB ESE instance and receive the same result sets from their queries, regardless of the coordinator partition used to process the query. Thus, a DB2 UDB ESE installation provides access redundancy (when the DB2 UDB ESE instance exists initially on more than one physical server). Here, the client program can access the DB2 UDB ESE instance through a round- robin selection of all available physical server names for the instance (or for a subset, provided that the subset contains at least two distinct physical servers). In the case of a failover, the DB2 UDB ESE instance can be accessed through any of the remaining healthy host names/IP addresses.

N Logical Hostnames

If the demands of the application require access to a particular DB2 UDB ESE partition, a logical hostname can be associated with each partition. Using Sun Cluster 3.0 administrative commands, each logical hostname resource can be added and grouped with its corresponding DB2 partition resource.

Consequently, the logical hostname/IP address will failover, together with its associated DB2 UDB ESE partition resource. Thus, connections to a logical hostname/IP address will always be associated with a connection to a particular DB2 UDB ESE coordinator partition.

Sun Cluster 3.0 DB2-HA AgentPackages

There are four methods that are used to control the way DB2 UDB is registered, removed, brought online, or taken offline in a Sun Cluster 3.0 environment. Note that, although there exist a number of other components in the package, only these four can be called directly.

 regdb2udb

This method will register appropriate resources and resource groups for a specified instance. Note that it will not attempt to bring online any resources. This will usually be the first script called, because it will perform all necessary steps to prepare DB2 UDB for Sun Cluster 3.0 control.

 unregdb2udb

This method will execute required Sun Cluster 3.0 commands in order to remove DB2-HA (including resources and groups registered for this instance) from the cluster. Essentially, this method is the inverse of regdb2udb and will be generally called if the instance is no longer required to be HA.

 onlinedb2udb

This method will execute required Sun Cluster 3.0 commands in order to bring a DB2-HA instance online. It will not create any resources or resource groups.

 offlinedb2udb

This method will execute required Sun Cluster 3.0 commands in order to bring a DB2-HA instance offline. It will not remove any resources or groups from the Sun Cluster 3.0 infrastructure.

Note the naming convention of the resources and resource groups and their structure. The instance we have made HA is clearly a two-database partition DB2 UDB instance. The partition numbers (also referred to as DB2 logical database partition numbers ) are 0 and 1, and the instance name is db2inst1. For each partition, we can see that exactly one resource group is created, and within that resource group, there is exactly one resource (the HA hostname/IP address has been discussed earlier). This allows for fine-grained control of the movement of the DB2 UDB across the physical servers of the complex.

The naming system is rather mechanical and is chosen to ensure name uniqueness, regardless of the number of instances or partitions that are to be made HA.

The naming convention is as follows:

The string "db2_"
Followed by the name of the instance (in this case "db2inst1")
Followed by the string "_"
Followed by the partition number of the instance (note that, for a single-database partitioned instance, the partition number will be represented as the number 0)
Followed by the string "_"
Followed by the string "rs" to represent a resource, or the string "rg" to represent a resource group.

Note the one-to-one mapping of DB2 resources to DB2 resource groups (and of a particular DB2 instance's logical partition resources and the Sun Cluster 3.0 resources).

In addition, there is one HA hostname/IP address. This is the address that, for example, will be used by clients to catalog the databases in this instance. This hostname/IP address (if present) is always associated with the first DB2 resources group for the instance (first when reading the db2nodes.cfg file top to bottom). For the single-database partitioned, the address is associated with the only DB2 resource group defined for that instance.

Sample Configuration Sun Cluster 3.x and DB2 UDB

We will create a simple DB2 UDB environment on a Sun Cluster platform. The DB2 UDB instance name is db2inst1 , and the HA hostname is sc30 .

Assumptions

It is presumed that:

The reader is familiar with the motivation for the implementation of an HA solution.
The reader is familiar with the basic terminology in this field.
The reader has some experience with DB2 UDB and with Sun Cluster.

NOTE

When using Sun Cluster 3.0 or VCS, ensure that DB2 instances are not started at boot time by using the db2iauto utility, as follows:

 db2iauto off InstName

where InstName is the login name of the instance.

Installation of DB2 binary

The DB2 Universal Database setup utility will install the executable file on the path /opt/IBM/db2.

Prior to performing the install, you must ensure that this mount point is on a global device. This can be accomplished by mounting this path directly or by providing a symbolic link from this path to a global mount point.

For example, on one cluster server, run:

  mkdir /global/scdg2/scdg2/opt/IBM/db2

On remaining cluster servers, run:

  ln -s /global/scdg2/scdg2/opt/IBM/db2 /opt/IBM/db2

Besides /opt/IBM/db2, /var/db2 can also be placed on a global file system. Some profile registry values and environment variables are stored in the files in /var/db2. Use the db2setup tool to create the instance. Ensure that the instance is not autostarted; the instance start and stop should be under the control of the Sun Cluster infrastructure. Additionally, the DB2 registry setting DB2SYSTEM should refer to the logical hostname, rather than the physical hostname.

The DB2 binary should be installed on a global shared file system. You must also take steps to ensure that the license key is available in the case of failover. You can achieve this in one of two ways:

Install the license key of each machine in the cluster using the db2licm tool.
Mount the license key location as a global mount point (this location, at the time of writing, is /var/lum) and, afterward, install the license key on exactly one server in the cluster.

The /etc/services file reserves a range of ports required for DB2 UDB communications. Ensure that the port range is sufficiently large to support all failover scenarios envisioned. For simplicity, we recommend that you configure the port range to be as large as the number of database partitions in the instance. You must configure the same port range for all cluster servers in the cluster.

The entries that compose the db2node.cfg determine the logical-to-physical mapping of the DB2 logical database partition to the appropriate physical host.

For each database partition that is expected to be subjected to significant disk activity, it is strongly recommended that each partition exist on a physical server with at least one local cluster file system mount point (i.e., for which that physical server is the primary of the cluster file system mount point). The cluster must be configured to give the user remote shell access from every server in the cluster to every server in the cluster (this step is required for multiple database partitions only). Generally, this is accomplished through the creation of a .rhosts file in the instance home directory. When this is completed, remote shell commands should proceed unprompted, and the db2_all command should execute without error.

As instance owner, issue the following command:

  db2_all date

This should return the correct date and time from each server. If not, you'll have to troubleshoot and solve this problem before proceeding.

For the instance in question, issue the following command (again, as the instance owner):

  db2start

This should complete successfully. If it does not complete successfully at all servers, that likely means a configuration error. You must review the DB2 UDB ESE Quick Beginnings guide and resolve the problem before proceeding.

Next, attempt to stop the instance with the following command (again, as the instance owner):

  db2stop

This should also complete successfully. Again, if for DB2 UDB ESE it does not complete successfully at all servers, that likely means a configuration error. You must review the DB2 UDB Quick Beginnings guide and resolve the problem before proceeding.

Once you've verified that the instance can be started and stopped, attempt to create the sample database (or an empty test database, if you prefer). Create the database on the path of the global device you plan to use for storage of the actual production database. When you're certain the create database command has completed successfully, remove the test or sample database, using the drop database command.

The instance is now ready to be made HA. Steps to configure:

Step #1

First, use the regdb2udb utility to register the instance with the Sun Cluster 3.0 infrastructure. Use the scstat commands to investigate the status of the cluster. We should see the necessary resources and resource groups registered.

  sun-ha1 # /opt/IBM/db2/V8.1/ha/sc30/util/regdb2udb a db2inst1 -h sc30   sun-ha1 # scstat p

NOTE

The result of the regdb2udb processing is that the appropriate DB2 resources and resource groups are created and registered with Sun Cluster 3.0.

Step #2

Next, use the onlinedb2udb utility to bring these registered resources online. Use the scstat commands to investigate the status of the cluster. We should see the necessary resources and resource groups registered.

  sun-ha1 # /opt/IBM/db2/V8.1/ha/sc30/util/onlinedb2udb a db2inst1 -h sc30   sun-ha1 # scstat p

NOTE

The result of the online processing is that the db2inst1 instance (and its associated HA IP address) is online and under the control of Sun Cluster 3.0.

As a result of this, you should see the resources brought online at the appropriate server (for example, on the physical hostname sun-ha1, you should see that it hosts the HA IP address for sc30, as well as the processes for the instance db2inst1, database partition 0).

There are two more supplied scripts that we will now discuss: offlinedb2udb and unregdb2udb.

Typically, you may wish to take a DB2 instance offline in order to remove the DB2 resources from Sun Cluster 3.0 control. For example, you may wish to bring the database engine down for an extended period of time. Directly issuing the appropriate DB2 commands (for example, db2start, db2stop) will be ineffective , because Sun Cluster 3.0 will interpret the absence of resources caused by the successful completion of the db2stop command as a failure and attempt to restart the appropriate database instance resources.

Instead, you must bring the resources offline as follows:

  sun-ha1 # offlinedb2udb -a db2inst1 -h sc30   sun-ha1 # scstat p

As you can see from the scstat output, all resources are now offline. No resources associated with the instance db2inst1 should be present on either server, nor will Sun Cluster 3.0 take any action to protect this instance, should it be brought online manually (i.e., via db2start) and a failure occur.

Let's assume that you've decided to remove this instance permanently from Sun Cluster 3.0 monitoring and control. For this task, you may use the unregdb2udb method. Note that this method merely interfaces with Sun Cluster 3.0 in order to perform these tasks; the instance itself is neither dropped nor removed.

  sun-ha1 # unregdb2udb -a db2inst1 -h sc30   sun-ha1 # scstat p

Configuration of Multiple DB2 Instances

For each additional instance, including the DAS that you wish to make HA, you are required to execute the regdb2udb command to register the instance with Sun Cluster 3.0. If it is desired to enable multiple HA DB2 instances, each HA DB2 instance will require a distinct HA hostname/IP address. This HA hostname/IP address is identified uniquely with exactly one instance.

For the DAS instance, the DAS instance name is used as the instance name argument when running the regdb2udb command. For example:

  sun-ha1 # /opt/IBM/db2/V8.1/ha/sc30/util/regdb2udb -a db2as -h daslogicalhostname

Cluster Verification Testing

Testing is an important aspect of an HA cluster. The purpose of testing is to gain some confidence that the HA cluster will function as envisioned for various failure scenarios. What follows is a set of minimum recommended scenarios for cluster testing and verification. These tests should be run regularly to ensure that the cluster continues to function as expected. Timing will vary, depending on production schedules, the degree to which the cluster state evolves over time, and management diligence.

Test 1

In this test, we're performing Sun Cluster 3.0 management commands to ensure that the db2inst1 instance can be controlled correctly.

First, verify that the instance is accessible from the clients (or locally) and that various database commands complete successfully (for example, create database).

Take the db2_db2inst1_0-rs and sc30 resources offline, using the following:

  scswitch -n -j db2_db2inst1_0-rs   scswitch -n -j sc30

Observe that the DB2 instance resources no longer exist on any server in the cluster, and the HA hostname sc30 is likewise inaccessible.

From the perspective of a client of this instance, the existing DB2 connections are closed, and new ones start up, pending the process to bring online the appropriate DB2 and IP resources.

Test 2

To return the resources to their previous states, bring them online with the following SC30 commands:

  scswitch -e -j sc30   scswitch -e -j db2_db2inst1_0-rs

DB2 clients waiting in the previous test mode will be able to connect and resubmit transactions to pick up from the last failure. The client program must send retries to accomplish this.

Test 3

Test the failover of the DB2 instance and associated resources from sun-ha1 onto sun-ha2. At this point, the cluster is again at its initial state.

Bring the resources contained within the resource group db2_db2inst1_0-rg offline, using the commands described in Test 1.

Then move the containing resource group to the secondary server, using the following Sun Cluster 3.0 command:

  scswitch -z -g db2_db2inst1_0-rg -h sun-ha2

Now attempt to enable the relevant resources, using the same commands described in Test 2.

You should see the DB2 resource for db2inst1 and the associated hostname/IP address now hosted by the secondary machine, sun-ha2. Verify by executing the scstat -p command.

Test 4

Here, test the failover capabilities of the Sun Cluster 3.0 software itself. Bring the resources back into their initial state (i.e., have the db2_db2inst1_0-rg hosted on sun-ha1).

Once the instance and its associated resources are hosted on the sun-ha1 machine, perform a power-off operation on that physical server. This will cause the internal heartbeat mechanism to detect a physical server failure, and the DB2 resources will be restarted on the surviving server.

Verify that the results are identical to those seen in Test 3 (i.e., the DB2 resources should be hosted on sun-ha2, and the clients should behave similarly in both cases).

Test 5

Bring the cluster back to its initial state. In this test, verify that the software monitoring is working as expected. To perform this test, you may issue commands as follows:

  ps -ef  grep db2sysc   kill -9 <pid>   or   ps -ef  grep db2tcpcm   kill 9 <pid>

The Sun Cluster 3.0 monitor should detect that a required process is not running and attempt to restart the instance on the same server. Verify that this, in fact, does occur. The client connections should experience a brief delay in service while the restart process continues.

Note that there is a high number of distinct testing scenarios that can be executed, limited only by your resources and imagination . Those discussed here are the minimum you should run to test the correct functioning of the cluster.

High Availability on HP/UX

HP MC/ServiceGuard monitors the health of each server and quickly responds to failures in a way that minimizes or eliminates application downtime. MC/ServiceGuard is able to detect and respond automatically to failures in the following components:

System processors
System memory
LAN media and adapters
System processes
Application processes

Application Packages

With HP MC/ServiceGuard, application services and all the resources needed to support the application are bundled into special entities called application packages . These application packages are the basic units that are managed and moved within an enterprise cluster. Packages simplify the creation and management of HA services and provide outstanding levels of flexibility for workload balancing.

Fast Detection of Failure, Fast Restoration of Applications

Within an enterprise cluster, HP MC/ServiceGuard monitors hardware and software components, detects failures, and responds by promptly allocating new resources to support mission-critical applications. The process of detecting the failure and restoring the application service is completely automatedno operator intervention is needed.

Recovery times for failures requiring the switch of an application to an alternate server will vary, depending on the software services being used by the application. For example, a database application that is using a logging facility would need to perform transaction rollbacks as part of the recovery process. The time needed to perform this transaction rollback would be part of the total time to recover the application. MC/ServiceGuard will detect the server failure, reconfigure the cluster, and begin executing the startup script for the application package on an alternate server in a short period of time.

Installation Outline for DB2 and MC/ServiceGuard

The following steps outline the installation process and configuration changes required for DB2 in an HA environment and summarize the full installation as documented.

Create a volume group to put the shared logical volumes on /dev/db2.
Create a logical volume on the new volume group, with a mountpoint, mounted under the shared logical volume /db2.
Mount the shared logical volume.
Install DB2 on the shared disk mounted as detailed in the DB2 UDB ESE Installation guide.
Unmount the shared directories.
Issue a "vgchange a n /dev/db2"
Issue a "vgchange c y /dev/db2"
Issue a "vgexport m db2.map s p v /dev/db2"
Issue a "vgexport m db2.map s p v /dev/db2"
FTP the db2.map file to the adoptive server.
Telnet into the adoptive server.
Create the /dev/db2 directory and group file with the name major and minor numbers.
Issue a "vgimport m db2.map s v /dev/db2"
Mount /db2 to confirm the import was successful.
Install DB2 UDB ESE.
Once the shared file system can be mounted on both systems, set up and configure the MC/ServiceGuard scripts for DB2.

Configuring the Cluster

All of the MC/ServiceGuard scripts developed during the certification process have been provided in this document. The following section describes the creation and use of these scripts.

Create the ASCII cluster template file:

  cmquerycl v C /etc/cmcluster/cluster.ascii n ptac171 n ptac178

Modify the template (cluster.ascii) to reflect the environment and to verify the cluster configuration:

  cmcheckconf v C /etc/cmcluster/cluster.ascii

Create the cluster by applying the configuration file. This will create the binary file cmclconfig and automatically distribute it among the servers defined in the cluster:

  cmapplyconf v C /etc/cmcluster/cluster.ascii

Start the cluster and check the cluster status. Test the cluster halt also:

  cmruncl v n ptac171 n ptac178   cmviewcl v   cmhaltcl f v   cmruncl n ptac171 n ptac178

Configuring a ServiceGuard Package (on a Single Server)

Create the db2inst1 package configuration file and tailor to the test environment. Do not include the second server at this stage.

  cd /etc/cmcluster   mkdir db2inst1   cmmakepkg p db2inst1.conf           # Edit db2inst1.conf

Create the db2inst1 package control script and tailor to the test environment. Do not include application startup/shutdown, service monitoring, or relocatable IP address at this stage.

  cd db2inst1   cmmakepkg s db2inst1.cntl

Shut down cluster; verify and distribute the binary configuration files

  cmhaltcl f v   cmapplyconf v C /etc/cmcluster/cluster.ascii P \   /etc/cmcluster/db2inst1/db2inst1.conf

Test cluster and package startup. Shut down DB2 if running, unmount all logical volumes on /dev/db2, and deactivate the volume group first. Copy the db2inst1.cntl and db2inst1.ascii scripts into the /etc/cmcluster/db2inst1 directory.

  Cmruncl                             # Start cluster and package   cmviewcl v                         # Check that package has started

Edit db2inst1.cntl and assign the dynamic IP address of the db2inst1 package.

  cmhaltpkg db2inst1   vi db2inst1.cntl            # Edit to add package IP   cmrunpkg v db2inst1        # Start DB2 Package   cmviewcl v                 # Check package has started and clients

Enable switching to a local standby LAN card.

  vi db2inst1.conf                     # Net switching enabled = YES   cmapplyconf v C /etc/cmcluster/cluster.ascii P db2inst1.ascii   cmhaltcl f v   cmruncl v

Configuring a ServiceGuard Package (Adding a Second Server)

Enable db2inst1 to switch to a second server by editing the package control file:

  vi db2inst1.conf                     # add SERVER_NAME ptac178   cmapplyconf v C /etc/cmcluster/cluster.ascii P db2inst1/db2inst1.conf   cmhaltcl f v   cmruncl v

Test package switch to ptac178 and back to ptac171

  cmhaltpkg db2inst1   cmrunpkg n ptac178 db2inst1         # Run package on Ptac178 and   cmmodpkg e db2inst1                 # run DB2 and check application   cmhaltpkg db2inst1                   # Enable package switching   cmrunpkg n ptac171 db2inst1         # Run package on ptac171 and test   cmmodpkg e db2inst1                 # DB2 runs here

Configuring DB2 in the MC/ServiceGuard Environment

Once DB2 is installed and configured in the MC/ServiceGuard cluster, the DB2 package scripts can be configured.

In testing the MC/ServiceGuard integration with IBM engineers , the db2inst1.cntl file was configured such that the db2inst1 service has 0 restarts and will failover to the adoptive server in the case of a software or hardware failure. This number can be changed to suit the needs of each install. It is recommended that the restart value be left at 0. DB2 is a robust product, and if there is a failure, the probability of a successful restart is low. To ensure a stable DB2 operating environment, it is suggested that MC/ServiceGuard be allowed to move the db2inst1 package to an adoptive server in the case of any failure.

The DB2 daemons that are monitored are db2sysc, db2tcpcm, db2srvlst, db2resyn, db2gds, and db2ipccm. Testing was performed where the list of processes above was monitored, and just db2sysc was monitored. The testing results were the same. If any of the other DB2 processes failed, db2sysc failed, as well. Because the MC/ServiceGuard monitor script for db2inst1 was monitoring the db2sysc process, the db2inst1 package was moved to the adoptive server in the case of any DB2 process failing.