27.4 Adding a New Package to the Cluster Utilizing a Serviceguard Toolkit

Before we get into actually configuring the new package, I want to point out some differences that I introduce with the configuration of this new package:

 FAILOVER_POLICY                 MIN_PACKAGE_NODE

This failover policy introduces a level of intelligence into the choice of the next adoptive node. Instead of simply using the next node in the list, i.e., FAILOVER_POLICY CONFIGURED_NODE, Serviceguard will ascertain how many packages each node is currently running. The next node to run a failed package is the node that is enabled to accept the package and has the least number of packages currently running. Serviceguard makes no judgments other than the number of packages currently running on a node. This type of configuration allows us to implement a Rolling Standby cluster where any node can be the standby for any other node, and the standby is not necessarily the next node in the list. It is a good idea if every node in the cluster is of a similar type and configuration; you don't want a performance-hungry application running on a performance- deficient server , do you?
 
 FAILBACK_POLICY                 AUTOMATIC 
This option means that an application will restart on the original node it came from when that node becomes re-enabled to run it, possibly after a critical failure. I would be very careful employing this configuration because it means that as soon as the original node becomes enabled ( cmmodpkg “e “n <nodename> -p <package> ), Serviceguard will instigate the process of halting the package on its current node and starting the package back on its original node. I have seen this configuration used where the adoptive nodes were not as powerful as the primary nodes and it was thought right to have the application running back on the most powerful machine as quickly as possible. In the instance I am referring to, the slight interruption to service while this happened was not deemed relevant because the application would poll for anything up to 1 minute waiting for a connection to the application server.

With these two changes in mind, let us proceed by first discussing a Serviceguard Toolkit.

27.4.1 A Serviceguard Toolkit

There are quite a number of Serviceguard Toolkits currently available, everything from NFS to Netscape to Oracle, Sybase Progress, Informix, and DB2 to name a few. The primary value of the toolkits is the inclusion of an application monitoring script for each application. This eliminates the need for you to code an application monitoring script from scratch. The monitoring scripts included with the toolkits require minor modifications to customize the script for your application.

I am going to look at the Oracle Toolkit. Once you have installed it, you will find the Enterprise Cluster Master Toolkit files under the directory /opt/cmcluster/toolkit . Here are the files for the current Oracle Toolkit:

 root@hpeos001[] #  cd /opt/cmcluster/toolkit  root@hpeos001[toolkit] #  ll  total 6 dr-xr-xr-x   3 bin        bin           1024 Aug  3  2002 SGOSB dr-xr-xr-x   2 bin        bin             96 Aug  3  2002 db2 dr-xr-xr-x   2 bin        bin           1024 Aug  3  2002 domain dr-xr-xr-x   2 bin        bin             96 Aug  3  2002 fasttrack dr-xr-xr-x   2 bin        bin             96 Feb 27  2002 foundation dr-xr-xr-x   2 bin        bin           1024 Aug  3  2002 informix dr-xr-xr-x   2 bin        bin             96 Aug  3  2002 oracle dr-xr-xr-x   2 bin        bin             96 Aug  3  2002 progress dr-xr-xr-x   2 bin        bin             96 Aug  3  2002 sybase root@hpeos001[toolkit] #  ll oracle  total 140 -r-xr-xr-x   1 bin        bin          14069 Mar 21  2002 ORACLE.sh -r-xr-xr-x   1 bin        bin          17118 Mar 21  2002 ORACLE_7_8.sh -r--r--r--   1 bin        bin          18886 Mar 21  2002 README -r--r--r--   1 bin        bin          19554 Mar 21  2002 README_7_8 root@hpeos001[toolkit] #

The files I am interested in are the ORACLE.sh script and the README file. I have just reviewed them and the ORACLE.sh file told me this:

 # *************************************************************************** # *                 This script supports Oracle 9i only.                   * # *                 For Oracle 7 or 8, use the ORACLE_7_8.sh script.       * # ***************************************************************************

This seems quite clear to me. I am not running Oracle 9i, so I need to concentrate on the ORACLE_7_8.sh and the README_7_8 file. Essentially, the process is exactly the same and is well documented in the associated README file; you should spend some time reviewing the README file. The ORACLE*.sh file is our application monitoring script. We update this script to reflect the particulars of our Oracle instance, i.e., ORACLE_HOME , SID_NAME , etc. Now, I am no Oracle expert; I worked on this with my Oracle DBA to fill in some of the configuration parameters in this application monitoring script. Once you have that information, you can proceed in much the same way as we did with adding the clockwatch application to the cluster. In fact, if you remember, I based the clockwatch application monitoring script on a Serviceguard Toolkit. Like CLOCKWATCH.sh , ORACLE*.sh has three functions: start , stop , and monitor . Each is triggered by the relevant command line argument. This means that we can use exactly the same procedure as we did to add clockwatch to the cluster:

Table 27-1. Cookbook for Setting Up a Package

	1. Set up and test a package-less cluster.
	2. Understand how a Serviceguard package works.
	3. Establish whether you can utilize a Serviceguard Toolkit.
	4. Understand the workings of any in-house applications.
	5. Create package monitoring scripts, if necessary.
	6. Distribute the application monitoring scripts to all relevant nodes in the cluster.
	7. Create and update an ASCII package configuration file ( `cmmakepkg “p` ).
	8. Create and update an ASCII package control script ( `cmmakepkg “s` ).
	9. Distribute manually to all relevant nodes the ASCII package control script.
	10. Check the ASCII package control file ( `cmcheckconf` ).
	11. Distribute the updated binary cluster configuration file ( `cmapplyconf` ).
	12. Ensure that any data files and programs that are to be shared are loaded onto shared disk drives .
	13. Start the package ( `cmrunpkg` or `cmmodpkg` ).
	14. Ensure that package switching is enabled.
	15. Test package failover functionality.

We start with "Create if necessary a package monitoring script", item 5 in the c ookbook .

27.4.1.1 CREATE PACKAGE MONITORING SCRIPTS, IF NECESSARY

This is where we need to work with our Oracle DBA to check the application monitoring script supplied in the Toolkit and configure it according to our needs. Read the associated README file carefully ; it details the steps involved in ensuring that the package is configured properly. The Oracle database was already installed, running, and tested ; this is important, because we need to know the specifics of what the instance is called and where it is located. Here is what my DBA and I did to set up our application monitoring script:

 root@hpeos001[toolkit] #  pwd  /opt/cmcluster/toolkit root@hpeos001[toolkit] #  mkdir /etc/cmcluster/oracle1  root@hpeos001[toolkit] #  cp oracle/ORACLE_7_8.sh \ /etc/cmcluster/oracle1/ORACLE1.sh  root@hpeos001[toolkit] #  cd /etc/cmcluster/oracle1  root@hpeos001[oracle1] #  vi ORACLE1.sh

Here are the lines we updated to reflect the configuration of our Oracle database:

 ORA_7_3_X=yes ORA_8_0_X= ORA_8_1_X= SID_NAME=oracle1 ORACLE_HOME=/u01/home/dba/oracle/product/8.1.6 SQLNET=no NET8= LISTENER_NAME= LISTENER_PASS= MONITOR_INTERVAL=10 PACKAGE_NAME=oracle1 TIME_OUT=20 set -A MONITOR_PROCESSES ora_smon_oracle1 ora_pmon_oracle1 \   ora_lgwr_oracle1

These configuration parameters started on line 130 of the application monitoring script. The application binaries have been installed on each node under the directory /u01 . The database itself is stored on a filesystem accessible to all relevant nodes under the directory /ora1 . We can now distribute the application monitoring script to all relevant nodes in the cluster.

27.4.1.2 DISTRIBUTE THE APPLICATION MONITORING SCRIPT(S) TO ALL RELEVANT NODES IN THE CLUSTER

As in the clockwatch package we need to manually distribute the application monitoring script to all relevant nodes in the cluster. When we say relevant nodes we mean all nodes that are required to run this package. In this case that means all nodes in the cluster.

27.4.1.3 CREATE AND UPDATE AND ASCII PACKAGE CONFIGURATION FILE ( `cmmakepkg -p` )

This step follows the same pattern as in creating the package configuration file for the clockwatch package. It lists the commands executed and list the lines I updated in the configuration file:

 root@hpeos001[oracle1] #  cmmakepkg -v -p oracle1.conf  Begin generating package template... Done. Package template is created. This file must be edited before it can be used. root@hpeos001[oracle1] # vi oracle1.conf

Here are the lines I updated in the configuration file oracle1.conf :

 PACKAGE_NAME                    oracle1

This is just the package name:

 FAILOVER_POLICY                 MIN_PACKAGE_NODE

The parameter FAILOVER_POLICY controls the level of intelligence we are allowing Serviceguard to use to decide which node to use to run the package should the original node fail. This can override the list of nodes we see below:

 FAILBACK_POLICY                 AUTOMATIC

With AUTOMATIC set for FAILBACK_POLICY , when the original node comes back online, we should see the application move back to its original node:

 NODE_NAME                       hpeos003 NODE_NAME                       hpeos001 NODE_NAME                       hpeos002

This is the list of the nodes the package is allowed to run on. Just to reiterate, with FAILBACK_POLICY set to AUTOMATIC , this is not the definitive order of the nodes on which the package will run:

 RUN_SCRIPT                      /etc/cmcluster/oracle1/oracle1.cntl RUN_SCRIPT_TIMEOUT              NO_TIMEOUT HALT_SCRIPT                     /etc/cmcluster/oracle1/oracle1.cntl HALT_SCRIPT_TIMEOUT             NO_TIMEOUT

The package control script is created in the next step of this process:

 SERVICE_NAME                   oracle1_mon SERVICE_FAIL_FAST_ENABLED      YES SERVICE_HALT_TIMEOUT           300

I have chosen a SERVICE_NAME of oracle1_mon . This is equated to the actual application monitoring script in the package control script:

 SUBNET                  192.168.0.0

I use the same subnet address as in the clockwatch application and assign an IP address in the package control script. I can now proceed in creating the package control script.

27.4.1.4 CREATE AND UPDATE AN ASCII PACKAGE CONTROL SCRIPT ( `cmmakepkg “s` )

The name I chose for the package control script was oracle1.cntl . This seems to follow the pattern for other applications. Obviously, you can name these files whatever you like. Let's look at my progress:

 root@hpeos001[oracle1] #  cmmakepkg -v -s oracle1.cntl  Begin generating package control script... Done. Package control script is created. This file must be edited before it can be used. root@hpeos001[oracle1] #  vi oracle1.cntl

As before, I will list only the configuration parameters that I have updated, not the entire file:

 VG[0]="/dev/vgora"

This is the name of the volume group accessible by all nodes in the cluster that contains the data file for this application:

 LV[0]="/dev/vgora/ora1"; FS[0]="/ora1"; FS_MOUNT_OPT[0]="-o rw"; FS_UMOUNT_OPT[0]="";  FS_FSCK_OPT[0]="" FS_TYPE[0]="vxfs"

This lists the names of all filesystems to be mounted and is a simplified configuration. I would suspect that in a real-life Oracle installation there would be significantly more filesystems involved:

 IP[0]="192.168.0.230" SUBNET[0]="192.168.0.0"

I have allocated another IP address for this application. Remember, clients have to access this application by this new IP address, not the unique server's IP address:

 SERVICE_NAME[0]="oracle1_mon" SERVICE_CMD[0]="/etc/cmcluster/oracle1/ORACLE1.sh monitor" SERVICE_RESTART[0]=""

Here, we equate the SERVICE_NAME s from the package configuration file with SERVICE_CMD s. As you can see, we are running the application monitoring script with the monitor command line argument. It is crucial that we have listed all the application processes correctly within the monitoring script:

 function customer_defined_run_cmds { # ADD customer defined run commands.         /etc/cmcluster/oracle1/ORACLE1.sh start         test_return 51 }

The customer_defined_run_cmds is where we actually start the application. Note again that it is the application monitoring script that starts the application:

 function customer_defined_halt_cmds { # ADD customer defined halt commands.         /etc/cmcluster/oracle1/ORACLE1.sh stop         test_return 52 }

The customer_defined_halt_cmds is where we actually stop the application. If you need to be reminded of the order in which these functions are executed, I direct you to Chapter 26, "Configuring Packages in a Serviceguard Cluster."

That's the only change I made. I can now distribute the package control script to all nodes.

27.4.1.5 DISTRIBUTE MANUALLY TO ALL NODES THE ASCII PACKAGE CONTROL SCRIPT

I don't want you to think that I am just repeating myself all the time, but I do need to remind you that the ASCII control script is not distributed by the cmapplyconf command. You need to ensure that it exists on all relevant nodes when it is first created and in some ways more importantly whenever changes are made to it.

27.4.1.6 CHECK THE ASCII PACKAGE CONTROL FILE ( `cmcheckconf` )

At this stage, we find out whether we have forgotten to distribute the package control script(s) to one particular node. We also determine whether our volume group as been configured on all nodes:

 root@hpeos001[oracle1] #  cmcheckconf -v -k -P oracle1.conf  Checking existing configuration ... Done Gathering configuration information ... Done Parsing package file: oracle1.conf. Attempting to add package oracle1. Maximum configured packages parameter is 10. Configuring 2 package(s). 8 package(s) can be added to this cluster. Adding the package configuration for package oracle1. Verification completed with no errors found. Use the cmapplyconf command to apply the configuration. root@hpeos001[oracle1] #

All looks well, so let us continue.

27.4.1.7 DISTRIBUTE THE UPDATED BINARY CLUSTER CONFIGURATION FILE ( `cmapplyconf` )

Now we update and distribute the binary cluster configuration file. Remember, this is all happening with the cluster up and running and other packages running. Also remember that when we add a package, it does not start automatically the first time. We first check that everything is in place and then enable it to run:

 root@hpeos001[oracle1] #  cmapplyconf -v -k -P oracle1.conf  Checking existing configuration ... Done Gathering configuration information ... Done Parsing package file: oracle1.conf. Attempting to add package oracle1. Maximum configured packages parameter is 10. Configuring 2 package(s). 8 package(s) can be added to this cluster. Modify the package configuration ([y]/n)?  y  Adding the package configuration for package oracle1. Completed the cluster update. root@hpeos001[oracle1] # root@hpeos001[oracle1] #  cmviewcl -v -p oracle1  UNOWNED_PACKAGES     PACKAGE      STATUS       STATE        AUTO_RUN     NODE     oracle1      down         halted       disabled     unowned       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        min_package_node       Failback        automatic       Script_Parameters:       ITEM       STATUS   NODE_NAME    NAME       Subnet     up       hpeos003     192.168.0.0       Subnet     up       hpeos001     192.168.0.0       Subnet     up       hpeos002     192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos003       Alternate    up           enabled      hpeos001       Alternate    up           enabled      hpeos002 root@hpeos001[oracle1] #

Looking at the output from cmviewcl , we can see that the configuration choices we made are now evident; the FAILOVER and FAILBACK policies are in place. Before we start the package, we need to perform an important step. Let's move on and discuss that next step.

27.4.1.8 ENSURE THAT ANY DATA FILES AND PROGRAMS THAT ARE TO BE SHARED ARE LOADED ONTO SHARED DISK DRIVES

You may be saying, " I thought you said earlier the database files had already been loaded onto a volume group accessible by all nodes in the cluster? " This is perfectly true. What we need to ensure here is that the volume group is cluster-aware . In our first package, clockwatch , the volume group used (/dev/vg01 ) happened to be listed in the ASCII cluster configuration file as a cluster lock volume group. In that instance, cmapplyconf made sure that the volume group was cluster-aware. This is not necessarily the case in subsequent packages and subsequent volume groups. We need to ensure that the flag has been set in the VGDA that allows the Serviceguard daemon cmlvmd to monitor and track which node currently has the volume group active. To accomplish this, we need to execute the following command on one of the nodes attached to the volume group:

 root@hpeos003[]  vgchange -c y /dev/vgora  Performed Configuration change. Volume group "/dev/vgora" has been successfully changed.

If we were to now try to activate the volume group in anything other than Exclusive Mode , we should receive an error.

 root@hpeos003[]  vgchange -a y /dev/vgora  vgchange: Activation mode requested for the volume group "/dev/vgora" conflicts with  configured mode.

When performing maintenance on volume groups after a package has been accessed, e.g., adding logical volumes to a volume group, we must remember this. In those situations, we would need to do the following:

Halt the package.
Activate the volume group and then do either of the following:
- In exclusive mode on one node ( vgchange “a e <VG> )

- Perform maintenance

- Deactivate the volume group ( vgchange “a n <VG> )

Or:
- Temporarily activate the volume group in Non-Exclusive Mode :
- Remove the cluster aware flag ( vgchange “c n <VG> ).
- Activate the volume group ( vgchange “a y <VG> ).
- Perform maintenance.
- Deactivate the volume group ( vgchange “a n <VG> ).
- Reapply the cluster-aware flag ( vgchange “c y <VG> ).
Update and distribute the package control script.
Restart the package.

We have performed this task and can now consider starting the package.

27.4.1.9 START THE PACKAGE ( `cmrunpkg` OR `cmmodpkg` )

Reviewing the output from cmviewcl , we can see that all nodes are enabled to receive the package. All we need to do is to enable AUTO_RUN , and the package will run on the most suitable node. In this instance, I expect it to run on node hpeos003 because it currently has no packages running on it and it is first on the list. Let's see what happens:

 root@hpeos001[oracle1] #  cmmodpkg -v -e oracle1  Enabling switching for package oracle1. cmmodpkg  : Successfully enabled package oracle1. cmmodpkg  : Completed successfully on all packages specified. root@hpeos001[oracle1] # root@hpeos001[oracle1] #  cmviewcl -v -p oracle1  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     oracle1      up           running      enabled      hpeos003       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        min_package_node       Failback        automatic       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   oracle1_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos003     (current)       Alternate    up           enabled      hpeos001       Alternate    up           enabled      hpeos002 root@hpeos001[oracle1] #

As we can see, oracle1 is up and running on node hpeos003 . If we look at the package logfile (/etc/cmcluster/oracle1/oracle1.cntl.log ) from that node, we should see the package startup process in all its glory .

 root@hpeos003[oracle1]  more oracle1.cntl.log  ########### Node "hpeos003": Starting package at Mon Aug 11 15:05:19 BST 2003  ########### Aug 11 15:05:19 - "hpeos003": Activating volume group /dev/vgora with exclusive option. Activated volume group in Exclusive Mode. Volume group "/dev/vgora" has been successfully changed. Aug 11 15:05:19 - Node "hpeos003": Checking filesystems:    /dev/vgora/ora1 /dev/vgora/rora1:file system is clean - log replay is not required Aug 11 15:05:19 - Node "hpeos003": Mounting /dev/vgora/ora1 at /ora1 Aug 11 15:05:19 - Node "hpeos003": Adding IP address 192.168.0.230 to subnet 192.168.0.0  *** /etc/cmcluster/oracle1/ORACLE1.sh called with start argument. ***  "hpeos003": Starting Oracle SESSION oracle1 at Mon Aug 11 15:05:19 BST 2003 Oracle Server Manager Release 3.1.6.0.0 - Production Copyright (c) 1997, 1999, Oracle Corporation. All Rights Reserved. Oracle8i Release 8.1.6.0.0 - Production JServer Release 8.1.6.0.0 - Production SVRMGR> Connected. SVRMGR> ORACLE instance started. Total System Global Area                         25004888 bytes Fixed Size                                          72536 bytes Variable Size                                    24649728 bytes Database Buffers                                   204800 bytes Redo Buffers                                        77824 bytes Database mounted. Database opened. SVRMGR> Server Manager complete. Oracle startup done. Aug 11 15:05:24 - Node "hpeos003": Starting service oracle1_mon using    "/etc/cmcluster/oracle1/ORACLE1.sh monitor"  *** /etc/cmcluster/oracle1/ORACLE1.sh called with monitor argument. ***         ########### Node "hpeos003": Package start completed at Mon Aug 11 15:05:25 BST  2003 ########### Monitored process = ora_smon_oracle1, pid = 4786 Monitored process = ora_pmon_oracle1, pid = 4778 Monitored process = ora_lgwr_oracle1, pid = 4782 root@hpeos003[oracle1]

I am no Oracle expert, but this looks okay to me. The next stage before putting this package into a production environment would be to test package failover.

27.4.1.10 ENSURE THAT PACKAGE SWITCHING IS ENABLED

This should already have been accomplished because we started the package by enabling package switching ( cmmodpkg “e oracle1 ). We can quickly check it with cmviewcl :

 root@hpeos003[oracle1]  cmviewcl -v -p oracle1  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     oracle1      up           running      enabled      hpeos003       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        min_package_node       Failback        automatic       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   oracle1_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos003     (current)       Alternate    up           enabled      hpeos001       Alternate    up           enabled      hpeos002 root@hpeos003[oracle1]

We are now ready to move on to the final step in this c ookbook, and that is to test package failover.

27.4.1.11 TEST PACKAGE FAILOVER FUNCTIONALITY

I am going to perform two tests:

Kill a critical application process .

This should fail the package over to the most suitable node in the cluster.
Re-enable the original node to receive the package .

The package should move it back to its original node automatically.

Let's take a quick review of the status of current packages:

 root@hpeos003[oracle1]  cmviewcl  CLUSTER      STATUS McBond       up   NODE         STATUS       STATE   hpeos001     up           running     PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      enabled      hpeos001   NODE         STATUS       STATE   hpeos002     up           running   hpeos003     up           running     PACKAGE      STATUS       STATE        AUTO_RUN     NODE     oracle1      up           running      enabled      hpeos003 root@hpeos003[oracle1]

So oracle1 is running on node hpeos003 and clockwatch is running on node hpeos001 . If oracle1 were to fail, which node would it move to?

 root@hpeos003[oracle1]  cmviewcl -v -p oracle1  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     oracle1      up           running      enabled      hpeos003       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        min_package_node       Failback        automatic       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   oracle1_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos003     (current)       Alternate    up           enabled      hpeos001       Alternate    up           enabled      hpeos002 root@hpeos003[oracle1]

If I interoperate this output correctly, then oracle1 should fail over to node hpeos002 because it is the node with the least number of packages running on it, even though node hpeos001 is listed before it in the output from cmviewcl “v “p oracle1 above. Remember, that level of intelligence is coming from Serviceguard via the package configuration parameter FAILOVER_POLICY being set to MIN_PACKAGE_NODE . Let's put that theory to the test.

Kill a critical application process.

Let's kill one of the Oracle daemons. This will test the application monitoring script as well as test the working of the FAILOVER_POLICY :

 root@hpeos003[oracle1]  tail oracle1.cntl.log  Aug 11 15:05:24 - Node "hpeos003": Starting service oracle1_mon using    "/etc/cmcluster/oracle1/ORACLE1.sh monitor"  *** /etc/cmcluster/oracle1/ORACLE1.sh called with monitor argument. ***         ########### Node "hpeos003": Package start completed at Mon Aug 11 15:05:25 BST  2003 ########### Monitored process = ora_smon_oracle1, pid = 4786 Monitored process = ora_pmon_oracle1, pid = 4778 Monitored process = ora_lgwr_oracle1, pid = 4782 root@hpeos003[oracle1] root@hpeos003[oracle1]  ps -fp 4786  UID   PID  PPID  C    STIME TTY       TIME COMMAND   oracle  4786     1  0 15:05:20 ?        0:00 ora_smon_oracle1 root@hpeos003[oracle1] root@hpeos003[oracle1]  kill 4786

This fatal problem in the application should be picked up by the application monitoring script, and the package should move to node hpeos002 :

 root@hpeos003[oracle1]  cmviewcl -v -p oracle1  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     oracle1      up           running      enabled      hpeos002       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        min_package_node       Failback        automatic       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   oracle1_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           disabled     hpeos003       Alternate    up           enabled      hpeos001       Alternate    up           enabled      hpeos002     (current) root@hpeos003[oracle1]

As predicted , the FAILOVER_POLICY has meant that Serviceguard has moved the package to the most suitable node in the cluster. From a user 's perspective, they will not care because they can still access the application using the application IP address:

 C:\Work>  ping 192.168.0.230  Pinging 192.168.0.230 with 32 bytes of data: Reply from 192.168.0.230: bytes=32 time<1ms TTL=255 Reply from 192.168.0.230: bytes=32 time<1ms TTL=255 Reply from 192.168.0.230: bytes=32 time<1ms TTL=255 Reply from 192.168.0.230: bytes=32 time<1ms TTL=255 Ping statistics for 192.168.0.230:     Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds:     Minimum = 0ms, Maximum = 0ms, Average = 0ms C:\Work>  arp -a  Interface: 192.168.0.1 --- 0x2   Internet Address      Physical Address      Type   192.168.0.202         08-00-09-c2-69-c6     dynamic   192.168.0.230         08-00-09-c2-69-c6     dynamic C:\Work>

As we can see, the MAC address associated with the application IP address ( 192.168.0.230 ) is the same MAC Address as for node hpeos002 . This is of no consequence to the user; he or she is unaware of the slight delay in getting a connection from the application while the package moves to its adoptive node. In this case, the delay was only 10 seconds:

 ########### Node "hpeos003": Package halt completed at Sat Aug  9 17:15:36 BST 2003  ########### ########### Node "hpeos002": Package start completed at Sat Aug  9 17:15:46 BST 2003  ###########

The next test we are going to undertake is seeing the impact of re-enabling the original node hpeos003 .

Re-enable the original node to receive the package.

The package configuration parameter FAILBACK_POLICY has been set to AUTOMATIC . This should mean that when node hpeos003 becomes enabled to receive the package, Serviceguard should start the process of halting the package on its current node ( hpeos002 ) and starting it on node ( hpeos003 ). As mentioned previously, use of this option should take into consideration the fact that this AUTOMATIC movement of a package will cause a slight interruption in the availability of the application. Let's conduct the test and see what happens:
```
 
```
```
 root@hpeos001[oracle1] #  cmmodpkg -v -e -n hpeos003 oracle1  Enabling node hpeos003 for switching of package oracle1. cmmodpkg  : Successfully enabled package oracle1 to run on node hpeos003. cmmodpkg  : Completed successfully on all packages specified. 
```

This will enable node hpeos003 to receive the package. Serviceguard should now move the package back to this node.

 root@hpeos001[oracle1] #  cmviewcl -v -p oracle1  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     oracle1      up           running      enabled      hpeos003       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        min_package_node       Failback        automatic       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   oracle1_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos003     (current)       Alternate    up           enabled      hpeos001       Alternate    up           enabled      hpeos002 root@hpeos001[oracle1] #

As predicted the package is now back on node hpeos003 .

I would not consider these tests to be sufficient in order to put this package into a production environment. I would thoroughly test the two-way failover and failback of the package from all nodes in the cluster. I would review both the Standard and Stress Tests undertaken for the package clockwatch . You need to be confident that your package will perform as expected in all situations within the cluster. I leave you to perform the remainder of those tests.

As we have seen, a Serviceguard Toolkit makes setting up packages much easier than having to write your own application monitoring scripts. It is important that you spend time reading the associated README file as well as fine-tuning the monitoring script itself. The only Toolkit that is slightly different is the Toolkit for Highly Available NFS. As the name implies, the package deals with exporting a number of NFS filesystems from a server. Being a Serviceguard package, this means that the exported filesystems will be filesystems accessible by all relevant nodes in the cluster. Serviceguard will mount and then export these filesystems as part of starting up the package. Users will access their NFS filesystems via the package name/IP address instead of the server name/IP address. Here are the files supplied with this Toolkit:

 root@hpeos003[nfs]  pwd  /opt/cmcluster/nfs root@hpeos003[nfs]  ll  total 158 -rwxr-xr-x   1 bin        bin          12740 Sep 20  2001 hanfs.sh -rwxr-xr-x   1 bin        bin          36445 Sep 12  2001 nfs.cntl -rwxr-xr-x   1 bin        bin          12469 Sep 12  2001 nfs.conf -rwxr-xr-x   1 bin        bin          13345 Sep 12  2001 nfs.mon -rwxr-xr-x   1 bin        bin           2111 Sep 12  2001 nfs_xmnt root@hpeos003[nfs]

As you can see, you are supplied with the package configuration and control scripts. This is because instead of specifying a SERVICE_NAME in the package control script, we specify an NFS_SERVICE_NAME . You see this later. This means that we don't perform the cmmakepkg “p or cmmakepkg “s steps. Essentially, the steps to configure such a package are the same as any other package. First, copy the entire contents of the directory /opt/cmcluster/nfs to /etc/cmcluster . Obviously, you would need to perform the preliminary steps for setting up a package as we did for all other packages, e.g., ensure that shared files are accessible to all relevant nodes, and so on. The modifications I would make to the supplied Toolkit files include:

Update the package configuration file. The updates listed below are for illustrative purposes only:

 #  vi nfs.conf  PACKAGE_NAME        nfs NODE_NAME           node1 NODE_NAME           node2 RUN_SCRIPT          /etc/cmcluster/nfs/nfs.cntl HALT_SCRIPT         /etc/cmcluster/nfs/nfs.cntl SERVICE_NAME        nfs_service SUBNET              192.168.0.0 (  or whatever is relevant  )

Update the package control script. The updates listed below are for illustrative purposes only:

 #  vi nfs.cntl  VG[0]=/dev/vg02      LV[0]=/dev/vg02/NFSvol1      LV[0]=/dev/vg02/NFSvol2      FS[0]=/manuals      FS[1]=/docs      IP[0]=192.168.0.240 (  or whatever is relevant)  SUBNET[0]=192.168.0.0

Prior to Serviceguard version 11.14, you would have to update the package control script with the following lines as well:

 XFS[0]="/manuals" XFS[1]="/docs" NFS_SERVICE_NAME[0]="nfs_service" NFS_SERVICE_CMD[0]="/etc/cmcluster/nfs/nfs.mon" NFS_SERVICE_RESTART[0]="-r 0"

As of Serviceguard version 11.14, these last updates are handled by the additional script hanfs.sh .

Update the script hanfs.sh , if appropriate. This script is new for Serviceguard version 11.14. This script lists the exported filesystems and the NFS_SERVICE_NAME . The updates listed below are for illustrative purposes only:
```
 
```
```
 XFS[0]="/manuals" XFS[1]="/docs" NFS_SERVICE_NAME[0]="nfs_service" NFS_SERVICE_CMD[0]="/etc/cmcluster/nfs/nfs.mon" NFS_SERVICE_RESTART[0]="-r 0" 
```
The supplied script nfs.mon needs no updating because it is simply monitoring all the relevant NFS daemons.
We would continue with adding a package as before:

- Distribute control and monitoring scripts to all nodes.

- Check the package configuration file ( cmcheckconf ).

- Update the cluster binary file ( cmapplyconf ).

- Start the package ( cmmodpkg or cmrunpkg ).

- Test that the package works as expected.

We now continue our discussions regarding managing a cluster by looking at updating an existing package by adding EMS resource monitoring.