26.16 Testing Package Failover Functionality | HP-UX CSE(c) Official Study Guide and Desk Reference

In this section, we use quite a few of the commands used daily in the administration of packages. We look at situations where we want to move a package to another node in a controlled fashion, possibly to allow urgent system maintenance on the original node to take place. We call such tests Standard Tests . We also look at situations where an application fails for whatever reason. We expect Serviceguard to move the package to an adoptive node itself. We call such tests Stress Tests . With both types of tests, we want to ensure that packages can move from and to their original node. This two-way functionality is crucial . If we think about it, the package control script performs the nuts and bolts of starting and stopping our applications. An important point to remember regarding the package control script is that it is distributed manually. If we make a change to a control script and forget to distribute it to all nodes, then the two-way functionality we are testing could fail; on the way back to its original node, it will run the control script, which could be an old version if we forgot to distribute the control script to that node. While performing these tests, we should also bear in mind access to the application from a user's perspective. We should have access to either a user input screen to witness the impact of these tests or at least to ping the package IP address while the tests are being performed.

Standard Tests

Move a package to another node in the cluster .

In this test, we are simply halting the application on one node and running it on another. The reason for doing this could be to allow urgent maintenance on the original node. When I first used Serviceguard, I thought there would be a cmmovepkg command. There isn't. We halt the package on one node and run it on another. It's as simple as that. We ensure that package and node switching are enabled as appropriate. Once the original node has completed its maintenance, we move the package back to the original node if we want.

I am starting from the position in which we left the package in the previous section; clockwatch is running on node hpeos001 with node and package switching enabled.

Halting the package = cmhaltpkg :

 root@hpeos002[] #  cmhaltpkg -v clockwatch  Halting package clockwatch. cmhaltpkg  : Successfully halted package clockwatch. cmhaltpkg  : Completed successfully on all packages specified. root@hpeos002[] #  cmviewcl -v -p clockwatch  UNOWNED_PACKAGES     PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   down         halted       disabled     unowned       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        configured_node       Failback        manual       Script_Parameters:       ITEM       STATUS   NODE_NAME    NAME       Subnet     up       hpeos001     192.168.0.0       Subnet     up       hpeos002     192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos001       Alternate    up           enabled      hpeos002 root@hpeos002[] #

The first thing to notice is that I don't need to run cmhaltpkg from the node the package is running on. The second thing is that simply halting the package keeps all node switching parameters configured as they were. Package switching, i.e., AUTO_RUN , has been halted as you might expect with a cmhaltpkg .

In this situation, if node hpeos001 were being shut down for urgent maintenance, I would either disable this node from running this package ( cmmodpkg “d “n hpeos001 clockwatch ), or more likely, I would stop cluster services on that node altogether ( cmhaltnode “v hpeos001 ). The important thing here is to get this application back up and running on another node. I would immediately follow a cmhaltpkg with a suitable cmrunpkg command. In this case, I want to run the package on node hpeos002 .

 root@hpeos002[] #  cmrunpkg -v -n hpeos002 clockwatch  Running package clockwatch on node hpeos002. cmrunpkg  : Successfully started package clockwatch. cmrunpkg  : Completed successfully on all packages specified. root@hpeos002[] #  cmviewcl -v -p clockwatch  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      disabled     hpeos002       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        configured_node       Failback        manual       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   clock_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos001       Alternate    up           enabled      hpeos002     (current) root@hpeos002[] #

In this example, I have used the “n hpeos002 option to cmrunpkg . I could have not used those options because I am logged into hpeos002 . In reality, I probably always use them, just in case I am accidentally logged on to another node in the cluster. Fully specifying the command ensures that you know exactly what is going to happen. Now we can deal with node hpeos001 and the package switching parameter, AUTO_RUN .

 root@hpeos002[] #  cmhaltnode -v hpeos001  Disabling package switching to all nodes being halted. Disabling all packages from running on hpeos001. Warning:  Do not modify or enable packages until the halt operation is completed. Halting cluster services on node hpeos001. .. Successfully halted all nodes specified. Halt operation complete. root@hpeos002[] #  cmmodpkg -e clockwatch  cmmodpkg  : Completed successfully on all packages specified. root@hpeos002[] #  cmviewcl  CLUSTER      STATUS McBond       up   NODE         STATUS       STATE   hpeos001     down         halted   hpeos002     up           running     PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      enabled      hpeos002 root@hpeos002[] #

We can see now that package switching is enabled for the clockwatch application; if we had a third node in the cluster and node hpeos002 failed, our application would now be able to fail over to that third node. We can also see that node hpeos001 is down for its urgent system maintenance.

Moving a package back to its original node.

This task highlights two things:

- What happens when the original node comes back online

- Moving a package back to its original node

Let's bring the original node hpeos001 back online.

 root@hpeos001[] #  cmrunnode -v hpeos001  Successfully started $SGLBIN/cmcld on hpeos001. cmrunnode  : Waiting for cluster to form..... cmrunnode  : Cluster successfully formed. cmrunnode  : Check the syslog files on all nodes in the cluster cmrunnode  : to verify that no warnings occurred during startup. root@hpeos001[] #  cmviewcl -v -p clockwatch  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      enabled      hpeos002       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        configured_node       Failback        manual       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   clock_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos001       Alternate    up           enabled      hpeos002     (current) root@hpeos001[] #

As you can see, just by restarting the original node the package stays on its current node. If you look carefully at the above output from cmviewcl “v “p clockwatch , you will notice a configuration option known as Failback . This policy, default = manual , relates to what happens to a package when its original node comes back online. I personally think the default = manual is suitable in most situations. If we were to set it to automatic , we would see clockwatch move back to that node. This means a package outage , because we need to stop clockwatch on node hpeos002 and restart it on node hpeos001 .

To move clockwatch back to node hpeos001 , it is as simple as our first Standard Test:

- Halt it on its current node ( cmhaltpkg ).

- Run it on the target node ( cmrunpkg ).

- Ensure that package and node switching is enabled ( cmmodpkg ).

We need to carefully consider when we want to perform this task because it will mean taking the package down for a short time.

 root@hpeos001[] #  cmhaltpkg -v clockwatch  Halting package clockwatch. cmhaltpkg  : Successfully halted package clockwatch. cmhaltpkg  : Completed successfully on all packages specified. root@hpeos001[] #  cmrunpkg -v -n hpeos001 clockwatch  Running package clockwatch on node hpeos001. cmrunpkg  : Successfully started package clockwatch. cmrunpkg  : Completed successfully on all packages specified. root@hpeos001[] #  cmmodpkg -e clockwatch  cmmodpkg  : Completed successfully on all packages specified. root@hpeos001[] #  cmviewcl -v -p clockwatch  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      enabled      hpeos001       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        configured_node       Failback        manual       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   clock_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos001     (current)       Alternate    up           enabled      hpeos002 root@hpeos001[] #

Stress Tests

During the first of these stress tests, I will ping the package IP address from a third independent node. I am expecting a delay in getting a response from the ping , but it should resume when the package fails over to the adoptive node.

Kill one of the major application processes .

In this test, we are testing the validity of the application monitoring script. If a critical application process dies, we should see the application monitoring script die. Hence, Serviceguard will detect that a service process has died and move the package to an adoptive node.

 root@hpeos001[] #  ps -ef   grep clockwatch  root  7929     1  0 06:12:56 ?         0:00 /clockwatch/bin/clockwatch /clockwatch /logs     root  7933  4449  0 06:12:56 ?         0:00 /sbin/sh /etc/cmcluster/clockwatch /CLOCKWATCH.sh monitor     root  7952  6259  2 06:13:16 pts/1     0:00 grep clockwatch root@hpeos001[] #  kill 7929

We may have to wait a few seconds for the application monitoring script to wake up and detect the failure.

 root@hpeos001[clockwatch] #  cmviewcl -v -p clockwatch  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      enabled      hpeos002       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        configured_node       Failback        manual       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   clock_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME   Primary      up           disabled     hpeos001     Alternate    up           enabled      hpeos002     (current)   root@hpeos001[clockwatch] #

The application has failed over to its adoptive node. It is worthwhile checking the package logfile /etc/cmluster/clockwatch/clockwatch.cntl.lo g to review what happened during the failure.

 Aug  9 06:12:56 - Node "hpeos001": Starting service clock_mon using    "/etc/cmcluster/clockwatch/CLOCKWATCH.sh monitor"         ########### Node "hpeos001": Package start completed at Sat Aug  9 06:12:57 BST 2003 ###########         ########### Node "hpeos001": Halting package at Sat Aug  9 06:13:28 BST 2003 ########### Aug  9 06:13:28 - Node "hpeos001": Halting service clock_mon cmhaltserv : Service name clock_mon is not running. PID = 7929 does not exist. Application may already be dead. Removing /clockwatch/logs/.watchpid to allow clockwatch to be restarted on another node. Aug  9 06:13:29 - Node "hpeos001": Remove IP address 192.168.0.220 from subnet 192.168.0.0 Aug  9 06:13:29 - Node "hpeos001": Unmounting filesystem on /dev/vg01/progs Aug  9 06:13:31 - Node "hpeos001": Unmounting filesystem on /dev/vg01/db Aug  9 06:13:31 - Node "hpeos001": Deactivating volume group /dev/vg01 Deactivated volume group in Exclusive Mode. Volume group "/dev/vg01" has been successfully changed.         ########### Node "hpeos001": Package halt completed at Sat Aug  9 06:13:32 BST 2003 ###########

We can now see the importance of the 'stop' function in the monitoring scripts; we detected an abnormal event in the application, i.e., the PID did not exist. This alerted us to the problem of the .watchpid file existing, which we removed to allow the application to move to an adoptive node.

Here is the output from the ping I performed on node hpeos003 during the failover:

 root@hpeos003[]  ping 192.168.0.220  PING 192.168.0.220: 64 byte packets 64 bytes from 192.168.0.220: icmp_seq=0. time=0. ms 64 bytes from 192.168.0.220: icmp_seq=1. time=0. ms ... 64 bytes from 192.168.0.220: icmp_seq=13. time=0. ms 64 bytes from 192.168.0.220: icmp_seq=14. time=0. ms 64 bytes from 192.168.0.220: icmp_seq=23. time=1. ms 64 bytes from 192.168.0.220: icmp_seq=24. time=0. ms

As we can see, there is a time when the application is non-contactable; we get no response after packet 14 and until packet 23, i.e., the package was being moved to an adoptive node. If we look at the ARP cache on this node, we should find that both the host and the package IP address equate to the same LAN card:

 root@hpeos003[]  arp -a  hpeos002 (192.168.0.202) at 8:0:9:c2:69:c6 ether 192.168.0.220 (192.168.0.220) at 8:0:9:c2:69:c6 ether root@hpeos003[]

As you can see, that is exactly the case. We can see that the IP address for the package has moved to node hpeos002 . How this affects clients is entirely dependent on the application. Most applications will retry requests a number of times before they issue a serious error message. In this case, it has taken only 8 seconds move the application from one node to another:

 ########### Node "hpeos001": Package halt completed at Sat Aug  9 08:20:30 BST 2003 ########### ########### Node "hpeos002": Package start completed at Sat Aug  9 08:20:38 BST 2003 ###########

Kill the application monitoring script .

In this test, we are ensuring that Serviceguard is operating as expected, i.e. it should detect that the service process has died and should halt the application and move it to another node. Before issuing this test, I am going to assume that we have fixed any problems with node hpeos001 from the previous test and we want to enable this node to run the application. In our cluster, that will mean if clockwatch fails on its current node, hpeos002 , it will have another node to run on, namely hpeos001 . If I do not re-enable hpeos001 to run the clockwatch application, the application will remain in a down , unowned state.

 root@hpeos002[clockwatch] #  cmmodpkg -v -e -n hpeos001 clockwatch  Enabling node hpeos001 for switching of package clockwatch. cmmodpkg  : Successfully enabled package clockwatch to run on node hpeos001. cmmodpkg  : Completed successfully on all packages specified. root@hpeos002[clockwatch] #  cmviewcl -v -p clockwatch  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      enabled      hpeos002       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        configured_node       Failback        manual       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   clock_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos001       Alternate    up           enabled      hpeos002     (current) root@hpeos002[clockwatch] #

Now we can kill the application monitoring script and see the application move to node hpeos001 .

 root@hpeos002[clockwatch] #  ps -ef  grep clockwatch  root  6016     1  0 06:13:39 ?       0:00 /clockwatch/bin/clockwatch /clockwatch/logs     root  6020  3728  0 06:13:40 ?         0:00 /sbin/sh /etc/cmcluster/clockwatch /CLOCKWATCH.sh monitor     root  6523  3945  4 06:30:09 pts/1     0:00 grep clockwatch root@hpeos002[clockwatch] #  kill 6020

This should be instantaneous because this is the service process that Serviceguard is monitoring. We should check the state of the package and review the package logfile on node hpeos002 .

 root@hpeos002[clockwatch] #  cmviewcl -v -p clockwatch  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      enabled      hpeos001       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        configured_node       Failback        manual       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   clock_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME   Primary      up           enabled      hpeos001     (current)     Alternate    up           disabled     hpeos002   root@hpeos002[clockwatch] #

As you can see, the package is running on the node we expected it to be. Here is the relevant extract from the package logfile on node hpeos002 .

 "/etc/cmcluster/clockwatch/CLOCKWATCH.sh monitor"         ########### Node "hpeos002": Package start completed at Sat Aug  9 06:13:40 BST 2003 ###########         ########### Node "hpeos002": Halting package at Sat Aug  9 06:30:20 BST 2003 ########### Aug  9 06:30:20 - Node "hpeos002": Halting service clock_mon cmhaltserv : Service name clock_mon is not running. Attempting to stop clockwatch; PID = 6016 Aug  9 06:30:20 - Node "hpeos002": Remove IP address 192.168.0.220 from subnet 192.168.0.0 Aug  9 06:30:21 - Node "hpeos002": Unmounting filesystem on /dev/vg01/progs Aug  9 06:30:22 - Node "hpeos002": Unmounting filesystem on /dev/vg01/db Aug  9 06:30:23 - Node "hpeos002": Deactivating volume group /dev/vg01 Deactivated volume group in Exclusive Mode. Volume group "/dev/vg01" has been successfully changed.         ########### Node "hpeos002": Package halt completed at Sat Aug  9 06:30:23 BST 2003 ########### clockwatch.cntl.log: END

From the logfile, it appears that no cleanup operations were necessary as in the previous test. This is good news. I would want to ensure that my application is definitely dead on node hpeos002 . That would mean checking that no stray clockwatch processes were left running. If there were old clockwatch processes that were not killed successfully, it would pose a problem the next time we run the application on this node.

 root@hpeos002[clockwatch] #  ps -ef  grep -i clock  root  6785  6616  4 07:00:43 pts/3     0:00 grep -i clock root@hpeos002[clockwatch] #

From what I can observe, this looks like both the application monitoring script and Serviceguard are working as expected. Before we conclude this test, I will ensure that node hpeos002 is enabled to accept the package again.

 root@hpeos002[clockwatch] #  cmmodpkg -v -e -n hpeos002 clockwatch  Enabling node hpeos002 for switching of package clockwatch. cmmodpkg  : Successfully enabled package clockwatch to run on node hpeos002. cmmodpkg  : Completed successfully on all packages specified. root@hpeos002[clockwatch] #  cmviewcl -v -p clockwatch  PACKAGE      STATUS       STATE        AUTO_RUN     NODE     clockwatch   up           running      enabled      hpeos001       Policy_Parameters:       POLICY_NAME     CONFIGURED_VALUE       Failover        configured_node       Failback        manual       Script_Parameters:       ITEM       STATUS   MAX_RESTARTS  RESTARTS   NAME       Service    up                  0         0   clock_mon       Subnet     up                                192.168.0.0       Node_Switching_Parameters:       NODE_TYPE    STATUS       SWITCHING    NAME       Primary      up           enabled      hpeos001     (current)       Alternate    up           enabled      hpeos002 root@hpeos002[clockwatch] #

The order of performing the two Stress Tests is not necessarily important. You should spend some time trying both types of tests on all nodes in your cluster. This can be time-consuming in a large 16-node cluster. I hope that you have witnessed the importance of reviewing the output of cmviewcl “v “p <package> as well as reviewing carefully the output from the package logfiles themselves .