Testing Your Heartbeat Configuration


Before you put your Heartbeat high-availability server pair into production, here are a few things to try:

Unplug the power cord on the primary server

  • Heartbeat on the backup server should detect the loss of heartbeat packets from the primary server and initiate a failover. Using Stonith, the backup server should turn off or reset the power to the primary server. Heartbeat on the backup server should then run the proper resource scripts (when the Stonith event has "cleared" or completed) to take ownership of the resources. Heartbeat on the backup server should also send gratuitous ARP broadcasts to notify clients and/or network equipment that the MAC addresses for the resource IP addresses have changed.

    Client computers, and network equipment such as routers and switches should update their ARP caches to reflect the new MAC address of the backup server. Check the ARP cache on your Cisco IOS network equipment, for example, with the command:

     show ip arp 

  • Or use the command:

     show ip arp 209.100.100.3 

    where 209.100.100.3 is the IP alias that fails over to the backup server. The MAC address should change automatically when the backup server sends out gratuitous ARP broadcasts.[9]

    Check all of the client computers that share the same network broadcast address with this command, which works on Windows PCs and Linux hosts:

     arp -a 

Test the behavior of the hb_standby command

  • Use the hb_standby command on the primary server to force resources to failover to the backup server. Then use the command again on the backup server to force the resources back to the primary server. ipfail will not work properly if the hb_standby command is not working properly.

Unplug the production network cable on the primary server

  • Using ipfail (or Mon,[10] or a similar monitoring tool), the network connection failure should be detected and the resources and IP aliases should failover to the backup server.

Remove one of the heartbeat paths between the two servers

  • Use more than one heartbeat path between the servers to avoid false positives (the backup server incorrectly assumes the primary server has died). When you remove only one of these paths, such as the crossover network cable or serial cable connecting them, nothing should happen.

Remove all of the heartbeat paths between the two servers

  • What happens when you remove all heartbeat paths between the two servers? If you are using Stonith, the backup server should assume that the primary server has died, initiate a Stonith event, and take over the resources. What happens next depends upon how you have Stonith configured and whether or not you are using the auto_failback option.

    With two Stonith devices (each server controlling the other server's power supply) and the auto_failback option turned on, the two servers may start repeatedly cycling each other's power or Stonithing each other. To avoid this, you can disable auto_failback or use the method described earlier in this chapter to power off rather than power cycle the primary server.

Kill the heartbeat daemon on the primary server (killall -9 heartbeat)

  • Stonith is especially important when you are using IP aliases to offer resources to client computers. The backup server must Stonith or power off/reset the primary server before trying to assume ownership of the resources to avoid a split-brain condition.

Kill the resource daemon(s) on the primary server

  • This case was not addressed by the Heartbeat configuration used in this chapter. Depending on your needs, you can run cl_status, included with the Heartbeat package, or cl_respawn, which is also included with the Heartbeat package, to monitor or automatically restart services when they fail. You can also use the Mon application (described in detail in Chapter 17) to monitor daemons and then take specific action (such as sending an alert to your cell phone or pager) when the service daemons fail.

Power reset (or reboot) both servers

  • Do the servers boot properly and leave the resources on the primary server where they belong once both systems have finished booting? You may need to adjust the initdead time in the /etc/ha.d/ha.cf file if the backup server tries to grab the resources away from the primary server before it finishes booting.

[9]You may need to make sure the Cisco equipment accepts Gratuitous ARP broadcasts with the Cisco IOS command ip gratuitous-arps.

[10]See Chapter 17.



The Linux Enterprise Cluster. Build a Highly Available Cluster with Commodity Hardware and Free Software
Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software
ISBN: 1593270364
EAN: 2147483647
Year: 2003
Pages: 219
Authors: Karl Kopper

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net