Testing Your Heartbeat Configuration | Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software

Before you put your Heartbeat high-availability server pair into production, here are a few things to try:

Unplug the power cord on the primary server

Test the behavior of the hb_standby command

Use the hb_standby command on the primary server to force resources to failover to the backup server. Then use the command again on the backup server to force the resources back to the primary server. ipfail will not work properly if the hb_standby command is not working properly.

Unplug the production network cable on the primary server

Using ipfail (or Mon,^[10] or a similar monitoring tool), the network connection failure should be detected and the resources and IP aliases should failover to the backup server.

Remove one of the heartbeat paths between the two servers

Use more than one heartbeat path between the servers to avoid false positives (the backup server incorrectly assumes the primary server has died). When you remove only one of these paths, such as the crossover network cable or serial cable connecting them, nothing should happen.

Remove all of the heartbeat paths between the two servers

What happens when you remove all heartbeat paths between the two servers? If you are using Stonith, the backup server should assume that the primary server has died, initiate a Stonith event, and take over the resources. What happens next depends upon how you have Stonith configured and whether or not you are using the auto_failback option.

With two Stonith devices (each server controlling the other server's power supply) and the auto_failback option turned on, the two servers may start repeatedly cycling each other's power or Stonithing each other. To avoid this, you can disable auto_failback or use the method described earlier in this chapter to power off rather than power cycle the primary server.

Kill the heartbeat daemon on the primary server (killall -9 heartbeat)

Stonith is especially important when you are using IP aliases to offer resources to client computers. The backup server must Stonith or power off/reset the primary server before trying to assume ownership of the resources to avoid a split-brain condition.

Kill the resource daemon(s) on the primary server

This case was not addressed by the Heartbeat configuration used in this chapter. Depending on your needs, you can run cl_status, included with the Heartbeat package, or cl_respawn, which is also included with the Heartbeat package, to monitor or automatically restart services when they fail. You can also use the Mon application (described in detail in Chapter 17) to monitor daemons and then take specific action (such as sending an alert to your cell phone or pager) when the service daemons fail.

Power reset (or reboot) both servers

Do the servers boot properly and leave the resources on the primary server where they belong once both systems have finished booting? You may need to adjust the initdead time in the /etc/ha.d/ha.cf file if the backup server tries to grab the resources away from the primary server before it finishes booting.

^[9]You may need to make sure the Cisco equipment accepts Gratuitous ARP broadcasts with the Cisco IOS command ip gratuitous-arps.

^[10]See Chapter 17.