Heartbeat Maintenance | Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software

Thanks to the fact that Heartbeat resource scripts are called by the heartbeat daemon with start, stop, or status requests, you can restart a resource without causing a cluster transition event. For example, say your Apache web server daemon is running on the primary.mydomain.com web server, and the backup.mydomain.com server is not running anything; it is waiting to offer the web server resource in the event of a failure of the primary computer. If you needed to make a change to your httpd.conf file (on both servers!) and you wanted to stop and restart the Apache daemon on the primary computer, you would not want this to cause Heartbeat to start offering the service on the backup computer. Fortunately, you can run the /etc/init.d/httpd restart command (or /etc/init.d/httpd stop followed by the /etc/init.d/httpd start command) without causing any change to the cluster status as far as Heartbeat is concerned.

Thus, you can safely stop and restart all of the cluster resources Heartbeat has been asked to manage, with perhaps the exception of filesystems, without causing any change in resource ownership or causing a cluster transition event. Of course, many daemons will also recognize the SIGHUP (or kill -HUP <process-ID-number>) command as well, so you can force a resource daemon to reload its configuration files after making a change without stopping and restarting it.

Again, in the case of the Apache httpd daemon, if you change the httpd.conf file and want to notify the running daemons of the change, you would send them the SIGHUP signal with the following command:

 #kill -HUP `cat /var/run/httpd.pid`

Note

The file containing the httpd parent process ID number is controlled by the PidFile entry in the httpd.conf file (this file is located in the /etc/httpd/conf directory on Red Hat Linux systems).

Changing Heartbeat Configuration Files

If you need to make a change to the heartbeat configuration file /etc/ha.d/ authkeys, or /etc/ha.d/ha.cf, you can force the running heartbeat daemon to reload these configuration files with the following command.

 #/etc/init.d/heartbeat reload

 #service heartbeat reload

When you change a haresources file, you need to restart Heartbeat on both the primary and the backup server to make your changes take effect (the reload option will not work).

Server Maintenance and the Heartbeat auto_failback Option

Normally, when the primary server crashes and the backup server takes ownership of a resource, the backup server will only hold this resource until the primary server comes back up. Once the primary server is up and running again, the backup server will release the resource and the primary server will assume ownership once again; it will start the resource script and start offering the service to client computers. This is the default heartbeat failback configuration.

To modify this Heartbeat behavior, add the following line before the node entries in your /etc/ha.d/ha.cf file (through version 1.1.2 of Heartbeat):

 nice_failback on

For Heartbeat versions 1.1.2 and later, the syntax is more intuitively obvious:

 auto_failback off

These options tell Heartbeat to leave the resource on the backup server even after the primary server comes back on line. Make this change to the ha.cf file on both heartbeat servers and then issue the following command on both servers to tell Heartbeat to re-read its configuration files:

 #/etc/init.d/heartbeat reload

You should see a message like the following in the /var/log/messages file:

 heartbeat[1032]: info: nice_failback is in effect.

This configuration is useful when you want to perform system maintenance tasks that require you to reboot the primary server. When you take down the primary server and the resources are moved to the backup server, they will not automatically move back to the primary server.

Once you are happy with your changes and want to move the resource back to the primary server, you would remove the nice_failback option from the ha.cf file and again run the following command on the backup server.

 #/etc/init.d/heartbeat reload

(You do not need to run this command on the primary server because we are about to restart the heartbeat daemon on the primary server anyway.) Now, force the resource back over to the primary server by entering the following command on the primary server:

 #/etc/init.d/heartbeat restart

Note

If you want auto_failback turned off as the default, or normal, behavior of your Heartbeat configuration, be sure to place the auto_failback option in the ha.cf file on both servers. If you neglect to do so, you may end up with a split-brain condition; both servers will think they should own the cluster resources. The setting you use for the auto_failback (or the deprecated nice_failback) option has subtle ramifications and will be discussed in more detail in Chapter 9.

Forcing the Primary Server into Standby Mode

In a two-node Heartbeat configuration, you can force the primary server to relinquish its resources, without stopping Heartbeat, by forcing the primary server into standby mode. This causes the backup server to start the resource scripts and take ownership of the Heartbeat resources. Run this command as root on the primary server:

 #/usr/lib/heartbeat/hb_standby

The primary server will not go into standby mode if it cannot talk to the heartbeat daemon on the backup server. In Heartbeat versions through 1.1.2, the hb_standby command required nice_failback to be turned on. With the change in syntax from nice_failback to auto_failback , Heartbeat no longer requires auto_failback to be turned off. However, if you are using an older version of Heartbeat that still supports nice_failback, it must still be turned on to use the hb_standby command.

In this example, the primary server (where we ran the hb_standby command) is requesting that the backup server take over the resources. When the backup server receives this request, it asks the primary server to release its resources. If the primary server does not release the resources,^[11] the backup server will not start them.

Note

The hb_standby command allows you to specify an argument of local, foreign, or all to specify which resources should go into standby (or failover to the backup server).

Tuning Heartbeat's Deadtime Value

Sometimes Heartbeat will report that it cannot hear its own heartbeat, or that heartbeat times are too long. If the heartbeat logs indicate that a heartbeat was not received within the deadtime timeout period and the backup server tried to take over for the primary server when you did not want it to, you need to properly tune your deadtime value to account for system and network environmental conditions that may be causing heartbeats to get lost or to not be heard. This can occur on systems that are heavily loaded with network processing tasks, or even with heavy CPU utilization.

To tune the heartbeat deadtime value for these conditions, set the deadtime value to a large value such as 60 seconds or higher, and set the warntime value to the number of seconds you would like to use for your deadtime value.

Now run the system for a few weeks and carefully watch the /var/log/ messages file and the logfile /var/log/ha-log for warntime messages indicating the longest period of time your system went without hearing a heartbeat. Armed with that information, set your warntime to this amount, and multiply this warntime value by 1.5 to 2 to arrive at the smallest possible value you should use for your deadtime. Leave logging enabled and continue to monitor your logs to make sure you have not set the value too low.

Informational Messages in Heartbeat's Log

You may see messages such as the following in your message log file:

 heartbeat: info: RealMalloc stats: 976 total malloc bytes. pid [369/HBREAD heartbeat: info: MSG stats: 0/441708 age 0 [pid370/MST_STATUS] heartbeat: info: ha_malloc stats: 0/9035987 0/0

These messages will appear every 24 hours, beginning when Heartbeat was started. After a few days of operation, the total number of bytes used should not grow. These informational messages from Heartbeat can be ignored when Heartbeat is operating normally.

Failover and Respawn (Automatically Restarting Failed Resources)

On a normal Unix/Linux server, the init daemon will start daemons (usually serial line communication services or tty related services) based on entries in the /etc/inittab file. If the entry contains the word respawn, init will monitor the daemon and restart it if it dies for any reason.

If you need to run a service that needs to be restarted or respawned automatically when it fails, you have a few options:

If the application should run on both the primary and the backup server all of the time, create an entry for the application in /etc/inittab (see the man page for inittab for syntax details).
If the application should only run when Heartbeat is running, you can create a respawn entry for the service in the /etc/ha.d/ha.cf file that looks like this:
```
 respawn root /usr/sbin/faxgetty ttyQ01e0 
```
This line tells Heartbeat to run the /usr/sbin/faxgetty program and pass it the argument ttyQ01e0. Heartbeat will do this when it first starts up on both the primary and the backup Heartbeat servers (recall that the Heartbeat configuration files should always be the same on both servers).
If the application should run on only one of the Heartbeat servers at a time, you will have to implement a service under Heartbeat's control that knows how to restart failed daemons, such as the Daemontools package (http://cr.yp.to/daemontools.html), or that can use the cl_respawn utility included with the Heartbeat package. To use the cl_respawn utility, create a line such as the following in the start section of your resource script:
```
 cl_respawn /usr/sbin/faxgetty ttyQ01e0 
```
When Heartbeat calls your resource script and passes it the start argument, it will then run the utility cl_respawn with the arguments /usr/sbin/faxgetty and ttyQ01e0. The cl_respawn utility is a small program that is unlikely to crash, because it doesn't do much—it just hangs around in the background and watches the daemon it started (faxgetty in this example) and restarts the daemon if it dies. (Run cl_respawn -h from a shell prompt for more information on the capabilities of this utility.)

License Manager Failover

A license manager daemon such as lmgrd from GlobeTrotter Software can be configured to failover in conjunction with an IP address, just like any other daemon. However, before you can use a license manager, you will need a second set of licenses from your software vendor for the backup server's hostid. Some software vendors will allow you to have two sets of licenses if you agree to use the second set only when the primary license server goes down.

^[11]There is a 20-minute timeout period (it was a 10-second timeout period prior to version 1.0.2).