Step 7: Launch Heartbeat

Before starting the heartbeat daemon, run the following ResourceManager tests on the primary server and the backup server to make sure you have set up the configuration files correctly:

 #/usr/lib/heartbeat/ResourceManager listkeys `/bin/uname -n`

This command looks at the /etc/ha.d/haresources file and returns the list of resources (or resource keys) from this file. The only resource we have defined so far is the test resource, so the output of this command should simply be the following:

 test

Launch Heartbeat on the Primary Server

Once you have the test resource script configured properly on the primary server, start the Heartbeat program, and watch what happens in the /var/log/ messages file.

Start Heartbeat (on the primary server) with one of these commands:

 #/etc/init.d/heartbeat start

 #service heartbeat start

Then look at the system log again, with this command:

 #tail /var/log/messages

To avoid retyping this command every few seconds while Heartbeat is coming up, tell the tail command to display new information on your screen as it is appended to the /var/log/messages file by using this command:

 #tail -f /var/log/messages

Press CTRL-C to break out of this command.

Note

To change the logging file Heartbeat uses, uncomment the following line in your /etc/ha.d/ha.cf file:

 logfile        /var/log/ha-log

You can then use the tail -f /var/log/ha-log command to watch what Heartbeat is doing more closely. However, the examples in this recipe will always use the /var/log/messages file.^[4] (This does not change the amount of logging taking place.)

Heartbeat will wait for the amount of time configured for initdead in the /etc/ha.d/ha.cf file before it finishes its startup procedure, so you will have to wait at least two minutes for Heartbeat to start up (initdead was set to 120 seconds in the test configuration file).

When Heartbeat starts successfully, you should see the following message in the /var/log/messages file (I've removed the timestamp information from the beginning of each of these lines to make the messages more readable):

 primary root: test called with status primary heartbeat[4410]: info: ************************** primary heartbeat[4410]: info: Configuration validated. Starting heartbeat <version> primary heartbeat[4411]: info: heartbeat: version <version> primary heartbeat[2882]: WARN: No Previous generation - starting at 1^[5] primary heartbeat[4411]: info: Heartbeat generation: 1 primary heartbeat[4411]: info: UDP Broadcast heartbeat started on port 694 (694) interface eth1 primary heartbeat[4414]: info: pid 4414 locked in memory. primary heartbeat[4415]: info: pid 4415 locked in memory. primary heartbeat[4416]: info: pid 4416 locked in memory. primary heartbeat[4416]: info: Local status now set to: 'up' primary heartbeat[4411]: info: pid 4411 locked in memory. primary heartbeat[4416]: info: Local status now set to: 'active' primary logger: test called with status primary last message repeated 2 times primary heartbeat: info: Acquiring resource group: primary.mydomain.com test primary heartbeat: info: Running /etc/init.d/test  start primary logger: test called with start primary heartbeat[4417]: info: Resource acquisition completed. primary heartbeat[4416]: info: Link primary.mydomain.com:eth1 up.

The test script created earlier in the chapter does not return the word OK, Running, or running when called with the status argument, so Heartbeat assumes that the daemon is not running and runs the script with the start argument to acquire the test resource (which doesn't really do anything at this point). This can be seen in the preceding output. Notice this line:

 primary.mydomain.com heartbeat[2886]: WARN: node backup.mydomain.com: is dead

Heartbeat warns you that the backup server is dead because the heartbeat daemon hasn't yet been started on the backup server.

Note

Heartbeat messages never contain the word ERROR or CRIT for anything that should occur under normal conditions (even during a failover). If you see an ERROR or CRIT message from Heartbeat, action is probably required on your part to resolve the problem.

Launch Heartbeat on the Backup Server

Once Heartbeat is running on the primary server, log on to the backup server, and start Heartbeat with this command:

 # /etc/init.d/heartbeat start

The /var/log/messages file on the backup server should soon contain the following:

 backup heartbeat[4650]: info: ************************** backup heartbeat[4650]: info: Configuration validated. Starting heartbeat <version> backup heartbeat[4651]: info: heartbeat: version <version> backup heartbeat[4651]: info: Heartbeat generation: 9 backup heartbeat[4651]: info: UDP Broadcast heartbeat started on port 694 (694) interface eth1 backup heartbeat[4654]: info: pid 4654 locked in memory. backup heartbeat[4655]: info: pid 4655 locked in memory. backup heartbeat[4656]: info: pid 4656 locked in memory. backup heartbeat[4656]: info: Local status now set to: 'up' backup heartbeat[4651]: info: pid 4651 locked in memory. backup heartbeat[4656]: info: Link backup.mydomain.com:eth1 up. backup heartbeat[4656]: info: Node primary.mydomain.com: status active backup heartbeat: info: Running /etc/ha.d/rc.d/status status backup heartbeat: info: Running /etc/ha.d/rc.d/ifstat ifstat backup heartbeat: info: Running /etc/ha.d/rc.d/ifstat ifstat backup heartbeat[4656]: info: No local resources [/usr/lib/heartbeat/ ResourceManager listkeys backup.mydomain.com] backup.mydomain.com heartbeat[4656]: info: Resource acquisition completed.

Notice in this output how Heartbeat declares that this machine (the backup server) does not have any local resources in the /etc/ha.d/haresources file. This machine will act as a backup server and sit idle, simply listening for heartbeats from the primary server until that server fails. Heartbeat did not need to run the test script (/etc/ha.d/resource.d/test). The Resource acquisition completed message is a bit misleading because there were no resources for Heartbeat to acquire.

Note

All resource script files that you refer to in the haresources file must exist and have execute permissions^[6] before Heartbeat will start.

Examining the Log Files on the Primary Server

Now that the backup server is up and running, Heartbeat on the primary server should be detecting heartbeats from the backup server. At the end of the /var/log/messages file you should see the following:

 primary heartbeat[2886]: info: Heartbeat restart on node backup.mydomain.com primary heartbeat[2886]: info: Link backup.mydomain.com:eth2 up. primary heartbeat[2886]: info: Node backup.mydomain.com: status up primary heartbeat: info: Running /etc/ha.d/rc.d/status status primary heartbeat: info: Running /etc/ha.d/rc.d/ifstat ifstat primary heartbeat[2886]: info: Node backup.mydomain.com: status active primary heartbeat: info: Running /etc/ha.d/rc.d/status status

If the primary server does not automatically recognize that the backup server is running, check to make sure that the two machines are on the same network, that they have the same broadcast address, and that no firewall rules are filtering out the packets. (Use the ifconfig command on both systems and compare the bcast numbers; they should be the same on both.) You can also use the tcpdump command to see if the heartbeat broadcasts are reaching both nodes:

 #tcpdump -i all -n -p udp port 694

This command should capture and display the heartbeat broadcast packets from either the primary or the backup server.

^[4]The recipe for this chapter is using this method so that you can also watch for messages from the resource scripts with a single command (tail -f /var/log/messages).

^[5]The Heartbeat "generation" number is incremented each time Heartbeat is started. If Heartbeat notices that its partner server changes generation numbers, it can take the proper action depending upon the situation. (For example, if Heartbeat thought a node was dead and then later receives another Heartbeat from the same node, it will look at the generation number. If the generation number did not increment, Heartbeat will suspect a split-brain condition and force a local restart.)

^[6]Script files are normally owned by user root, and they have their permission bits configured using a command such as this: chmod 755 <scriptname>.