Network Failures | Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software

Even with these preparations, we still have not eliminated all single points of failure in our two-server Heartbeat failover design. For example, what happens if the primary server simply loses the ability to communicate properly with the client computers over the normal, or production network?

In such a case, if you have a properly configured Heartbeat configuration the heartbeats will continue to travel to the backup server. This is thanks to the redundancy you built into your Heartbeat paths (as described in Chapter 8), and no failover will occur. Still, the client computers will no longer be able to access the resource daemons on the primary server (the cluster resources).

We can solve this problem in at least two ways:

Run an external monitoring package, like the Perl program Mon, on the primary server and watch for the failure of the public NIC. When Mon detects the failure of this NIC it should shut down the Heartbeat daemon (or force it into standby mode) on the primary server. The backup server will then takeover the resources and, assuming it is healthy and can communicate on its public network interface, the client computers will once again have access to the resources. (See Chapter 17 for more information about Mon.)
Use the ipfail API plug-in, which allows you to specify one or more ping servers in the Heartbeat configuration file. If the primary server suddenly fails to see one of the ping servers, it asks the backup server, "Did you see that ping server go down too?" If the backup server can still talk to the ping server, it knows that the primary server is not communicating on the network properly and it should now take ownership of the resources.

ipfail

Beginning with Heartbeat version 0.4.9d, the ipfail plug-in is included with the Heartbeat RPM package as part of the standard Heartbeat distribution. To use ipfail, decide which network device (IP address) both Heartbeat servers should be able to ping at all times (such as a shared router, a network switch that is never supposed to go offline, and so on). Next, enter this IP address in your /etc/ha.d/ha.cf file and tell Heartbeat to start the ipfail plug-in each time it starts up:

 #vi /etc/ha.d/ha.cf

Add three lines before the final server lines at the end of the file like so:

 respawn hacluster /usr/lib/heartbeat/ipfail ping 10.1.1.254 10.1.1.253 auto_failback off

The first line above tells Heartbeat to start the ipfail program on both the primary and backup server,^[6] and to restart or respawn it if it stops, using the hacluster user created during installation of the Heartbeat RPM package. The second line specifies one or more ping servers or network nodes that Heartbeat should ping at heartbeat intervals to be sure that its network connections are working properly. (If you are building a firewall machine, for example, you will probably want to use ping servers on both interfaces, or networks.^[7])

Note

If you are using a version of Heartbeat prior to version 1.1.2, you must turn nice_failback on. Version 1.1.2 and later allow auto_failback (the replacement for nice_failback but with the opposite meaning) to be either on or off.

Now start Heartbeat on both servers and test your configuration. You should see a message in /var/log/messages indicating that Heartbeat started the ipfail child client process. Try removing the network cable on the primary server to break the primary server's ability to ping one of the ping servers, and watch as ipfail forces the primary server into standby mode. The backup server should then take over the resources listed in haresources.

^[6]The /etc/ha.d/ha.cf configuration file should be the same on the primary and backup server.

^[7]And, of course, be sure to configure your iptables or ipchains rules to accept ICMP traffic (see Chapter 2).