Watchdog and Softdog


The kernel has its own method to handle a hung system, called watchdog. Watchdog is simply a kernel module that checks a timer to make sure the system is healthy. If watchdog thinks the kernel is hung, it can take drastic action such as a system reboot. If you want to protect your high-availability server configuration from a server hang that causes an interruption in services even when the server hang is not detected by Heartbeat, you should enable watchdog in your kernel.

Note 

We are talking about a server hang here and not an application problem. Heartbeat (prior to Heartbeat release 2, which is not yet available as of this writing) does not monitor resources or the applications under its control to see if they are healthy—to do this you need to use another package such as the Mon monitoring system discussed in Part IV.

A watchdog device is normally connected to a system to allow the kernel to determine whether the system has hung (when the kernel no longer sees the external timer device updating properly, it knows that something has gone wrong).

The watchdog code also supports a software replacement for external hardware timers called softdog. Softdog maintains an internal timer that is updated as soon as another process on the system writes to the /dev/watchdog device file. If softdog doesn't see a process write to the /dev/watchdog file, it assumes that the kernel must be malfunctioning, and it will initiate a kernel panic. Normally a kernel panic will cause a system to shut down, but you can modify this default behavior and, instead, cause the system to reboot.

Enable Watchdog in the Kernel

To enable watchdog in the kernel, you first need to make sure that the softdog module is compiled for your kernel.

Note 

On a normal Red Hat or SuSe distribution you will not need to add watchdog to your kernel because the modular version of the Red Hat kernel contains a compiled copy of the softdog module already.

If you have compiled your own kernel from source code, run makemenu config from the /usr/src/linux directory, and check or enable the "Software Watchdog" option on the following submenu:

  • Character Devices

    • Watchdog Cards --->

      • [*] Watchdog Timer Support

        • [M] Software Watchdog (NEW)

If this option was not already selected in the kernel, follow the steps in Chapter 3 to recompile and install your new kernel. If you are using the standard modular version of the kernel included with Red Hat (or if you have just finished compiling your own kernel with modular support for the Software Watchdog), enter the following commands to make sure that the module loads into your running kernel:

 #insmod softdog #lsmod 

You should see softdog listed. Normally the Heartbeat init script will insert this module for you if you have enabled watchdog support in /etc/ ha.d/ha.cf as described later in this section. Assuming that watchdog is enabled, you should now remove it from the kernel and allow Heartbeat to add it for you when it starts. Remove softdog from the kernel with the command:

 #modprobe -r softdog 

Kernel Panic—Hang or Reboot?

To force the system to reboot instead of halt if the kernel panics, modify the boot arguments passed to the kernel. To do this on a system using the Lilo[8] boot loader, for example, edit /etc/lilo.conf near the top of the file, and before image= lines so it will take effect for all versions of the kernel you have configured, and add the following line:

 append="panic=60" 

Then be sure to run the command:

 #lilo -v 

Alternatively, you could also use the command:

 #echo 60 > /proc/sys/kernel/panic 

But you would need to add this command to an init script so that it would be executed each time the system boots.

Configure Heartbeat to Support Watchdog

In addition to using the softdog timer as we've just described (as part of the normal configuration of your server to improve its reliability when the system hangs) you can tell Heartbeat to update the softdog timer. This lets watchdog know that Heartbeat is running and healthy. If the timer doesn't get updated, watchdog will notice and force a kernel panic. In effect, we are telling watchdog to watch Heartbeat.

Note 

With Heartbeat release 1.2.3, you can have apphbd watch Heartbeat and then let watchdog watch apphbd instead.

When you enable the watchdog option in your /etc/ha.d/ha.cf file, Heartbeat will write to the /dev/watchdog file (or device) at an interval equal to the deadtime timer raised to the next second. Thus, should anything cause Heartbeat to fail to update the watchdog device, watchdog will initiate a kernel panic once the watchdog timeout period has expired (which is one minute, by default).

 #vi /etc/ha.d/ha.cf 

Uncomment the line:

 watchdog /dev/watchdog 

Now restart Heartbeat to give the Heartbeat init script the chance to properly configure the watchdog device, with the command:

 #service heartbeat restart 

You should see softdog listed when you run:

 #lsmod 

Note 

You should do this on all of your Heartbeat servers to maintain a consistent Heartbeat configuration.

To test the watchdog behavior, kill all of the running Heartbeat daemons on the primary server with the following command:

 #killall -9 heartbeat 

You will see the following warning on the system console and in the /var/ log/messages file:

 Softdog: WDT device closed unexpectedly. WDT will not stop! 

This error warns you that the kernel will panic. Your system should reboot instead of halting if you have modified the /proc/sys/kernel/panic value as described previously.

[8]See Chapter 3 for a discussion of the Lilo boot loader.



The Linux Enterprise Cluster. Build a Highly Available Cluster with Commodity Hardware and Free Software
Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software
ISBN: 1593270364
EAN: 2147483647
Year: 2003
Pages: 219
Authors: Karl Kopper

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net