Stonith


Stonith, or "shoot the other node in the head,"[1] is a component of the Heartbeat package that allows the system to automatically reset the power of a failing server using a remote or "smart" power device connected to a healthy server. A Stonith device is a device that can turn power off and on in response to software commands. A serial or network cable allows a server running Heartbeat to send commands to this device, which controls the electrical power supply to the other server in a high-availability pair of servers. The primary server, in other words, can reset the power to the backup server, and the backup server can reset the power to the primary server.

Note 

Although there is no theoretical limitation on the number of servers that can be connected to a remote or "smart" power device capable of cycling system power, the majority of Stonith implementations use only two servers. Because a two-server Stonith configuration is the simplest and easiest to understand, it is likely in the long run to contribute to—rather than detract from—system reliability and high availability.[2]

This section will describe how to get started with Stonith using a two-server (primary and backup server) configuration with Heartbeat. When the backup server in this two-server configuration no longer hears the heartbeat of the primary server, it will power cycle the primary server before taking ownership of the resources. There is no need to configure sophisticated cluster quorum election algorithms in this simple two-server configuration;[3] the backup server can be sure it will be the exclusive owner of the resource while the primary server is booting. If the primary server cannot boot and reclaim its resources, the backup server will continue to maintain ownership of them indefinitely. The backup server may also keep control of the resources if you enable the auto_failback option in the Heartbeat ha.cf configuration file (as discussed in Chapter 8).

Forcing the primary server to reboot with a power reset is the crudest and surest way to avoid a split-brain condition. As mentioned in Chapter 6, a split-brain condition can have dire consequences when two servers share access to an external storage device such as a single SCSI bus connection to one disk drive. If the server with write permission, the primary server, mal-functions, the backup server must take great precautions to ensure that it will have exclusive access to the storage device before it modifies data.[4]

Stonith also ensures that the primary server is not trying to claim ownership of an IP address after it fails over to a backup server. This is more important than it sounds, because many times a failover can occur when the primary server is simply not behaving properly and the lower-level networking protocols that allow the primary server to respond to ARP requests ("Who owns this IP address?") are still working. The backup server has no reliable way of knowing that the primary server is engaging in this sort of improper behavior once communication with the daemons on the primary server, especially with the Heartbeat daemon, is lost.

[1]Other high-availability solutions sometimes call this Stomith, or "shoot the other machine in the head."

[2]"Complexity is the enemy of reliability," writes Alan Robertson, the lead Heartbeat developer.

[3]With three or more servers competing for cluster resources, a quorum, or majority-wins, election process is possible (quorum election will be included in Release 2 of Heartbeat).

[4]More sophisticated methods are possible through advanced SCSI commands, which are not implemented in all SCSI devices and are not currently a part of Heartbeat.



The Linux Enterprise Cluster. Build a Highly Available Cluster with Commodity Hardware and Free Software
Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software
ISBN: 1593270364
EAN: 2147483647
Year: 2003
Pages: 219
Authors: Karl Kopper

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net