Clustering Terminology | Novell Open Enterprise Server Administrators Handbook, SUSE LINUX Edition

We all know that clustering provides a high-availability platform for your network infrastructure. High availability is becoming increasingly important for two purposes: file access and network services. The following sections discuss NCS configuration for both of these situations. However, before you start working with an NCS cluster, you should be familiar with the terms described in the following sections.

Master Node

The first server that comes up in an NCS cluster is assigned the cluster IP address and becomes the master node. (Other nodes in the cluster are often referred to as slave nodes.) The master node updates information transmitted between the cluster and eDirectory, and monitors the health of the cluster nodes. If the master node fails, NCS migrates the cluster IP address to another server in the cluster, and that server becomes the master node.

Cluster-Enabled Volume

A cluster-enabled volume is an NSS volume configured to provide location-transparent access to OES Linux file services. The volume is associated with an eDirectory virtual server object that provides a unique secondary IP address for locating the volume on the cluster's shared storage device. The volume provides read-write file access to users.

NOTE

OES Linux clusters failover storage pools. This means you can migrate more than one volume at a time to another node if they are part of the same storage pool. For more information on Novell Storage Services (NSS), see Chapter 11, "OES Linux File Storage and Management."

Cluster Resource

A cluster resource is an object in eDirectory that represents an application or other type of service (such as DHCP or the master IP address) that you can migrate or fail over from one node to another in an NCS cluster. The cluster resource object includes scripts for unloading the service from one node and loading it on another node. In most cases, make sure the service is installed on all nodes in the cluster that will host the service.

Heartbeats and the Split-Brain Detector

NCS uses heartbeats on the LAN and a Split-Brain Detector (SBD) on the shared storage device to keep all services highly available on the cluster when a node fails. NCS determines when a node fails over the LAN and casts off the failed node through the following process:

Every second (by default), each node in an NCS cluster sends out a heartbeat message over the network.
The master node monitors the heartbeats of all other nodes in the cluster to determine whether they are still functioning.
If a heartbeat is not received from a node during a predefined timeout (eight seconds by default), that node is removed (cast off) from the cluster, and migration of services begins.

NOTE

If the master node fails to send a heartbeat within the predefined timeout, it is cast off, and another node takes over as the master node.

NCS also uses the SBD to determine when a node fails through the following process:

Each node writes an epoch number to a special SBD partition on the shared storage device. An epoch occurs each time a node leaves or joins the cluster. The epoch number is written at half the predefined timeout value (four seconds by default).
Each node reads all epoch numbers for all other nodes in the SBD partition.
When the master node sees an epoch number for a specific node that is lower than the others, it knows that the node has failed, and the node is cast off.

Fan-Out Failover

When a node fails in an NCS cluster, the cluster-enabled volumes and resources assigned to that node are migrated to other nodes in the cluster. Although this migration happens automatically, you must design and configure where each volume and resource migrates during failover.

TIP

You will probably want to distribute, or fan out, the volumes and resources to several nodes based on factors such as server load and the availability of installed applications. NCS relies on you to define where clustered resources will be assigned should a failure occur.