24.4 Serviceguard and High Availability Clusters | HP-UX CSE(c) Official Study Guide and Desk Reference

We look at the basic ideas behind a simple High Availability Cluster in relation to the technologies involved to accomplish the goals of the cluster.

We are specifically interested in High Availability Clusters whereby we are trying to maximize the availability of applications by providing redundancy in hardware and software, eliminating as many Single Points Of Failure as possible; we are trying to eliminate as many unplanned outages as possible. Management and coordination of the cluster is via software. Figure 24-5 shows up an example of a High Availability Cluster from Hewlett Packard's perspective.

Figure 24-5. A High Availability Cluster.

The software that controls the creation and management of the cluster is Serviceguard. The name itself gives you a hint as to its purpose: Serviceguard = the nodes themselves are not important, but the Service they provide is what we are protecting. Here are some points regarding the configuration and setup of such a cluster:

A single cluster is a collection of at least two and up to 16 nodes. A cluster can run with a single node, but there is no protection against a system failure. Larger clusters offer better flexibility when dealing with a failure. Serviceguard can be configured to distribute the workload of individual applications to machines based on the current workload of available machines within the cluster.
Each node runs its own operating system. It as advised that the operating system's disk(s) has redundancy in the form of at least RAID 1 (mirror) protection.
The use of advanced disk arrays is advisable wherever possible due to their built-in resilience and the fact they can be housed in a remote site, if necessary ( utilizing technologies such as Fibre Channel).
All nodes should be running intelligent diagnostic software to ensure that the health of individual nodes is constantly monitored .
All nodes prescribed to be able to run a specific application package must be connected to the disks used by that application package.
The cluster is viewed as a collection of nodes in a single IP subnet, hence, geographically the cluster is only limited by capabilities of the LAN/WAN.
A system with two LANICs cannot be configured with IP addresses in the same subnet. If multiple LANICs are used, then multiple network addresses will need to be used.
The members of the cluster must not be on different segments of a routed network; they are all located on the same IP subnet.
Standby LAN cards are recommended for redundancy and fast local LAN failover. This offers LAN failure protection (fast local switch to standby LAN adapter inside the same node). The IP address for the Active LAN card will be relocated to the Standby LAN card. At least one Standby LAN is required.
Standby LANs need to be located on the same LAN segment as the active LAN. This means that both physical networks must be bridged. The bridge, switch, or hub used must support the 802.1 Spanning Tree Algorithm. Use of multiple bridges/hubs/switches is advisable to eliminate another SPOF.
A Dedicated Heartbeat LAN is recommended for performance, reliability, and redundancy. A Standby LAN card for the Heartbeat LAN is also advisable.
The primary and standby LAN must be the same type (FDDI - FDDI, 10Base-t - 10Base-t, TokenRing - TokenRing).
Application packages allow all resources for a package to be defined in one place, including the following:
- Disk/volume groups

- Processes to be monitored
- An IP address associated with each application. This is an important feature because it performs a number of tasks :
- Disassociates an application from a single node.
- Allows the IP address to be "relocated" to another node if the original node fails. The concept of a Relocatable IP address is the essence of how Serviceguard can maximize the uptime of applications; node failures need not lead to long application outages.
Automatic cluster reconfiguration after a node failure means that no manual intervention is required after a node failure.
Accomplishing intelligent cluster reconfiguration after a node failure using either a "cluster lock disk" or a "quorum server" preserves data and cluster integrity, i.e., no "spilt-brain" syndrome.
No resources are idle because every node is attached to all the data disks for an application and, hence, can run its own or any other application. This also facilitates hardware and software upgrades because application packages can be moved from node to node with minimal interruption.
It is advisable that each node able to run an application package be of a similar hardware configuration to alleviate issues relating to expected performance levels.
Advanced performance tools such as Process Resource Manager (PRM) and Work Load Manager (WLM) may need to be employed to balance the workload on individual nodes when a package is moved from one node to another.
Cluster-wide security policies need to be employed so that the availability of application packages is not compromised. If users need to log in to individual nodes, then sophisticated measures may need to be employed to distribute and replicate user login details across the entire cluster, e.g., single-sign-on, LDAP, and NIS+.
Advanced clusters will be considered later where we provide site-wide redundancy in an attempt to eliminate unplanned outages such as natural disaster and acts of terrorism that can render an entire data center inoperable.

The cluster is configured and detailed via a binary configuration file called /etc/cmcluster/cmclconfig . This binary file is described in an associated ASCII file; any name can be used for the ASCII file. The binary file is distributed to all nodes in the cluster. As we see, updates and amendments to the cluster can occur while the cluster is up and running.

An application package is essentially all of the resources an application requires to run: disk/volume groups, filesystems, processes, and an IP address. The IP address will allow users to connect to the package IP address, hence, removing the relationship between an individual server and an individual application. Individual application packages have their own control file and startup script. The control file is compiled and distributed into the cluster configuration file ( /etc/cmcluster/cmclconfig ), while it is up to the administrator to ensure that the startup script is distributed to all nodes able to run the application package.

One of the most important processes within Serviceguard is a process called cmcld . This is a high priority process known as the Cluster Management Daemon. A fundamental feature of cmcld is to maintain the current list of members in the cluster. The technique to achieve this is for each node to send and receive regular "heartbeat" packets. This is fundamental to cmcld to ensure that the heartbeat packets arrive at their destination multiple and/or a dedicated LAN cards are used for heartbeat traffic. When a node has failed to send/receive heartbeat packets in a prescribed timeframe, cmcld will deem that node to have "failed" for whatever reason. At this time, cmcld will conduct a cluster reformation . The remaining nodes will run any packages that are now deemed to be unowned due to the previous node failure. Serviceguard can either be managed from the command line or via a GUI interface.

Let's move on to look at the actual mechanics of constructing a basic High Availability Cluster. First, we will configure and test a simple cluster with no packages: This known as a "package-less" cluster. We will then add our applications into the cluster to improve their availability to our user communities.