2.7 TruCluster Server Overview

This might be a good place for a quick comparison of the V1.X TruCluster product and the V5.X TruCluster Server product. As we mentioned in Chapter 1, there has been a steady evolution of the clustering software. In the early days (1994), the cluster was referred to as the Available Server Environment (ASE). Most of the cluster support daemons ran in user mode and used TCP/IP to interface with the Memory Channel once it became available in 1996. Also, each member of the cluster had its own copy of the operating system, thereby eliminating any possibility for ease of management. In 1996, the product (V1.4) involved three major components (ASE, Production Server, and Memory Channel). A cluster site could graduate from using ASE for application failover support to the Production Server (PS), which would include support for a Connection Manager, Distributed Lock Manager, and Distributed Raw Disks. (Note that Oracle Parallel Server was an example of a PS V1.X application.)

The PS environment encompassed ASE and Memory Channel. As such, it prepared the foundation for a significant leap forward to today's TruCluster Server software product (V5.x). Prior to TruCluster V5.0A, administering a cluster was actually more difficult than managing a single system or group of systems. Each system would require separate installation efforts. Today most administrators find the cluster relatively easy to manage.

With its inclusion of the Cluster File System (CFS), the Cluster Application Availability Subsystem (CAA) replacing ASE, and the Cluster Alias Subsystem (CLUA), the TruCluster software provides excellent availability through CAA and component failure detection techniques. Using CFS, applications are able to access file data transparently from any cluster member. The following subsections introduce many of the cluster software components. Subsequent chapters will provide more detail on each component.

2.7.1 Communication between Cluster Members

Remember how difficult it was to synchronize code running on multiple CPUs in an SMP environment? The solution was to design the kernel mode and ISR code such that it acquired the simple lock protecting the requested resource before going forward and accessing it. What happens in the world of clusters when code running on two nodes needs to access a common database of information? The answer includes several levels. First we need to discuss the mechanism for communication between cluster members. Then we'll look at the software that uses the medium. In order to use TruCluster Server version 5.0A or 5.1, the systems have to have Memory Channel adapters installed and be connected directly (for a two member cluster) or be connected through a hub for more than two members. In TruCluster Server version 5.1A, the LAN hardware can be used as the cluster interconnect alternatively.

2.7.2 Memory Channel (MC)

Cluster-based code will be executing on one (or more) CPUs within each member of the cluster (remember that a cluster member can be an SMP-based multiprocessor). We saw earlier that multiple CPUs within one member can remain synchronized through simple locks, but simple locks require access to memory that is equally accessible to the two CPUs. Through the Memory Channel, portions of the memory of one cluster member can be mapped into the address space of another member. Don't jump to conclusions here. Standard, Tru64 UNIX simple locks are not used in this instance due to their reliance on inter-processor interrupts and CPU specific registers (note that the Memory Channel will set aside a page for MC spinlocks). Interestingly, the fact that there is some ability to access data over a common medium (Memory Channel) opens up many possibilities for speedy communications between members. Normal memory access across the Memory Channel should complete within 3 – 5 microseconds.

Figure 2-8 shows two cluster members connected by a cluster interconnect. Currently, the cluster interconnect can be implemented through Memory Channel hardware or Ethernet hardware^[4] (new for V5.1A).

click to expand
Figure 2-8: Cluster Interconnect

2.7.3 Internode Communication Subsystem

No hardware, even something as sophisticated as Memory Channel, can communicate without the help of software. The software responsible for communications across the Memory Channel is referred to as the Internode Communication Subsystem (ICS). Just as a disk has a disk driver, and a terminal has a terminal driver, you can consider the ICS as the software that interacts with the device driver for the MC.

In a very basic sense, the MC is present to provide a speedy communication path between the members of a cluster. Remove the MC from the picture for a minute. Don't we still have a mechanism to communicate between the members? A network interconnect should also suffice as the communication medium.

In version 5.1A of TruCluster Server, the Ethernet is also supported as the Cluster Interconnect. This changes the equation slightly since the ICS will also drive the data across the Ethernet. The Ethernet is supported by the Ethernet driver. Don't think about TCP/IP here. That's further up the food chain. Consider the fact that the same Ethernet driver (and device) can send TCP/IP, LAT^[5], DECnet^[6], and other network protocols across the medium. If that's true, why shouldn't we be able to create a portion of the ICS code such that it handles cluster communication over the Ethernet rather than relying on having a memory channel adapter installed?

From this point forward, we will refer to the Cluster Interconnect (CI), rather than specifically mention the Memory Channel. Essentially, the rest of the concepts will be the same whether the MC or the LAN hardware is used for the CI. As you will see, there are many major cluster components that use the ICS (and indirectly the CI). Note that ICS actually has communication channels for every cluster subsystem.

Figure 2-9 shows the ICS above the CI. Note that the cluster interconnect will exist on each member in the cluster and have some kind of medium connecting the members.

click to expand
Figure 2-9: ICS to CI Subsystem Relationship

See Chapter 18 for more information on the Internode Communication Subsystem.

2.7.4 Connection Manager

In order for the cluster to function as a unit, there must be a component responsible for checking and maintaining the status of the members. The Connection Manager (CNX) is built into the kernel of Tru64 UNIX. It runs on each member in a cluster and provides the software glue to keep the cluster stuck together (or to unstick it as the case may be). The CNX on each member in a cluster establishes communication with each other and periodically checks to be sure that the communication path is intact. In the event of a communication path error, the ICS notifies the CNX so that it can take appropriate action, such as removing a member from the cluster and notifying other cluster components of the cluster's change in state.

During cluster formation, the heavy lifting is done by the CNXs as they check for a predetermined number of votes from the members in order to establish the cluster. This agreement is referred to as the cluster quorum. The cluster will not form if enough votes to establish quorum are not present. The CNXs and the quorum mechanism also prevent cluster partitioning, where some members come up as a cluster separate from and unaware of the rest of the cluster.

The CNX uses the ICS and the CI to communicate with the other members (or potential members) of the cluster. The CNX will also communicate with the Tru64 UNIX Event Manager (EVM) in order to log/report cluster events and communicate with the Cluster Application Availability daemon. The CNX is also responsible for notifying other cluster subsystems when certain events occur. Finally, the CNX must rebuild the Distributed Lock Manager (DLM)^[7] and Kernel Group Services (KGS)^[8]when members leave the cluster.

Figure 2-10 depicts the CNX as a kernel component interacting with the ICS and the Event Manager.

click to expand
Figure 2-10: Connection Manager Subsystem Relationships

2.7.5 Cluster Application Availability

Applications can be created to run on the cluster in one of two ways. They can either be cluster- aware or cluster-unaware. Most cluster-unaware applications will need some software intervention if the member on which the application is running goes down. Assuming that it is possible to get the application running on another member in the cluster, the Cluster Application Availability (CAA) component will be responsible for reacting to EVM messages (which may have emanated from CNX or another cluster component such as NIFF – Network Interface Failure Finder) and starting the application on another member. Once the application has restarted, CAA will resume its indirect watch over the cluster-unaware applications. Note that CAA can directly check the status of an application if the application script contains a "check" entry point. See Chapters 23 and 24 for more information on CAA. Take note that cluster-unaware applications come in two flavors: single instance – the software can run on one cluster member at a time; and multi instance – the application can run on multiple members at a time. A multi-instance cluster-unaware application could be using system calls (such as fcntl(2)) to synchronize file access, in which case CAA would not be necessary for failover situations since the application code might already be running a synchronized instance on each cluster member. However, a multi-instance cluster-unaware application that uses an interprocess communication (IPC) method such as shared memory and semaphores is a candidate for CAA.

Figure 2-11 shows the CAA software responsible for marshalling cluster-unaware applications. Note that CAA is implemented as a user mode software mechanism.

click to expand
Figure 2-11: CAA Component Relationship

2.7.6 Cluster-Aware Applications

A cluster-aware application is one that has been written with full use of the cluster based APIs such as the Distributed Lock Manager (DLM) or the MC API. The cluster-aware application can be arranged in several ways. It can be organized such that some processing takes place on multiple cluster members, or it can be arranged such that parallel processing occurs, or there can be duplication of service. But in each of these cases, there will inevitably be the need to coordinate process-based activities. DLM is available for exactly that kind of synchronization.

2.7.7 Distributed Lock Manager

The Distributed Lock Manager (DLM) is used by cluster-aware applications and by some of the cluster components themselves for cluster-wide synchronization. DLM goes far beyond standard file locking capabilities (although it can be used for file locking as well). In essence, DLM represents a resource (which can be just about anything) with a Resource Name. The Resource Name will have queues of locks (i.e., granted, waiting, waiting for conversion) that represent the applications contending for the resource. Each requested lock will contain the Process Identifier (PID) of the requesting process. Note that in V5.X, PIDs are unique cluster-wide (see Chapter 6 for more information). Therefore each lock can be traced back to a process and a cluster member.

Note that the component is called the DISTRIBUTED Lock Manager. This is because the lock database is distributed across all of the members of the cluster. The DLM database is not fully replicated on all cluster members. Portions of the lock database will live on each member in the cluster. Upon a cluster state transition, one of the items to be resolved is the redistribution of the lock database.

The beauty of the DLM is that it is a cluster-wide mechanism for synchronization. It is one of the few ways that processes can quickly communicate when they exist on separate cluster members. Another way to communicate is to put data into shared files. But file sharing is one of those good news, bad news situations. The good news is that the file can be accessed by several processes at once. The bad news is the same: the file can be accessed by several processes at once. File locking can be used by the applications to coordinate their access to shared files. Furthermore, disk access tends to be much slower than memory access, even when the access is through MC or other ICS options. The lock manager is used by cluster components as well as user processes through the DLM API. Figure 2-12 includes the kernel mode DLM component.

click to expand
Figure 2-12: DLM Subsystem Relationships

2.7.8 Cluster File System

File system level access to disk storage is supported by the Cluster File System (CFS). CFS makes all file systems mounted on any member in the cluster visible to any other member in the cluster. CFS is implemented using a client-server strategy, where the first member to mount a file system will be the serving member for the rest of the cluster. All file systems created on disks that are found on a non-shared bus will be served by the local member. The Virtual File System (VFS) software (part of the Tru64 UNIX kernel) will interact with the CFS, which will then issue the request to the physical file system. The physical file system will most likely be the Advanced File System (AdvFS) but could be the Unix File System (UFS), which is currently supported in read-only mode when mounted cluster wide. Note that in V5.1A, a UFS (and MFS – Memory File System) can be available in read-write mode if mounted using the "-o server_only" option on the mount(8) command. The physical file system will need to interact with another software level that will be responsible for accessing the correct disk.

Figure 2-13 shows the Tru64 UNIX VFS kernel mode code, which feeds the CFS code and ultimately the local disk driver to get access to file system data. Note that a portion of the CFS client activities is still under wraps but will be examined in the next section. Specifically, there must be a component that is responsible for identifying the target disk if it exists on another member in the cluster.

click to expand
Figure 2-13: CFS Subsystem Relationships

2.7.9 Device Request Dispatcher

The Device Request Dispatcher (DRD) accepts requests for data from the physical file systems and sends the request to the appropriate serving member. Note that the CFS is organized as client-server as is the DRD in some cases. The DRD is at a lower level than the CFS and does not concern itself with file system specifics. It dispatches the requests for data to be retrieved from the devices (thus the name).

The DRD provides cluster-wide access to disk storage regardless of whether the disk storage is accessible locally. The DRD will use the ICS and ultimately the CI to send its requests for data to the appropriate serving member. Note that DRD is not limited to providing access to disks. It also provides access to tape drives, floppy drives, CD-ROM, and DVD-ROM devices.

Be aware that prior to V5.0, DRD stood for Distributed Raw Disk. After V5.0, the Device Request Dispatcher provides access to raw disks and other block devices. Thus the change in the acronym was made.

Figure 2-14 includes the DRD handling requests for device access.

click to expand
Figure 2-14: DRD Subsystem Relationships

2.7.10 Cluster LSM

The Logical Storage Manager (LSM) provides the ability to create logical storage devices referred to as volumes. The volumes can be arranged in a mirrored fashion, or striped, or concatenated. In any of these cases, LSM presents to the system a pseudo-device that potentially represents multiple physical devices. For this reason, many Tru64 UNIX users consider LSM to be a software RAID option.

LSM is supported in a cluster except for use on some specialized storage such as the quorum disk and the member's boot disk (discussed later in this book). Note that LSM will ultimately need to specify the I/O to be performed to the lower levels of the software. It would be easy to say that LSM (which is primarily implemented as a pseudo-device driver) will ultimately pass its request to the local driver for processing. The statement would be true on a standalone system. However, on a cluster, LSM must ask the DRD to deliver its data requests to the device drivers. Consider the fact that the device may only be physically accessible to another member of the cluster. DRD will deliver the request to the correct target member.

Figure 2-15 positions the Cluster LSM component just above the DRD.

click to expand
Figure 2-15: CLSM Subsystem Relationships

2.7.11 Cluster Alias

As a closely coupled computing system, a cluster may need to present itself to network applications (such as telnet(1) or ftp(1)) as a single entity. A cluster may in fact consist of up to eight members, and all eight members may have network interfaces. A client system will be able to reference a single IP address that can represent all of the cluster members. The single cluster alias will transparently provide access to the cluster. Note that multiple aliases can be defined representing subsets of the cluster members if desired.

Let's go back to the problem of the cluster presenting itself to the network as a single entity. This is achieved through the use of the Cluster Alias (CLUA) software. The CLUA software allows client systems (telnet, ftp, NFS, etc.) to request a connection to the cluster as a server for NFS, FTP, telnet, or other network options. The CLUA software would direct the request to a target member within the cluster and attempt to load balance and distribute the client requests. Later on (in Chapter 16), we will discuss the mechanisms behind CLUA (in addition to the other components for that matter).

Each cluster member may have multiple network interfaces for redundancy, but even with a single network interface, the cluster member will want to be a ‘member’ of one or more cluster aliases. Notice that the network interfaces are supported by the Network Interface Failure Finder (NIFF) software, which is responsible for reporting an event involving the network interface, to the EVM, which forwards the event to the CAA software, CLUA software, or other software relying on the network. The routing tables would be altered to reflect the current usable interfaces. NIFF would ultimately interact with Redundant Array of Independent Network interfaces (NetRAIN), which would attempt to fail over the network activities such that another interface is used.

Figure 2-16 includes the user mode NIFF components, as well as the kernel mode NIFF and NetRAIN components. Note that this picture positions the CLUA software in both user mode space (aliasd) and kernel mode space.

click to expand
Figure 2-16: CLUA Subsystem Relationships

2.7.12 Cluster Big Picture

Figure 2-17 should give you an appreciation of the complexity of the TruCluster software. Each component is there to solve a particular implementation problem. Subsequent chapters will build a full meal out of the appetizers presented in this chapter. You may want to dog-ear this page for reference as you move through the book as it helps to drop back occasionally and look at the big cluster picture.

click to expand
Figure 2-17: Cluster Subsystem Components

^[4]V5.1A supports 100Mb/Full Duplex connections with a switch; 100Mb/half duplex with a hub (this is supported but not recommended). Gigabit Ethernet is also supported.

^[5]Local Area Transport – An aging, non-routable network protocol used in terminal servers.

^[6]DECnet – HP's OSI based network software.

^[7]See section 2.7.7.

^[8]See Chapter 18 for more information on KGS.