3.3 The Remaining Details - Cluster Interconnect, Network Connections, and Storage | TruCluster Server Handbook (HP Technologies)

3.3 The Remaining Details – Cluster Interconnect, Network Connections, and Storage

Once the number of hosts and their size have been chosen, there is a second level of decisions concerning the configuration of the cluster. This includes whether to have a quorum disk, the configuration of the cluster interconnect, external network interfaces, and storage.

3.3.1 Should a Quorum Disk be Configured?

The TruCluster Server Connection Manager (CNX) component is responsible for maintaining a list of which cluster members are actively participating members at any moment in time. This "list" is made available to the cluster software running on all members. This current membership list takes into account hardware failures that cause some systems to crash or if the cluster interconnect communication were to fail between members. As part of the CNX design, a quorum disk is a configuration option. Its necessity or usefulness is dependent on the number of members of the cluster configuration. We will discuss it briefly here, but you can read more about the quorum disk in Chapter 17.

3.3.1.1 When is a Quorum Disk "Required"?

The short answer is for two-node clusters. The longer explanation starts with a note about the word "required". The TruCluster Server software never insists that a quorum disk be defined for any node count cluster for that cluster to boot and become operational. Yet, if we want to be ready in the event of an individual node failure for the remaining member to stay operational, then a quorum disk is a "requirement" for two-node clusters. If a quorum disk is not configured for a two-node cluster, and either member or all redundant cluster interconnect rails were to fail, the cluster would suspend operations.

3.3.1.2 When is a Quorum Disk Beneficial (As Opposed to "Required")?

A quorum disk is beneficial for four, six and eight node clusters. For a cluster with an even number of members, the quorum disk allows one additional member system to be unavailable and the cluster will continue to operate (that would be the case for a same size cluster not configured with a quorum disk). For instance, in a four-node cluster without a quorum disk, if two nodes are down, fail, or become un-reachable on the cluster interconnect, the cluster will suspend operation. If a quorum disk is added to the four-node cluster's configuration, then it takes three nodes to be unavailable for the cluster to suspend operation. Note that this benefit does not exist with odd-node count clusters.

See Chapter 17 for more details.

3.3.1.3 Example

The Biotech cluster is at six nodes. So, they will add a quorum disk to increase the threshold at which the cluster suspends operations from three to four nodes.

3.3.2 What Type of Cluster Interconnect Should Be Used?

TruCluster Server now supports two types of cluster interconnects: Memory Channel and LAN-based. In addition, either interconnect type can be configured as either a single rail or in a dual-redundant configuration. The short and conservative answer is to use Memory Channel because it has been supported the longest and has the best performance characteristics. Of course there are some overriding reasons to use a LAN based cluster interconnect instead – cost, distance, and slot restrictions on the host member. These will be discussed in more detail in the following sections.

3.3.2.1 The Role of the Cluster Interconnect

The performance and integrity of the cluster interconnect is critical to the operation of the cluster. The cluster interconnect is used for communication and synchronization between the cluster software running on all the members. For instance, in the case of the Connection Manager (CNX) component of TruCluster Server, partial or total failure of the cluster interconnect can result in individual cluster members suspending or the entire cluster suspending operation. With this in mind, it is critical to a successful cluster to use a redundant interconnect configuration and to select a cluster interconnect which has the best performance possible.

The cluster interconnect is presented to the system and users as an IP network whether it is Memory Channel or LAN-based. A conservative approach to ensure that the full performance of the cluster interconnect is available to the cluster software is to not utilize this IP network between cluster members for any application or administrative high bandwidth consuming tasks. If high performance communication between members were required between applications on different members, a conservative solution would be to implement an additional network common to all members for this traffic.

3.3.2.2 Memory Channel Interconnect

The Memory Channel is the original and longest supported cluster interconnect for the TruCluster Server software. It is based on PCI-based Memory Channel cards that reside in the members that in the standard configuration are connected in a star fashion to a Memory Channel Hub. If only two hosts are being connected, the hub can be omitted with setting jumpers on the memory channel boards (this is covered in detail in Chapter 4). In this configuration, the maximum point-to-point distance between two members is twenty meters. Memory channel has high bandwidth and very low latency compared to traditional network technology making it attractive for use between tightly connected cluster members. As an alternative configuration option that trades off performance for distance, instead of one hub, two hubs with a Fibre Channel bridge between them can be used to span larger distances of up to 6000 meters point to point. For high availability, two memory channel rails should be configured between member systems. The cluster automatically detects the two interconnects and uses one as active and the other as a passive standby LAN-Based Cluster Interconnect

Starting with V5.1A, TruCluster Server supports LAN-based cluster interconnects built from industry standard fast Ethernet or gigabit Ethernet hardware. Both technologies offer an industry standard hardware solution for cluster interconnects with a trade-off in varying degrees of performance when compared with memory channel. Gigabit Ethernet can be nearly comparable to memory channel in both bandwidth and latency, but FastEthernet is significantly slower in both cases. Also be aware that LAN interconnect support is not a wildcard that allows you to use any "network" you might otherwise be able to configure between two systems. You should know that restrictions exist concerning distance and public or private nature of the network. First the total point-to-point distance for NSPOF LAN interconnect configurations is finite at 1100 meters, and although this is much greater than a standard Memory Channel configuration, the memory channel configuration with a fibre bridge can actually span longer distances at 6000 meters. Various restrictions exist for the physical LAN interconnect configuration related to the number of switches allowed between members (currently three), which are described in the Cluster LAN Interconnect guide that is part of the TruCluster Server documentation set from HP.

Additionally, the LAN interconnect is expected to be a private network solely to the cluster and no other systems or applications. With a memory channel configuration, it is difficult if not impossible to get non-cluster systems connected to the interconnect. With a "network," it would be easy to do but disastrous if the additional traffic made the interconnect unusable by the cluster members.

For availability, the LAN interconnect should be configured as dual redundant at the hardware level. It is then required that the two rails be configured using the base system's NetRAIN facility so that they are treated as one logical rail. The NetRAIN facility is a standard component of the Tru64 UNIX operating system.

3.3.2.3 Comparing and Contrasting Memory Channel and LAN-Based Interconnects

Table 3-1 lists the attributes of Memory Channel and LAN-based cluster interconnects.

Table 3-1: Memory Channel Vs LAN Comparison
Interconnect Type	Relative Performance		Point to Point Max Distance Supported in a NSPOF Config	Relative Cost	Slots Consumption	Redundance Model	1st Supported Release
Interconnect Type	Bandwidth	Latency	Point to Point Max Distance Supported in a NSPOF Config	Relative Cost	Slots Consumption	Redundance Model	1st Supported Release
Memory Channel	High	Low	20 meters	High	1 PCI slot/rail	Active/Passive	5.0
Memory Channel with Fibre Bridge	Medium	Medium	6000 meters	High	1PCI slot/rail	Active/Passive	5.0
Fast Ethernet	Medium	Med/High	1100 meters	Low	slot/rail (4-port card)	Active/Passive (NetRAIN)	5.1A
Gigabit Ethernet	High	Med/High	1100 meters	Med/High	1 PCI slot/rail	Active/Passive (NetRAIN)	5.1A

Note the following:

In performance, Memory Channel has a bandwidth comparable to Gigabit but a much better latency performance characteristic.
In the area of maximum distances, LAN-Based solutions are better unless you use the fibre-bridge configuration for Memory Channel.
In relative cost metric, the highest option is memory channel. In absolute terms, however, Memory Channel cards and hub are not excessive vis- -vis Gigabit hardware.
In NSPOF configurations, Memory Channel and LAN-based cluster interconnects are comparable in supporting multiple rails in an active/passive mode until the active rail fails. Memory channel configurations support a maximum of two physical rails. The LAN interconnect supports as many as the NetRAIN subsystem allows (two is typically adequate).
In the area of release history, although both types of interconnects are fully supported by HP for TruCluster Server, Memory Channel hardware and the TruCluster Server software to drive it have been around longer and have significantly more test and deployed hours behind it at the current time.
Memory Channel will detect communication failures much more quickly since it doesn't have to wait for IP timeouts to occur.

The conservative approach and recommended rule of thumb is to use Memory Channel unless you have a compelling reason to do otherwise – i.e., distance, cost, or member system slot restrictions.

3.3.2.4 Example

For the Biotech research department, they started with the assumption they'd use Memory Channel. Looking at cost, they do not find the cost of the Memory Channel components disproportionate to the cost of LAN-Based components or the total cost of the cluster. In addition, the six ES45 nodes of the cluster will be located in a pair or racks in a common equipment room with no greater than five meters between the furthest systems point-to-point.

3.3.3 Should the Cluster Interconnect Be Dual-Redundant?

Given the previous discussion of the critical role of the cluster interconnect in the operation of the cluster, the answer is "Yes." Note that although the cluster software does not insist on redundant interconnect hardware as an active software-enforced hardware configuration rule, it is the reasonable thing to do to protect the stability and availability of the cluster as a whole.

3.3.3.1 Example

For the Biotech cluster, a dual-redundant Memory Channel cluster interconnect configuration will be used.

3.3.4 Should Network Interfaces be Highly Available?

If clients and users cannot access the cluster over network interfaces, it is as good as down. So yes, configure network interfaces for high availability.

How is it done? TruCluster Server solutions have mechanisms at two levels to insure that network connectivity stays up in the face of failure. At the first level, the TruCluster Server software's Cluster Alias capability can be used. The cluster alias can be configured so that if a member's network interface fails, another member with a working interface can act as a proxy, accepting and forwarding network traffic from the original member over the cluster interconnect. At the Tru64 UNIX level, the NetRAIN and LAG^[5] interfaces provide a means to create a single logical IP interface from multiple physical network cards and remain highly available when one of the physical interface cards fail. You can read about NetRAIN and LAG in more detail in Chapters 9 and 12.

3.3.5 Default Basic Remedy to a Failed Network Interface on a Member - Using a CLUA Common-Subnet Alias

The Cluster Alias (CLUA) subsystem is a component of TruCluster Server software that allows virtual network addresses to be created and used to allow clients to connect to services in the cluster, independent of the member's native IP addresses and any failures to those interfaces or members. The Cluster Alias subsystem is described in detail in Chapter 16. Here we will briefly describe configuration options and alternatives when designing a cluster solution.

Using the Cluster Alias with the default, common subnet alias type, requires no special hardware beyond each member systems having a network interface to the public network, which would be the norm anyway. Under this configuration scheme, if the one network interface of a member currently receiving traffic from the outside world fails, then the Cluster Alias subsystem will automatically choose another member to receive incoming traffic on its interface on the same subnet and forward any traffic destined to the original host over the cluster interconnect to reach it. This scheme will meet most requirements by offering the following:

Configuring member systems with only a single NIC to a public network, network traffic will be delivered to a member even when its single physical NIC to that network fails.

In rare cases, the routing incoming traffic over the cluster interconnect could cause the following issues:

For some applications, the additional step of being rerouted over the cluster interconnect might result in an unacceptable performance impact.
The additional traffic on the cluster interconnect could, in extreme cases, impact the ability of the cluster to use the cluster interconnect for other purposes, such as the Connection Manager subsystem and forwarding I/O for devices and file systems.

If you encounter a situation where you believe using the Alias is causing excessive traffic on the cluster interconnect, you can use software configuration options and additional hardware to minimize the effect. These methods are described next.

3.3.6 Additional Options That Eliminate Any Potential Network Traffic Being Re-Routed Over the Network Interconnect

As described in the previous section, using the default cluster alias with single network interfaces can result in undesirable network traffic on the cluster interconnect. If this is an issue for the cluster you are designing, you have four options.

3.3.6.1 Use CAA to Relocate an Application When a Network Interface Fails

Utilize the Cluster Application Availability (CAA) framework to create a dependency between an application and a network interface to trigger failover of the application to a member with a working external network interface when the network interface on the member it is running has a failure. This option has the following pros and cons:

Does not require additional hardware in terms of physical NICs
Does result in downtime for the application as it is relocated to another member.

3.3.6.2 Use a CLUA Virtual Subnet Alias

Create your alias as a CLUA "virtual subnet alias" and have two external network interfaces on each member, connected to two different subnets, which are connected by a common router. In this scheme, if one network interface fails, the router will recognize the other network interface of the member as a valid path and forward network traffic to the same host using this alternative path. This option has the following pros and cons:

Eliminates the need to re-route network traffic over the cluster interconnect when a single physical network interface fails
Is transparent to the application and does not require it to restart
Not only survives the failure of an individual NICbut also survives the failure of an entire subnet between the members and a router
Requires multiple physical network interfaces
Requires an additional network interface
Requires all members to be configured on two subnets connected by a common router

3.3.6.3 Utilize Tru64 UNIX NetRAIN Feature

Utilize Tru64 UNIX's NetRAIN configuration option for network interfaces. NetRAIN is independent of the cluster and allows a logical network interface to be created from multiple physical network interfaces on the same host. If one interface fails, the kernel transparently fails over to another physical network interface in the set without disruption to the application. This option has the following pros and cons:

Eliminates the need to re-route network traffic over the cluster interconnect when a single physical network interface fails
Is transparent to the application and does not require it to restart
Requires multiple physical network interfaces

3.3.6.4 Utilize Tru64 UNIX LAG Feature

Tru64 UNIX's LAG feature, like NetRAIN, allows multiple physical network interfaces to be used to create a single logical network interface with high availability characteristics. The advantage of LAG is that all physical network interfaces are used in parallel in an active/active scheme to increase aggregate bandwidth of the logical interface. Though LAG can survive the failure of a physical network interface in the set, this performance gain comes at the availability downside that all of the physical network interfaces must be connected to a common network switch. This option has the following pros and cons:

Eliminates the need to re-route network traffic over the cluster interconnect when a single physical network interface fails
Is transparent to the application and does not require it to restart
Offers performance scaling
Requires multiple physical network interfaces
The switch that all NICs have to be connected to becomes a single point of failure

As stated earlier, utilizing single network interfaces and the default-style common subnet alias will be sufficient for most clusters and applications. Only if excessive network traffic on the cluster interconnect is known to be unacceptable should further measures be considered.

3.3.6.5 Example

For the Biotech cluster, two networks are required. One network will be used only by cluster members. It will be for communications between the user "login" system and compute farm. An additional network will connect all members to the outside world. The primary traffic will come in to the login system and the web server.

For both networks, the site will start out with single NICs and utilize the Cluster Alias subsystem to reroute traffic in the event of a failure. In the case that the private network cannot provide the required performance, the site is prepared to invest in additional NICs and convert the interface to LAG.

3.3.7 How Should Storage Be Configured?

The cluster and applications that run in it need dependable, reliable, and redundant access to storage devices. The minimum disk configuration for a cluster is the following:

A disk with three partitions for cluster root (/), /usr, and /var file systems
A disk for each member of the cluster to contain the member's boot_partition, swap, and a cnx data areas
An additional disk to be used as a quorum disk for clusters configurations with an even number of members

Beyond these minimum configuration requirements, a cluster can have its cluster root (/), /usr, and /var over multiple disks and will certainly have additional disks for use in file systems for applications and users.

The availability features at your disposal to make these storage containers highly available in TruCluster Server are at different levels in the I/O hierarchy:

Using LSM and/or Hardware RAID to create highly available volumes
Using multiple paths from different members relying on the TruCluster Server software's Device Request Dispatcher (DRD) to ensure that if all paths to storage from one member fail, another member can access the device on behalf of the member and transfer any required I/O across the cluster interconnect
Using the Multi-Pathing between HBAs on the same host (intrinsic in Tru64 UNIX)
Implementing redundant fabrics connecting HBAs to storage endpoints

As a general rule, try and make the storage architecture for the cluster as symmetrical as possible for ease of implementation, management, and troubleshooting.

General default rule of thumb:

Have at least two HBAs in each host.
Have at least two fabrics between the hosts and their storage.
Assume the use of hardware RAID Arrays unless it is a small storage cluster and cost is an issue. Evaluate the use of LSM for ease of management or additional features on a case by case basis.

3.3.8 Use Hardware RAID and/or LSM?

In today's computing environments for standalone systems as well as clusters, two methods are available to implement highly available RAID configurations for disk devices – software based RAID and hardware based RAID. In Tru64 UNIX and the TruCluster Server software, these are available as Logical Storage Manager (LSM) and the StorageWorks family of RAID Arrays respectively. The standard question is, "Which of these two solutions should be used?"

3.3.8.1 Enterprise Servers with 100GB+ of storage

The original question of software versus hardware RAID is lightly off target for almost all Enterprise solutions utilizing Tru64 UNIX. The reason is that in these cases, the site will already be investing in StorageWorks devices, which come with RAID capability simply to physically attach and reasonably manage the amount of total physical storage connecting to the cluster. In other words, hardware RAID is almost always a given as available to use, so the real question becomes: "I've already paid for hardware RAID capability in the storage hardware. Do I also utilize software RAID in LSM?"

The answer is usually "No," for cost and simplicity, unless the following factors cause the site to choose LSM:

The site wants to gain additional performance benefits by creating stripe sets that include hardware RAID sets being hosted by different RAID Arrays. Making use of this capability is rarely used as it only makes sense for sites with very large amounts of storage and very high bandwidth requirements.
The site is ultra paranoid and wants to mirror storage with LSM across hardware cabinets.
The site wants to make use of additional features in LSM other than RAID for their own merits. These include flexible volume management and the ability to create a volume of any arbitrary size independent of physical disk or hardware RAID properties.

3.3.8.2 Small Scale Cost-Conscious Configurations

You may ask, "What if the solution I'm putting together does not have 100+ GB of storage and could be put together with simple storage devices, which do not have RAID capability built in?" In this case, we get into the classic question of hardware RAID versus software RAID and the general answer will come down to cost.

Can a simpler hardware solution with LSM be less expensive than a more complex hardware solution without LSM and still provide the total bytes of storage to the cluster?

Another lesser consideration is that LSM cannot be used with specific file systems (and partitions) used in the cluster. The quorum disk, if configured, and any member boot disks cannot be used as LSM volumes (swap can be an LSM volume, but the entire boot disk – boot_partition, CNX partition, and swap – cannot be under LSM control which is our point here). How much of a disadvantage is this?

If a member boot disk fails, that member will crash. If quorum is configured properly and all other hosts are available, the cluster will continue to operate. Any applications running on the failed node will need to be restarted on a surviving node.
If the quorum disk fails on a cluster that has not lost any members, the cluster will continue to operate. So there is no immediate impact. It can only impact the cluster if its failure goes undetected and is not repaired, and one of the member hosts fails or shuts down causing the cluster to lose quorum.

Online replacement procedures for both member boot disks and the quorum disk are covered in Chapter 22.

Taking these two factors into account, the major non-cost availability difference in behavior between a cluster configured with software RAID versus hardware RAID is that in the event of a member boot disk failure, a software RAID configured cluster will result in an application restart.

3.3.9 Independent Paths from Each Member to All Storage Devices

The next level of storage availability in a TruCluster Server is that if a cluster member does not have a working path to a storage device but another member does, the member with a working path will "serve" the I/O to the member without the path. This feature is implemented in the Device Request Dispatcher (DRD) subsystem of the TruCluster Server software and is described in more detail in Chapter 15. This means that as long as one member has a working path to a storage device, all members can continue to access it.

This encourages us to configure paths from more than one member to all storage containers the cluster uses. Before we think too deeply on this and ponder which two members will be have connections to which storage, etc., the issue becomes moot with the typical use of Fibre Channel today. The overwhelming configuration of Tru64 UNIX and the TruCluster Server software is to connect all members to a Fibre Channel fabric that is connected to all storage devices. This symmetry is the simplest, most cost-effective means and utilizes the ability of Fibre Channel to address a large number of devices and host bus adaptors through a single Fabric.

So, assuming that all members are connected by a common fabric to all storage devices, is utilizing DRD the final answer? Not quite.

The DRD alone cannot protect us if we have a single fabric between all members, and the storage or the fabric fails.
- The solution here is to implement multiple redundant fabrics between the hosts and the storage. This implies at least two HBAs in each host connected to two different fabrics, and the storage arrays connected through different ports to the two different fabrics.
The DRD's rerouting of I/O over the cluster interconnect in an extreme case might impact the performance of the application or the stability of the cluster as a whole. To alleviate this concern:
- Utilize CAA to detect failure of storage paths on the host an application is running on and re-start the application on a host that does have working connectivity, or
- Utilize Tru64 UNIX's multi-pathing capability and have multiple redundant host bus adaptors from the same member to the storage. Not that performance reasons alone promotes the use of this, but a single member will also utilize all HBAs in an active/active scheme that scales performance with the number of paths.

3.3.9.1 Example

The Biotech Company's six-node ES45 cluster will contain two HBAs on each system for availability and to meet performance requirements. Two fabrics using two switches will be implemented to connect each member to each HSG raid array through two different independent paths. LSM will also be used as convenience for database volumes.

^[5]Link Aggregation (or trunking)