Cluster File Systems | Storage Networking Fundamentals: An Introduction to Storage Devices, Subsystems, Applications, Management, and File Systems (Vol 1)

System clusters are high-availability designs where two or more systems provide mission-critical application support. If a system in the cluster stops working, another system assumes the work without having to stop and restart any applications.

Systems in clusters often have the exact same configurations, including the same type of processors, memory, network adapters, and host bus adapters (HBAs). They also usually have the same software environment, including the operating system level, file system, and application versions.

File systems for clustered systems, referred to as cluster file systems (CFSs), have special requirements. Traditional file systems typically assume that only one instance of the file system is running on a single machine. Detailed information about the data being accessed and its storage status (in system memory, pending completion to disk media, or on disk) is contained within a single system. In contrast, cluster file systems have to assume that multiple systems are accessing data, each one having a detailed storage status about the data it is working on. If any of them fail, the detailed storage status has to be retrieved by another system so it can finish the task and continue operating. This may involve rolling back interrupted transactions so they can be executed again by another system.

Clustered file systems are designed for distributed operations. File system internal operations such as cache control, locking, and updates to the layout reference system become much more challenging when multiple systems are concurrently working on the same storage address space. For instance, if two systems attempt to access the same data at the same time, how are locks granted and released? If two systems simultaneously access the same data, how do you know where the most recent version of the data is? Is it on disk, or is it in one of the system's memory caches?

Things that seemingly would not interfere with data access must also be considered. For instance, installation and license information written in the superblocks of a clustered file system must be usable by all the systems in the cluster. Likewise, private areas written by volume managers also need to work correctly in clusters.

Basic Cluster Designs

Three basic cluster designs determine cluster file system operations:

Two-node active/passive clusters
Multinode active/passive clusters
Active/active clusters

Two-Node Active/Passive Clusters

The simplest clustering design is a two-node active/passive cluster where one of the systems is actively processing the application while the other system is operating in standby mode. By isolating the processing and data access to a single system, most of the complexities of cache control, locking, and metadata updates can be avoided. This simple active/passive cluster design is shown in Figure 16-1.

Figure 16-1. A Two-Node Cluster with Active and Standby Systems

In the design shown in Figure 16-1, the systems are connected by a heartbeat connection. Heartbeats are single-frame, low-latency messages exchanged by systems in a cluster that are used to determine if other systems in the cluster are functioning normally. In the two-node active/passive cluster, if the standby system determines that the active system has failed, it initiates a failover process so it can take over processing operations.

NOTE

Some would argue that the two-node active/passive cluster is a just a simple example of the general active/passive clustering approach. I agree, but I believe it's worth singling out because it operates more like a single system than a cluster. The two-node active/passive cluster provides system-level redundancy, but its operating environment is contained in a single system as opposed to being distributed, as it is with other cluster architectures.

Multinode Active/Passive Clusters

An expansion of the two-node active/passive cluster is a multinode (more than two) n+1 cluster, where one or more systems are running in standby mode, waiting to take over for other systems that fail. Cache management and locking are necessarily more complicated, because the active systems must have a way of coordinating these functions.

Failover operations involve multiple systems agreeing that another system has failed. The complexity of the failover operation is simplified somewhat by the fact that the standby system can assume the workload of the failed system without having any impact on other application processes it is running.

The multinode active/passive cluster design is shown in Figure 16-2.

Figure 16-2. An n+1 Active/Passive Cluster

Active/Active Clusters

An active/active cluster has two or more systems sharing all responsibilities, including failover operations, without a standby system. All nodes in the cluster determine if another system has failed and distribute the work of the failed system to other nodes. In general, it is the most intricate and scalable cluster design.

An active/active cluster design is shown in Figure 16-3.

Figure 16-3. An Active/Active Cluster

NOTE

Database clusters can be done with rip-roaring success using active/active clusters. Some of the fastest database systems on the planet are implemented this way. The Transaction Processing Performance Council (TPC) is an organization that produces benchmark results for database configurations. Active/active clusters are often at the top of the list. The TPC's website has a lot of interesting information about the configurations used, including storage configurations. Interested readers might check them out at http://www.tpc.org.

Two File System Approaches for Clusters: Share Everything or Share Nothing

In addition to the three basic cluster designs, two basic cluster file system designs provide access to data:

Shared everything
Shared nothing

These approaches differ significantly in how the layout reference system and locking are handled.

Shared Everything Cluster

Shared everything cluster file systems are designed to give equal access to all storage. Each system in the cluster mounts all storage resources and accesses data as requested by applications. The relationships between servers and storage in a shared everything cluster are depicted in Figure 16-4.

Figure 16-4. Relationships of Systems and Storage in a Shared Everything Cluster

The shared everything approach depends on having a single layout reference system used by all cluster systems to locate file data within the storage address space. As updates occur, layout references are also updated and made available to all cluster systems.

Locking is an intricate process in shared everything clusters. The locking mechanism needs to be able to resolve "ties" where two systems access data simultaneously. It's not as easy as it may seem because there has to be a way to recognize that concurrent access is being made.

Shared Nothing Cluster

Shared nothing cluster file systems combine a traditional file system architecture with a peer-to-peer network communications architecture. Systems in shared nothing cluster file systems have their own "semiprivate" storage, which they access directly like a traditional file system. However, data that is stored on a peer system's storage is accessed using a high-speed peer-to-peer communications facility. A requesting system sends a message to its peer, which accesses the data and transmits it to the requesting system. In clustering terminology, this is sometimes called cross-shipping data. Using the shared nothing approach, cluster systems mount only their semiprivate storage, as opposed to mounting all storage like shared everything cluster systems do.

While the storage mountings may be different between shared everything and shared nothing clusters, the physical connections might be exactly the same. Failover processes require the system that assumes the processing load to be able to mount the failed system's storage. This mounting can't be done without a physical connection to the failed system's storage.

Each system has its own layout reference for semiprivate storage but does not have detailed layout references for data residing on peer systems' storage. Nonetheless, all cluster systems must be able to locate all data somehow, using some sort of modified layout references that can locate the server where the data is stored. A two-system design is easiest; if the data is not in the local layout reference system, it is in the other one.

Locking is much easier to understand in shared nothing clusters than in shared everything clusters because a single system can perform locking for data stored in its semiprivate storage.

Failover operations require another system in the cluster to mount the failed system's storage and ensure it is ready for use, including ensuring the consistency of data. Considering that a system has failed, there is a reasonable chance that data consistency errors may have occurred. Therefore, journaled file systems are required to quickly identify incomplete I/O operations and minimize the time it takes to resume cluster operations.

The obvious downside of the shared nothing approach is the overhead needed for peer-to-peer communications to cross-ship data. The latency added by cross-shipping data can be minimized through the use of high-speed networking technologies and I/O processes. Even if the intercluster link is not optimized for performance, many low-I/O-rate applications like e-mail can be implemented successfully on shared nothing designs.

Any type of network can be used for cross-shipping data, as long as the performance meets the requirements of the application. For example, a storage area network (SAN) or Gigabit Ethernet local area network (LAN) would likely meet the requirements of many applications.

The relationships between servers and storage for a shared nothing cluster are shown in Figure 16-5.

Figure 16-5. Relationships of Systems and Storage in a Shared Nothing Cluster

Structural Considerations for Cluster File Systems

Cluster file systems have the same type of structural elements as traditional file systems, but the cluster versions are often altered to fit the requirements of multisystem clusters. Some of the more interesting modifications are discussed in the following sections.

Layout Reference System

All file systems map files and data objects to block address locations through their layout reference system. Clusters add a level of complexity by requiring all changes to the layout reference system to be immediately recognized by all systems in the cluster. If the layout references for a data file are inconsistent among systems in the cluster, the data accessed by an application could also be inconsistent.

Buffer Memory

Traditional file systems place data in buffer (cache) memory. In addition to application data, systems also place system information such as layout reference data in system buffer memory to accelerate storage performance. While this practice works well for file systems in single systems, it creates data consistency quandaries for cluster file system designers.

Data updates written to one system's buffers may be requested shortly thereafter by a second system. If the data is not flushed to disk from the first system before the second system reads it, the second system reads the wrong data. Buffering data in clusters requires a mechanism to handle this scenario.

Conversely, a system that has data in buffers may be unaware that another system may have updated some of its data "behind its back." In this case the cluster file system needs a way to invalidate obsolete data in system buffers.

Some cluster designs disable system buffering to avoid the overhead of constantly "snooping" buffer memory. Other designs implement mirrored cache, or mirrored memory, where systems copy the contents of buffer memory to other systems in the cluster using a high-speed link. For practical reasons, these cluster designs are often limited to two systems.

Locking

Locking is used to reserve the right of certain applications or users to perform certain I/O operations on data objects. In a cluster, a single application can have modules running on different systems. Therefore, shared everything clusters require a locking mechanism that all systems use. This locking mechanism could be implemented in a distributed fashion on all systems. It also could be done by running a lock process on one of the systems (with a backup system ready to assume the job). Locking is considerably easier with shared nothing clusters, where only a single system manages the locks for any particular data object.

Journaling

The journaling function in a cluster file system records the sequence of all I/O operations for all systems in the cluster. If something occurs that stops the entire cluster, the journal is used to reestablish its last valid data state. Similar to locking, journaling is easier to understand with shared nothing designs than it is with shared everything designs. With shared nothing designs, journaling can be done by a single system to its semiprivate storage. With shared everything designs, there could be many possible ways to implement journaling. A single journal (with backup) could be used, or each system could maintain its own journal record that could be reconciled with other journals through the use of time stamps.

Global Name Space

A file system's name space represents its file and directory contents to applications and users. Just as the layout reference system has to be the same for all systems, the name space must also be uniform across all systems in the cluster so that all systems can locate data accurately. The term used to describe this is global name space.

Implementing Storage for Clustered File Systems

Storage implementations should match the requirements of the applications they support. Clusters add their own storage requirements to those imposed by the applications running on them. Some of these are discussed in the following sections.

Multipathing and Clustering

Clustering is similar to I/O multipathing, as discussed in Chapter 11, "Connection Redundancy in Storage Networks and Dynamic Multipathing." The difference is that multipathing provides dual connections from a single system to storage in case one of them should fail, whereas clustering provides access to storage from two different systems. Multipathing is usually used in clustering designs, but it is not an architectural necessity.

Importance of SANs in Clusters

Clustered storage depends on having a connecting technology that allows all systems to access all storage. SANs are by far the easiest way to connect clustered systems and storage, especially for clusters with more than two systems.

Clusters require multiple storage initiators due to the presence of multiple systems, most of them using two initiators in multipathing configurations. Whereas parallel Small Computer Systems Interface (SCSI) was designed primarily for single-initiator connectivity, SANs were designed for multi-initiator environments.

NOTE

Although SCSI was designed for single-initiator connectivity, it wasand isused for multi-initiator schemes also. Multi-initiator SCSI implementations are usually proprietary and expensive, but they do work and provide excellent availability. The point is that SANs make this a whole lot easier.

SAN flexibility is also an important advantage with clusters. Systems and storage can be connected to the cluster SAN without having to stop the operation of the cluster. It may be necessary to stop cluster operations in order to incorporate new elements into the cluster configuration, but it is not necessary to stop operations just to connect the cables.

Uniformity of Storage Address Spaces in Clusters

As described in previous chapters, storage address spaces can be formed by many different techniques. These techniques can also be used with clusters as long as the unique requirements of clusters are considered. For starters, it's essential that all systems in a cluster work with the exact same storage address spaces, however they are generated. If SAN-based virtualization is used, all systems need to use the same virtualization "translation lenses." Similarly, if volume managers are used, all systems must have the same volume definitions. All maintenance operations that impact storage address spaces must be carefully considered to avoid problems.

All systems in the cluster need to use the same type of file system software. While there may be more than one storage address space used by a cluster, and therefore more than one file system, they all need to have the same logical structure, including the location of superblocks, internal reference system data, name spaces, and attributes.

Data Caching in Clusters

Buffer (cache) memory in host systems needs to be distinguished from cache memory implemented in storage subsystems or SAN virtualization systems. In general, I/O latencies should be kept to a minimum in clusters to avoid complicating intricate functions like locking. Subsystem cache can significantly reduce I/O latencies.

Using caching to minimize the mechanical latencies of disk drives is generally a good idea, but the variables of multisystem operations and the possibility of creating inconsistent data must be accounted for. While caching lowers latency, it also increases the possibility of creating inconsistent data.

All possible paths from all servers to storage logical units must be identified and understood for both normal and post-failover operations. Mirrored or shared cache between two different subsystem controllers can be used to ensure that two distinct I/O paths have access to the same cache contents. With mirrored cache, all logical unit numbers (LUNs) used to access a specific logical unit in a subsystem are given access to the same cache data. Cache in storage should be implemented with battery backup, both for the entire system or subsystem where the cache is located and for the cache memory itself.

The most conservative approach to caching in clusters is to avoid write caching altogether. While caching is likely to be the most effective way to reduce write latencies, several other techniques and technologies can be applied, such as broad spreading of data across a relatively high number of smaller-capacity disk drives and avoiding the RAID 5 write penalty by using mirroring or RAID 10.

Disk Drive Considerations for Clusters

With availability and performance driving the selection of disk technologies for clusters, Fibre Channel and parallel SCSI drives are obvious choices. However, SCSI drives should be used only when they are implemented inside a Fibre Channel storage subsystem.

The command queuing capabilities of both SCSI and Fibre Channel drives increase performance significantly over drives that do not implement command queuing. Serial ATA (SATA) drive manufacturers have developed a similar technology called native queuing, but it is unclear how well SATA native queuing will compare with the stellar results of command queuing in SCSI and Fibre Channel disk drives.

The mean time between failure (MTBF) numbers for clusters should be at least 1 million hours, which also includes server-class SATA drives. ATA drives are relatively poor choices for high-availability, high-performance cluster environments.

A technique to reduce latency in disk drives is short-stroking, as discussed briefly in Chapter 4, "Storage Devices." By limiting the physical distance that the disk arms must move, the average access time of disk drives can be shortened considerably. The capacity lost by short-stroking may be able to be compensated for by using RAID 10.

NOTE

Short-stroking is not necessarily a feature of all disk subsystems, even though it would be relatively simple to add in a disk partitioning utility. Reducing the usable capacity of disk drives in a storage subsystem might seem stupid, but it's notespecially when performance is the ultimate goal. For the fastest storage, minimize mechanical latencies and get as many disk actuators as possible involved in doing the work in parallel.

RAID Levels for High Availability and Performance in Clusters

Emphasizing performance over capacity, the best RAID levels for cluster storage are RAID 1 and RAID 10. There is no reason to slow the processing of write I/Os due to the write penalty of RAID 5. RAID 10 provides the best performance by allowing data to be striped across many sets of mirrored pairs. Not only that, but RAID 10 can survive the loss of more than one disk drive in the array. As the size of arrays increases, this becomes much more significant.

If higher levels of data availability are needed to guarantee data availability after the loss of two disk drives, it is possible to use double parity protection. For instance, the concept of RAID 6 uses an additional parity calculation besides XOR and writes parity information to two separate disks.

NAS Clusters

Traditional network attached storage (NAS) systems are single points of failure for the clients that access files through them. A web server farm that depends on a single NAS system has a realistic risk of having its entire website risk go down due to a problem with the NAS system. To address this, some NAS vendors have implemented clustering technology in their products.

Beyond the obvious availability benefits of clustered NAS, clustering can also be used to increase the capacity of a NAS system. For instance, if a single NAS system has a maximum capacity of 10 terabytes, a two-system cluster could have a maximum capacity of 20 terabytes. Clusters also can be used to improve performance by doubling the number of network file services and I/O channels. Even shared nothing designs that cross-ship data can improve performance if the intercluster connections and processes are fast enough.

NAS clusters can use any combination of cluster designs, including active/active or active/passive failover and shared nothing or shared everything storage designs. All systems in a NAS cluster export the same name space for clients to access data in the cluster. Any data caching done by NAS cluster systems needs to have a mechanism to prevent the loss of data consistency.

NOTE

An example of NAS clustering can be found in Network Appliance Filer products. Netapp implements NAS clustering with two-node, active/active, shared nothing clusters. Data consistency is assured through mirrored memory that is carried over a high-speed link between the two systems.