Dynamic Multipathing

The rest of the chapter discusses the logical operations of path redundancy, which is commonly called dynamic multipathing (DMP). In this chapter, the shortened term multipathing means the same thing as dynamic multipathing.

The Big Picture of Dynamic Multipathing

Multipathing is concerned with a very small subset of the overall I/O path: the Small Computer Systems Interface (SCSI) architectural elements responsible for SCSI layer transmissions between host systems and storage. In a nutshell, multipathing establishes two or more SCSI communication connections between a host system and the storage it uses. If one of these communication connections fails, another SCSI communication connection is used in its place. Figure 11-1 illustrates.

Figure 11-1. A High-Level View of Multipathing

NOTE

Notice the lack of a network or bus in Figure 11-1. While the word "path" might be assumed to refer to a network path or route, it doesn't. The path in this case is actually better thought of as a SCSI nexus. (See the section "SCSI Nexus and Connection Relationships" in Chapter 6, "SCSI Storage Fundamentals and SAN Adapters.") Of course, nobody in their right mind refers to this topic as multinexusing, or nexi-ing, or whatever it would be, because nobody understands how to modify the word nexus, much less use it.

Differentiating Between Mirroring, Routing, and Multipathing

Multipathing is sometimes confused with other redundancy functions, including storage mirroring and network routing. Often all three are combined to support high-availability data access to mission-critical applications. Table 11-1 summarizes the different roles and relationships these technologies have in providing redundancy for SAN communications.

Table 11-1. Differentiating Different Redundancy Functions
Redundancy Function	Relationship	Role
Mirroring	Generates two I/Os to two storage targets	Creates two copies of data
Routing (convergence)	Determined by switches independent of SCSI	Recreates network routes after a failure
Multipathing	Two initiators to one target	Selects the initiator-LUN pair to use

Subsystem Multipathing Structures

Among the more challenging aspects of SCSI communications are the various relationships between SCSI initiators and targets, LUNs, and logical units. Multipathing provides a context that helps clarify how these elements are related, but it is practically impossible to understand multipathing without understanding these fundamental relationships.

A Review of SCSI Logical Units and LUNs

As discussed in Chapter 6, a SCSI logical unit (LU) is the command process running in a storage subsystem controller that manages I/O operations for a particular storage address space. Most storage address spaces in storage network subsystems are composed of redundant storage from multiple disk partitions configured using mirroring, RAID, or other virtualization techniques.

Logical units in SAN storage subsystems are accessed through the combination of the subsystem SAN port and a particular LUN that is associated with the LU. A single subsystem port can have multiple LUNs, each of them associated with a single LU. Figure 11-2 shows a subsystem where I/O commands enter through SAN Port P and are directed to LU abc through LUN X. The storage address space is formed by two mirrored disk drives forming a single storage address space.

Figure 11-2. I/O Traffic Enters a Subsystem Through SAN Port P and Is Directed to LU abc Via LUN X. A Pair of Mirrored Disks Forms a Single Storage Address Space.

It is important to understand that the LUN is not an identifier for the LU, but simply provides an access role. If you consider the entire subsystem and all its exported storage, LUNs provide a mapping method that allows storage I/O traffic to be directed through subsystem SAN ports to the proper LUs. Each subsystem SAN port has one or more LUNs that are available for directing I/Os to specific LUs.

The LU has its own unique identifier within the subsystem. A serial number or a universal unique identifier (UUID) is created in the subsystem. Multiple LUNs associated with different SAN ports in a subsystem can all index the same LU by its unique UUID.

The way LUNs are assigned to SAN ports is an administrative decision. While LUs are unique in a subsystem, the LUNs that index them do not have to be. For instance, it is possible to have the same LUN ID defined on multiple ports that map I/Os to different LUs. For instance, LUN 3 could be defined on Ports 1 and 2, with LUN 3 on Port 1 mapping I/Os to LU aaa and LUN 3 on Port 2 mapping I/Os to LU bbb. In general, it is a good practice to associate all occurrences of the same LUN ID with the same LU within a subsystem. In other words, all occurrences of a particular LUN ID would map to the same LU, regardless of the subsystem port.

NOTE

For what it's worth, no architectural limit is placed on the number of host system initiators that can communicate with each port/LUN in a subsystem. Limits on host/LUN communications are accomplished through LUN masking, a process discussed in Chapter 5.

Figure 11-3 expands Figure 11-2 by showing two identical LUNs in two different SAN ports mapping to LU abc.

Figure 11-3. Two Identical LUNs Accessed Through Different SAN Ports Mapping I/Os to a Single Logical Unit

World Wide Node Names and World Wide Port Names

Fibre Channel uses the notion of global, unique identifiers to locate resources in the network. All port hardware in a Fibre Channel network has a 64-bit identifier assigned at the factory called a world wide port name (WWPN). It is used to uniquely identify the port, even after the network is powered down or rebuilt. The idea of the WWPN is to provide persistence in the SAN, to facilitate fast recovery of functions following some sort of failure or other disaster. For instance, an initiator that retrieves configuration information about the storage it was using should be able to find it in the network again, following a complete network power cycle.

The term world wide name (WWN) usually refers to the WWPN, but where multipathing is concerned, the world wide node name (WWNN) is also needed to uniquely identify the subsystem. The WWNN is the ID of a system or subsystem that has multiple ports. Now there are some interesting problems trying to figure out how to create a unique WWNN with a system that may be sold without any SAN HBAs whatsoever, but you can assume that for multipathing, the subsystem can identify itself as an entity containing multiple WWPNs.

Figure 11-4 illustrates a subsystem with four SAN ports, each with its own WWPN.

Figure 11-4. Multiple Ports with Unique WWPNs in a Storage Subsystem with a Unique WWNN

WWPNs, LUNs, and LUs

The whole picture of the subsystem can now be made. Each port has a unique WWPN. Associated with those ports are one or more LUNs, which map storage I/O traffic to specific LUs, each having its own specific UUID. The LU is the command processor for I/O commands operating on the disk drive partitions that form the storage address space. All these entities are pictured in Figure 11-5.

Figure 11-5. Subsystem SAN Ports with WWPNs and Associated LUNs Mapping I/O Traffic to a Specific LU

Host System Multipathing Functions

Subsystem architectures for multipathing are half the story; the other half of the multipathing equation occurs in host systems.

Host Storage Initiators in Multipathing

Multipathing software monitors host storage initiator functions where storage I/Os originate and where communications failures are identified. If a failure is identified, multipathing software changes the initiator port being used.

In many multipathing implementations, the HBA has a single port and therefore a single instance of an initiator process. This has created the perception that multipathing software requires multiple HBAs. However, it is certainly possible to use multiported HBAs, with each port's operations being controlled by its own discrete initiator process. When multiported HBAs are used, multipathing software can switch to a different path by using a different initiator process associated with a different port on the same HBA where the failure occurred.

Nonetheless, multiple HBAs will continue to be commonly used in SAN multipathing solutions in order to protect against HBA failures.

Implementing Multipathing Software

Multipathing software typically runs in kernel space in host systems, which means it has to run quickly without errors. It does not create or alter storage transfers. Instead, it determines the storage path that is used, which in turn determines the network connections that are used and all other parts of the I/O path all the way to the LUN in a subsystem.

There are many different ways to implement multipathing software, depending on the operating system. Some operating systems, like Microsoft Windows Server operating systems, have application programming interfaces (APIs) for integrating third-party multipathing software. Other operating systems do not have APIs, which gives multipathing software vendors much more leeway in implementing their solutions but also creates significant challenges for testing and debugging.

Using a stack analysis, the multipathing software is typically placed between the SCSI command driver and the low-level connecting HBA device driver, as shown in Figure 11-6.

Figure 11-6. Multipathing Software in the Storage Software Stack

Determining Paths

Paths in multipathing solutions have three elements:

The initiator, which originates commands
The subsystem SAN port and LUN (WWPN+LUN) where commands are sent
The LU that processes commands for a given storage address space

In the remaining sections, these three elements of the path are sometimes indicated using the construct "initiator/WWPN+LUN/LU."

One of the most interesting aspects of multipathing is how the software determines which initiator/WWPN+LUN/LU paths can be used for I/O transmissions. Multipathing software can be thought of as a SCSI investigator that gathers information about a subsystem's storage resources and uses deductive reasoning to determine all the paths that reach a specific logical unit.

There are several ways multipathing software could be designed to discover multiple storage paths between systems and an LU in a subsystem. One possible discovery process is outlined in the following steps:

1.	For each initiator in the system, get a list of all storage target WWPNs (subsystem ports) from the name service of a SAN switch.
2.	Query all WWPNs to report all associated LUNs. (LUN masking would prevent "masked" LUNs from being reported to certain HBAs.)
3.	For each reported LUN, acquire the UUID for the LU it references.
4.	Create a list of all initiator/WWPN+LUN/LU paths that can be used to transmit I/Os between a system and a particular LU.

For full redundancy in the storage path, there need to be at least two paths with different pairs of initiators and WWPN+LUNs. Obviously the LU UUIDs have to be the same. The LUNs could be the same or different, as long as they both refer to the same LU.

Figure 11-7 shows a hypothetical list of storage paths discovered by multipathing software in a system. Redundant storage paths are indicated by dotted lines.

Figure 11-7. A Hypothetical List of Storage Paths Maintained by Multipathing Software

Active/Passive Configurations and Static Load Balancing

A common configuration for multipathing follows an active/passive model where the active storage path carries all I/O traffic while the passive storage path is idle. Active/passive configurations such as these require an administrator to select which storage path will be active and which path will be passive.

Active storage paths are used for all I/O until something occurs that keeps I/O transmissions from reaching their destination. When a path failure is recognized, the passive path automatically is elevated to become the active path, and the formerly active path is made inactive to prevent it from inadvertently restarting and creating problems.

It is important to differentiate between physical SAN links and the initiator/WWPN+LUN/LU paths used by multipathing software. A single physical link can be used by multiple paths concurrently. For instance, a single host system with two HBAs (or a single multiported HBA) running multiple applications can define active and passive paths for both host controllers, allowing I/O traffic to be divided between them.

Assigning the I/O traffic from different applications to active paths defined on different host initiators is a practice referred to as static load balancing. For instance, consider a server system with two HBAs (Initiator 1 and Initiator 2) running two applications, #1 and #2, storing data through two different LUs, A and B, respectively. Assume both ports can access both LUs. It is relatively simple to define an active path for application #1 using Port 1 and a passive path using Port 2. Conversely, the active path for application #2 can use Port 2, and the passive path can use Port 1.

Figure 11-8 shows this simple static load balancing configuration, where Application #1 stores data through LU A and Application #2 stores data through LU B. This figure uses dotted lines to represent a pair of point-to-point links connecting the system initiators with the subsystem SAN ports (Ports A and B). In actual SANs, these connections would most likely be made through a pair of SAN switches. The bold solid lines denote active paths, and the thin solid lines denote passive paths within the system and subsystem. The diamond shapes in the subsystem represent LUNs; each subsystem port has two LUNs associated with it. LUNs 1 and 3 are associated with LU A, and LUNs 2 and 4 are associated with LU B.

Figure 11-8. An Active/Passive Multipathing Configuration

NOTE

Some readers might wonder why Figure 11-8 does not include SAN switches. While the addition of switches would have been more representative of actual SANs, Figure 11-8 is already busy enough without the presence of switches and switch connections. It would have been possible to use the familiar network cloud to indicate network connections, but it probably would have required two clouds (for redundancy) and a lengthy discussion of single SANs versus dual SANs. So let's just say you could use one or two SANs and that multipathing software doesn't care what you do because the switches and switch ports are not defined as part of the initiator/WWPN+LUN/LU path.

Active/Active Configurations with Dynamic Load Balancing

More advanced multipathing solutions allow active/active connections, where multiple storage paths provide redundancy as well as dynamic load balancing. In a nutshell, active/active configurations use one or more storage paths as part of normal operations. If one of the paths should experience a failure, the remaining path carries all I/O traffic.

Dynamic load balancing distributes I/O transmissions over the available paths. Three basic algorithms are used to distribute storage traffic in dynamic load balancing:

Round robin Each I/O command is sent to the next path available for an initiator/LU connection. When two paths are available, the I/O commands alternate between both paths.
Least blocks The next I/O command is sent over the path that has the fewest blocks in transit. This method is most useful for streaming applications.
Least I/O The next I/O command is sent over the path with the lightest measured I/O load. This method is most useful for transaction processing applications.

Failover Processing

The process of changing paths with multipathing technology is called failing over. Failing over involves recognizing that a storage path has failed, preparing to restart operations on a redundant path, and then reinitiating operations on the redundant path.

Recognizing Path Failures

Path failure recognition is the responsibility of the host-based multipathing software. Failures are detected when I/O operations are not acknowledged after a defined period of time, or timeout value. Usually the operation is retried to allow for an intermittent problem of some sort, but subsequent timeout failures are determined to be a path failure. Timeout values are not specified by any standard and are determined by the design of the multipathing software. Some vendors allow users to select the timeout values they want to use. Common timeout values range from 30 seconds to up to several minutes.

Preparing to Fail Over

There may be several "housekeeping" actions that need to be done prior to commencing operations on the passive path. This can involve changing the state of system variables and flushing "dirty" data stored in write-back cache to disk drives.

Initiating Operations on the New Active Path

It might not be possible to detect what the status was of a pending I/O that was unacknowledged. Assumptions should be made that the I/O failed prior to reaching the subsystem and that the I/O operation needs to be retried on the newly active path. From that point on, all I/O operations use the newly activated path.

Failing Back

In an active/passive configuration, after all problems with the former active path have been resolved, it may be desirable to "fail back" to the original configuration in order to restore the balance of I/O processes in the SAN. The failback process is identical to the failover process except that it is not precipitated by a path failover. Failing back stops I/O operations on the active path, prepares the repaired path for operations, and then starts processing I/O commands again through the original active path.

Multipathing and Network Route Convergence

Multipathing is not the only automated way to recover from a network problem in a SAN. SAN switches use the FSPF routing protocol to converge new routes through the network following a change to the network configuration, including link or switch failures.

Multipathing software in a host system can use new network routes that have been created in the SAN by switch routing algorithms. This depends on switches in the network recognizing a change to the network and completing their route convergence process prior to the I/O operation timing out in the multipathing software. As long as the new network route allows the storage path's initiator and LU to communicate, the storage process uses the route. Considering this, it could be advantageous to implement multipathing so that the timeout values for storage paths exceed the times needed to converge new network routes.

The Big Picture of Dynamic Multipathing

Figure 11-1. A High-Level View of Multipathing

Differentiating Between Mirroring, Routing, and Multipathing

Table 11-1. Differentiating Different Redundancy Functions

Subsystem Multipathing Structures

A Review of SCSI Logical Units and LUNs

Figure 11-2. I/O Traffic Enters a Subsystem Through SAN Port P and Is Directed to LU abc Via LUN X. A Pair of Mirrored Disks Forms a Single Storage Address Space.

Figure 11-3. Two Identical LUNs Accessed Through Different SAN Ports Mapping I/Os to a Single Logical Unit

World Wide Node Names and World Wide Port Names

Figure 11-4. Multiple Ports with Unique WWPNs in a Storage Subsystem with a Unique WWNN

WWPNs, LUNs, and LUs

Figure 11-5. Subsystem SAN Ports with WWPNs and Associated LUNs Mapping I/O Traffic to a Specific LU

Host System Multipathing Functions

Host Storage Initiators in Multipathing

Implementing Multipathing Software

Figure 11-6. Multipathing Software in the Storage Software Stack

Determining Paths

Figure 11-7. A Hypothetical List of Storage Paths Maintained by Multipathing Software

Active/Passive Configurations and Static Load Balancing

Figure 11-8. An Active/Passive Multipathing Configuration

Active/Active Configurations with Dynamic Load Balancing

Failover Processing

Recognizing Path Failures

Preparing to Fail Over

Initiating Operations on the New Active Path

Failing Back

Multipathing and Network Route Convergence