Chapter 15: The Device Request Dispatcher (DRD) | TruCluster Server Handbook (HP Technologies)

In this chapter we will be discussing the Device Request Dispatcher cluster subsystem. The Device Request Dispatcher, as I am sure you have probably already guessed, is the subsystem that dispatches requests to devices. Actually, it dispatches I/O requests to any storage device in the cluster (in fact there are some engineers within Compaq who prefer the name DeviceIO Request Dispatcher). The DRD also enables a cluster-wide view of all storage devices connected to any and all members of the cluster. The storage devices can be located on any bus on any system in the cluster; and not just shared buses – private buses are also handled by the DRD. The DRD does not discriminate; any storage device is fair game. And the really great thing is that the DRD does all of this transparently from the user's perspective.

You may be starting to see a trend here. Every member can see all the file systems, every member can see all the devices (even those on a private bus), and everything is accomplished transparently. Not only will your users not require special training to use the cluster, they don't even have to know that they're using a cluster! With few exceptions, the goal is for all the cluster members to cooperate such that users see the cluster as simply one system.

15.1 DRD Concepts

The primary function of the DRD is to coordinate access to the storage devices in a cluster. This coordination is accomplished by having a cluster-wide namespace for all storage devices, communication between all the DRD components of each cluster member (the DRD is a client/server implementation), and the judicious use of I/O barriers to prevent rogue nodes from corrupting data on the cluster-wide storage devices.

15.1.1 Clusterwide Device Namespace

The DRD subsystem is the mechanism that enables every member in the cluster to see every storage device in the cluster. Since devices are given a name as they are discovered, when a member is added to a cluster, its devices are given names. Device naming was covered in detail in chapter 7.

The first member in the cluster will assign names to the devices it detects in the order it detects them (see Figure 15-1).

click to expand
Figure 15-1: Cluster-wide Device Naming

It is important to remember that a device name is associated with the device's worldwide identifier (WWID) and not the device's location on the bus. For additional information, see the discussion on WWIDs in chapter 7. As you can see in the figure, the devices that are directly connected to member1 have the lowest device names (i.e., floppy0, cdrom0, dsk0-dsk6, and tape0). As the second member is added to the cluster, the devices directly connected to it (and not connected directly to member1) are detected, and the naming continues with the next number in sequence (i.e., floppy1, cdrom1, dsk7-dsk9). When the third member is added to the cluster, the devices directly connected to it (and not connected directly to member1 or member2) are detected and are named in sequence following those connected to member2 (i.e., floppy2, cdrom2, dsk10-dsk11).

There are three instances when the DRD will look for devices:

Boot time.

The DRD will query the hardware management subsystem at boot time for devices it needs to manage.
Device open.

If the DRD does not know about a device when an open on that device is requested, it will query the hardware management subsystem for that specific device.
Device detection.

When a new device is added to the system, the hardware management subsystem will post an event to EVM upon detection (e.g., as a result of executing the "hwmgr -scan scsi" command). The DRD is a subscriber of events posted to EVM from the hardware management subsystem.

15.1.2 Clusterwide I/O

All I/O to a storage device in the cluster will pass through the DRD subsystem as you can see in the now familiar TruCluster I/O Subsystems Architecture diagram (Figure 15-2). Direct I/O (see chapter 13), character I/O, and block/file I/O eventually travel through the DRD. Even the "client" paths in the diagram are sent through the Internode Communication Subsystem (ICS) to another cluster member's DRD.

click to expand
Figure 15-2: TruCluster Server I/O Subsystems Architecture

The DRD is a client/server implementation per device. And, just to make things interesting, certain devices can be served simultaneously by more than one member. So how does the DRD accomplish this feat? There are actually two approaches that the DRD can take: an indirect approach or a direct approach.

The indirect approach is handled using a client/server (or served) technique in which one cluster member acts as the DRD server for a device, and all other members are DRD clients (accessing the device through the DRD server only). This is similar to how the Distributed Raw Disk (the other DRD) component of the TruCluster Production Server product functions^[1].

The direct approach is accomplished using devices that are Direct Access I/O (DAIO) capable. The DAIO implementation grew out of the Multi-Node Simultaneous Access (MUNSA)^[2] work that was done for TruCluster Production Server version 1.5.

15.1.2.1 Direct Access I/O (DAIO) Device

A device can be served by more than one DRD if the device is Direct Access I/O capable. What determines whether or not a device is DAIO capable?

The device must have hardware bad block replacement enabled.
The device must support tagged queuing to allow command ordering from a host.
The device must support simultaneous access from multiple initiators.

All disk devices supported by TruCluster Server versions 5.0A, 5.1, and 5.1A are DAIO capable.

Figure 15-3 illustrates how a DAIO device is accessed in a four-member cluster. In the figure you can see that every member in the cluster can directly access the disk. The captions for each member show a slightly modified view that you would get if you ran the drdmgr(8) command on each member. Since every member in the cluster has a physical connection to the bus where the disk is located, the "number of servers" attribute has a value of four. The servers are listed in the "server name" fields. The fact that all four members' "server state" is "Server" indicates that all four members can actively access this disk. Valid server states are: "Server" and "Not Server".

click to expand
Figure 15-3: Direct Access I/O

Although there are four active servers for this device, each member will access the disk via one server at a time. The server that the member is currently using to access the device is indicated by the "access member name". By default, if the member is also a server, its "access member name" will always be itself.

In Figure 15-3, every member has direct access because they are physically connected to the bus where the device is attached and the device supports DAIO (as indicated by the "device type" field). As of this writing, there are four valid device types that you will see in the output of the drdmgr command: "Direct Access IO Disk" (indicated as DAIO in the figure), "Served Disk", "Served Tape", and "Served Changer".

A device can be DAIO and yet have members in the cluster that cannot directly access the device. Although the device is capable of having more than one server, due to physical connectivity the device may have only one server. Figure 15-4 illustrates a DAIO device on a private bus. The DRD

click to expand
Figure 15-4: DAIO Device, Private Bus

does not really care where the device is located. If a member wants to access a device to which it is not physically connected, it dispatches the request to a DRD on a member that has a direct connection.

Note

Older SCSI devices like the RZ26, RZ28, RZ29, and RZ1CB-CA do not have bad block replacement enabled by default, so if you add a device of this type, you can run the clu_disk_install script to enable bad block replacement. This script may take awhile to complete if you install several devices at the same time.

15.1.2.2 Served Devices

If a device is a served device, one member in the cluster actively serves the device to the other cluster members regardless of whether or not the device is accessible by more than one member via physical connection (i.e., shared bus).

Tape and changer devices as well as all CD-ROM, DVD-ROM^[3], floppy disk devices, and some hard disk devices are served devices. Unlike tape and changer devices, though, CD-ROM, DVD-ROM, and floppy disk devices cannot, as of this writing, be located on a shared bus.

Figure 15-5 shows a two-member cluster with a served tape device on a shared bus. Notice that there are two potential servers, however, member1's server state is "Not Server". This is because a tape device is not a DAIO device. In other words, even though both members can serve the device, only one device can serve the device at a time; therefore the device has a device type of "served" – in this case "Served Tape".

click to expand
Figure 15-5: Served Device, Shared Bus

15.1.3 I/O Barriers

I/O Barriers are used in a cluster to emulate how a single-system would handle I/O on shutdown or failure. In other words, when a standalone system is shutdown or crashes, any outstanding I/O will not be delivered to the device when the member reboots. An I/O Barrier is a combination of software and hardware/firmware. On the software end, the DRD creates a barrier by blocking I/O on nodes without quorum, while on the hardware/firmware end I/O Barriers are implemented at the CAM layer using bus resets, bus device resets, and persistent reservations.

When a cluster is running normally, all members are allowed to perform I/O to all storage devices in the cluster. Things get interesting when a member is shutdown, crashes, or becomes incommunicative (Note: the DRD also erects I/O Barriers for all devices new to the cluster as the cluster is being formed and when a member joins the cluster).

When the Connection Manager^[4] (CNX) senses a member leaving the cluster, either voluntarily or otherwise, it initiates a cluster membership transition to determine which nodes are allowed to continue in the cluster and which nodes are not. When the membership transition begins, the DRD blocks and queues any new I/O while draining any I/O that was previously queued. At least that's what is supposed to happen. However, since a member may be experiencing hardware or software errors which caused the membership transition to occur in the first place, the member may not properly drain but rather continue to deliver I/O to the shared storage.

Once the cluster membership transition is complete, the CNX directs the DRD to erect an I/O Barrier around any node that is no longer a cluster member. The I/O Barrier guarantees that any I/O submitted to the DRD by a node that is no longer a member will not be committed to any cluster-wide shared storage.

For example, using the three-member cluster from Figure 15-1 earlier in the chapter, let's say that member1 has been writing to dsk4 when something happens to cause member1 to shutdown. Table 15-1 shows that at time T1, all three members are able to perform I/O to any storage device in the cluster when member1 experiences a glitch causing it to shutdown. At time T2, the cluster, as a whole, must go through a cluster membership transition, handled by the CNX on each member, to determine which members can continue to participate in the cluster. At this time, the DRD will block any new I/O and queue it to be submitted after the cluster shakeout is complete. Any I/O queued prior to the transition will be drained. After the CNX determines the new cluster membership (time T3), those members that have been voted out of the "cluster club" are locked out – in other words, I/O Barriers are erected and any Distributed Lock Manager^[5] (DLM) locks are reconfigured.

Table 15-1: Cluster Membership Transition and the I/O Barrier
Cluster Member	Time
Cluster Member	T1	T2		T3	T4
memberl	I/O proceeds	new I/O is blocked & queued unless prevented by H/W and/or S/W error	previously queued I/O is drained unless prevented by H/W and /or S/W error	I/O Barrier	I/O rejected
member2	I/O proceeds	new I/O is blocked & queued	previously queued I/O is drained	new I/O is blocked & queued	I/O proceeds
member3	I/O proceeds	new I/O is blocked & queued	previously queued I/O is drained	new I/O is blocked & queued	I/O proceeds
	member1, member2, and member3 are cluster members. Can perform I/O to any storage device.	CNX detects is connectivity issue with member1 and starts a cluster membership transition. The DRD will block and queue new I/O until the cluster transistion is complete. The DRD will drain I/O that was queued prior to the cluster state transition.		I/O Barrier is erectedand any DLM locks will be reconfigured.	member1 is no longer cluter member. Any I/O submitted from member1 after time=T2 will be rejected. I/O on member2 and member 3 will continue.

You may wonder what happens when the DRD "drains" previously queued I/O. We put the question to one the TruCluster Server engineers who explained that pending I/O requests are processed (read from a shared storage device or written to a shared storage device). If the I/O receives an error and needs to be retried, it is re-queued. I/Os that are returned to the DRD from the shared storage devices are returned to the calling thread. When the drain is completed, no outstanding I/Os need to be processed and any new I/Os will have been queued.

^[1]See chapter 2 for more information on Distributed Raw Disk and TruCluster Production Server.

^[2]MUNSA was not part of the shipping product. It was a special patch created for a customer.

^[3]DVD-ROM devices are supported in V5.1 and later.

^[4]See chapter 17 for more information on the Connection Manager.

^[5]See chapter 18 for more information on the Distributed Lock Manager.