The Architecture and Logic of SCSI | Storage Networking Fundamentals: An Introduction to Storage Devices, Subsystems, Applications, Management, and File Systems (Vol 1)

The SCSI communications architecture is a logical system of commands and responses exchanged between systems and storage, encompassing such things as addressing, naming, and error-correction procedures. The primary role of the SCSI architecture is to provide a reliable abstraction layer between systems and storage devices. Without a storage abstraction layer, every application would need to incorporate details about the operations of every storage device used with it. This situation would clearly be unacceptable, so it was necessary for the computer and storage industries to develop standard interfaces for both systems and storage devices. SCSI was developed as a standard storage abstraction for open-systems computers.

History of SCSI

SCSI began its development in 1981 with work done by Shugart Associates and NCR Corporation, which were both looking for ways to connect disk drives to systems. SCSI was originally called SASI, for Shugart Associates Systems Interface. In December of 1981, the ANSI standards organization created the X3T9.2 technical committee for the continued development of this work, and it was renamed SCSI. The first SCSI standard was approved and published in 1986. At the time, the protocol and the interconnect were tightly integrated as a single combined technology.

Since then SCSI has undergone two major expansions. SCSI-2 expanded the width of the SCSI bus and increased its clock speed. SCSI-3 articulated the architectural structure of SCSI communications, separated the various technology elements, and created multiple, separate committees to work on these elements in parallel.

The T10 SCSI Standards Committees

The ongoing standards work in SCSI is performed by the T10 Technical Committee of the International Committee for Information Technology Standards (INCITS). Readers interested in reading draft standards documents should visit the T10 website at http://www.t10.org/.

SCSI-3 Connection Independence

The most significant change between SCSI-2 and SCSI-3 was the abstraction of logical storage functions from the underlying connection technology. This separation allows SCSI processes to be transmitted as an application over virtually any kind of network. In general, you can assume the use of SCSI-3 logic, processes, and protocols in any type of network storage implementation. The SCSI-3 standards documents make it clear that SCSI protocols are intended to be implemented independently of the connecting technology.

SCSI Architecture Model

One of the key elements of the SCSI protocol is the communications architecture for exchanging storage commands and data, which is defined by the SCSI Architecture Model (SAM). This section covers the following topics from the SCSI architecture model:

Initiators and targets
Initiator and target ports
SCSI remote procedure call structure
Overlapped I/O
Asymmetrical communications in SCSI
Dual-mode controllers
No guarantee for ordered delivery
SCSI ports, IDs, and names
SCSI logical units
Tasks, task sets, and tagged tasks
SCSI nexus and connection relationships
Tagged command queuing

Initiators and Targets

The SCSI protocol is based on using distributed communications between initiators and targets. In general, initiators are implemented in HBAs and systems, and targets are implemented in devices and subsystems, but there is no reason to limit one's concept of storage I/O to systems and storage. Initiators and targets can be implemented many different ways as long as the roles of both are clear.

The initiator controller issues a command, and the target controller acts on the request and makes a response. Figure 6-1 shows an HBA initiator issuing a command to a disk drive controller target.

Figure 6-1. An HBA SCSI Initiator Communicating with a Disk Drive SCSI Target

NOTE

The SCSI specification is the ultimate source of clarity and confusion about SCSI. Written by engineers, it contains more than a few sentences like this one: "An initiator device name is a name (see 3.1.64) that is a SCSI device name (see 4.7.6) for a SCSI initiator device." As many times as I read this sentence, I cannot help but feel like a wiener dog chasing my own tail.

In an attempt to clarify some of the geek-speak in the SCSI spec, I've changed some of the terms to be more intuitive. For instance, I use the word controller where the standard language uses device. I simply prefer to call something that processes commands a controller as opposed to calling it a device, as the SCSI standard does. I also prefer to use connecting technology or network as opposed to service delivery subsystem.

The standard also refers to initiators and targets as clients and servers. While the communication model of client/server computing is used, SCSI components don't resemble what most people think of as clients and serversespecially in an environment that includes network attached storage (NAS) clients and servers.

Initiator and Target Ports

In the SCSI architecture model, the network ports that are part of initiator and target controllers are considered to be part of the connecting network and not part of the initiator or target controller function. This might seem counterintuitive, but it is the correct functional distinction. Network ports and the low-level drivers that control the formation and recognition of protocol data units (PDUs) and network operation are not involved with the storage processes of SCSI. They have connecting roles, but not storing roles. Figure 6-2 shows the same HBA and disk drive as Figure 6-1, but it identifies the communication ports as part of the connecting network, not part of the SCSI logical process.

Figure 6-2. Communication Ports in SCSI as Part of the Connecting Network

SCSI Remote Procedure Call Structure

The SCSI architecture model specifies a pair of controllers exchanging information as a sequence of commands and responses. Initiator-target communications use remote procedure calls, where the initiator transmits a command including the data being transmitted and any associated execution parameters. This command is addressed to a specific target controller ID. The target processes the command and responds with any requested outgoing data and any accompanying information about the command's completion status, including errors or failures.

The communication between the initiator and the target is asynchronous, which means the command is sent and then both the initiator and target disengage and go about conducting their respective tasks. When the target has a response to send to the initiator, it notifies the initiator, and they reconnect to manage the data transfer.

SCSI was designed with the assumption that a host system would be multitasking. Therefore, there is an inherent understanding in the SCSI communications model that targets might have multiple tasks to perform and might be busy doing other work and, therefore, cannot respond to new commands immediately.

When the initiator finishes transferring a command to its network port, the command is said to be pending. Any data transfers that accompany the command (READ or WRITE commands) are processed, and the command is completed when the target sends a response to the initiator. The response can be a command completion confirmation or a status message, such as an error or failure. Figure 6-3 illustrates the sequence of the SCSI command/response mechanism.

Figure 6-3. Command/Response Sequence in SCSI

The initiator does not maintain constant contact with the target after sending the command so that it can be available for other work. This means the responses from targets are processed as interrupts from an HBA's device driver. Considering the number of interrupts that occur from storage I/O processes and the necessity to handle these interrupts correctly and quickly, it is easy to appreciate the interaction between the operating system kernel and the HBA's device driver software. This is why it is so important to check the OS support for storage HBAs and the various device drivers that are available. Unless an OS level is explicitly named in a device driver's support list, it should be assumed that the HBA and driver will not work in the system.

NOTE

Storage device driver development tends to be more of a black art, or alchemical process, than it is a clinical, predictable or scientific process. Experience working with devices and operating system kernels counts big time.

One of the challenges is the rate at which target-side responses occur, which determines the rate of host I/O interrupts. Target-side variables include cache size, RAID level, device capabilities, and file system fragmentation in addition to the application mix. The fact is, it is extremely difficult, if not impossible, to replicate actual scenarios in a development or test environment. That's one of the reasons things sometimes go awry.

Overlapped I/O

The SCSI communications architecture allows I/Os to be overlapped over the connecting bus or network. In other words, when a host initiator is finished issuing a command to one target, it can issue another command to the same or other targets before receiving a response for the first command. Responses from targets for different I/Os can be received in whatever order they finish processing.

For instance, an initiator can send a command to a tape drive that takes several minutes to complete and subsequently complete thousands of commands sent to disk drives that take only a fraction of a second each. Overlapped I/Os provide a high degree of parallelism for I/O communications and enable SCSI communications to be very efficient.

Asymmetrical Communications in SCSI

Unlike most data networks, the communications model for SCSI is not symmetrical. Both sides perform different functions and interact with distinctly different users/applications. Initiators work on behalf of applications, issuing commands and then waiting for targets to respond. Targets do their work on behalf of storage media, waiting for commands to arrive from initiators and then reading and writing data to media.

NOTE

If you think it seems goofy to say the user at the target side is storage media, you're rightit struck me as goofy, too, when I stumbled across that analysis. However, media is the thing at the end of the line on the target side. SCSI communications are asymmetrical, because there is something intelligent (a processor running applications) at one end and something unintelligent (media) at the other end.

So, while it is possible to put intelligent processors in storage subsystems and devices, the SCSI protocol was developed to manipulate unintelligent media with block storage addresses. The ramification of all this is that an "intelligent" storage controller on the target side of the exchange is limited to manipulating commands and addresses and is not capable of working with data objects.

There are huge differences between typical host HBA processes and storage target processes in devices or subsystems. The HBA needs to carefully manage operating system details, while the target has to manage the details of communications with multiple external initiators as well as internal storage targets. It might be tempting to describe a subsystem port as an HBA located in a subsystem, but that is not really true. The two implementations are very different.

Dual-Mode Controllers

Controllers can be designed to implement both initiator and target functions. Dual-mode controllers are useful for implementing the SCSI EXTENDED COPY command, discussed later in this chapter. Dual-mode controllers might also be implemented in SAN routers and virtualization appliances, as discussed in Chapter 12, "Storage Virtualization: The Power in Volume Management Software and SAN Virtualization Systems."

The simplest way to picture a dual-mode controller is a circuit board with two different physical portsone port functioning as an initiator and the other port as a target. A storage subsystem could hypothetically use this type of design.

Another implementation model for a dual-mode controller is to use a single network port that operates as both an initiator and a target. Obviously the controller in this case needs to differentiate between the two different roles.

No Guarantee for Ordered Delivery

It is often also assumed that SCSI provides in-order delivery to maintain data integrity. In-order delivery was traditionally provided by the SCSI bus and, therefore, was not needed by the SCSI protocol layer. In SCSI-3, the SCSI protocol assumes that proper ordering is provided by the underlying connection technology. In other words, the SCSI protocol does not provide its own reordering mechanism and, therefore, the network is responsible for the reordering of transmission frames that are received out of order. This is the main reason why TCP was considered essential for the iSCSI protocol that transports SCSI commands and data transfers over IP networking equipment: TCP provided ordering while other upper-layer protocols, such as UDP, did not.

SCSI Ports, IDs, and Names

SCSI controllers can communicate over multiple networks simultaneously through different ports. They can also communicate through multiple ports connected to a single network. With this kind of flexibility built into the architecture, there obviously needs to be a way to identify ports distinctly to ensure safe, consistent operations.

The SCSI standard provides a mechanism for identifying controllers and their ports. All controller ports must have a unique identifier (port ID) on each network they are connected to. For example, the port ID on the SCSI bus is a number between 0 and 15. The port ID on a FC fabric network is a 24-bit network address.

In addition to the port ID, there is also a port name that identifies each port uniquely. In FC, this port name is known as the worldwide name (WWN), which is a 64-bit hexadecimal value that is assigned to each port when the controller is being manufactured. These worldwide name values provide a mechanism to identify and address initiators and controllers in a SAN.

NOTE

Naming in storage networking can be confusing. In addition to the port name, there is also an optional node name, known as the worldwide node name (WWNN), which sometimes forces the use of the acronym WWPN to differentiate the port name from the node name.

The idea of identifying the node could be useful as a way to identify a system or subsystem uniquely, but there are several tricky issues involved. For example, the WWNN is intended to identify a system, but it is generated by the HBA, not by the system itself. The problem is that a system might have multiple HBAs for different purposes, and those HBAs could each have a different WWNN. There might be cases when it would make sense to have a single WWNN for all the HBAs in the system, but there might be other cases where it is preferable to have different WWNNs.

Before SANs, there was not a big need for naming controllers, because the address space was limited to fewer than 16 controllers on a parallel SCSI bus. The parallel SCSI method of setting numerical values or physically positioning jumpers obviously could not work in a SAN environment with millions of addresses.

The situation becomes clear when considering a recovery from a disaster such as a large-scale loss of power, where it is paramount to quickly and accurately recreate storage configurations. Without a way to uniquely identify storage resources, it could be extremely difficult to quickly and accurately recreate the logical structure of the SAN. In other words, it's essential to have a persistent method of discovering and addressing storage resources. The combination of WWNs and the use of name services in SAN switches provides this mechanism.

SCSI Logical Units

SCSI targets have logical units that provide the processing context for SCSI commands. Essentially, a logical unit is a virtual machine (or virtual controller) that handles SCSI communications on behalf of real or virtual storage devices in a target. Commands received by targets are directed to the appropriate logical unit by a task router in the target controller.

The work of the logical unit is split between two different functionsthe device server and the task manager. The device server executes commands received from initiators and is responsible for detecting and reporting errors that might occur. The task manager is the work scheduler for the logical unit, determining the order in which commands are processed in the queue and responding to requests from initiators about pending commands.

The logical unit number (LUN) identifies a specific logical unit (think virtual controller) in a target. Although we tend to use the term LUN to refer to a real or virtual storage device, a LUN is an access point for exchanging commands and status information between initiators and targets. Metaphorically, a logical unit is a "black box" processor, and the LUN is simply a way to identify SCSI black boxes.

Logical units are architecturally independent of target ports and can be accessed through any of the target's ports, via a LUN. A target must have at least one LUN, LUN 0, and might optionally support additional LUNs. For instance, a disk drive might use a single LUN, whereas a subsystem might allow hundreds of LUNs to be defined.

The process of provisioning storage in a SAN storage subsystem involves defining a LUN on a particular target port and then assigning that particular target/LUN pair to a specific logical unit. An individual logical unit can be represented by multiple LUNs on different ports. For instance, a logical unit could be accessed through LUN 1 on Port 0 of a target and also accessed as LUN 12 on port 1 of the same target. Figure 6-4 shows a two-port subsystem with a single logical unit being accessed this way.

Figure 6-4. A Single Logical Unit Being Accessed Through Two Different Subsystem Ports Using Two Different LUNs

NOTE

I've always thought the word "provisioning" was a bit off the mark for this, given its preexisting usage in data networking. The storage process described above always seemed much more like network protocol "binding" than it did network provisioning. Here is a definition for storage provisioning: it is the process of binding a virtual storage machine to a specific set of storage resources and making them available by one or more LUN IDs across one or more target ports.

Tasks, Task Sets, and Tagged Tasks

Commands in logical units are managed as tasks. SCSI tasks are placed in one or more queues, which are called task sets. The device server executes the tasks, and the task manager does what it sounds likemanages the various tasks in task sets. The use of multiple queues in SCSI provides both prioritization and optimization of storage I/O. SCSI queue management provides a multitasking environment for storage I/O processes to match the I/O requirements of multiprocessing servers.

SCSI's queuing capabilities are much more flexible than simple first-in, first-out (FIFO) or last-in, first-out (LIFO) queues, where low-priority applications can create I/O system bottlenecks. Instead, SCSI's queuing accommodates many different application I/O requirements simultaneously. This is one way the SCSI architecture supports high-throughput, multiprocessing computing environments. Figure 6-5 shows how multiple commands are managed as tasks in multiple task sets by a SCSI logical unit.

Figure 6-5. SCSI Commands Are Managed as Tasks in Task Sets by a SCSI Logical Unit

Tasks are further designated as being tagged or untagged. Tagging allows a group of commands to be transferred from an initiator to a logical unit using a sequential identifier for each command. Tagged commands and their associated tasks are placed together in a task set where the task manager can change the order in which they are processed. As each task completes, the logical unit responds to the initiator using the tag to identify the task. Tagging is used by multitasking applications that can have many independent I/Os, such as database applications.

SCSI Nexus and Connection Relationships

If you have been reading between the lines in the preceding sections, it might have occurred to you that there are multiple ways initiators can identify the communications they have with storage. The nexus object describes the initiator/storage communication relationship.

There are three nexus objects in SCSI:

Initiator/target (an I_T nexus)
Initiator/target/LUN (an I_T_L nexus)
Initiator/target/LUN/tag (an I_T_L_Q nexus)

The type of nexus object used determines the number of concurrent commands that can be pending at any time. An I_T nexus allows only a single command between an initiator and a specific target. An I_T_L nexus allows a single command between an initiator and a specific logical unit. An I_T_L_Q nexus allows many possible commands to be pending, as long as the commands are tagged.

NOTE

The SCSI nexus defines the SCSI path elements that are used for storage I/O processes, including multipathing. This definition of path is not necessarily intuitive to many networking people, who are accustomed to thinking about paths in TCP/IP networks. The underlying connecting bus or network is transparent to SCSI logical processes, which means that the SCSI path cannot include the network. The SCSI nexus entities, therefore, define the complete SCSI path.

Tagged Command Queuing

The most important feature of tagging in SCSI is tagged command queuing (TCQ), a mechanism that allows the logical unit's task manager to reorder tasks to optimize the performance of a storage device or subsystem.

Tagged command queuing was developed to optimize the performance of mechanical components in disk drives, particularly the disk arms and actuators. The basic idea is to reorder a group of commands to reduce the overall latency involved in seeking tracks on disk platters.

Assume there are 20 tagged tasks in a task set, each with a directive to read or write data across a random distribution of tracks on disk media. Without the ability to rearrange tasks, the seek time latency would be the average seek time for the drive. Using command queuing, the tasks could be structured so the actuator moves the minimal amount for each task as it moves from one task's track to the track of its nearest neighbor. Figure 6-6 contrasts the impact on seek times between targets that use tagged command queuing and those that do not.

Figure 6-6. Differences in Seek Processes Used for Tagged Command Queuing and Untagged Commands

Tagged command queuing significantly reduces seek time latency for disk I/O operations. The degree of improvement depends heavily on the queue depth (the number of tasks in a task set). In general, the greater the queue depth is, the shorter the average seek time will be and the better storage performance will be.