iSCSI Operational Details

This section provides an in-depth exploration of the operational details of iSCSI. This section complements and builds upon the iSCSI overview provided in Chapter 4, "Overview of Modern SCSI Networking Protocols." You are also encouraged to consult IETF RFC 3347 for an overview of the requirements that shaped the development of iSCSI.

iSCSI Addressing Scheme

iSCSI implements SCSI device names, port names, and port identifiers. Both device and port names are required by iSCSI. The SCSI device name is equivalent to the iSCSI node name. Three types of iSCSI node names are currently defined: iSCSI qualified name (IQN), extended unique identifier (EUI), and network address authority (NAA). Each iSCSI node name begins with a three-letter designator that identifies the type. RFC 3720 requires all iSCSI node names to be globally unique, but some node name types support locally administered names that are not globally unique. As with FC node names, global uniqueness can be guaranteed only by using globally administered names. To preserve global uniqueness, take care when moving iSCSI entities from one operating system image to another.

Currently, the IQN type is most prevalent. The IQN type provides globally unique iSCSI node names. The IQN format has four components: a type designator followed by a dot, a date code followed by a dot, the reverse domain name of the naming authority, and an optional device specific string that, if present, is prefixed by a colon. The type designator is "iqn". The date code indicates the year and month that the naming authority registered its domain name in the DNS. The reverse domain name identifies the naming authority. Sub-domains within the naming authority's primary domain are permitted; they enable delegation of naming tasks to departments, subsidiaries, and other subordinate organizations within the naming authority. The optional device specific string provides a unique iSCSI node name for each SCSI device. If the optional device specific string is not used, only one SCSI device can exist within the naming authority's namespace (as identified by the reverse domain name itself). Therefore, the optional device specific string is a practical requirement for real-world deployments. iSCSI node names of type IQN are variable in length up to a maximum of 223 characters. Examples include

iqn.1987-05.com.cisco:host1

iqn.1987-05.com.cisco.apac.singapore:ccm-host1

iqn.1987-05.com.cisco.erp:dr-host8-vpar1

The EUI type provides globally unique iSCSI node names assuming that the Extension Identifier sub-field within the EUI-64 string is not locally administered (see Chapter 5, "OSI Physical and Data-Link Layers"). The EUI format has two components: a type designator followed by a dot and a device specific string. The type designator is "eui". The device specific string is a valid IEEE EUI-64 string. Because the length of an EUI-64 string is eight bytes, and EUI-64 strings are expressed in hexadecimal, the length of an iSCSI node name of type EUI is fixed at 20 characters. For example

eui.02004567A425678D

The NAA type is based on the ANSI T11 NAA scheme. As explained in Chapter 5, the ANSI T11 NAA scheme supports many formats. Some formats provide global uniqueness; others do not. In iSCSI, the NAA format has two components: a type designator followed by a dot and a device specific string. The type designator is "naa". The device specific string is a valid ANSI T11 NAA string. Because the length of ANSI T11 NAA strings can be either 8 or 16 bytes, iSCSI node names of type NAA are variable in length. ANSI T11 NAA strings are expressed in hexadecimal, so the length of an iSCSI node name of type NAA is either 20 or 36 characters. Examples include

naa.52004567BA64678D

naa.62004567BA64678D0123456789ABCDEF

iSCSI allows the use of aliases for iSCSI node names. The purpose of an alias is to provide an easily recognizable, meaningful tag that can be displayed in tools, utilities, and other user interfaces. Although IQNs are text-based and somewhat intuitive, the device-specific portion might need to be very long and even cryptic to adequately ensure a scalable nomenclature in large iSCSI environments. Likewise, EUI and NAA formatted node names can be very difficult for humans to interpret. Aliases solve this problem by associating a human-friendly tag with each iSCSI node name. Aliases are used only by humans. Aliases might not be used for authentication or authorization. Aliases are variable in length up to a maximum of 255 characters.

SCSI port names and identifiers are handled somewhat differently in iSCSI as compared to other SAM Transport Protocols. iSCSI uses a single string as both the port name and port identifier. The string is globally unique, and so the string positively identifies each port within the context of iSCSI. This complies with the defined SAM functionality for port names. However, the string does not contain any resolvable address to facilitate packet forwarding. Thus, iSCSI requires a mapping of its port identifiers to other port types that can facilitate packet forwarding. For this purpose, iSCSI employs the concept of a network portal. Within an iSCSI device, an Ethernet (or other) interface configured with an IP address is called a network portal. Network portals facilitate packet forwarding, while iSCSI port identifiers serve as session endpoint identifiers. Within an initiator device, each network portal is identified by its IP address. Within a target device, each network portal is identified by its IP address and listening TCP port (its socket). Network portals that share compatible operating characteristics may form a portal group. Within a target device, each portal group is assigned a target portal group tag (TPGT).

SCSI ports are implemented differently for iSCSI initiators versus iSCSI targets. Upon resolution of a target iSCSI node name to one or more sockets, an iSCSI initiator logs into the target. After login completes, a SCSI port is dynamically created within the initiator. In response, the iSCSI port name and identifier are created by concatenating the initiator's iSCSI node name, the letter "i" (indicating this port is contained within an initiator device) and the initiator session identifier (ISID). The ISID is 6 bytes long and is expressed in hexadecimal. These three fields are comma-separated. For example

iqn.1987-05.com.cisco:host1,i,0x00023d000002

The order of these events might seem counter-intuitive, but recall that iSCSI port identifiers do not facilitate packet forwarding; network portals do. So, iSCSI port identifiers do not need to exist before the iSCSI login. Essentially, iSCSI login signals the need for a SCSI port, which is subsequently created and then used by the SCSI Application Layer (SAL).

Recall from Chapter 5 that SAM port names must never change. One might think that iSCSI breaks this rule, because initiator port names seem to change regularly. iSCSI creates and destroys port names regularly, but never changes port names. iSCSI generates each new port name in response to the creation of a new SCSI port. Upon termination of an iSCSI session, the associated initiator port is destroyed. Because the iSCSI port name does not change during the lifetime of its associated SCSI port, iSCSI complies with the SAM persistence requirement for port names.

In a target device, SCSI ports are also created dynamically in response to login requests. The target port name and identifier are created by concatenating the target's iSCSI node name, the letter "t" (indicating this port is contained within a target device) and the TPGT. The target node infers the appropriate TPGT from the IP address at which the login request is received. The TPGT is 2 bytes long and is expressed in hexadecimal. These three fields are comma-separated. For example

iqn.1987-05.com.cisco:array1,t,0x4097

All network portals within a target portal group share the same iSCSI port identifier and represent the same SCSI port. Because iSCSI target port names and identifiers are based on TPGTs, and because a network portal may operate independently (not part of a portal group), a TPGT must be assigned to each network portal that is not part of a portal group. Thus, a network portal that operates independently forms a portal group of one.

Some background information helps us fully understand ISID and TPGT usage. According to the SAM, the relationship between a SCSI initiator port and a SCSI target port is known as the initiator-target nexus (I_T nexus). The I_T nexus concept underpins all session-oriented constructs in all modern storage networking technologies. At any point in time, only one I_T nexus can exist between a pair of SCSI ports. According to the SAM, an I_T nexus is identified by the conjunction of the initiator port identifier and the target port identifier. The SAM I_T nexus is equivalent to the iSCSI session. Thus, only one iSCSI session can exist between an iSCSI initiator port identifier and an iSCSI target port identifier at any point in time.

iSCSI initiators adhere to this rule by incorporating the ISID into the iSCSI port identifier. Each new session is assigned a new ISID, which becomes part of the iSCSI port identifier of the newly created SCSI port, which becomes part of the I_T nexus. Each ISID is unique within the context of an initiator-target-TPGT triplet. If an initiator has an active session with a given target device and establishes another session with the same target device via a different target portal group, the initiator may reuse any active ISID. In this case, the new I_T nexus is formed between a unique pair of iSCSI port identifiers because the target port identifier includes the TPGT. Likewise, any active ISID may be reused for a new session with a new target device. Multiple iSCSI sessions may exist simultaneously between an initiator device and a target device as long as each session terminates on a different iSCSI port identifier (representing a different SCSI port) on at least one end of the session. Initiators accomplish this by connecting to a different target portal group or assigning a new ISID. Note that RFC 3720 encourages the reuse of ISIDs in an effort to promote initiator SCSI port persistence for the benefit of applications, and to facilitate target recognition of initiator SCSI ports in multipath environments.

Note

RFC 3720 officially defines the I_T nexus identifier as the concatenation of the iSCSI initiator port identifier and the iSCSI target port identifier (initiator node name + "i" + ISID + target node name + "t" + TPGT). This complies with the SAM definition of I_T nexus identifier. However, RFC 3720 also defines a session identifier (SSID) that can be used to reference an iSCSI session. The SSID is defined as the concatenation of the ISID and the TPGT. Because the SSID is ambiguous, it has meaning only in the context of a given initiator-target pair.

iSCSI may be implemented as multiple hardware and software components within a single network entity. As such, coordination of ISID generation in an RFC-compliant manner across all involved components can be challenging. For this reason, RFC 3720 requires a single component to be responsible for the coordination of all ISID generation activity. To facilitate this rule, the ISID format is flexible. It supports a namespace hierarchy that enables coordinated delegation of ISID generation authority to various independent components within the initiator entity. Figure 8-1 illustrates the general ISID format.

Figure 8-1. General iSCSI ISID Format

A brief description of each field follows:

T This is 2 bits long and indicates the format of the ISID. Though not explicitly stated in RFC 3720, T presumably stands for type.
A This is 6 bits long and may be concatenated with the B field. Otherwise, the A field is reserved.
B This is 16 bits long and may be concatenated with the A field or the C field. Otherwise, the B field is reserved.
C This is 8 bits long and may be concatenated with the B field or the D field. Otherwise, the C field is reserved.
D This is 16 bits long and may be concatenated with the C field or used as an independent field. Otherwise, the D field is reserved.

Table 8-1 summarizes the possible values of T and the associated field descriptions.

Table 8-1. ISID Format Descriptions
T Value	ISID Format	Field Descriptions
00b	OUI	A & B form a 22-bit field that contains the OUI of the vendor of the component that generates the ISID. The I/G and U/L bits are omitted. C & D form a 24-bit qualifier field that contains a value generated by the component.
01b	EN	A is reserved. B & C form a 24-bit field that contains the IANA enterprise number (EN) of the vendor of the component that generates the ISID. D is a 16-bit qualifier field that contains a value generated by the component.
10b	Random	A is reserved. B & C form a 24-bit field that contains a value that is randomly generated by the component responsible for ISID coordination. A unique value is assigned to each subordinate component that generates ISIDs. D is a 16-bit qualifier field that contains a value generated by subordinate components.
11b	Reserved	All fields are reserved. This T value is currently not used.

The target also assigns a session identifier to each new session. This is known as the target session identifying handle (TSIH). During login for a new session, the initiator uses a TSIH value of zero. The target generates the TSIH value during login and sends the new value to the initiator in the final login response. In all subsequent packets, the assigned TSIH is used by the initiator to enable the target to associate received packets with the correct session. The TSIH is two bytes long, but the format of the TSIH is not defined in RFC 3720. Each target determines its own TSIH format. For more information about iSCSI device names, port names, port identifiers, and session identifiers, readers are encouraged to consult IETF RFCs 3720, 3721, 3722, and 3980, and ANSI T10 SAM-3.

iSCSI Name Assignment and Resolution

iSCSI is designed to operate with a single node name per operating system image. iSCSI node names can be preset in iSCSI hardware at time of manufacture, dynamically created by iSCSI software at time of installation, or manually assigned by the storage network administrator during initial configuration. If the node name is preset in iSCSI hardware at time of manufacture, RFC 3720 mandates the storage network administrator must be provided a way to change the node name. As previously discussed, iSCSI port names are dynamically assigned by the iSCSI hardware or software upon creation of a SCSI port.

iSCSI name resolution is tied to iSCSI name discovery. Because three methods of target discovery exist (see Chapter 3, "Overview of Network Operating Principles"), three methods of name resolution exist: manual, semi-manual, and automated. A manually configured iSCSI initiator node is given the iSCSI node names of all targets to which access is permitted. The socket(s) associated with each target node also must be manually configured. In this environment, name resolution occurs within the initiator node and does not involve any network service.

A semi-manually configured iSCSI initiator node is given the socket(s) to which the SendTargets command should be sent. The SendTargets response contains the TPGT for the network portal at which the request was received. The SendTargets response also contains the iSCSI node name of each target accessible via that portal group. This constitutes reverse name resolution (that is, address-to-name resolution). The SendTargets response also may contain additional socket and TPGT information for each target node name. This constitutes normal name resolution (that is, name-to-address resolution).

When using SLP with a DA, each target entity registers its target nodes in the DA store. Each target node is registered independently as a service URL containing the iSCSI node name, IP address, TCP port, and TPGT. If a target node is accessible via multiple network portals, a service URL is registered for each network portal. Upon booting, an initiator queries the DA to discover accessible targets. The DA response contains the service URL(s) to which access has been administratively granted (based on scope membership). When using SLP without a DA, each SA entity responds to queries it receives. The SA response contains all the service URLs implemented within that target entity to which the initiator has been administratively granted access (based on scope membership). With or without a DA, name resolution is inherent to the discovery process because the service URL contains the iSCSI node name and all relevant network portal information. SLP does not support RSCN functionality, so initiators must periodically send update requests to the DA or to each SA to discover any new targets that come online after initial discovery.

The iSNS model is very similar to the SLP model. When using iSNS, clients (target and initiator entities) must register with the iSNS server before they can query the iSNS server. Each target entity registers its target node names, network portal information (including IP addresses and TCP port numbers), and TPGT information. Initiator entities query the iSNS server and receive a response containing the iSCSI node name of each target node to which access has been administratively granted (based on Discovery Domain membership). The response also contains the relevant network portal and TPGT information. Thus, name resolution is inherent to the discovery process. iSNS also has the advantage of RSCN support. Clients do not need to periodically query for current status of other devices. Instead, clients may register to receive SCN messages. For more information about iSCSI name resolution, readers are encouraged to consult IETF RFCs 2608, 3720, 3721, 4018, and 4171.

iSCSI Address Assignment and Resolution

As previously discussed, iSCSI port identifiers are dynamically assigned by the iSCSI hardware or software upon creation of a SCSI port. However, iSCSI port identifiers are not addresses per se. To facilitate forwarding of iSCSI packets, IP addresses are required. IP address assignment is handled in the customary manner (manually or via DHCP). No special addressing requirements are mandated by iSCSI, and no special addressing procedures are implemented by network entities that host iSCSI processes. After an IP address is assigned to at least one network portal within an iSCSI device, iSCSI can begin communication. Likewise, address resolution is accomplished via the customary mechanisms. For example, IP addresses are resolved to Ethernet address via ARP. IP address assignment and resolution are transparent to iSCSI.

iSCSI Session Types, Phases, and Stages

As discussed in Chapter 3, iSCSI implements two types of session: discovery and normal. Both session types operate on TCP. Each discovery session uses a single TCP connection. Each normal session can use multiple TCP connections for load balancing and improved fault tolerance. A discovery session is used to discover iSCSI target node names and network portal information via the SendTargets command. A normal session is used for all other purposes. Discovery sessions are optional, and normal sessions are mandatory. We discussed discovery sessions in Chapter 3, so this section focuses on normal sessions.

The login phase always occurs first and is composed of two stages: security parameter negotiation and operational parameter negotiation. Each stage is optional, but at least one of the two stages must occur. If the security parameter negotiation stage occurs, it must occur first. Authentication is optional. If authentication is implemented, it occurs during the security parameter negotiation stage. Therefore, the security parameter negotiation stage must occur if authentication is implemented. Currently, authentication is the only security parameter negotiated during this stage.

Although the operational parameter negotiation stage is optional according to RFC 3720, it is a practical requirement for real-world deployments. Each initiator and target device must support the same operational parameters to communicate successfully. It is possible for the default settings of every iSCSI device to match, but it is not probable. So, negotiable parameters must be configured manually or autonegotiated. Manually setting all negotiable parameters on every iSCSI device can be operationally burdensome. Thus, the operational parameter negotiation stage is implemented by all iSCSI devices currently on the market. Support for unsolicited writes, the maximum burst length and various other parameters are negotiated during this stage.

Following the login phase, the iSCSI session transitions to the full feature phase. During the full feature phase of a normal session, initiators can issue iSCSI commands as well as send SCSI commands and data. Additionally, certain iSCSI operational parameters can be re-negotiated during the full feature phase. When all SCSI operations are complete, a normal iSCSI session can be gracefully terminated via the iSCSI Logout command. If a normal session is terminated unexpectedly, procedures are defined to clean up the session before reinstating the session. Session cleanup prevents processing of commands and responses that might have been delayed in transit, thus avoiding data corruption. Procedures are also defined to re-establish a session that has been terminated unexpectedly, so SCSI processing can continue from the point of abnormal termination. After all normal sessions have been terminated gracefully, the discovery session (if extant) can be terminated gracefully via the iSCSI Logout command. For more information about iSCSI session types, phases, and stages, readers are encouraged to consult IETF RFC 3720. Figure 8-2 illustrates the flow of iSCSI sessions, phases, and stages.

Figure 8-2. iSCSI Session Flow

iSCSI Data Transfer Optimizations

iSCSI handles data transfer very flexibly. To understand the data transfer options provided by iSCSI, some background information about SCSI I/O operations is required. SCSI defines the direction of data transfer from the perspective of the initiator. Data transfer from initiator to target (write command) is considered outbound, and data transfer from target to initiator (read command) is considered inbound. The basic procedure for a SCSI read operation involves three steps. First, the initiator sends a SCSI read command to the target. Next, the target sends the requested data to the initiator. Finally, the target sends a SCSI status indicator to the initiator. The read command specifies the starting block address and the number of contiguous blocks to transfer. If the data being retrieved by the application client is fragmented on the storage medium (that is, stored in non-contiguous blocks), then multiple read commands must be issued (one per set of contiguous blocks). For each set of contiguous blocks, the initiator may issue more than one read command if the total data in the set of contiguous blocks exceeds the initiator's available receive buffer resources. This eliminates the need for a flow-control mechanism for read commands. A target always knows it can send the entire requested data set because an initiator never requests more data than it is prepared to receive. When multiple commands are issued to satisfy a single application client request, the commands may be linked together as a single SCSI task. Such commands are called SCSI linked commands.

The basic procedure for a SCSI write operation involves four steps. First, the initiator sends a SCSI write command to the target. Next, the target sends an indication that it is ready to receive the data. Next, the initiator sends the data. Finally, the target sends a SCSI status indicator to the initiator. The write command specifies the starting block address and the number of contiguous blocks that will be transferred by this command. If the data being stored by the application client exceeds the largest contiguous set of available blocks on the medium, multiple write commands must be issued (one per set of contiguous blocks). The commands may be linked as a single SCSI task. A key difference between read and write operations is the initiator's knowledge of available receive buffer space. When writing, the initiator does not know how much buffer space is currently available in the target to receive the data. So, the target must inform the initiator when the target is ready to receive data (that is, when receive buffers are available). The target must also indicate how much data to transfer. In other words, a flow-control mechanism is required for write operations. The SAM delegates responsibility for this flow-control mechanism to each SCSI Transport Protocol.

In iSCSI parlance, the data transfer steps are called phases (not to be confused with phases of an iSCSI session). iSCSI enables optimization of data transfer through phase-collapse. Targets may include SCSI status as part of the final data PDU for read commands. This does not eliminate any round-trips across the network, but it does reduce the total number of PDUs required to complete the read operation. Likewise, initiators may include data with write command PDUs. This can be done in two ways. Data may be included as part of the write command PDU. This is known as immediate data. Alternately, data may be sent in one or more data PDUs immediately following a write command PDU without waiting for the target to indicate its readiness to receive data. This is known as unsolicited data. In both cases, one round-trip is eliminated across the network, which reduces the total time to completion for the write operation. In the case of immediate data, one data PDU is also eliminated. The initiator must negotiate support for immediate data and unsolicited data during login. Each feature is negotiated separately. If the target supports phase-collapse for write commands, the target informs the initiator (during login) how much data may be sent using each feature. Both features may be supported simultaneously. Collectively, immediate data and unsolicited data are called first burst data. First burst data may be sent only once per write command (the first sequence of PDUs). For more information about iSCSI phase-collapse, readers are encouraged to consult IETF RFC 3720.

Note

In the generic sense, immediate data is actually a subset of unsolicited data. Unsolicited data generically refers to any data sent to the target without first receiving an indication from the target that the target is ready for the data transfer. By that generic definition, immediate data qualifies as unsolicited data. However, the term unsolicited data has specific meaning in the context of iSCSI. Note that data sent in response to an indication of receiver readiness is called solicited data.

iSCSI PDU Formats

iSCSI uses one general PDU format for many purposes. The specific format of an iSCSI PDU is determined by the type of PDU. RFC 3720 defines numerous PDU types to facilitate communication between initiators and targets. Of these, the primary PDU types include login request, login response, SCSI command, SCSI response, data-out, data-in, ready to transfer (R2T), selective negative acknowledgment (SNACK) request, task management function (TMF) request, TMF response, and reject. All iSCSI PDUs begin with a basic header segment (BHS). The BHS may be followed by one or more additional header segments (AHS), a header-digest, a data segment, or a data-digest. The data-digest may be present only if the data segment is present. iSCSI PDUs are word-oriented, and an iSCSI word is 4 bytes. All iSCSI PDU segments and digests that do not end on a word boundary must be padded to the nearest word boundary. RFC 3720 encourages, but does not require, the value of 0 for all padding bits. Figure 8-3 illustrates the general iSCSI PDU format.

Figure 8-3. General iSCSI PDU Format

A brief description of each field follows:

BHS This is 48 bytes long. It is the only mandatory field. The BHS field indicates the type of PDU and contains most of the control information used by iSCSI.
Optional AHS Each is variable in length. The purpose of the AHS field is to provide protocol extensibility. Currently, only two AHS types are defined: extended command descriptor block (CDB) and expected bidirectional read data length.
Optional Header-Digest Each is variable in length as determined by the kind of digest used. This field provides additional header integrity beyond that of the IP and TCP checksums. During login, each initiator-target pair negotiates whether a header digest will be used and, if so, what kind. Separating the header digest from the data digest improves the operational efficiency of iSCSI gateways that modify header fields.
Optional Data This is variable in length. It contains PDU-specific data.
Optional Data-Digest This is variable in length as determined by the kind of digest used. This field provides additional data integrity beyond that of the IP and TCP checksums. During login, each initiator-target pair negotiates whether a data digest will be used and, if so, what kind.

The remainder of this section focuses on the BHS because the two defined AHSs are less commonly used. Details of the BHS are provided for each of the primary iSCSI PDU types. Figure 8-4 illustrates the general format of the iSCSI BHS. All fields marked with "." are reserved.

Figure 8-4. iSCSI BHS Format

A brief description of each field follows:

Reserved This is 1 bit.
I This is 1 bit. I stands for immediate delivery. When an initiator sends an iSCSI command or SCSI command that should be processed immediately, the I bit is set to 1. When this bit is set to 1, the command is called an immediate command. This should not be confused with immediate data (phase-collapse). When this bit is set to 0, the command is called a non-immediate command.
Opcode This is 6 bits long. The Opcode field contains an operation code that indicates the type of PDU. Opcodes are defined as initiator opcodes (transmitted only by initiators) and target opcodes (transmitted only by targets). RFC 3720 defines 18 opcodes (see Table 8-2).
F This is 1 bit. F stands for final PDU. When this bit is set to 1, the PDU is the final (or only) PDU in a sequence of PDUs. When this bit is set to 0, the PDU is followed by one or more PDUs in the same sequence. The F bit is redefined by some PDU types.
Opcode-specific Sub-fields These are 23 bits long. The format and use of all Opcode-specific sub-fields are determined by the value in the Opcode field.
TotalAHSLength This is 8 bits long. It indicates the total length of all AHS fields. This field is needed because the AHS fields are variable in number and length. The value is expressed in 4-byte words and includes padding bytes (if any exist). If no AHS fields are present, this field is set to 0.
DataSegmentLength This is 24 bits long. It indicates the total length of the Data segment. This field is needed because the Data segment is variable in length. The value is expressed in bytes and does not include padding bytes (if any exist). If no Data segment is present, this field is set to 0.
LUN field and the Opcode-specific Sub-fields These are 64 bits (8 bytes) long. They contain the destination LUN if the Opcode field contains a value that is relevant to a specific LUN (such as a SCSI command). When used as a LUN field, the format complies with the SAM LUN format. When used as Opcode-specific sub-fields, the format and use of the sub-fields are opcode-specific.
Initiator Task Tag (ITT) This is 32 bits long. It contains a tag assigned by the initiator. An ITT is assigned to each iSCSI task. Likewise, an ITT is assigned to each SCSI task. A SCSI task can represent a single SCSI command or multiple linked commands. Each SCSI command can have many SCSI activities associated with it. A SCSI task encompasses all activities associated with a SCSI command or multiple linked commands. Likewise, an ITT that represents a SCSI task also encompasses all associated activities of the SCSI command(s). An ITT value is unique only within the context of the current session. The iSCSI ITT is similar in function to the FC fully qualified exchange identifier (FQXID).
Opcode-specific Sub-fields These are 224 bits (28 bytes) long. The format and use of the sub-fields are opcode-specific.

Table 8-2 summarizes the iSCSI opcodes that are currently defined in RFC 3720. All opcodes excluded from Table 8-2 are reserved.

Table 8-2. iSCSI Operation Codes
Category	Value	Description
Initiator	0x00	NOP-out
Initiator	0x01	SCSI command
Initiator	0x02	SCSI task management request
Initiator	0x03	Login request
Initiator	0x04	Text request
Initiator	0x05	SCSI data-out
Initiator	0x06	Logout request
Initiator	0x10	SNACK request
Initiator	0x1C-0x1E	Vendor-specific codes
Target	0x20	NOP-in
Target	0x21	SCSI response
Target	0x22	SCSI task management response
Target	0x23	Login response
Target	0x24	Text response
Target	0x25	SCSI data-in
Target	0x26	Logout response
Target	0x31	Ready to transfer (R2T)
Target	0x32	Asynchronous message
Target	0x3C-0x3E	Vendor-specific codes
Target	0x3F	Reject

The first login request of a new session is called the leading login request. The first TCP connection of a new session is called the leading connection. Figure 8-5 illustrates the iSCSI BHS of a Login Request PDU. Login parameters are encapsulated in the Data segment (not shown) as text key-value pairs. All fields marked with "." are reserved.

Figure 8-5. iSCSI Login Request BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
I This is always set to 1.
Opcode This is 6 bits long. It is set to 0x03.
T This is the F bit redefined as the T bit. T stands for transit. This bit is set to 1 when the initiator is ready to change to the next login stage.
C This is 1 bit. It indicates whether the set of text keys in this PDU is complete. C stands for continue. When the set of text keys is too large for a single PDU, the C bit is set to 1, and another PDU follows containing more text keys. When all text keys have been transmitted, the C bit is set to 0. When the C bit is set to 1, the T bit must be set to 0.
Reserved This is 2 bits long.
CSG This is 2 bits long. It indicates the current stage of the login procedure. The value 0 indicates the security parameter negotiation stage. The value 1 indicates the operational parameter negotiation stage. The value 2 is reserved. The value 3 indicates the full feature phase. These values also used are by the NSG field.
NSG This is 2 bits long. It indicates the next stage of the login procedure.
Version-Max This is 8 bits long. It indicates the highest supported version of the iSCSI protocol. Only one version of the iSCSI protocol is currently defined. The current version is 0x00.
Version-Min This is 8 bits long. It indicates the lowest supported version of the iSCSI protocol.
TotalAHSLength This is 8 bits long.
DataSegmentLength This is 24 bits long.
ISID This is 48 bits (6 bytes) long. For a new session, this value is unique within the context of the initiator-target-TPGT triplet. For a new connection within an existing session, this field indicates the session to which the new connection should be added.
TSIH This is 16 bits long. For a new session, the initiator uses the value 0. Upon successful completion of the login procedure, the target provides the TSIH value to the initiator in the final Login Response PDU. For a new connection within an existing session, the value previously assigned to the session by the target must be provided by the initiator in the first and all subsequent Login Request PDUs.
ITT This is 32 bits long.
Connection Identifier (CID) This is 16 bits long. The initiator assigns a CID to each TCP connection. The CID is unique only within the context of the session to which the connection belongs. Each TCP connection may be used by only a single session.
Reserved This is 16 bits long.
Command Sequence Number (CmdSN) This is 32 bits long. The initiator assigns a unique sequence number to each non-immediate SCSI command issued within a session. This field enables in-order delivery of all non-immediate commands within a session even when multiple TCP connections are used. For a leading login request, the value of this field is arbitrarily selected by the initiator. The same value is used until login successfully completes. The same value is then used for the first non-immediatenon-immediate SCSI Command PDU. The CmdSN counter is then incremented by one for each subsequent new non-immediate command, regardless of which TCP connection is used. If the login request is for a new connection within an existing session, the value of this field must reflect the current CmdSN of that session. The first non-immediate SCSI Command PDU sent on the new TCP connection also must use the current CmdSN value, which can continue to increment while login processing occurs on the new connection. Both initiator and target track this counter. The iSCSI CmdSN is similar in function to the FCP command reference number (CRN).
ExpStatSN or Reserved This is 32 bits long. It may contain the iSCSI Expected Status Sequence Number. Except during login, the initiator uses the ExpStatSN field to acknowledge receipt of SCSI Response PDUs. The target assigns a status sequence number (StatSN) to each SCSI Response PDU. While the CmdSN is session-wide, the StatSN is unique only within the context of a TCP connection. For a leading login request, this field is reserved. If the login request is for a new connection within an existing session, this field is reserved. If the login request is for recovery of a lost connection, this field contains the last value of ExpStatSN from the failed connection. If multiple Login Request PDUs are sent during recovery, this field is incremented by one for each Login Response PDU received. Thus, during login, this field is either not used or is used to acknowledge receipt of Login Response PDUs.
Reserved This is 128 bits (16 bytes) long.

Each Login Request PDU or sequence of Login Request PDUs precipitates a Login Response PDU or sequence of Login Response PDUs. Figure 8-6 illustrates the iSCSI BHS of a Login Response PDU. Login parameters are encapsulated in the Data segment (not shown) as text key-value pairs. All fields marked with "." are reserved.

Figure 8-6. iSCSI Login Response BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
Reserved This is the 1 bit redefined as Reserved.
Opcode This is 6 bits long. It is set to 0x23.
T bit This is the F bit redefined as the T bit. This bit is set to 1 when the target is ready to change to the next login stage. The target may set this bit to 1 only if the most recently received Login Request PDU had a T bit value of 1.
C This is 1 bit.
Reserved This is 2 bits long.
CSG This is 2 bits long.
NSG This is 2 bits long.
Version-Max This is 8 bits long.
Version-Active This is 8 bits long. It indicates the highest iSCSI protocol version that both the target and the initiator have in common. If no version is common to both target and initiator, the target rejects the login request, and this field reverts to Version-Min.
TotalAHSLength This is 8 bits long.
DataSegmentLength This is 24 bits long.
ISID This is 48 bits (6 bytes) long.
TSIH This is 16 bits long.
ITT This is 32 bits long.
Reserved This is 32 bits long.
Status Sequence Number (StatSN) This is 32 bits long. iSCSI assigns a sequence number to each new Login Response PDU and each new SCSI Response PDU sent within a session. This field enables recovery of lost status and Login Response PDUs. When multiple TCP connections are used, the value of StatSN is maintained independently for each TCP connection. Mandatory support for connection allegiance makes this possible. For the first login response sent on each TCP connection, the value of this field is arbitrarily selected by the target. Each time a new Login Response PDU or a new SCSI Response PDU is transmitted, the StatSN of the associated TCP connection is incremented by one. A retransmitted SCSI Response PDU carries the same StatSN as the original PDU. This field is valid only when the Status-Class field is set to 0.
Expected Command Sequence Number (ExpCmdSN) This is 32 bits long. This field enables the target to acknowledge receipt of commands. Both initiator and target track this counter. The target calculates this value by adding 1 to the highest CmdSN received in a valid PDU. The initiator does not calculate this value. Instead, the initiator uses the value received most recently from the target. This field is valid only when the Status-Class field is set to 0.
Maximum Command Sequence Number (MaxCmdSN) This is 32 bits long. Whereas lower layers of the OSI model implement flow control at the granularity of a byte or a frame or packet, this field enables flow control at the granularity of a SCSI command. The maximum number of SCSI commands that the target can queue is determined by the resources (such as memory) within the target. This field allows the target to inform the initiator of available resources at a given point in time. Both initiator and target track this counter. The target increments its MaxCmdSN counter by 1 each time it transmits a SCSI Response PDU. The initiator does not calculate this value. Instead, the initiator uses the value received most recently from the target. The maximum number of commands the target can accept from the initiator at a given point in time is calculated as the current value of the target's MaxCmdSN counter minus the current value of the target's CmdSN counter. When the target's CmdSN counter equals its MaxCmdSN counter, the target cannot accept new commands within this session. This field is valid only when the Status-Class field is set to 0.
Status-Class This is 8 bits long. It indicates the status of the most recently received Login Request PDU. A value of 0 indicates that the request was understood and processed properly. A value of 1 indicates that the initiator is being redirected. Redirection usually indicates the target IP address has changed; a text key-value pair must be included in the Data segment to indicate the new IP address. A value of 2 indicates that the target has detected an initiator error. The login procedure should be aborted, and a new login phase should be initiated if the initiator still requires access to the target. A value of 3 indicates that the target has experienced an internal error. The initiator may retry the request without aborting the login procedure. All other values are currently undefined but not explicitly reserved.
Status-Detail This is 8 bits long. It provides more detail within each category of Status-Class. This field is meant to be used by sophisticated iSCSI initiator implementations and may be ignored by simple iSCSI initiator implementations.
Reserved This is 80 bits long.

Following login, the initiator may send SCSI commands to the target. Figure 8-7 illustrates the iSCSI BHS of a SCSI Command PDU. All fields marked with "." are reserved.

Figure 8-7. iSCSI SCSI Command BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
I This is 1 bit.
Opcode This is 6 bits long. It is set to 0x01.
F This is 1 bit.
R This is 1 bit. It indicates a read command when set to 1. For bidirectional commands, both the R and W bits are set to 1.
W This is 1 bit. It indicates a write command when set to 1. For bidirectional commands, both the R and W bits are set to 1.
Reserved This is 2 bits long.
ATTR This is 3 bits long. It indicates the SCSI Task Attribute. A value of 0 indicates an untagged task. A value of 1 indicates a simple task. A value of 2 indicates an ordered task. A value of 3 indicates a Head Of Queue task. A value of 4 indicates an Auto Contingent Allegiance (ACA) task. All other values are reserved. For more information about SCSI Task Attributes, see the ANSI T10 SAM-3 specification.
Reserved This is 16 bits long.
TotalAHSLength This is 8 bits long.
DataSegmentLength This is 24 bits long.
LUN This is 64 bits (8 bytes) long.
ITT This is 32 bits long.
Expected Data Transfer Length This is 32 bits long. It indicates the total amount of data expected to be transferred unidirectionally by this command. This field is expressed in bytes. When the data transfer is bidirectional, this field represents the write data, and the Expected Bidirectional Read Data Length AHS must follow the BHS. This field is set to 0 for certain commands that do not transfer data. This field represents an estimate. After all data is transferred, the target informs the initiator of how much data was actually transferred.
CmdSN This is 32 bits long. It contains the current value of the CmdSN counter. The CmdSN counter is incremented by 1 immediately following transmission of a new non-immediate command. Thus, the counter represents the number of the next non-immediate command to be sent. The only exception is when an immediate command is transmitted. For an immediate command, the CmdSN field contains the current value of the CmdSN counter, but the counter is not incremented after transmission of the immediate command. Thus, the next non-immediate command to be transmitted carries the same CmdSN as the preceding immediate command. The CmdSN counter is incremented by 1 immediately following transmission of the first non-immediate command to follow an immediate command. A retransmitted SCSI Command PDU carries the same CmdSN as the original PDU. Note that a retransmitted SCSI Command PDU also carries the same ITT as the original PDU.
ExpStatSN This is 32 bits long.
SCSI CDB This is 128 bits (16 bytes) long. Multiple SCSI CDB formats are defined by ANSI. SCSI CDBs are variable in length up to a maximum of 260 bytes, but the most common CDB formats are 16 bytes long or less. Thus, the most common CDB formats can fit into this field. When a CDB shorter than 16 bytes is sent, this field is padded with zeros. When a CDB longer than 16 bytes is sent, the BHS must be followed by an Extended CDB AHS containing the remainder of the CDB. All CDBs longer than 16 bytes must end on a 4-byte word boundary, so the Extended CDB AHS does not require padding.

The final result of each SCSI command is a SCSI status indicator delivered in a SCSI Response PDU. The SCSI Response PDU also conveys iSCSI status for protocol operations. Figure 8-8 illustrates the iSCSI BHS of a SCSI Response PDU. All fields marked with "." are reserved.

Figure 8-8. iSCSI SCSI Response BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
Reserved This is the 1 bit redefined as Reserved.
Opcode This is 6 bits long and is set to 0x21.
F bit This is always set to 1.
Reserved This is 2 bits long.
o This is 1 bit. The o stands for overflow. When a bidirectional command generates more read data than expected by the initiator, this bit is set to 1. This condition is known as a Bidirectional Read Residual Overflow. This bit is not used for unidirectional commands. This bit is valid only if the Response field is set to 0x00.
u This is 1 bit. The u stands for underflow. When a bidirectional command generates less read data than expected by the initiator, this bit is set to 1. This condition is known as a Bidirectional Read Residual Underflow. This bit is not used for unidirectional commands. This bit is valid only if the Response field is set to 0x00.
O This is 1 bit. The O stands for overflow. When a bidirectional command generates more write data than expected by the initiator, this bit is set to 1. This condition is known as a Bidirectional Write Residual Overflow. When a unidirectional read command generates more data than expected by the initiator, this bit is set to 1. When a unidirectional write command generates more data than expected by the initiator, this bit is set to 1. In both unidirectional cases, this condition is known as a Residual Overflow. This bit is valid only if the Response field is set to 0x00.
U This is 1 bit. The U stands for underflow. When a bidirectional command generates less write data than expected by the initiator, this bit is set to 1. This condition is known as a Bidirectional Write Residual Underflow. When a unidirectional read command generates less data than expected by the initiator, this bit is set to 1. When a unidirectional write command generates less data than expected by the initiator, this bit is set to 1. In both unidirectional cases, this condition is known as a Residual Underflow. This bit is valid only if the Response field is set to 0x00.
Reserved This is 1 bit.
Response This is 8 bits long and contains a code that indicates the presence or absence of iSCSI protocol errors. The iSCSI response code is to the SCSI service delivery subsystem what the SCSI status code is to SAL. An iSCSI response code of 0x00 is known as Command Completed at Target. It indicates the target has completed processing the command from the iSCSI perspective. This iSCSI response code is roughly equivalent to a SCSI service response of LINKED COMMAND COMPLETE or TASK COMPLETE. This iSCSI response code is also roughly equivalent to an FCP_RSP_LEN_VALID bit set to zero. An iSCSI response code of 0x00 conveys iSCSI success but does not imply SCSI success. An iSCSI response code of 0x01 is known as Target Failure. It indicates failure to process the command. This iSCSI response code is roughly equivalent to a SCSI service response of SERVICE DELIVERY OR TARGET FAILURE. This iSCSI response code is also roughly equivalent to an FCP_RSP_LEN_VALID bit set to 1. iSCSI response codes 0x80-0xff are vendor-specific. All other iSCSI response codes are reserved.
Note

The SCSI service response is passed to the SAL from the SCSI service delivery subsystem within the initiator. The SCSI service response indicates success or failure for delivery operations. Whereas the iSCSI response code provides status between peer layers in the OSI Reference Model, the SCSI service response provides inter-layer status between provider and subscriber.
Status This is 8 bits long. This field contains a status code that provides more detail about the final status of the SCSI command and the state of the logical unit that executed the command. This field is valid only if the Response field is set to 0x00. Even if the Response field is set to 0x00, the target might not have processed the command successfully. If the status code indicates failure to process the command successfully, error information (called SCSI sense data) is included in the Data segment. All iSCSI devices must support SCSI autosense. iSCSI does not define status codes. Instead, iSCSI uses the status codes defined by the SAM. Currently, 10 SCSI status codes are defined in the SAM-3 specification (see Table 8-3). All other values are reserved:
TotalAHSLength This is 8 bits long.
DataSegmentLength This is 24 bits long.
Reserved This is 64 bits (8 bytes) long.
ITT This is 32 bits long.
SNACK Tag or Reserved This is 32 bits long. Each SNACK Request PDU is assigned a tag by the initiator. If the initiator sends one or more SNACK Request PDUs for read data after the SCSI Response PDU has been received, the SCSI Response PDU must be discarded. After retransmitting the missing data, the target retransmits the SCSI Response PDU containing the SNACK Tag of the most recently received SNACK Request PDU. The SAM mandates that no more than one status indicator be sent for each SCSI command. However, a single status indicator may be sent more than once. So, a retransmitted SCSI Response PDU always carries the same StatSN as the original PDU. This represents multiple instances of a single status indicator. The initiator must be able to distinguish between multiple instances of the status indicator to process the most recent instance. This field enables the initiator to detect and discard older SCSI Response PDUs. Only the SCSI Response PDU containing the most recent SNACK Tag is considered valid. If the target does not receive any SNACK Request PDUs before sending status, this field is reserved.
StatSN This is 32 bits long.
ExpCmdSN This is 32 bits long.
MaxCmdSN This is 32 bits long. If a SCSI Response PDU is dropped, the target and initiator can temporarily lose synchronization of their MaxCmdSN counters. During this time, the initiator does not know that the target can accept another command. Depending on the current level of I/O activity, this can result in slight performance degradation. Upon receipt of the retransmitted SCSI Response PDU, the MaxCmdSN counters resynchronize, and performance returns to normal.
ExpDataSN or Reserved This is 32 bits long. This field contains the expected data sequence number. It enables the initiator to verify that all Data-In/Data-Out PDUs were received without error. The target assigns a data sequence number (DataSN) to each Data-In PDU sent during a read command. Likewise, the initiator assigns a DataSN to each Data-Out PDU sent during a write command. For read commands, the ExpDataSN is always calculated as the current DataSN plus one. Therefore, a value of DataSN plus one is always reported in the SCSI Response PDU regardless of whether the read command completes successfully. For write commands, the initiator resets the value of DataSN to 0 upon transmission of the first PDU in each new sequence of PDUs, and the target resets the value of ExpDataSN to 0 after successful reception of each sequence of PDUs. Therefore, a value of 0 is reported in the SCSI Response PDU if the write command completes successfully. If the write command completes unsuccessfully, the value reported in the SCSI Response PDU may vary depending on choices made by the target. In the case of bidirectional commands, the target uses this field to report the number of Data-In and R2T PDUs transmitted. This field is reserved if a command does not complete or if no Data-In PDUs were sent during a read command.
Bidirectional Read Residual Count or Reserved This is 32 bits long. When the o bit or the u bit is set to 1, this field indicates the residual read byte count for a bidirectional command. When neither the o bit nor the u bit is set to 1, this field is reserved. This field is not used for unidirectional commands. This field is valid only if the Response field is set to 0x00.
Residual Count or Reserved This is 32 bits long. When either the O bit or the U bit is set to 1, this field indicates the residual read byte count for a read command, the residual write byte count for a write command, or the residual write byte count for a bidirectional command. When neither the O bit nor the U bit is set to 1, this field is reserved. This field is valid only if the Response field is set to 0x00.

Table 8-3 summarizes the SCSI status codes that are currently defined in the SAM-3 specification. All SCSI status codes excluded from Table 8-3 are reserved.

Table 8-3. SCSI Status Codes
Status Code	Status Name	Associated SCSI Service Response
0x00	GOOD	TASK COMPLETE
0x02	CHECK CONDITION	TASK COMPLETE
0x04	CONDITION MET	TASK COMPLETE
0x08	BUSY	TASK COMPLETE
0x10	INTERMEDIATE	LINKED COMMAND COMPLETE
0x14	INTERMEDIATE-CONDITION MET	LINKED COMMAND COMPLETE
0x18	RESERVATION CONFLICT	TASK COMPLETE
0x28	TASK SET FULL	TASK COMPLETE
0x30	ACA ACTIVE	TASK COMPLETE
0x40	TASK ABORTED	TASK COMPLETE

Outbound data is delivered in Data-Out PDUs. Figure 8-9 illustrates the iSCSI BHS of a Data-Out PDU. All fields marked with "." are reserved. Each Data-Out PDU must include a Data segment.

Figure 8-9. iSCSI Data-Out BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
Reserved This is the 1 bit redefined as Reserved.
Opcode This is 6 bits long. It is set to 0x05.
F This is 1 bit.
Reserved This is 23 bits long.
TotalAHSLength This is 8 bits long.
DataSegmentLength This is 24 bits long.
LUN or Reserved This is 64 bits (8 bytes) long. Each R2T PDU provides a LUN. All Data-Out PDUs sent in response to an R2T PDU must carry in this field the LUN provided by the R2T PDU. Upon receipt, the target uses this field and the TTT or 0xFFFFFFFF field to associate the Data-Out PDU with a previously transmitted R2T PDU. For Data-Out PDUs containing first burst data, this field is reserved.
ITT This is 32 bits long.
Target Transfer Tag (TTT) or 0xFFFFFFFF This is 32 bits long. Each R2T PDU provides a TTT. All Data-Out PDUs sent in response to an R2T PDU must carry in this field the TTT provided by the R2T PDU. Upon receipt, the target uses this field and the LUN or Reserved field to associate the Data-Out PDU with a previously transmitted R2T PDU. For Data-Out PDUs containing first burst data, this field contains the value 0xFFFFFFFF.
Reserved This is 32 bits long.
ExpStatSN This is 32 bits long.
Reserved This is 32 bits long.
Data Sequence Number (DataSN) This is 32 bits long. This field uniquely identifies each Data-Out PDU within each sequence of PDUs. The DataSN is similar in function to the FC SEQ_ID. Each SCSI write command is satisfied with one or more sequences of PDUs. Each PDU sequence is identified by the ITT (for unsolicited data) or the TTT (for solicited data). This field is incremented by one for each Data-Out PDU transmitted within a sequence. A retransmitted Data-Out PDU carries the same DataSN as the original PDU. The counter is reset for each new sequence within the context a single command.
Buffer Offset This is 32 bits long. This field indicates the position of the first byte of data delivered by this PDU relative to the first byte of all the data transferred by the associated SCSI command. This field enables the target to reassemble the data properly.
Reserved This is 32 bits long.

Inbound data is delivered in Data-In PDUs. Figure 8-10 illustrates the iSCSI BHS of a Data-In PDU. All fields marked with "." are reserved. Each Data-In PDU must include a Data segment.

Figure 8-10. iSCSI Data-In BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
Reserved This is the 1 bit redefined as Reserved.
Opcode This is 6 bits long. It is set to 0x25.
F bit This may be set to 1 before sending the final Data-In PDU. Doing so indicates a change of direction during a bidirectional command. When used to change the direction of transfer, this bit is similar in function to the FC Sequence Initiative bit in the F_CTL field in the FC Header.
A This is 1 bit. The A stands for Acknowledge. The target sets this bit to 1 to request positive, cumulative acknowledgment of all Data-In PDUs transmitted before the current Data-In PDU. This bit may be used only if the session supports an ErrorRecoveryLevel greater than 0 (see the iSCSI Login Parameters section of this chapter).
Reserved This is 3 bits long.
O and U bits These are used in the same manner as previously described. These bits are present to support phase-collapse for read commands. For bidirectional commands, the target must send status in a separate SCSI Response PDU. iSCSI (like FCP) does not support status phase-collapse for write commands. These bits are valid only when the S bit is set to 1.
S This is 1 bit. The S stands for status. When this bit is set to 1, status is included in the PDU.
Reserved This is 8 bits long.
Status or Reserved This is 8 bits long. When the S field is set to 1, this field contains the SCSI status code for the command. Phase-collapse is supported only when the iSCSI response code is 0x00. Thus, a Response field is not required because the response code is implied. Furthermore, phase-collapse is supported only when the SCSI status is 0x00, 0x04, 0x10, or 0x14 (GOOD, CONDITION MET, INTERMEDIATE, or INTERMEDIATE-CONDITION MET, respectively). When the S field is set to zero, this field is reserved.
TotalAHSLength This is 8 bits long.
DataSegmentLength This is 24 bits long.
LUN or Reserved This is 64 bits (8 bytes) long and contains the LUN if the A field is set to 1. The initiator copies the value of this field into a similar field in the acknowledgment PDU. If the A field is set to 0, this field is reserved.
ITT This is 32 bits long.
TTT or 0xFFFFFFFF This is 32 bits long. It contains a TTT if the A field is set to 1. The initiator copies the value of this field into a similar field in the acknowledgment PDU. If the A field is set to 0, this field is set to 0xFFFFFFFF.
StatSN or Reserved This is 32 bits long. It contains the StatSN if the S field is set to 1. Otherwise, this field is reserved.
ExpCmdSN This is 32 bits long.
MaxCmdSN This is 32 bits long.
DataSN This is 32 bits long. It uniquely identifies each Data-In PDU within a sequence of PDUs. The DataSN is similar in function to the FC SEQ_ID. Each SCSI read command is satisfied with a single sequence of PDUs. Each PDU sequence is identified by the CmdSN. This field is incremented by 1 for each Data-In PDU transmitted within a SCSI task. A retransmitted Data-In PDU carries the same DataSN as the original PDU. The DataSN counter does not reset within a single read command or after each linked read command within a single SCSI task. Likewise, for bidirectional commands in which the target periodically sets the F field to 1 to allow the transfer of write data, the DataSN counter does not reset after each sequence. This field is also incremented by one for each R2T PDU transmitted during bidirectional command processing. In other words, the target maintains a single counter for both DataSN and R2T Sequence Number (R2TSN) during bidirectional command processing.
Buffer Offset This is 32 bits long.
Residual Count This is 32 bits long. It is used in the same manner as previously described. This field is present to support phase-collapse for read commands. This field is valid only when the S field is set to 1.

The target signals its readiness to receive write data via the R2T PDU. The target also uses the R2T PDU to request retransmission of missing Data-Out PDUs. In both cases, the PDU format is the same, but an R2T PDU sent to request retransmission is called a Recovery R2T PDU. Figure 8-11 illustrates the iSCSI BHS of a R2T PDU. All fields marked with "." are reserved.

Figure 8-11. iSCSI R2T BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
Reserved This is the 1 bit redefined as Reserved.
Opcode This is 6 bits long. It is set to 0x31.
F bit This is always set to 1.
Reserved This is 23 bits long.
TotalAHSLength This is 8 bits long. It is always set to 0.
DataSegmentLength This is 24 bits long. It is always set to 0.
LUN This is 64 bits (8 bytes) in length.
ITT This is 32 bits in length.
TTT This is 32 bits long. It contains a tag that aids the target in associating Data-Out PDUs with this R2T PDU. All values are valid except 0xFFFFFFFF, which is reserved for use by initiators during first burst.
StatSN This is 32 bits long. It contains the StatSN that will be assigned to this command upon completion. This is the same as the ExpStatSN from the initiator's perspective.
ExpCmdSN This is 32 bits long.
MaxCmdSN This is 32 bits long.
R2TSN This is 32 bits long. It uniquely identifies each R2T PDU within the context of a single SCSI task. Each task is identified by the ITT. This field is incremented by 1 for each new R2T PDU transmitted within a SCSI task. A retransmitted R2T PDU carries the same R2TSN as the original PDU. This field is also incremented by 1 for each Data-In PDU transmitted during bidirectional command processing.
Buffer Offset This is 32 bits long. It indicates the position of the first byte of data requested by this PDU relative to the first byte of all the data transferred by the SCSI command.
Desired Data Transfer Length This is 32 bits long. This field indicates how much data should be transferred in response to this R2T PDU. This field is expressed in bytes. The value of this field cannot be 0 and cannot exceed the negotiated value of MaxBurstLength (see the iSCSI Login Parameters section of this chapter).

iSCSI supports PDU retransmission and PDU delivery acknowledgment on demand via the SNACK Request PDU. Each SNACK Request PDU specifies a contiguous set of missing single-type PDUs. Each set is called a run. Figure 8-12 illustrates the iSCSI BHS of a SNACK Request PDU. All fields marked with "." are reserved.

Figure 8-12. iSCSI SNACK Request BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
Reserved This is the 1 bit redefined as Reserved.
Opcode This is 6 bits long. It is set to 0x10.
F bit This is always set to 1.
Reserved This is 3 bits long.
Type This is 4 bits long. The SNACK Request PDU serves multiple functions. So, RFC 3720 defines multiple SNACK Request PDU types. This field indicates the PDU function. The PDU format is the same for all SNACK Request PDUs regardless of type, but some fields contain type-specific information. All PDU types must be supported if an ErrorRecoveryLevel greater than 0 is negotiated during login (see the iSCSI Login Parameters section of this chapter). Currently, only four PDU types are defined (see Table 8-4). All other types are reserved.
Reserved This is 16 bits long.
TotalAHSLength This is 8 bits long.
DataSegmentLength This is 24 bits long.
LUN or Reserved This is 64 bits (8 bytes) long. It contains a LUN if the PDU type is DataACK. The value in this field is copied from the LUN field of the Data-In PDU that requested the DataACK PDU. Otherwise, this field is reserved.
ITT or 0xFFFFFFFF This is 32 bits long. It is set to 0xFFFFFFFF if the PDU type is Status or DataACK. Otherwise, this field contains the ITT of the associated task.
TTT or SNACK Tag or 0xFFFFFFFF This is 32 bits long. It contains a TTT if the PDU type is DataACK. The value in this field is copied from the TTT field of the Data-In PDU that requested the DataACK PDU. This field contains a SNACK Tag if the PDU type is R-Data. Otherwise, this field is set to 0xFFFFFFFF.
Reserved This is 32 bits long.
ExpStatSN This is 32 bits long.
Reserved This is 64 bits (8 bytes) long.
BegRun or ExpDataSN This is 32 bits long. For Data/R2T and Status PDUs, this field contains the identifier (DataSN, R2TSN or StatSN) of the first PDU to be retransmitted. This value indicates the beginning of the run. Note that the SNACK Request does not request retransmission of data based on relative offset. Instead, one or more specific PDUs are requested. This contrasts the FCP model. For DataACK PDUs, this field contains the initiator's ExpDataSN. All Data-in PDUs up to but not including the ExpDataSN are acknowledged by this field. For R-Data PDUs, this field must be set to 0. In this case, all unacknowledged Data-In PDUs are retransmitted. If no Data-In PDUs have been acknowledged, the entire read sequence is retransmitted beginning at DataSN 0. If some Data-In PDUs have been acknowledged, the first retransmitted Data-In PDU is assigned the first unacknowledged DataSN.
RunLength This is 32 bits long. For Data/R2T and Status PDUs, this field specifies the number of PDUs to retransmit. This field may be set to 0 to indicate that all PDUs with a sequence number equal to or greater than BegRun must be retransmitted. For DataACK and R-Data PDUs, this field must be set to 0.

Table 8-4 summarizes the SNACK Request PDU types that are currently defined in RFC 3720. All PDU types excluded from Table 8-4 are reserved.

Table 8-4. iSCSI SNACK Request PDU Types
Type	Name	Function
0	Data/R2T	Initiators use this PDU type to request retransmission of one or more Data-In or R2T PDUs. By contrast, targets use the Recovery R2T PDU to request retransmission of one or more Data-Out PDUs.
1	Status	Initiators use this PDU type to request retransmission of one or more Login Response PDUs or a SCSI Response PDU. By contrast, targets do not request retransmission of SCSI Command PDUs.
2	DataACK	Initiators use this PDU type to provide explicit, positive, cumulative acknowledgment for Data-In PDUs. This frees buffer space within the target device and enables efficient recovery of dropped PDUs during long read operations. By contrast, targets do not provide acknowledgment for Data-Out PDUs. This is not necessary because the SAM requires initiators to keep all write data in memory until a SCSI status of GOOD, CONDITION MET, or INTERMEDIATE-CONDITION MET is received.
3	R-Data	Initiators use this PDU type to request retransmission of one or more Data-In PDUs that need to be resegmented. The need for resegmentation occurs when the initiator's MaxRecvDataSegmentLength changes during read command processing. By contrast, targets use the Recovery R2T PDU to request retransmission of one or more Data-Out PDUs if the target's MaxRecvDataSegmentLength changes during write command processing. Even when resegmentation is not required, initiators use this PDU type. If a SCSI Response PDU is received before all associated Data-In PDUs are received, this PDU type must be used to request retransmission of the missing Data-In PDUs. In such a case, the associated SCSI Response PDU must be retransmitted after the Data-In PDUs are retransmitted. The SNACK Tag must be copied into the duplicate SCSI Response PDU to enable the initiator to discern between the duplicate SCSI Response PDUs.

iSCSI initiators manage SCSI and iSCSI tasks via the TMF Request PDU. Figure 8-13 illustrates the iSCSI BHS of a TMF Request PDU. All fields marked with "." are reserved.

Figure 8-13. iSCSI TMF Request BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
I This is 1 bit.
Opcode This is 6 bits long. It is set to 0x02.
F bit This is always set to 1.
Function This is 7 bits long. It contains the TMF Request code of the function to be performed. iSCSI currently supports six of the TMFs defined in the SAM-2 specification and one TMF defined in RFC 3720 (see Table 8-5). All other TMF Request codes are reserved.
Reserved This is 16 bits long.
TotalAHSLength This is 8 bits long. It is always set to 0.
DataSegmentLength This is 24 bits long. It is always set to 0.
LUN or Reserved This is 64 bits (8 bytes) long. It contains a LUN if the TMF is ABORT TASK, ABORT TASK SET, CLEAR ACA, CLEAR TASK SET, or LOGICAL UNIT RESET. Otherwise, this field is reserved.
ITT This is 32 bits long. It contains the ITT assigned to this TMF command. This field does not contain the ITT of the task upon which the TMF command acts.
Referenced Task Tag (RTT) or 0xFFFFFFFF This is 32 bits long. If the TMF is ABORT TASK or TASK REASSIGN, this field contains the ITT of the task upon which the TMF command acts. Otherwise, this field is set to 0xFFFFFFFF.
CmdSN This is 32 bits long. It contains the CmdSN of the TMF command. TMF commands are numbered the same way SCSI read and write commands are numbered. This field does not contain the CmdSN of the task upon which the TMF command acts.
ExpStatSN This is 32 bits long.
RefCmdSN or Reserved This is 32 bits long. If the TMF is ABORT TASK, this field contains the CmdSN of the task upon which the TMF command acts. The case of linked commands is not explicitly described in RFC 3720. Presumably, this field should contain the highest CmdSN associated with the RTT. This field is reserved for all other TMF commands.
ExpDataSN or Reserved This is 32 bits long. It is used only if the TMF is TASK REASSIGN. Otherwise, this field is reserved. For read and bidirectional commands, this field contains the highest acknowledged DataSN plus one for Data-In PDUs. This is known as the data acknowledgment reference number (DARN). If no Data-In PDUs were acknowledged before connection failure, this field contains the value 0. The initiator must discard all unacknowledged Data-In PDUs for the affected task(s) after a connection failure. The target must retransmit all unacknowledged Data-In PDUs for the affected task(s) after connection allegiance is reassigned. For write commands and write data in bidirectional commands, this field is not used. The target simply requests retransmission of Data-Out PDUs as needed via the Recovery R2T PDU.
Reserved This is 64 bits long.

Table 8-5 summarizes the TMF Request codes that are currently supported by iSCSI. All TMF Request codes excluded from Table 8-5 are reserved.

Table 8-5. iSCSI TMF Request Codes
TMF Code	TMF Name	Description
1	ABORT TASK	This function instructs the Task Manager of the specified LUN to abort the task identified in the RTT field. This TMF command cannot be used to terminate TMF commands.
2	ABORT TASK SET	This function instructs the Task Manager of the specified LUN to abort all tasks issued within the associated session. This function does not affect tasks instantiated by other initiators.
3	CLEAR ACA	This function instructs the Task Manager of the specified LUN to clear the ACA condition. This has the same affect as ABORT TASK for all tasks with the ACA attribute. Tasks that do not have the ACA attribute are not affected.
4	CLEAR TASK SET	This function instructs the Task Manager of the specified LUN to abort all tasks identified by the task set type (TST) field in the SCSI Control Mode Page. This function can abort all tasks from a single initiator or all tasks from all initiators.
5	LOGICAL UNIT RESET	This function instructs the Task Manager of the specified LUN to abort all tasks, clear all ACA conditions, release all reservations, reset the logical unit's operating mode to its default state and set a Unit Attention condition. In the case of hierarchical LUNs, these actions also must be taken for each dependent logical unit.
6	TARGET WARM RESET	This function instructs the Task Manager of LUN 0 to perform a LOGICAL UNIT RESET for every LUN accessible via the target port through which the command is received. This function is subject to SCSI access controls and also may be subject to iSCSI access controls.
7	TARGET COLD RESET	This function instructs the Task Manager of LUN 0 to perform a LOGICAL UNIT RESET for every LUN accessible via the target port through which the command is received. This function is not subject to SCSI access controls but may be subject to iSCSI access controls. This function also instructs the Task Manager of LUN 0 to terminate all TCP connections for the target port through which the command is received.
8	TASK REASSIGN	This function instructs the Task Manager of the specified LUN to reassign connection allegiance for the task identified in the RTT field. Connection allegiance is reassigned to the TCP connection on which the TASK REASSIGN command is received. This function is supported only if the session supports an ErrorRecoveryLevel of two. This function must always be transmitted as an immediate command.

Each TMF Request PDU precipitates one TMF Response PDU. Figure 8-14 illustrates the iSCSI BHS of a TMF Response PDU. All fields marked with "." are reserved.

Figure 8-14. iSCSI TMF Response BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
Reserved This is the 1 bit redefined as Reserved.
Opcode This is 6 bits long. It is set to 0x22.
F bit This is always set to 1.
Reserved This is 7 bits long.
Response This is 8 bits long. This field indicates the completion status for the TMF command identified in the ITT field. RFC 3720 currently defines eight TMF Response codes (see Table 8-6). All other values are reserved.
Reserved This is 8 bits long.
TotalAHSLength This is 8 bits long. It is always set to 0.
DataSegmentLength This is 24 bits long. It is always set to 0.
Reserved This is 64 bits (8 bytes) long.
ITT This is 32 bits long.
Reserved This is 32 bits long.
StatSN This is 32 bits long.
ExpCmdSN This is 32 bits long.
MaxCmdSN This is 32 bits long.
Reserved This is 96 bits (12 bytes) long.

Table 8-6. iSCSI TMF Response Codes
TMF Code	TMF Name	Description
0	Function Complete	The TMF command completed successfully.
1	Task Does Not Exist	The task identified in the RTT field of the TMF request PDU does not exist. This response is valid only if the CmdSN in the RefCmdSN field in the TMF request PDU is outside the valid CmdSN window. If the CmdSN in the RefCmdSN field in the TMF request PDU is within the valid CmdSN window, a function complete response must be sent.
2	LUN Does Not Exist	The LUN identified in the LUN or Reserved field of the TMF request PDU does not exist.
3	Task Still Allegiant	Logout of the old connection has not completed. A task may not be reassigned until logout of the old connection successfully completes with reason code "remove the connection for recovery".
4	Task Allegiance Reassignment Not Supported	The session does not support ErrorRecoveryLevel 2.
5	TMF Not Supported	The target does not support the requested TMF command. Some TMF commands are optional for targets.
6	Function Authorization Failed	The initiator is not authorized to execute the requested TMF command.
255	Function Rejected	The initiator attempted an illegal TMF request (such as ABORT TASK for a different TMF task).

Table 8-6 summarizes the TMF Response codes that are currently supported by iSCSI. All TMF Response codes excluded from Table 8-6 are reserved.

The Reject PDU signals an error condition and rejects the PDU that caused the error. The Data segment (not shown in Figure 8-15) must contain the header of the PDU that caused the error. If a Reject PDU causes a task to terminate, a SCSI Response PDU with status CHECK CONDITION must be sent. Figure 8-15 illustrates the iSCSI BHS of a Reject PDU. All fields marked with "." are reserved.

Figure 8-15. iSCSI Reject BHS Format

A brief description of each field follows. The description of each field is abbreviated unless a field is used in a PDU-specific manner:

Reserved This is 1 bit.
Reserved This is the 1 bit redefined as Reserved.
Opcode This is 6 bits long. It is set to 0x3F.
F bit This is always set to 1.
Reserved This is 7 bits long.
Reason This is 8 bits long. This field indicates the reason the erroneous PDU is being rejected. RFC 3720 currently defines 11 Reject Reason codes (see Table 8-7). All other values are reserved.
Reserved This is 8 bits long.
TotalAHSLength This is 8 bits long. It is always set to 0.
DataSegmentLength This is 24 bits long.
Reserved This is 64 bits (8 bytes) long.
ITT This is 32 bits long. It is set to 0xFFFFFFFF.
Reserved This is 32 bits long.
StatSN This is 32 bits long.
ExpCmdSN This is 32 bits long.
MaxCmdSN This is 32 bits long.
DataSN/R2TSN or Reserved This is 32 bits long. This field is valid only when rejecting a Data/R2T SNACK Request PDU. The Reject Reason code must be 0x04 (Protocol Error). This field indicates the DataSN or R2TSN of the next Data-In or R2T PDU to be transmitted by the target. Otherwise, this field is reserved.
Reserved This is 64 bits (8 bytes) long.

Table 8-7. iSCSI Reject Reason Codes
Reason Code	Reason Name
0x02	Data-Digest Error
0x03	SNACK Reject
0x04	Protocol Error
0x05	Command Not Supported
0x06	Immediate Command RejectedToo Many Immediate Commands
0x07	Task In Progress
0x08	Invalid DataACK
0x09	Invalid PDU Field
0x0a	Long Operation RejectCannot Generate TTTOut Of Resources
0x0b	Negotiation Reset
0x0c	Waiting For Logout

Table 8-7 summarizes the Reject Reason codes that are currently supported by iSCSI. All Reject Reason codes excluded from Table 8-7 are reserved.

The preceding discussion of iSCSI PDU formats is simplified for the sake of clarity. Comprehensive exploration of all the iSCSI PDUs and their variations is outside the scope of this book. For more information, readers are encouraged to consult IETF RFC 3720 and the ANSI T10 SAM-2, SAM-3, SPC-2, and SPC-3 specifications.

iSCSI Login Parameters

During the Login Phase, security and operating parameters are exchanged as text key-value pairs. As previously stated, text keys are encapsulated in the Data segment of the Login Request and Login Response PDUs. Some operating parameters may be re-negotiated after the Login Phase completes (during the Full Feature Phase) via the Text Request and Text Response PDUs. However, most operating parameters remain unchanged for the duration of a session. Security parameters may not be re-negotiated during an active session. Some text keys have a session-wide scope, and others have a connection-specific scope. Some text keys may be exchanged only during negotiation of the leading connection for a new session. Some text keys require a response (negotiation), and others do not (declaration). Currently, RFC 3720 defines 22 operational text keys. RFC 3720 also defines a protocol extension mechanism that enables the use of public and private text keys that are not defined in RFC 3720. This section describes the standard operational text keys and the extension mechanism. The format of all text key-value pairs is:

<key name>=<list of values>

The SessionType key declares the type of iSCSI session. Only initiators send this key. This key must be sent only during the Login Phase on the leading connection. The valid values are Normal and Discovery. The default value is Normal. The scope is session-wide.

The HeaderDigest and DataDigest keys negotiate the use of the Header-Digest segment and the Data-Digest segment, respectively. Initiators and targets send these keys. These keys may be sent only during the Login Phase. Values that must be supported include CRC32C and None. Other public and private algorithms may be supported. The default value is None for both keys. The chosen digest must be used in every PDU sent during the Full Feature Phase. The scope is connection-specific.

As discussed in Chapter 3, the SendTargets key is used by initiators to discover targets during a Discovery session. This key may also be sent by initiators during a Normal session to discover changed or additional paths to a known target. Sending this key during a Normal session is fruitful only if the target configuration changes after the Login Phase. This is because, during a Discovery session, a target network entity must return all target names, sockets, and TPGTs for all targets that the requesting initiator is permitted to access. Additionally, path changes occurring during the Login Phase of a Normal session are handled via redirection. This key may be sent only during the Full Feature Phase. The scope is session-wide.

The TargetName key declares the iSCSI device name of one or more target devices within the responding network entity. This key may be sent by targets only in response to a SendTargets command. This key may be sent by initiators only during the Login Phase of a Normal session, and the key must be included in the leading Login Request PDU for each connection. The scope is session-wide.

The TargetAddress key declares the network addresses, TCP ports, and TPGTs of the target device to the initiator device. An address may be given in the form of DNS host name, IPv4 address, or IPv6 address. The TCP port may be omitted if the default port of 3260 is used. Only targets send this key. This key is usually sent in response to a SendTargets command, but it may be sent in a Login Response PDU to redirect an initiator. Therefore, this key may be sent during any phase. The scope is session-wide.

The InitiatorName key declares the iSCSI device name of the initiator device within the initiating network entity. This key identifies the initiator device to the target device so that access controls can be implemented. Only initiators send this key. This key may be sent only during the Login Phase, and the key must be included in the leading Login Request PDU for each connection. The scope is session-wide.

The InitiatorAlias key declares the optional human-friendly name of the initiator device to the target for display in relevant user interfaces. Only initiators send this key. This key is usually sent in a Login Request PDU for a Normal session, but it may be sent during the Full Feature Phase as well. The scope is session-wide.

The TargetAlias key declares the optional human-friendly name of the target device to the initiator for display in relevant user interfaces. Only targets send this key. This key usually is sent in a Login Response PDU for a Normal session, but it may be sent during the Full Feature Phase as well. The scope is session-wide.

The TargetPortalGroupTag key declares the TPGT of the target port to the initiator port. Only targets send this key. This key must be sent in the first Login Response PDU of a Normal session unless the first Login Response PDU redirects the initiator to another TargetAddress. The range of valid values is 0 to 65,535. The scope is session-wide.

The ImmediateData and InitialR2T keys negotiate support for immediate data and unsolicited data, respectively. Immediate data may not be sent unless both devices support immediate data. Unsolicited data may not be sent unless both devices support unsolicited data. Initiators and targets send these keys. These keys may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The default settings support immediate data but not unsolicited data. The scope is session-wide for both keys.

The MaxOutstandingR2T key negotiates the maximum number of R2T PDUs that may be outstanding simultaneously for a single task. This key does not include the implicit R2T PDU associated with unsolicited data. Each R2T PDU is considered outstanding until the last Data-Out PDU is transferred (initiator's perspective) or received (target's perspective). A sequence timeout can also terminate the lifespan of an R2T PDU. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The range of valid values is 1 to 65,535. The default value is one. The scope is session-wide.

The MaxRecvDataSegmentLength key declares the maximum amount of data that a receiver (initiator or target) can receive in a single iSCSI PDU. Initiators and targets send this key. This key may be sent during any phase of any session type and is usually sent during the Login Phase on the leading connection. This key is expressed in bytes. The range of valid values is 512 to 16,777,215. The default value is 8,192. The scope is connection-specific.

The MaxBurstLength key negotiates the maximum amount of data that a receiver (initiator or target) can receive in a single iSCSI sequence. This value may exceed the value of MaxRecvDataSegmentLength, which means that more than one PDU may be sent in response to an R2T Request PDU. This contrasts the FC model. For write commands, this key applies only to solicited data. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. This key is expressed in bytes. The range of valid values is 512 to 16,777,215. The default value is 262,144. The scope is session-wide.

The FirstBurstLength key negotiates the maximum amount of data that a target can receive in a single iSCSI sequence of unsolicited data (including immediate data). Thus, the value of this key minus the amount of immediate data received with the SCSI command PDU yields the amount of unsolicited data that the target can receive in the same sequence. If neither immediate data nor unsolicited data is supported within the session, this key is invalid. The value of this key cannot exceed the target's MaxBurstLength. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. This key is expressed in bytes. The range of valid values is 512 to 16,777,215. The default value is 65,536. The scope is session-wide.

The MaxConnections key negotiates the maximum number of TCP connections supported by a session. Initiators and targets send this key. Discovery sessions are restricted to one TCP connection, so this key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The range of valid values is 1 to 65,535. The default is value is 1. The scope is session-wide.

The DefaultTime2Wait key negotiates the amount of time that must pass before attempting to logout a failed connection. Task reassignment may not occur until after the failed connection is logged out. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. This key is expressed in seconds. The range of valid values is 0 to 3600. The default value is 2. A value of 0 indicates that logout may be attempted immediately upon detection of a failed connection. The scope is session-wide.

The DefaultTime2Retain key negotiates the amount of time that task state information must be retained for active tasks after DefaultTime2Wait expires. When a connection fails, this key determines how much time is available to complete task reassignment. If the failed connection is the last (or only) connection in a session, this key also represents the session timeout value. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. This key is expressed in seconds. The range of valid values is 0 to 3600. The default value is 20. A value of 0 indicates that task state information is discarded immediately upon detection of a failed connection. The scope is session-wide.

The DataPDUInOrder key negotiates in-order transmission of data PDUs within a sequence. Because TCP guarantees in-order delivery, the only way for PDUs of a given sequence to arrive out of order is to be transmitted out of order. Initiators and targets send this key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The default value requires in-order transmission. The scope is session-wide.

The DataSequenceInOrder key negotiates in-order transmission of data PDU sequences within a command. For sessions that support in-order transmission of sequences and retransmission of missing data PDUs (ErrorRecoveryLevel greater than zero), the MaxOustandingR2T key must be set to 1. This is because requests for retransmission may be sent only for the lowest outstanding R2TSN, and all PDUs already received for a higher outstanding R2TSN must be discarded until retransmission succeeds. This is inefficient. It undermines the goal of multiple outstanding R2T PDUs. Sessions that do not support retransmission must terminate the appropriate task upon detection of a missing data PDU, and all data PDUs must be retransmitted via a new task. Thus, no additional inefficiency is introduced by supporting multiple outstanding R2T PDUs when the ErrorRecoveryLevel key is set to 0. Initiators and targets send the DataSequenceInOrder key. This key may be sent only during Normal sessions and must be sent during the Login Phase on the leading connection. The default value requires in-order transmission. The scope is session-wide.

The ErrorRecoveryLevel key negotiates the combination of recovery mechanisms supported by the session. Initiators and targets send this key. This key may be sent only during the Login Phase on the leading connection. The range of valid values is 0 to 2. The default value is 0. The scope is session-wide.

The OFMarker and IFMarker keys negotiate support for PDU boundary detection via the fixed interval markers (FIM) scheme. Initiators and targets send these keys. These keys may be sent during any session type and must be sent during the Login Phase. The default setting is disabled for both keys. The scope is connection-specific.

The OFMarkInt and IFMarkInt keys negotiate the interval for the FIM scheme. These keys are valid only if the FIM scheme is used. Initiators and targets send these keys. These keys may be sent during any session type and must be sent during the Login Phase. These keys are expressed in 4-byte words. The range of valid values is 1 to 65,535. The default value is 2048 for both keys. The scope is connection-specific.

A mechanism is defined to enable implementers to extend the iSCSI protocol via additional key-value pairs. These are known as private and public extension keys. Support for private and public extension keys is optional. Private extension keys are proprietary. All private extension keys begin with "X-" to convey their proprietary status. Public extension keys must be registered with the IANA and must also be described in an informational RFC published by the IETF. All public extension keys begin with "X#" to convey their registered status. Private extension keys may be used only in Normal sessions but are not limited by phase. Public extension keys may be used in either type of session and are not limited by phase. Initiators and targets may send private and public extension keys. The scope of each extension key is determined by the rules of that key. The format of private extension keys is flexible but generally takes the form:

X-ReversedVendorDomainName.KeyName

The format of public extension keys is mandated as:

X#IANA-Registered-String

For more information about iSCSI text key-value pairs, readers are encouraged to consult IETF RFC 3720.

iSCSI Delivery Mechanisms

The checksum used by TCP does not detect all errors. Therefore, iSCSI must use its own CRC-based digests (as does FC) to ensure the utmost data integrity. This has two implications:

When a PDU is dropped due to digest error, the iSCSI protocol must be able to detect the beginning of the PDU that follows the dropped PDU. Because iSCSI PDUs are variable in length, iSCSI recipients depend on the BHS to determine the total length of a PDU. The BHS of the dropped PDU cannot always be trusted (for example, if dropped due to CRC failure), so an alternate method of determining the total length of the dropped PDU is required. Additionally, when a TCP packet containing an iSCSI header is dropped and retransmitted, the received TCP packets of the affected iSCSI PDU and the iSCSI PDUs that follow cannot be optimally buffered. An alternate method of determining the total length of the affected PDU resolves this issue.
To avoid SCSI task abortion and re-issuance in the presence of digest errors, the iSCSI protocol must support PDU retransmission. An iSCSI device may retransmit dropped PDUs (optimal) or abort each task affected by a digest error (suboptimal).

Additionally, problems can occur in a routed IP network that cause a TCP connection or an iSCSI session to fail. Currently, this does not occur frequently in iSCSI environments because most iSCSI deployments are single-subnet environments. However, iSCSI is designed in a such a way that it supports operation in routed IP networks. Specifically, iSCSI supports connection and session recovery to prevent IP network problems from affecting the SAL. This enables iSCSI users to realize the full potential of TCP/IP. RFC 3720 defines several delivery mechanisms to meet all these requirements.

Error Recovery Classes

RFC 3720 permits each iSCSI implementation to select its own recovery capabilities. Recovery capabilities are grouped into classes to simplify implementation and promote interoperability. Four classes of recoverability are defined:

Recovery within a command (lowest class)
Recovery within a connection
Recovery of a connection
Recovery of a session (highest class)

RFC 3720 mandates the minimum recovery class that may be used for each type of error. RFC 3720 does not provide a comprehensive list of errors, but does provide representative examples. An iSCSI implementation may use a higher recovery class than the minimum required for a given error. Both initiator and target are allowed to escalate the recovery class. The number of tasks that are potentially affected increases with each higher class. So, use of the lowest possible class is encouraged. The two lowest classes may be used in only the Full Feature Phase of a session. Table 8-8 lists some example scenarios for each recovery class.

Table 8-8. iSCSI Error Recovery Classes
Class Name	Scope of Affect	Example Error Scenarios
Recovery Within A Command	Low	Lost Data-In PDU, Lost Data-Out PDU, Lost R2T PDU
Recovery Within A Connection	Medium-Low	Request Acknowledgement Timeout, Response Acknowledgement Timeout, Response Timeout
Recovery Of A Connection	Medium-High	Connection Failure (see Chapter 7, "OSI Transport Layer"), Explicit Notification From Target Via Asynchronous Message PDU
Recovery Of A Session	High	Failure Of All Connections Coupled With Inability To Recover One Or More Connections

Error Recovery Hierarchy

RFC 3720 defines three error recovery levels that map to the four error recovery classes. The three recovery levels are referred to as the Error Recovery Hierarchy. During the Login Phase, the recovery level is negotiated via the ErrorRecoveryLevel key. Each recovery level is a superset of the capabilities of the lower level. Thus, support for a higher level indicates a more sophisticated iSCSI implementation. Table 8-9 summarizes the mapping of levels to classes.

Table 8-9. iSCSI Error Recovery Hierarchy
ErrorRecoveryLevel	Implementation Complexity	Error Recovery Classes
0	Low	Recovery Of A Session
1	Medium	Recovery Within A Command, Recovery Within A Connection
2	High	Recovery Of A Connection

At first glance, the mapping of levels to classes may seem counter-intuitive. The mapping is easier to understand after examining the implementation complexity of each recovery class. The goal of iSCSI recovery is to avoid affecting the SAL. However, an iSCSI implementation may choose not to recover from errors. In this case, recovery is left to the SCSI application client. Such is the case with ErrorRecoveryLevel 0, which simply terminates the failed session and creates a new session. The SCSI application client is responsible for reissuing all affected tasks. Therefore, ErrorRecoveryLevel 0 is the simplest to implement. Recovery within a command and recovery within a connection both require iSCSI to retransmit one or more PDUs. Therefore, ErrorRecoveryLevel 1 is more complex to implement. Recovery of a connection requires iSCSI to maintain state for one or more tasks so that task reassignment may occur. Recovery of a connection also requires iSCSI to retransmit one or more PDUs on the new connection. Therefore, ErrorRecoveryLevel 2 is the most complex to implement. Only ErrorRecoveryLevel 0 must be supported. Support for ErrorRecoveryLevel 1 and higher is encouraged but not required.

PDU Boundary Detection

To determine the total length of a PDU without relying solely on the iSCSI BHS, RFC 3720 permits the use of message synchronization schemes. Even though RFC 3720 encourages the use of such schemes, no such scheme is mandated. That said, a practical requirement for such schemes arises from the simultaneous implementation of header digests and ErrorRecoveryLevel 1 or higher. As a reference for implementers, RFC 3720 provides the details of a scheme called fixed interval markers (FIM). The FIM scheme works by inserting an 8-byte marker into the TCP stream at fixed intervals. Both the initiator and target may insert the markers. Each marker contains two copies of a 4-byte pointer that indicates the starting byte number of the next iSCSI PDU. Support for the FIM scheme is negotiated during the Login Phase.

PDU Retransmission

iSCSI guarantees in-order data delivery to the SAL. When PDUs arrive out of order due to retransmission, the iSCSI protocol does not reorder PDUs per se. Upon receipt of all TCP packets composing an iSCSI PDU, iSCSI places the ULP data in an application buffer. The position of the data within the application buffer is determined by the Buffer Offset field in the BHS of the Data-In/Data-Out PDU. When an iSCSI digest error, or a dropped or delayed TCP packet causes a processing delay for a given iSCSI PDU, the Buffer Offset field in the BHS of other iSCSI data PDUs that are received error-free enables continued processing without delay regardless of PDU transmission order. Thus, iSCSI PDUs do not need to be reordered before processing. Of course, the use of a message synchronization scheme is required under certain circumstances for PDU processing to continue in the presence of one or more dropped or delayed PDUs. Otherwise, the BHS of subsequent PDUs cannot be read. Assuming this requirement is met, PDUs can be processed in any order.

Retransmission occurs as the result of a digest error, protocol error, or timeout. Despite differences in detection techniques, PDU retransmission is handled in a similar manner for data digest errors, protocol errors and timeouts. However, header digest errors require special handling. When a header digest error occurs, and the connection does not support a PDU boundary detection scheme, the connection must be terminated. If the session supports ErrorRecoveryLevel 2, the connection is recovered, tasks are reassigned, and PDU retransmission occurs on the new connection. If the session does not support ErrorRecoveryLevel 2, the connection is not recovered. In this case, the SCSI application client must re-issue the terminated tasks on another connection within the same session. If no other connections exist with the same session, the session is terminated, and the SCSI application client must re-issue the terminated tasks in a new session. When a header digest error occurs, and the connection supports a PDU boundary detection scheme, the PDU is discarded. If the session supports ErrorRecoveryLevel 1 or higher, retransmission of the dropped PDU is handled as described in the following paragraphs. Note that detection of a dropped PDU because of header digest error requires successful receipt of a subsequent PDU associated with the same task. If the session supports only ErrorRecoveryLevel 0, the session is terminated, and the SCSI application client must re-issue the terminated tasks in a new session. The remainder of this section focuses primarily on PDU retransmission in the presence of data digest errors.

Targets explicitly notify initiators when a PDU is dropped because of data digest failure. The Reject PDU facilitates such notification. Receipt of a Reject PDU for a SCSI Command PDU containing immediate data triggers retransmission if ErrorRecoveryLevel is 1 or higher. When an initiator retransmits a SCSI Command PDU, certain fields (such as the ITT, CmdSN, and operational attributes) in the BHS must be identical to the original PDU's BHS. This is known as a retry. A retry must be sent on the same connection as the original PDU unless the connection is no longer active. Receipt of a Reject PDU for a SCSI Command PDU that does not contain immediate data usually indicates a non-digest error that prevents retrying the command. Receipt of a Reject PDU for a Data-Out PDU does not trigger retransmission. Initiators retransmit Data-Out PDUs only in response to Recovery R2T PDUs. Thus, targets are responsible for requesting retransmission of missing Data-Out PDUs if ErrorRecoveryLevel is 1 or higher. Efficient recovery of dropped data during write operations is accomplished via the Buffer Offset and Desired Data Transfer Length fields in the Recovery R2T PDU. In the absence of a Recovery R2T PDU (in other words, when no Data-Out PDUs are dropped), all Data-Out PDUs are implicitly acknowledged by a SCSI status of GOOD in the SCSI Response PDU. When a connection fails and tasks are reassigned, the initiator retransmits a SCSI Command PDU or Data-Out PDUs as appropriate for each task in response to Recovery R2T PDUs sent by the target on the new connection. When a session fails, an iSCSI initiator does not retransmit any PDUs. At any point in time, an initiator may send a No Operation Out (NOP-Out) PDU to probe the sequence numbers of a target and to convey the initiator's sequence numbers to the same target. Initiators also use the NOP-Out PDU to respond to No Operation IN (NOP-In) PDUs received from a target. A NOP-Out PDU may also be used for diagnostic purposes or to adjust timeout values. A NOP-Out PDU does not directly trigger retransmission.

Initiators do not explicitly notify targets when a Data-In PDU or SCSI Response PDU is dropped due to data digest failure. Because R2T PDUs do not contain data, detection of a missing R2T PDU via an out-of-order R2TSN means a header digest error occurred on the original R2T PDU. When a Data-In PDU or SCSI Response PDU containing data is dropped, the initiator requests retransmission via a Data/R2T SNACK Request PDU if ErrorRecoveryLevel is 1 or higher. Efficient recovery of dropped data during read operations is accomplished via the BegRun and RunLength fields in the SNACK Request PDU. The target infers that all Data-In PDUs associated with a given command were received based on the ExpStatSN field in the BHS of a subsequent SCSI Command PDU or Data-Out PDU. Until such acknowledgment is inferred, the target must be able to retransmit all data associated with a command.

This requirement can consume a lot of the target's resources during long read operations. To free resources during long read operations, targets may periodically request explicit acknowledgment of Data-In PDU receipt via a DataACK SNACK Request PDU. When a connection fails and tasks are reassigned, the target retransmits Data-In PDUs or a SCSI Response PDU as appropriate for each task. Initiators are not required to explicitly request retransmission following connection recovery. All unacknowledged Data-In PDUs and SCSI Response PDUs must be automatically retransmitted after connection recovery. The target uses the ExpDataSN of the most recent DataACK SNACK Request PDU to determine which Data-IN PDUs must be retransmitted for each task. Optionally, the target may use the ExpDataSN field in the TMF Request PDU received from the initiator after task reassignment to determine which Data-In PDUs must be retransmitted for each task. If the target cannot reliably maintain state for a reassigned task, all Data-In PDUs associated with that task must be retransmitted. If the SCSI Response PDU for a given task was transmitted before task reassignment, the PDU must be retransmitted after task reassignment. Otherwise, the SCSI Response PDU is transmitted at the conclusion of the command or task as usual. When a session fails, an iSCSI target does not retransmit any PDUs. At any point in time, a target may send a NOP-In PDU to probe the sequence numbers of an initiator and to convey the target's sequence numbers to the same initiator. Targets also use the NOP-In PDU to respond to NOP-Out PDUs received from an initiator. A NOP-In PDU may also be used for diagnostic purposes or to adjust timeout values. A NOP-In PDU does not directly trigger retransmission.

iSCSI In-Order Command Delivery

According to the SAM, status received for a command finalizes the command under all circumstances. So, initiators requiring in-order delivery of commands can simply restrict the number of outstanding commands to one and wait for status for each outstanding command before issuing the next command. Alternately, the SCSI Transport Protocol can guarantee in-order command delivery. This enables the initiator to maintain multiple simultaneous outstanding commands.

iSCSI guarantees in-order delivery of non-immediate commands to the SAL within a target. Each non-immediate command is assigned a unique CmdSN. The CmdSN counter must be incremented sequentially for each new non-immediate command without skipping numbers. In a single-connection session, the in-order guarantee is inherent due to the properties of TCP. In a multi-connection session, commands are issued sequentially across all connections. In this scenario, TCP cannot guarantee in-order delivery of non-immediate commands because TCP operates independently on each connection. Additionally, the configuration of a routed IP network can result in one connection using a "shorter" route to the destination node than other connections. Thus, iSCSI must augment TCP by ensuring that non-immediate commands are processed in order (according to the CmdSN) across multiple connections. So, RFC 3720 requires each target to process non-immediate commands in the same order as transmitted by the initiator. Note that a CmdSN is assigned to each TMF Request PDU. The rules of in-order delivery also apply to non-immediate TMF requests.

Immediate commands are handled differently than non-immediate commands. An immediate command is not assigned a unique CmdSN and is not subject to in-order delivery guarantees. The initiator increments its CmdSN counter after transmitting a new non-immediate command. Thus, the value of the initiator's CmdSN counter (the current CmdSN) represents the CmdSN of the next non-immediate command to be issued. The current CmdSN is also assigned to each immediate command issued, but the CmdSN counter is not incremented following issuance of immediate commands. Moreover, the target may deliver immediate commands to the SAL immediately upon receipt regardless of the CmdSN in the BHS. The next non-immediate command is assigned the same CmdSN. For that PDU, the CmdSN in the BHS is used by the target to enforce in-order delivery. Thus, immediate commands are not acknowledged via the ExpCmdSN field in the BHS. Immediate TMF requests are processed like non-immediate TMF requests. Therefore, marking a TMF request for immediate delivery does not expedite processing.

Note

The order of command delivery does not necessarily translate to the order of command execution. The order of command execution can be changed via TMF request as specified in the SCSI standards.

iSCSI Connection and Session Recovery

When ErrorRecoveryLevel equals 2, iSCSI supports stateful recovery at the connection level. Targets may choose whether to maintain state during connection recovery. When state is maintained, active commands are reassigned, and data transfer resumes on the new connection from the point at which data receipt is acknowledged. When state is not maintained, active commands are reassigned, and all associated data must be transferred on the new connection. Connections may be reinstated or recovered. Reinstatement means that the same CID is reused. Recovery means that a new CID is assigned. RFC 3720 does not clearly define these terms, but the definitions provided herein appear to be accurate. When MaxConnections equals one, and ErrorRecoveryLevel equals two, the session must temporarily override the MaxConnections parameter during connection recovery. Two connections must be simultaneously supported during recovery. Additionally, the failed connection must be cleaned up before recovery to avoid receipt of stale PDUs following recovery.

In a multi-connection session, each command is allegiant to a single connection. In other words, all PDUs associated with a given command must traverse a single connection. This is known as connection allegiance. Connection allegiance is command-oriented, not task-oriented. This can be confusing because connection recovery involves task reassignment. When a connection fails, an active command is identified by its ITT (not CmdSN) for reassignment purposes. This is because SCSI defines management functions at the task level, not at the command level. However, multiple commands can be issued with a single ITT (linked commands), and each linked command can be issued on a different connection within a single session. No more than one linked command can be outstanding at any point in time for a given task, so the ITT uniquely identifies each linked command during connection recovery. The PDUs associated with each linked command must traverse the connection on which the command was issued. This means a task may be spread across multiple connections over time, but each command is allegiant to a single connection at any point in time.

When a session is recovered, iSCSI establishes a new session on behalf of the SCSI application client. iSCSI also terminates all active tasks within the target and generates a SCSI response for the SCSI application client. iSCSI does not maintain any state for outstanding tasks. All tasks must be reissued by the SCSI application client.

The preceding discussion of iSCSI delivery mechanisms is simplified for the sake of clarity. For more information about iSCSI delivery mechanisms, readers are encouraged to consult IETF RFCs 3720 and 3783.

iSCSI Addressing Scheme

Figure 8-1. General iSCSI ISID Format

Table 8-1. ISID Format Descriptions

iSCSI Name Assignment and Resolution

iSCSI Address Assignment and Resolution

iSCSI Session Types, Phases, and Stages

Figure 8-2. iSCSI Session Flow

iSCSI Data Transfer Optimizations

iSCSI PDU Formats

Figure 8-3. General iSCSI PDU Format

Figure 8-4. iSCSI BHS Format

Table 8-2. iSCSI Operation Codes

Figure 8-5. iSCSI Login Request BHS Format

Figure 8-6. iSCSI Login Response BHS Format

Figure 8-7. iSCSI SCSI Command BHS Format

Figure 8-8. iSCSI SCSI Response BHS Format

Table 8-3. SCSI Status Codes

Figure 8-9. iSCSI Data-Out BHS Format

Figure 8-10. iSCSI Data-In BHS Format

Figure 8-11. iSCSI R2T BHS Format

Figure 8-12. iSCSI SNACK Request BHS Format

Table 8-4. iSCSI SNACK Request PDU Types

Figure 8-13. iSCSI TMF Request BHS Format

Table 8-5. iSCSI TMF Request Codes

Figure 8-14. iSCSI TMF Response BHS Format

Table 8-6. iSCSI TMF Response Codes

Figure 8-15. iSCSI Reject BHS Format

Table 8-7. iSCSI Reject Reason Codes

iSCSI Login Parameters

iSCSI Delivery Mechanisms

Error Recovery Classes

Table 8-8. iSCSI Error Recovery Classes

Error Recovery Hierarchy

Table 8-9. iSCSI Error Recovery Hierarchy

PDU Boundary Detection

PDU Retransmission

iSCSI In-Order Command Delivery

iSCSI Connection and Session Recovery