Conceptual Underpinnings | Storage Networking Protocol Fundamentals (Vol 2)

Networking professionals understand some of the topics discussed in this chapter, but not others. Before we discuss each network technology, we need to discuss some of the less understood conceptual topics, to clarify terminology and elucidate key points. This section provides foundational knowledge required to understand addressing schemes, address formats, delivery mechanisms, and link aggregation.

Addressing Schemes

The SPI, Ethernet, IP, and Fibre Channel all use different addressing schemes. To provide a consistent frame of reference, we discuss the addressing scheme defined by the SAM in this section. As we discuss addressing schemes of the SPI, Ethernet, IP, and Fibre Channel subsequently, we will compare each one to the SAM addressing scheme. The SAM defines four types of objects known as application client, logical unit, port, and device. Of these, three are addressable: logical unit, port, and device.

The SCSI protocol implemented in an initiator is called a SCSI application client. A SCSI application client can initiate only SCSI commands. No more than one SCSI application client may be implemented per SCSI Transport Protocol within a SCSI initiator device. Thus, no client ambiguity exists within an initiator device. This eliminates the need for SCSI application client addresses.

The SCSI protocol implemented in a target is called a SCSI logical unit. A SCSI logical unit can execute only SCSI commands. A SCSI target device may (and usually does) contain more than one logical unit. So, logical units require addressing to facilitate proper forwarding of incoming SCSI commands. A SCSI logical unit is a processing entity that represents any hardware component capable of providing SCSI services to SCSI application clients. Examples include a storage medium, an application-specific integrated circuit (ASIC) that supports SCSI enclosure services (SES) to provide environmental monitoring services, a robotic arm that supports SCSI media changer (SMC) services, and so forth. A SCSI logical unit is composed of a task manager and a device server. The task manager is responsible for queuing and managing the commands received from one or more SCSI application clients, whereas the device server is responsible for executing SCSI commands.

SCSI ports facilitate communication between SCSI application clients and SCSI logical units. A SCSI port consists of the hardware and software required to implement a SCSI Transport Protocol and associated SCSI Interconnect. One notable exception is the SPI, which does not implement a SCSI Transport Protocol.

A SCSI initiator device is composed of at least one SCSI port and at least one SCSI application client. A SCSI target device consists of at least one SCSI port, one task router per SCSI port, and at least one SCSI logical unit. Each task router directs incoming SCSI commands to the task manager of the appropriate logical unit. An FC HBA or iSCSI TOE is considered a SCSI device. This is somewhat confusing because the term device is commonly used to generically refer to a host, storage subsystem, switch, or router. To avoid confusion, we use the terms enclosure and network entity in the context of SCSI to describe any host, storage subsystem, switch, or router that contains one or more SCSI devices. A SCSI device often contains only a single SCSI port, but may contain more than one. For example, most JBOD chassis in use today contain dual-port disk drives that implement a single SCSI logical unit. Each disk drive is a single SCSI device with multiple ports. Likewise, many intelligent storage arrays contain multiple SCSI ports and implement a single SCSI device accessible via all ports. However, most multi-port FC HBAs in use today implement a SCSI application client per port (that is, multiple single-port SCSI devices). Many of the early iSCSI implementations are software-based to take advantage of commodity Ethernet hardware. In such an implementation, a multi-homed host (the network entity) typically contains a single SCSI device that consists of a single SCSI software driver (the application client) bound to a single iSCSI software driver (the SCSI Transport Protocol) that uses multiple IP addresses (the initiator ports making up the SCSI Interconnect) that are assigned to multiple Ethernet NICs.

The SAM defines two types of addresses known as name and identifier. A name positively identifies an object, and an identifier facilitates communication with an object. Names are generally optional, and identifiers are generally mandatory. Names are implemented by SCSI Interconnects and SCSI Transport Protocols, and identifiers are implemented only by SCSI Interconnects. The SAM addressing rules are not simple, so a brief description of the rules associated with each SAM object follows:

Device names are optional in the SAM. However, any particular SCSI Transport Protocol may require each SCSI device to have a name. A device name never changes and may be used to positively identify a SCSI device. A device name is useful for determining whether a device is accessible via multiple ports. A device may be assigned only one name within the scope of each SCSI Transport Protocol. Each device name is globally unique within the scope of each SCSI Transport Protocol. Each SCSI Transport Protocol defines its own device name format and length.
Device identifiers are not defined in the SAM. Because each SCSI device name is associated with one or more SCSI port names, each of which is associated with a SCSI port identifier, SCSI device identifiers are not required to facilitate communication.
Port names are optional in the SAM. However, any particular SCSI Transport Protocol may require each SCSI port to have a name. A port name never changes and may be used to positively identify a port in the context of dynamic port identifiers. A port may be assigned only one name within the scope of each SCSI Transport Protocol. Each port name is globally unique within the scope of each SCSI Transport Protocol. Each SCSI Transport Protocol defines its own port name format and length.
Port identifiers are mandatory. Port identifiers are used by SCSI Interconnect technologies as source and destination addresses when forwarding frames or packets. Each SCSI Interconnect defines its own port identifier format and length.
Logical unit names are optional in the SAM. However, any particular SCSI Transport Protocol may require each SCSI logical unit to have a name. A logical unit name never changes and may be used to positively identify a logical unit in the context of dynamic logical unit identifiers. A logical unit name is also useful for determining whether a logical unit has multiple identifiers. That is the case in multi-port storage arrays that provide access to each logical unit via multiple ports simultaneously. A logical unit may be assigned only one name within the scope of each SCSI Transport Protocol. Each logical unit name is globally unique within the scope of each SCSI Transport Protocol. Each SCSI Transport Protocol defines its own logical unit name format and length.
Logical unit identifiers are mandatory. A logical unit identifier is commonly called a logical unit number (LUN). If a target device provides access to multiple logical units, a unique LUN is assigned to each logical unit on each port. However, a logical unit may be assigned a different LUN on each port through which the logical unit is accessed. In other words, LUNs are unique only within the scope of a single target port. To accommodate the wide range of LUN scale and complexity from a simple SPI bus to a SAN containing enterprise-class storage subsystems, the SAM defines two types of LUNs known as flat and hierarchical. Each SCSI Transport Protocol defines its own flat LUN format and length. By contrast, the SAM defines the hierarchical LUN format and length. All SCSI Transport Protocols that support hierarchical LUNs must use the SAM-defined format and length. Up to four levels of hierarchy may be implemented, and each level may use any one of four defined formats. Each level is 2 bytes long. The total length of a hierarchical LUN is 8 bytes regardless of how many levels are used. Unused levels are filled with null characters (binary zeros). Support for hierarchical LUNs is optional.

Note

In common conversation, the term LUN is often used synonymously with the terms disk and volume. For example, one might hear the phrases present the LUN to the host, mount the volume, and partition the disk all used to describe actions performed against the same unit of storage.

SCSI logical unit numbering is quite intricate. Because LUNs do not facilitate identification of nodes or ports, or forwarding of frames or packets, further details of the SAM hierarchical LUN scheme are outside the scope of this book. For more information, readers are encouraged to consult the ANSI T10 SAM-3 specification and Annex C of the original ANSI T10 FCP specification. A simplified depiction of the SAM addressing scheme is shown in Figure 5-1. Only two levels of LUN hierarchy are depicted.

Figure 5-1. SAM Addressing Scheme

A separate addressing scheme is used to identify physical elements (storage shelves, media load/unload slots, robotic arms, and drives) within a media changer. Each element is assigned an address. In this scenario, the element address scheme is the conceptual equivalent of the logical block addressing (LBA) scheme used by magnetic disks. Just as SCSI initiators can copy data from one block to another within a disk, SCSI initiators can move media cartridges from one element to another within a media changer. Because media cartridges can be stored in any location and subsequently loaded into any drive, a barcode label (or equivalent) is attached to each media cartridge to enable identification as cartridges circulate inside the media changer. This label is called the volume tag. Application software (for example, a tape backup program) typically maintains a media catalog to map volume tags to content identifiers (for example, backup set names). An initiator can send a read element status command to the logical unit that represents the robotic arm (called the media transport element) to discover the volume tag at each element address. The initiator can then use element addresses to move media cartridges via the move medium command. After the robotic arm loads the specified media cartridge into the specified drive, normal I/O can occur in which the application (for example, a tape backup program) reads from or writes to the medium using the drive's LUN to access the medium, and the medium's LBA scheme to navigate the medium. Element and barcode addressing are not part of the SAM LUN scheme. The element and barcode addressing schemes are both required and are both complementary to the LUN scheme. Further details of physical element addressing, barcode addressing, and media changers are outside the scope of this book. For more information, readers are encouraged to consult the ANSI T10 SMC-2 specification. A simplified depiction of media changer element addressing is shown in Figure 5-2 using tape media.

Figure 5-2. Media Changer Element Addressing

Address Formats

Several address formats are in use today. All address formats used by modern storage networks are specified by standards organizations. In the context of addressing schemes, a standards body that defines an address format is called a network address authority (NAA) even though the standards body might be engaged in many activities outside the scope of addressing. Some network protocols use specified bit positions in the address field of the frame or packet header to identify the NAA and format. This enables the use of multiple address formats via a single address field. The most commonly used address formats include the following:

MAC-48 specified by the IEEE
EUI-64 specified by the IEEE
IPv4 specified by the IETF
IQN specified by the IETF
WWN specified by the ANSI T11 subcommittee
FC Address Identifier specified by the ANSI T11 subcommittee

The IEEE formats are used for a broad range of purposes, so a brief description of each IEEE format is provided in this section. For a full description of each IEEE format, readers are encouraged to consult the IEEE 802-2001 specification and the IEEE 64-bit Global Identifier Format Tutorial. Descriptions of the various implementations of the IEEE formats appear throughout this chapter and Chapter 8, "OSI Session, Presentation, and Application Layers." Description of IPv4 addressing is deferred to chapter 6, "OSI Network Layer." Note that IPv6 addresses can be used by IPS protocols, but IPv4 addresses are most commonly implemented today. Thus, IPv6 addressing is currently outside the scope of this book. Description of iSCSI qualified names (IQNs) is deferred to chapter 8, "OSI Session, Presentation, and Application Layers." Descriptions of world wide names (WWNs) and FC address identifiers follow in the FC section of this chapter.

The IEEE 48-bit media access control (MAC-48) format is a 48-bit address format that guarantees universally unique addresses in most scenarios. The MAC-48 format supports locally assigned addresses which are not universally unique, but such usage is uncommon. The MAC-48 format originally was defined to identify physical elements such as LAN interfaces, but its use was expanded later to identify LAN protocols and other non-physical entities. When used to identify non-physical entities, the format is called the 48-bit extended unique identifier (EUI-48). MAC-48 and EUI-48 addresses are expressed in dash-separated hexadecimal notation such as 00-02-8A-9F-52-95. Figure 5-3 illustrates the IEEE MAC-48 address format.

Figure 5-3. IEEE MAC-48 Address Format

A brief description of each field follows:

The IEEE registration authority committee (RAC) assigns a three-byte organizationally unique identifier (OUI) to each organization that wants to produce network elements or protocols. No two organizations are assigned the same OUI. All LAN interfaces produced by an organization must use that organization's assigned OUI as the first half of the MAC-48 address. Thus, the OUI field identifies the manufacturer of each LAN interface.
Embedded within the first byte of the OUI is a bit called the universal/local (U/L) bit. The U/L bit indicates whether the Extension Identifier is universally administered by the organization that produced the LAN interface or locally administered by the company that deployed the LAN interface.
Embedded within the first byte of the OUI is a bit called the individual/group (I/G) bit. The I/G bit indicates whether the MAC-48 address is an individual address used for unicast frames or a group address used for multicast frames.
The Extension Identifier field, which is three bytes long, identifies each LAN interface. Each organization manages the Extension Identifier values associated with its OUI. During the interface manufacturing process, the U/L bit is set to 0 to indicate that the Extension Identifier field contains a universally unique value assigned by the manufacturer. The U/L bit can be set to 1 by a network administrator via the NIC driver parameters. This allows a network administrator to assign an Extension Identifier value according to a particular addressing scheme. In this scenario, each LAN interface address may be a duplicate of one or more other LAN interface addresses. This duplication is not a problem if no duplicate addresses exist on a single LAN. That said, local administration of Extension Identifiers is rare. So, the remainder of this book treats all MAC-48 Extension Identifiers as universally unique addresses.

The growing number of devices that require a MAC-48 address prompted the IEEE to define a new address format called EUI-64. The EUI-64 format is a 64-bit universally unique address format used for physical network elements and for non-physical entities. Like the MAC-48 format, the EUI-64 format supports locally assigned addresses which are not universally unique, but such usage is uncommon. MAC-48 addresses are still supported but are no longer promoted by the IEEE. Instead, vendors are encouraged to use the EUI-64 format for all new devices and protocols. A mapping is defined by the IEEE for use of MAC-48 and EUI-48 addresses within EUI-64 addresses. EUI-64 addresses are expressed in hyphen-separated hexadecimal notation such as 00-02-8A-FF-FF-9F-52-95. Figure 5-4 illustrates the IEEE EUI-64 address format.

Figure 5-4. IEEE EUI-64 Address Format

A brief description of each field follows:

OUIidentical in format and usage to the OUI field in the MAC-48 address format.
U/L bitidentical in usage to the U/L bit in the MAC-48 address format.
I/G bitidentical in usage to the I/G bit in the MAC-48 address format.
Extension Identifieridentical in purpose to the Extension Identifier field in the MAC-48 address format. However, the length is increased from 3 bytes to 5 bytes. The first two bytes can be used to map MAC-48 and EUI-48 Extension Identifier values into the remaining three bytes. Alternately, the first 2 bytes can be concatenated with the last 3 bytes to yield a 5-byte Extension Identifier for new devices and protocols. Local administration of Extension Identifiers is rare. So, the remainder of this book treats all EUI-64 Extension Identifiers as universally unique addresses.

Note the U/L and I/G bits are rarely used in modern storage networks. In fact, some FC address formats omit these bits. However, omission of these bits from FC addresses has no effect on FC-SANs.

Delivery Mechanisms

Delivery mechanisms such as acknowledgement, frame/packet reordering, and error notification vary from one network technology to the next. Network technologies are generally classified as connection-oriented or connectionless depending on the suite of delivery mechanisms employed. However, these terms are not well defined. Confusion can result from assumed meanings when these terms are applied to disparate network technologies because their meanings vary significantly depending on the context.

In circuit-switched environments, such as the public switched telephone network (PSTN), the term connection-oriented implies a physical end-to-end path over which every frame or packet flows. A connection is established hop-by-hop by using a signaling protocol (such as Signaling System v7 [SS7]) to communicate with each switch in the physical end-to-end path. Once the connection is established, no information about the source node or destination node needs to be included in the frames or packets. By contrast, packet-switched environments use the term connection-oriented to describe logical connections made between end nodes. For example, in IP networks, TCP is used to form a logical connection between end nodes. Packets associated with a single logical connection may traverse different physical paths. Each router in the path must inspect each IP packet as it enters the router and make a forwarding decision based on the destination information contained in the packet header. This means that packets can be delivered out of order, so reordering protocols are sometimes implemented by end nodes to process packets arriving out of order before they are passed to ULPs. A logical connection such as this is typically characterized by a unique identifier (carried in the header of each packet), reserved resources in the end nodes (used to buffer frames or packets associated with each connection), and negotiated connection management procedures such as frame/packet acknowledgement and buffer management.

In circuit-switched environments, the term connectionless generally has no meaning. By contrast, the term connectionless is used in packet-switched environments to describe the absence of an end-to-end logical connection. For example, UDP is combined with IP to provide connectionless communication. Most packet-switching technologies support both connectionless and connection-oriented communication.

This book occasionally uses the terms connection-oriented and connectionless, but readers should consider the context each time these terms are used. Readers should not equate the collection of delivery mechanisms that define connection-oriented communication in a given network technology with any other network technology. This book attempts to avoid the use of the terms connection-oriented and connectionless, opting instead to refer to specific delivery mechanisms as appropriate. The remainder of this section discusses the details of each type of delivery mechanism so that readers can clearly understand the subsequent descriptions of network technologies. The delivery mechanisms discussed in this section include the following:

Detection of missing frames or packets (drops)
Detection of duplicate frames or packets
Detection of corrupt frames or packets
Acknowledgement
Guaranteed delivery (retransmission)
Flow control (buffering)
Guaranteed bandwidth
Guaranteed latency
Fragmentation, reassembly, and PMTU discovery
In-order delivery and frame/packet reordering

The SAM does not explicitly require all of these delivery guarantees, but the SAM does assume error-free delivery of SCSI requests or responses. How the SCSI Transport Protocol or SCSI Interconnect accomplish error-free delivery is self-determined by each protocol suite. Client notification of delivery failure is explicitly required by the SAM. That implies a requirement for detection of failures within the SCSI Transport Protocol layer, the SCSI Interconnect layer, or both. A brief discussion of each delivery mechanism follows.

Dropped Frames or Packets

Several factors can cause drops. Some examples include the following:

Buffer overrun
No route to the destination
Frame/packet corruption
Intra-switch forwarding timeout
Fragmentation required but not permitted
Administrative routing policy
Administrative security policy
Quality of Service (QoS) policy
Transient protocol error
Bug in switch or router software

For these reasons, no network protocol can guarantee that frames or packets will never be dropped. However, a network protocol may guarantee to detect drops. Upon detection, the protocol may optionally request retransmission of the dropped frame or packet. Because this requires additional buffering in the transmitting device, it is uncommon in network devices. However, this is commonly implemented in end nodes. Another option is for the protocol that detects the dropped frame or packet to notify the ULP. In this case, the ULP may request retransmission of the frame or packet, or simply notify the next ULP of the drop. If the series of upward notifications continues until the application is notified, the application must request retransmission of the lost data. A third option is for the protocol that detects the dropped frame or packet to take no action. In this case, one of the ULPs or the application must detect the drop via a timeout mechanism.

Duplicate Frames or Packets

Duplicates can result from several causes. This might seem hard to imagine. After all, how is a duplicate frame or packet created? Transient congestion provides a good example. Assume that host A sends a series of packets to host B using a protocol that guarantees delivery. Assume also that there are multiple paths between host A and host B. When load balancing is employed within the network, and one path is congested while the other is not, some packets will arrive at host B while others are delayed in transit. Host B might have a timer to detect dropped packets, and that timer might expire before all delayed packets are received. If so, host B may request retransmission of some packets from host A. When host A retransmits the requested packets, duplicate packets eventually arrive at host B. Various actions may be taken when a duplicate frame or packet arrives at a destination. The duplicate can be transparently discarded, discarded with notification to the ULP, or delivered to the ULP.

Corrupt Frames or Packets

Corrupt frames or packets can result from several causes. Damaged cables, faulty transceivers, and electromagnetic interference (EMI) are just some of the potential causes of frame/packet corruption. Even when operating properly, optical transceivers will introduce errors. OSI physical layer specifications state the minimum interval between transceiver errors. This is known as the bit error ratio (BER). For example, 1000BASE-SX specifies a BER of 1:10¹². Most network protocols include a checksum or CRC in the header or trailer of each frame or packet to allow detection of frame/packet corruption. When a corrupt frame or packet is detected, the protocol that detects the error can deliver the corrupt frame or packet to the ULP (useful for certain types of voice and video traffic), transparently drop the corrupt frame or packet, drop the corrupt frame or packet with notification to the ULP, or drop the frame or packet with notification to the transmitter. In the last case, the transmitter may abort the transmission with notification to the ULP, or retransmit the dropped frame or packet. A retransmit retry limit is usually enforced, and the ULP is usually notified if the limit is reached.

Acknowledgement

Acknowledgement provides notification of delivery success or failure. You can implement acknowledgement as positive or negative, and as explicit or implicit. Positive acknowledgement involves signaling from receiver to transmitter when frames or packets are successfully received. The received frames or packets are identified in the acknowledgement frame or packet (usually called an ACK). This is also a form of explicit acknowledgement. Negative acknowledgement is a bit more complicated. When a frame or packet is received before all frames or packets with lower identities are received, a negative acknowledgement frame or packet (called an NACK) may be sent to the transmitter for the frames or packets with lower identities, which are assumed missing. A receiver timeout value is usually implemented to allow delivery of missing frames or packets prior to NACK transmission. NACKs may be sent under other circumstances, such as when a frame or packet is received but determined to be corrupt. With explicit acknowledgement, each frame or packet that is successfully received is identified in an ACK, or each frame or packet that is dropped is identified in a NACK. Implicit acknowledgement can be implemented in several ways.

For example, a single ACK may imply the receipt of all frames or packets up to the frame or packet identified in the ACK. A retransmission timeout value is an example of implicit NACK.

The SAM does not require explicit acknowledgement of delivery. The SCSI protocol expects a response for each command, so frame/packet delivery failure eventually will generate an implicit negative acknowledgement via SCSI timeout. The SAM assumes that delivery failures are detected within the service delivery subsystem, but places no requirements upon the service delivery subsystem to take action upon detection of a delivery failure.

Guaranteed Delivery

Guaranteed delivery requires retransmission of every frame or packet that is dropped. Some form of frame/packet acknowledgement is required for guaranteed delivery, even if it is implicit. Frames or packets must be held in the transmitter's memory until evidence of successful delivery is received. Protocols that support retransmission typically impose a limit on the number of retransmission attempts. If the limit is reached before successful transmission, the ULP is usually notified.

Flow Control and QoS

Flow control attempts to minimize drops due to buffer overruns. Buffer management can be proactive or reactive. Proactive buffer management involves an exchange of buffering capabilities before transmission of data frames or packets. The downside to this approach is that the amount of available buffer memory determines the maximum distance that can be traversed while maintaining line rate throughput. The upside to this approach is that it prevents frames or packets from being dropped due to buffer overrun. Reactive buffer management permits immediate transmission of frames or packets. The receiver must then signal the transmitter in real time when buffer resources are exhausted. This eliminates the relationship between distance and available buffer memory, but it allows some in-flight frames or packets to be dropped because of buffer overrun. Flow control can be implemented between two directly connected devices to manage the buffers available for frame reception. This is known as device-to-device, buffer-to-buffer, or link-level flow control. Flow control can also be implemented between two devices that communicate across an internetwork to manage the buffers available for frame/packet processing. This is known as end-to-end flow control.

Another key point to understand is the relationship between flow control and QoS. Flow control limits the flow of traffic between a pair of devices, whereas QoS determines how traffic will be processed by intermediate network devices or destination nodes based on the relative priority of each frame or packet. Flow control and QoS are discussed further in chapter 9, "Flow Control and Quality of Service."

The networking industry uses the phrases quality of service and class of service inconsistently. Whereas quality of service generally refers to queuing policies based on traffic prioritization, some networking technologies use class of service to convey this meaning. Ethernet falls into this category, as some Ethernet documentation refers to Class of Service instead of Quality of Service. Other networking technologies use class of service to convey the set of delivery mechanisms employed. FC falls into this category.

Guaranteed Bandwidth

Circuit-switching technologies like that used by the PSTN inherently dedicate end-to-end link bandwidth to the connected end nodes. The drawback of this model is inefficient use of available bandwidth within the network. Packet-switching technologies seek to optimize use of bandwidth by sharing links within the network. The drawback of this model is that some end nodes might be starved of bandwidth or might be allotted insufficient bandwidth to sustain acceptable application performance. Thus, many packet-switching technologies support bandwidth reservation schemes that allow end-to-end partial or full-link bandwidth to be dedicated to individual traffic flows or specific node pairs.

Guaranteed Latency

All circuit-switching technologies inherently provide consistent latency for the duration of a connection. Some circuit-switching technologies support transparent failover at OSI Layer 1 in the event of a circuit failure. The new connection might have higher or lower latency than the original connection, but the latency will be consistent for the duration of the new connection. Packet-switching technologies do not inherently provide consistent latency. Circuit emulation service, if supported, can guarantee consistent latency through packet-switched networks. Without circuit-emulation services, consistent latency can be achieved in packet-switching networks via proper network design and the use of QoS mechanisms.

Fragmentation, Reassembly, and PMTU Discovery

Fragmentation is discussed in chapter 3, "Overview of Network Operating Principles," in the context of IP networks, but fragmentation also can occur in other networks. When the MTU of an intermediate network link or the network link connecting the destination node is smaller than the MTU of the network link connecting the source node, the frame or packet must be fragmented. If the frame or packet cannot be fragmented, it must be dropped. Once a frame or packet is fragmented, it typically remains fragmented until it reaches the destination node. When fragmentation occurs, header information is updated to indicate the data offset of each fragment so that reassembly can take place within the destination node. The network device that fragments the frame or packet and the destination node that reassembles the fragments both incur additional processing overhead. Furthermore, protocol overhead increases as a percentage of the data payload, which contributes to suboptimal use of network bandwidth. For these reasons, it is desirable to have a common MTU across all network links from end to end. When this is not possible, it is desirable to dynamically discover the smallest MTU along the path between each pair of communicating end nodes. Each network technology that supports fragmentation typically specifies its own PMTU discovery mechanism.

Fragmentation should not be confused with segmentation. Fragmentation occurs in a network after a frame or packet has been transmitted by a node. Segmentation occurs within a node before transmission. Segmentation is the process of breaking down a chunk of application data into smaller chunks that are equal to or less than the local MTU. If PMTU is supported, the PMTU value is used instead of the local MTU. The goal of segmentation is to avoid unnecessary fragmentation in the network without imposing any additional processing burden on applications. Segmentation reduces application overhead by enabling applications to transmit large amounts of data without regard for PMTU constraints. The tradeoff is more overhead imposed on the OSI layer that performs segmentation on behalf of the application. Remember this when calculating protocol throughput, because the segmenting layer might introduce additional protocol overhead.

In-order Delivery

If there is only one path between each pair of nodes, in-order delivery is inherently guaranteed by the network. When multiple paths exist between node pairs, network routing algorithms can suppress all but one link and optionally use the suppressed link(s) as backup in the event of primary link failure. Alternately, network routing algorithms can use all links in a load-balanced fashion. In this scenario, frames can arrive out of order unless measures are taken to ensure in-order delivery. In-order delivery can be defined as the receipt of frames at the destination node in the same order as they were transmitted by the source node. Alternately, in-order delivery can be defined as delivery of data to a specific protocol layer within the destination node in the same order as they were transmitted by the same protocol layer within the source node. In both of these definitions, in-order delivery applies to a single source and a single destination. The order of frames or packets arriving at a destination from one source relative to frames or packets arriving at the same destination from another source is not addressed by the SAM and is generally considered benign with regard to data integrity.

In the first definition of in-order delivery, the network must employ special algorithms or special configurations for normal algorithms to ensure that load-balanced links do not result in frames arriving at the destination out of order. There are four levels of granularity for load balancing in serial networking technologies; frame level, flow level, node level, and network level.

Frame-level load balancing spreads individual frames across the available links and makes no effort to ensure in-order delivery of frames. Flow-level load balancing spreads individual flows across the available links. All frames within a flow follow the same link, thus ensuring in-order delivery of frames within each flow. However, in-order delivery of flows is not guaranteed. It is possible for frames in one flow to arrive ahead of frames in another flow without respect to the order in which the flows were transmitted. Some protocols map each I/O operation to an uniquely identifiable flow. In doing so, these protocols enable I/O operation load balancing. If the order of I/O operations must be preserved, node-level load balancing must be used.

Node-level load balancing spreads node-to-node connections across the available links, thus ensuring that all frames within all flows within each connection traverse the same link. Multiple simultaneous connections may exist between a source and destination. Node-level load balancing forwards all such connections over the same link. In this manner, node-level load balancing ensures in-order delivery of all frames exchanged between each pair of nodes. Node-level load balancing can be configured for groups of nodes at the network level by effecting a single routing policy for an entire subnet. This is typically (but not always) the manner in which IP routing protocols are configured. For example, a network administrator who wants to disable load balancing usually does so for a routing protocol (affecting all subnets reachable by that protocol) or for a specific subnet (affecting all nodes on that single subnet). In doing so, the network administrator forces all traffic destined for the affected subnet(s) to follow a single link. This has the same affect on frame delivery as implementing a node-level load-balancing algorithm, but without the benefits of load balancing.

Conversely, network-level load balancing can negate the intended affects of node-level algorithms that might be configured on a subset of intermediate links in the end-to-end path. To illustrate this, you need only to consider the default behavior of most IP routing protocols, which permit equal-cost path load balancing. Assume that node A transmits two frames. Assume also that no intervening frames destined for the same subnet are received by node A's default gateway. If the default gateway has two equal-cost paths to the destination subnet, it will transmit the first frame via the first path and the second frame via the second path. That action could result in out-of-order frame delivery. Now assume that the destination subnet is a large Ethernet network with port channels between each pair of switches. If a node-level load-balancing algorithm is configured on the port channels, the frames received at each switch will be delivered across each port channel with order fidelity, but could still arrive at the destination node out of order. Thus, network administrators must consider the behavior of all load-balancing algorithms in the end-to-end path. That raises an important point; network-level load balancing is not accomplished with a special algorithm, but rather with the algorithm embedded in the routing protocol. Also remember that all load-balancing algorithms are employed hop-by-hop.

In the second definition of in-order delivery, the network makes no attempt to ensure in-order delivery in the presence of load-balanced links. Frames or packets may arrive out of order at the destination node. Thus, one of the network protocols operating within the destination node must support frame/packet reordering to ensure data integrity.

The SAM does not explicitly require in-order delivery of frames or packets composing a SCSI request/response. However, because the integrity of a SCSI request/response depends upon in-order delivery of its constituent frames, the SAM implicitly requires the SCSI service delivery subsystem (including the protocols implemented within the end nodes) to provide in-order delivery of frames. The nature of some historical SCSI Interconnects, such as the SPI, inherently provides in-order delivery of all frames. With frame-switched networks, such as FC and Ethernet, frames can be delivered out of order. Therefore, when designing and implementing modern storage networks, take care to ensure in-order frame delivery. Note that in-order delivery may be guaranteed without a guarantee of delivery. In this scenario, some data might be lost in transit, but all data arriving at the destination node will be delivered to the application in order. If the network does not provide notification of non-delivery, delivery failures must be detected by an ULP or the application.

Similarly, the SAM does not require initiators or targets to support reordering of SCSI requests or responses or the service delivery subsystem to provide in-order delivery of SCSI requests or responses. The SAM considers such details to be implementation-specific. Some applications are insensitive to the order in which SCSI requests or responses are processed. Conversely, some applications fail or generate errors when SCSI requests or responses are not processed in the desired order. If an application requires in-order processing of SCSI requests or responses, the initiator can control SCSI command execution via task attributes and queue algorithm modifiers (see ANSI T10 SPC-3). In doing so, the order of SCSI responses is also controlled. Or, the application might expect the SCSI delivery subsystem to provide in-order delivery of SCSI requests or responses. Such applications might exist in any storage network, so storage network administrators typically side with caution and assume the presence of such applications. Thus, modern storage networks are typically designed and implemented to provide in-order delivery of SCSI requests and responses. Some SCSI Transport protocols support SCSI request/response reordering to facilitate the use of parallel transmission techniques within end nodes. This loosens the restrictions on storage network design and implementation practices.

Reordering should not be confused with reassembly. Reordering merely implies the order of frame/packet receipt at the destination node does not determine the order of delivery to the application. If a frame or packet is received out of order, it is held until the missing frame or packet is received. At that time, the frames/packets are delivered to the ULP in the proper order. Reassembly requires frames/packets to be held in a buffer until all frames/packets that compose an upper layer protocol data unit (PDU) have been received. The PDU is then reassembled, and the entire PDU is delivered to the ULP. Moreover, reassembly does not inherently imply reordering. A protocol that supports reassembly may discard all received frames/packets of a given PDU upon receipt of an out-of-order frame/packet belonging to the same PDU. In this case, the ULP must retransmit the entire PDU. In the context of fragmentation, lower-layer protocols within the end nodes typically support reassembly. In the context of segmentation, higher layer protocols within the end nodes typically support reassembly.

Link Aggregation

Terminology varies in the storage networking industry related to link aggregation. Some FC switch vendors refer to their proprietary link aggregation feature as trunking. However, trunking is well defined in Ethernet environments as the tagging of frames transmitted on an inter-switch link (ISL) to indicate the virtual LAN (VLAN) membership of each frame. With the advent of virtual SAN (VSAN) technology for FC networks, common sense dictates consistent use of the term trunking in both Ethernet and FC environments.

Note

One FC switch vendor uses the term trunking to describe a proprietary load-balancing feature implemented via the FC routing protocol. This exacerbates the confusion surrounding the term trunking.

By contrast, link aggregation is the bundling of two or more physical links so that the links appear to ULPs as one logical link. Link aggregation is accomplished within the OSI data-link layer, and the resulting logical link is properly called a port channel. Cisco Systems invented Ethernet port channels, which were standardized in March 2000 by the IEEE via the 802.3ad specification. Standardization enabled interoperability in heterogeneous Ethernet networks. Aggregation of FC links is not yet standardized. Thus, link aggregation must be deployed with caution in heterogeneous FC networks.

Transceivers

Transceivers can be integrated or pluggable. An integrated transceiver is built into the network interface on a line card, HBA, TOE, or NIC such that it cannot be replaced if it fails. This means the entire interface must be replaced if the transceiver fails. For switch line cards, this implication can be very problematic because an entire line card must be replaced to return a single port to service when a transceiver fails. Also, the type of connector used to mate a cable to an interface is determined by the transceiver. So, the types of cable that can be used by an interface with an integrated transceiver are limited. An example of an integrated transceiver is the traditional 10/100 Ethernet NIC, which has an RJ-45 connector built into it providing cable access to the integrated electrical transceiver.

By contrast, a pluggable transceiver incorporates all required transmit/receive componentry onto a removable device. The removable device can be plugged into an interface receptacle without powering down the device containing the interface (hot-pluggable). This allows the replacement of a failed transceiver without removal of the network interface. For switch line cards, this enables increased uptime by eliminating scheduled outages to replace failed transceivers. Also, hot-pluggable transceivers can be easily replaced to accommodate cable plant upgrades. In exchange for this increased flexibility, the price of the transceiver is increased. This results from an increase in research and development (R&D) costs and componentry. Pluggable transceivers are not always defined by networking standards bodies. Industry consortiums sometimes form to define common criteria for the production of interoperable, pluggable transceivers. The resulting specification is called a multi-source agreement (MSA). The most common types of pluggable transceivers are

Gigabit interface converter (GBIC)
Small form-factor pluggable (SFP)

A GBIC or SFP can operate at any transmission rate. The rate is specified in the MSA. Some MSAs specify multi-rate transceivers. Typically, GBICs and SFPs are not used for rates below 1 Gbps. Any pluggable transceiver that has a name beginning with the letter X operates at 10 Gbps. The currently available 10-Gbps pluggable transceivers include the following:

10 Gigabit small form-factor pluggable (XFP)
XENPAK
XPAK
X2

XENPAK is the oldest 10 gigabit MSA. The X2 and XPAK MSAs build upon the XENPAK MSA. Both X2 and XPAK use the XENPAK electrical specification. XFP incorporates a completely unique design.