TCPIP Suite

TCP/IP Suite

As Chapter 6, "OSI Network Layer," stated, IP does not provide reliable delivery. IP delivers packets on a best-effort basis and provides notification of non-delivery in some situations. This is because not all applications require reliable delivery. Indeed, the nature of some applications suits them to unreliable delivery mechanisms. For example, a network management station (NMS) that must poll thousands of devices using the Simple Network Management Protocol (SNMP) could be heavily burdened by the overhead associated with reliable delivery. Additionally, the affect of dropped SNMP packets is negligible as long as the proportion of drops is low. So, SNMP uses unreliable delivery to promote NMS scalability. However, most applications require reliable delivery. So, the TCP/IP suite places the burden of reliable delivery on the transport layer and provides a choice of transport layer protocols. For unreliable delivery, UDP is used. For reliable delivery, TCP is used. This provides flexibility for applications to choose the transport layer protocol that best meets their delivery requirements. For additional architectural information about the TCP/IP suite, readers are encouraged to consult IETF RFC 1180.

UDP

When packet loss has little or no negative impact on an application, UDP can be used. Broadcast traffic is usually transmitted using UDP encapsulation (for example, DHCPDIS-COVER packets defined in IETF RFC 2131). Some unicast traffic also uses UDP encapsulation. Examples of UDP-based unicast traffic include SNMP commands and DNS queries.

UDP Operational Overview

UDP is connectionless. RFC 768 defines UDP, and RFC 1122 defines procedural requirements for UDP. RFC 768 defines the packet format, the packet-level error detection mechanism, and the minimum/maximum packet sizes. UDP leaves all other communication parameters to the upper-layer protocols (ULPs). Thus, UDP adds very little overhead to IP. The only services that UDP provides to ULPs are data-integrity validation (via checksum) and ULP multiplexing.

ULP multiplexing is accomplished by assigning a 16-bit identifier (called a port number) to each session layer protocol. Port numbers are conceptually similar to EtherTypes and PPP protocol numbers at the data-link layer, and IP protocol numbers at the network layer. Both UDP and TCP use port numbers to multiplex ULPs. When a port number is assigned to a session layer protocol, it is assigned for use by UDP and TCP. This is done for several reasons. One reason is that some session layer protocols use both UDP and TCP. For example, DNS uses UDP in most situations, but DNS can use TCP to pass through firewalls that block UDP traffic. Another reason is to simplify network management. For example, a session layer protocol that uses UDP today might be modified in the future to use TCP. If that session layer protocol's UDP port number is not reserved for use with TCP, a different session layer protocol could be assigned that TCP port number. This would require assignment of a new port number for the first session layer protocol to use TCP. This complicates firewall configuration, content switching, protocol decoding, and other network management tasks related to that session layer protocol. IANA manages the range of valid port numbers. The range of valid port numbers is 0 to 65,535. It is divided into the following three categories:

Well known ports (WKPs) range from 0 to 1023 inclusive.
Registered Ports range from 1024 to 49,151 inclusive.
Dynamic Ports range from 49,152 to 65,535 inclusive.

Tip

The meaning of the term port varies depending on the context. Readers should not confuse the meaning of port in the context of UDP and TCP with the meaning of port in other contexts (such as switching hardware or the SAM addressing scheme).

IANA assigns WKPs. WKPs derive their name from their reserved status. Once IANA assigns a port number in the WKP range to a session layer protocol, that port number is reserved for use by only that session layer protocol. Assigned WKPs are used on servers to implement popular services for clients. By reserving port numbers for popular services, clients always know which port number to contact when requesting a service. For example, a server-based implementation of DNS "listens" for incoming client queries on UDP port 53. All unassigned WKPs are reserved by IANA and may not be used. IANA does not assign port numbers outside the WKP range. However, IANA maintains a public register of the server-based port numbers used by session layer protocols that have not been assigned a WKP. These are called Registered Ports. The third category is Dynamic Ports. It is used by clients when communicating with servers. Each session layer protocol within a client dynamically selects a port number in the Dynamic Port range upon initiating communication with a server. This facilitates server-to-client communication for the duration of the session. When the session ends, the Dynamic Port number is released and becomes available for reuse within the client.

UDP does not provide data segmentation, reassembly, or reordering for ULPs. So, ULPs that use UDP might suffer performance degradation due to IP fragmentation of large UDP packets. To avoid fragmentation, ULPs that use UDP are required to assess the maximum data segment size that can be transmitted. To facilitate this, UDP transparently provides Session Layer protocols access to the IP service interface. The session layer protocol in the source host queries UDP, which in turn queries IP to determine the local MTU. The session layer protocol then generates packets equal to or less than the local MTU, minus the overhead bytes of the IP and UDP headers. Alternately, path maximum transmission unit (PMTU) discovery can be used. The session layer protocol in the destination host must reorder and reassemble received segments using its own mechanism.

UDP Packet Formats

UDP offers a very limited set of services to ULPs, so the UDP packet format does not need to contain many fields. UDP uses a header but does not use a trailer. UDP packets are word-oriented, and an UDP word is 4 bytes. Figure 7-1 illustrates the UDP packet format.

Figure 7-1. UDP Packet Format

A brief description of each field follows:

Source Port 16 bits long. It indicates the port number of the session layer protocol within the transmitting host. The receiving host copies this value to the Destination Port field in reply packets.
Destination Port 16 bits long. It indicates the port number of the session layer protocol within the receiving host. This enables the receiving host to properly route incoming packets to the appropriate session layer protocol. The receiving host copies this value to the Source Port field in reply packets.
Length 16 bits long. It indicates the overall length of the packet (header plus data) expressed in bytes. Because this field is 16 bits long, the theoretical maximum length of a UDP packet is 65,536 bytes. The actual maximum length of a UDP packet is 65,516 bytes (the maximum size of the Data field in an IP packet).
Checksum 16 bits long. It contains a checksum that is calculated on the entire UDP packet prepended with a pseudo-header consisting of specific fields from the not-yet-generated IP header. The transport layer must inform the network layer where to send each packet, so certain information in the IP header is known to UDP (and TCP) before generation of the IP header. This enables UDP (and TCP) to generate the IP pseudo-header before calculating the checksum. Figure 7-2 illustrates the IP pseudo-header format. The IP pseudo-header is not transmitted as part of each packet. The IP pseudo-header does not include any IP header fields that are modified by IP routers during normal packet forwarding (such as the TTL field); therefore the UDP checksum does not need to be recalculated by each router. However, NAT devices must recalculate the checksum because they modify the source or destination IP address fields, which are included in the IP pseudo-header. By including the destination IP address field in the IP pseudo-header, UDP (and TCP) provide protection against accidental delivery of packets to the wrong destination. If the UDP Data field does not end on a 16-bit boundary, it is padded with zeros for the purpose of calculating the checksum. The pad bits are not transmitted as part of each packet. The value of the checksum field is zero for the purpose of calculating the checksum. Many Ethernet NICs now calculate the checksum in hardware, which offloads the processing burden from the host's CPU. Note that use of the Checksum field is optional (unlike TCP). The transmitting host may choose to not calculate a checksum. In this case, the transmitted value of the Checksum field is 0.

Figure 7-2. IP Pseudo-Header Format
Data(if present) variable in length. It contains an ULP (such as SNMP). The length of this field is variable with no minimum length and a maximum length of 65,508 bytes (65,536 bytes minus 20 bytes of IP header and 8 bytes of UDP header).

For more information about the UDP packet format, readers are encouraged to consult IETF RFCs 768 and 1122. A new variation of UDP (called UDP-Lite) is documented in RFC 3828. UDP-Lite is optimized for streaming media such as voice and video. As such, UDP-Lite is not applicable to modern storage networks. UDP-Lite is mentioned herein only to prevent confusion.

UDP Delivery Mechanisms

UDP supports only one set of delivery mechanisms that provide unacknowledged, connectionless service. Thus, UDP does not augment the delivery service provided by IP. UDP implements the following delivery mechanisms:

UDP does not detect packets dropped in transit. When the UDP process within a destination host drops a packet because of checksum failure, UDP does not report the drop to the ULP or source host. When the UDP process within a destination host drops a packet because the destination port number is not active, IP may report the drop to the source host by transmitting an ICMP Type 3 Code 3 message. However, notification is not required. In the absence of notification, ULPs are expected to detect drops via their own mechanisms. Likewise, recovery behavior is determined by the ULP that generated the packets.
UDP does not detect duplicate packets. If a duplicate packet is received, UDP delivers the packet to the appropriate ULP. ULPs are expected to detect duplicates via their own mechanisms.
UDP can detect corrupt packets via the Checksum field. Upon detection of a corrupt packet, the packet is dropped.
UDP does not provide acknowledgement of successful packet delivery. Each ULP is expected to define its own acknowledgement mechanism (if required).
UDP does not provide retransmission. Each ULP is expected to define its own retransmission mechanism (if required).
UDP does not provide end-to-end flow control.
Guaranteeing reserved bandwidth is not a transport layer function. UDP relies on QoS policies implemented by IP for bandwidth guarantees.
Guaranteeing fixed or minimal latency is not a transport layer function. UDP relies on QoS policies implemented by IP for latency guarantees.
Fragmentation and reassembly is not a transport layer function. UDP relies on IP for fragmentation and reassembly. To avoid fragmentation, UDP supports PMTU discovery as defined in RFC 1191 by providing ULPs access to the IP service interface. When UDP is notified of a drop via an ICMP Type 3 Code 4 message, the constricting MTU information is passed to the appropriate ULP. The ULP is responsible for resegmenting and retransmitting the data.
UDP does not guarantee in-order delivery. UDP does not provide packet reordering. Each ULP is expected to define its own out-of-order packet detection and reordering mechanism (if required).

For more information about UDP delivery mechanisms, readers are encouraged to consult IETF RFCs 768, 1122, and 1180.

UDP Connection Initialization

No UDP parameters are exchanged between hosts before communication. This reflects the connectionless nature of UDP.

TCP

When packet loss affects an application negatively, TCP is used. Most unicast traffic uses TCP encapsulation.

TCP Operational Overview

TCP is connection-oriented. Each TCP connection comprises a full-duplex virtual-circuit between a pair of end nodes. Like UDP, TCP provides data integrity validation (via checksum) and ULP multiplexing services (via port numbers). Unlike UDP, TCP provides data segmentation and reordering services, and several end-to-end delivery guarantees to ULPs. RFC 793 defines TCP, whereas RFC 1122 defines procedural requirements for TCP. RFC 793 defines the packet format, minimum and maximum packet sizes, error detection and recovery mechanisms (including a method of packet-loss detection and a procedure for packet retransmission), a flow-control mechanism, connection establishment/maintenance/teardown procedures, a packet reordering mechanism, and timeout values.

TCP packetizes and transmits data as TCP sees fit. In other words, data received from a ULP may or may not be sent immediately, and separate transmission requests may or may not be aggregated by TCP. For example, a message-based ULP (such as iSCSI) can pass multiple discrete "chunks" of data to TCP. These data chunks are called Protocol Data Units (PDUs). TCP may aggregate multiple PDUs into a single packet for transmission (subject to PMTU constraints).

Likewise, when an ULP passes to TCP a PDU that exceeds the PMTU, TCP may segment and transmit the PDU when and how TCP sees fit. This is known as byte streaming. To TCP, each ULP is the source of a stream of bytes. At the receiving host, TCP does not reassemble segments per se. Because TCP does not segment ULP data as discrete PDUs, TCP cannot reassemble segments into discrete PDUs. So, TCP in a receiving host merely ensures the proper order of received segments before passing data to a ULP. The ULP in the receiving host is responsible for detecting the boundaries between PDUs. Note that TCP in a receiving host may delay passing data to a ULP even when all segments are properly ordered. Sometimes this is done to improve buffer management efficiency. However, a mechanism (the Push bit) is defined to enable ULPs in the transmitting host to instruct TCP in both the transmitting and receiving hosts to immediately forward data (see Chapter 9, "Flow Control and Quality of Service"). In other words, the first PDU received from the ULP may be held by TCP while TCP waits for additional PDUs from the ULP. TCP does not recognize and process data received from a ULP as discrete PDUs.

TCP supports an optional keep-alive mechanism to maintain an open connection during periods of inactivity. The keep-alive mechanism is not included in RFC 793, so RFC 1122 addresses the need for a keep-alive mechanism. RFC 1122 requires the inactivity timer to be configurable. So, most modern TCP implementations support an inactivity timer ranging from a few seconds to many hours. Network administrators determine the optimum value for the inactivity timer based on specific deployment requirements. The TCP keep-alive mechanism works by transmitting an invalid packet, which elicits an acknowledgement packet from the peer host. The transmitting host indicates via the Sequence Number field that it is ready to transmit a byte of data that has already been acknowledged by the peer host. The keep-alive packet contains no data. Upon processing the keep-alive packet, the peer host detects the error in the Sequence Number field and retransmits the most recent acknowledgement packet. This two-packet exchange implicitly notifies each host that the connection is still valid. If the connection is no longer valid, the receiving host returns a reset packet to the transmitting host.

Multiple simultaneous TCP connections can be open between each pair of hosts. In fact, this is required when multiple ULPs are communicating because ULPs may not share a TCP connection. The combination of a TCP port number and an IP address is called a socket. Each TCP connection is uniquely identified by the pair of communicating sockets. This enables each host to use a single socket to communicate simultaneously with multiple remote sockets without confusing the connections. In the context of traditional IP networks, this is common in server implementations. In the context of storage networks, this is common in storage implementations. For example, an iSCSI target (storage array) may use a single socket to conduct multiple simultaneous sessions with several iSCSI initiators (hosts). The storage array's iSCSI socket is reused over time as new sessions are established.

Clients may also reuse sockets. For example, a single-homed host may select TCP port 55,160 to open a TCP connection for file transfer via FTP. Upon completing the file transfer, the client may terminate the TCP connection and reuse TCP port 55,160 for a new TCP connection. Reuse of Dynamic Ports is necessary because the range of Dynamic Ports is finite and would otherwise eventually exhaust. A potential problem arises from reusing sockets. If a client uses a given socket to connect to a given server socket more than once, the first incarnation of the connection cannot be distinguished from the second incarnation. Reincarnated TCP connections can also result from certain types of software crashes. TCP implements multiple mechanisms that work together to resolve the issue of reincarnated TCP connections. Note that each TCP host can open multiple simultaneous connections with a single peer host by selecting a unique Dynamic Port number for each connection.

TCP Packet Formats

TCP offers a rich set of services to ULPs, so the TCP packet format is quite complex compared to UDP. TCP uses a header but does not use a trailer. The TCP header format defined in RFC 793 is still in use today, but some fields have been redefined. TCP packets are word-oriented, and a TCP word is 4 bytes. Figure 7-3 illustrates the TCP packet format.

Figure 7-3. TCP Packet Format

A brief description of each field follows:

Source Port 16 bits long. It indicates the port number of the session layer protocol within the transmitting host. The receiving host copies this value to the Destination Port field in reply packets.
Destination Port 16 bits long. It indicates the port number of the session layer protocol within the receiving host. This enables the receiving host to properly route incoming packets to the appropriate session layer protocol. The receiving host copies this value to the Source Port field in reply packets.
Sequence Number 32 bits long. It contains the sequence number of the first byte in the Data field. TCP consecutively assigns a sequence number to each data byte that is transmitted. Sequence numbers are unique only within the context of a single connection. Over the life of a connection, this field represents an increasing counter of all data bytes transmitted. The Sequence Number field facilitates detection of dropped and duplicate packets, and reordering of packets. The purpose and use of the TCP Sequence Number field should not be confused with the FC Sequence_Identifier (SEQ_ID) field.
Acknowledgement Number 32 bits long. It contains the sequence number of the next byte of data that the receiving host expects. This field acknowledges that all data bytes up to (but not including) the Acknowledgement Number have been received. This technique is known as cumulative acknowledgement. This field facilitates flow control and retransmission. This field is meaningful only when the ACK bit is set to 1.
Data Offset(also known as the TCP Header Length field) 4 bits long. It indicates the total length of the TCP header expressed in 4-byte words. This field is necessary because the TCP header length is variable due to the Options field. Valid values are 5 through 15. Thus, the minimum length of a TCP header is 20 bytes, and the maximum length is 60 bytes.
Reserved 4 bits long.
Control Bits 8 bits long and contains eight sub-fields.
Congestion Window Reduced (CWR) bit when set to 1, it informs the receiving host that the transmitting host has reduced the size of its congestion window. This bit facilitates flow control in networks that support explicit congestion notification (ECN). See Chapter 9, "Flow Control and Quality of Service," for more information about flow control in IP networks.
ECN Echo (ECE) bit when set to 1, it informs the transmitting host that the receiving host received notification of congestion from an intermediate router. This bit works in conjunction with the CWR bit. See Chapter 9, "Flow Control and Quality of Service," for more information about flow control in IP networks.
Urgent (URG) bit when set to 1, it indicates the presence of urgent data in the Data field. Urgent data may fill part or all of the Data field. The definition of "urgent" is left to ULPs. In other words, the ULP in the source host determines whether to mark data urgent. The ULP in the destination host determines what action to take upon receipt of urgent data. This bit works in conjunction with the Urgent Pointer field.
Acknowledgement (ACK) bit when set to 1, it indicates that the Acknowledgement Number field is meaningful.
Push (PSH) bit when set to 1, it instructs TCP to act immediately. In the source host, this bit forces TCP to immediately transmit data received from all ULPs. In the destination host, this bit forces TCP to immediately process packets and forward the data to the appropriate ULP. When an ULP requests termination of an open connection, the push function is implied. Likewise, when a packet is received with the FIN bit set to 1, the push function is implied.
Reset (RST) bit when set to 1, it indicates a request to reset the connection. This bit is used to recover from communication errors such as invalid connection requests, invalid protocol parameters, and so on. A connection reset results in abrupt termination of the connection without preservation of state or in-flight data.
Synchronize (SYN) bit when set to 1, it indicates a new connection request. Connection establishment is discussed in the following Parameter Exchange section.
Final (FIN) bit when set to 1, it indicates a connection termination request.
Window 16 bits long. It indicates the current size (expressed in bytes) of the transmitting host's receive buffer. The size of the receive buffer fluctuates as packets are received and processed, so the value of the Window field also fluctuates. For this reason, the TCP flow-control mechanism is called a "sliding window." This field works in conjunction with the Acknowledgement Number field.
Checksum 16 bits long. It is used in an identical manner to the UDP Checksum field. Unlike UDP, TCP requires use of the Checksum field. The TCP header does not have a Length field, so the TCP length must be calculated for inclusion in the IP pseudo-header.
Urgent Pointer 16 bits long. It contains a number that represents an offset from the current Sequence Number. The offset marks the last byte of urgent data in the Data field. This field is meaningful only when the URG bit is set to 1.
Options if present, it is variable in length with no minimum length and a maximum length of 40 bytes. This field can contain one or more options. Individual options vary in length, and the minimum length is 1 byte. Options enable discovery of the remote host's maximum segment size (MSS), negotiation of selective acknowledgement (SACK), and so forth. The following section discusses TCP options in detail.
Padding variable in length with no minimum length and a maximum length of 3 bytes. This field is used to pad the header to the nearest 4-byte boundary. This field is required only if the Options field is used, and the Options field does not end on a 4-byte boundary. If padding is used, the value of this field is set to 0.
Data if present, it is variable in length with no minimum length and a maximum length of 65,496 bytes (65,536 bytes minus 20 bytes of IP header and 20 bytes of TCP header). This field contains ULP data (such as FCIP).

The preceding descriptions of the TCP header fields are simplified for the sake of clarity. For more information about the TCP packet format, readers are encouraged to consult IETF RFCs 793, 1122, 1323, and 3168.

TCP Options

TCP options can be comprised of a single byte or multiple bytes. All multi-byte options use a common format. Figure 7-4 illustrates the format of TCP multi-byte options.

Figure 7-4. TCP Multi-Byte Option Format

A brief description of each field follows:

Kind 8 bits long. It indicates the kind of option.
Length 8 bits long. It indicates the total length of the option (including the Kind and Length fields) expressed in bytes.
Option Data if present, it is variable in length and contains option-specific data. The format of this sub-field is determined by the kind of option.

Table 7-1 lists the currently defined TCP options that are in widespread use.

Table 7-1. TCP Options
Kind	Length In Bytes	Description
0	1	End of Option List (Pad)
1	1	No-Operation (No-Op)
2	4	Maximum Segment Size (MSS)
3	3	Window Scale (WSopt)
4	2	SACK-Permitted
5	Variable From 1034	SACK
8	10	Timestamps (TSopt)
19	18	MD5 Signature

Of the options listed in Table 7-1, only the MSS, WSopt, SACK-Permitted, SACK, and TSopt are relevant to modern storage networks. So, only these options are discussed in detail in this section.

Maximum Segment Size

The maximum segment size (MSS) option is closely related to PMTU discovery. Recall that TCP segments ULP data before packetizing it. MSS refers to the maximum amount of ULP data that TCP can packetize and depacketize. While this could be limited by some aspect of the TCP software implementation, it is usually limited by the theoretical maximum size of the Data field in a TCP packet (65,496 bytes). Each host can advertise its MSS to the peer host during connection establishment to avoid unnecessary segmentation. Note that the MSS option may be sent only during connection establishment, so the advertised MSS value cannot be changed during the life of an open connection. If the theoretical MSS is advertised, the peer host can send TCP packets that each contain up to 65,496 bytes of ULP data. This would result in severe fragmentation of packets at the network layer due to the ubiquity of Ethernet. So, the best current practice (BCP) is to advertise an MSS equal to the local MTU minus 40 bytes (20 bytes for the IP header and 20 bytes for the TCP header). This practice avoids unnecessary fragmentation while simultaneously avoiding unnecessary segmentation. If a host implements PMTU discovery, TCP uses the lesser of the discovered PMTU and the peer's advertised MSS when segmenting ULP data for transmission. Figure 7-5 illustrates the MSS option format.

Figure 7-5. TCP Maximum Segment Size Option Format

A brief description of each field follows:

Kind set to 2.
Length set to 4.
Maximum Segment Size 16 bits long. It indicates the largest segment size that the sender of this option is willing to accept. The value is expressed in bytes.

Window Scale

One drawback of all proactive flow control mechanisms is that they limit throughput based on the amount of available receive buffers. As previously stated, a host may not transmit when the receiving host is out of buffers. To understand the effect of this requirement on throughput, it helps to consider a host that has enough buffer memory to receive only one packet. The transmitting host must wait for an indication that each transmitted packet has been processed before transmitting an additional packet. This requires one round-trip per transmitted packet. While the transmitting host is waiting for an indication that the receive buffer is available, it cannot transmit additional packets, and the available bandwidth is unused. So, the amount of buffer memory required to sustain maximum throughput can be calculated as:

Bandwidth * Round-trip time (RTT)

where bandwidth is expressed in bytes per second, and RTT is expressed in seconds. This is known as the bandwidth-delay product. Although this simple equation does not account for the protocol overhead of lower-layer protocols, it does provide a reasonably accurate estimate of the TCP memory requirement to maximize throughput.

As TCP/IP matured and became widely deployed, the maximum window size that could be advertised via the Window field proved to be inadequate in some environments. So-called long fat networks (LFNs), which combine very long distance with high bandwidth, became common as the Internet and corporate intranets grew. LFNs often exceed the original design parameters of TCP, so a method of increasing the maximum window size is needed. The Window Scale option (WSopt) was developed to resolve this issue. WSopt works by shifting the bit positions in the Window field to omit the least significant bit(s). To derive the peer host's correct window size, each host applies a multiplier to the window size advertisement in each packet received from the peer host. Figure 7-6 illustrates the WSopt format.

Figure 7-6. TCP Window Scale Option Format

A brief description of each field follows:

Kind set to 3.
Length set to 3.
Shift Count 8 bits long. It indicates how many bit positions the Window field is shifted. This is called the scale factor.

Both hosts must send WSopt for window scaling to be enabled on a connection. Sending this option indicates that the sender can both send and receive scaled Window fields. Note that WSopt may be sent only during connection establishment, so window scaling cannot be enabled during the life of an open connection.

Selective Acknowledgement

TCP's cumulative acknowledgement mechanism has limitations that affect performance negatively. As a result, an optional selective acknowledgement mechanism was developed. The principal drawback of TCP's cumulative acknowledgement mechanism is that only contiguous data bytes can be acknowledged. If multiple packets containing non-contiguous data bytes are received, the non-contiguous data is buffered until the lowest gap is filled by subsequently received packets. However, the receiving host can acknowledge only the highest sequence number of the received contiguous bytes. So, the transmitting host must wait for the retransmit timer to expire before retransmitting the unacknowledged bytes. At that time, the transmitting host retransmits all unacknowledged bytes for which the retransmit timer has expired. This often results in unnecessary retransmission of some bytes that were successfully received at the destination. Also, when multiple packets are dropped, this procedure can require multiple roundtrips to fill all the gaps in the receiving host's buffer. To resolve these deficiencies, TCP's optional SACK mechanism enables a receiving host to acknowledge receipt of non-contiguous data bytes. This enables the transmitting host to retransmit multiple packets at once, with each containing non-contiguous data needed to fill the gaps in the receiving host's buffer. By precluding the wait for the retransmit timer to expire, retransmitting only the missing data, and eliminating multiple roundtrips from the recovery procedure, throughput is maximized.

SACK is implemented via two TCP options. The first option is called SACK-Permitted and may be sent only during connection establishment. The SACK-Permitted option informs the peer host that SACK is supported by the transmitting host. To enable SACK, both hosts must include this option in the initial packets transmitted during connection establishment. Figure 7-7 illustrates the format of the SACK-Permitted option.

Figure 7-7. TCP SACK-Permitted Option Format

A brief description of each field follows:

Kind set to 4.
Length set to 2.

After the connection is established, the second option may be used. The second option is called the SACK option. It contains information about the data that has been received. Figure 7-8 illustrates the format of the SACK option.

Figure 7-8. TCP SACK Option Format

A brief description of each field follows:

Kind set to 5.
Length set to 10, 18, 26, or 34, depending on how many blocks are conveyed. The requesting host decides how many blocks to convey in each packet.
Left Edge of Block "X" 32 bits long. It contains the sequence number of the first byte of data in this block.
Right Edge of Block "X" 32 bits long. It contains the sequence number of the first byte of missing data that follows this block.

TCP's SACK mechanism complements TCP's cumulative ACK mechanism. When the lowest gap in the receive buffer has been filled, the receiving host updates the Acknowledgement Number field in the next packet transmitted. Note that TCP's SACK mechanism provides an explicit ACK for received data bytes currently in the receive buffer, and an explicit NACK for data bytes currently missing from the receive buffer. This contrasts TCP's cumulative ACK mechanism, which provides an implicit ACK for all received data bytes while providing an implicit NACK for all missing data bytes except the byte referenced in the Acknowledgement Number field. Note also the difference in granularity between the retransmission procedures of TCP and FC. Whereas TCP supports retransmission of individual packets, the finest granularity of retransmission in FC is a sequence.

Timestamps

The Timestamps option (TSopt) provides two functions. TSopt augments TCP's traditional duplicate packet detection mechanism and improves TCP's RTT estimation in LFN environments. TSopt may be sent during connection establishment to inform the peer host that timestamps are supported. If TSopt is received during connection establishment, timestamps may be sent in subsequent packets on that connection. Both hosts must support TSopt for timestamps to be used on a connection.

As previously stated, a TCP connection may be reincarnated. Reliable detection of duplicate packets can be challenging in the presence of reincarnated TCP connections. TSopt provides a way to detect duplicates from previous incarnations of a TCP connection. This topic is discussed fully in the following section on duplicate detection.

TCP implements a retransmission timer so that unacknowledged data can be retransmitted within a useful timeframe. When the timer expires, a retransmission time-out (RTO) occurs, and the unacknowledged data is retransmitted. The length of the retransmission timer is derived from the mean RTT. If the RTT is estimated too high, retransmissions are delayed. If the RTT is estimated too low, unnecessary retransmissions occur. The traditional method of gauging the RTT is to time one packet per window of transmitted data. This method works well for small windows (that is, for short distance and low bandwidth), but severely inaccurate RTT estimates can result from this method in LFN environments. To accurately measure the RTT, TSopt may be sent in every packet. This enables TCP to make real-time adjustments to the retransmission timer based on changing conditions within the network (such as fluctuating levels of congestion, route changes, and so on). Figure 7-9 illustrates the TSopt format.

Figure 7-9. TCP Timestamps Option Format

A brief description of each field follows:

Kind set to 8.
Length set to 10.
Timestamp Value (TSval) 32 bits long. It records the time at which the packet is transmitted. This value is expressed in a unit of time determined by the transmitting host. The unit of time is usually seconds or milliseconds.
Timestamp Echo Reply (TSecr) 32 bits long. It is used to return the TSval field to the peer host. Each host copies the most recently received TSval field into the TSecr field when transmitting a packet. Copying the peer host's TSval field eliminates the need for clock synchronization between the hosts.

The preceding descriptions of the TCP options are simplified for the sake of clarity. For more information about the TCP option formats, readers are encouraged to consult IETF RFCs 793, 1122, 1323, 2018, 2385, and 2883.

TCP Delivery Mechanisms

TCP supports only one set of delivery mechanisms that provide acknowledged, connection-oriented service. Thus, TCP augments the delivery service provided by IP. TCP implements the following delivery mechanisms:

TCP can detect packets dropped in transit via the Sequence Number field. A packet dropped in transit is not acknowledged to the transmitting host. Likewise, a packet dropped by the receiving host because of checksum failure or other error is not acknowledged to the transmitting host. Unacknowledged packets are eventually retransmitted. TCP is connection-oriented, so packets cannot be dropped because of an inactive destination port number except for the first packet of a new connection (the connection request packet). In this case, TCP must drop the packet, must reply to the source host with the RST bit set to 1, and also may transmit an ICMP Type 3 Code 3 message to the source host.
TCP can detect duplicate packets via the Sequence Number field. If a duplicate packet is received, TCP drops the packet. Recall that TCP is permitted to aggregate data any way TCP chooses. So, a retransmitted packet may contain the data of the dropped packet and additional untransmitted data. When such a duplicate packet is received, TCP discards the duplicate data and forwards the additional data to the appropriate ULP. Since the Sequence Number field is finite, it is likely that each sequence number will be used during a long-lived or high bandwidth connection. When this happens, the Sequence Number field cycles back to 0. This phenomenon is known as wrapping. It occurs in many computing environments that involve finite counters. The time required to wrap the Sequence Number field is denoted as Twrap and varies depending on the bandwidth available to the TCP connection. When the Sequence Number field wraps, TCP's traditional method of duplicate packet detection breaks, because valid packets can be mistakenly identified as duplicate packets. Problems can also occur with the packet reordering process. So, RFC 793 defines a maximum segment lifetime (MSL) to ensure that old packets are dropped in the network and not delivered to hosts. The MSL mechanism adequately protected TCP connections against lingering duplicates in the early days of TCP/IP networks. However, many modern LANs provide sufficient bandwidth to wrap the Sequence Number field in just a few seconds. Reducing the value of the MSL would solve this problem, but would create other problems. For example, if the MSL is lower than the one-way delay between a pair of communicating hosts, all TCP packets are dropped. Because LFN environments combine high one-way delay with high bandwidth, the MSL cannot be lowered sufficiently to prevent Twrap from expiring before the MSL. Furthermore, the MSL is enforced via the TTL field in the IP header, and TCP's default TTL value is 60. For reasons of efficiency and performance, most IP networks (including the Internet) are designed to provide any-to-any connectivity with fewer than 60 router hops. So, the MSL never expires in most IP networks. To overcome these deficiencies, RFC 1323 defines a new method of protecting TCP connections against wrapping. The method is called protection against wrapped sequence numbers (PAWS). PAWS uses TSopt to logically extend the Sequence Number field with additional high-order bits so that the Sequence Number field can wrap many times within a single cycle of the TSopt field.
TCP can detect corrupt packets via the Checksum field. Upon detection of a corrupt packet, the packet is dropped.
TCP acknowledges receipt of each byte of ULP data. Acknowledgement does not indicate that the data has been processed; that must be inferred by combining the values of the Acknowledgement Number and Window fields. When data is being transmitted in both directions on a connection, acknowledgment is accomplished without additional overhead by setting the ACK bit to one and updating the Acknowledgement Number field in each packet transmitted. When data is being transmitted in only one direction on a connection, packets containing no data must be transmitted in the reverse direction for the purpose of acknowledgment.
TCP guarantees delivery by providing a retransmission service for undelivered data. TCP uses the Sequence Number and Acknowledgment Number fields to determine which data bytes need to be retransmitted. When a host transmits a packet containing data, TCP retains a copy of the data in a retransmission queue and starts the retransmission timer. If the data is not acknowledged before the retransmission timer expires, a retransmission time-out (RTO) occurs, and the data is retransmitted. When the data is acknowledged, TCP deletes the data from the retransmission queue.
TCP provides proactive end-to-end flow control via the Window field. TCP implements a separate window for each connection. Each time a host acknowledges receipt of data on a connection, it also advertises its current window size (that is, its receive buffer size) for that connection. A transmitting host may not transmit more data than the receiving host is able to buffer. After a host has transmitted enough data to fill the receiving host's receive buffer, the transmitting host cannot transmit additional data until it receives a TCP packet indicating that additional buffer space has been allocated or in-flight packets have been received and processed. Thus, the interpretation of the Acknowledgement Number and Window fields is somewhat intricate. A transmitting host may transmit packets when the receiving host's window size is zero as long as the packets do not contain data. In this scenario, the receiving host must accept and process the packets. See Chapter 9, "Flow Control and Quality of Service," for more information about flow control. Note that in the IP model, the choice of transport layer protocol determines whether end-to-end flow control is used. This contrasts the FC model in which the class of service (CoS) determines whether end-to-end flow control is used.
Guaranteeing reserved bandwidth is not a transport layer function. TCP relies on QoS policies implemented by IP for bandwidth guarantees.
Guaranteeing fixed or minimal latency is not a transport layer function. TCP relies on QoS policies implemented by IP for latency guarantees. The PSH bit can be used to lower the processing latency within the source and destination hosts, but the PSH bit has no affect on network behavior.
Fragmentation and reassembly is not a transport layer function. TCP relies on IP for fragmentation and reassembly. To avoid fragmentation, TCP supports PMTU discovery as defined in RFC 1191. When TCP is notified of a drop via an ICMP Type 3 Code 4 message, TCP resegments and retransmits the data.
TCP guarantees in-order delivery by reordering packets that are received out of order. The receiving host uses the Sequence Number field to determine the correct order of packets. Each byte of ULP data is passed to the appropriate ULP in consecutive sequence. If one or more packets are received out of order, the ULP data contained in those packets is buffered by TCP until the missing data arrives.

Comprehensive exploration of all the TCP delivery mechanisms is outside the scope of this book. For more information about TCP delivery mechanisms, readers are encouraged to consult IETF RFCs 793, 896, 1122, 1180, 1191, 1323, 2018, 2309, 2525, 2581, 2873, 2883, 2914, 2923, 2988, 3042, 3168, 3390, 3517, and 3782.

TCP Connection Initialization

TCP connections are governed by a state machine. As each new connection is initialized, TCP proceeds through several states. A TCP client begins in the CLOSED state and then proceeds to the SYN-SENT state followed by the ESTABLISHED state. A TCP server begins in the LISTEN state and then proceeds to the SYN-RECEIVED state followed by the ESTABLISHED state. ULP data may be exchanged only while a connection is in the ESTABLISHED state. When the ULP has no more data to transmit, TCP terminates the connection. TCP proceeds through several additional states as each connection is terminated. The state of each connection is maintained in a small area in memory called a transmission control block (TCB). A separate TCB is created for each TCP connection.

RFC 793 defines procedures for establishment, maintenance and teardown of connections between pairs of hosts. Following IP initialization, TCP may initiate a new connection. TCP's connection initialization procedure is often called a three-way handshake because it requires transmission of three packets between the communicating hosts. The initiating host selects an initial sequence number (ISN) for the new connection and then transmits the first packet with the SYN bit set to 1 and the ACK bit set to 0. This packet is called the SYN segment. If the initiating host supports any TCP options, the options are included in the SYN segment. Upon receipt of the SYN segment, the responding host selects an ISN for the new connection and then transmits a reply with both the SYN and ACK bits set to one. This packet is called the SYN/ACK segment. If the responding host supports any TCP options, the options are included in the SYN/ACK segment. The Acknowledgement Number field contains the initiating host's ISN incremented by one. Upon receipt of the SYN/ACK segment, the initiating host considers the connection established. The initiating host then transmits a reply with the SYN bit set to 0 and the ACK bit set to 1. This is called an ACK segment. (Note that ACK segments occur throughout the life of a connection, whereas the SYN segment and the SYN/ACK segment only occur during connection establishment.) The Acknowledgement Number field contains the responding host's ISN incremented by one. Despite the absence of data, the Sequence Number field contains the initiating host's ISN incremented by one. The ISN must be incremented by one because the sole purpose of the ISN is to synchronize the data byte counter in the responding host's TCB with the data byte counter in the initiating host's TCB. So, the ISN may be used only in the first packet transmitted in each direction. Upon receipt of the ACK segment, the responding host considers the connection established. At this point, both hosts may transmit ULP data. The first byte of ULP data sent by each host is identified by incrementing the ISN by one. Figure 7-10 illustrates TCP's three-way handshake.

Figure 7-10. TCP Three-Way Handshake

TCP/IP Suite

UDP

UDP Operational Overview

UDP Packet Formats

Figure 7-1. UDP Packet Format

Figure 7-2. IP Pseudo-Header Format

UDP Delivery Mechanisms

UDP Connection Initialization

TCP

TCP Operational Overview

TCP Packet Formats

Figure 7-3. TCP Packet Format

TCP Options

Figure 7-4. TCP Multi-Byte Option Format

Table 7-1. TCP Options

Maximum Segment Size

Figure 7-5. TCP Maximum Segment Size Option Format

Window Scale

Figure 7-6. TCP Window Scale Option Format

Selective Acknowledgement

Figure 7-7. TCP SACK-Permitted Option Format

Figure 7-8. TCP SACK Option Format

Timestamps

Figure 7-9. TCP Timestamps Option Format

TCP Delivery Mechanisms

TCP Connection Initialization

Figure 7-10. TCP Three-Way Handshake