Transmission Control Protocol (TCP) Retransmission and Time-Out

Table of contents:

The reliable service of TCP requires that all segments containing data be acknowledged by the receiver. When an acknowledgment (ACK) for a segment is not received within a determined amount of time, the sender retransmits the segment. The sender might retransmit the segment multiple times before abandoning the connection. The retransmission and time-out behaviors of TCP directly affect TCP performance and can help prevent congestion on the internetwork.

Retransmission Time Out and Round Trip Time

For each connection, TCP maintains a variable called the retransmission time-out (RTO), which is the amount of time within which an ACK for the segment is expected. If TCP does not receive an ACK before the RTO expires, the segment is retransmitted.

The RTO must allow enough time for the following:

The initially sent TCP segment to traverse the internetwork (the transit time from source to destination).
The initially sent TCP segment to be received and processed by the destination node (the destination's inbound packet-processing time).
The generation of an ACK for the segment (the ACK generation time). One component of the ACK generation time is the delayed acknowledgment time of the destination node. Rather than sending an ACK segment for each TCP data segment received, TCP delays ACKs. These delayed ACKs can contain data,include updated window sizes, and acknowledge multiple segments received.
The generated ACK to traverse the internetwork (the transit time from destination to source).
The generated ACK to be received and processed by the sending node (the source's inbound packet-processing time).

The sum of all these times is known as the round-trip time (RTT). The RTT varies over time and must be constantly measured throughout the TCP connection's life. The RTO is based on the currently known RTT and should always be greater than the currently known RTT to prevent unnecessary retransmissions.

The RTO should be neither too large nor too small to prevent the following behaviors:

When the RTO is too large, the sending TCP peer must wait too long before retransmitting a lost segment. This lowers throughput for connections with some degree of packet loss.
When the RTO is too small, segments are retransmitted unnecessarily. Retransmitted segments increase the load on the internetwork and waste internetwork capacity.

If the ACK for the initially sent segment does not arrive within the RTO, the ACK iseither arriving late or not at all. The main causes of ACK segments arriving late areeither an increase in the transit time from the source to the destination or an increase in the transit time from the destination to the source.

The following are reasons why the ACK is not received at all:

The initially sent TCP segment is dropped at a router because of congestion.
The initially sent TCP segment is dropped at a router or the destinationbecause of damage to the packet, which occurs when electronic or opticalerrors corrupt the encoded signal, causing bits within the packet to change values. Damaged packets are silently discarded after failing checksum calculations.
The ACK for the TCP segment is dropped at a router because of congestion.
The ACK for the TCP segment is dropped at a router or the destinationbecause of damage to the packet.

It is much more likely that the TCP segment or its ACK was discarded by a congested router rather than being damaged and silently discarded.

Note

Unlike TCP segments containing data, ACKs that contain no data are not sent reliably. The ACK sender does not set an RTO for the ACK and doesnot retransmit the ACK segment. Therefore, a lost ACK is recovered by the sender retransmitting the segment(s) that the lost ACK is acknowledging, and not by the sender of the lost ACK retransmitting the ACK.

Congestion Collapse

The proper measurement of the RTT and determination of the RTO for sent TCP segments are important to prevent a phenomenon of routed internetworks known as congestion collapse. Congestion collapse occurs when the buffers of the internetwork routers fill to capacity and the routers begin to discard packets.

Congestion collapse begins with a steady increase in the load on the internetwork. As hosts send more data, more data is queued in the buffers of the internetwork routers. As this occurs, the transit time from the source to the destination and from the destination to the source increases. Therefore, the actual RTT grows larger than the currently known RTT of sending hosts.

The current RTO for sent segments is based on the currently known RTT. When the actual RTT increases to the extent that it is greater than the current RTO, sent TCP segments have ACKs that arrive late. When the ACKs do not arrive in the time based on the current RTO, the segments are retransmitted. There are then two copies of each retransmitted segment, effectively doubling the load on the internetwork at a time when the load needs to be decreased. As more TCP segments are retransmitted, eventually the buffers on the internetwork routers fill and the routers begin to discard packets.

Congestion collapse can be avoided through the ongoing determination of the current RTT, which is monitored on a per window or per segment basis. Changes in the currentRTT are used to update the RTO.

The recurrence of congestion collapse is avoided through the combination of the slow start and congestion avoidance algorithms of the sending host, as discussed in Chapter 14, "Transmission Control Protocol (TCP) Data Flow." When the RTO for a segment expires, TCP assumes that RTO expiration is a result of the segment being discarded by a router experiencing congestion. Slow start and congestion avoidance are used to slowly scale the number of segments sent before waiting for an ACK up to the number of segments that fit in the receiver's advertised receive window.

Slow start and congestion avoidance are used together to prevent congestion collapse from recurring. Without slow start and congestion avoidance, once an internetwork becomes congested, it becomes congested again as the sending hosts begin transmitting new data and the internetwork oscillates between congested and uncongested states.

Retransmission Behavior

TCP uses the following exponential backoff behavior to determine the RTO of successive retransmissions of the same segment:

When the TCP segment is initially sent, the RTO for the segment is set to the currently known RTO for the connection.
After RTO number of seconds, when the RTO expires, the segment RTO is set to twice the RTO for the segment's previous transmission and retransmitted.

Step 2 is repeated for the maximum number of retransmissions before the TCP connection is abandoned. For TCP/IP for the Microsoft Windows Server 2003 family andWindows XP, the TcpMaxDataRetransmissions registry setting controls the maximum number of retransmissions.

TcpMaxDataRetransmissions

Location: 
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Data type: REG_DWORD
Valid range: 0–0xFFFFFFFF
Default value: 5
Present by default: No

TcpMaxDataRetransmissions sets the maximum number of retransmissions of a TCP segment containing data before the connection is abandoned.

The following Network Monitor trace (Capture 15-01, included in the Captures folder on the companion CD-ROM) shows the maximum number of retransmissions and the doubling of the RTO between successive retransmissions:

1 0.000000 LOCAL 0060083E4607 TCP .A...., len:
 0, seq: 1311725-1311725, ack:23 FTP Server FTP Client
2 0.000000 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
3 0.000000 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
4 0.000000 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
5 0.000000 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
6 0.000000 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
7 0.000000 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
8 0.500720 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
9 1.001440 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
10 2.002880 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
11 4.005760 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server
12 8.011520 0060083E4607 LOCAL FTP Data 
Transfer To Server, Port = 1296, size 1460 FTP Client FTP Server

This Network Monitor trace was captured from a File Transfer Protocol (FTP) client on which the uploading of a file was in progress and the cable connecting the network adapter of the FTP server was pulled. Frames 8 through 12 show the retransmission behavior of TCP/IP for the Windows Server 2003 family and Windows XP. Notice how the initial RTO is 0.5 seconds and successive retransmissions have RTOs that are doubled. After the last retransmission, the FTP server waits 16 seconds before abandoning the connection and recovering the connection's resources. It takes a total of 31.5 seconds to abandon the connection. The connection abandonment time is 63 times the RTO for the connection (the sum of RTO for the initial segment sent, 2*RTO for the first retransmission, 4*RTO for the second retransmission, 8*RTO for the third retransmission, 16*RTO for the fourth retransmission, and 32*RTO for the fifth retransmission).

Note

The RTOs are doubled, but the elapsed time for sending the retransmitted segment might not be exactly doubled for other Network Monitor traces because of delays in processing, queuing, and the physical transmission of network frames.

Retransmission Behavior for New Connections

For new connections initiated by a Windows Server 2003 family– or Windows XP–based host, the TcpMaxConnectRetransmissions registry setting determines the maximum number of retransmissions of the synchronize (SYN) segment.

TcpMaxConnectRetransmissions

Location: 
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Data type: REG_DWORD
Valid range: 0–255
Default value: 2
Present by default: No

TcpMaxConnectRetransmissions sets the maximum number of retransmissions of a SYN segment before the connection attempt is abandoned. Exponential backoff is usedbetween successive retransmissions of the SYN segment. With an initial RTO value of3 seconds, it takes 21 seconds to abandon a connection attempt (the sum of 3 seconds for the initial SYN, 6 seconds for the first retransmission, and 12 seconds for the second retransmission). The initial RTO's value is controlled using the TcpInitialRTT registry setting described in the section entitled "Calculating the RTO," later in this chapter.

For new connections initiated by a TCP peer for a Windows Server 2003 family– orWindows XP–based host, the TcpMaxConnectResponseRetransmissions registry setting determines the SYN-ACK segment's maximum number of retransmissions.

TcpMaxConnectResponseRetransmissions

Location: 
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Data type: REG_DWORD
Valid range: 0–255
Default value: 2
Present by default: No

TcpMaxConnectResponseRetransmissions sets the maximum number of retransmissions of a SYN-ACK segment sent in response to a SYN segment before the connection attempt is abandoned. Exponential backoff is used between successive retransmissions of the SYN-ACK segment. With an initial RTO value of 3 seconds, it takes 21 seconds to abandon the connection (the sum of 3 seconds for the first SYN, 6 seconds for the first retransmission, and 12 seconds for the second retransmission). If TcpMaxConnectResponseRetransmissions is greater than 1, SYN attack protection is used. See Chapter 13, "Transmission Control Protocol (TCP) Connections," for more information on the SYN attack.

Dead Gateway Detection

Dead gateway detection is an algorithm that detects the failure of the currently configured default gateway. If it detects a failure, dead gateway detection automatically switches to a new default gateway, provided there are multiple default gateways configured. Dead gateway detection uses TCP retransmission behavior to detect and recover from a downed router configured as the default gateway.

When an individual TCP connection retransmits a segment multiple times (half of TcpMaxDataRetransmissions), its next-hop IP address is changed to the next default gateway. When 25 percent of all TCP connections using the failed default gateway have been moved to the next default gateway, the default route in the IP routing table is updated with the next default gateway as the next-hop IP address.

If the new default gateway is unavailable, dead gateway detection is used to switch to the next default gateway in the configured list. When the last default gateway in the list is reached and becomes unavailable, the next default gateway is the first default gateway in the list. When the computer is restarted, the first default gateway in the list is used.

For a detailed example of how dead gateway detection works, consider a host with the following configuration:

The IP address of 10.0.0.99/24.
Two default gateways are configured: 10.0.0.1 and 10.0.0.2.
The default route 0.0.0.0/0 has 10.0.0.1 as its next-hop IP address.
There are currently 10 TCP connections for locations off the 10.0.0.0/24 subnet using 10.0.0.1 as their next-hop IP address.
TcpMaxDataRetransmissions is set at its default value of 5.

When the router at 10.0.0.1 fails, dead gateway detection uses the following process to change the default route to use the next-hop IP address of 10.0.0.2:

A TCP connection (one of the 10 TCP connections at the host) sends a data segment. Because no ACK is received, the segment is retransmitted. After the third retransmission, the next-hop IP address for this specific TCP connection is changed to 10.0.0.2. At this point, 10 percent of the TCP connections using the next-hop IP address of 10.0.0.1 have been switched to 10.0.0.2.
Another TCP connection sends a data segment. Because no ACK is received, the segment is retransmitted. After the third retransmission, the next-hop IPaddress for this specific TCP connection is changed to 10.0.0.2. At this point, 20 percent of the TCP connections using the next-hop IP address of 10.0.0.1 have been switched to 10.0.0.2.
Another TCP connection sends a data segment. Because no ACK is received, the segment is retransmitted. After the third retransmission, the next-hop IPaddress for this specific TCP connection is changed to 10.0.0.2. At this point, 30 percent of the TCP connections using the next-hop IP address of 10.0.0.1 have been switched to 10.0.0.2.
Because more than 25 percent of the TCP connections using 10.0.0.1 as their next-hop IP address have had their next-hop IP addresses changed, the default route in the IP routing table is updated to use 10.0.0.2 as the next-hop IP address.

The EnableDeadGWDetect registry setting controls dead gateway detection in TCP/IP for the Windows Server 2003 family and Windows XP.

EnableDeadGWDetect

Location: 
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Data type: REG_DWORD
Valid range: 0–1
Default value: 1
Present by default: Yes

EnableDeadGWDetect enables (when set to 1) or disables (when set to 0) dead gateway detection. Dead gateway detection is enabled by default.

Note

Dead gateway detection can change the default gateway configuration even when the local default gateway is functioning and a remote router fails. If a remote router in the path of traffic for TCP connections fails, TCP retransmissions for multiple TCP connections can cause dead gateway detection to switch default gateways.

Using the Selective Acknowledgment (SACK) TCP Option

The SACK TCP option allows the receiver to selectively acknowledge noncontiguous blocks of data received. However, the sender should not discard selectively acknowledged segments from its transmission queue until the segments are included in a cumulative acknowledgment.

RFC 2018 allows the data receiver to discard noncontiguous segments even though they have been selectively acknowledged. This is known as reneging on a selective acknowledgment, and its practice is discouraged. To keep reneged data from being lost on a connection, the sender must retransmit selectively acknowledged data until it is acknowledged by the Acknowledgment Number field in an ACK from the receiver.

More Info

TCP selective acknowledgments are described in RFC 2018, which can be found in the Rfc folder on the companion CD-ROM.

The retransmission behavior of selectively acknowledged segments is as follows:

For each segment, maintain a selective acknowledgment flag that is enabled when the segment is selectively acknowledged.
When initial RTO timers begin to expire, only retransmit the segments that have not been selectively acknowledged (segments for which the selectiveacknowledgment flag is disabled).
If an ACK is received that cumulatively acknowledges the retransmitted segment, the send window closes and opens depending on the new Acknowledgment Number + Window sum and new segments can be sent. The selective acknowledgment flags on noncumulatively acknowledged segments are maintained.
If a retransmitted segment times out, indicating that the receiver might have reneged on the selectively acknowledged segments, disable the selectiveacknowledgment flags of all segments in the current window and retransmit them normally.

This mechanism recovers from the possibility that the receiver discarded the noncontiguous received segments. If necessary, the entire window of data is resent.

Calculating the RTO

The determination of the RTO is an important function of TCP. The RTO must be adjusted to the internetwork's changing conditions. If the determined RTO is less than the RTT, segments are unnecessarily retransmitted.

In RFC 793, the suggested method of computing the RTO—known as the smoothed round-trip time (SRTT)—is based on the following formulas:

SRTT = (a*SRTT) + ((1-a)*RTT)

RTO = min[UpperBound, max[LowerBound,(b *SRTT)]]

Thus, the new RTO is based on the determination of the current RTT, the previous SRTT, a smoothing factor (a), and a variance factor (b). RFC 793 cited this formula as an example method of computing the RTO. In practice, this formula was found to be inadequate in determining the RTO in an environment in which the RTT changed suddenly. Instead, RFC 1122 states that TCP must use the following formulas as documented in "Congestion Avoidance and Control," a paper written by Van Jacobson and Michael J. Karels:

SRTT = RTT + 8*(New_RTT – RTT)

Dev = Dev + (|New_RTT - RTT| – Dev)/4

RTO = SRTT + Dev/4

This new way of calculating the RTO is based on the average and variance (Dev) of the RTT. The RTO is self-tuning for different environments (the low-delay local area network [LAN] and the high-delay wide area network [WAN]) and is sensitive to sudden changes in the RTT for environments such as the Internet.

More Info

RTO calculation is described in RFCs 793 and 1122, which can be found in the Rfc folder on the companion CD-ROM.

For TCP/IP for the Windows Server 2003 family and Windows XP, the TcpInitialRTT registry setting controls the RTO's initial value for establishing connections or sending data on new connections.

TcpInitialRTT

Location: 
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
InterfacesInterfaceGUID
Data type: REG_DWORD
Valid range: 0–0xffff
Default value: 3
Present by default: No

TcpInitialRTT sets the number of seconds for the initial RTO for SYN segments, SYN-ACK segments, and initial data segments sent on a new connection for each interface. Increasing this value from its default has a multiplicative effect on the amount of time it takes to time-out from a connection establishment or when sending data on a new connection.

For new connections being established by a host, the connection abandonment time is 7*TcpInitialRTT (assuming the default value of TcpMaxConnectRetransmissions). Forarbitrary values of TcpInitialRTT and TcpMaxConnectRetransmissions, the connection abandonment time is

TcpInitialRTT*[2 (TcpMaxConnectRetransmissions+1) – 1]

For new connections being requested from a host, the connection abandonment time is 7*TcpInitialRTT (assuming the default value of TcpMaxConnectResponseRetransmissions). For arbitrary values of TcpInitialRTT and TcpMaxConnectResponseRetransmissions, the connection abandonment time is

TcpInitialRTT*[2 (TcpMaxConnectResponseRetransmissions+1) – 1]

As data segments are sent, the RTO is adjusted from the TcpInitialRTT to a value closer to the connection's RTT. By default, the connection's RTT is not sampled for each segment sent. Rather, the RTT is sampled once for every full send window of data sent. If the send window is 12*MSS (maximum segment size), the RTT is sampled once every 12 segments. For each sample of the RTT, the time that the sampled segment is sent is recorded based on the current value of an internal clock. When the ACK for the segment is received, the RTT is determined from the difference between the recorded value of when the segment was sent and the current value of the internal clock.

The RTT sampling rate is 1/(window size). For small window sizes, this sampling rate is adequate. However, for large windows, the sampling rate is inadequate and cannot keep up with rapid changes in the RTT. The result is increased network bandwidth utilization by unnecessary retransmissions when the currently known RTO is less than the current RTT. In these situations, the TCP Timestamps option is used to provide a sampling rate that is equal to the sending rate.

Using the TCP Timestamps Option

As described in Chapter 12, "Transmission Control Protocol (TCP) Basics," the TCP Timestamps option allows TCP peers to place a timestamp value on each segment. The TCP Timestamps option contains two 32-bit fields to track timestamps: TS Value and TS Echo Reply. The TS Value field stores the current timestamp value. The TS Echo Reply field stores the timestamp echo, the value of the TS Value field of the segment being acknowledged.

The use of TCP timestamps allows an RTT to be calculated by subtracting the timestamp echo in the ACK from the current time value of the timestamp clock.

As an example, TCP Peer A sends a data segment to TCP Peer B, which sends an ACK back. The data segment's TS Value is 1285458 when it is sent and is echoed in the ACK segment's TS Echo Reply field. When the ACK is received and processed, the current value of TCP Peer A's timestamp clock is 1286506. Therefore, the RTT for this segment is based on the TCP timestamp value of 1048, or 1286506 – 1285458.

This basic method of RTT determination is complicated by the following factors:

There might be pauses in sending data.
ACKs are delayed and can acknowledge multiple TCP segments.
Segments can arrive out of sequence.
Segments can be dropped and must be retransmitted.

Figure 15-1 illustrates the problem with pauses in sending data. TCP Peer A sends TCP Peer B a series of segments and then pauses. Then TCP Peer A sends more segments. The new segment after the pause has the TS Echo Reply field set to the TS Value field of the last ACK received. If TCP Peer B now calculates the RTT for the last ACK sent, the RTT is inflated by the time of the pause in sending data.

click to expand
Figure 15-1: The behavior of TCP timestamps with pauses in data.

From Figure 15-1, the TCP timestamp interval calculated from TCP segment 5 is 1898 (10951 – 9053), clearly the wrong value, as it includes the pause in sending data. With an RTO adjusted to this higher value of the RTT, throughput for data sent by TCP Peer 2 is not optimal because the RTO is too high. To prevent this behavior, the RTT is calculated only for TCP segments that acknowledge new data sent. Therefore, in the example shown in Figure 15-1, the RTT is calculated only by TCP Peer A. TCP Peer B does not calculate RTT because the segments received by TCP Peer B do not acknowledge data sent by TCP Peer B.

For delayed ACKs, segments that arrive out of order, and retransmitted segments, the value of TS Echo Reply for ACKs is based on the following algorithm:

For correct TCP timestamp behavior, TCP keeps track of two variables for each connection: tsrecent is the value of the TS Echo Reply that will be sent in the next ACK, and lastack is the value of the Acknowledgment Number field from the last ACK sent.
After receipt of a new segment, if the segment contains the byte numbered lastack, which means that a contiguous segment has arrived, update tsrecent with the value of the TS Value field from the arriving segment. If the segment does not contain lastack, ignore the value of the TS Value field of the arriving segment.
When sending a segment with the TCP Timestamp option, set the value of TS Echo Reply to the value of tsrecent.
When sending an ACK, set the value of lastack to the value of the Acknowledgment Number field in the ACK.

For delayed acknowledgments, the RTT determination must include the acknowledgment delay. Therefore, when sending a delayed acknowledgment, the TS Echo Reply of the delayed ACK is set to the TS Value of the first segment being acknowledged. Figure 15-2 illustrates this behavior.

click to expand
Figure 15-2: The behavior of TCP timestamps for delayed acknowledgments.

Prior to receiving any TCP segments, the value of tsrecent is 10 and the value of lastack is 1000. When TCP segment 1 arrives, it contains the lastack byte and therefore tsrecent is updated with the TS Value of 100. When TCP segment 2 arrives, it does not contain the lastack byte and tsrecent remains at the value of 100. When TCP segment 3 arrives, it does not contain the lastack byte and tsrecent remains at the value of 100. When the delayed ACK is sent, the value of TS Echo Reply is set to tsrecent and lastack is set to the value of the Acknowledgment Number field.

When segments arrive out of sequence, the value of tsrecent, and therefore the value of TS Echo Reply, is not updated. TS Echo Reply and tsrecent are updated only when the missing segment(s) arrives. Figure 15-3 illustrates this behavior.

click to expand
Figure 15-3: The behavior of TCP timestamps for out-of-order segments.

Prior to receiving any TCP segments, the value of tsrecent is 10 and the value of lastack is 1000. When TCP segment 1 arrives, it contains the lastack byte and therefore tsrecent is updated with the TS Value field value of 100. When the ACK on segment 1 is sent, the value of TS Echo Reply field is set to tsrecent and lastack is set to the Acknowledgment Number field's value.

When TCP segment 3 arrives, it does not contain the lastack byte, and tsrecent remains at the value of 100. When TCP segment 2 arrives, it does contain the lastack byte and the value of tsrecent is updated.

When a segment is dropped and must be retransmitted and the segments arrive out of sequence, the value of tsrecent, and therefore the value of the TS Echo Reply field, is not updated. Because the RTT does not include the RTO for the retransmitted segment, tsrecent and TS Echo Reply are updated only when the missing retransmitted segment arrives. Figure 15-4 illustrates this behavior.

click to expand
Figure 15-4: The behavior of TCP timestamps for retransmitted segments.

Prior to receiving any TCP segments, the value of tsrecent is 10 and the value of lastack is 1000. When TCP segment 1 arrives, it contains the lastack byte and therefore tsrecent is updated with the TS Value of 100. When the ACK on segment 1 is sent, the value of TS Echo Reply is set to tsrecent and lastack is set to the value of the Acknowledgment Number field.

When TCP segment 3 arrives, it does not contain the lastack byte and tsrecent remains at the value of 100. When the retransmitted TCP segment 2 arrives, it does contain the lastack byte and the value of tsrecent is updated.

Karn s Algorithm

When calculating the RTT for a TCP segment being sent, the time at which the segment is sent is recorded. If the RTO expires, an exact duplicate is sent and its time is recorded. When the ACK is received, how is the RTT computed? When the TCP Timestamps option is not being used, the ACK does not distinguish between the original TCP segment and its retransmitted copy. TCP has the problem of acknowledgment ambiguity. When multiple copies of a TCP segment are sent, the ACK does not identify a specific instance of the TCP segment being acknowledged.

If we choose to calculate the RTT based on the first instance of the segment and the first instance is lost, the measured RTT is larger than the actual RTT for the connectionbecause it includes the RTO for retransmitting the segment. The measured RTT is the difference between the time the first segment was sent and the time the ACK for the retransmitted instance was received. The new RTO grows larger than it should, resulting in lowered throughput for retransmitted segments. As more TCP segments are lost, the RTO based on this method of RTT calculation grows larger.

If we choose to calculate the RTT based on the retransmitted instance of the segment, and the RTO expired as a result of a sudden increase in the RTT, the ACK for the first instance arrives soon after the retransmitted segment is sent. The measured RTT (the difference between the time the retransmitted segment was sent and the time the ACK for the first instance was received) is now smaller than the connection's actual RTT. The updated RTO gets smaller when it should get larger, eventually resulting in unnecessary retransmissions for subsequent segments.

To prevent these conditions from incorrectly changing the RTO, RTT measurements for TCP segments that have been retransmitted are ignored. Only the RTT for ACKs that are acknowledging a single instance of a TCP segment are considered. However, ignoring the RTT for retransmitted segments introduces a new problem. When the actual RTT increases suddenly, the RTO for a TCP segment is too small and results in a retransmission. Because the RTT is not calculated for the retransmitted segment, the RTO remains at its inadequate value. Subsequent TCP segments sent would also be retransmitted.

To keep subsequent TCP segments from being sent with an inadequate RTO when the actual RTT increases suddenly, TCP/IP implementations, including TCP/IP for theWindows Server 2003 family and Windows XP, use Karn's algorithm. Karn's algorithm is named after its creator, Phil Karn, in the paper "Improving Routing-Trip Time Estimates in Reliable Transport Protocols," by Phil Karn and Craig Partridge. Karn's algorithm states that when an ACK for a retransmitted segment arrives, it should not be used to update the RTO. However, the RTO of the retransmitted segment (that has been exponentially backed off) should be used as a temporary RTO for subsequent TCP segments. When an ACK for a nonretransmitted TCP segment arrives, use its RTT to update the RTO. Then, use the updated RTO for subsequent TCP segments.

For example, if the RTO for a TCP connection is 300 ms and the actual RTT for the connection suddenly rises to 400 ms, Karn's algorithm causes the following behavior:

Segment A is sent and its RTO is set to 300 ms.
Because the RTO for Segment A is lower than the connection's actual RTT, the RTO for Segment A expires. Segment A's RTO is set to 600 ms and retransmitted (using exponential backoff and a factor of 2).
The ACK for Segment A arrives (400 ms after the first instance of Segment A was sent).
Because the ACK is for a retransmitted segment, it is not used to update the RTO.
TCP temporarily sets the RTO for subsequent segments to 600 ms (the RTO of the retransmitted Segment A).
Segment B is transmitted and Segment B's RTO is set to 600 ms.
The ACK for Segment B arrives in 400 ms.
Because the ACK is for a segment that has not been retransmitted, its RTT is calculated and used to update the RTO.
Subsequent segments are sent using the updated RTO.

Karn s Algorithm and the Timestamps Option

Karn's algorithm applies when the ACKs are ambiguous—when TCP cannot distinguish the original TCP segment from a retransmitted instance. However, with the TCP Timestamps option, each TCP segment has a steadily increasing timestamp clock value (the TS Value field in the TCP Timestamps option header) and is therefore unique within the time that segments are being retransmitted. The ACK for different instances of a TCP segment can be distinguished from another because the ACK contains the echo of the timestamp value of the segment being acknowledged. Therefore, Karn's algorithm does not apply when TCP timestamps are being used.

If a segment is retransmitted because of a segment loss, the ACK for the retransmitted segment contains the timestamp value for the retransmitted segment, and not the original segment. Therefore, the RTT is accurately calculated as the difference in the current TCP time clock and the ACK's timestamp echo.

If a segment is retransmitted because of a sudden increase in RTT, the ACK contains the timestamp value of the first instance. Therefore, the RTT is accurately calculated as the difference in the current TCP time clock and the timestamp echo in the ACK for the first segment.

Fast Retransmit

When a TCP segment arrives and the sequence number is not the next sequence number the receiver was expecting (a noncontiguous, out-of-order segment), an immediate ACK is sent with the Acknowledgment Number field set to the next sequence number the receiver was expecting. This ACK is a duplicate of an ACK that was previously sent and is not subject to the delayed acknowledgment behavior for new contiguous data received.

After receipt of this duplicate ACK, the sender cannot determine whether the duplicate ACK was sent by the receiver because of a TCP segment that arrived out of order or because a segment was lost.

If a TCP segment arrived out of order, the TCP segment that contains the next byte the receiver expects to receive should arrive at the receiver shortly thereafter and a cumulative ACK is sent. Therefore, for out-of-order segments, only one or two duplicate ACKs are likely to be sent.
If a TCP segment is lost, all of the segments beyond the contiguous segment that arrive at the receiver generate an immediate duplicate ACK. Therefore, if three or more duplicate ACKs arrive at the sender, the TCP segment containing the next byte the receiver expects is most likely lost and must be retransmitted.

Fast retransmit is the retransmission of a TCP segment before the RTO for the segment expires, based on the receipt of three duplicate ACKs where the ACK's acknowledgment number is the retransmitted segment's sequence number. The retransmitted segment is the missing segment.

More Info

Fast retransmit and fast recovery are described in RFC 2581, which can be found in the Rfc folder on the companion CD-ROM.

As Figure 15-5 illustrates, TCP Peer A sends five TCP segments and the first segment is lost. As the noncontiguous segments arrive, TCP Peer B sends an immediate ACK with the ACK number it expects to receive. After the third duplicate ACK for sequence number 1000, TCP Peer A retransmits the first segment.

click to expand
Figure 15-5: Fast retransmit behavior when the first of five segments is dropped.

For TCP/IP for the Windows Server 2003 family and Windows XP, the TcpMaxDupAcks registry value controls fast retransmit behavior.

TcpMaxDupAcks

Location: 
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Data type: REG_DWORD
Valid range: 1–3
Default value: 2
Present by default: No

TcpMaxDupAcks sets the maximum number of duplicate ACKs (ACKs that are duplicates of an original ACK received) that must be received before fast retransmit is used toretransmit the missing segment. The default value of TcpMaxDupAcks is 2, rather than the value of 3 discussed in RFC 2581.

Fast Recovery

Fast retransmit causes the sender to retransmit the missing TCP segment before its RTO expires. If the RTO expires, slow start and congestion avoidance algorithms are used to gradually increase the actual send window up to the advertised receive window. Because the RTO did not expire, congestion avoidance is performed, but not slow start. Thisbehavior is known as fast recovery, described in RFC 2581. For more information on slow start and congestion avoidance, see Chapter 14, "Transmission Control Protocol (TCP) Data Flow."

Fast recovery assumes that the arrival of duplicate ACKs indicates that segments sent before the missing TCP segment have already been received, and are not adding to the internetwork congestion. Therefore, TCP can scale the congestion window faster than when using slow start.

The fast recovery algorithm is defined as follows:

After receipt of the third duplicate ACK, the value of the slow start threshold (ssthresh) is set to one half the value of the congestion window (cwind), with a minimum value of 2*MSS.
The missing segment is retransmitted and cwind is set to (ssthresh + 3*MSS). This increases cwind to a value that reflects the receipt of three TCP segments at the receiver (based on the receipt of three duplicate ACKs).
For each additional duplicate ACK, cwind is increased by MSS. Once again, cwind is being increased because of an additional segment that has arrived at the receiver.
If allowed by the values of cwind and the advertised receive window size, the next TCP segment(s) is transmitted.
When the ACK arrives that acknowledges the receipt of the missing new segment and all other contiguous segments, cwind is set to the value of ssthresh. At this value of cwind, slow start is avoided and congestion avoidance isperformed.

Summary

To recover from lost TCP segments, TCP connections maintain an RTO for each segment. If the RTO expires, the segment is retransmitted and the RTO is doubled for the retransmitted segment. After a maximum number of retransmissions, the TCP connection is abandoned. The RTO is based on calculations from samples of the RTT, using either a single sample per window of data or TCP timestamps. When TCP segments are sent without timestamps, Karn's algorithm is used to update the RTO when an ACK for a retransmitted segment is received. Fast retransmit is used to resend a missing segment before its RTO expires, based on the receipt of multiple duplicate ACK segments. Fast recovery is used to increase the size of the actual send window more quickly when fast retransmit occurs.

Part I - The Network Interface Layer

Part II - Internet Layer Protocols

Part III - Transport Layer Protocols

Part IV - Application Layer Protocols and Services