24.4 Protocol Mechanisms For Data Exchange | Linux Network Architecture

The following subsections introduce several protocol mechanisms of the TCP protocol. However, the TCP protocol uses a large number of algorithms; a detailed description of all of these mechanisms would go beyond the scope and volume of this chapter, but we will discuss selected parts of the TCP instance here.

To begin with, Section 24.4.1 will discuss flow control by use of the sliding-window method. In addition to flow control, we will have a look at the methods for window scaling, zero-window probing, and the PAWS mechanism, including the timestamp option. Subsequently, Section 24.4.2 will discuss methods for detection, handling, and avoidance of congestions: slow-start, congestion-avoidance, fast-retransmit, and fast-recovery methods. Finally, Section 24.4.3 covers methods for load avoidance, concentrating on the Nagle algorithm and the transmission of delayed acknowledgements.

The different timers used by a TCP instance and their management will be discussed in Section 24.5.

24.4.1 Flow Control

The TCP protocol uses flow control to regulate the data flow the data volume exchanged between a sender and a receiver on a per-time-unit basis.^[1] Flow control limits the number of bytes sent in one communication direction to prevent buffer overflows in the receiving TCP instance and to meet the service consumer requirements. Reasons to limit the data flow by the receiving TCP instance include the following:

^[1] This section considers flow control in only one direction of a TCP connection and treats the two TCP instances as a sender and a receiver. In bidirectional data exchange, the flow is controlled separately for each direction, where each instance assumes both the role of a sender and the role of a receiver.

The computing performance of the sending TCP instance can be higher than that of the receiving instance. This means that the sender creates segments faster than the receiving TCP instance can process them. In such a situation, the receive buffer at the receiver's end over flows, causing segments to be discarded.
An application removes data from the socket receive buffer at specific intervals, which means that this buffer empties only occasionally. Examples include applications that output multimedia contents, receiving contents faster than their playback rate.

Consequently, flow control can be used to prevent the receive buffer of a receiving TCP instance from overflowing, which would cause additional incoming packets to be dropped. To implement flow control, the TCP protocol uses the sliding-window mechanism. This mechanism is based on the assignment of explicit transmit credits by the receiver [Pete00]. We will introduce it in the following section.

The Sliding-Window Mechanism

The sliding-window protocol mechanism is used commonly in transport protocols or connection-oriented protocols, because it provides for three important tasks:

The original order of a set of data segmented and sent in several packets can be restored in the receiver.
Missing or duplicate packets can be identified by ordering of packets. Together with additional packet-retransmission methods, this enables us to guarantee reliable data transport.
The data flow between two TCP instances can be controlled by assigning transmit credits. Specifically, it is distinguished between a fixed credit quantity (e.g., in HDLC) and explicit credit assignment (e.g., in TCP).

The following elements are added to the protocol header (using the TCP protocol as our example) to handle these tasks:

All data is numbered consecutively by sequence numbers. The sequence numbers of the first payload byte in a data packet is carried in the packet header and denotes the sequence number of this segment. The tp->snd_nxt variable in the tcp_opt structure stores the sequence number of the packet to be sent next.
Together with each segment, a TCP instance informs its communication partner about the number of bytes it can still receive in the Window field of the packet header. This is an explicit transmit credit granted to the partner. When a segment is sent, this value is specified by the tcp_select_window() function (described later).
Together with each segment, the other TCP instance is informed about the sequence number up to which data has been received correctly (accumulative acknowledgement). More specifically, this value is higher by one, because the sequence number of the data set expected next is specified, so that all data from this sequence number on are implicitly acknowledged (ACK).

Example: Figure 24-10 shows how the sliding-window mechanism is used to implement flow control [Stev94a]. This example uses a simplified version of the slow-start algorithm. The figure shows a scenario where a fast sender ships 8192 bytes to a slow receiver.

Figure 24-10. A fast sender transmits 8192 bytes to a slow receiver.

graphics/24fig10.gif

In this example, the segments 1 through 3 belong to the connection-establishment phase. In this phase, the receiver grants the sender an initial transmit credit of 4096 bytes in segment 2 (in the win field). Together with the segments 4 through 7, instance A sends 4096 bytes, exhausting its transmit credit. It may not send more segments and has to wait for an acknowledgement from instance B, which actually grants it a new transmit credit. Together with segment 8, all 4096 bytes sent are acknowledged.

However, because it has not yet been able to empty its buffer, the slow receiver does not grant transmit credit to the sender (passing a value of 0 in the win field). Later, an acknowledgement arrives at TCP instance A, together with segment 9, granting it an additional credit of 4096 bytes. Subsequently, the source instance can send its remaining data (segments 10 through 13), arriving again at the situation described above, which results in the transmission of segments 14 and 15. Segments 16 and 17 refer to the connection-teardown phase.

In order for the sender to keep control over all bytes with regard to error handling and flow control, it has to take a certain view on the data. The view of the sending TCP instance on the sliding-window mechanism is shown in Figure 24-11. The receiver's view on the sliding-window mechanism is shown in Figure 24-12.

Figure 24-11. Byte-sequence range from the sender's view.

Figure 24-12. Byte-sequence range from the receiver's view.

Notice that the byte sequence range between the segment received last and the right-hand window margin corresponds to the remaining free buffer space. When the receiving TCP instance receives a segment that moves the byte-sequence number of the data byte received last to the right-hand receive window margin, then the buffer of this instance is full, and all segments arriving later cause a buffer overflow, until the right-hand window margin is moved further to the right, so that buffer capacity is available again.

Specifying the Transmit Credit

The flow-control mechanism is effective in many places within a TCP instance. A detailed description of all of these situations would go beyond the scope and volume of this chapter, especially because Section 24.2 discussed several flow-control aspects. This section explains how the explicit transmit credit for a partner instance can be specified in each TCP segment sent.

`tcp_select_window()`	net/ipv4/tcp_output.c

 static __inline__ u16 tcp_select_window(struct sock *sk) {        struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp);        u32 cur_win = tcp_receive_window(tp);        u32 new_win = __tcp_select_window(sk);        /* Never shrink the offered window */        if(new_win < cur_win)        {              new_win = cur_win;        }        tp->rcv_wnd = new_win;        tp->rcv_wup = tp->rcv_nxt;        /* RFC1323 scaling applied */        new_win >>= tp->rcv_wscale;        (...)        return new_win; }

tcp_select_window(sk) is invoked in the tcp_transmit_skb() method when a TCP segment is sent (except for SYN and SYN-ACK segments) to specify the size of the transmit credit (i.e., the advertised window). The current advertised window size is initially specified by the tcp_receive_window(). Subsequently, the __tcp_select_window() function is used to see how much buffer space is available in the computer. This forms the basis for determining the new transmit credit offered to the partner instance. However, as specified in RFC 793, care should be taken not to reduce a previously granted credit.

Once the new advertised window has been computed, the credit is stored in the tcp_opt structure of the connection (tp->rcv_wnd). Also, the tp->rcv_wup is adapted; it stores the current value of the tp->rcv_nxt variable when computing and sending the new credit, known as window update. The reason is that each segment arriving after the advertised window has been sent has to be charged against the credit granted, just as happens in the tcp_receive_window() function.

In addition, another TCP algorithm is used in this method, namely an algorithm known as window scaling, which will be introduced in the next section. This algorithm had to be introduced to make it possible to work meaningfully with a 16-bit number for the transmit and receive windows. This scaling factor was introduced for this reason; it specifies the number of bits to move the window size to the left. With the value F in tp->rcv_wscale, this corresponds to an advertised window increase by factor 2^F.

`tcp_receive_window()`	including/net/tcp.4

 /* Compute the actual receive window we are currently * advertising. Rcv_nxt can be after the window if the sending * peer pushes more data than the offered window. */ static __inline__ u32 tcp_receive_window(struct tcp_opt *tp) {        s32 win = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt;        if (win < 0)                win = 0;        return (u32) win; }

This method calculates how much is left of the transmit credit granted in the last segment, taking the quantity of data received since then into account. Slightly rewriting the equation, we can clearly identify the objective of this computation:

 s32 win = tp->rcv_wnd - (tp->rcv_nxt - tp->rcv_wup);

The difference in brackets is the data quantity received since the last window update was sent: the sequence number expected at the beginning of the next packet, minus the sequence number of the packet expected next when sending the credit, which was stored in the tp->rcv_wnd variable. This means that the resulting value, win, is the data quantity that can now be received against the credit granted last, corresponding to the actual size of the receive window.

`__tcp_select_window()`	net/ipv4/tcp_output.c

 u32 __tcp_select_window(struct sock *sk) {      // free_space is being computed      free_space = tcp_space(sk);      (...)      if (free_space < tp->ack.rcv_mss)              return 0;      window = tp->rcv_wnd;      if ((((int) window) <= (free_space - ((int) mss))) ||              (((int) window) > free_space))              window = (((unsigned int) free_space) /mss)*mss;      return window; }

The __tcp_select_window() method is used to check how much memory of this connection is available for the receive buffer and what size can be selected for the receive window.

In the first step, tcp_space() determines how much buffer memory is available. After a few adaptations of the determined buffer space, it finally checks on whether there is enough buffer space for a TCP segment with maximum size (tp->ack.rcv_mss). If this is not the case, then this would mean that a Silly Window Syndrome (SWS) [Stev94a] has occurred. To avoid SWS, no credit smaller than the negotiated maximum segment size is granted in such a case.

If the available buffer space is larger than the maximum segment size, then the computation of the new receive window is continued. The window variable is set to the value of the credit last granted (tp->rcv_wnd). If the old credit is larger, or smaller by more than one segment size, than the available buffer, window, is set to the next smaller multiple of a maximum segment size (MSS) of the free buffer space (free_space). The value for window calculated in this way is returned as a recommendation for the new receive credit.

The Window-Scaling Option

This option can be used to increase the value range for the flow-control transmit window from 2¹⁶ to 2¹⁶ · 2^F, where F is the exponent specified by this option. The window-based flow-control method used in the TCP protocol uses the 16-bit Window field in the TCP header. This field is used to grant transmit credits of up to 65535 bytes (the basic unit of the Window field is 1 byte, excluding window scaling). However, connections with a large path capacity (bandwidth times packet round-trip time) require a larger transmit credit to be able to send data continually. For example, if you consider a connection having transmission rate 10 Mbps and round-trip time 100 ms, then a constant data flow requires a transmit window of at least 10 Mbps · 100 ms = 125000 bytes.

The window-scaling TCP option was introduced in [JaBB92] to solve this problem. Window scaling enlarges the TCP window by from 16 to 30 bits. To ensure backward compatibility, the size of the Window field remains 16 bits; instead, the option changes the basic unit of the Window field to 2^F. Figure 24-13 shows the packet-header format of the window-scaling option used by the TCP instances to negotiate the F parameter [Stev94a]. The Shift Count^[2] field includes the scaling factor for the receive window. If the F value of the Shift Count field is unequal to zero, then the basic unit used to compute the receive window is 2^F rather than 1 byte.

^[2] Field name as used in [JaBB92].

Figure 24-13. Format of the window-scaling TCP option.

graphics/24fig13.gif

The window-scaling option can be sent only in a SYN or SYN-ACK segment in the connection establishment phase see Section 24.3:

In a SYN segment, it assumes two tasks [JaBB92]: It shows that the TCP protocol instance can use window scaling both when sending and receiving, and it outputs the scaling factor for the receive window of this TCP instance.
In a SYN-ACK segment, this option may be used only if it was specified in the SYN segment. To enable the use of window scaling, both TCP instances have to have the window-scaling option set in their SYN segments during the connection-establishment phase. This ensures that window scaling is used only provided that both instances are actually able to do so.

The maximum scaling factor is limited to 2¹⁴, and the maximum byte sequence number is limited to 2¹⁶ · 2¹⁴ = 2³⁰ < 2³¹ to prevent byte-sequence-number overflow. The negotiated window-scaling value is stored in the tp->rcv_wscale variable.

Zero-Window Probing

No minimum size for the advertised window is guaranteed, so it can happen that the sender has fully used up its advertised window while the receiver has no more new buffer space available. The consequence is that the sender receives an acknowledgement with a new advertising window having size null, which means that it must not send more data.

The problem is now that it can receive a new window size (i.e., a new transmit credit) only together with a new packet. There is no additional acknowledgement, because it cannot send more data, which means that the sender now depends on whether the other end sends data. If the receiver does not have data to send, or if it was also granted a window having size null, then the two communicating partners would wait eternally to be able to continue sending data.

To solve this problem, the Transmission Control Protocol includes a mechanism known as zero-window probing. This mechanism is based on an additional timer, the probe timer. When this timer expires, a TCP packet is sent even the advertising window has size null. This packet consists only of a packet header; it does not carry payload, because data must not be transmitted in this situation.

The communication peer acknowledges this packet, and, if the other end was able to remove its congestion, at least to some extent, in the meantime, then a window larger than null is granted together with this acknowledgement, so that the transmission of data can be resumed. If the window size null occurs again, then the communication peers have to wait until the probe timer expires to send another probing packet.

`tcp_probe_timer()`	net/ipv4/tcp_timer.c

 static void tcp_probe_timer(struct sock *sk) {        (...)        if (tp->probes_out > max_probes) {                tcp_write_err (sk);        } else {                /* Only send another probe if we didn't close things                 up. */                tcp_send_probe0(sk);        } }

tcp_probe_timer(sk) is the handling routine for the zero-window probe timer, if it leads to a timeout. The routine initially checks on whether the peer has offered a meaningful transmit window over a lengthy period. If several (more than max_probes^[3] ) window probes were sent unsuccessfully, then an error message is output to announce that there are serious problems in the TCP connection (tcp_write_err()) and to close this connection by tcp_done().

^[3] The value for max_probes is initialized to 15 and can be set in the proc file /proc/sys/net/ipv4/tcp_retries2.

If the maximum number of window probes has not yet been reached, tcp_send_probe0() sends a TCP segment without payload, and the acknowledgment segment will then, it is hoped, grant a credit (advertised window) larger than null.

`tcp_send_probe0()`	net/ipv4/tcp_output.c

 void tcp_send_probe0(struct sock *sk) {        struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp);        int err;        err = tcp_write_wakeup(sk);        if (tp->packets_out || !tp->send_head) {                /* Cancel probe timer, if it is not required. */                tp->probes_out = 0;                tp->backoff = 0;                return;        }        if (err <= 0) {                tp->backoff++;                tp->probes_out++;                tcp_reset_xmit_timer(sk, TCP_TIME_PROBEO,                            min(tp->rto << tp->backoff, TCP_RTO_MAX));        } else { (...) } }

tcp_send_probe0(sk)() uses tcp_write_wakeup(sk)() to generate and send a zero-window probe packet. If the probe timer is no longer needed, as can be seen from the fact that there is currently nothing to be sent (tp->send_head == NULL), or if there are still packets currently under way and their ACKs could contain new credits, then it is not restarted, and the probes_out and backoff parameters are reset.

Otherwise, these parameters are incremented after a probe packet has been sent, and the zero-window probe timer is restarted, so that it expires again after some time. This time is tp->rto*2^tp->backoff or a maximum of TCP_RTO_MAX (120 seconds).

`tcp_write_wakeup()`	net/ipv4/tcp_output.c

 int tcp_write_wakeup(struct sock *sk) {        (...)        if ((skb = tp->send_head) != NULL &&               before(TCP_SKB_CB(skb)->snd_una+tp->snd_wnd))        {               (...)               /* We are probing the opening of a window               * but the window size is != 0               * must have been a result SWS avoidance ( sender )               */               (...)               err = tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC));               (...)        }        else        {               return tcp_xmit_probe_skb(sk);        }               (...) }

tcp_write_wakeup(sk) checks for whether the transmit window is of size null and for whether the beginning of the data segment in skb is still within the transmit window range (snd_una + snd_wnd). In the case of a zero-window problem discussed so far, the transmit window has null size, so the tcp_xmit_probe_skb() method is invoked in the else branch. It generates the desired zero-window probe packet.

However, if the above condition is met, and if there is at least one packet in the transmit queue that is within the transmit window range, then we probably have a silly window syndrome situation. A packet that corresponds to the requirements of the current transmit window is generated and sent by tcp_transmit_skb().

`tcp_xmit_probe_skb()`	net/ipv4/tcp_output.c

 /* This routine sends a packet with an out of date sequence * number. It assumes the other end will try to ack it. */ static int tcp_xmit_probe_skb(struct sock *sk, int urgent) {        (...)        skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);        (...)        /* Reserve space for headers and set control bits. */        skb_reserve(skb, MAX_TCP_HEADER);        skb->csum = 0;        TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK;        TCP_SKB_CB(skb)->sacked = urgent;        /* Use a previous sequence. This should cause the other        * end to send an ack. Don't queue or clone SKB, just send it.        */        TCP_SKB_CB(skb)->seq = urgent ? tp->snd_una : tp->snd_una - 1;        TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq;        TCP_SKB_CB(skb)->when = tcp_time_stamp;        return tcp_transmit_skb(sk, skb); }

tcp_xmit_probe_skb(sk, urgent) creates a TCP segment without payload, because the transmit window has size null, and data currently must not be sent. alloc_skb() procures a socket buffer having length MAX_TCP_HEADER. As mentioned above, no payload is sent, and the packet consists only of the TCP packet header.

The trick of this routine utilizes the fact that an old sequence number (i.e., a previously acknowledged sequence number) is written to the packet. tp->snd_una holds the first, so far unacknowledged sequence number, which means that tp->snd_una - 1 meets this purpose. The old sequence number causes the counterpart to send an acknowledgement. Together with this ACK, it also sends the current transmit window size, which is now, it is hoped, larger than null.

In the next line, the last sequence number is set to the first sequence number of this packet, which results in the payload size end_seq - seq = 0. Subsequently, a timestamp is added to the segment, and the segment is sent by tcp_transmit_skb().

`tcp_ack_probe()`	net/ipv4/tcp_input.c

 static void tcp_ack_probe(struct sock *sk) {        struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp);        /* Was it a usable window open? */        if (!after(TCP_SKB_CB(tp->send_head)->end_seq, tp->snd_una +                           tp->snd_wnd)) {               tp->backoff = 0;               tcp_clear_xmit_timer(sk, TCP_TIME_PROBE0);               /* Socket must be waked up by subsequent tcp_data_snd_check().               * This function is not for random using!               */        } else {               tcp_reset_xmit_timer(sk, TCP_TIME_PROBE0,                            min(tp->rto << tp->backoff, TCP_RTO_MAX));        } }

tcp_ack_probe(sk, ack) is invoked in tcp_ack() when a TCP segment with the ACK flag set is received and if it is suspected to be a reply to a zero-window probe segment. Depending on whether the segment opens the receive window (the first packet in the transmit queue affects the transmit window), tcp_clear_xmit_timer() stops the zero-window probe timer or tcp_reset_xmit_timer() restarts this timer.

Protection Against Wrapped Sequence Numbers (PAWS)

The PAWS mechanism prevents a byte-sequence-number overflow, which would lead to inconsistencies, from occurring in connections with too large a bandwidth/delay product (i.e., when there is an excessive amount of data "in the pipe").

The byte-sequence numbers of a TCP connection are of length 32 bits. With a sufficient transmission rate and an accordingly large packet round-trip time, byte-sequence numbers can reach the end of the byte-sequence-number range within the time that a segment is delayed in router queues and thereby cause a byte-sequence-number overflow. In this case, the byte-sequence numbers take the initial values of the byte-sequence-number range. In this situation, the TCP protocol cannot identify delayed segment duplicates if their byte-sequence numbers are within the receive window and consequently cannot drop these duplicates.

The PAWS mechanism was introduced to solve this problem. It uses the Timestamp option, to protect the TCP protocol within a connection from problems caused by several segments' having the same byte-sequence numbers. The PAWS mechanism operates basically by attaching timestamps to TCP segments and dropping a segment as a duplicate if its timestamp is smaller than the timestamp of the segment received last, which cause the receive window to move forward. (See Section 24.4.1.)

Transmission Control Protocol (TCP):All segments arriving over an established connection are subject to the following checks by the PAWS algorithm [JaBB92]:

The RST flag should not be set.
The TCP partner instance should have received a valid Timestamp option (i.e., the tp->ts_recent variable contains a valid value).
If an incoming segment contains the Timestamp option, and if its timestamp, tp->rcv_tsval, is actually smaller than the timestamp stored last, tp->ts_recent, then this segment should be dropped. This check is done in the tcp_paws_discard() function shown below, which is invoked in Slow Path in the tcp_rcv_established() function. (See Section 24.2.1.)

 extern __inline__ int tcp_paws_discard(struct tcp_opt *tp, struct sk_buff   *skb) {   return ((s32)(tp->ts_recent - tp->rcv_tsval) > TCP_PAWS_WINDOW &&          xtime.tv_sec < tp->ts_recent_stamp + TCP_PAWS_24DAYS &&          !tcp_disordered_ack(tp, skb)); }

The second condition checks the validity of the tp->ts_recent timestamp. There is a very small probability that an extremely long time has passed since the time when the timestamp was taken from a received segment, which is stored in the tp->ts_recent_stamp variable. This means that, if more than 24 days have passed since this time, the ts_recent timestamp is considered invalid. [Stev94b] includes an example for such a situation.
If the TCP connection is not in the ESTABLISHED state, then the PAWS check on incoming packets is done in the tcp_paws_check() method (net/ipv4/tcp_input.c).

The PAWS mechanism requires the use of the timestamp option and the timestamp property as monotonically increasing values. In addition, it requires that the timestamp increase at least once per window and that the time in which it repeatedly takes the same value is greater than the maximum segment lifecycle.

The timestamp option is a prerequisite for the PAWS algorithm and additionally is used to determine the packet round-trip time and the retransmission timeout; we will introduce it next.

The Timestamp Option

This option offers TCP instances a way to continually probe and adapt the packet round-trip time (RTT) of a connection. The timestamp option allows you to set a timestamp in each segment. The receiver returns this timestamp in the acknowledgement segment, allowing the sender to compute the round-trip time for each acknowledgement received. A timestamp is a monotonically increasing value taken from a timer. Figure 24-14 shows the format of the timestamp option in the TCP packet header [JaBB92].

Figure 24-14. Format of the timestamp option in the TCP packet header.

graphics/24fig14.gif

The timestamp option was introduced to make possible a more exact measurement of the packet round-trip time. When setting a segment, the sender sets the current timestamp in the TSval field. The receiver returns this timestamp in the TSecr field of the ACK segment, together with the timestamp it set. Because the receiver returns the timestamp it received together with the acknowledgement for data it accepted to the sender without considering the timestamp's value, the timestamp unit is not relevant for the receiver. In particular, no timer synchronization between the communicating TCP instances is required [Stev94a]. This means that the sender can compute the packet round-trip time for each acknowledgement it received; this calculation is done in the method tcp_rtt_estimator() method of the TCP instance in the Linux kernel.

Notice that only a single timestamp variable per connection is used, to minimize the number of states the communicating TCP instances have to maintain. The value of this variable is updated by the following algorithm:

TCP holds the value for the timestamp to be sent in the next acknowledgement in the tp->ts_recent parameter.
When an incoming TCP segment with the timestamp option set in the tcp_rcv_established() method is handled, then the routine tcp_ts_replace_recent() checks on whether the tp->rcv_tsval timestamp contained in the tp->ts_recent variable should be accepted. If so, this is done by the tcp_store_ts_recent() method. In fast path, acceptance of the timestamp is checked directly.
Whenever a TCP segment with the timestamp option should be sent, the value of the timestamp received is copied from the tp->ts_recent variable to the TSecr field when the packet header is assembled in tcp_transmit_skb().

The above algorithm shows the following behavior when segments are delayed or lost:

When acknowledgements are delayed by the receiver, the timestamp in the acknowledgement refers to the first of the segments being acknowledged.
When an incoming segment belongs to the current window, but arrives out of order (which implies that an earlier segment was lost), the timestamp of the earlier segment is returned as soon as it arrives, rather than the timestamp of the segment that arrived out of order.

Figure 24-15 shows an example for the use of the timestamp option [JaBB92]. The timestamp option can be negotiated by the same principle used for the window scaling option during the connection-establishment phase. A TCP instance can set the timestamp option in the initial SYN segment (i.e., in a segment with the SYN bit set and the ACK bit not set), and it may set this option in other segments only provided that it received them in the initial SYN segment for this connection.

Figure 24-15. Example using the timestamp option.

graphics/24fig15.gif

24.4.2 Detecting, Avoiding, and Handling Congestions

The flow-control mechanisms introduced in Section 24.4.1 ensure that only that amount of data is sent to a receiving TCP instance that this instance can accommodate, to avoid packet losses in the receiving end system. However, buffer overflows can occur in the forwarding systems more specifically, in the queues of the Internet Protocol when they are not emptied fast enough and more data arrives than can be sent over a network adapter. This situation is called congestion.

Detecting a Congestion

Congestions occur mainly in the IP instances in forwarding systems. The question is now how a loss by the TCP instances in the end systems can be detected. The reason is that the TCP instances suffer from such a congestion, and have to retransmit their data. In addition, they have to reduce their transmission rates, initially to resolve the congestion and then to avoid additional congestions. Unfortunately, the Internet Protocol does not respond to congestions.

This means that recognizing a congestion is a basic prerequisite for using the congestion-handling and -prevention methods. The TCP protocol has several mechanisms to detect congestions and uses various algorithms to respond to a congestion:

A retransmission timer waits for a certain period of time for an acknowledgement once a packet has been sent. If no acknowledgement arrives, it is assumed that a congestion caused the packet to be lost. The initial response to a lost packet is the slow-start phase, which, from a certain point on, is replaced by the congestion-avoidance algorithm. Both methods will be introduced in the following section.
The receipt of duplicate acknowledgements (dupacks) is an indication that a data segment was lost, because, although subsequent segments arrive, they cannot be acknowledged, because of cumulative ACK segments. In this case, it is normally not assumed that a serious congestion occurred, because subsequent segments were actually received. For this reason, the more recent TCP versions (TCP Reno, TCP New Reno) do not respond to a loss by means of the slow-start phase, but instead by means of the fast retransmit and fast recovery methods, which will be introduced further along.

Slow-Start and Congestion Avoidance

The slow-start algorithm is used once a connection has been initialized and the retransmission timer has expired. It serves for stepwise approximation of the transmitted data volume to the transmission capacity available in this connection. The slow-start algorithm is normally implemented together with the congestion-avoidance algorithm.

At the beginning of a connection, or after a congestion, the minimum transmission capacity is assumed; subsequently, it is increased exponentially until segments are lost or a threshold value is reached, representing an approximate measure for the available capacity. The following parameters are defined:

The congestion window (snd_cwnd) denotes the number of bytes that may be under way at a certain point in time (without acknowledgement). This means that, in addition to the normal transmission window of the sliding-window mechanism, it defines a second credit, which also has to be available and sufficient to be able to send data.
The congestion window is initialized to at most one segment at the beginning of the slow-start phase (i.e., only one segment, and that of minimum size, can be sent). Subsequently, the congestion window is increased by one byte for each arriving acknowledgement of a byte. This corresponds to doubling the congestion window when the data volume of the entire window has been fully sent and acknowledged.
The slow-start threshold (ssthresh) limits the exponential growth of the slow-start phase when the available transmission capacity is approximately known. At the beginning of a connection, the threshold is set to the maximum value, to ensure that the slow-start phase will be able to test for the available capacity. If a packet loss occurs, then the threshold value is set to half the size of the current congestion window, to limit the exponential growth of the slow-start phase.

The congestion avoidance algorithm starts once the threshold value for the congestion window has been reached in the slow-start phase. This algorithm represents a measure for the available transmission capacity. For this reason, the congestion window still increases linearly during the congestion-avoidance phase, as shown in Figure 24-16. This value increases by one once n acknowledgements have arrived, where n corresponds to the size of the current congestion window. This means that the TCP instance increases its transmission rate only slowly, but still in a strictly monotone way. Naturally, the total data volume that can be sent corresponds to the minimum from the transmission window of the sliding-window algorithm and the congestion window, whichever is less.

Figure 24-16. Operation of the slow-start and congestion avoidance mechanisms.

Figure 24-16 also shows how the two mechanisms cooperate. The slow-start algorithm operates at the beginning of a data transmission. It terminates its exponential growth as soon as a congestion occurs, which is detected by the fact that the retransmission timer expires. According to the approach described above, the slow-start threshold, snd_ssthresh, is set to half of the current congestion window (i.e., 20 segments, in the above figure). Consequently, snd_cwnd takes the value 1 (2, in the Linux kernel), because the congestion was identified by the expiration of the retransmission timer. Next, the TCP instance returns to the slow-start phase, which continues until snd_cwnd reaches the value snd_ssthresh. Subsequently, the congestion-avoidance algorithm is used until another congestion occurs at a window size of 28 segments; finally, this congestion is handled by the same approach.

The following example shows how the two mechanisms described above are implemented in the TCP instance of the Linux kernel:

`tcp_v4_init_sock()`	net/ipv4/tcp_ipv4.c

 static int tcp_v4_init_sock(struct sock *sk) {        struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp);        (...)        /* So many TCP implementations out there (incorrectly) count the        * initial SYN frame in their delayed-ACK and congestion control        * algorithms that we must have the following bandaid to talk        * efficiently to them. -DaveM        */        tp->snd_cwnd = 2;        /* See draft-stevens-tcpca-spec-01 for discussion of the        * initialization of these values.        */        tp->snd_ssthresh = 0x7fffffff; /* Infinity */        tp->snd_cwnd_clamp = ~0;        tp->mss_cache = 536;        (...) }

Parameters for a TCP connection are initialized in the tcp_v4_init_sock() method, which is defined as an init() function of the proto structure. The most important parameters for congestion control are as follows:

The size of the initial congestion window (tp->snd_cwnd) is set to the value two.
The threshold value for the exponential growth (tp->snd_ssthresh) is set to "infinite" (the largest value that can be represented by an int variable).

The reason why the congestion window size is initialized to two and not to one (as by default) is as follows (as in many other places in the source text):

This is an efficient way to avoid errors in the TCP implementations of other operating systems, ensuring smooth operation. In this case, some TCP instances count the SYN packet, even though the slow-start algorithm should not become active before the connection has been successfully established (as with all congestion avoidance and flow-control algorithms).

The following source-code fragment of the TCP instance shows how the congestion window grows exponentially during the slow-start phase and how the subsequent linear increase during the congestion-avoidance phase is implemented in the Linux kernel.

`tcp_cong_avoid()`	net/ipv4/tcp_input.c

 /* This is Jacobson's slow start and congestion avoidance. * SIGCOMM '88, p. 328. */ static __inline__ void tcp_cong_avoid(struct tcp_opt *tp) {        if (tp->snd_cwnd <= tp->snd_ssthresh)        { /* In 'safe' area, increase. */          if (tp->snd_cwnd < tp->snd_cwnd_clamp)             tp->snd_cwnd++;        }        else        { /* In dangerous area, increase slowly.           * In theory this is tp->snd_cwnd += 1 / tp->snd_cwnd           */        if (tp->snd_cwnd_cnt >= tp->snd_cwnd)        {                if (tp->snd_cwnd < tp->snd_cwnd_clamp)                        tp->snd_cwnd++;                tp->snd_cwnd_cnt=0;        }        else                tp->snd_cwnd_cnt++;        } }

tcp_cong_avoid(tp) implements the congestion window growth in the slowstart and congestion-avoidance algorithms. tcp_cong_avoid() is invoked when an incoming TCP segment with valid acknowledgement (ACK) is handled in tcp_ack(). (See Section 24.2.1.)

Initially, the code checks for whether the TCP connection is still in the slow-start phase or is already in the congestion-avoidance phase:

The congestion window is increased by one in the slow-start phase (i.e., when the current value of the congestion window tp->snd_cwnd is not yet bigger than the current threshold value tp->snd_ssthresh). However, it must not exceed the upper limit value, tp->snd_cwnd_clamp. This means that, in this phase, the congestion window is increased by one upon each incoming acknowledgement. In practice, this means that the amount of data that can be sent doubles each time. This behavior corresponds to the exponential growth shown in Figure 24-16.
In the congestion-avoidance phase, the congestion window will be increased by one only if n acknowledgements have been previously received, where n corresponds to the current congestion-window value.
To implement this behavior, the additional variable tp->snd_cwnd_cnt is introduced; it is incremented by one upon each incoming acknowledgement (i.e., upon each call of tcp_cong_avoid() in the congestion-avoidance phase). Next, when the congestion-window value tp->snd_cwnd is reached, tp->snd_cwnd can finally be increased by one, and tp->snd_cwnd_cnt is reset. This method makes linear growth achievable.

In summary: There is initially an exponential increase of the congestion window, but, once the threshold value has been reached, there is only a linear growth, as shown in Figure 24-16. Notice at this point that the congestion window increases continually in tcp_cong_avoid(). It increases until the criteria for a congestion are met, which will then cause it to reduce in the function described next.

`tcp_enter_loss()`	net/ipv4/tcp_input.c

 void tcp_enter_loss(struct sock *sk, int how) {        struct tcp_opt *tp = &sk->tp_pinfo.af_tcp;        (...)        /* Reduce ssthresh if it has not yet been              made inside this window. */        if ((tp->ca_state <= TCP_CA_Disorder)                      || (tp->snd_una == tp->high_seq)                      || (tp->ca_state == TCP_CA_Loss && !tp->retransmits))        {             tp->prior_ssthresh = tcp_current_ssthresh(tp);             tp->snd_ssthresh = tcp_recalc_ssthresh(tp);        }        tp->snd_cwnd = 1;        tp->snd_cwnd_cnt = 0;        tp->snd_cwnd_stamp = tcp_time_stamp;        (...)

tcp_enter_loss(sk, how) is invoked in the handling routine of the retransmission timer (tcp_retransmit_timer). This timer triggers whenever a transmitted data segment has not been acknowledged by the time the retransmission timeout (RTO) expires. It is assumed that the data segment or its acknowledgement was lost. In modern networks, except wireless networks, packet losses occur only in congestion situations, and packets have to be discarded in forwarding systems to handle buffer overflows.

In such a situation, which is detected either because no acknowledgement arrives or because duplicate acknowledgements arrive, the TCP congestion-handling routine has to ensure that the data flow of the TCP connection involved is reduced.

Initially, the threshold value for exponential growth in the slow-start phase is set to a new value, which is computed by the tcp_recalc_ssthresh() method. Before this value is computed, the current threshold value is stored in the tp->prior_ssthresh variable. Subsequently, the other congestion-control variables primarily the congestion window are reset. tp->snd_cwnd is set to a maximum segment, which causes the TCP connection to start the next transmission in the slow-start phase again.

`tcp_recalc_ssthresh()`	include/net/tcp.h

 /* Recalculate snd_ssthresh, we want to set it to: * * one-half the current congestion window, but no * less than two segments */ static inline __u32 tcp_recalc_ssthresh(struct tcp_opt *tp) {        return max(tp->snd_cwnd>>1, 2); }

The threshold value for the exponential growth in the slow-start phase is recalculated in tcp_recalc_ssthresh(tp) as soon as a congestion situation is detected. Consequently, the current size of the congestion window, tp->snd_cwnd, is reduced to half as soon as the congestion is detected and is returned as the new threshold value. However, care is taken only that the threshold value not be smaller than two.

Fast Retransmit and Fast Recovery

The fast-retransmit algorithm was integrated in the TCP protocol for fast detection of single packet losses. Previously, the only way to detect packet losses was the expiry of the retransmit timer, and TCP responded to this by reducing the transmission rate in the slow-start phase.

The new fast-retransmit algorithm enables TCP to detect a packet loss before the retransmit timer expires that is, when a single segment out of a series of many segments is lost. The receiver responds to incoming segments with one segment missing by sending duplicate acknowledgements, because it is now receiving packets out of order. Figure 24-17 shows such a situation.

Figure 24-17. Detecting a single packet loss by duplicate acknowledgements (segments 5 to 7) and handling by fast retransmit.

graphics/24fig17.gif

The sender can see from the duplicate acknowledgements it received that a segment must have been lost, but that other segments still made their way to the receiver. This means that a massive congestion is very unlikely. The sender retransmits the data segment that follows the sequence number of the duplicate acknowledgements without waiting for the retransmission timer to trigger.

A new slow-start phase would be the normal response to a packet loss; that reaction would significantly reduce the transmission rate. Given duplicate acknowledgements, however, TCP detects that there is no serious congestion. For this reason, the fast-recovery method was introduced as an extension of the fast-retransmit algorithm. After a fast retransmit, the congestion window is not set to a minimum value, but instead cut in half, and, subsequently, increased linearly directly in the congestion-avoidance phase.

We will next describe how these two algorithms cooperate in the TCP instance of the Linux kernel^[4]:

^[4] We will keep our discussion short, because the implementation of these two algorithms is extensive and is distributed over many positions within the TCP instance.

When three acknowledgement duplicates are received, the variable tp->snd_ssthresh is set to half of the current transmit window. The missing segment is retransmitted, and the congestion window tp->snd_cwnd takes the value tp->ssthresh + 3 * MSS, where MSS denotes the maximum segment size.
Each time that a duplicate acknowledgement is received, the congestion window tp->snd_cwnd increases by the value of the maximum segment size, and an additional segment is sent (if permitted by the transmit window size).
When the first acknowledgment of new data arrives, then tp->snd_cwnd takes the original value of tp->snd_ssthresh, which was stored in tp->prior_ssthresh. This acknowledgement should acknowledge the data segment that was originally lost. In addition, it should acknowledge all segments sent between the lost packet and the third acknowledgement duplicate.

24.4.3 Congestion Avoidance

The TCP protocol could produce considerable load, even when a relatively small amount of data is sent. This effect is due to the size of the TCP packet header, which comprises 20 bytes. If we add the IP packet header and an Ethernet packet header to this, the protocol-control information sent with each packet adds up to more than fifty bytes. With scarcely filled data packets or frequent acknowledgement packets (consisting of packet headers only), a large amount of bandwidth is wasted on packet headers. The two methods described below are aimed at minimizing these circumstances.

Delayed Acknowledgements

The principle of delayed acks enables the TCP protocol to delay an acknowledgement for a segment. The acknowledgement is delayed so as to eventually be sent together with the data (which is also called piggybacking) and so as to enable TCP to accumulate several acknowledgements. The delayed transmission of acknowledgements is implemented by using the delack timer, which will be introduced in Section 24.5.

Nagle Algorithm

John Nagle's algorithm, also known by the name small-packet-avoidance algorithm, serves to avoid excessive network load due to a large number of small TCP packets. For this purpose, the data to be sent is held back as long as possible, because two complete TCP packets would have to be sent for each payload byte in the worst case: the first packet to transport the data, and the second packet, which is sent by the receiver, to acknowledge the first one.

This worst case occurs when data to be sent accumulate more slowly than they can be sent (e.g., in a Telnet connection, where the data to be transmitted is transported very slowly a few characters per second to the TCP instance. The TCP instance would pack each character in a packet and send these packets separately. The packet-header overhead created by this approach was described above.

To avoid high loads in the network due to these many small segments, ones smaller than the negotiated maximum segment length (MSS), the Nagle algorithm [Nagl84] introduces the limitation that at most one segment in a connection may be unacknowledged before other segments are sent. The algorithm says that small segments should be retained pending arrival of an acknowledgement in the case of sent and unacknowledged segments. The data contained in these small segments are grouped into a larger segment. However, if an application (e.g., Telnet) wants to avoid packet delays, for some specific reason, then it can use the socket option TCP_NODELAY to disable the Nagle algorithm.

The following fragment shows the complete source text of the Nagle algorithm. These few lines are sufficient to achieve the behavior discussed above.

`tcp_nagle_check()`	include/net/tcp.h

 static __inline__ int tcp_nagle_check(struct tcp_opt *tp,                         struct sk_buff *skb, unsigned mss_now) {       return (skb->len < mss_now &&           !(TCP_SKB_CB(skb)->flags & (TCPCB_FLAG_URG|TCPCB_FLAG_FIN))           && (tp->nonagle == 2 ||               (!tp->nonagle &&               tp->packets_out &&               tcp_minshall_check(tp)))); }

The Nagle algorithm (i.e., the tcp_nagle_check() method) is asked in tcp_snd_test() whether the existing segment may be sent. If it wants to prevent the segment from being sent at this time, then tcp_nagle_check() returns the value true. If there are reasons for immediate transmission of this data, then it returns false.

The first check is for whether there is a sufficient amount of data to fill a complete segment. If this is the case, then there is no reason to delay its transmission further. Consequently, the segment length (skb->len) collaborates with mss_now to determine the amount of data than can be sent at once.

The next check is to see whether there are particularly important control packets (e.g., FIN or URG packets). These packets have to be sent immediately, and Nagle's algorithm must not delay them.

The next line checks for whether the socket option TCP_CORK was activated (tp-nonagle == 2). It forces that only complete packets will be sent.

If all checks done so far have resulted in false, the next step checks for whether the Nagle algorithm has been disabled by the socket option TCP_NODELAY. If so, then tp->nonagle would have the value one, and tcp_nagle_check() would return false, which means that the segment can be sent.

However, if the Nagle algorithm is still active, then tcp_minshall_check() (described later) checks for whether small and incompletely filled packets are on their, way (i.e., packets that have not yet been acknowledged). If there are no small and unacknowledged packets in the connection, this data packet may be sent; otherwise, it has to be delayed.

The previous line checks for whether any acknowledgements for packets are still missing. Only one variable has to be checked here, so this query is much faster than the method call in the next line (tcp_minshall_check(tp)). Lazy evaluation saves time, because the slower method has to be invoked only if the fast query has not yet returned a result.

`tcp_minshall_check()`	include/net/tcp.h

 static __inline__ int tcp_minshall_check(struct tcp_opt *tp) {        return after(tp->snd_sml, tp->snd_una) &&               !after(tp->snd_sml, tp->snd_nxt); }