Section 13.5. TCP Algorithms

13.5. TCP Algorithms

Now that we have introduced TCP, its state machine, and its sequence space, we can begin to examine the implementation of the protocol in FreeBSD. Several aspects of the protocol implementation depend on the overall state of a connection. The TCP connection state, output state, and state changes depend on external events and timers. TCP processing occurs in response to one of three events:

A request from the user, such as sending data, removing data from the socket receive buffer, or opening or closing a connection
The receipt of a packet for the connection
The expiration of a timer

These events are handled in the routines tcp_usr_send(), tcp_input(), and a set of timer routines. Each routine processes the current event and makes any required changes in the connection state. Then, for any transition that may require sending a packet, the tcp_output() routine is called to do any output that is necessary.

The criteria for sending a packet with data or control information are complicated, and therefore the TCP send policy is the most interesting and important part of the protocol implementation. For example, depending on the state- and flow-control parameters for a connection, any of the following may allow data to be sent that could not be sent previously:

A user send call that places new data in the send queue
The receipt of a window update from the peer
The expiration of the retransmission timer
The expiration of the window-update (persist) timer

In addition, the tcp_output() routine may decide to send a packet with control information, even if no data may be sent, for any of these reasons:

A change in connection state (e.g., open request, close request)
Receipt of data that must be acknowledged
A change in the receive window because of removal of data from the receive queue
A send request with urgent data
A connection abort

We shall consider most of these decisions in greater detail after we have described the states and timers involved. We begin with algorithms used for timing, connection setup, and shutdown; they are distributed through several parts of the code. We continue with the processing of new input and an overview of output processing and algorithms.

Timers

Unlike a UDP socket, a TCP connection maintains a significant amount of state information, and, because of that state, some operations must be done asynchronously. For example, data might not be sent immediately when a process presents them because of flow control. The requirement for reliable delivery implies that data must be retained after they are first transmitted so that they can be retransmitted if necessary. To prevent the protocol from hanging if packets are lost, each connection maintains a set of timers used to recover from losses or failures of the peer. These timers are stored in the protocol control block for a connection. The kernel provides a timer service via a set of callout() routines. The TCP module can register up to five timeout routines with the callout service, as shown in Table 13.3. Each routine has its own associated time at which it will be called. In earlier versions of BSD, timeouts were handled by the tcp_slowtimo() routine that was called every 500 milliseconds and would then do timer processing when necessary. Using the kernel's timer service is both more accurate, since each timer can be handled independently, and has less overhead, because no routine is called unless absolutely necessary.

Table 13.3. TCP timer routines.
Routine	Timeout	Description
tcp_timer_2msl	60s	wait on close
tcp_timer_keep	75s	send keep alive or drop dormant connection
tcp_timer_persist	5-60s	force a connection to persist
tcp_timer_rexmt	3ticks-64s	called when retransmission is necessary
tcp_timer_delack	100ms	send a delayed acknowledgement to the peer

Two timers are used for output processing. Whenever data are sent on a connection, the retransmit timer (tcp_rexmt()) is started by a call to callout_reset(), unless it is already running. When all outstanding data are acknowledged, the timer is stopped. If the timer expires, the oldest unacknowledged data are resent (at most one full-sized packet), and the timer is restarted with a longer value. The rate at which the timer value is increased (the timer backoff) is determined by a table of multipliers that provides an exponential increase in timeout values up to a ceiling.

The other timer used for maintaining output flow is the persist timer (tcp_timer_persist()) This timer protects against the other type of packet loss that could cause a connection to constipate: the loss of a window update that would allow more data to be sent. Whenever data are ready to be sent but the send window is too small to bother sending (zero, or less than a reasonable amount), and no data are already outstanding (the retransmit timer is not set), the persist timer is started. If no window update is received before the timer expires, the routine sends as large a segment as the window allows. If that size is zero, it sends a window probe (a single octet of data) and restarts the persist timer. If a window update was lost in the network, or if the receiver neglected to send a window update, the acknowledgment will contain current window information. On the other hand, if the receiver is still unable to accept additional data, it should send an acknowledgment for previous data with a still-closed window. The closed window might persist indefinitely; for example, the receiver might be a network-login client, and the user might stop terminal output and leave for lunch (or vacation).

The third timer used by TCP is a keepalive timer (tcp_timer_keep()) The keepalive timer has two different purposes at different phases of a connection. During connection establishment, this timer limits the time for the three-way handshake to complete. If the timer expires during connection setup, then the connection is closed. Once the connection completes, the keepalive timer monitors idle connections that might no longer exist on the peer because of a network partition or a crash. If a socket-level option is set and the connection has been idle since the most recent keepalive timeout, the timer routine will send a keepalive packet designed to produce either an acknowledgment or a reset (RST) from the peer TCP. If a reset is received, the connection will be closed; if no response is received after several attempts, the connection will be dropped. This facility is designed so that network servers can avoid languishing forever if the client disappears without closing. Keepalive packets are not an explicit feature of the TCP protocol. The packets used for this purpose by FreeBSD set the sequence number to 1 less than snd_una, which should elicit an acknowledgment from the peer if the connection still exists.

The fourth TCP timer is known as the 2MSL timer ("twice the maximum segment lifetime"). TCP starts this timer when a connection is completed by sending an acknowledgment for a FIN (from FIN_WAIT_2) or by receiving an ACK for a FIN (from CLOSING state, where the send side is already closed). Under these circumstances, the sender does not know whether the acknowledgment was received. If the FIN is retransmitted, it is desirable that enough state remain that the acknowledgment can be repeated. Therefore, when a TCP connection enters the TIME_WAIT state, the 2MSL timer is started; when the timer expires, the control block is deleted. If a retransmitted FIN is received, another ACK is sent, and the timer is restarted. To prevent this delay from blocking a process closing the connection, any process close request is returned successfully without the process waiting for the timer. Thus, a protocol control block may continue its existence even after the socket descriptor has been closed. In addition, FreeBSD starts the 2MSL timer when FIN_WAIT_2 state is entered after the user has closed; if the connection is idle until the timer expires, it will be closed. Because the user has already closed, new data cannot be accepted on such a connection in any case. This timer is set because certain other TCP implementations (incorrectly) fail to send a FIN on a receive-only connection. Connections to such hosts would remain in FIN_WAIT_2 state forever if the system did not have a timeout.

The final timer is the tcp_timer_delack(), which processes delayed acknowledgments. This will be described in Section 13.6.

Estimation of Round-Trip Time

When connections must traverse slow networks that lose packets, an important decision determining connection throughput is the value to be used when the retransmission timer is set. If this value is too large, data flow will stop on the connection for an unnecessarily long time before the dropped packet is resent. Another round-trip time interval is required for the sender to receive an acknowledgment of the resent segment and a window update, allowing it to send new data. (With luck, only one segment will have been lost, and the acknowledgment will include the other segments that had been sent.) If the timeout value is too small, however, packets will be retransmitted needlessly. If the cause of the network slowness or packet loss is congestion, then unnecessary retransmission only exacerbates the problem. The traditional solution to this problem in TCP is for the sender to estimate the round-trip time (rtt) for the connection path by measuring the time required to receive acknowledgments for individual segments. The system maintains an estimate of the round-trip time as a smoothed moving average, srtt [Postel, 1981b], using

In addition to a smoothed estimate of the round-trip time, TCP keeps a smoothed variance (estimated as mean difference, to avoid square-root calculations in the kernel). It employs an a value of 0.875 for the round-trip time and a corresponding smoothing factor of 0.75 for the variance. These values were chosen in part so that the system could compute the smoothed averages using shift operations on fixed-point values instead of floating-point values because on many hardware architectures it is expensive to use floating-point arithmetic. The initial retransmission timeout is then set to the current smoothed round-trip time plus four times the smoothed variance. This algorithm is substantially more efficient on long-delay paths with little variance in delay, such as satellite links, because it computes the BETA factor dynamically [Jacobson, 1988].

For simplicity, the variables in the TCP protocol control block allow measurement of the round-trip time for only one sequence value at a time. This restriction prevents accurate time estimation when the window is large; only one packet per window can be timed. However, if the TCP timestamps option is supported by both peers, a timestamp is sent with each data packet and is returned with each acknowledgment. Here, estimates of round-trip time can be obtained with each new acknowledgment; the quality of the smoothed average and variance is thus improved, and the system can respond more quickly to changes in network conditions.

Connection Establishment

There are two ways in which a new TCP connection can be established. An active connection is initiated by a connect call, whereas a passive connection is created when a listening socket receives a connection request. We consider each in turn.

The initial steps of an active connection attempt are similar to the actions taken during the creation of a UDP socket. The process creates a new socket, resulting in a call to the tcp_attach() routine. TCP creates an inpcb protocol control block and then creates an additional control block (a tcpcb structure), as described in Section 13.1. Some of the flow-control parameters in the tcpcb are initialized at this time. If the process explicitly binds an address or port number to the connection, the actions are identical to those for a UDP socket. Then a tcp_connect() call initiates the actual connection. The first step is to set up the association with in_pcbconnect(), again identically to this step in UDP. A packet-header template is created for use in construction of each output packet. An initial sequence number is chosen from a sequence-number prototype, which is then advanced by a substantial amount. The socket is then marked with soisconnecting(), the TCP connection state is set to TCPS_SYN_SENT, the keepalive timer is set (to 75 seconds) to limit the duration of the connection attempt, and tcp_output() is called for the first time.

The output-processing module tcp_output() uses an array of packet control flags indexed by the connection state to determine which control flags should be sent in each state. In the TCPS_SYN_SENT state, the SYN flag is sent. Because it has a control flag to send, the system sends a packet immediately using the prototype just constructed and including the current flow-control parameters. The packet normally contains three option fields: a maximum-segment-size option, a window-scale option, and a timestamps option (see Section 13.4). The maximum-segment-size option communicates the largest segment size that TCP is willing to accept. To compute this value, the system locates a route to the destination. If the route specifies a maximum transmission unit (MTU), the system uses that value after allowing for packet headers. If the connection is to a destination on a local network the maximum transmission unit of the outgoing network interface is used, possibly rounding down to a multiple of the mbuf cluster size for efficiency of buffering. If the destination is not local and nothing is known about the intervening path, the default segment size (512 octets) is used.

In earlier versions of FreeBSD many of the important variables relating to TCP connections, such as the MTU of the path between the two endpoints, and the data used to manage the connection were contained in the route entry that described the connection and in a routing entry as a set of route metrics. The TCP host cache was developed to centralize all this information in one easy-to-find place so that information that was gathered on one connection could be reused when a new connection was opened to the same endpoint. The data that is recorded on a connection is shown in Table 13.4. All the variables stored in a host cache entry are described in various parts of later sections of this chapter when they become relevant to our discussion of how TCP manages a connection.

Table 13.4. TCP host cache metrics.
Variable	Description
rmx_mtu	MTU for this path
rmx_ssthresh	outbound gateway buffer limit
rmx_rtt	estimated round-trip time
rmx_rttvar	estimated rtt variance
rmx_bandwidth	estimated bandwidth
rmx_cwnd	congestion window
rmx_sendpipe	outbound delay-bandwidth product
rmx_recvpipe	inbound delay-bandwidth product

Whenever a new connection is opened, a call is made to tcp_hc_get() to find any information on past connections. If an entry exists in the cache for the target endpoint, TCP uses the cached information to make better-informed decisions about managing the connection. When a connection is closed, the host cache is updated with all the relevant information that was discovered during the connection between the two hosts. Each host cache entry has a default lifetime of one hour. Anytime that the entry is accessed or updated, its lifetime is reset to one hour. Every five minutes the tcp_hc_purge() routine is called to clean out any entries that have passed their expiration time. Cleaning out old entries ensures that the host cache does not grow too large and that it always has reasonably fresh data.

TCP can use Path MTU Discovery as described in Mogul & Deering [1990]. Path MTU discovery is a process whereby the system probes the network to see what the maximum transfer unit is on a particular route between two nodes. It does this by sending packets with the IP flag don't fragment set on each packet. If the packet encounters a segment on its path to its destination on which it would have to be fragmented, then it is dropped by the intervening router, and an error is returned to the sender. The error message contains the maximum size packet that the segment will accept. This information is recorded in the TCP host cache for the appropriate endpoint and transmission is attempted with the smaller MTU. Once the connection is complete, because enough packets have made it through the network to establish a TCP connection, the revised MTU recorded in the host cache is confirmed. Packets will continue to be transmitted with the don't fragment flag set so that if the path to the node changes, and that path has an even smaller MTU, this new smaller MTU will be recorded. FreeBSD currently has no way of upgrading the MTU to a larger size when a route changes.

When a connection is first being opened, the retransmit timer is set to the default value (6 seconds) because no round-trip time information is available yet. With a bit of luck, a responding packet will be received from the target of the connection before the retransmit timer expires. If not, the packet is retransmitted and the retransmit timer is restarted with a greater value. If no response is received before the keepalive timer expires, the connection attempt is aborted with a "Connection timed out" error. If a response is received, however, it is checked for agreement with the outgoing request. It should acknowledge the SYN that was sent and should include a SYN. If it does both, the receive sequence variables are initialized, and the connection state is advanced to TCPS_ESTABLISHED. If a maximum-segment-size option is present in the response, the maximum segment size for the connection is set to the minimum of the offered size and the maximum transmission unit of the outgoing interface; if the option is not present, the default size (512 data bytes) is recorded. The flag TF_ACKNOW is set in the TCP control block before the output routine is called so that the SYN will be acknowledged immediately. The connection is now ready to transfer data.

The events that occur when a connection is created by a passive open are different. A socket is created and its address is bound as before. The socket is then marked by the listen call as willing to accept connections. When a packet arrives for a TCP socket in TOPS LISTEN state, a new socket is created with sonewconn(), which calls the TCP tcp_usr_attach() routine to create the protocol control blocks for the new socket. The new socket is placed on the queue of partial connections headed by the listening socket. If the packet contains a SYN and is otherwise acceptable, the association of the new socket is bound, both the send and the receive sequence numbers are initialized, and the connection state is advanced to TCPS_SYN_RECEIVED. The keepalive timer is set as before, and the output routine is called after TF_ACKNOW has been set to force the SYN to be acknowledged; an outgoing SYN is sent as well. If this SYN is acknowledged properly, the new socket is moved from the queue of partial connections to the queue of completed connections. If the owner of the listening socket is sleeping in an accept call or does a select, the socket will indicate that a new connection is available. Again, the socket is finally ready to send data. Up to one window of data may have already been received and acknowledged by the time that the accept call completes.

SYN Cache

One problem in previous implementations of TCP was that it was possible for a malicious program to flood a system with SYN packets, thereby preventing it from doing any useful work or servicing any real connections. This type of denial of service attack became common during the commercialization of the Internet in the late 1990s. To combat this attack, a syncache was introduced to efficiently store, and possibly discard, SYN packets that do not lead to real connections. The syncache handles the three-way handshake between a local server and connecting peers.

When a SYN packet is received for a socket that is in the LISTEN state, the TCP module attempts to add a new syncache entry for the packet using the syncache_add() routine. If there are any data in the received packet, they are not acknowledged at this time. Acknowledging the data would use up system resources, and an attacker could exhaust these resources by flooding the system with SYN packets that included data. If this SYN has not been seen before, a new entry is created in the hash table based on the packet's foreign address, foreign port, the local port of the socket, and a mask. The syncache module responds to the SYN with a SYN/ACK and sets a timer on the new entry. If the syncache contains an entry that matches the received packet, then it is assumed that the original SYN/ACK was not received by the peer initiating the connection, and another SYN/ACK is sent and the timer on the syncache entry is reset. There is no limit set on the number of SYN packets that can be sent by a connecting peer. Any limit would not follow the TCP RFCs and might impede connections over lossy networks.

Connection Shutdown

A TCP connection is symmetrical and full-duplex, so either side may initiate disconnection independently. As long as one direction of the connection can carry data, the connection remains open. A socket may indicate that it has completed sending data with the shutdown system call, which results in a call to the tcp_usr_shutdown() routine. The response to this request is that the state of the connection is advanced; from the ESTABLISHED state, the state becomes FIN_WAIT_1. The ensuing output call will send a FIN, indicating an end-of-file. The receiving socket will advance to CLOSE_WAIT but may continue to send. The procedure may be different if the process simply closes the socket. In that case, a FIN is sent immediately, but if new data are received, they cannot be delivered. Normally, higher-level protocols conclude their own transactions such that both sides know when to close. If they do not, however, TCP must refuse new data. It does so by sending a packet with the RST flag set if new data are received after the user has closed. If data remain in the send buffer of the socket when the close is done, TCP will normally attempt to deliver them. If the socket option SO_LINGER was set with a linger time of zero, the send buffer is simply flushed; otherwise, the user process is allowed to continue, and the protocol waits for delivery to conclude. Under these circumstances, the socket is marked with the state bit SS_NOFDREF (no file-descriptor reference). The completion of data transfer and the final close can take place an arbitrary amount of time later. When TCP finally completes the connection (or gives up because of timeout or other failure), it calls tcp_close(). The protocol control blocks and other dynamically allocated structures are freed at this time. The socket also is freed if the SS_NOFDREF flag has been set. Thus, the socket remains in existence as long as either a file descriptor or a protocol control block refers to it.