24.2 Implementing The TCP Protocol Instance | Linux Network Architecture

The protocol instance of the Transmission Control Protocol is one of the most complex parts in the Linux network architecture. The protocol uses a large number of algorithms and features that require extensive mechanisms to implement them. This section explains how these mechanisms are implemented and how they interact in the TCP implementation.

First, we will have a look at "normal" receive and transmit processes in the TCP instance, where we will leave out many details. Too much detail would make it difficult at this point to understand the entire process in the TCP instance and the features of each of the TCP algorithms.

Section 24.3 discusses connection management how TCP connections are established and torn down; Section 24.4 discusses each of the algorithms used to exchange data (e.g., congestion control and window scaling). Finally, Section 24.5 will introduce the tasks of the TCP protocol instance and how its timers are managed.

The TCP protocol instance is extremely complex. It consists of a large number of functions, inline functions, structures, and macros. In addition, the large number of algorithms used within the TCP protocol makes its description rather difficult. For this reason, we will begin with a general overview of the process involved when receiving, and then when sending, a TCP segment. A detailed discussion of the large number of algorithms used in TCP will follow in Section 24.4. In addition, this section assumes that data is exchanged over an existing connection. The complex management of TCP connections is dealt with in Section 24.3.

24.2.1 Handling Incoming TCP Segments

The transport protocol for an incoming packet is selected early, by the time it is needed in the IP layer, to be able to pass the packet to the appropriate protocol-handling routine in the transport layer. (See Section 14.2.5.) In the TCP instance, this task is handled by the tcp_v4_rcv() function (net/ipv4/tcp_ipv4.c).

Figure 24-2 shows how packets are processed in the TCP instance, and Figure 24-3 gives an overview of what happens when the TCP instance receives a segment.

Figure 24-2. Partial representation of how packets are handled in the TCP instance.

Figure 24-3. Overview of the process for receiving a segment in the TCP instance.

graphics/24fig03.gif

`tcp_v4_rcv()`	net/ipv4/tcp_ipv4.c

tcp_v4_rcv(skb, len) checks for whether the packet in the form of the socket buffer, skb, is really addressed to this computer (skb->pkt_type == PACKET_HOST). If so, then the IP packet header is removed, and the protocol processing continues; otherwise, the socket buffer is dropped.

tcp_v4_lookup() searches the hash table of the active socket for the socket or sock structure. The IP addresses and the ports of the two communication partners and the network device index, skb->dst->rt_iif at which this segment arrived are the parameters used. If a socket with these addresses and ports can be found, the tcp_v4_do_rcv() function continues with an appropriate handling routine, depending on the connection state. If no socket can be found, then tcp_send_reset() sends a RESET segment.

`tcp_v4_do_rcv()`	net/ipv4/tcp_ipv4.c

First, if socket filters are activated, the sk_filter() function checks the socket buffer. If the result is negative, the packet is dropped. Otherwise, the process continues with either of the following functions, depending on the TCP connection state (sk->state):

TCP_ESTABLISHED: With a connection established, the socket buffer is further processed in tcp_rcv_established() (as is shown later).
If the socket is in one of the other states, the function tcp_rcv_state_process() processes the socket buffer. This function is described in Section 24.3.

The latter cases how all packets not arriving in the TCP_ESTABLISHED are handled are introduced in Section 24.3. This section will explain how the TCP finite state machine was implemented. As we can already see at this point, the implementation of the TCP instance in the Linux kernel deviates from the usual concept of such an implementation. In educational operating systems, the state machine would probably be implemented with a central case statement to branch, depending on the state. However, real-world systems often prefer a "fast" over an "elegant" implementation for performance reasons.

At this point, let's return to the tcp_rcv_established() function and how it handles packets as they arrive. Receiving packets over an established connection is the most common case, but we will not discuss it in detail for lack of space. The TCP protocol has a large number of algorithms, which are discussed in Section 24.4 rather than in this section. This approach allows us to concentrate on the path a packet takes through the TCP instance without having to discuss each algorithm. Section 24.4 describes the features of these algorithms and how they were implemented.

`tcp_rcv_established()`	net/ipv4/tcp_input.c

tcp_rcv_established(sk, skb, th, len) handles TCP packets incoming over an established connection (i.e., in the data-exchange phase (TCP_ESTABLISHED)). Once again, this function is a good example of the intended objective: to achieve efficient protocol handling. In fact, tcp_rcv_established() distinguishes between two paths for packet-handling purposes:

Fast Path is used to handled the ideal case of an incoming packet. The most common cases occurring in a normal TCP connection should be detected as fast as possible and processed optimally, without having to test for marginal cases, which normally won't occur in this situation.
Slow Path is used for all packets not corresponding to the ideal case and requiring some special handling. For example, if a packet had to be manipulated to deal with a transmission error, or if it is a retransmitted packet, it is processed in Slow Path by appropriate error-correction mechanisms.

Fast Path and Slow Path are not distinguished specifically in the Linux kernel. It fact, this differentiation was proposed by Van Jacobson [Jaco90a] at the beginning of the nineties, and it was implemented in BSD UNIX in a similar form. A study in [Stev94b] showed that Fast Path was applied in from 97% to 100% of all packets incoming over a TCP connection within a local area network and in from 83% to 99% of all cases in a WAN connection. Though these results are not necessarily representative and depend on the actual load of the networks, they show that it appears useful to differentiate between Slow Path and Fast Path.

Packets are processed in Fast Path in the following two situations:

The segment received is a pure ACK segment for the data sent last (no duplicate ACK).
The segment received includes data expected next, so that they are consecutive with the data received until then.

Fast Path is not accessed in the following situations (and so a detailed protocol-handling process is performed in Slow Path):

Unexpected TCP flags: The process continues in Slow Path if a SYN, URG, FIN, or RST flag is set. These cases are detected by the Header Prediction described later.
If the sequence number of an incoming segment does not correspond to the sequence number expected next (tp->rcv_nxt), then the segment is either a retransmitted segment or an out-of-order segment.
Both communication partners exchange data: Fast Path cannot be used except in situations where the relevant TCP instance either only sends or only receives (i.e., where either the sequence number or the acknowledgement number remains constant).
The current TCP instance sent a Zero Window (i.e., no transmit credit can currently be granted to the communication partner).
Unexpected TCP options are processed in Slow Path. The Timestamp is the only option that can be handled in Fast Path.

To make the differentiation between Fast Path and Slow Path worthwhile, Fast Path has to be detected quickly and reliably. Most of the cases mentioned above can be detected by the so-called Header Prediction, which uses a simple comparative operation on the Header Length, Flags, and Window Size fields and a predicted value to decide whether the Fast Path can be used:

 /* pred_flags is 0xS?10 <<16 + snd_wnd  * if header_prediction is to be made  * 'S' will always be tp->tcp_header_len >> 2  * '?' will be 0 for the fast path, otherwise pred_flags is 0 to  *       turn it off (when there are holes in the receive space  *       for instance), PSH flag is ignored.  */ if ((tcp_flag_word(th) & TCP_HP_BIPS) == tp->pred_flags &&         TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { (...FAST PATH...) } else { (...SLOW PATH...) }

The comparative value (tp->pred_flags) is computed in the tcp_fast_path_on() function in advance, which means that it activates processing over the Fast Path (if the packets meet the preconditions):

 static __inline__ void __tcp_fast_path_on(struct tcp_opt *tp, u32 snd_wnd) {        tp->pred_flags = htonl((tp->tcp_header_len << 26) |                                        ntohl(TCP_FLAG_ACK) |                                        snd_wnd); } static __inline__ void tcp_fast_path_on(struct tcp_opt *tp) {        __tcp_fast_path_on(tp, tp->snd_wnd>>tp->snd_wscale); }

Should it ever happen that the TCP connection gets into a situation where Fast Path cannot be used, then the problem is solved by simply writing a null to the comparative operator of the Header Prediction for example, much like what is done when sending a Zero Window in tcp_select_window().

Fast Path

Once an incoming segment has successfully passed the Header Prediction, its processing in the Fast Path begins, where the following operations are done:

The sequence number is checked to filter out-of-order packets (TCP_SKB_CB(skb)->seq == tp->rcv_nxt).
The Timestamp option is checked, but only by evaluating the length of the packet header. All other options fail the Header Prediction, which means that a simple check of the packet-header length is sufficient. Subsequently, the Timestamp values, TSval and TSecr, are read directly. (See Section 24.4.1.) If the subsequent PAWS check fails, then the process continues in Slow Path; otherwise, the segment is all right. If the condition to update the tp->ts_recent timestamp is met, it is accepted by tcp_store_ts_recent().
Subsequently, the packet-header length is compared with the segment length to distinguish pure acknowledgement packets from payload packets, and segments that are too short are dropped.
- ACK segment: The acknowledgement number, if present, in a packet that contains no payload is processed in tcp_ack. Subsequently, __kfree_skb() releases the socket buffer, which completes the process of handling this segment. Finally, tcp_data_snd_check() checks for whether local packets can be sent.
- Data segment: In this case, the segment contains the data expected next (was previously checked).
  At this point, the only thing tested is whether the payload can be copied directly into the user-address space:
  - If the payload can be copied directly into the user-address space, then the statistics of this connection are updated, the relevant process is informed, the payload is copied into the receive memory of the process, the TCP packet header is removed, and, finally, the variable with the sequence number expected next is updated.
  - If the payload cannot be directly copied into the user-address space, then the availability of buffer memory in the socket is checked, statistical information is updated, the TCP packet header is removed, the packet is added to the end of the socket's receive queue, and, finally, the sequence number expected next is set.
- All management tasks arising from the receipt of a payload segment are completed in the tcp_event_data_rcv() function.
- If the segment's acknowledgement number confirms data not yet acknowledged, then the actions required are done in tcp_ack(). Subsequently, tcp_data_snd_check() initiates the transmission of waiting data that may be sent now that the acknowledgement was received.
- Finally, the process checks whether an acknowledgement has to be sent as response to the receipt of this segment, in the form of either Delayed ACK or Quick ACK.

Slow Path

A packet is processed in Slow Path if the prerequisites for Fast Path are not met or the Header Prediction fails. Slow Path processing considers all possibilities of a segment received over an established connection. The following operations are done consecutively:

The checksum is verified.
The Timestamp option is checked in tcp_fast_parse_options(), and the PAWS check works out whether the packet has to be dropped (tcp_paws_discard()).
Using the sequence number, tcp_sequence() checks for whether the packet arrived out of order (OfO). If it is an OfO packet, then the QuickACK mode is activated to send acknowledgements as fast as possible.
If the RST flag is set, tcp_reset resets the connection (changes connection state and deactivates timer), and the socket buffer is freed.
If the TCP packet header contains a Timestamp option, then tcp_replace_ts_recent() updates the recent timestamp stored locally.
If the SYN flag is set to signal an error case in an established connection, then tcp_reset() resets the connection.
If the ACK flag is set, the tcp_ack() function processes the acknowledgement.
If the URG flag denotes that the packet contains priority data, then this data is processed in tcp_urg().
tcp_data() and tcp_data_queue() process the payload. Among other things, this includes a check for sufficient space in the receive buffer and insertion of the socket buffer into the receive queue or the out-of-order queue.
Finally, two methods, tcp_data_snd_check() and tcp_ack_snd_check(), are invoked to check on whether data or acknowledgements waiting can be sent.

These actions complete the process of handling a received TCP segment. This section is only an overview of the rough process involved in handling incoming segments. At some point, we mentioned functions invoked in Fast Path or Slow Path, but we didn't explain them in detail. For this reason, we will briefly describe these functions in the following subsection.

Helper Functions to Handle Incoming TCP Segments

`tcp_ack()`	net/ipv4/tcp_input.c

tcp_ack(sk, th, ack_seq, ack, len) handles all tasks involved in receiving an acknowledgement packet or a data packet with valid ACK number (piggybacking):

Adapt the receive window (tcp_ack_update_window()).
Delete acknowledged packets from the retransmission queue (tcp_clean_rtx_queue()).
Check for Zero Window Probing acknowledgement.
Adapt the congestion window (tcp_may_raise_cwnd()).
Update the packet round-trip time (RTT) and the timeout for packet retransmissions (Retransmission TimeOut RTO).
Retransmit packets and update the retransmission timer.
Activate the Fast Retransmit mode, if necessary.

`tcp_event_data_recv()`	net/ipv4/tcp_input.c

tcp_event_data_recv(tp, skb) handles all management work required for receiving of payload. This includes updating the maximum segment size, the timestamp, and the timer for delayed acknowledgements (Acknowledgement Timeout ATO).

`tcp_data_snd_check()`	net/ipv4/tcp_input.c

tcp_data_snd_check(sk) checks on whether data is ready and waiting in the transmit queue, and it starts the transmission, if permitted by the transmit window of the sliding-window mechanism and the congestion-control window. The actual transmission is initiated by tcp_write_xmit():

 static __inline__ void tcp_data_snd_check(struct sock *sk) {         struct sk_buff *skb = sk->tp_pinfo.af_tcp.send_head;         struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp);         if (skb != NULL)         {         if (after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd) ||                  tcp_packets_in_flight(tp) >= tp->snd_cwnd ||                  tcp_write_xmit(sk))                        tcp_check_probe_timer(sk, tp);         }         tcp_check_space(sk); }

`tcp_ack_snd_check()`	net/ipv4/tcp_input.c

tcp_ack_snd_check(sk, ofo_possible) checks for various cases where acknowledgements can be sent. Also, it checks the type of acknowledgement (i.e., whether it should be quick or delayed):

 static __inline__ void tcp_ack_snd_check(struct sock *sk) {         struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp);         if (!tcp_ack_scheduled(tp)) {                 /* We sent a data segment already. */                 return;         }                       /* More than one full frame received... */         if (((tp->rcv_nxt - tp->rcv_wup) > tp->ack.rcv_mss                       /* ... and right edge of window advances far enough.                        * (tcp_recvmsg() will send ACK otherwise). Or... */                  && __tcp_select_window(sk) >= tp->rcv_wnd) ||                       /* We ACK each frame or ... */                  tcp_in_quickack_mode(tp) ||                       /* We have out of order data. */                  (skb_peek(&tp->out_of_order_queue) != NULL)         {                  tcp_send_ack(sk); /* Then ack it now */         }else         {                  tcp_send_delayed_ack(sk); /* Else, send delayed ack. */         } }

`tcp_fast_parse_options()`	net/ipv4/tcp_input.c

tcp_fast_parse_options(sk, th, tp) handles the Timestamp option in the TCP packet header. (See Section 24.4.1.) tcp_parse_options() is invoked if the packet header contains several options.

Handling Incoming Packets in Other States

`tcp_rcv_state_process()`	net/ipv4/tcp_input.c

The tcp_rcv_state_process() function processes incoming segments when the TCP connection is not in the ESTABLISHED state. It mainly handles state transitions and management work for the connection. The detailed process is described in Section 24.3.

24.2.2 Sending TCP Segments

This section describes how payload is sent over a TCP instance (i.e., how TCP segments containing payload are transmitted). The transmission of acknowledgements (ACKs) is initiated by incoming TCP segments or by the Delayed ACK timer. (See Sections 24.2.1 and 24.5.)

A TCP instance uses the send() system call to send payload. The send() system call causes the tcp_sendmsg() function to be invoked. This function is present as a handling routine for this system call in the tcp_prot structure (net/ipv4/tcp_ipv4.c). Figure 24-4 shows the invocation hierarchy during the process of sending payload over the TCP instance.

Figure 24-4. Sending payload in the TCP protocol instance.

graphics/24fig04.gif

`tcp_sendmsg()`	net/ipv4/tcp.c

tcp_sendmsg(sock, msg, size) copies payload from the user-address space into the kernel and starts sending this data in the form of TCP segments. Before it starts sending, however, it checks on whether the connection has already been established and on whether it is in the TCP_ESTABLISHED state. If no connection has been established yet, the system call waits in wait_for_tcp_connect() for a connection.

The next step computes the maximum segment size (tcp_current_mss) and starts copying the data from the user-address space. First, it checks for whether a "half empty" segment is present at the end of the socket's transmit queue (tp->write_queue), which could be used to pack data. Subsequently, or if no small segment was available, tcp_alloc_skb() creates new socket buffers. The data to be sent is copied from the useraddress space into the socket buffers, and tcp_send_skb() is invoked to order data within the socket's transmit queue.

Finally, the __tcp_push_pending_frames() routine takes TCP segments from the socket transmit queue (tp->write_queue) and starts sending them.

`tcp_send_skb()`	net/ipv4/tcp_output.c

tcp_send_skb(sk, skb, force_queue, cur_mss) adds the socket buffer, skb, to the socket transmit queue (sk->write_queue) and decides whether transmission can be started or it has to wait in the queue. It uses the tcp_snd_test() routine to make this decision. If the result is negative, the socket buffer remains in the transmit queue.

If the result is positive, it starts sending the present segment, that is, it uses the tcp_transmit_skb() function to complete the TCP packet header and pass the segment to the IP instance. As shown in Figure 24-2, the latter function is also used in other places within the TCP instance for this purpose.

The timer for automatic retransmission is started automatically in tcp_reset_xmit_timer(), if the transmit process was successful. This timer is initiated if no acknowledgement for this packet arrived after a specific time.

`tcp_snd_test()`	include/net/tcp.h

tcp_snd_test(tp, skb, cur_mss, nonagle)() is an inline method that checks on whether the TCP segment, skb, may be sent at the time of invocation. It verifies the criteria specified in RFC 1122 to ensure standard-compliant behavior and, mainly, that there is sufficient space in the transmit and congestion-control windows and what the Nagle algorithm says with regard to sending this packet. The behavior of the participating algorithms is described in Section 24.4.

 /* This checks if the data bearing packet SKB (usually tp->send_head) * should be put on the wire right now. */ static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,                           unsigned cur_mss, int nonagle) {      /*    RFC 1122 - Section 4.2.3.4:        *     We must queue if        *     a) The right edge of this frame exceeds the window        *     b) There are packets in flight and we have a small segment        *        [SWS avoidance and Nagle algorithm]        *        (part of SWS is done on packetization)        *        Minshall version sounds: there are no _small_        *        segments in flight. (tcp_nagle_check)        *     c) We have too many packets 'in flight'        *        *     Don't use the nagle rule for urgent data (or        *     for the final FIN -DaveM).        */ return ((nonagle==1 || tp->urg_mode        || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&        ((tcp_packets_in_flight(tp) < tp->snd_cwnd) ||        (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&        !after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd)); }

`tcp_transmit_skb()`	include/linux/tcp_output.c

tcp_transmit_skb(sk, skb) is responsible for completing the TCP segment in the socket buffer, skb, and for subsequently sending it over the Internet Protocol. To this end, it first fills the TCP packet header with the appropriate values from the opt structure for example, much like an explicit transmit credit specified by tcp_select_window(). tcp_syn_build_options() registers the TCP options for SYN packets, and tcp_build_and_update_options() registers the options for all other packets.

Subsequently, all actions required to send a payload-carrying segment or a segment with its ACK flag set are executed:

If the ACK flag is set, the number of permitted Quick ACK packets is decremented in the tcp_event_ack_sent() method (provided that the connection is in the Quick ACK mode). Subsequently, the timer for delayed ACKs is stopped, because the next step sends an acknowledgement.
If the segment to be sent carries payload, we first have to check for whether an interval corresponding to the retransmission timeout has elapsed since the last data segment (stored in tp->lsntime) was sent. If this is the case, then the congestion window, snd_cwnd, is set to the minimum value (tcp_cwnd_restart), as specified in RFC 2861.

Subsequently, the function pointer tp->af_specific->queue_xmit(), which references the corresponding transmit function depending on the Internet Protocol version (e.g., ip_queue_xmit() for IPv4) passes the socket buffer to the IP layer for transmission.

Finally, the tcp_enter_cwr() method adapts the threshold value for the slow-start algorithm. This completes the transmit process for the TCP segment within the TCP instance, unless no acknowledgement arrives for this packet during the retransmission timeout, so that it would have to be retransmitted; see details in Section 24.5.

`tcp_push_pending_frames()`	include/net/tcp.h

tcp_push_pending_frames checks for whether there are segments ready for transmission that couldn't be sent during the regular transmission attempt. If this is the case, then tcp_write_xmit() initiates the transmission of these segments, if the tcp_snd_test() function agrees:

 struct sk_buff *skb = tp->send_head; if (skb) { if (!tcp_skb_is_last(sk, skb))         nonagle = 1; if (!tcp_snd_test(tp, skb, cur_mss, nonagle) ||         tcp_write_xmit(sk))              tcp_check_probe_timer(sk, tp); } tcp_cwnd_validate(sk, tp);

`tcp_write_xmit()`	net/ipv4/tcp_output.c

tcp_write_xmit(sk) continues to send segments from the transmit queue of the socket, sk, as long as it is allowed to do so by tcp_snd_test(). It checks for whether the conditions of TCP algorithms (e.g., slow-start method and congestion-control algorithm) are maintained. Data segments can also be fragmented (tcp_fragment()) before the maximum segment size is exceeded. tcp_transmit_skb() handles the final completion of the TCP packet and passes it to the Internet Protocol.

`tcp_retransmit_skb()`	include/linux/tcp_output.c

tcp_retransmit_skb(sk, skb) retransmits a TCP segment. The segment may have to be fragmented (tcp_fragment()), or it might be joined with the next segment (tcp_retrans_try_collapse()).

`tcp_send_ack()`	include/linux/tcp_output.c

tcp_send_ack(sk) is responsible for building and sending ACK packets. For this purpose, it requests a socket buffer and assigns the corresponding values to this buffer. Subsequently, it uses tcp_transmit_skb() to send an ACK segment.

24.2.3 Data Structures of the TCP Instance

`struct tcp_opt`	include/net/sock.h

The tcp_opt structure contains all variables of the TCP algorithms for a TCP connection. The names of these variables were adopted from RFCs 793 and 1122 for the sake of better understanding (except for uppercase and lowercase). This facilitates reading the source text and comparing with the TCP standard.

The tcp_opt structure is very complex, which shouldn't come as a surprise considering the complexity of the TCP protocol and the large number of its algorithms. A detailed description of the tcp_opt structure would go beyond the scope and volume of this chapter, especially because they are well documented and because the variable names correspond to the TCP standard. Nevertheless, we will list the variables and their tasks, to facilitate quick references when reading the following sections.

Among other things, the tcp_opt structure consists of variables for the following algorithms or protocol mechanisms:

sequence and acknowledgement numbers;
flow-control information;
packet round-trip time;
congestion control and congestion handling;
timers;
TCP options in the packet header; and
automatic and selective packet retransmission.

 struct tcp_opt {   int tcp_header_len;    /* Bytes of tcp header to send                       */   /* Header prediction flags * 0x5?10 << 16 + snd_wnd in net byte order       */   __u32 pred_flags;   __u32 rcv_nxt;         /* What we want to receive next                      */   __u32 snd_nxt;         /* Next sequence we send                             */   __u32 snd_una;         /* First byte we want an ack for                     */   __u32 snd_sml;         /* Last byte of most recently xmitted small packet   */   __u32 rcv_tstamp;      /* timestamp of last received ACK (for keepalives)   */   __u32 lsndtime;        /* timestamp of last sent data packet (for restart                                       window)                                 */   /* Delayed ACK control data  */   struct   {       __u8 pending;             /* ACK is pending                             */       __u8 quick;               /* Scheduled number of quick acks             */       __u8 pingpong;            /* The session is interactive                 */       __u8 blocked;             /* Delayed ACK was blocked by socket lock     */       __u32 ato;                /* Predicted tick of soft clock               */       unsigned long timeout;    /* Currently scheduled timeout                */       __u32 lrcvtime;           /* timestamp of last received data packet     */       __u16 last_seg_size;      /* Size of last incoming segment              */       __u16 rcv_mss;            /* MSS used for delayed ACK decisions         */   } ack;   __u32 snd_wl1;                /* Sequence for window update                 */   __u32 snd_wnd;                /* The window we expect to receive            */   __u32 max_window;             /* Maximal window ever seen from peer         */   __u32 pmtu_cookie;            /* Last pmtu seen by socket                   */   __u16 mss_cache;              /* Cached effective mss, not including SACKS  */   __u16 mss_clamp;              /* Maximal mss, negotiated at connection setup*/   __u16 ext_header_len;         /* Network protocol overhead (IP/IPv6 options)*/   __u8 ca_state;                /* State of fast-retransmit machine           */   __u8 retransmits;             /* Number of unrecovered RTO timeouts.        */   __u8 reordering;              /* Packet reordering metric. */   __u8 queue_shrunk;            /* Write queue has been shrunk recently.      */   __u8 defer_accept;            /* User waits for some data after accept      */   /* RTT measurement */   __u8 backoff;                 /* backoff                                    */   __u32 srtt;                   /* smoothed round trip time << 3              */   __u32 mdev;                   /* medium deviation                           */   __u32 mdev_max;               /* maximal mdev for the last rtt period       */   __u32 rttvar;                 /* smoothed mdev_max                          */   __u32 rtt_seq;                /* sequence number to update rttvar           */   __u32 rto;                    /* retransmit timeout                         */   __u32 packets_out;            /* Packets which are "in flight"              */   __u32 left_out;               /* Packets which leaved network               */   __u32 retrans_out;            /* Retransmitted packets out */   /* Slow start and congestion control (see also Nagle and Karn & Part.)      */   __u32 snd_ssthresh;          /* Slow start size threshold                   */   __u32 snd_cwnd;              /* Sending congestion window                   */   __u16 snd_cwnd_cnt;          /* Linear increase counter                     */   __u16 snd_cwnd_clamp;        /* Do not allow snd_cwnd to grow above this    */   __u32 snd_cwnd_used;   __u32 snd_cwnd_stamp;   /* Two commonly used timers in both sender and receiver paths.              */   unsigned long           timeout;   struct timer_list       retransmit_timer;     /* Resend (no ack)            */   struct timer_list       delack_timer;         /* Ack delay                  */ struct sk_buff_head out_of_order_queue;         /* Out of order segments go                                                    here */   struct tcp_func *af_specific;    /* AF_INET{4,6} specific operations        */   struct sk_buff *send_head;       /* Front of stuff to transmit              */   __u32 rcv_wnd;              /* Current receiver window                      */   __u32 rcv_wup;              /* rcv_nxt on last window update sent           */   __u32 write_seq;            /* Tail(+1) of data held in tcp send buffer     */   __u32 pushed_seq;           /* Last pushed seq, required to talk to windows */   __u32 copied_seq;           /* Head of yet unread data                      */   /* Options received (usually on last packet, some only on SYN packets)      */   char       tstamp_ok,       /* TIMESTAMP seen on SYN packet                 */              wscale_ok,       /* Wscale seen on SYN packet                    */              sack_ok;         /* SACK seen on SYN packet                      */   char       saw_tstamp;      /* Saw TIMESTAMP on last packet                 */   __u8       snd_wscale;      /* Window scaling received from sender          */   __u8       rcv_wscale;      /* Window scaling to send to receiver           */   __u8       nonagle;         /* Disable Nagle algorithm?                     */   __u8       keepalive_probes;/* num of allowed keep alive probes             */   /* PAWS/RTTM data */   __u32      rcv_tsval;       /* Time stamp value                             */   __u32      rcv_tsecr;       /* Time stamp echo reply                        */   __u32      ts_recent;       /* Time stamp to echo next                      */   long       ts_recent_stamp; /* Time we stored ts_recent (for aging)         */   /* SACKs data */   __u16      user_mss;        /* mss requested by user in ioctl               */   __u8       dsack;           /* D-SACK is scheduled                          */   __u8       eff_sacks;       /* Size of SACK array to send with next packet  */   struct tcp_sack_block duplicate_sack[1]; /* D-SACK block                    */   struct tcp_sack_block selective_acks[4]; /* The SACKS themselves            */   __u32      window_clamp;    /* Maximal window to advertise                  */   __u32      rcv_ssthresh;    /* Current window clamp                         */   __u8       probes_out;      /* unanswered 0 window probes                   */   __u8       num_sacks;       /* Number of SACK blocks                        */   __u16      advmss;          /* Advertised MSS                               */   __u8       syn_retries;     /* num of allowed syn retries                   */   __u8       ecn_flags;       /* ECN status bits.                             */   __u16      prior_ssthresh;  /* ssthresh saved at recovery start             */   __u32      ost_out;         /* Lost packets                                 */   __u32      sacked_out;      /* SACK'd packets                               */   __u32      fackets_out;     /* FACK'd packets                               */   __u32      high_seq;        /* snd_nxt at onset of congestion               */   __u32      retrans_stamp;   /* Timestamp of the last retransmit, also used in                               SYN-SENT to remember stamp of the first SYN     */   __u32      undo_marker;     /* tracking retrans started here.               */   int  undo_retrans;     /* number of undoable retransmissions.               */   __u32       syn_seq;   /* Seq of received SYN.                              */   __u32       fin_seq;   /* Seq of received FIN.                              */   __u32       urg_seq;   /* Seq of received urgent pointer                    */   __u16       urg_data;  /* Saved octet of OOB data and control flags         */   __u8        pending;   /* Scheduled timer event                             */   __u8        urg_mode;  /* In urgent mode                                    */   __u32       snd_up;    /* Urgent pointer                                    */        /* The syn_wait_lock is necessary only to avoid tcp_get_info having        * to grab the main lock sock while browsing the listening hash        * (otherwise it's deadlock prone).                                     */   rwlock_t                  syn_wait_lock;   struct tcp_listen_opt     *listen_opt;        /* FIFO of established children */   struct open_request       *accept_queue;   struct open_request       *accept_queue_tail;   int             write_pending;       /* A write to socket waits to start.   */   unsigned int    keepalive_time;      /* time before keep alive takes place  */   unsigned int    keepalive_intvl;     /* interval between keep alive probes  */   int             linger2; };

`struct tcp_skb_cb`	include/net/tcp.h

 struct tcp_skb_cb {        ...        __u32         seq;                /* Starting sequence number*/        __u32         end_seq;            /* SEQ + FIN + SYN + datalen*/        __u32         when;               /* used to compute rtt's*/        __u8          flags;              /* TCP header flags*/        #define TCPCB_FLAG_FIN            0x01        #define TCPCB_FLAG_SYN            0x02        #define TCPCB_FLAG_RST            0x04        #define TCPCB_FLAG_PSH            0x08        #define TCPCB_FLAG_ACK            0x10        #define TCPCB_FLAG_URG            0x20        #define TCPCB_FLAG_ECE            0x40        #define TCPCB_FLAG_CWR            0x80        __u8          sacked;             /* State flags for SACK/FACK*/        #define TCPCB_SACKED_ACKED        0x01 /* SKB ACK'd by a SACK                                           block*/        #define TCPCB_SACKED_RETRANS      0x02 /* SKB retransmitted*/        __u16         urg_ptr;            /* Valid w/URG flags is set*/        __u32         ack_seq;            /* Sequence number ACK'd*/ };

The preceding structure represents the control block of a packet. This type of structure is included in each packet.

`struct tcphdr`	include/linux/tcp.h

 struct tcphdr {        __u16         source;        __u16         dest;        __u32         seq;        __u32         ack_seq; #if defined(__LITTLE_ENDIAN_BITFIELD)        __u16         res1:4, doff:4, fin:1, syn:1, rst:1,                      psh:1, ack:1, urg:1, ece:1, cwr:1; #elif defined(__BIG_ENDIAN_BITFIELD)        __u16         doff:4, res1:4, cwr:1, ece:1, urg:1,                      ack:1, psh:1, rst:1, syn:1, fin:1; #else #error "Adjust your <asm/byteorder.h> defines" #endif       __u16          window;       __u16          check;       __u16          urg_ptr; };

Section 24.1.2 introduced the protocol header. The tcphdr structure maps this header, depending on the memory sequence.