7.4 Network Protocols | System Performance Tuning2002

If we think of the media layers -- Ethernet, FDDI, and ATM -- as highways, then the network protocols are the rules of the road. It is not enough simply to have a strip of concrete; rather, we must paint lines to guide the traffic on its way. How can we get from San Francisco to Boston? When cars arrive simultaneously at a stop sign, who is allowed to pass first? ^[15] These rules are the domain of the two major network protocols: the Internet Protocol (IP) and the Transmission Control Protocol (TCP). Continuing the analogy of a highway, IP is analogous to the rules for laying out the interstate highway system and describing how to get from one city to the next , whereas TCP is a set of rules for how to travel on the roads .

^[15] As we'll see later in TCP, the cars are allowed to smash into each other, and then new cars are dispatched. Wouldn't it be nice if automobile accidents were so easily managed?

7.4.1 IP

IP is a network-layer protocol (see Section 7.1.1 earlier in this chapter) conceived in the early 1970s by Vint Cerf and Bob Kahn. Other examples of network-layer protocols include IPX and AppleTalk, but we will not discuss them here.

IP specifies a header attached to the beginning of every data transferred. Together, the header and data form a packet . Table 7-8 describes the format of the IP header.

Table 7-8. IP header

Bits	Content
0-3	Version
4-7	Header length
8-15	Type of service
16-31	Total length
32-47	Identifier
48	Unused
49	Flag: don't fragment (DF)
50	Flag: more fragments (MF)
51-63	Fragmented offset
64-71	Time-to-live
72-79	Protocol
80-95	Header checksum
96-127	Source address
128-159	Destination address
160-191	Options and padding

The version field is usually set to 4, as the current version of IP in widespread use is Version 4 (IPv4). IP Version 6 (IPv6) is slowly gaining ground, but not yet in widespread deployment. Its performance impact has not been well-characterized, so we will not discuss it here. The header length specifies the length of the header field in 32-bit increments . The type of service field identifies special handling requirements for the packet as implemented by quality of service protocols, such as classification and priority handling. The total length field specifies the length of the entire packet, including the header, in bytes. ^[16]

^[16] The maximum decimal number representable by 16 bits (the size of the total length field) is 65,535; as a result, the maximum possible size of an IP packet is 65,535 bytes.

7.4.1.1 Fragmentation

The identifier field controls fragmentation of a packet, in conjunction with the fragment offset field and the two flag bits. A packet must be fragmented into multiple smaller packets only if the original length exceeds the defined maximum transmission unit (MTU) size of the underlying data link. For example, a 6,500-byte packet travelling along an Ethernet network with a MTU of 1,500 bytes will be fragmented into a number of packets, each less than 1,500 bytes in size. When the "Don't fragment" flag is set, the packet cannot be fragmented, and will be dropped and an error message returned if fragmentation becomes necessary. When a router fragments a packet, it sets the "More fragments" bit to 1 in all but the last fragment, so that the receiver knows when the last fragment arrives. The fragment offset field encodes the information necessary to reassemble the fragments in the correct order.

7.4.1.2 Time-to-live

The time-to-live field is set with a certain fixed number ( generally 64, but values such as 15 and 32 are often seen) when the packet is first generated. As the packet makes its way through the routers in an inter-network, this number is decremented. If the number reaches zero, the packet is discarded and an error message is returned to the source. This prevents "zombie" packets that are unable to find their destinations from roaming the inter-network and feasting on the packets of the living.

7.4.1.3 Protocols

The protocol field describes the type of data being transported. There are more than 100 assigned protocol numbers , but a few of the more common ones are noted in Table 7-9.

Table 7-9. Common IP protocol numbers

Protocol number	Description
1	Internet Control Message Protocol (ICMP)
6	Transmission Control Protocol (TCP)
17	User Datagram Protocol (UDP)
88	Internet Gateway Routing Protocol (IGRP)
89	Open Shortest Path First (OSPF, a routing protocol)

The header checksum contains the checksum for the header, obviously. However, this field must be recalculated at every router, because the time-to-live will have been decremented. The source and destination addresses are the 32-bit IP addresses.

7.4.1.4 IP addresses

The IP address is a 32-bit binary value that consists of a network portion , which uniquely identifies a network, and a host portion , which uniquely identifies a host on that network. Because it is cumbersome to write down 32-bit binary values, IP addresses are often represented in " dotted decimal" notation, in which each of the four 1-byte chunks are represented as a decimal number between 0 and 255. Dotted decimal notation is just a convienent way for humans to read IP addresses; all the network hardware ever sees is a 32-bit binary value.

One of the most interesting characteristics of an IP address is that the size of the network and host portions can vary depending on the size of the network. IP was designed to be very flexible, and can support huge networks with millions of hosts to small networks with only a few hosts . This makes IP addresses somewhat difficult to manage.

7.4.1.5 Classful addressing

The first scheme that developed to manage IP addresses involves dividing networks into three broad categories, or classes:

Class A network: Very large, with many hosts. There are few class A networks.
Class B network: Of moderate size, with a moderate number of hosts. Class B networks are much more common than class A networks.
Class C network: Very small (less than 250 attached hosts). Class C networks are common.

As a result, the network and host portions are split up according to Table 7-10.

Table 7-10. Network class summary

Class	Network portion (in bytes)	Available networks ^[17]	Host portion(in bytes)	Available hosts per network
A	1	127	3	16,777,216
B	2	32,767	2	65,536
C	3	8,388,607	1	256

^[17] These are not quite what you would expect, because a portion of the class A address space is used for class B and C addresses.

An IP address in which the host portion is set to all zeros refers to the network as a whole, and is called the network address . Determining what network a host belongs to is the role of the address mask , a 32-bit binary number with the bytes that refer to the network portion all set to 1. Computing the bitwise AND of any IP address with its address mask will always return the network address. For an example, see Table 7-11.

Table 7-11. IP addresses, network masks, and network addresses on a class B network

	Dotted decimal	1st byte	2nd byte	3rd byte	4th byte
Host IP address	176.16.146.38	`10101100`	`00010000`	`10010010`	`00100110`
Network mask	255.255.0.0	`11111111`	`11111111`	`00000000`	`00000000`
Network address(Bitwise AND)	176.16.0.0	`10101100`	`00010000`	`00000000`	`00000000`

Public Versus Private Address Space

Some IP addresses are reserved, per RFC 1918, for "private" use; that is, they may never appear on a public network. The reserved addresses are 10.0.0.0-10.255.255.255, 172.16.0.0-172.31.255.255, and 192.168.0.0-192.168.255.255.

Private addresses are often used on IP networks that don't participate in the Internet, such as point-to-point interconnects in a datacenter, or in networks that are "hidden" from the public Internet by means of a firewall or a Network Address Translation (NAT) mechanism.

7.4.1.6 Subnetting classful networks

Subnetworking can be a complex topic; it's something most people don't understand very well. Before we get started, then, a word of advice.

When working with subnets, it pays to work in binary. Dotted decimal notation exists merely for human convenience; the machines think in binary.

In order to improve the efficiency of assigning IP addresses, a concept called subnetting evolved. Subnetting involves breaking up a single large network into multiple smaller networks through extending the network portion by some amount less than a full byte. That is, the network portion intrudes into the host portion.

For example, let's say we wish to segment the 172.16.0.0 class B network. If we extend the network mask by 8 bits, that is, from 2 bytes to 3 bytes, we are now able to assign up to 256 subnets underneath the single class B address. Let's look at Chapter 7 to see how this looks in binary.

Table 7-12. Subnetting a class B network

	Dotted decimal	1st byte	2nd byte	3rd byte	4th byte
Host IP address	176.16.146.38	10101100	00010000	10010010	00100110
Subnet mask	255.255.255.0	11111111	11111111	11111111	00000000
Subnet address(Bitwise AND)	176.16.146.0	10101100	00010000	10010010	00000000

We say that the extra eight bits in the subnet mask are the subnet portion, in complement to the host and network portions that we've already defined. Note that the network address is still 176.16.0.0!

Subnet masks are represented in three ways: the dotted decimal form, as is shown in Table 7-12; the hexadecimal form, which is simply the 32-bit binary value converted to a hexadecimal value (e.g., in our example in Table 7-12, the subnet mask in hexadecimal would be 0xFFFFFF00); and the bitcount form, which is the network address followed by a slash and the number of bits used for the subnet mask (e.g., continuing with the example given, 172.16.0.0/24).

If you need to design the addressing for a subnet, there are a few things you must keep in mind:

There are two "unusable" IP addresses in every subnet: the subnet address itself and the broadcast address.
The number of usable subnets in a network for a given subnet mask is given by 2 ls - 2, where ls is the length of the subnet portion in bits.
The number of usable host addresses in a subnet is given by 2 lh - 2, where lh is the length of the host portion in bits.

Our subnet example deals with a case where the subnet space falls on a byte boundary. This is not always the case -- nor even advantageous. Let's say you have the 192.168.144.0 class C network, and you need to support 6 departments, each with 25 systems. This is easily done, but only if the subnet portion is 3 bits long (23 - 2 gives us 6 possible subnets -- hope we don't add another department!). This leaves us with 5 bits for each subnetwork's host portion, or 30 hosts per network. How do we find the subnet mask for this division? Let's look at Table 7-13.

Table 7-13. Subnetting a class C network

	Dotted decimal	1st byte	2nd byte	3rd byte	4th byte
Network address	192.168.144.0	11000000	10101000	10010000	00000000
Subnet mask	255.255.255.224	11111111	11111111	11111111	11100000

With three bits for the subnet portion, we can write down all the possible combinations: 000, 001, 010, 011, 100, 101, 110, and 111. We also know that the first and last addresses, 000 and 111, cannot be used. By calculating all the possible combinations for the five bits of the host portion of the address, we can work out the range of allowed host addresses for that subnet. For a summary of these calculations, see Table 7-14.

Table 7-14. Computing address ranges for a subnetted class C network

Subnet	Network address	Broadcast address	Host address range
001	192.168.144.32	192.168.144.63	192.168.144.33-192.168.144.62
010	192.168.144.64	192.168.144.95	192.168.144.65-192.168.144.94
011	192.168.144.96	192.168.144.127	192.168.144.97-192.168.144.126
100	192.168.144.128	192.168.144.159	192.168.144.129-192.168.144.158
101	192.168.144.160	192.168.144.191	192.168.144.161-192.168.144.190
110	192.168.144.192	192.168.144.223	192.168.144.193-192.168.144.222

The reason for the advice I gave when we started this adventure should now be (perhaps painfully) clear: when you are given an address, you have no idea just by looking at it whether it is a subnet address, a broadcast address, or a host address, and even knowing the subnet mask does not make things obvious. The key to working with subnets is to work in binary.

7.4.1.7 Moving to a classless world

By about 1990, the Internet was faced with two serious growth problems. As the Internet grew in popularity, there was a flood of new classful networks, and every one had to be included in the routing tables. As a result, the routers were starting to run out of memory, and spending far too much time doing address lookups. It had also become apparent that the pace of requests for new class B networks would soon exhaust the available supply. The Internet Engineering Task Force (IETF) devised a solution, which is known as classless interdomain routing (CIDR), simply classless routing , or supernetting .

CIDR is based on an extension of the principle of subnetting that we've already discussed. Supernetting involves moving the boundary between the host and network portion to the left, not just the right. Groups of neighboring classful networks can be combined into single routing table entries. Groups of class C networks can be assigned in batches of 2, 4, 8, or 16 to fill the needs of organizations that would otherwise have requested a class B network.

Instead of a subnet mask to distinguish between the network portion and the host portion of the address, classless IP addressing uses a length indication. This length indication, which consists of a slash and a number that follows the network address, is known as a variable length subnet mask , or VLSM. The following number specifies how many bits make up the network portion of the address. For example, 192.168.214.0/21 specifies 21 bits for the network portion. Using the standard subnet mask notation, this is equivalent to 255.255.248.0.

The big benefit of classless addressing is that it improves the efficiency of allocating IP addresses. For example, if an organization is assigned a class B address, it has 16,384 addresses available for use. If that organization has only 1,000 hosts, the remainder of the addresses are reserved but remain unused. In a classless world, the organization could simply be assigned one network with a VLSM of 22, which would provide it with 1024 addresses (the equivalent of 4 "aggregated" class C networks). The equivalent netmask would be 255.255.252.0.

7.4.1.8 Routing

One thing that we will not discuss here is the means by which routers discover and share information with each other, facilitating communications across IP-based internetworks. If you are interested in learning more about that, there is an excellent (if voluminous) reference, Routing TCP/IP by Jeff Doyle, published by Cisco Press.

7.4.2 TCP

The Transmission Control Protocol, or TCP, describes a connection-oriented, reliable communications protocol; it provides the appearance of a point-to-point connection. Such a connection has two characteristics: there is only one route from the source to the destination, and packets arrive in the same order that they are sent. However, because the underlying delivery mechanism (IP) is connectionless, TCP can only provide the appearance of a connection. Just like a telephone connection, TCP must establish a connection, transfer data, then disconnect when the data transmission is complete. TCP uses three mechanisms to accomplish this:

The first byte of each packet is labeled with a sequence number, so that the receiving service can reassemble out-of-order packets correctly.
There is a system of acknowledgments, checksums, and timers that provide reliability. For example, the receiving end can notify the sender that a packet has failed to arrive, and the sender will resend any packet that is not acknowledged by the receiving end within a certain amount of time.
There is a mechanism to control the flow of packets, called windowing . This decreases the chances of a packet being dropped because the received-packet buffer has overflowed.

TCP attaches a header to the data being sent, which contains fields for the source and destination address of the application (ports) as well as all the TCP control data. The data plus the TCP header is generally then encapsulated into an IP packet for delivery. The TCP header is rather complex, and is summarized in Table 7-15.

Table 7-15. Transmission Control Protocol header

Bits	Content
0-15	Source port
16-31	Destination port
32-63	Sequence number
64-95	Acknowledgment number
96-99	Header length
100-105	Reserved
106	Flag: Urgent (URG)
107	Flag: Acknowledgment (ACK)
108	Flag: Push (PSH)
109	Flag: Reset (RST)
110	Flag: Synchronize (SYN)
111	Flag: Final (FIN)
112-127	Window size
128-143	Checksum
144-159	Urgent pointer
160-191+	Options and padding

The sequence number identifies the position of the encapsulated data byte within the stream. For example, if a segment has a sequence number of 5,748 and contains 256 bytes of data, the next segment will have a sequence number of 6,004. This "expected next sequence number" value is reported in the acknowledgment number field, so that the sending host knows not only whether packets have been lost, but which ones as well.

The header length field specifies the length of the header in 32-bit increments. The reserved field is always set to six zeroes. The window size is used for flow control by specifying the number of bytes, starting with the byte indicated by the acknowledgment number that the sender of the packet will accept from the other end of the connection before requiring an acknowledgment.

The checksum is a 16-bit checksum of both the header and the encapsulated data for error correction. If the urgent flag is set, the urgent pointer field is used; when added to the sequence number, it specifies the end of the urgent data. The options field contains data specified by the application; the most common use of this field is to send the maximum segment size, which tells the receiver the largest segment size the sender is willing to accept. The remainder of the field is padded with zeroes to make an even 32-bit length.

Viewing and Setting TCP Tunables: ndd

ndd is a tool that allows you to view and set tunables on Solaris systems without rebooting. While it can be used on most devices that have a device file in /dev and a kernel module, we will use it primarily to tune TCP.

You can use ndd interactively:

 #  ndd /dev/tcp  name to get/set ?  tcp_slow_start_initial  value ? length ? 1 name to get/set ?  ^D

You can also use it to display all available parameters:

 #  ndd /dev/icmp \?  ? (read only) icmp_wroff_extra (read and write) icmp_ipv4_ttl (read and write) icmp_ipv6_hoplimit (read and write) icmp_bsd_compat (read and write) icmp_xmit_hiwat (read and write) icmp_xmit_lowat (read and write) icmp_recv_hiway (read and write) icmp_max_buf (read only) icmp_status (read only)

You can also query many parameters at once:

 #  ndd -get /dev/hme link_status link_speed link_mode  1 0 0

And, finally, you can use it to set tunables:

 #  ndd -set /dev/tcp tcp_slow_start_initial 2

In Linux, the TCP tunables are generally adjusted permanently by means of editing the kernel source code and recompiling. They can also be adjusted at runtime by editing the files in /proc/sys/net/ipv4 .

Please note that in our further discussions of TCP, I will provide details about the protocol's inner workings. Those of you with experience in this area will immediately realize that I am simplifying some things -- TCP is a complex protocol, and a full treatment is well beyond the scope of this text. The best reference for TCP is, by far, W. Richard Stevens's TCP/IP Illustrated, Volume 1 (Addison Wesley). It is simply outstanding in explaining how TCP works, and I strongly suggest you read it.

7.4.2.1 Connection initiation and SYN flooding

TCP defines a specific way of creating a new TCP connection, called the three-way handshake. There are many queues and tunable values that are used during this instantiation process. In order to tune effectively, you need to understand not only what is happening on the wire, but also what is happening in the system; namely, how listen( ) and accept( ) interact with the queues.

When the server calls listen( ) , the kernel moves the socket from the TCP state CLOSED into the LISTEN state. Concurrently, the kernel creates and initalizes various data structures, including the socket buffers and two queues, the incomplete connection queue, and the completed connection queue.

When the first packet in a TCP connection is sent from the server (in which only the SYN bit is set), then an entry is added to the incomplete connection queue. The server sends a packet back to the client, in which the ACK bit is set (to signal acknowledgment of the client's SYN ) as well as the SYN bit. The socket is now in the SYN_RCVD state. When the client's ACK of the server's SYN is received, the connection stays in this queue for one round-trip time, and then is moved to the completed connection queue.

The completed connection queue contains an entry for every active connection (one in which the three-way handshake has been successfully completed, but the user application has not yet accepted the connection). The sockets in the completed connection queue are in the ESTABLISHED state. Every call to accept( ) removes the front entry from the queue; if there are no entries, then the call to accept( ) usually blocks. There is a parameter, tcp_conn_req_min , which defines the minimum number of available connections in the completed connection queue in order for select( ) or poll( ) to return "readable" to a listening socket descriptor. You shouldn't ever need to touch this tunable.

Both of the queues I described are limited in size. When listen( ) is called, the server can specify the size of the second queue. Historically, this value has defined the number of entries for the sum of both queues. Solaris systems since 2.6 include a "fudge factor" (multiplying the argument by three- halves ). The incomplete connection queue typically requires more entries than the completed connection queue; the only reason to specify a large backlog value is to enable the incomplete connection queue to grow as SYN s arrive from clients . Solaris has two kernel tunables that set the size of these lists: tcp_conn_req_max_q0 is the maximum number of connections with incomplete handshakes (the incomplete connection queue), whereas tcp_conn_req_max_q is the maximum number of completed connections awaiting to return from an accept( ) call (the completed connection queue).

The default values for tcp_conn_req_max_q0 and tcp_conn_req_max_q are 1,024 and 128, respectively. You can inspect the value of tcpListenDrop and tcpListenDropQ0 in the output of the netstat -sP tcp command. If you get many drops , you may need to increase these tunables. tcp_conn_req_max_q0 probably shouldn't be set above about 10,000; tcp_conn_req_max_q 's reasonable upper bound is the current value of tcp_conn_req_max_q0 .

If you're working under Linux, the equivalent to the tcp_conn_req_max_q0 parameter is tcp_max_syn_backlog , which can be found in /proc/sys/net/ipv4/ .

Solaris versions prior to 2.5.1 (with patches 103630-09 and 103582-12 or above applied) have a single, unified connection queue that is governed by tcp_conn_req_max . While you can tune this parameter, the recommended course of action is to upgrade or apply the current version of the two TCP patches to your system.

A certain type of denial-of-service attack, called SYN flooding , involves sending a large number of SYN packets with nonexistent source addresses. Because the second SYN is never acknowledged, the listen queue fills up and new connections get through only as old ones time out and are discarded from the queue. Whenever a dubious connection is discarded, the tcpHalfOpenDrop counter is incremented; a high value indicates that a SYN flood was likely attempted. If you observe this behavior, you can improve your protection by increasing tcp_conn_req_max_q0 .

7.4.2.2 Path MTU discovery and the maximum segment size

During the three-way handshake, the value for the maximum segment size (MSS) over that connection is determined. This value defines the largest amount of information that will be sent in a single TCP packet. The MSS is set to the minimum of the smallest MTU ( message transfer unit , the largest amount of data that can be sent in one packet on a given interface) of an outgoing interface or the MSS announced by the other end of the session. If the remote peer does not announce an MSS, the value 536 will normally be assumed. If path MTU discovery is active, all outgoing packets have the IP option DF set (for "don't fragment").

All versions of Solaris set the default MSS to 536 bytes. This is not always what you want, and can be remedied by adjusting the tcp_mss_def_ipv4 parameter: ^[18]

^[18] This parameter is called tcp_mss_def in pre-8 versions of Solaris.

 #  ndd -set /dev/tcp tcp_mss_def_ipv4 1460

The value to which the MSS should be set is determined by the MTU of the most commonly used outgoing interface, less 40 bytes for the TCP/IP headers. Since Ethernet has a MTU of 1,500 bytes, tcp_mss_def_ipv4 should be set to 1,460. Note that if the other side of the connection demands a lower MSS, that request will be honored. You shouldn't have to tune this under Linux.

If the TCP stack receives the ICMP error "Message fragmentation needed," it means a router on the way to the destination had to fragment the message, but was not allowed to do so (likely because the IP DF bit was set). As a result, the router had to throw the packet away, and sent back the ICMP error to notify the sender. Most newer routers actually transmit the needed MSS back to the host, but if the needed MSS isn't included, the correct MSS must be determined by trial and error.

RFC 1191 recommends rediscovering the path MTU of a TCP circuit every ten minutes. Unfortunately, early versions of Solaris (pre-2.5) try to rediscover the path MTU every 30 seconds. This is acceptable behaviour in LAN environments, but it is very rude for wide area networks, as it unnecessarily consumes the network capacity. The tunable is ip_ire_pathmtu_interval , which should be set to 600,000. You can control path MTU discovery in Solaris by means of the ip_path_mtu_discovery tunable, which you should leave set to 1. In Linux, it is set by the ip_no_pmtu_disc tunable in /proc/sys/net/ipv4/ , which you should leave set to 0.

7.4.2.3 Buffers, watermarks, and windows

In order to understand the various buffers and watermarks in a modern TCP stack, we need to briefly discuss what happens when you send a piece of data through the various layers of the stack. This section provides a simplified explanation in order to ease understanding. An application can send almost any size of data to the socket buffers , which reside in the transport layer -- in this case, TCP. All of the sent application data will be copied to the socket buffer. If there is an insufficient amount of data, the application will be put to sleep. The TCP module then segments the contents of this socket buffer, with no segment exceeding the MSS. Data is only removed from the socket buffer when the data is acknowledged by the remote peer. These segments are then passed down to the IP layer, where they are fragmented if they are too large for the underlying physical link interface (larger than the MTU).

The kernel-wide tunable for the size of a specific protocol's receive buffer is the receive high watermark. In Solaris, these are given by the tcp_recv_hiwat and udp_recv_hiwat tunables, or by calling setsockopt( ) with the SO_RCVBUF parameter. By default, both the tcp_recv_hiwat and udp_recv_hiwat tunables are 8,192 before Solaris 8, and 24,576 in Solaris 8. I would set these to something in the range of 16 KB to 32 KB for most applications, but it may be necessary to experiment in your particular environment. Also, keep in mind that this parameter can strongly influence memory usage: if you have 512 concurrent connections, and both the reception and transmission high watermarks are set to 64 KB, you will need to have 64 MB of memory for your TCP socket buffers alone. A related parameter is tcp_recv_hiwat_minmss , which is the number of maximum- sized segments that must be able to fit into a receive buffer; it's not possible to set a receive buffer size that can't fit at least this many segments. By default, this parameter is 4, and should be left alone.

The transmit high watermark, tcp_xmit_hiwat , effectively determines the size of the send buffer. ^[19] This can be set on a per-socket basis by the socket option SO_SNDBUF . When the amount of data in the send buffer is greater than tcp_xmit_lowat but less than tcp_xmit_hiwat , it is reported as nonwriteable. The default for tcp_xmit_hiwat is 8,192, and you should set it to something in the neighborhood of 32 KB to 48 KB. Since the transmit buffer has to store all unacknowledged segments, it's reasonable to make it larger than the receive buffer.

^[19] UDP effectively has no transmit buffering.

As a word of warning, setting this watermark near 65,535 may have the side effect of turning on the TCP window scaling option, because these values are rounded up to multiples of the MTU for each connection. In some cases, you may not want to use window scaling. In order to avoid accidentally using it, you should make sure that tcp_recv_hiwat and tcp_xmit_hiwat are set to at least 1 MTU below 65,535. For Ethernet interfaces, 64,000 is a good choice.

There are a set of low watermarks, tcp_xmit_lowat and udp_xmit_lowat , which you should not need to change. Applications can use the SO_SNDLOWAT and SO_RCVLOWAT to change these if necessary.

The tcp_max_buf and udp_max_buf parameters specify the maximum buffer size that can be requested by a setsockopt( ) call. By default, these values are 1,048,576 and 262,144, respectively. An attempt to specify a larger buffer size will fail with an EINVAL return code.

On pre-2.4 Linux systems, there are equivalent interfaces to change located in /proc/sys/net/ core . These are rmem_max , rmem_default , wmem_max , and wmem_default . By default, they are all set to 65,536. Starting with 2.4, there is a new socket buffer autosizing algorithm, loosely controlled by a set of TCP-specific parameters, tcp_rmem and tcp_wmem , located in /proc/sys/net/ipv4/ . Each of these files contains three values: the minimum, the default, and the maximum size allowed for each buffer.

The minimum size for each buffer is guaranteed to each TCP socket, even when the system is memory-constrained. By default, it is set to 8 KB.
The default size for each buffer overrides the rmem_default value in /proc/sys/net/core/ , which is still used for non-TCP protocols. The default is 87,380 bytes, which results in a TCP window size of 65,535 bytes.
The maximum specified size of the receive buffers does not override the value set by rmem_max in /proc/sys/net/core/ . Using setsockopt( ) with SO_RCVBUF does, however. The default is 174,760 bytes.

The TCP window is closely related to the socket buffer. A TCP window is the amount of outstanding data (unacknowledged by the recipient) that can be sent on a particular connection before the sending stack receives an acknowledgment from the receiver. For example, if two hosts are communicating over a TCP connection that has a window size of 64 KB, the sender can only send 64 KB of data; then it must stop and wait for an acknowledgment of some, or all, of the data. If the receiver acknowledges that all the data has been received, the sender is free to transmit another 64 KB. If the receiver acknowledges only the first 32 KB (for example, if the last 32 KB hasn't gotten there yet, or has been lost in transit) then the sender can only transmit another 32 KB. The primary intent of the window is for congestion control. The entire end-to-end network connection, which encompasses both hosts, all the intermediate routers, and all the intermediate connections (be they fiber, copper , satellite-based relay, etc.), can only handle data at some peak speed. If the transmitting occurs too fast, then the bottleneck is surpassed, and some data will be lost. The TCP window serves to " throttle down" the transmission speed to a level where congestion and the accompanying data loss don't occur. The TCP header specifies a 16-bit field to convey the window size. Therefore, the maximum window size is 2 ¹⁶ , or 65,536 bytes.

The fastest bulk data transfer rate can be found by dividing the window size your peer offers by the round-trip time. The optimum value for this parameter is called the bandwidth-delay product ; that is, how fast your connection can run multiplied by the time it takes to send a data packet out and receive an acknowledgment back. ^[20] This makes intuitive sense: the lower the round trip time or the larger the window, the faster you can transmit. For example, if you are using a 10 Mb/s (~1.0 MB/s) Ethernet link and you observe an average latency of 6 ms to your most frequently visited hosts, then the product is 6 KB and the default value of 8 KB is sufficient. If the latency increases past 8 ms, however, increasing the buffer size may be helpful. This sort of tuning is particularly useful over slow links such as ISDN lines; with a latency of 40 ms and a 12 KB/s data rate, our buffers should be 48 KB! TCP window tuning really needs to be done on the client end, and is most effective on busy systems that take a long time to turn around a reply. Systems on high speed networks will probably also benefit.

^[20] You can get a rough idea of the latency by using ping -s hostname .

In order to work around the performance limitations imposed by a maximum window size of 64 KB over very high latency links, RFC 1323 introduced a technique known as window scaling . Window scaling expands the definition of the TCP window to 32 bits, and uses a scaling factor to carry this 32-bit value in the original 16-bit field. The scaling factor is carried in a new TCP option, Window Scale. This option is sent only in a segment when the SYN bit is sent, meaning window scaling is fixed in each direction when a connection is opened. This three-byte option accomplishes two things:

Indicates that TCP is prepared to do both send and receive window scaling.
Specifies the scale factor that should be applied to its receive window. A TCP stack that is prepared to scale its windows should send this option, even if its own scale factor is 1. The scale factor is a logarithmically encoded power of 2. Both sides must send Window Scale options in their SYN segments in order to enable window scaling in either direction.

The window scaling behavior is controlled in Solaris by means of the tcp_wscale_always TCP tunable, which is zero by default. If tcp_wscale_always is set to a nonzero value, then the TCP window scaling option will always be negotiated during connection initiation. Otherwise, window scaling is only used if the buffer size is larger than 64 KB. The Linux equivalent is tcp_window_scaling in /proc/sys/net/ipv4/ .

7.4.2.4 Retransmissions

Sometimes data is lost on the network. The reasons for data loss are many: perhaps a backhoe cut a buried network cable, a building service worker unplugged a router by accident , some piece of network hardware failed, etc. The exact reasons for data loss aren't important; what we will consider are the mechanisms TCP uses to control retransmission of data.

The most basic tunable for this behavior is tcp_rexmit_interval_initial , which specifies how long to wait, in milliseconds , before the last data sent but not acknowledged is retransmitted. The default value of this parameter since Solaris 2.5.1 has been 3,000, which is a reasonable length of time for general use; laboratory environments not connected to the Internet might do better with a shorter time (say, 400 ms).

After the initial retransmission, further retransmissions will start after tcp_rexmit_interval_min milliseconds have elapsed. Since Solaris 8, this value defaults to 400; in earlier releases, it was 200. It should probably be set to roughly half of the tcp_rexmit_interval_initial value.

Further retransmissions follow an exponential backoff algorithm, but the inter-retransmission delay is capped at tcp_rexmit_interval_max milliseconds. Since 2.6, the default has been 240,000 (which is compliant with RFC 1122); Solaris 8 reduced this time to 60,000.

Two related values are tcp_ip_abort_interval and tcp_ip_abort_cinterval . These specify the time, in milliseconds, that retransmissions should be attempted until the RST segment is set and the TCP connection is reset. The tcp_ip_abort_cinterval parameter is specific to connections that have not yet reached the ESTABLISHED state (that is, they are still in the process of handshaking). The defaults for these parameters are 480,000 and 180,000, respectively. tcp_ip_abort_interval should probably be increased to about 600,000.

7.4.2.5 Deferring acknowledgments

One very useful set of tunables on Solaris systems are those that govern deferring acknowledgments. These parameters interact to control when acknowledgments are sent by the receiver for data transmissions on a local area network. The first parameter is tcp_deferred_ack_interval , which is measured in milliseconds; the second is tcp_deferred_acks_max , which defines the maximum number of segments that can be received before an ACK is immediately sent.

In order to understand how these parameters interact, let's look at an example. When the first inbound packet is received, two things happen: the deferred acknowledgment timer is started and the packet count is increased. As more packets come in, there are two possible outcomes . One is that the packet count becomes equal to tcp_deferred_acks_max , causing an acknowledgment to be sent, the timer to be reset, and the packet count to be set to zero. The other outcome is that the deferred acknowledgment timer fires (when it is equal to tcp_deferred_ack_interval ). This causes an acknowledgment to be sent, even if the number of accumulated packets is less than tcp_deferred_acks_max , and thereafter deferred acknowledgments will be disabled , and acknowledgments will be sent every other packet (which is how the TCP stack functions when talking to hosts farther away than the local area network). ^[21] By default, tcp_deferred_ack_interval is 100 milliseconds in Solaris 8; it was 50 milliseconds before that. The default for tcp_deferred_ack_max is 8 packets. This behavior is compliant with RFC 1122, which discusses the slow start algorithm.

^[21] This is done to avoid causing problems with systems that are following the slow-start algorithm.

7.4.2.6 Window congestion and the slow start algorithm

When the connection is negotiated, the latency and bandwidth available to the connection are not clear. To avoid flooding a slow network, TCP implements a slow start algorithm , which limits the number of packets it sends until it sees an acknowledgment. This limit is called the congestion window , and the standard says that the initial congestion window should be one packet, and double each time an acknowledgment is received. ^[22] There are two defects in most TCP implementations that relate to the slow start algorithm:

^[22] This does not work well for short connections because there are not enough packets sent to get the window size increased.

Most TCP implementations start with a default congestion window of 2. This is so widely done that changing the tcp_slow_start_initial variable from 1 to 2 in Solaris releases prior to 7 is a good idea; it (slightly) increases transmission speed on lightly loaded networks. Solaris releases after 7 have this configured by default.
A related problem occurs in the Microsoft TCP implementation: it does not immediately acknowledge receipt of a single packet, instead waiting for two or waiting for a short delay period. After receiving two packets, they acknowledge immediately. This causes higher response times than normal to be seen over high speed links; the latency in wide area networks or low speed links hides this problem. However, changing the tcp_slow_start_initial variable as described solves the problem. You can't tune this behavior in Linux.

7.4.2.7 TCP timers and intervals

There are three other interesting timers at play in TCP. These are the keepalive timer, the TIME_WAIT timer, and the FIN_WAIT_2 timer.

The keepalive timer is one of the more contentious issues in TCP stack tuning. The intent of the keepalive system is to free host resources by scavenging TCP connections in which the remote host is not responding. This comes at a cost in network utilization. The most common use of the keepalive timer system is in web servers. Recall that most TCP transactions follow a simple, four-step negotiation process:

The HTTP client opens a connection to the HTTP server.
The client forwards its query (requests a file).
The server responds with the file (responds to the request).
The server closes the connection.

If the client system or browser crashes, hangs , or is otherwise stopped prior to issuing its request, the server is forced to keep the connection open indefinitely. If you are running a busy web server, you should probably set this value to some reasonably small value, on the order of 5 to 10 minutes. It is controllable by the tcp_keepalive_interval parameter in Solaris. Linux actually implements three parameters to control this behavior, all of which reside in /proc/sys/net/ipv4 . tcp_keepalive_time specifies how often TCP sends out keepalive messages when keepalive is enabled (by default, every two hours). The tcp_keepalive_probes tunable specifies how many probes TCP will send out until it decides that the connection is broken (by default, nine). Finally, tcp_keepalive_interval specifies how frequently the probes are dispatched; when multiplied by tcp_keepalive_probes , it describes how long a nonresponsive connection is allowed to linger before being killed . tcp_keepalive_interval has a default value of 75 seconds.

The TCP TIME_WAIT state has a negative reputation, and people often frantically try to avoid it. It is actually a very good thing. A TCP connection is assigned to the TIME_WAIT state after it is closed, but before it is reused. The maximum segment lifetime (MSL) is the maximum time that a TCP segment may survive in the network. Waiting twice this time ensures that there are no leftover segments coming in, and it is safe to reuse the socket resource. This is the reasoning behind the TIME_WAIT state. You can probably safely reduce the length of time a socket waits in the TIME_WAIT state. A good starting point is twice the round-trip time to an extremely remote site: 60,000 or so is reasonable. You can directly tune this in Solaris by means of the tcp_time_wait_interval tunable, which is in units of milliseconds (note that the tcp_time_wait_interval parameter used to be called tcp_close_wait_interval in releases of Solaris before 7). Linux systems cannot tune this parameter directly, as they implement a fast TCP TIME_WAIT recycling algorithm already (governed by the tcp_tw_recycle parameter, which you shouldn't change). However, Linux does implement a tcp_max_tw_buckets parameter, which is the maximum number of TIME_WAIT sockets held by the kernel at any one time. If this number is exceeded, enough TIME_WAIT sockets to get underneath the limit are immediately destroyed and a warning is printed. This is intended primarily as an attempt to prevent simple denial-of-service attacks; never lower this limit.

The FIN_WAIT_2 timer is used when a connection is being closed. FIN_WAIT_2 is a TCP state that occurs when a connection is actively closed, but the FIN from the other side hasn't arrived yet -- and it might not ever arrive. A crashed or misbehaving client may cause the connection to reside in FIN_WAIT_2 for a long time: until the FIN_WAIT_2 timer expires . If netstat -f inet ( netstat inet in Linux) shows many connections in the FIN_WAIT_2 state, you should probably decrease the value for this timer; it is governed by the tcp_fin_wait_2_flush_interval TCP tunable in Solaris, or the /proc/sys/net/ipv4 Linux tunable tcp_fin_timeout . The Solaris default is 675,000 ms, and the Linux value as of the 2.4 kernels is 60 seconds (it was 180 seconds in 2.2). Solaris systems should probably tune the value down to 67,500 (one order of magnitude less).

7.4.2.8 The Nagle algorithm

The Nagle algorithm is a means of trading latency for throughput by aggregating small packets. It states that, under some circumstances, there will be a waiting period of 200 milliseconds before data is sent. This algorithm was published in RFC 896 in an attempt to solve a simple problem involving interactive connections such as telnet ; every interactive keystroke across a telnet connection normally generates a single data packet. Furthermore, the remote telnet server is responsible for echoing the character back. This generates up to four segments (the transmission of the keystroke, the transmission of the echoed keystroke, and two acknowledgment packets). These small packets, called tinygrams , are not normally a problem on LANs, but they can contribute to severe congestion on WANs. In essence, the Nagle algorithm serves to collect small amounts of data and send them in a single segment. The algorithm has the further advantage of being self-clocking: the faster acknowledgments come back, the faster the data is sent.

The following variables are used to determine when data should be sent in the absence of an acknowledgment:

If there are no unacknowledged small packets, send this one immediately.
If the TCP window is not full and the packet size is equal to or larger than the maximum segment size, send an MTU-sized buffer immediately.
If the TCP window is not full, and either the interface is idle or the TCP_NODELAY flag is set, send the buffer immediately.
If there is less than half the TCP window in outstanding data, send the data immediately.
Otherwise, wait up to 200 milliseconds for more data to be sent before sending the buffer.

When an acknowledgment arrives, the Nagle buffer is flushed immediately (notice that we have now sent a number of MTU-sized packets and, perhaps, one small packet as well).

If the Nagle algorithm interacts very poorly with the acknowledgment timer (unless the small packet is following a full packet, in which case we could expect things to work properly), the Nagle algorithm won't be able to send more data. This causes severe performance problems.

For many (but certainly not all) workloads, the Nagle algorithm should probably be disabled, by setting the tcp_naglim_def variable to 1. Applications that want to disable the Nagle algorithm on a particular socket should use the NODELAY flag when setting up the socket.

Tuning TCP parameters carries with it the possibility of causing grief to everyone on your network, not just yourself. Please be careful.

7.4.3 UDP

The User Datagram Protocol , or UDP, is a connectionless, "best effort" protocol. In light of the high reliability afforded by TCP, it can be hard to imagine why anyone would wish to use UDP. The advantage is that rather than taking the effort to set up and tear down a connection required in TCP, the data is just sent in UDP. This results in very low latencies, and applications that send data in short bursts over reliable networks will see a performance advantage.

Another reason for UDP's better performance is that it uses a small header (only 64 bits). See Table 7-16.

Table 7-16. Universal Datagram Protocol header

Bits	Content
0-15	Source port
16-31	Destination port
32-47	UDP length
58-64	Checksum

The source and destination ports are identical to those discussed in the TCP header. The UDP length is the length of the entire segment in bytes, and the checksum is an optional field that contains the checksum for the entire segment.

As UDP is a much simpler protocol, it does not have nearly as many tunable parameters. The most important are the size of the transmission and reception buffers, and are discussed along with their TCP counterparts; see Section 7.4.2.3 earlier in this chapter.

7.4.4 TCP Versus UDP for Network Transport

One common conundrum is whether TCP or UDP should be used as the network transport for a specific application that offers a choice (e.g., NFS). In general, there are three key differences to take into consideration:

UDP provides higher performance, in theory. ^[23]

^[23] The truth is that most vendors have spent their time tuning their TCP implementations, so you will need to conduct experiments to measure the actual difference in your specific environment.
TCP has higher overhead, but also guarantees reliability, which UDP cannot.
TCP controls transmission costs in error-prone environments very well, e.g., over wide area networks.

In general, UDP can provide a significant performance increase over fast, well- maintained , switched local area networks (for example, in the context of a machine room), but this is highly dependent on your vendor's particular implementation, and the only way to know is to test.