TCPIPv4 Protocol Kernel Parameters

TCP/IPv4 Protocol Kernel Parameters

This section covers some of the more useful IPv4 parameters. The IPv4 category includes the protocol-specific parameters. We pay particular attention here to the TCP protocol parameters. When a socket is created, the protocol and address families are specified. For a TCP socket, the address family is AF_INET and the type is SOCK_STREAM. For a UDP protocol socket, the address family is AF_INET but the type is SOCK_DGRAM. TCP socket buffer sizes are controlled by their own parameters rather than the core kernel buffer size parameters.

TCP Buffer and Memory Management

tcp_rmem

net.ipv4.tcp_rmem (/proc/sys/net/ipv4/tcp_rmem)

This variable is an array of three integers:

net.ipv4.tcp_rmem[0] = minimum size of the read buffer
net.ipv4.tcp_rmem[1] = default size of the read buffer
net.ipv4.tcp_rmem[2] = maximum size of the read buffer

When a socket is created, the initial size of the read buffer, in bytes, is controlled by the core default socket sizes. If it is a TCP protocol socket, however (type = AF_INET SOCK_STREAM), the read buffer size is set to the TCP protocol-specific default, which is the second integer of this array. To increase the initial size of the TCP read buffer, this is the variable to increase. Note that the maximum value this can take is still limited by the core socket maximum size, so this value cannot be greater than net.core.rmem_max.

The minimum and maximum TCP read buffer size limits (the first and the third integers in the tcp_rmem array) are used by the kernel while dynamically tuning the size of the buffer. The default values of this parameter are shown in Table 12-2.

Table 12-2. Default TCP Socket Read Buffer Sizes
	Buffer Parameters
	Minimum net.ipv4.tcp_rmem[0]	Default net.ipv4.tcp_rmem[1]	Maximum net.ipv4.tcp_rmem[2]
Low Memory	PAGE_SIZE	43689	43689*2
Normal	4KB	87380	87380*2

tcp_wmem

net.ipv4.tcp_wmem (/proc/sys/net/ipv4/tcp_wmem)

As with the read buffer, the TCP socket write buffer is also an array of three integers:

net.ipv4.tcp_wmem[0] = minimum size of the write buffer
net.ipv4.tcp_wmem[1] = default size of the write buffer
net.ipv4.tcp_wmem[2] = maximum size of the write buffer

As with the TCP read buffer, the TCP write buffer is controlled by the TCP protocol-specific parameters listed previously. Having a large write buffer is beneficial in that it allows the application to transfer a large amount of data to the write buffer without blocking. The default TCP socket write buffer sizes are shown in Table 12-3.

Table 12-3. Default TCP Socket Write Buffer Sizes
	Buffer Parameters
	Minimum net.ipv4.tcp_wmem[0]	Default net.ipv4.tcp_wmem[1]	Maximum net.ipv4.tcp_wmem[2]
Low Memory	4KB	16KB	64KB
Normal	4KB	16KB	128KB

TCP is a "windowing" transmission protocol. The window that a receiver advertisesin other words, the amount of data that the receiver says it can consumeis used by the sender to determine how much data can be sent. The larger the receiver socket buffer, the larger the window advertised by TCP and the faster the sender can send data.

A significant improvement in performance can be obtained by simply increasing these values. Note that the kernel attempts to round the value provided to the nearest multiple of approximate segment size.

Table 12-4 shows some typical values these parameters are set to for large networking applications.

Table 12-4. Examples of Increased TCP Socket Buffer Sizes
	Buffer Parameters
	Minimum net.ipv4.tcp_wmem[0]	Default net.ipv4.tcp_wmem[1]	Maximum net.ipv4.tcp_wmem[2]
TCP Write Buffer	8K	436600	873200
TCP Read Buffer	32K	436600	873200

tcp_mem

net.ipv4.tcp_mem[] (/proc/sys/net/ipv4/tcp_mem)

This kernel parameter is also an array of three integers that are used to control memory management behavior by defining the boundaries of memory management zones:

net.ipv4.tcp_mem[0] = pages below which TCP does not consider itself under memory pressure
net.ipv4.tcp_mem[1] = pages at which TCP enters memory pressure region
net.ipv4.tcp_mem[2] = pages at which TCP refuses further socket allocations (with some exceptions)

The term pages refers to the amount of memory, in pages, allocated globally to sockets in the system. The Linux kernel maintains limits on how much memory can be allocated at any given time.

Therefore, memory allocation failures while opening sockets can occur even though the system still has available memory.

For systems that need to support busy workloads, increasing the tcp_mem[] parameter benefits performance. For example, here are some values that it can be set to for a large networking workload:

 net.ipv4.tcp_mem = 1000000 1001024 1002048

This change should be made with caution, because it comes with a trade-off.

A large amount of memory is consumed by each socket buffer, and this limits the amount of memory available for other activity in the system.

TCP Options

The following sections describe the TCP options most likely to be of use to the system administrator.

tcp_window_scaling

net.ipv4.tcp_window_scaling (/proc/sys/net/ipv4/tcp_window_scaling)

This kernel variable enables the TCP window scaling feature. It is on by default. Window scaling implements RFC1379.

The window scale option is required for the employment of TCP window sizes larger than 64K. Because the length field in the TCP protocol header for the advertised receive buffer is specified by 16 bits, the window size cannot be larger than 64K.

Window scaling allows buffers larger than 64K to be advertised, thereby enabling the sender to fill network pipes whose bandwidth latency is larger than 64K. Particularly for connections over satellite links and on networks with large round-trip times, employing large windows (greater than 64K) results in a significant improvement in performance.

Optimum throughput is achieved by maintaining a window at least as large as the bandwidth-delay product of the network. If that quantity is larger than 64K, window scaling can potentially provide improved throughput, with a trade-off between the overhead of creating and processing the TCP option and the much larger amount of data that can be sent in transit without delay.

It should be noted that socket buffers larger than 64K are still potentially beneficial even when window scaling is turned off.

Although the kernel does not advertise or allow windows larger than that limit, having a larger buffer enables the kernel to absorb a greater amount of data from the application one write() call at a time, and hence return earlier.

When applications perform a write() operation in normal blocking mode, they have to wait until the write() system call completes. Having a large-enough buffer to stage the data in the kernel allows it to return immediately. If the buffer were not large enough, the application would have to wait until the send buffer was sufficiently drained to absorb all the data from the write, which would typically include the data being sent to the destination and the sender receiving an acknowledgment in return, allowing it to free up the space. When a send buffer is set to a very small value, the throughput measured in this situation is a more accurate reflection of how fast data is actually transferred to the destination across the network.

tcp_sack

net.ipv4.tcp_sack (/proc/sys/net/ipv4/tcp_sack)

This variable enables the TCP Selective Acknowledgments (SACK) feature. SACK is a TCP option for congestion control that allows the receiving side to convey information to the sender regarding missing sequence numbers in the byte stream. This reduces the number of segments the sender has to retransmit in the case of a lost segment and also reduces the delays that might be suffered, thereby improving overall throughput.

SACK is expected to be beneficial when loss occurs frequently and round-trip times are long. In environments like high-speed local networks with very short round-trip times and negligible loss, performance can actually be improved by turning SACK off. This is due to avoiding the overhead of processing SACK options.

tcp_dsack

net.ipv4.tcp_dsack (/proc/sys/net/ipv4/tcp_dsack)

This variable enables the TCP D-SACK feature, which is an enhancement to SACK to detect unnecessary retransmits. It is enabled by default, and it should be disabled if SACK is disabled.

tcp_fack

net.ipv4.tcp_fack (/proc/sys/net/ipv4/tcp_fack)

This variable enables the TCP Forward Acknowledgment (FACK) feature. FACK is a refinement of the SACK protocol to improve congestion control in TCP. It should also be disabled if SACK is disabled.

TCP Connection Management

TCP is a connection-oriented protocol. The following describes some of the parameters that help the kernel manage its connections to remote hosts. Tuning these parameters can have an impact on the number of connections that it can support simultaneously, an important consideration for busy servers.

tcp_max_syn_backlog

net.ipv4.tcp_max_syn_backlog (/proc/sys/net/ipv4/tcp_max_syn_backlog)

This variable controls the length of the TCP Syn Queue for each port. Incoming connection requests (SYN segments) are queued until they are accepted by the local server. If there are more connection requests than specified by this variable, the connection request is dropped. If clients experience failures connecting to busy servers, this value could be increased.

tcp_synack_retries

net.ipv4/tcp_synack_retries (/proc/sys/net/ipv4/tcp_synack_retries)

This variable controls the number of times the kernel tries to resend a response to an incoming SYN/ACK segment. Reducing this number results in earlier detection of a failed connection attempt from the remote host.

tcp_retries2

net.ipv4/tcp_retries2 (/proc/sys/net/ipv4/tcp_retries2)

This variable controls the number of times the kernel tries to resend data to a remote host with which it has an established connection. Reducing this number results in earlier detection of a failed connection to the remote host. This allows busy servers to quickly free up the resources tied to the failed connection and makes it easier for the server to support a larger number of simultaneous connections. The default value is 15, and, because TCP exponentially backs off with each attempt before retrying, it can normally take quite a while before it abandons an established connection. Lowering the value to 5, for example, results in five retransmission attempts being made on unacknowledged data.

TCP Keep-Alive Management

Once a connection is established, the TCP protocol does not stipulate that data be exchanged. A connection can remain idle permanently. In such a situation, if one host fails or becomes unavailable, it is not detected by the remaining host. The Keep-Alive mechanism allows a host to monitor the connection and learn of such a failure within a reasonable time. This section describes the global kernel parameters associated with the TCP Keep-Alive mechanism. Applications need to have the TCP Keep-Alive option enabled using the setsockopt() system call in order to make use of the kernel mechanism.

tcp_keepalive_time

net.ipv4.tcp_keepalive_time (/proc/sys/net/ipv4/tcp_keepalive_time)

If a connection is idle for the number of seconds specified by this parameter, the kernel initiates a probing of the connection to the remote host.

tcp_keepalive_intvl

net.ipv4.tcp_keepalive_intvl (/proc/sys/net/ipv4/tcp_keepalive_intvl)

This parameter specifies the time interval, in seconds, between the keepalive probes sent by the kernel to the remote host.

tcp_keepalive_probes

net.ipv4.tcp_keepalive_probes (/proc/sys/net/ipv4/tcp_keepalive_probes)

This parameter specifies the maximum number of keepalive probes the kernel sends to the remote host to detect if it is still alive. If it has sent this number of probes without receiving a response in return, the kernel concludes that the remote host is no longer available and closes the connection, freeing up all the local resources associated with the connection.

The default values are as follows:

tcp_keepalive_time = 7200 seconds (2 hours)
tcp_keepalive_probes = 9
tcp_keepalive_intvl = 75 seconds

These settings result in a connection getting dropped after approximately two hours and eleven minutes of idle time. Such large values are typically used to accommodate large delays and round-trip times on the Internet.

It can be desirable to lower the preceding parameters to detect remote hosts that have gone away earlier. This minimizes resources (such as memory and port space) tied up in extinct connections.

IP Port Space Range

ip_local_port_range

 sysctl.net.ipv4.ip_local_port_range (/proc/sys/net/ipv4/ip_local_port_range)

This parameter specifies the range of ephemeral ports that are available to the system. A port is a logical abstraction that the IP protocol uses as an address of sorts to distinguish between individual sockets, and it is simply an integer sequence space. When an application connects to a remote endpoint actively, it typically asks the kernel to dynamically assign it an ephemeral, or the next available free port, as opposed to a well-known port. These are unique to each protocol (TCP and UDP). The default value is configured at boot time based on the memory in the system. For systems with more than 128KB of memory, it is set to 32768 to 61000. Thus, a maximum of 28,232 ports can be in use simultaneously. Increasing this range allows a larger number of simultaneous connections for each protocol (TCP and UDP).