7.5 Optimizing protocols

7.5.1 Frame and packet size

In packet-switched networks, packet size has a significant influence on throughput and efficiency. Smaller packets clearly incur more overhead than larger packets, since the ratio of protocol and encapsulation headers to actual user data is proportionally much higher. Therefore, the user data throughput of a link decreases as packet size decreases (even though the number of bytes passed over the link may be the same). The advantage of smaller packets in packet-switched networks is that they have a much better chance of getting through without error, and hence fewer retransmissions are likely. Line quality clearly plays an important part in deciding optimum packet size. On high-quality lines larger packets are less likely to become corrupted.

This phenomenon produces a bell-shaped curve [1]), where throughput increases as packet size increases but then begins to degrade. The optimum packet size is found to be the highest point in the curve. For example, an IP network suffers from low throughput with very small packet sizes but will also degrade when packet sizes are too large (due to fragmentation). The goal is to increase packet size to just below the point where fragmentation is needed and increase buffers accordingly to compensate for transmission delays. This will generally vary by packet service; in networks where each node must read an entire frame before transmission (Frame Relay, etc.), very large packets should be avoided. In many cases you will have to use trial and error. For example, you could start with a packet size of 1,500 bytes and measure performance as this value is increased or decreased.

When vendors document the throughput supported by their equipment, they often do so using the minimum packet size. While this may not be the most efficient data rate, it leads to the highest raw throughput numbers possible (usually expressed in Packet Per Second [PPS]). These numbers may bare no relation to the traffic characteristics of your network, so beware. Reference [39] suggests the following frame sizes to be used in throughput tests:

Ethernet: 64, 128, 256, 512, 1,024, 1,280, 157 bytes
Token Ring: 54, 64, 128, 256, 1,024, 157, 2,048, 4,472 bytes
FDDI: 54, 64, 128, 256, 1,024, 157, 2,048, 4,472 bytes

7.5.2 Fragmentation

Fragmentation (sometimes called segmentation) occurs frequently in packet-switched internetworks and can be a significant factor in degrading performance. Fragmentation occurs when one media technology cannot support the Maximum Transmission Unit (MTU) of another, in which case it must break up packets and insert the appropriate encapsulation for each fragment before forwarding. The remote end system typically has the responsibility of reassembling these smaller fragments, although intermediate devices such as routers or firewalls may also need to reassemble fragments for security reasons (some security attacks attempt to conceal malicious activity within fragments). Clearly, this additional encapsulation overhead and the increase in the number of packets transmitted decrease overall throughput and add both latency and additional bandwidth overheads to the network. As a consequence, applications that routinely generate very large packets (such as NFS) may exhibit poor performance and occasionally time out, due to excessive delays and subsequent retransmissions (wasting yet more bandwidth). Where possible, set higher-level protocols and applications to limit their MTU to the maximum packet size supported on the network. This may compromise the efficiency of these applications on local networks, but this may be an acceptable trade-off. For the interested reader, [40] provides a formal analysis of optimal fragmentation algorithms in computer networks.

7.5.3 Window sizes

Connection-oriented protocols often use the concept of a window to enable the transmission of multiple packets in sequence before requiring an acknowledgment. This is inherently more efficient than requiring a separate acknowledgment for each packet transmitted, as long as all packets within the window reach their destination intact. The larger the window size, the more system resources are consumed at the receiver, since the recipient must buffer and check all packets in the sequence until the receive window is filled, at which point an acknowledgment can be sent. Theoretically, the optimal window size can be expressed as:

WindowSize = CircuitBandwidth ×CircuitDelay

To maximize throughput the transmit window should be large enough to fill the pipe with data completely before stopping and waiting for an acknowledgment. In packet-switched internetworks, protocol window sizes are often tuned at packet-switched nodes (such as routers) to meet the characteristics of the circuit media. Broadly speaking window size can be increased on reliable networks (such as high-speed fiber-optic links with low Bit Error Rates [BER]) and decreased on less reliable networks (such as analog leased lines or X.25 PSNs). Large windows are usually a bad idea on low-speed links or links with potentially high bit error rates, since packet loss or timeouts may result in frequent retransmissions and could dramatically degrade performance.

Some protocols monitor session throughput and will automatically adjust window sizes, timers, and retransmission attempts to suit. For example, TCP will dynamically adjust both its transmit and receive window sizes based on throughput using a number of well-documented mechanisms. As a general rule window size should not be modified without a thorough understanding of the circuit quality and buffering capabilities on the intermediate network. Setting the wrong window size could be disastrous. In cases where performance problems are observed, you could potentially identify performance problems elsewhere in the network by examining changes in TCP window size within a detailed packet trace.

7.5.4 Tuning TCP/IP

TCP/IP is a general-purpose protocol suite and, therefore, it may be possible to tune up and improve performance on mission- or business-critical resources such as network servers. There are several factors that affect TCP/ IP application performance, as follows:

Every TCP connection starts with a three-way handshake (SYN, SYN-ACK, ACK) and this can take 500 ms or more to complete (depending on the length and characteristics of the end-to-end path). All TCP applications require that a session be established before user data can be transferred.
TCP also uses a sliding window algorithm for efficient bulk data transfer. The loss of even one TCP packet in the window sequence requires the whole block of data to be retransmitted.
TCP's built-in slow-start algorithm, intended to cut down on congestion and traffic spikes, has the effect of degrading throughput by effectively dampening TCP's ability to transfer data rapidly at session start up.
TCP's Nagle algorithm [41] is designed to reduce small packet traffic by batching small packets for a round-trip time; under some circumstances, there can be a waiting period of 200 ms before data are transmitted.

Since applications such as Web browsing or FTP can involve dozens of connections, connection latency is incurred by each new session. The Nagle algorithm can also impact applications that require near real-time feedback such as some X-Windows applications (and may be disabled by setting the TCP_NODELAY flag via the Socket API).

Of particular importance is TCP's behavior in response to packet loss. When a TCP sender detects a dropped segment, it retransmits that segment and then adjusts its transmission rate to half of what it was previously by going into slow start. Although this back-off behavior is responsive to congestion, problems occur when many TCP sessions are affected simultaneously, as is the case with tail drop in overloaded routers. This situation can be improved using techniques such as RED gateways, as discussed in section 7.2.3.

Tunable parameters

With most TCP/IP implementations (including Windows 95, 98, and NT and Solaris v2) there are a few parameters you should be able to override default settings for, including the following:

IP Maximum Transmission Unit (MTU)—determines the maximum packet size.
TCP Maximum Segment Size (MSS)—determines the maximum segment size.
IP Path MTU Discovery (PMTUD)—a probe algorithm that may be disabled or enabled.
TCP window size—not normally configurable by the user, but some applications may change default settings.

In some cases it may actually be necessary to change these values if problems are encountered with data transfers involving maximum sized packets (where data transfers with short packet lengths work fine). If sessions hang or time out, then it is likely that there is a problem due to fragmentation or inconsistent MTU sizes within the network. Often this is observed as a unidirectional problem, with large data transfers succeeding in one direction but failing in the other. This can be a particular problem where mixed-media bridging is deployed, where different MTU sizes are configured and there is no capability to fragment (FDDI to Ethernet or Ethernet to Token Ring). Another potential problem could occur when a router interface MTU is set correctly, but the router cannot forward datagrams of that size over the interface (perhaps due to buffering limitations, a malfunctioning CSU/DSU/modem, bad cable, or software/firmware failures). In this case the router may generate any useful ICMP messages, because the states that trigger such events are not entered. Finally, tunneling protocols such as GRE and L2TP introduce additional packet overheads on links that may not be taken into account when configuring MTU sizes.

TCP MSS and IP MTU

The TCP Maximum Segment Size (MSS) parameter specifies the maximum amount of TCP data in a single IP datagram that the local system can accept (i.e., specifically the maximum amount of data it is willing to reassemble). Theoretically, the MSS could be as large as 65,495, but in practice it is usually much lower to avoid fragmentation (hosts typically subtract 40 bytes from the MTU of the output interface to calculate the MSS, where 40 is the combined size of an IPv4 and TCP header). For example, the MSS value for an Ethernet interface would typically be set to 1,460 bytes (i.e., 1,500 - 40). To set the maximum MSS to 1,460 on Solaris 2, for example, we would use the command $ ndd -set /dev/tcp tcp_mss_max 1,460.

Note that if the MSS is set larger than the MTU, then IP datagrams must be fragmented into several packets when transmitted. Receiving stations must be prepared to accept the same MSS size. Note also that if you change the interface MTU on a router or end station, then all systems connected to the same broadcast domain must be configured with consistent MTUs. If systems on the same broadcast domain use different MTUs, they may exhibit problems communicating (large packets are discarded if sent from stations with a larger MTU to stations with a smaller MTU).

PMTUD

Prior to Path MTU Discovery (PMTUD), the default MTU used by most IP systems was set to 576 bytes. This is the minimum size that must be supported by any IP node when communicating over different subnets and leads in many cases to throughput inefficiency, since often we could use much larger packets to transfer bulk data. PMTUD is an algorithm that attempts to maximize data transfer throughput by first discovering the largest IP datagram that may be sent without incurring fragmentation by probing the network. PMTUD is implemented in relatively new TCP/IP stacks and described in [42]. PMTUD works by setting the Don't Fragment (DF) flag in the IP header of a probe packet. Initially very large probes are sent, with the size being progressively reduced until a successful end-to-end transmission is achieved. Unsuccessful transmissions are indicated by the ICMP message Type 3, Code 4 (fragmentation needed and DF set), which is returned by intermediate routers along the path if the MTU of the next-hop interface will not support this packet size [43].

There are a number of scenarios where PMTUD will not work, and the IP sender will continue to use a large MTU with consequent retransmissions. Essentially these problems revolve around the need for PMTUD to see an explicit ICMP failure message to know the path MTU status. Some routers may not generate ICMP's errors, or intermediate routers may discard ICMPs from farther upstream. There are also potential host stack problems where received ICMP messages may be misinterpreted or filtered before PMTUD can process them. In these cases you will need to disable PMTUD on the sending stations. To disable PMTUD on Solaris 2, for example, we would use the command $ ndd -set /dev/ip ip_path_mtu_discovery 0. This causes the IP sender to send IP datagrams with the DF flag clear. This may result in fragmentation if the MSS value exceeds intermediate media MTU values, so it may also be beneficial to lower the MSS if fragmentation issues are evident.

TCP window size

Generally it is not possible to configure TCP window sizes from the user's interface; it may be an option available to programmers wishing to optimize performance if they have direct access to the stack. Reference [44], for example, illustrates the potential for larger window sizes than the standard TCP window of 65,535 bytes, when TCP is used over very high—speed pipes with long delays (e.g., high-bandwidth satellite channels and long-distance transcontinental fiber-optic links).

In recent years there have been some innovative approaches to these problems, including plug-in Web server and browser software, which reduces handshaking latency by transmitting data in the first TCP packet of a connection and selectively retransmits only dropped packets in a sequence rather than the whole window. These modifications are reported to boost Web browsing performance by a factor of three or more. They might also assist transaction-based services, such as credit card authorization (where short data messages could be sent as part of the connection request and the session could be quickly terminated—rather like X.25's expedited data facility). The disadvantage is that they require special software to be installed on all server and client platforms.