Enhancements in the Linux 2.4 and 2.6 Kernels

Linux network stack performance depends on many factors, such as the system where the kernel is running, the network stack implementation, and the network stack's interaction with other parts of the Linux kernel components. Although this chapter does not deal with all the other subsystem enhancements in the kernel and their effect on network performance, this section does look at a few of the kernel features that are new in 2.4 and 2.6, such as the O(1) scheduler and read copy update (RCU), as well as the effects of these changes on some of the network benchmarks.

In the current evolution of computer systems, interconnect bus speed and memory access latency are not increasing proportionally to processor speed (GHz) and network bandwidth capacity (gigabit networks). Reducing the number of accesses to these relatively slower components, such as the interconnect bus and memory, improves the system's performance in general, including the network. In the TCP/IP stack, the data is copied many times as part of transmit and receive processes, generating heavy traffic in the interconnect buses. This traffic leads to increased memory accesses, which can result in poor network performance and scalability issues. The scalability issues are exacerbated in SMP and NUMA systems by additional high-speed CPUs and relatively slower interconnect buses.

The bus speed, its throughput rate, and how much bus traffic is generated by the workload are crucial for improving the performance of the Linux network stack. Any improvement to the Linux network stack that eliminates or reduces the number of times data flows through these buses and reduces the memory accesses improves the Linux stack performance. The following sections look at some of the enhancements that were made to both the 2.4 and 2.6 kernels that reduce the number of copies made in the TCP/IP stack and thereby boost the network stack performance. Specifically, these enhancements are in the following areas:

SendFile support
TCP Segmentation Offloading (TSO) support
Process and IRQ affinity
Network device driver API (NAPI)
TCP Offload Engine (TOE)

We'll start by looking at SendFile support.

SendFile Support

The Linux 2.4 network stack supports the SendFile API, which allows application data to be sent directly from the file system buffer cache to the network card's buffer through direct memory access (DMA). Through the use of Zerocopy support in the network stack and network interface card (NIC), the SendFile API lets the application data be sent via DMA directly to the network for transfer, without having to do the usual user space-to-kernel space copying. In the typical non-SendFile case, the application follows these steps to send a file:

1.	Allocates a buffer.
2.	Copies the file data into its buffer (the file data is first copied from disk to kernel buffer cache and then to the application buffer).
3.	Issues a send with the data that gets copied to the kernel socket buffer.
4.	Gets the DMA to the NIC.

The SendFile API and Zerocopy bypass the extra copies and thereby reduce the context-switch procedure that involves many complex and CPU-intensive operations. Moreover, multiple copies of the same data, if processed in multiple CPUs, adversely affect performance by increasing memory latency, cache miss rate, and translation lookaside buffer (TLB) cache miss rate. The extra copies have direct and indirect consequences that can lead to poor network performance.

As the name implies, the SendFile API can only transfer file system data. Therefore, applications such as WebServer, Telnet, and so on that deal with dynamic and interactive data cannot use the SendFile API. Because a major part of the data that travels the Internet is static (file) data and file transfer is a major part of that traffic, the SendFile API and Zerocopy are important for improving the performance of network applications. Applications that use the SendFile API and Zerocopy, where applicable, can significantly improve network performance.

SendFile API support in Linux and Zerocopy support in both the Linux TCP/IP stack and the network driver achieve this single copy during TCP/IP data processing. SendFile is not transparent to the applications, because the applications need to implement the SendFile API to take advantage of this feature.

Two of the benchmarks that are used in this studyNetBench and SPECWeb99use the SendFile API. Table 21-1 shows the SendFile throughput performance improvement using NetBench. The Baseline columns show results using a Linux kernel without SendFile API/Zerocopy support; the Samba application uses the SendFile API. The Zerocopy in the table refers to the SendFile /Zerocopy support in the Linux TCP/IP stack and the Intel Gigabit Ethernet NIC driver. Samba 2.3.0 is patched to include SendFile support. The Linux kernel used for this study was Linux 2.4.4 and Samba 2.2.0 with the SendFile patch.

Server: 8 x 700 MHz PIII, 1MB L2 (four CPUs used), Intel Profusion chipset
4 GB memory, 4 x 1 Gbps Ethernet, 18 GB SCSI
Linux 2.4.4, Samba 2.2.0
Clients: 866 MHz PIII, 256k L2
Windows 2000, NetBench 7.0.1

Table 21-1. NetBench Results Showing SendFile+Zerocopy Benefits
Number of Clients	Linux 2.4.4 Baseline+Samba SendFile (Mbps)	Linux 2.4.4 Baseline +Samba SendFile+ Zerocopy (Mbps)	% Improvement
24	516.8	571.0	10
28	540.0	609.2	12
32	551.9	635.1	15
36	556.6	647.2	16
40	560.6	655.4	17
44	562.5	661.8	18

The Baseline+Samba SendFile without the SendFile API+Zerocopy support in the Linux kernel used the regular send path of the TCP/IP stack. This is the baseline in this case, and when the SendFile+Zerocopy support was applied to the TCP/IP stack and to the Intel Gigabit driver, the results improved, as shown in Table 21-1 .The SendFile feature is clearly a winner in this case. As the number of clients increases, the amount of data that is processed also increases, as does the network performance when the SendFile feature is used. SendFile is enabled in the kernel by default, and some network device drivers support Zerocopy. Therefore, if an application supports the SendFile API, and if the system where the application runs has the NIC that has Zerocopy support, the SendFile feature is used; otherwise, even though the application has implemented the SendFile API, it just falls through the normal send path of the network stack. The network stack learns the capabilities of the NIC during initialization.

TCP Segmentation Offloading Support

For every TCP/IP transmit or receive of a network packet, multiple PCI bus accesses occur as the data gets transmitted to and received from the NIC. With the TCP Segmentation Offloading (TSO) feature in the NIC and the Linux TCP/IP stack, the number of PCI bus accesses is reduced on the send side to one per 64KB buffer size instead of one per network packet (1518 bytes) for an Ethernet network. If the application issues a buffer size greater than the frame size, the TCP/IP stack splits it into multiple frames, each with a TCP/IP header, and maps it via DMA to the NIC.

When TSO is enabled, 64KB data is sent via DMA to the NIC with one pseudo header. The network adapter's controller parses the 64KB block into standard Ethernet packets, reducing the host processor utilization and access to the PCI bus. TSO increases efficiency by reducing CPU utilization and increasing the network throughput. The network silicon is designed specifically for this task. Enabling TSO in Linux and the Gigabit Ethernet silicon designed for TSO enhances system performance. Figure 21-1 explains TSO operation versus non-TSO operation in the Linux kernel.

Figure 21-1. Standard frames compared to TSO segmentation.

Server applications that use the large message size (and a large socket buffer size), such as SPECWeb99, benefit the most from TSO. A gain in the performance of microbenchmarks, such as Netperf3, is seen when the large message size is used. Unlike SendFile, TSO can be used for both static and dynamic content to improve the network performance of a SPECWeb99 workload. In 2.5.33 and later Linux kernels, TSO is enabled by default. TSO is transparent to the application, so no application change is needed to use TSO.

The SPECWeb99 results on Linux kernel 2.5.33 using the TSO feature on an Intel e1000 gigabit driver showed a 10% improvement in simultaneous connections (such as 2,906 versus 2,656 simultaneous connections). For this test, the Apache 2.0.36 web server was used, which uses the SendFile API for anything other than dynamic content over 8k in size. Because SendFile is not used for dynamic content, the benefits of TSO are realized only for dynamic content. The system configuration for this workload includes an 8-way 900MHz Pentium III and four Intel e1000 gigabit NICs. Pentium III desktops were used for clients.

Figure 21-2 depicts a study that was conducted at Intel, which shows the benefits of TSO in an environment using Intel PRO network connections and Dell PowerEdge servers running Linux.

Figure 21-2. NetIQ Chariot network testing software results: TCP segmentation on servers running Linux, on versus off. Source: Intel Labs, September 2002.

From the experiments shown in Figure 21-2, it is obvious that TCP segmentation offloading provides a significant boost to Linux network performance. TSO support is available in the 2.5.x and 2.6 mainline kernels. The Linux distributor SUSE AG has this support in the SUSE Linux Enterprise Server 9 (SLES 9) kernel through a configuration option.

Process and IRQ Affinity in Network Load

Affinity, or binding, refers to the process of forcing process interrupts to execute only on one or multiple chosen CPUs. In a multiprocessor system, the processes are migrated to other processors under certain conditions based on the policies used by the scheduler in the operating system. By keeping a process on the same processorforcing the scheduler to schedule a particular task on a specified CPUthe likelihood of the required data being in the cache is highly improved, which reduces the memory latency. There are cases where even if the process is bound to schedule on the same CPU, factors such as large working set and multiple processes sharing the same CPU tend to flush the cache, so the process affinity may not work in all situations. The case study shown in this chapter is perfectly suited to take advantage of process affinity.

The Linux kernel implements the following two system calls to facilitate binding processes with processors:

 asmlinkage int sys_sched_set_affinity(pid_t pid, unsigned      int mask_len, unsigned long *new_mask_ptr);  asmlinkage int sys_sched_get_affinity(pid_t pid, unsigned      int *user_mask_len_ptr, unsigned long *user_mask_ptr);

The sched_set_affinity() syscall also ensures that the target process will run on the right CPU (or CPUs). The sched_get_affinity(pid, &mask_len, NULL) can be used to query the kernel's supported CPU bitmask length.

Process affinity refers to binding the process or thread to a CPU. IRQ affinity refers to forcing the interrupt processing of an individual IRQ to execute on a particular CPU. In Linux, the /proc interface can be used to set IRQ affinity. You can set the IRQ affinity by finding the IRQ used for an I/O interface and then changing the mask for this selected IRQ. For example, the following command sets the IRQ 63 to CPU 0 and routes the interrupts generated by IRQ 63 to CPU 1:

 echo "1" > /proc/irq/63/smp_affinity

Process and IRQ affinity are common practices for performance optimization to improve the throughput and scalability of applications and systems. These methods require that the processes of an application need to be identified and isolated and the interrupts need to be identified to map to the application processes and the processor where it is running. Therefore, this n-to-n mapping is not always feasible in complex environments.

Our Netperf3 case study clearly indicates that the Linux network SMP scalability and network throughput performance improved when IRQ and process affinity were used. Netperf3 multiadapter support measured the network scalability in this case. For each NIC, one process was created, which was bound to one of four CPUs and one NIC. The NIC was intended to be used by the process and was bound to the same CPU. The processing of one Netperf3 process was isolated and the execution of the interrupts generated by one NIC to a single processor. This was done on the Netperf3 server using the TCP_STREAM test.

On the server, the receipt of data was processed heavily in this workload, which results in multiple data copies (copy from NIC to kernel memory and from kernel memory to application memory). The binding boosted the performance because these copies were done on a single processor, which eliminated the need to load the data into two different processor caches. Binding reduces the memory latency when both the affinities are applied. Although the IRQ and process affinities can be applied separately, applying them together can boost performance further, as shown in Figure 21-3. Figure 21-3 shows the data collected using the Linux 2.4.17 kernel (from www.kernel.org) with IRQ, process affinities, and the O(1) scheduler patch. The reason why the O(1) scheduler patch was selected to compare with process affinity is because the O(1) scheduler enforces affinity in scheduling the process/task. The O(1) scheduler tries to schedule the task to the same processor as much as it can, which seems to improve Netperf3 TCP_STREAM performance similar to the run with process affinity.

Figure 21-3. Netperf3 TCP stream4-way scalability.

The best performer in Figure 21-3 is the case where both IRQ and process affinities are applied. It achieved almost three-out-of-four scalability for most of the message sizes, whereas the base kernel achieved only two out of four for two messages and less than two for the rest of the message sizes. The results of the cases where these affinities were applied individually and the O(1) scheduler patch hover around 2 and 2.5 out of four scalability.

In Figure 21-3, the legend 2.4.17+irq/proc aff means that there are four Netperf3 processes, each bound to one processor, and four IRQs resulting from four network interface cards that are bound to each of the four processors. Therefore, Netperf3 process 1 is bound to processor0, and the first NIC's IRQ is bound to processor 0. The connections established by Netperf3 process 1 go through NIC 1 on processor 0.

The 2.4.17+irq aff label means that four IRQs are bound to four CPUs.

The 2.4.17+proc aff means that the Netperf 3 processors are bound to four of the processors in the system.

NAPI Support

The network device driver API (NAPI) was developed to improve Linux network performance under a heavy interrupt load. The system's load is used to determine if polling or interrupts are used to process incoming data to improve the performance. Polling improves performance when the interrupt load is heavy but wastes CPU cycles when the interrupt load is light. Interrupts improve latency under a low interrupt load but make the system vulnerable to live lock when the interrupt load exceeds the Maximum Loss Free Forwarding Rate (MLFFR). NAPI works only with NICs that implement a "ring" of DMA buffers. When a network packet is received, it is placed in the next buffer in the ring. Normally, the processor is interrupted for each packet and the system is expected to empty the packet from the ring. When NAPI is enabled, it responds to the first interrupt by telling the NIC to stop interrupting. NAPI then polls the ring. As it processes the packets, NAPI pulls new packets without the need for further interrupts, leading to a dramatic reduction in receive interrupts that helps the system function normally under heavy network load.

The implementation of NAPI in the Linux 2.5 kernel saves network device driver developers from manually implementing polling in each driver, manages the flood of interrupts, and protects the system from denial of service (DoS) attacks. The flood ceiling is much lower without NAPI because of the CPU overhead of interrupt handlers. NAPI also eliminates the need for lots of code to implement NIC hardware interrupt mitigation variants. To make use of NAPI advantages, the network device drivers have to use the new network API (NAPI). Currently, many network device drivers have NAPI support. Some network drivers, such as Intel Gigabit Ethernet, support NAPI as an option that can be built into the driver, whereas others, such as the BroadCom Tg3 driver, support NAPI by default. If a network driver does not use NAPI, the NAPI code in the kernel is not used either. For more information on NAPI, see the networking/NAPI_HOWTO.txt file in the Documentation section under the Linux source directory on your system.

For our NAPI test, we used the Intel Gigabit Ethernet driver built with NAPI support. The Intel Gigabit NIC also provides another option to mitigate transmit and receive interrupts. This receive interrupt or receive interrupt delay basically enables the driver to interrupt the processor when the configurable time delay has elapsed. As a result, the processor is not interrupted for every packet. With the advent of high-bandwidth network cards, the interrupt rate has increased tremendously. Without techniques like NAPI or interrupt delay, the processor ends up spinning most of its cycles in interrupt processing. Interrupt delay is discussed in more detail in the section "Netperf3 (Gigabit Ethernet Tuning Case Study)."

NAPI has helped improve network throughput performance in Internet packet forwarding services such as routers. NAPI test results are shown in Tables 21-2, 21-3, and 21-4. Netperf TCP single stream, UDP stream, and packet forwarding were used to obtain these results. The hardware and software configurations used for these tests include the following:

Two uniprocessor Pentium III @ 933MHz with Intel Gigabit Ethernet (e1000) NIC
Linux 2.4.20-pre4 kernel with e1000 driver version 4.3.2-k1
NAPI patch

Table 21-2. Netperf TCP Single Stream with and Without NAPI
Message Size (Bytes)	e1000 Baseline in Mbps	e1000 NAPI in Mbps
4	20.74	20.69
128	458.14	465.26
512	836.40	846.71
1024	936.11	937.93
2048	940.65	939.92
4096	940.86	937.59
8192	940.87	939.95
16384	940.88	937.61
32768	940.89	939.92
65536	940.90	939.48
131070	940.84	939.74

Table 21-3. Netperf, UDP_STREAM, with and Without NAPI
Intel e1000 Driver Baseline Throughput in Mbps	Intel e1000 Driver with NAPI Support Throughput in Mbps
955.7	955.7

Table 21-4. Packet Forwarding (Routing) with and Without NAPI
Intel e1000 Driver Baseline Throughput in Packets Forwarded	Intel e1000 Driver with NAPI Support Throughput in Packets Forwarded
305	298580

The Receive Interrupt Delay (RxIntDelay) was set to 1 in the case of NAPI because RxIntDelay=0 caused a problem, and RxIntDelay was set to 64 in the case of base e1000 driver without NAPI support. The receive and transmit descriptors were set to 256 on both systems for the e1000 driver. For the TCP Stream test, the TCP socket size was set to 131070. The test duration for both the TCP Stream and the UDP Stream was 30 seconds. One million packets at 970 kpackets per second were injected into the packet-forwarding experiments.

The packet-forwarding results in Table 21-4 clearly show that NAPI is a winner for routing packets. TCP and UDP stream tests using Netperf, as shown in Tables 21-2 and 21-3, indicate that NAPI does not improve network throughput. In fact, network throughput caused performance to moderately regress.

The results of TCP stream, UDP stream, and packet forwarding in Tables 21-2, 21-3, and 21-4 show that NAPI helped improve performance significantly in routing tests. Because NAPI is optionally available in many network drivers, you can choose to use NAPI depending on your environment.

TCP Offload Engine Support

Server applications that do not use some of the unique features of the Linux TCP/IP protocol, such as IP filtering or quality of service (QoS), can instead use network interface cards that support TCP/IP offloading to improve network performance. TCP Offload Engine (TOE) technology consists of software extensions to existing TCP/IP stacks that enable the use of hardware planes implemented on specialized TOE network interface cards (TNICs). This hardware/software combination lets operating systems offload all TCP/IP traffic to the specialized hardware on the TNIC. As a result, the server resources are no longer needed to process TCP/IP frames and are used instead for application and other system needs, bolstering overall system performance. This sounds like a great solution for improving network performance; however, general-purpose TCP offload solutions have repeatedly failed, and TOE is suitable for some specific application environments. One of the contexts where TCP Offload is suitable is the storage-specific interconnect via commoditized network hardware, and another context is high-performance clustering solutions.

Analysis of Gigabit Ethernet workloads indicates that for every gigabit of network traffic a system processes, a GHz of CPU processing power is needed to perform the work. With the advent of higher-speed processor systems, high-performance serial PCI-Express system buses, Ethernet controllers with built-in checksum operations logic, and interrupt mitigation, users will be able to use multiple Gigabit Ethernet connections in servers without performance degradation. However, the improvement in network bandwidth also keeps pace with the improvements in other system components that necessitate a solution like TOE to improve network performance. But TCP Offload as a general-purpose solution has failed due to fundamental performance issues and the difficulties resulting in the complexities of deploying TCP Offload in practice. TOE has both proponents and opponents in the academic and commercial worlds. However, many vendors develop TNICs that are used for various workloads in addition to clustering and network storage solutions.

The TOE solutions are proprietary and vary in functionality when compared to Linux TCP/IP software stack code; unique features such IPCHAINS, IPTABLES, and general-purpose packet handling capability may not be available in these proprietary TOE solutions. Many vendors that build TNICs and drivers support different operating systems. Although our early results showed a 17% improvement in NetBench performance on Linux, the drivers used were not stable at the time they were tested, so further investigation is needed for this feature. More stable drivers and NICs should yield higher performance results.

Note that TOE is different from TCP Segmentation Offloading (TSO). TSO offloads only the segmenting of data. In the case of TSO, the TCP/IP stack hands the whole message that the application passed it to the NIC in one frame. The NIC splits the data (message) into multiple frames, building headers to each frame for transmission. This offload is done only for the send side (outbound traffic); it does not handle the inbound traffic. On the other hand, TOE offloads the whole TCP/IP stack processing to the NIC, which includes both inbound and outbound processing.

Although the TOE solution is a much-needed technology for improving network performance, how it will be adopted by the Linux open source community is yet to be seen. The contention of some in the open source community is that because the Linux TCP/IP stack already supports Zerocopy, the number of copies is already reduced. The checksum, segmentation offloading, and interrupt mitigation support in the NIC would get most of the benefit from TOE adoption, so there is no need for TOE because these remaining functions handle basic socket management and process wakeup. If the TOE solution evolves and can show a tremendous advantage over the software network stack, adoption of the TOE technology might increase.

For now, only commercial companies are developing TOE engines. As a result, the implementation of this solution varies from vendor to vendor. Perhaps the vendors will be able to mitigate or resolve the difficulties in the implementation and deployment of TOE and change the minds of skeptics who are unwilling to accept the TOE solution as a viable alternative for improving network performance.

SendFile Support

Table 21-1. NetBench Results Showing SendFile+Zerocopy Benefits