Case Study

This section presents the results of the benchmarks used for the case study and shows the cumulative improvements of all the performance enhancement features discussed so far. Some of the benchmarks captured the cumulative gain of all the features, whereas others captured selected gains for specific workloads.

NetBench

Figure 21-4 summarizes the performance improvements made to the NetBench benchmark throughput through various Linux kernel and Samba enhancements and tuning. These tests were conducted on a Pentium 4 system with four 1.5GHz P4 processors, four gigabit Ethernet adapters, 2GB memory, and fourteen 15k rpm SCSI disks. SUSE version 8.0 was used for all tests. Each subsequent test included one new kernel, tuning, or Samba change. The NetBench Enterprise Disk Suite by Ziff-Davis was used for all tests.

Figure 21-4. NetBench case study.

The following list describes the column names in Figure 21-4:

Baseline. Represents a clean installation of SUSE Enterprise Linux Server 8.0 (SUSE SLES 8) with no performance configuration changes.
data=writeback. A configuration change was made to the default Ext3 mount option from ordered to writeback for the /data file system (where the Samba shares reside). This improved file system performance greatly on metadata- intensive workloads such as this one.
smblog=1. The Samba logging level was changed from 2 to 1 to reduce disk I/O to the Samba log files. A level of 1 is verbose enough to log critical errors.
SendFile/Zerocopy. A patch that makes Samba use SendFile for client read requests. This, combined with Linux Zerocopy support (first available in 2.4.4), eliminates two very costly memory copies.
O(1) scheduler. A small improvement that will facilitate other performance improvements in the future. The O(1) scheduler is the multiqueue scheduler that improves the performance of symmetrical multiprocessors. This is the default scheduler in the Linux 2.5 and 2.6 kernels.
evenly affined IRQs. Each of the four network adapters' interrupts was handled by a unique processor. SUSE SLES 8, for P4 architecture, defaults to a round-robin assignment (destination = irq_num % num_cpus) for IRQ-to-CPU mappings. In this particular case, all of the network adapters' IRQs are routed to CPU0. This can be very good for performance because cache warmth on this code is improved, but one CPU may not be able to handle the entire network load as more NICs are added to the system. The ideal solution is to evenly affine these IRQs so that each processor handles interrupts from one NIC. Along with process affinity, this should keep the process that is assigned to a particular NIC also assigned to one CPU for maximum performance.

process affinity. This technique ensures that for each network interrupt that is processed, the corresponding smbd process is scheduled on the same CPU to further improve cache warmth.

This is primarily a benchmark technique and is not commonly used elsewhere. If a workload can logically be divided evenly across many CPUs, this can be a big gain. Most workloads in practice are dynamic, and affinity cannot be predetermined.

/proc/sys/net/hl=763. Increases the number of buffers held in the network stack code so that the network stack code does not have to call the memory system to get or free a buffer from and to the memory system. This tuning is not available in the 2.6 kernel.
case sensitivity enforced. When case sensitivity is not enforced, Samba might have to search for different versions of a filename before it can start that file, because many combinations of filenames can exist for the same file. Enforcing case sensitivity eliminates those guesses.
spinlocks. Samba uses fcntl() for its database, which can be costly. Using spinlocks avoids a fcntl() call. The use of the Big Kernel Lock found in posix_lock_file() reduces contention and wait times for Big Kernel Lock. To use this feature, configure Samba with --use-spin-locks, as shown in the following example:
```
 Smbd  --use-spin-locks  -p <port_number> -O<socket option>  -s <configuration file> 
```
dcache read copy update. Directory entry lookup times are reduced with a new implementation of dlookup(), using the read copy update technique. Read copy update is a two-phase update method for mutual exclusion in Linux that allows avoiding of the overheads of spin-waiting locks. For more information, see the locking section of the Linux Scalability Effort project.

Netperf3 (Gigabit Ethernet Tuning Case Study)

Gigabit Ethernet NICs are becoming cheaper and are quickly replacing 100Mb Ethernet cards. The system manufacturers are including Gigabit Ethernet on motherboards, and system suppliers and integrators are choosing Gigabit Ethernet network cards and switches for connecting disk servers, PC computer farms, and departmental backbones. This section looks at gigabit network performance in the Linux operating system and how tuning the Gigabit Ethernet improves network performance.

The Gigabit Ethernet network cards (Intel Gigabit Ethernet and Acenic Gigabit Ethernet) have a few additional features that facilitate handling high throughput. These features include support for jumbo frame (MTU) size, interrupt-delay, and TX/RX descriptors:

Jumbo frame size. With Gigabit Ethernet NICs, the MTU size can be greater than 1500 bytes. The limitation of 1500 MTU for 100Mb Ethernet does not exist anymore. Increasing the size of the MTU usually improves the network throughput, but make sure that the network routers support the jumbo frame size. Otherwise, when the system is connected to a 100Mb Ethernet network, the Gigabit Ethernet NICs will drop to 100Mb capacity.
Interrupt-delay/interrupt coalescence. The interrupt coalescence can be set for receive and transmit interrupts that let the NIC delay generating interrupts for the set time period. For example, when RxInt is set to 1.024us (which is the default value on Intel Gigabit Ethernet), the NIC places the received frames in memory and generates an interrupt only when 1.024us has elapsed. This can improve CPU efficiency as it reduces the context switches, but at the same time, it also has the effect of increasing receive packet latency. If properly tuned for the network traffic, interrupt coalescence can improve CPU efficiency and network throughput performance.
Transmit and receive descriptors. This value is used by the Gigabit Ethernet driver to allocate buffers for sending and receiving data. Increasing this value allows the driver to buffer more incoming packets. Each descriptor includes a transmit and receive descriptor buffer along with a data buffer. This data buffer size depends on the MTU size; the maximum MTU size is 16110 for this driver.

Other Gigabit Ethernet studies also indicate that for every Gigabit of network traffic a system processes, approximately 1GHz of CPU processing power is needed to perform the work. Our experiment also proved that this is true, but adding more GHz processors and more Gigabit Ethernet network cards did not scale the network throughput even when the number of GHz processors equaled the number of Gigabit Ethernet NICs. Other bottlenecks, such as system buses, affect the scalability of the Gigabit Ethernet NICs on SMP processors. The NIC test shows that in only three out of four NICs media speed was achieved.

These tests were run between a Pentium 4 1.6GHz 4-way machine and four client machines (Pentium 3 1.0GHz) capable of running 1 gigabit NIC at media limit. All machines had the Linux 2.4.17 SMP vanilla kernel. The e1000 driver was version 4.1.7. These are Netperf3 PACKET_STREAM and PACKET_MAERTS runs, all with an MTU of 1500 bytes. The Pentium 4 machine had 16GB RAM with four CPUs. The four NICs were distributed equally between 100MHz and 133MHz PCI-X slots. The hyperthreading was disabled on the Pentium 4 system.

The PACKET_STREAM test transfers raw data without any TCP/IP headers. None of the packets transmitted or received go through the TCP/IP layers. This is a test that only exercises the CPU, memory, PCI bus, and the NIC's driver to look for bottlenecks in these areas. The interrupt delay and transmit and receive descriptors for the Gigabit driver were tuned with different values to determine what works best for the environment. Another element of tuning was added using different socket buffer sizes.

Table 21-5 shows that the maximum throughput achieved is 2808 Mbps out of four NICs, and the tuning that helped achieve this throughput are 4096 transmit and receive descriptors and interrupt delay set to 64 for both the receive and send sides with a 132k socket buffer size.

Table 21-5. Intel Gigabit NICs Tuning Case Study
Receive / Four Adapters / Four Processors / 2.4.17 Kernel
Descriptors, Interrupt Delay	Socket Size	Total Throughput (Mbps)	Dropped Packets %		% CPU (vmstat)
			Driver	Upper Layers
256,16	131072	2472	35	72	41.0
256,16	65536	2496	35	72	39.2
256,16	32768	2544	33	71	38.2
256,64	131072	2628	37	68	46.6
256,64	65536	2640	35	69	43.2
256,64	32768	2700	24	68	39.6
256,96	131072	2628	25	64	45.2
256,96	65536	2664	23	64	45.6
256,96	32768	2724	19	61	41.2
4096,16	131072	2628	28	98	27.8
4096,16	65536	2664	29	98	28.0
4096,16	32768	2640	28	98	27.8
4096,64	131072	2808	19	97	28.6
4096,64	65536	2784	20	97	28.4
4096,64	32768	2664	25	97	30.4
4096,96	131072	2772	18	97	28.2
4096,96	65536	2748	19	97	28.6
4096,96	32768	2616	21	97	28.6

VolanoMark

The VolanoMark benchmark creates 10 chat rooms of 20 clients. Each room echoes the messages from one client to the other 19 clients in the room. This benchmark, not yet an open source benchmark, consists of the VolanoChat server and a second program that simulates the clients in the chat room. It is used to measure the raw server performance and network scalability performance. VolanoMark can be run in two modes: loopback and network. The loopback mode tests the raw server performance, and the network mode tests the network scalability performance. VolanoMark uses two parameters to control the size and number of chat rooms.

The VolanoMark benchmark creates client connections in groups of 20 and measures how long it takes the server to take turns broadcasting all the clients' messages to the group. At the end of the loopback test, it reports a score of the average number of messages transferred per second. In network mode, the metric is the number of connections between the clients and the server. The Linux kernel components stressed with this benchmark include TCP/IP, the scheduler, and signals.

Figure 21-5 shows the results of the VolanoMark benchmark run in loopback mode. The improvements shown resulted from various factors such as tuning the network, kernel enhancements, and two prototype patches. The Performance team at the IBM Linux Technology Center created the prototype patches, but they have not been submitted to the upstream kernel. The first patch is the priority preemption patch, which enables a process to run longer without being preempted by a higher-priority process. Because this policy of turning off the priority preemption is not acceptable for all workloads, the patch is enabled through a new scheduler tuning parameter. The other patch, the TCP soft affinity patch, is related to TCP/IP, so a detailed discussion of this patch is not appropriate for this chapter.

Figure 21-5. VolanoMark case study.

The loopback code path in the TCP/IP stack is inefficient. The send and receive threads of this benchmark always execute on different CPUs in an SMP system. This results in extra loading of data in two different L2 caches. To make the sender and the receiver threads of a connection execute on the same CPU, the code is changed to wake the receiver thread on the sender's CPU. Because the O(1) scheduler tries to keep the thread on the same processor, after the receiver thread is moved to the sender CPU, the receiver thread stays on the sender CPU, eliminating the loading of the same data on two different CPUs.

Patch1 in Figure 21-5 is the SMP Scalable Timer patch. Patch2 is the Scheduler Priority Preemption patch. Patch3 is the TCP Soft Affinity patch. The SMP Scalable Timer patch is already included in the 2.5 Linux kernel. The two prototype patches are done to exemplify the problems unique to this benchmark. These patches are enabled in the kernel through the configuration option and can be made available through tuning to suit appropriate workloads. The Priority Preemption patch was submitted to the open source community, but the TCP soft affinity is not disclosed except in this book.

Additional enhancements that are part of the 2.5 and 2.6 kernels and that improved the performance of VolanoMark include the O(1) scheduler and the SMP scalable timer.

SPECWeb99

The SPECWeb99 benchmark presents a demanding workload to a web server. This workload requests 70% static pages and 30% simple dynamic pages. Sizes of the web pages range from 102 bytes to 921,000 bytes. The dynamic content models GIF advertisement rotation; there is no SSL content. SPECWeb99 is relevant because web serving, especially with Apache, is one of the most common uses of Linux servers. Apache is rich in functionality but is not designed for high performance. Apache was chosen as the web server for this benchmark because it currently hosts more web sites than any other web server on the Internet. SPECWeb99 is the accepted standard benchmark for web serving. SPECWeb99 stresses the following kernel components:

Scheduler
TCP/IP
Various threading models
SendFile
Zerocopy
Network drivers

Figures 21-6 and 21-7 show the results of SPECWeb99 using web static content and web dynamic content, respectively. Also included is a description of the hardware and software configurations used for each test.

Figure 21-6. SPECWeb99 static content case study.

Figure 21-7. SPECWeb99 dynamic content case study.

Some of the issues we have addressed that have resulted in the improvements shown include adding O(1) and RCU dcache kernel patches and adding a new dynamic API mod_specweb module to Apache.