|
This section presents the results of the benchmarks used for the case study and shows the cumulative improvements of all the performance enhancement features discussed so far. Some of the benchmarks captured the cumulative gain of all the features, whereas others captured selected gains for specific workloads. NetBenchFigure 21-4 summarizes the performance improvements made to the NetBench benchmark throughput through various Linux kernel and Samba enhancements and tuning. These tests were conducted on a Pentium 4 system with four 1.5GHz P4 processors, four gigabit Ethernet adapters, 2GB memory, and fourteen 15k rpm SCSI disks. SUSE version 8.0 was used for all tests. Each subsequent test included one new kernel, tuning, or Samba change. The NetBench Enterprise Disk Suite by Ziff-Davis was used for all tests. Figure 21-4. NetBench case study.The following list describes the column names in Figure 21-4:
Netperf3 (Gigabit Ethernet Tuning Case Study)Gigabit Ethernet NICs are becoming cheaper and are quickly replacing 100Mb Ethernet cards. The system manufacturers are including Gigabit Ethernet on motherboards, and system suppliers and integrators are choosing Gigabit Ethernet network cards and switches for connecting disk servers, PC computer farms, and departmental backbones. This section looks at gigabit network performance in the Linux operating system and how tuning the Gigabit Ethernet improves network performance. The Gigabit Ethernet network cards (Intel Gigabit Ethernet and Acenic Gigabit Ethernet) have a few additional features that facilitate handling high throughput. These features include support for jumbo frame (MTU) size, interrupt-delay, and TX/RX descriptors:
Other Gigabit Ethernet studies also indicate that for every Gigabit of network traffic a system processes, approximately 1GHz of CPU processing power is needed to perform the work. Our experiment also proved that this is true, but adding more GHz processors and more Gigabit Ethernet network cards did not scale the network throughput even when the number of GHz processors equaled the number of Gigabit Ethernet NICs. Other bottlenecks, such as system buses, affect the scalability of the Gigabit Ethernet NICs on SMP processors. The NIC test shows that in only three out of four NICs media speed was achieved. These tests were run between a Pentium 4 1.6GHz 4-way machine and four client machines (Pentium 3 1.0GHz) capable of running 1 gigabit NIC at media limit. All machines had the Linux 2.4.17 SMP vanilla kernel. The e1000 driver was version 4.1.7. These are Netperf3 PACKET_STREAM and PACKET_MAERTS runs, all with an MTU of 1500 bytes. The Pentium 4 machine had 16GB RAM with four CPUs. The four NICs were distributed equally between 100MHz and 133MHz PCI-X slots. The hyperthreading was disabled on the Pentium 4 system. The PACKET_STREAM test transfers raw data without any TCP/IP headers. None of the packets transmitted or received go through the TCP/IP layers. This is a test that only exercises the CPU, memory, PCI bus, and the NIC's driver to look for bottlenecks in these areas. The interrupt delay and transmit and receive descriptors for the Gigabit driver were tuned with different values to determine what works best for the environment. Another element of tuning was added using different socket buffer sizes. Table 21-5 shows that the maximum throughput achieved is 2808 Mbps out of four NICs, and the tuning that helped achieve this throughput are 4096 transmit and receive descriptors and interrupt delay set to 64 for both the receive and send sides with a 132k socket buffer size.
VolanoMarkThe VolanoMark benchmark creates 10 chat rooms of 20 clients. Each room echoes the messages from one client to the other 19 clients in the room. This benchmark, not yet an open source benchmark, consists of the VolanoChat server and a second program that simulates the clients in the chat room. It is used to measure the raw server performance and network scalability performance. VolanoMark can be run in two modes: loopback and network. The loopback mode tests the raw server performance, and the network mode tests the network scalability performance. VolanoMark uses two parameters to control the size and number of chat rooms. The VolanoMark benchmark creates client connections in groups of 20 and measures how long it takes the server to take turns broadcasting all the clients' messages to the group. At the end of the loopback test, it reports a score of the average number of messages transferred per second. In network mode, the metric is the number of connections between the clients and the server. The Linux kernel components stressed with this benchmark include TCP/IP, the scheduler, and signals. Figure 21-5 shows the results of the VolanoMark benchmark run in loopback mode. The improvements shown resulted from various factors such as tuning the network, kernel enhancements, and two prototype patches. The Performance team at the IBM Linux Technology Center created the prototype patches, but they have not been submitted to the upstream kernel. The first patch is the priority preemption patch, which enables a process to run longer without being preempted by a higher-priority process. Because this policy of turning off the priority preemption is not acceptable for all workloads, the patch is enabled through a new scheduler tuning parameter. The other patch, the TCP soft affinity patch, is related to TCP/IP, so a detailed discussion of this patch is not appropriate for this chapter. Figure 21-5. VolanoMark case study.The loopback code path in the TCP/IP stack is inefficient. The send and receive threads of this benchmark always execute on different CPUs in an SMP system. This results in extra loading of data in two different L2 caches. To make the sender and the receiver threads of a connection execute on the same CPU, the code is changed to wake the receiver thread on the sender's CPU. Because the O(1) scheduler tries to keep the thread on the same processor, after the receiver thread is moved to the sender CPU, the receiver thread stays on the sender CPU, eliminating the loading of the same data on two different CPUs. Patch1 in Figure 21-5 is the SMP Scalable Timer patch. Patch2 is the Scheduler Priority Preemption patch. Patch3 is the TCP Soft Affinity patch. The SMP Scalable Timer patch is already included in the 2.5 Linux kernel. The two prototype patches are done to exemplify the problems unique to this benchmark. These patches are enabled in the kernel through the configuration option and can be made available through tuning to suit appropriate workloads. The Priority Preemption patch was submitted to the open source community, but the TCP soft affinity is not disclosed except in this book. Additional enhancements that are part of the 2.5 and 2.6 kernels and that improved the performance of VolanoMark include the O(1) scheduler and the SMP scalable timer. SPECWeb99The SPECWeb99 benchmark presents a demanding workload to a web server. This workload requests 70% static pages and 30% simple dynamic pages. Sizes of the web pages range from 102 bytes to 921,000 bytes. The dynamic content models GIF advertisement rotation; there is no SSL content. SPECWeb99 is relevant because web serving, especially with Apache, is one of the most common uses of Linux servers. Apache is rich in functionality but is not designed for high performance. Apache was chosen as the web server for this benchmark because it currently hosts more web sites than any other web server on the Internet. SPECWeb99 is the accepted standard benchmark for web serving. SPECWeb99 stresses the following kernel components:
Figures 21-6 and 21-7 show the results of SPECWeb99 using web static content and web dynamic content, respectively. Also included is a description of the hardware and software configurations used for each test. Figure 21-6. SPECWeb99 static content case study.Figure 21-7. SPECWeb99 dynamic content case study.Some of the issues we have addressed that have resulted in the improvements shown include adding O(1) and RCU dcache kernel patches and adding a new dynamic API mod_specweb module to Apache. |
|