Analysis Methodology | Performance Tuning for Linux Servers

A strategy for improving Linux performance and scalability includes running several industry-accepted and component-level benchmarks, selecting the appropriate hardware and software, developing benchmark run rules, setting performance and scalability targets, and measuring, analyzing, and improving performance and scalability.

Performance is defined as raw throughput on a uniprocessor (UP) or SMP. A distinction is made between SMP scalability (CPUs) and resource scalability (for example, the number of network connections).

Hardware and Software

The architecture assumed for the majority of this discussion is IA-32 (for example, x86), from one to eight processors. Also examined are the issues associated with nonuniform memory access (NUMA) IA-32 and NUMA IA-64 architectures. The selection of hardware typically aligns with the selection of the benchmark and the associated workload. The selection of software aligns with the evaluator's middleware strategy and/or open-source middleware. The following lists several workloads that are typically targeted for Linux server performance evaluation. Each workload includes a description of sample hardware discussed in this chapter.

Database. You can use a query database or an online transaction processing benchmark. The hardware is an eight-way SMP system with a large disk configuration. IBM DB2 for Linux is the database software used, and the SCSI controllers are IBM ServeRAID 4H. The database is targeted for eight-way SMP.
SMB file serving. A typical benchmark is NetBench. The hardware is a four-way SMP system with as many as 48 clients driving the SMP server. The middleware is Samba (open source). SMB file serving is targeted for four-way SMP.
Web serving. The benchmark is SPECweb99. The hardware is an eight-way SMP with a large memory configuration and as many as 32 clients. For the purposes of this discussion, the benchmarking was conducted for research purposes only and was noncompliant (see this chapter's "Acknowledgments" section for details). The web server is Apache, which is open-source and the most popular web server.
Linux kernel version. The level of the Linux kernel.org kernel (2.4.x, 2.6.x, or 2.7.x) used is benchmark-dependent. It is discussed later in the "Benchmark" section. The Linux distribution selected is Red Hat 7.1 or 7.2 in order to simplify administration. The focus is kernel performance, not the performance of the distribution. The Red Hat kernel is replaced with one from kernel.org and the patches under evaluation.

Run Rules

During benchmark setup, run rules are developed that detail how the benchmark is installed, configured, and run, and how results are to be interpreted. The run rules serve several purposes:

They define the metric that will be used to measure benchmark performance and scalability (for example, messages/sec).
They ensure that the benchmark results are suitable for measuring the performance and scalability of the workload and kernel components.
They provide a documented set of instructions that will allow others to repeat the performance tests.
They define the set of data that is collected so that performance and scalability of the system under test (SUT) can be analyzed to determine when bottlenecks exist.

These run rules are the foundation of benchmark execution. Setting benchmark targets, which typically occurs after the run rules have been defined, is the next step in the evaluation process.

Setting Targets

Performance and scalability targets for a benchmark are associated with a specific SUT (hardware and software configuration). Setting performance and scalability targets requires the following:

Baseline measurements to determine the performance of the benchmark on the baseline kernel version. Baseline scalability is then calculated.
Initial performance analysis to determine a promising direction for performance gains (for example, a profile indicating the scheduler is very busy might suggest trying an O(1) scheduler).
Comparing baseline results with similar published results (for example, finding SPECweb99 publications on the same web server on a similar eight-way from spec.org). It is also desirable to compare Linux results with the results of other operating systems. Given the competitive data and baseline results, select a performance target for UP and SMP machines.
Finally, a target may be predicated on making changes to the application. For example, if the methodology the application uses for asynchronous I/O is known to be inefficient, it may be desirable to select the performance target assuming the I/O method will be changed.

Measurement, Analysis, and Tuning

The benchmark executions are initiated according to the run rules in order to measure both performance and scalability in terms of the defined performance metric. When calculating SMP scalability for a given machine, there exists an alternative between computing this metric based on the performance of a UP kernel and computing it based on the performance of an SMP kernel, with the number of processors set to 1 (1P). The important factor here is consistency, so either option is acceptable, as long as the same alternative is used when comparing results.

Before any measurements are made, both the hardware and software configurations are tuned before performance and scalability are analyzed. Tuning is an iterative cycle of tuning and measuring. It involves measuring components of the system such as CPU utilization and memory usage, and possibly adjusting system hardware parameters, system resource parameters, and middleware parameters. Tuning is one of the first steps of performance analysis. Without tuning, scaling results may be misleading. In other words, they might not indicate kernel limitations, but rather some other issue.

The first step required to analyze the SUT's performance and scalability is to understand the benchmark and the workload tested. Initial performance analysis is made against a tuned system. Sometimes analysis uncovers additional modifications to tuning parameters.

Analyzing the SUT's performance and scalability requires a set of performance tools. The use of open-source community (OSC) tools is desirable whenever possible to facilitate posting of analysis data to the OSC in order to illustrate performance and scalability bottlenecks. It also allows those in the OSC to replicate results with the tool or to understand the results after using the tool on another application on which they can experiment. In many instances, ad hoc performance tools are developed to gain a better understanding of a specific performance bottleneck. Ad hoc performance tools are usually simple tools that instrument a specific component of the Linux kernel. It is advantageous to share such tools with the OSC. A sample listing of performance tools available includes the following:

/proc file system. meminfo, slabinfo, interrupts, network stats, I/O stats, and so on.
SGI's lockmeter. For SMP lock analysis.
SGI's kernel profiler (kernprof). Time-based profiling, performance counter-based profiling, annotated call graph (ACG) of kernel space only.
IBM Trace Facility. Single-step (mtrace) and both time-based and performance counter-based profiling for both user and system space.

Ad hoc performance tools help you further understand a specific aspect of the system. Examples are as follows:

sstat. Collects scheduler statistics.
schedret. Determines which kernel functions are blocking for investigation of idle time.
acgparse. Post-processes kernprof ACG.
copy in/out instrumentation. Determines alignment of buffers, size of copy, and CPU utilization of copy in/out algorithm.

Performance analysis data is then used to identify performance and scalability bottlenecks. You need a broad understanding of the SUT and a more specific understanding of certain Linux kernel components that are being stressed by the benchmark to understand where the performance bottlenecks exist. You must also understand the Linux kernel source code that is the cause of the bottleneck. In addition, the Linux OSC can be leveraged to help you isolate performance-related issues and developing patches for their associated resolution.

Exit Strategy

An evaluation of Linux kernel performance may require several cycles of running the benchmarks, analyzing the results to identify performance and scalability bottlenecks, addressing any bottlenecks by integrating patches into the Linux kernel, and running the benchmark again. You can obtain the patches by finding existing patches in the OSC or by developing new ones. There is a set of criteria for determining when Linux is "good enough," which terminates the process.

If the targets have been met and there are no outstanding Linux kernel issues to address for the specific benchmark that would significantly improve its performance, Linux is "good enough." In this instance, it is better to move on to address other issues. Second, if several cycles of performance analysis have occurred, and there are still outstanding bottlenecks, consider the trade-offs between the development costs of continuing the process and the benefits of any additional performance gains. If the development costs are too high relative to any potential performance improvements, discontinue the analysis and articulate the rationale appropriately.

In both cases, when reviewing all the additional outstanding Linux kernel-related issues to address, assess appropriate benchmarks that may be used to address these kernel component issues, examine any existing data on the issues, and decide to analyze the kernel component (or collection of components) based on this collective information.