Microbenchmarks | Performance Tuning for Linux Servers

This section looks at several types of microbenchmarks, including operating system benchmarks, disk benchmarks, network benchmarks, and application benchmarks.

Operating System Benchmarks

The operating system benchmarks discussed in this section are as follows:

LMbench
AIM7
AIM9
Reaim
SPEC SDET

LMbench

(http://www.bitmover.com/lmbench/)

LMbench is a suite of simple benchmarks in the true sense of a microbenchmark. It contains a series of small tests for measuring latency and bandwidth of some of the most fundamental of all UNIX or Linux APIs. The bandwidth measures include a measure of reading from cached files, copy memory completely in user level, measuring the bandwidth of data through a UNIX pipe, and some simple benchmarks for TCP. Usually, these bandwidth measures are done by copying a block of memory or issuing a read() call in a loop, with calls to a system clock before and after the loop. Counting the number of bytes transferred per unit of time provides a measure of the overall bandwidth of the various APIs. The measures selected can then be compared between different operating systems, processor types, hardware memory subsystems, and so on for these basic APIs.

LMbench also measures the rate at which an operating system switches from user level into the operating system's protected mode using a very simple system call. Because events like this happen much faster than the system's timer granularity, a common technique is to run a simple primitive in a loop and measure the number of calls to the primitive per some unit of time. Relatively straightforward math then enables a calculation of the average rate per transaction. Other primitives measured by LMbench include the establishment of TCP connections, creation of pipes, creation of processes, rate at which signals are received, and so on. LMbench is fairly mature and has been careful to take into account some of the more complex side effects that can typically plague benchmarking efforts. For instance, the test for process creation looks at the cost of process creation via fork(), as well as the cost of fork()+exit() and fork()+exec(). On memory tests, the documentation describing how to interpret the results points out the performance impacts of various hardware configurationsmost specifically, the impact of various sizes of memory caches. LMbench also takes into account compiler differences by recommending a common compiler that should provide equivalent results for all architectures. It is possible to use different compilers on different architectures, but the performance analyst must take this into account when comparing results. Although all of the tests are fairly simple in concept, the wealth of experience included makes it much cheaper to use an off-the-shelf benchmark for simple comparisons. Also, the publicly available test resultsand, in this case, the publicly available source codeallows a performance analyst to easily compare any differences in different environments. In the following results, some information provides simple measures, such as the time for a simple system call such as read() or write(). Other results show throughputs for a variety of data transfer sizes.

LMbench Sample Output

This first section summarizes the machine being tested, including the kernel version (output of uname(1), memory sizes to be tested, processor speeds, and so on).

 [lmbench2.0 results for Linux herkimer.ltc.austin.ibm.com 2.6.3  #1 SMP Wed Mar 10 19:51:47 CST 2004 i686 i686 i386 GNU/Linux] [LMBENCH_VER: Version-2.0.4 20030113111940] [ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m  32m 64m 128m 256m 512m] [DISKS: ] [DISK_DESC: ] [ENOUGH: 5000] [FAST: ] [FASTMEM: NO] [FILE: /usr/tmp/XXX] [FSDIR: /usr/tmp] [HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m  32m 64m 128m 256m] [INFO: INFO.herkimer.ltc.austin.ibm.com] [LOOP_O: 0.00000234] [MB: 512] [MHZ: 495 MHz, 2.02 nanosec clock] [...] [OS: i686-pc-linux-gnu] [TIMING_O: 0] [LMBENCH VERSION: lmbench-2alpha13] [USER: ] [HOSTNAME: herkimer.ltc.austin.ibm.com] [NODENAME: herkimer.ltc.austin.ibm.com] [SYSNAME: Linux] [PROCESSOR: i686] [MACHINE: i686] [RELEASE: 2.6.3] [VERSION: #1 SMP Wed Mar 10 19:51:47 CST 2004] [...]

This section provides the output of several system calls run for a short period of time in a loop. It also calculates the average time for the system call based on the amount of time run divided by the number of system calls successfully executed.

 Simple syscall: 0.4189 microseconds Simple read: 0.7907 microseconds Simple write: 0.6517 microseconds Simple stat: 47.4274 microseconds Simple fstat: 1.5631 microseconds Simple open/close: 54.3922 microseconds Select on 10 fd's: 6.7666 microseconds Select on 100 fd's: 34.5312 microseconds Select on 250 fd's: 82.4412 microseconds Select on 500 fd's: 159.6176 microseconds Select on 10 tcp fd's: 8.2204 microseconds Select on 100 tcp fd's: 52.6321 microseconds Select on 250 tcp fd's: 127.5116 microseconds Select on 500 tcp fd's: 248.5455 microseconds Signal handler installation: 1.742 microseconds Signal handler overhead: 9.899 microseconds Protection fault: 1.310 microseconds Pipe latency: 28.2951 microseconds AF_UNIX sock stream latency: 97.5933 microseconds Process fork+exit: 806.4286 microseconds Process fork+execve: 2265.0000 microseconds Process fork+/bin/sh -c: 10137.0000 microseconds File /usr/tmp/XXX write bandwidth: 5923 KB/sec n=2048, usecs=12993 Pagefaults on /usr/tmp/XXX: 6 usecs

This section provides the size of an mmap region and the number of microseconds to complete a mapping of that size.

 "mappings 0.524288 30 1.048576 43 2.097152 71 4.194304 121 8.388608 219 16.777216 412 33.554432 848 67.108864 1691 134.217728 3326 268.435456 6750 536.870912 14383

This section shows how long in microseconds it takes to complete a read from a file system.

 "File system latency 0k      1000    5307    10741 1k      1000    3835    7716 4k      1000    3770    7717 10k     1000    2602    6395

This section provides latency timings for various network connections and a summary of the bandwidth of several local networking calls.

 UDP latency using localhost: 93.2020 microseconds TCP latency using localhost: 177.5119 microseconds RPC/tcp latency using localhost: 239.6613 microseconds RPC/udp latency using localhost: 140.2291 microseconds TCP/IP connection cost to localhost: 272.5500 microseconds Socket bandwidth using localhost: 29.04 MB/sec Avg xfer: 3.2KB, 41.8KB in 13.5080 millisecs, 3.09 MB/sec AF_UNIX sock stream bandwidth: 50.00 MB/sec Pipe bandwidth: 196.79 MB/sec

This section provides the rate in MBps of reads of various byte sizes, from 512 bytes to 512MB.

 "read bandwidth 0.000512 100.16 0.001024 182.24 0.002048 313.72 0.004096 469.94 0.008192 402.96 0.016384 238.67 0.032768 245.45 0.065536 244.97 0.131072 242.77 0.262144 228.47 0.524288 209.52 1.05 197.06 2.10 198.86 4.19 196.74 8.39 198.87 16.78 197.20 33.55 199.74 67.11 193.44 134.22 199.26 268.44 196.50 536.87 199.57

This section provides the bandwidth of a complete open/read/close cycle in MBps (shown in the second column) for various block sizes (shown in the first column).

 "read open2close bandwidth 0.000512 7.85 0.001024 15.73 0.002048 30.36 0.004096 57.20 0.008192 80.01 0.016384 117.05 0.032768 156.55 0.065536 191.84 0.131072 208.20 0.262144 207.66 0.524288 195.78 1.05 194.32 2.10 192.59 4.19 197.70 8.39 195.38 16.78 198.77 33.55 196.26 67.11 197.06 134.22 197.39 268.44 187.94 536.87 197.56

This section provides the throughput in MBps (shown in the second column) of mmap read access of various sizes (shown in the first column) ranging from 512 bytes to 512MB. A read implies that the data is mmap()'d and accessed/touched by the processor, leading to page faults by the operating system. This should provide a fair comparison to similar operations doing a read() system call from a file.

 "Mmap read bandwidth 0.000512 1589.89 0.001024 1747.58 0.002048 1849.51 0.004096 1876.64 0.008192 1925.34 0.016384 1806.04 0.032768 945.75 0.065536 939.72 0.131072 949.63 0.262144 786.75 0.524288 389.81 1.05 285.60 2.10 280.14 4.19 277.49 8.39 280.36 16.78 277.84 33.55 280.44 67.11 277.87 134.22 280.45 268.44 277.83 536.87 280.41

This section provides the same measures as given in the preceding section, with the full open/mmap/close cycle, again where the data is accessed after the mmap(), and before the close().

 "Mmap read open2close bandwidth 0.000512 6.47 0.001024 12.99 0.002048 25.54 0.004096 49.57 0.008192 88.09 0.016384 142.55 0.032768 222.20 0.065536 303.94 0.131072 382.90 0.262144 373.09 0.524288 255.05 1.05 210.47 2.10 212.69 4.19 213.20 8.39 216.84 16.78 214.64 33.55 217.02 67.11 215.04 134.22 217.51 268.44 215.38 536.87 216.09

This section measures the rate at which the C library's bcopy() function can copy unaligned data. Unaligned data often requires more complex copying algorithms than aligned data. For instance, aligned data may be copied a double word at a time (64 bits at a time), but unaligned data may need to be copied by the library one byte at a time. In particular, the bcopy() routine needs to ensure that it never attempts to read a byte that is not present in the calling process's address space. This often makes an unoptimized bcopy() routine extremely slow. As the size of the data to be copied (in the first column) increases, the rate of data copying of a good bcopy() implementation should directly approach that of the aligned bcopy() measurements that follow. A simplistic solution may iterate through the data one byte at a time and be substantially slower for copying large blocks of data. An incorrectly optimized implementation could manage to generate a segmentation fault via a reference to data that is not mapped into the application's address space.

 "libc bcopy unaligned 0.000512 870.40 0.001024 1300.36 0.002048 1755.31 0.004096 2095.21 0.008192 2166.23 0.016384 376.98 0.032768 383.18 0.065536 379.35 0.131072 382.55 0.262144 252.74 0.524288 196.17 1.05 183.64 2.10 186.51 4.19 185.17 8.39 187.81 16.78 187.52 33.55 190.23 67.11 188.79 134.22 190.93 268.44 189.05

This version is the same as the preceding version, where the first column is the amount of data being copied (512 bytes to 512MB) and the second is the throughput in MBps.

 "libc bcopy aligned 0.000512 870.49 0.001024 1299.05 0.002048 1755.31 0.004096 2093.22 0.008192 2048.00 0.016384 376.47 0.032768 383.11 0.065536 379.29 0.131072 382.55 0.262144 259.08 0.524288 196.61 1.05 183.96 2.10 186.65 4.19 185.68 8.39 188.25 16.78 187.65 33.55 190.75 67.11 189.32 134.22 191.57 268.44 191.24 "unrolled bcopy unaligned 0.000512 1045.57 0.001024 1033.53 0.002048 1044.55 0.004096 1033.31 0.008192 996.75 0.016384 342.01 0.032768 335.21 0.065536 332.38 0.131072 318.58 0.262144 241.38 0.524288 173.98 1.05 159.09 2.10 160.30 4.19 159.16 8.39 160.63 16.78 159.23 33.55 159.44 67.11 159.55 134.22 161.25 268.44 159.70 "unrolled partial bcopy unaligned 0.000512 5399.13 0.001024 5122.18 0.002048 5342.10 0.004096 5665.45 0.008192 3491.16 0.016384 386.58 0.032768 387.40 0.065536 387.93 0.131072 388.87 0.262144 245.68 0.524288 180.23 1.05 163.28 2.10 164.81 4.19 164.77 8.39 166.05 16.78 165.24 33.55 167.56 67.11 166.40 134.22 166.58

AIM7 and AIM9

AIM Technologies released two commonly used benchmarks in the mid-1980s. Caldera/SCO acquired AIM Technologies and released the AIM7 and AIM9 benchmarks under the GPL license in 1999. (See http://www.caldera.com/developers/community/contrib/aim.html for details.) The source can be downloaded from ftp://ftp.sco.com/pub/opensource/ or http://sourceforge.net/projects/aimbench. Although both AIM7 and AIM9 are old enough to be nearly obsolete, they are interesting primarily because they are freely available under the GPL in source form and provide a useful performance perspective in understanding today's Linux systems and the workloads running on those systems.

AIM7

(http://cvs.sourceforge.net/viewcvs.py/ltp/benchmarks/)

The AIM7 benchmark is often referred to as the AIM Multiuser Benchmark because it focuses on systems that support multiple interactive users. Although the benchmark is somewhat dated compared to the tasks that multiple users run on a single system today, the test still stresses some of the core subsystems of the underlying operating system in a fairly generic way. For instance, the AIM7 test focuses on running a large number of tasks simultaneously, which stresses the operating system's scheduler, as well as the process creation and process exit capabilities. The test attempts to use a reasonable amount of storage that is scaled to the number of users in this synthetic workload; however, the amount of storage actually used by the benchmark is fairly low for today's storage-hungry workloads. The test minimizes the use of floating-point and complex mathematical operations, making it less useful as a benchmark for the high-performance computing needs of the scientific and technical communities. Other benchmarks focus on more commonly used matrix multiplications, problem partitioning, and computation-intensive workloads. AIM7 spends a fairly high percentage of time sorting and searching through relatively large quantities of data, although the amount of data is probably low by today's standards. Also, today's workload mix tends to offload operations like that to databases rather than more standard brute-force search and sort algorithms. AIM7 also attempts to focus more on the system libraries and system calls than some other benchmarks. When used as part of a comprehensive set of benchmarking, AIM7 provides useful input into the comparative performance of various UNIX systems, most of which purport to supply similar capabilities on different implementations of the operating system and on different hardware.

AIM9

(http://cvs.sourceforge.net/viewcvs.py/ltp/benchmarks/)

AIM9 is referred to as the AIM Independent Resource Benchmark. It has a more component-oriented focus on benchmarking than AIM7. In many respects, it provides a view that is similar to that of LMbench in that it measures discrete capabilities such as additions per second or sorts per second. This is in contrast to AIM7, which provides a measure of more user-visible operations. AIM9 attempts to avoid measures that are impacted by the operating system's memory subsystem or scheduler; instead it focuses on the processor-bound capabilities. AIM9 measures a slightly different set of underlying operations than LMbench. In particular, AIM9 has a larger focus on integer and floating-point calculations, while also providing another view of disk performance testing, and localhost networking tests such as TCP, UDP, fifos, and pipes. The tests typically run for a short, fixed period of time and measure the number of underlying operations completed during that time. The specific tests, the amount of time for each test to run, and a number of environmental factors, such as locations of various files and which compiler to use, can be easily reconfigured.

The performance of a system is more than the sum of its parts. The complex interactions of system calls, locking, network latency, disk latency, user input, application dependencies, and many other factors combine in ways that are often difficult to predict. Component viewpoints such as those provided by LMbench and AIM9 may be important in understanding raw processor power, general operating system comparisons, and performance impacts of certain critical application operations, but it is rarely possible to extrapolate the performance of a complex workload from such a narrowly focused benchmark. However, understanding the component limitations in terms of performance and throughput helps provide useful background when trying to analyze performance problems in more complex workload scenarios.

Sample AIM9 Output

[View full width]

AIM Independent Resource Benchmark - Suite IX v1.1, January 22, 1996 Copyright (c) 1996 - 2001 Caldera International, Inc. All Rights Reserved Machine's name : Machine's configuration : Number of seconds to run each test [2 to 1000] : Path to disk files : Starting time: Fri Feb 21 11:41:25 2003 Projected Run Time: 1:00:00 Projected finish: Fri Feb 21 12:41:25 2003 ---------------------------------------------------------------------------- Test Test Elapsed Iteration Iteration Operation Number Name Time (sec) Count Rate Rate (loops/sec) (ops/sec) ---------------------------------------------------------------------------- 1 add_double 60.06 835 13.90276 250249.75 Thousand Double Precision Additions/second 2 add_float 60.03 1252 20.85624 250274.86 Thousand Single Precision Additions/second 3 add_long 60.02 2060 34.32189 2059313.56 Thousand Long Integer Additions/second 4 add_int 60.01 2060 34.32761 2059656.72 Thousand Integer Additions/second 5 add_short 60.00 5148 85.80000 2059200.00 Thousand Short Integer Additions/second 6 creat-clo 60.00 16990 283.16667 283166.67 File Creations and Closes/second 7 page_test 60.00 12495 208.25000 354025.00 System Allocations & Pages/second 8 brk_test 60.01 4645 77.40377 1315864.02 System Memory Allocations/second 9 jmp_test 60.00 372707 6211.78333 6211783.33 Non-local gotos/second 10 signal_test 60.00 17050 284.16667 284166.67 Signal Traps/second 11 exec_test 60.00 3499 58.31667 291.58 Program Loads/second 12 fork_test 60.00 2607 43.45000 4345.00 Task Creations/second 13 link_test 60.00 97321 1622.01667 102187.05 Link/Unlink Pairs/second 14 disk_rr 60.00 1147 19.11667 97877.33 Random Disk Reads (K)/second 15 disk_rw 60.01 976 16.26396 83271.45 Random Disk Writes (K)/second 16 disk_rd 60.00 3934 65.56667 335701.33 Sequential Disk Reads (K)/second 17 disk_wrt 60.00 1843 30.71667 157269.33 Sequential Disk Writes (K)/second 18 disk_cp 60.01 1296 21.59640 110573.57 Disk Copies (K)/second 19 sync_disk_rw 60.41 14 0.23175 593.28 Sync Random Disk Writes (K)/second 20 sync_disk_wrt 60.61 32 0.52797 1351.59 Sync Sequential Disk Writes (K)/second 21 sync_disk_cp 61.57 33 0.53598 1372.10 Sync Disk Copies (K)/second 22 disk_src 60.00 52799 879.98333 65998.75 Directory Searches/second 23 div_double 60.03 1541 25.67050 77011.49 Thousand Double Precision Divides/second 24 div_float 60.00 1540 25.66667 77000.00 Thousand Single Precision Divides/second 25 div_long 60.00 1854 30.90000 27810.00 Thousand Long Integer Divides/second 26 div_int 60.00 1854 30.90000 27810.00 Thousand Integer Divides/second 27 div_short 60.00 1854 30.90000 27810.00 Thousand Short Integer Divides/second 28 fun_cal 60.00 5082 84.70000 43366400.00 Function Calls (no arguments)/second 29 fun_cal1 60.00 11919 198.65000 101708800.00 Function Calls (1 argument)/second 30 fun_cal2 60.00 9282 154.70000 79206400.00 Function Calls (2 arguments)/second 31 fun_cal15 60.01 2860 47.65872 24401266.46 Function Calls (15 arguments)/second 32 sieve 60.63 67 1.10506 5.53 Integer Sieves/second 33 mul_double 60.01 975 16.24729 194967.51 Thousand Double Precision Multiplies/second 34 mul_float 60.00 973 16.21667 194600.00 Thousand Single Precision Multiplies/second 35 mul_long 60.00 88643 1477.38333 354572.00 Thousand Long Integer Multiplies/second 36 mul_int 60.00 88643 1477.38333 354572.00 Thousand Integer Multiplies/second 37 mul_short 60.00 70502 1175.03333 352510.00 Thousand Short Integer Multiplies/second 38 num_rtns_1 60.00 38632 643.86667 64386.67 Numeric Functions/second 39 new_raph 60.00 93107 1551.78333 310356.67 Zeros Found/second 40 trig_rtns 60.01 2523 42.04299 420429.93 Trigonometric Functions/second 41 matrix_rtns 60.00 406791 6779.85000 677985.00 Point Transformations/second 42 array_rtns 60.04 1100 18.32112 366.42 Linear Systems Solved/second 43 string_rtns 60.00 840 14.00000 1400.00 String Manipulations/second 44 mem_rtns_1 60.01 2465 41.07649 1232294.62 Dynamic Memory Operations/second 45 mem_rtns_2 60.00 165114 2751.90000 275190.00 Block Memory Operations/second 46 sort_rtns_1 60.00 2375 39.58333 395.83 Sort Operations/second 47 misc_rtns_1 60.00 80427 1340.45000 13404.50 Auxiliary Loops/second 48 dir_rtns_1 60.00 17727 295.45000 2954500.00 Directory Operations/second 49 shell_rtns_1 60.00 4519 75.31667 75.32 Shell Scripts/second 50 shell_rtns_2 60.01 4516 75.25412 75.25 Shell Scripts/second 51 shell_rtns_3 60.00 4517 75.28333 75.28 Shell Scripts/second 52 series_1 60.00 1701469 28357.81667 2835781.67 Series Evaluations/second 53 shared_memory 60.00 203788 3396.46667 339646.67 Shared Memory Operations/second 54 tcp_test 60.00 59795 996.58333 89692.50 TCP/IP Messages/second 55 udp_test 60.00 113791 1896.51667 189651.67 UDP/IP DataGrams/second 56 fifo_test 60.00 326544 5442.40000 544240.00 FIFO Messages/second 57 stream_pipe 60.00 232668 3877.80000 387780.00 Stream Pipe Messages/second 58 dgram_pipe 60.00 228686 3811.43333 381143.33 DataGram Pipe Messages/second 59 pipe_cpy 60.00 376647 6277.45000 627745.00 Pipe Messages/second 60 ram_copy 60.00 2262411 37706.85000 943425387.00 Memory to Memory Copy/second -------------------------------------------------------------------------------------Projected Completion time: Fri Feb 21 12:41:25 2003

(u@Actual Completion time: Fri Feb 21 12:41:29 2003 Difference: 0:00:04 AIM Independent Resource Benchmark - Suite IX Testing over

Reaim

Reaim is a project that was in development by the Open Source Development Lab (OSDL) at the time this book was written. It is an effort to update AIM7 for today's workloads. Reaim's goal is to provide a mixed application workload with some repeatable, concrete measures for throughput. Its goal is to be self-contained, easy to set up, configure, and run. It also aims to provide quick results that allow operating system developers to measure the benefits of specific changes or allow application performance analysts to quickly understand the benefits of some system configuration changes or modifications of underlying operating system tuning parameters.

SPEC SDET

SPEC SDET is a retired benchmark that is still available to members of the Standard Performance Evaluation Corporation (SPEC) organization. SPEC SDET is another benchmark that measures a mixed workload, although the components of that workload mix are somewhat dated. However, it provides a fairly simple, self-contained environment to measure the throughput of a set of scripts. Each script is run for five iterations by default, with an increasing number of users. The number of times the scripts are run and the number of users are configurable. The output shows both scalability under load and absolute performance at any stress point.

SPEC SDET is the first benchmark covered in this chapter that is a controlled benchmark. Specifically, the SPEC organization has strict rules governing how the test must be run and how the results must be portrayed. In many cases, results may be subject to audit to establish validity. In some cases, there are slightly less-strict rules for research or evaluation purposes and such runs must be clearly identified as specified by the license the end user agreed to when she acquired the benchmark. In our case, the results were run on an internal machine and were not conformant. In particular, the operating system had custom modifications and the lack of controls means that these may not be repeatable and that the specific results should not be compared against any other SPEC SDET results.

Many major benchmarks come with such restrictions on how the tests can be run, how the results should be repeatable, and how the results should be auditable. These restrictions and publication guidelines are intended to protect the integrity of the benchmark, preventing anyone from generating misleading comparisons. The organizations providing the benchmarks strive to ensure that anyone attempting to compare performance or price performance numbers can do so and therefore make informed decisions about technologies, capabilities, and so on of a platform or software offering.

Because these benchmarks become a standard for comparison, the competition to publish the best numbers can be intense. And because the costs of running, tuning, and optimizing these large benchmarks can be so high, typically only large companies purchase licenses to use the benchmarks, procure the complex hardware environment required to run the benchmark, and apply the experience and resources needed to generate publishable numbers. Benchmarks also tend to drive new technologies, which improve the components measured by that benchmark. One constant concern that the major benchmark providers diligently address is the tendency of technologists to optimize specifically for the benchmark. If any component of the benchmark does not reflect actual, common usage of the components being measured, the benchmark could become less useful to real users. Therefore, the benchmarking companies constantly evaluate the results and work to improve the validity and utility of their benchmarks.

With all of that in mind, included in the following are the results from one such test that is less stringently controlled, primarily because it is retired, but also because the test is less complex and easily repeatable by an end user.

SPEC and the benchmark name SPECsdm are registered trademarks of the Standard Performance Evaluation Corporation. This benchmarking was performed for research purposes only. This benchmark run is noncompliant, and the results may not be compared with other results.

 SDET results 1-20 users, 5 iterations: 1 users: 2553 2011 2553 2553 2307 mean: 2395 stddev: 10.01% 2 users: 4931 4022 4260 4186 4586 mean: 4397 stddev: 8.24% 3 users: 5869 5373 5934 5046 6206 mean: 5685 stddev: 8.22% 4 users: 6666 5830 6000 6233 7422 mean: 6430 stddev: 9.91% 5 users: 6642 7142 7929 7826 7317 mean: 7371 stddev: 7.13% 6 users: 7970 8089 8470 8605 8307 mean: 8288 stddev: 3.16% 7 users: 9403 8936 9264 8456 8344 mean: 8880 stddev: 5.32% 8 users: 9230 10034 9085 9350 9411 mean: 9422 stddev: 3.86% 9 users: 10031 10836 10351 10220 10657 mean: 10419 stddev: 3.13% 10 users: 11320 11764 11842 10843 11726 mean: 11499 stddev: 3.64% 11 users: 10909 12036 12492 12336 11478 mean: 11850 stddev: 5.51% 12 users: 12485 11250 12743 12067 12378 mean: 12184 stddev: 4.73% 13 users: 13371 13333 12219 11011 11878 mean: 12362 stddev: 8.13% 14 users: 12923 11325 11004 13298 13298 mean: 12369 stddev: 9.03% 15 users: 11297 10588 11739 14285 13953 mean: 12372 stddev:13.34% 16 users: 13395 13150 14117 13457 13333 mean: 13490 stddev: 2.73% 17 users: 13275 12830 14069 14036 13814 mean: 13604 stddev: 3.95% 18 users: 13584 13251 13360 13360 14117 mean: 13534 stddev: 2.57% 19 users: 13680 14279 12930 13790 13902 mean: 13716 stddev: 3.60% 20 users: 13584 13308 13740 13872 13636 mean: 13628 stddev: 1.54%

For reference, the metric of interest in the preceding is the number of scripts per hour executed by each of five runs of the test with the indicated number of users. The tests are run five times, and the metric of interest for each run is printed, followed by the mean and standard deviation of those values.

In this particular case, we were looking at how the Linux scheduler performed with a couple of algorithms, attempting to identify whether a new approach would have benefits for multitask workloads. With only the output of a single run here, we cannot compare the results or come to any specific conclusions; however, we can see what the output format shows us. The first two columns indicate the number of users, or the amount of stress that the benchmark was applying to the system, moving upward from simulating a single user to simulating 20 users. The next five columns show the number of tasks completed in each run, followed by the mean and standard deviation of those five runs. In particular, it is interesting to note that this test generates results that are probably less stable than might be ideal. Using the mean value as the primary comparison key and noting the standard deviation between each set of five runs indicates the runs' stability. Some standardized benchmarks provide highly stable results where variations of a single unit of the measurement metric are common on repeated runs. Other benchmarks derive more stable metrics from a mathematical analysis of the underlying raw data. Deriving metrics is best left to the experts in defining and developing benchmarks. Benchmark metrics can be as misleading as any other statistics if results are derived without appropriate context and without a solid understanding of the underlying sources of variance in the raw numbers.

In this particular benchmark, the high standard deviation may actually be a side effect of the operation of the Linux scheduler that was measured at the time, or it may be inherent in the benchmark. Without a wide variety of tests against which to correlate the results, the metrics could easily be taken out of context, and a performance analyst could draw invalid conclusions. On the other hand, such microbenchmarks can be invaluable when measuring the impact of various tuning parameters and hardware or software configurations. And results from the slightly more complex microbenchmarking workload, such as AIM7, Reaim, and SPEC SDET often have a higher correlation with real customer workloads without the added complexity and difficulty in measurement that those real workloads entail.

Disk Benchmarks

This section discusses the following disk benchmarks:

Bonnie/Bonnie++
IOzone
IOmeter
tiobench
dbench

We have examined two classes of microbenchmarks: those that focus on the primitive, primary capabilities of the underlying processor and operating system, and those that use simplistic workload simulations that exercise some of the more commonly used operating system components such as the scheduler or IPC mechanisms. Next, we'll look at a few other classes of benchmarks that focus on other components of the system. These include benchmarks for the disk subsystem, various file systems, and the networking subsystem. These components often have a higher correlation with specific aspects of the end-user workload and therefore are quite valuable in isolating and identifying the key performance characteristics of the file systems and disk subsystems underlying the end-user workload. Many of these benchmarks are developed from or related to traces of actual workloads running on specific file systems. Others are simply designed to create a stressful environment in which throughput can be measured in best-case and worst-case scenarios.

Bonnie/Bonnie++

Bonnie is primarily a file system test useful to measure underlying file system performance and identify bottlenecks. Bonnie and Bonnie++ focus on a set of sequential read and write tests and some random read/write testing. Bonnie measures both character-at-a-time reads and writes and block-at-a-time reads and writes. The character-at-a-time tests use the operating system's getc() and putc() functions to provide some measure of the overhead of the stdio libraries, whereas the block transfers use the read() and write() system calls to provide a measure of the file system throughput and latency with very little additional overhead. The random tests run a user-specified number of threads, each of which seeks to a random location and performs a read(); at a random interval, the block that was read is updated and written back via write().

 $ ./Bonnie -d bonnie-tmp -m laptop  File 'bonnie-tmp/Bonnie.31541', size: 104857600 Writing with putc()...done Rewriting...done Writing intelligently...done Reading with getc()...done Reading intelligently...done Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...               -------Sequential Output-------- ---Sequential Input-- --Random--               -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU laptop    100  5911 56.2 14987 27.5  3703  4.7  8889 70.7 12414  4.5 981.6  7.6

In the preceding run, the default block size was used (100MB). Bonnie did about 100 million (100MB, one character at a time) putc() operations at a rate of 5911KBps, utilizing 56% of the CPU. The same I/O done in blocks was about 15KBps using only 27% of the CPU. The overhead for doing a byte at a time was more than three times that of doing block I/O operations, so obviously the resulting reduced CPU utilization indicates that the library interface is fairly CPU-intensive. Application writers may use this same data to decide whether they want the advantages of using buffered I/O with an eye toward the worst case being a byte-at-a-time, or the advantages of doing direct read() and write() operations with the resulting overhead as quantified in the preceding output for a particular hardware and operating system combination. Note that the comparison may not be completely fair. The block size used by Bonnie is 16K, but the underlying block size for getc() and putc() is only 4K on the version of Linux and glibc used on the test machine. Also, the random seeks and reads using the strategy of 10% modify and write may not map directly to any particular application, although they do provide an initial basis for comparison.

Bonnie++ is a benchmark derived directly from Bonnie. It begins with the same tests that are in Bonnie but extends the very simple application with support for larger files (more than 2GB on a 32-bit machine), a number of parameters to control block sizes, the capability to set the random number generator seed, the capability to generate data for use in a spreadsheet, and so on.

 $ ./bonnie++ Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em...done...done...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.93c       ------Sequential Output------ --Sequential Input- --Random- Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP laptop         1G   121  88 11920  23  4797   8   452  73 17249  13  71.9   1 Latency               529ms    3201ms    3105ms     127ms     376ms    2487ms Version 1.93c       ------Sequential Create------ --------Random Create-------- laptop            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP                  16   294  69 26593  90 11278  72   346  82 27319  90   889  70 Latency              1128ms    3338us     195ms     199ms    3321us    1187ms 1.93c,1.93c,laptop,1,1087268459,1G,,121,88,11920,23,4797,8,452,73,17249,13,71.9,1,16 ,,,,,294,69,26593,90,11278,72,346,82,27319,90,889,70,529ms,3201ms,3105ms,127ms,376ms, 2487ms,1128ms,3338us,195ms,199ms,3321us,1187ms

Note that Bonnie++ includes measures of I/O latency and shows the rate of file creation and deletion with interspersed reads. This model provides some insight into the sort of operations that a mail server might perform. In fact, Bonnie++ has an option to perform an fsync() after each write operation, as many mail servers might do. The last line of the output describes the same results in a format that might be suitable for inclusion in a spreadsheet or for analysis by a Perl script.

IOzone

IOzone is another file system benchmark that can be used to compare a number of different workloads or operating system platforms. In addition to common tests for read and write, IOzone can test such capabilities as reading backward, reading a stride of data (for example, every third or fifth block of data), various block sizes, and random mixes of reads and writes. IOzone also can test reads and writes through stdiofor example, via fread()/fwrite(). IOzone allows a performance analyst to model different types of workloads with its rich set of configuration parameters, including simulating diverse workloads like primarily sequential decision support database workloads or transaction processing workloads, which consist of read-mostly, highly random I/O patterns. IOzone also can use asynchronous I/O and can model performance differences between primarily synchronous and asynchronous I/O-based workloads. IOzone also can model applications using threaded and nonthreaded workloads and can use mmap() as the underlying API for reads and writes of file operations.

IOmeter

IOmeter is another I/O and file system subsystem test that is useful for I/O testing on a single machine but can also be used on clustered systems. The test was originally written by Intel and later released into the public domain. IOmeter is available for download at http://sourceforge.net/projects/iometer/. IOmeter is another test that runs on multiple operating systems and can be used to compare operating systems. However, it also can be used on a single system to compare changes in kernel configuration parameters, application tuning, hardware configuration, and so on.

IOmeter provides a slightly different view of performance tuning. Its goal is not to run a task and see how quickly it completes or see how many operations it completes in a period of time. Instead, it provides a steady-state workload and allows you to monitor various aspects of the system under load. You can also vary the workload and see how the underlying system responds. In addition, you can vary the system parameters, configuration, and so on to see how that impacts the workload. In other words, it provides more tuning control points for analyzing changes in workloads than any of the other benchmarks discussed previously. A similar tool that launched before IOmeter was publicly available is pgmeter, also known as Penguinometer. pgmeter is available at http://pgmeter.sourceforge.net/ (or http://sourceforge.net/projects/pgmeter).

tiobench

tiobench is a file system benchmark that is threaded and can generate tests using one or more simultaneous threads of I/O. This allows the simulation of applications that generate I/O in parallel, either independently of each other (truly random) or with some coordination (for example, with some sequential I/O).

 $ ./tiobench.pl gerrit@w-gerrit2:/nue/tiobench-0.3.3$ ./tiosum.pl No size specified, using 1022 MB Run #1: ./tiotest -t 8 -f 127 -r 500 -b 4096 -d . -TTT Unit information ================ File size = megabytes Blk Size  = bytes Rate      = megabytes per second CPU%      = percentage of CPU used during the test Latency   = milliseconds Lat%      = percent of requests that took longer than X seconds CPU Eff   = Rate divided by CPU% - throughput per cpu load File size in megabytes, Blk Size in bytes.  Read, write, and seek rates in MB/sec.  Latency in milliseconds. Percent of requests that took longer than 2 and 10 seconds. Sequential Reads         File  Blk   Num                   Avg     Maximum    Lat%    Lat%     CPU Kernel  Size  Size  Thr  Rate  (CPU%)  Latency   Latency     >2s    >10s      Eff --------------------------------------------------------------------------------- 2.6.3   1022  4096    1  12.82 5.645%    0.303   1450.99   0.00000  0.00000   227 2.6.3   1022  4096    2  17.45 7.706%    0.446   1749.80   0.00000  0.00000   226 2.6.3   1022  4096    4  16.77 7.419%    0.901   2100.15   0.00038  0.00000   226 2.6.3   1022  4096    8  18.54 8.404%    1.585   1717.18   0.00000  0.00000   221 Random Reads         File  Blk   Num                    Avg     Maximum   Lat%     Lat%    CPU Kernel  Size  Size  Thr  Rate  (CPU%)   Latency   Latency    >2s     >10s     Eff ---------------------------- ------ ----- ---  ------------------------------------ 2.6.3   1022  4096    1   0.26 0.637%    15.081   2199.50   0.02500  0.00000   41 2.6.3   1022  4096    2   0.46 0.685%    16.754    293.99   0.00000  0.00000   66 2.6.3   1022  4096    4   0.41 0.717%    37.062   1367.18   0.00000  0.00000   57 2.6.3   1022  4096    8   0.35 0.591%    84.926   3423.89   0.40000  0.00000   59 Sequential Writes         File  Blk   Num                   Avg     Maximum   Lat%     Lat%     CPU Kernel  Size  Size  Thr   Rate  (CPU%)  Latency   Latency    >2s     >10s     Eff ---------------------------- ------ ----- ---  ---------------------------------- 2.6.3   1022  4096    1   18.68 24.68%    0.194    799.52   0.00000  0.00000   76 2.6.3   1022  4096    2   17.21 22.51%    0.411   3206.45   0.00076  0.00000   76 2.6.3   1022  4096    4   18.78 24.89%    0.684   3277.35   0.00191  0.00000   75 2.6.3   1022  4096    8   18.10 23.86%    1.357   6397.58   0.01615  0.00000   76 Random Writes         File  Blk   Num                    Avg     Maximum   Lat%    Lat%     CPU Kernel  Size  Size  Thr   Rate (CPU%)   Latency    Latency   >2s     >10s     Eff ---------------------------- ------ ----- ---  ---------------------------------- 2.6.3   1022  4096    1    0.58 0.698%    0.236    66.66    0.00000  0.00000   83 2.6.3   1022  4096    2    0.73 0.911%    1.106   967.64    0.00000  0.00000   81 2.6.3   1022  4096    4    0.70 0.911%    2.677  1474.37    0.00000  0.00000   77 2.6.3   1022  4096    8    0.78 1.035%    6.960  3392.60    0.05000  0.00000   75

tiobench allows for a variety of patterns, as shown from the help option. A single run can be set up to do a large set of tests with multiple block sizes, multiple numbers of threads, each thread issuing a random set of interspersed I/Os, which can be configured to resemble a number of applications using a single file system.

 $ tiobench.pl --help Usage: ./tiobench.pl [<options>] Available options:         [--help] (this help text)         [--identifier IdentString] (use IdentString  as identifier in output)         [--nofrag] (don't write fragmented files)         [--size SizeInMB]+         [--numruns NumberOfRuns]+         [--dir TestDir]+         [--block BlkSizeInBytes]+         [--random NumberRandOpsPerThread]+         [--threads NumberOfThreads]+ + means you can specify this option multiple times to cover  multiplecases, for instance: ./tiobench.pl --block 4096  --block 8192 will first runthrough with a 4KB block size  and then again with a 8KB block size.--numruns specifies  over how many runs each test should be averaged

Overall, tiobench is a simple set of code with some fairly flexible configuration capabilities to simulate some multiple application workloads or to simulate very simple database random or sequential style models from a purely file-system-oriented perspective.

dbench

dbench is a workload created to model another benchmark: NetBench. NetBench is the de facto industry standard benchmark for measuring Windows file servers. However, as with several benchmarks discussed later in this chapter, NetBench requires access to a large number of clients and a very large server to measure performance. With many benchmarks, the complexity of the benchmark makes it better suited to benchmark teams sponsored by larger vendors to create a run, usually after a product is generally available. NetBench is also poorly suited to help developers analyze code during the development cycle because of the long cycle time for testing and the complex testing environment required.

Many Linux developers prefer to analyze the performance of various aspects of their code as that code is developed. That analysis requires benchmarks that are easy to run, benchmarks that can be easily understood, and benchmarks that require minimal hardware configurations. dbench was written by the developers of the Windows file system support for Linux and achieves those goals. Their goal was to use dbench to analyze their code for compatibility, functionality, performance, and scalability on Linux.

dbench was developed based on a set of network traces captured while running NetBench. The theory was that if you can simulate the traffic pattern on the network from accessing a network file system, you can generate effectively the same load with a program that does nothing except generate those patterns. NetBench used a more brute-force approach where each network client was an individual computer, generating file system accesses at the rate a real user might generate file system accesses. By effectively simulating the file system traffic on the local machine, the server portion of the network file system code should be stressed in much the same way.

dbench effectively eliminates the need for a network in this case, simply simulating locally the traffic that would otherwise come in over the network. This simplifies the overall analysis to a point that allows the developers to focus on a single component of the entire systemthe way that the file system code responds to a relatively standardized client load. In fact, with this level of simulation, it is now possible to simulate workloads even larger than those often done in a NetBench configuration, and with only a single machine.

Running dbench is very simpleits only argument is the number of clients to run, so the following output is the result of running dbench while simulating 16 clients.

 $ dbench 16 16 clients started   16       637  234.00 MB/sec   16      3070  285.37 MB/sec   16      4086  214.91 MB/sec   16      4909  177.61 MB/sec   16      7051  177.85 MB/sec   16      8999  174.34 MB/sec   16     11030  174.34 MB/sec   16     13069  173.72 MB/sec   16     15765  179.60 MB/sec   16     17505  176.47 MB/sec   16     20106  180.02 MB/sec   16     22086  179.87 MB/sec   16     24246  179.19 MB/sec   16     26469  179.88 MB/sec   16     27781  175.42 MB/sec   16     30133  177.20 MB/sec   16     32413  177.82 MB/sec   16     34923  179.28 MB/sec   16     37410  181.32 MB/sec   16     39026  179.35 MB/sec   16     42022  182.70 MB/sec   16     44060  182.13 MB/sec   16     46126  181.81 MB/sec   16     48693  183.46 MB/sec   16     50478  182.01 MB/sec   14     53321  183.62 MB/sec   12     55804  185.15 MB/sec    7     59312  188.96 MB/sec    1     62463  191.78 MB/sec    0     62477  191.49 MB/sec Throughput 191.494 MB/sec 16 procs

The final line indicates the overall throughput of the underlying file system when responding to the rough equivalent of 16 NetBench client systems. The interim lines rovide a running summary of the throughput. As with any benchmark, it is important to understand exactly what the workload being measured is. It is also important for the performance analyst to understand how that measurement might relate to his own workload. As an example, NetBench tends to model a workload that is very write intensiveas much as 90% writes. Most real-user situations rarely have a need to write so much data to the server or to write that percentage of data for a long period of time. However, writes tend to be more resource-intensive and harder to optimize than reads. In this case, dbench could be useful for some worst-case analysis of a workload or an environment, but it is unlikely to be directly representative of that workload.

dbench was originally written to model another benchmark. However, dbench is weak in a couple of properties that would make it a much more useful benchmark. In particular, dbench seems to have a very wide variation in the throughput results between runs, even when run repeatedly on identical hardware under what are believed to be identical conditions. This deviation in the average run results means that the test often needs to be run many times and an average and standard deviation for those results needs to be calculated. This means that running dbench from a development and stress perspective can be very useful but that using dbench as a multiplatform or multiconfiguration test requires additional runs and more stringent analysis of the results. Also, absolute results from dbench tend to be nearly meaningless because of these deviations. However, dbench remains a useful tool for generating stress, seeing how the system responds under that stress, and for providing some reasonable guidelines on the benefits of configuration changes, such as comparing different file systems on the same hardware and software stack, or changing some file system or related kernel parameters on a given configuration. As with all benchmarks, a solid understanding of their operation, strengths, and weaknesses, as well as an ability to compare the benchmark's operation to that of the end-user workload to be used on a given machine, critical.

A companion tool to dbench is tbench. Whereas dbench factored out all the networking and real client component of NetBench, tbench factors out all the disk I/O. tbench was constructed from the same NetBench traces and models the related networking traffic in a network file system. Here is the output from a 32-thread (client) run; the output is nearly identical to that of dbench but measures the effective throughput of only the networking component of a NetBench-style workload.

 32 clients started   32       735  402.57 MB/sec   32      1978  395.83 MB/sec   32      3735  399.99 MB/sec   32      5781  390.50 MB/sec   32      7870  382.69 MB/sec   32     10164  382.61 MB/sec   32     12356  380.21 MB/sec   32     14491  379.11 MB/sec   32     16614  376.35 MB/sec   32     18638  372.94 MB/sec   32     20838  372.71 MB/sec   31     23087  373.25 MB/sec   31     25178  371.87 MB/sec   31     27435  373.10 MB/sec   31     29571  372.40 MB/sec   30     31702  371.65 MB/sec   29     33870  370.35 MB/sec   28     36058  371.11 MB/sec   26     38345  371.14 MB/sec   22     40935  374.55 MB/sec   20     43394  376.73 MB/sec   20     45815  377.82 MB/sec   18     48291  379.80 MB/sec   17     50970  383.07 MB/sec   15     53593  385.23 MB/sec   13     56601  389.88 MB/sec    7     59426  393.44 MB/sec    2     61201  390.04 MB/sec    2     61966  381.01 MB/sec    1     62323  370.27 MB/sec    0     62477  364.31 MB/sec Throughput 364.306 MB/sec 32 procs

By using these two tests independently, it is possible to simulate the conditions under which the untested component is assumed to be infinitely fast. This helps find bottlenecks in individual subsystems without being dependent on bottlenecks or performance constraints in other subsystems. For instance, some of the complex benchmarks discussed later in this chapter provide a single or small set of numbers to represent the overall throughput of a highly complex workload. In these complex workloads, a combination of processing speed, network capacity, disk latency, file system performance, database workload, Java performance, web server response times, and other factors combine to yield a single throughput and/or latency result. But if a single component of the workload is misconfigured or inadequate for the environment being tested, the overall number suffers, often without any direct feedback as to the cause of the deficiency. At that point, an analyst needs to look at the detailed statistics provided by the benchmark, understand what portions are most critical to the workload's throughput, and attempt to tune or otherwise correct that subsystem and rerun. This process can take days for an expert, and for the uninitiated, this process of analysis, retuning, reconfiguration, and so on can take months!

It may appear that tests like dbench, tbench, and other component tests are "the only way to go." The initial answer is: Definitely. However, even detailed analysis of the various components does not provide a comprehensive view of how those components interact. For instance, dbench can indicate that the file system component is working well, and tbench can help tune the networking component to work well. But if there are interactions between the networking component and the underlying file system code, such as locking, scheduling, or latency issues, the overall result of a comparable NetBench may be quite different. If it is different, it is most likely that the more holistic view is the one that will perform the worst. As with software design, integrated circuit design, home building, automobile manufacture, and so on, there is distinct value in doing performance analysis on the various components before integrating them into a greater whole. But they are not a replacement for doing full integration testing and performance analysis on the integrated set of those tools and components. As with automobile design, it is possible to design an engine that performs so well that it could push an automobile to 200 mph. But if the remainder of the automobile's components are not designed and tested to meet those same performance specifications, the end user of that automobile may not be happy with how the vehicle performs and handles under that user's workload.

Network Benchmarks

This section discusses the following network benchmarks:

Netperf
SPEC SFS

Netperf

Netperf is a fairly comprehensive set of network subsystem tests, including support for TCP, UDP, DLPI (Data Link Provider Interface), and UNIX Domain Sockets. Netperf is actually two executable programs: a client-side program, Netperf, and an accompanying server-side program, Netserver. Netserver is usually automatically executed out of a system's inetd daemon; all command specification is done via the Netperf command-line interface. Netperf communicates with Netserver over a standard TCP-based socket, using that communication channel to establish the subsequent tests. A secondary control and status connection is opened between the client and server, depending on the test options selected, and performance statistics are returned to the client over this secondary connection. During a test, there should be no traffic (other than potential keep-alive packets) on the primary control channel. Therefore, if the performance analyst shuts off all other network traffic to the machine, Netperf can measure the full capability of the network performance of a pair of machines and their corresponding network fabric.

Netperf is primarily useful for measuring TCP throughput, which accounts for the majority of network traffic generated by modern applications. Netperf comes with a couple of scripts that help automatically generate some basic tests covering socket size and send buffer size: tcp_stream_script and tcp_range_script. The first one runs Netperf with some specific socket sizes (56K, 32K, 8K) and send sizes (4K, 8K, 32K). The second script uses a socket size of 32K by default and iterates over send sizes from 1K to 64K.

Here is some sample output from a run of Netperf.

 RCV    SND    MSG  TIME   THRUPUT  CPU  CPU   Snd   RCV                                    sebd recv  lat   lat 131070 131070 8192 60.00  798.59   9.28 11.34 1.903 2.327  131070 131070 8192 60.00  796.62   9.12 11.48 1.876 2.360  131070 131070 8192 60.00  800.74   9.14 11.78 1.869 2.411  131070 131070 8192 60.00  796.92   9.24 11.90 1.899 2.447  131070 131070 8192 60.00  800.05   9.20 11.48 1.885 2.350  RCV = RCV socket size (bytes) SND = send socket size (bytes) MSG = send msg size (bytes) TIME = elapsed time (seconds) THRUPUT = throughput (10^6 bits/second) CPU send = CPU utilization send local (%) CPU recv =  CPU utilization receive remote (%) Snd lat = Service Send local (us/KB) RCV lat = Demand RCV remote (us/KB)

See http://www.netperf.org for the benchmark and some archived results.

SPEC SFS

SPEC (http://www.spec.org/) provides a benchmark for network file systems. It is primarily used today for benchmarking NFS v2 and v3 servers. It allows for benchmarking both the UDP and TCP transports. SPEC SFS also provides a breakdown of all the common network file operations used by NFS, including reading directories, reading files, writing files, checking access rights, removing files, updating access rights, and so on. At the completion of a test run, an indication of the throughput for each operation is generated. The test attempts to simulate a real-world workload with an appropriate mix of the underlying operations so that the results should be relevant to any reasonable NFS workload. A performance analyst may want to monitor the operations of her own workload and compare it to the published breakdown of operations as run by the SPEC SFS benchmark before drawing too many conclusions.

As with other SPEC-based tests, there are strict run rules, configuration requirements, data integrity requirements, and reporting requirements for test results. It is still possible to use the test for internal testing and to use it to compare the results of various configuration changes to an environment.

Application Benchmarks

This section discusses the following benchmarks:

The Java benchmarks Volanomark, SPECjbb, and SPECjvm
PostMark
Database benchmarks

Java Benchmarks

Three commonly used benchmarks for comparing implementations of Java, Java settings, or Java on particular platforms are Volanomark (http://www.volano.com/benchmarks.html), SPECjbb, and SPECjvm. Volanomark is a benchmark designed to simulate the VolanoChat application. VolanoChat is a chat room application written entirely in Java and has certain application characteristics that are similar to many Java applications. In particular, Volanomark simulates long-running network connections with lots of threads instantiated on the local operating system. Because chat benchmarks are inherently interactive, latency and response times are very important, and the Volanomark benchmark's goal is to see how many connections to the chat room can be maintained within a specified response time. Obviously, the more connections you can maintain with a reasonable latency, the less overhead the operating system or the Java implementation consumes. Volanomark can be implemented with both the server application and all clients running on the local host. This would be useful for separating the network component from the overall equation and focusing exclusively on the Java implementation and the underlying operating system performance characteristics. However, a more realistic workload involves setting up clients on a network, driving a server. This helps measure end-to-end latency as well as overall Java performance. It also simulates a more reasonable customer deployment of Java in which Java is used as a middleware layer in a multitier client/server configuration.

Java, SPECjvm, and SPECjbb also add a new level of complexity over the benchmarks discussed thus far. Most of the preceding benchmarks may not be affected significantly by system tuning. They are more likely to be affected by hardware configuration changes, total available system memory, I/O configuration, and so on. For these benchmarks, the same operating system configuration options, hardware configuration options, and so on are often secondary to the tuning configuration parameters of the version of Java in use. Java itself has configuration parameters governing the size of the heap it uses, the number of threads, the type of threads, and so on. Each of these configuration parameters interacts with the test and the underlying operating system in ways that are often difficult to predict. As a result, benchmark publication efforts often run waves of tests, changing various configuration parameters at each level with some intelligent guesses as to the potential interactions of the configuration parameters. This technique is fairly common for finding the best results a particular benchmark can provide. However, a benchmark like this can also be used to model the impacts of various configuration changes. The benchmark in this case can provide a scientific control against which to compare various changes.

Using the benchmark as a control allows the performance analyst to try out different versions of Java, different operating systems or kernels, different configuration parameters for Java, and so on. This gives the performance analyst more insights into optimizing similar workloads.

Volanomark provides a high-level metric for indicating throughput, including messages sent, messages received, total messages, elapsed time, and average throughput in messages per second. The following is the output from a sample run.

 $ volanomark 1024 1024 1000 Running with start heap = 1024 max heap=1024 msg_count=1000 [ JVMST080: verbosegc is enabled ] [ JVMST082: -verbose:gc output will be written to stderr ]   <GC[0]: Expanded System Heap by 65536 bytes java version "1.4.1" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1) Classic VM (build 1.4.1, J2RE 1.4.1 IBM build  cxia32dev-20030702 (JIT enabled: jitc)) VolanoMark(TM) Benchmark Version 2.1.2 Copyright (C) 1996-1999 Volano LLC.  All rights reserved. Creating room number 1 ... 20 connections so far. Creating room number 2 ... 40 connections so far. Creating room number 3 ... 60 connections so far. Creating room number 4 ... 80 connections so far. Creating room number 5 ... 100 connections so far. Creating room number 6 ... 120 connections so far. Creating room number 7 ... 140 connections so far. Creating room number 8 ... 160 connections so far. Creating room number 9 ... 180 connections so far. Creating room number 10 ... 200 connections so far. Running the test ... <AF[1]: Allocation Failure. need 528 bytes, 0 ms since last AF> <AF[1]: managing allocation failure, action=1 (0/1019991016) (53683736/53683736)>   <GC(4): GC cycle started Mon Nov 24 10:13:30 2003   <GC(4): freed 1018220216 bytes, 99% free  (1071903952/1073674752), in 83 ms>   <GC(4): mark: 68 ms, sweep: 15 ms, compact: 0 ms>   <GC(4): refs: soft 0 (age >= 32), weak 0, final 0,  phantom 0> <AF[1]: completed in 94 ms> Test complete. VolanoMark version = 2.1.2 Messages sent      = 200000 Messages received  = 3800000 Total messages     = 4000000 Elapsed time       = 79.49 seconds Average throughput = 50321 messages per second

SPECjvm is a benchmark controlled by and made available through the Standard Performance Evaluation Corporation (SPEC). SPECjvm is primarily used to measure the client-side performance of JVM, including how well it runs on the underlying operating system. SPECjvm also measures the performance of the just-in-time (JIT) compiler. The underlying operating system and hardware performance also affect the throughput measures of SPECjvm.

SPECjbb measures a more complex application environment, specifically that of an order processing application for a wholesale supplier. Like SPECjvm, SPECjbb also measures the system's underlying hardware and operating system performance. SPECjbb is loosely modeled on the TPC-C benchmark; however, it is written entirely in Java. SPECjbb models a three-tier system (database, application, user interface) with a focus on the Java-based application layer in the middle tier. The middle tier models a typical business application and is the basis for the generated metrics of interest. SPECjbb is conveniently self-contained, which means that it does not need a complex database to be installed (it implements a simple tree-based data structure in JVM for the database tier) and does not need a web server because it provides a randomized simulation of user input on the first tier.

PostMark

PostMark is another type of application test, although a bit different from the preceding Java application tests. PostMark provides a means of testing a specific usage pattern on a file systemspecifically, the usage pattern that might be seen on a mail server. In particular, mail servers tend to operate on many small files. Many of the other benchmarks focus specifically on file system throughput in terms of the amount of data sent to or fetched from the file system. PostMark recognizes the pattern used by mail servers (and often news servers, and possibly some web-based e-commerce servers) in which small files are frequently created and written, appended to, and read and deleted. In fact, the access patterns of most mail servers are a subset of all possible file system operations. Therefore an I/O and disk storage subsystem that performs that set of operations well should provide a higher-performance mail server.

PostMark provides statistics at the completion of a run that indicate how many files can be modified in common patterns per second. These common patterns include the overall elapsed time, elapsed time performing actual transactions and the average transaction rate in terms of files per second, total files created and the creation rate, total number of files read and the average rate at which those files were read, and so on for file appends, deletions, and so on. The test also reports on the overall amount and rate of data read and written.

As with many of the more complex benchmarks, many of the parameters for rates of file creation, appends, and so on can be modified to better represent a particular workload. See http://www.netapp.com/tech_library/3022.html for more information on PostMark.

Database Benchmarks

Database benchmarks are worthy of an entire book of their own and are mentioned here only briefly. Databases are often the backbone of many enterprises, maintaining customer records, sales records, inventory, marketing patterns, video clips, images, sounds, order fulfillment information, and so on. Although there are several major patterns for databases, two of the most common are online transaction processing (OLTP) and decision support (DS) (sometimes referred to as business intelligence (BI)). These two major models have fairly different operational characteristics, especially as to how they stress the underlying hardware and operating system. OLTP tends to be used more often for recording all aspects of transactions done in a dynamic environment. For instance, an OLTP system can be viewed as a set of point of sale (POS) terminals all doing queries on current inventory, updating sales numbers, looking up prices, and so on, much like a checkout stand at a store or an e-commerce web front end to a company's inventory. These operations tend to use the database's organizational capabilities to generate queries based on selected attributes. They do the smallest number of disk I/Os to locate the specific record or set of records for which an operator is searching. Occasionally when a record is located, some data related to that record is updated, such as the number available when one is sold. In the OLTP model, most disk I/O is done in fairly small chunks (sometimes as small as 2K or 4K), and the processor is used to help run joins on tables or calculate the next offset in the table to look up. The access pattern is typically read-mostly with a small number of writes.

Decision support, on the other hand, is often run on a database generated through an OLTP mode of operation. However, in this case, the entire database is often searched for trends, summary operations, full reports on a day's activities, and so on. In the case of DS workloads, nearly the entire database is read, often in large chunks (more often 64K or 2MB at a time), and a limited amount of processing is done based on the data returned. Occasionally, summary writes are written back to the database.

These two models are sufficiently different in that distinct benchmarks help model these workloads. For the OLTP model, the gold standard for benchmarking is the TPC-C benchmark, published by the Transaction Processing Council (http://www.tpc.org/). The same organization also publishes the TPC-H benchmark, which models the decision support style workload. Both of these benchmarks have evolved over the years to attempt to represent real-life workloads. These tests are also very rigorously controlled, as is the reporting of the results. As with the SPEC tests discussed earlier, the run rules are well established and controlled, various requirements are placed on the coherence of data stored in the database, benchmark runs must be highly controlled, results must be published and auditable, and all publications must be done on hardware and software that is generally available or will soon be generally available. Because these benchmarks are often used by major companies to decide what hardware, operating system, and database to purchase, these benchmarks not only provide throughput and latency measures, but they also calculate performance relative to cost metrics.

The downside of these tests is that they can be very resource-intensive to set up and run. In some cases, the size of the database to be used can be very large, and it can take hours or days to create and load the initial database. After the database is loaded, runs can take anywhere from a few hours to many days to complete. And, in some cases, the cost of the required hardware can be prohibitive, even for large companies. As a result, most benchmark runs on the larger configurations come from large departments of large hardware or software vendor companies that have large budgets dedicated to demonstrating the value of their products in terms of raw performance or significant cost advantage over their competitors.

As an alternative to the expensive, complex, and rigorously controlled databases, the Open Source Development Lab (http://www.osdl.org/) has developed a couple of similar, but vastly simpler, tests for performance analysts to run on a smaller scale. For OLTP workloads, the OSDL's Database Test #1 (OSDL DBT-1: http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/) models an e-commerce site (similar to TPC-W). Database Test #2 (OSDL DBT-2: http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-2/) attempts to model the TPC-C benchmark, and Database Test #3 (OSDL DBT-3: http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-3) attempts to model the TPC-H benchmark.

All three of these tests are freely available and can be used to compare hardware and software configuration changes, application changes, and different application or operating system stacks. Of course, because the applications and their results are not tightly controlled like the TPC or SPEC benchmarks, keep in mind that comparing publicly published numbers may show inconsistencies in results, especially because the results are not typically audited or validated by anyone other than the individual or company that publishes the results. However, as a tool for a performance analyst, these test suites are much more accessible for day-to-day testing than the TPC or SPEC tests.