|
This section looks at several types of microbenchmarks, including operating system benchmarks, disk benchmarks, network benchmarks, and application benchmarks. Operating System BenchmarksThe operating system benchmarks discussed in this section are as follows:
LMbench(http://www.bitmover.com/lmbench/) LMbench is a suite of simple benchmarks in the true sense of a microbenchmark. It contains a series of small tests for measuring latency and bandwidth of some of the most fundamental of all UNIX or Linux APIs. The bandwidth measures include a measure of reading from cached files, copy memory completely in user level, measuring the bandwidth of data through a UNIX pipe, and some simple benchmarks for TCP. Usually, these bandwidth measures are done by copying a block of memory or issuing a read() call in a loop, with calls to a system clock before and after the loop. Counting the number of bytes transferred per unit of time provides a measure of the overall bandwidth of the various APIs. The measures selected can then be compared between different operating systems, processor types, hardware memory subsystems, and so on for these basic APIs. LMbench also measures the rate at which an operating system switches from user level into the operating system's protected mode using a very simple system call. Because events like this happen much faster than the system's timer granularity, a common technique is to run a simple primitive in a loop and measure the number of calls to the primitive per some unit of time. Relatively straightforward math then enables a calculation of the average rate per transaction. Other primitives measured by LMbench include the establishment of TCP connections, creation of pipes, creation of processes, rate at which signals are received, and so on. LMbench is fairly mature and has been careful to take into account some of the more complex side effects that can typically plague benchmarking efforts. For instance, the test for process creation looks at the cost of process creation via fork(), as well as the cost of fork()+exit() and fork()+exec(). On memory tests, the documentation describing how to interpret the results points out the performance impacts of various hardware configurationsmost specifically, the impact of various sizes of memory caches. LMbench also takes into account compiler differences by recommending a common compiler that should provide equivalent results for all architectures. It is possible to use different compilers on different architectures, but the performance analyst must take this into account when comparing results. Although all of the tests are fairly simple in concept, the wealth of experience included makes it much cheaper to use an off-the-shelf benchmark for simple comparisons. Also, the publicly available test resultsand, in this case, the publicly available source codeallows a performance analyst to easily compare any differences in different environments. In the following results, some information provides simple measures, such as the time for a simple system call such as read() or write(). Other results show throughputs for a variety of data transfer sizes. LMbench Sample OutputThis first section summarizes the machine being tested, including the kernel version (output of uname(1), memory sizes to be tested, processor speeds, and so on). [lmbench2.0 results for Linux herkimer.ltc.austin.ibm.com 2.6.3 #1 SMP Wed Mar 10 19:51:47 CST 2004 i686 i686 i386 GNU/Linux] [LMBENCH_VER: Version-2.0.4 20030113111940] [ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m 256m 512m] [DISKS: ] [DISK_DESC: ] [ENOUGH: 5000] [FAST: ] [FASTMEM: NO] [FILE: /usr/tmp/XXX] [FSDIR: /usr/tmp] [HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m 256m] [INFO: INFO.herkimer.ltc.austin.ibm.com] [LOOP_O: 0.00000234] [MB: 512] [MHZ: 495 MHz, 2.02 nanosec clock] [...] [OS: i686-pc-linux-gnu] [TIMING_O: 0] [LMBENCH VERSION: lmbench-2alpha13] [USER: ] [HOSTNAME: herkimer.ltc.austin.ibm.com] [NODENAME: herkimer.ltc.austin.ibm.com] [SYSNAME: Linux] [PROCESSOR: i686] [MACHINE: i686] [RELEASE: 2.6.3] [VERSION: #1 SMP Wed Mar 10 19:51:47 CST 2004] [...] This section provides the output of several system calls run for a short period of time in a loop. It also calculates the average time for the system call based on the amount of time run divided by the number of system calls successfully executed. Simple syscall: 0.4189 microseconds Simple read: 0.7907 microseconds Simple write: 0.6517 microseconds Simple stat: 47.4274 microseconds Simple fstat: 1.5631 microseconds Simple open/close: 54.3922 microseconds Select on 10 fd's: 6.7666 microseconds Select on 100 fd's: 34.5312 microseconds Select on 250 fd's: 82.4412 microseconds Select on 500 fd's: 159.6176 microseconds Select on 10 tcp fd's: 8.2204 microseconds Select on 100 tcp fd's: 52.6321 microseconds Select on 250 tcp fd's: 127.5116 microseconds Select on 500 tcp fd's: 248.5455 microseconds Signal handler installation: 1.742 microseconds Signal handler overhead: 9.899 microseconds Protection fault: 1.310 microseconds Pipe latency: 28.2951 microseconds AF_UNIX sock stream latency: 97.5933 microseconds Process fork+exit: 806.4286 microseconds Process fork+execve: 2265.0000 microseconds Process fork+/bin/sh -c: 10137.0000 microseconds File /usr/tmp/XXX write bandwidth: 5923 KB/sec n=2048, usecs=12993 Pagefaults on /usr/tmp/XXX: 6 usecs This section provides the size of an mmap region and the number of microseconds to complete a mapping of that size. "mappings 0.524288 30 1.048576 43 2.097152 71 4.194304 121 8.388608 219 16.777216 412 33.554432 848 67.108864 1691 134.217728 3326 268.435456 6750 536.870912 14383 This section shows how long in microseconds it takes to complete a read from a file system. "File system latency 0k 1000 5307 10741 1k 1000 3835 7716 4k 1000 3770 7717 10k 1000 2602 6395 This section provides latency timings for various network connections and a summary of the bandwidth of several local networking calls. UDP latency using localhost: 93.2020 microseconds TCP latency using localhost: 177.5119 microseconds RPC/tcp latency using localhost: 239.6613 microseconds RPC/udp latency using localhost: 140.2291 microseconds TCP/IP connection cost to localhost: 272.5500 microseconds Socket bandwidth using localhost: 29.04 MB/sec Avg xfer: 3.2KB, 41.8KB in 13.5080 millisecs, 3.09 MB/sec AF_UNIX sock stream bandwidth: 50.00 MB/sec Pipe bandwidth: 196.79 MB/sec This section provides the rate in MBps of reads of various byte sizes, from 512 bytes to 512MB. "read bandwidth 0.000512 100.16 0.001024 182.24 0.002048 313.72 0.004096 469.94 0.008192 402.96 0.016384 238.67 0.032768 245.45 0.065536 244.97 0.131072 242.77 0.262144 228.47 0.524288 209.52 1.05 197.06 2.10 198.86 4.19 196.74 8.39 198.87 16.78 197.20 33.55 199.74 67.11 193.44 134.22 199.26 268.44 196.50 536.87 199.57 This section provides the bandwidth of a complete open/read/close cycle in MBps (shown in the second column) for various block sizes (shown in the first column). "read open2close bandwidth 0.000512 7.85 0.001024 15.73 0.002048 30.36 0.004096 57.20 0.008192 80.01 0.016384 117.05 0.032768 156.55 0.065536 191.84 0.131072 208.20 0.262144 207.66 0.524288 195.78 1.05 194.32 2.10 192.59 4.19 197.70 8.39 195.38 16.78 198.77 33.55 196.26 67.11 197.06 134.22 197.39 268.44 187.94 536.87 197.56 This section provides the throughput in MBps (shown in the second column) of mmap read access of various sizes (shown in the first column) ranging from 512 bytes to 512MB. A read implies that the data is mmap()'d and accessed/touched by the processor, leading to page faults by the operating system. This should provide a fair comparison to similar operations doing a read() system call from a file. "Mmap read bandwidth 0.000512 1589.89 0.001024 1747.58 0.002048 1849.51 0.004096 1876.64 0.008192 1925.34 0.016384 1806.04 0.032768 945.75 0.065536 939.72 0.131072 949.63 0.262144 786.75 0.524288 389.81 1.05 285.60 2.10 280.14 4.19 277.49 8.39 280.36 16.78 277.84 33.55 280.44 67.11 277.87 134.22 280.45 268.44 277.83 536.87 280.41 This section provides the same measures as given in the preceding section, with the full open/mmap/close cycle, again where the data is accessed after the mmap(), and before the close(). "Mmap read open2close bandwidth 0.000512 6.47 0.001024 12.99 0.002048 25.54 0.004096 49.57 0.008192 88.09 0.016384 142.55 0.032768 222.20 0.065536 303.94 0.131072 382.90 0.262144 373.09 0.524288 255.05 1.05 210.47 2.10 212.69 4.19 213.20 8.39 216.84 16.78 214.64 33.55 217.02 67.11 215.04 134.22 217.51 268.44 215.38 536.87 216.09 This section measures the rate at which the C library's bcopy() function can copy unaligned data. Unaligned data often requires more complex copying algorithms than aligned data. For instance, aligned data may be copied a double word at a time (64 bits at a time), but unaligned data may need to be copied by the library one byte at a time. In particular, the bcopy() routine needs to ensure that it never attempts to read a byte that is not present in the calling process's address space. This often makes an unoptimized bcopy() routine extremely slow. As the size of the data to be copied (in the first column) increases, the rate of data copying of a good bcopy() implementation should directly approach that of the aligned bcopy() measurements that follow. A simplistic solution may iterate through the data one byte at a time and be substantially slower for copying large blocks of data. An incorrectly optimized implementation could manage to generate a segmentation fault via a reference to data that is not mapped into the application's address space. "libc bcopy unaligned 0.000512 870.40 0.001024 1300.36 0.002048 1755.31 0.004096 2095.21 0.008192 2166.23 0.016384 376.98 0.032768 383.18 0.065536 379.35 0.131072 382.55 0.262144 252.74 0.524288 196.17 1.05 183.64 2.10 186.51 4.19 185.17 8.39 187.81 16.78 187.52 33.55 190.23 67.11 188.79 134.22 190.93 268.44 189.05 This version is the same as the preceding version, where the first column is the amount of data being copied (512 bytes to 512MB) and the second is the throughput in MBps. "libc bcopy aligned 0.000512 870.49 0.001024 1299.05 0.002048 1755.31 0.004096 2093.22 0.008192 2048.00 0.016384 376.47 0.032768 383.11 0.065536 379.29 0.131072 382.55 0.262144 259.08 0.524288 196.61 1.05 183.96 2.10 186.65 4.19 185.68 8.39 188.25 16.78 187.65 33.55 190.75 67.11 189.32 134.22 191.57 268.44 191.24 "unrolled bcopy unaligned 0.000512 1045.57 0.001024 1033.53 0.002048 1044.55 0.004096 1033.31 0.008192 996.75 0.016384 342.01 0.032768 335.21 0.065536 332.38 0.131072 318.58 0.262144 241.38 0.524288 173.98 1.05 159.09 2.10 160.30 4.19 159.16 8.39 160.63 16.78 159.23 33.55 159.44 67.11 159.55 134.22 161.25 268.44 159.70 "unrolled partial bcopy unaligned 0.000512 5399.13 0.001024 5122.18 0.002048 5342.10 0.004096 5665.45 0.008192 3491.16 0.016384 386.58 0.032768 387.40 0.065536 387.93 0.131072 388.87 0.262144 245.68 0.524288 180.23 1.05 163.28 2.10 164.81 4.19 164.77 8.39 166.05 16.78 165.24 33.55 167.56 67.11 166.40 134.22 166.58 AIM7 and AIM9AIM Technologies released two commonly used benchmarks in the mid-1980s. Caldera/SCO acquired AIM Technologies and released the AIM7 and AIM9 benchmarks under the GPL license in 1999. (See http://www.caldera.com/developers/community/contrib/aim.html for details.) The source can be downloaded from ftp://ftp.sco.com/pub/opensource/ or http://sourceforge.net/projects/aimbench. Although both AIM7 and AIM9 are old enough to be nearly obsolete, they are interesting primarily because they are freely available under the GPL in source form and provide a useful performance perspective in understanding today's Linux systems and the workloads running on those systems. AIM7(http://cvs.sourceforge.net/viewcvs.py/ltp/benchmarks/) The AIM7 benchmark is often referred to as the AIM Multiuser Benchmark because it focuses on systems that support multiple interactive users. Although the benchmark is somewhat dated compared to the tasks that multiple users run on a single system today, the test still stresses some of the core subsystems of the underlying operating system in a fairly generic way. For instance, the AIM7 test focuses on running a large number of tasks simultaneously, which stresses the operating system's scheduler, as well as the process creation and process exit capabilities. The test attempts to use a reasonable amount of storage that is scaled to the number of users in this synthetic workload; however, the amount of storage actually used by the benchmark is fairly low for today's storage-hungry workloads. The test minimizes the use of floating-point and complex mathematical operations, making it less useful as a benchmark for the high-performance computing needs of the scientific and technical communities. Other benchmarks focus on more commonly used matrix multiplications, problem partitioning, and computation-intensive workloads. AIM7 spends a fairly high percentage of time sorting and searching through relatively large quantities of data, although the amount of data is probably low by today's standards. Also, today's workload mix tends to offload operations like that to databases rather than more standard brute-force search and sort algorithms. AIM7 also attempts to focus more on the system libraries and system calls than some other benchmarks. When used as part of a comprehensive set of benchmarking, AIM7 provides useful input into the comparative performance of various UNIX systems, most of which purport to supply similar capabilities on different implementations of the operating system and on different hardware. AIM9(http://cvs.sourceforge.net/viewcvs.py/ltp/benchmarks/) AIM9 is referred to as the AIM Independent Resource Benchmark. It has a more component-oriented focus on benchmarking than AIM7. In many respects, it provides a view that is similar to that of LMbench in that it measures discrete capabilities such as additions per second or sorts per second. This is in contrast to AIM7, which provides a measure of more user-visible operations. AIM9 attempts to avoid measures that are impacted by the operating system's memory subsystem or scheduler; instead it focuses on the processor-bound capabilities. AIM9 measures a slightly different set of underlying operations than LMbench. In particular, AIM9 has a larger focus on integer and floating-point calculations, while also providing another view of disk performance testing, and localhost networking tests such as TCP, UDP, fifos, and pipes. The tests typically run for a short, fixed period of time and measure the number of underlying operations completed during that time. The specific tests, the amount of time for each test to run, and a number of environmental factors, such as locations of various files and which compiler to use, can be easily reconfigured. The performance of a system is more than the sum of its parts. The complex interactions of system calls, locking, network latency, disk latency, user input, application dependencies, and many other factors combine in ways that are often difficult to predict. Component viewpoints such as those provided by LMbench and AIM9 may be important in understanding raw processor power, general operating system comparisons, and performance impacts of certain critical application operations, but it is rarely possible to extrapolate the performance of a complex workload from such a narrowly focused benchmark. However, understanding the component limitations in terms of performance and throughput helps provide useful background when trying to analyze performance problems in more complex workload scenarios. Sample AIM9 Output
ReaimReaim is a project that was in development by the Open Source Development Lab (OSDL) at the time this book was written. It is an effort to update AIM7 for today's workloads. Reaim's goal is to provide a mixed application workload with some repeatable, concrete measures for throughput. Its goal is to be self-contained, easy to set up, configure, and run. It also aims to provide quick results that allow operating system developers to measure the benefits of specific changes or allow application performance analysts to quickly understand the benefits of some system configuration changes or modifications of underlying operating system tuning parameters. SPEC SDETSPEC SDET is a retired benchmark that is still available to members of the Standard Performance Evaluation Corporation (SPEC) organization. SPEC SDET is another benchmark that measures a mixed workload, although the components of that workload mix are somewhat dated. However, it provides a fairly simple, self-contained environment to measure the throughput of a set of scripts. Each script is run for five iterations by default, with an increasing number of users. The number of times the scripts are run and the number of users are configurable. The output shows both scalability under load and absolute performance at any stress point. SPEC SDET is the first benchmark covered in this chapter that is a controlled benchmark. Specifically, the SPEC organization has strict rules governing how the test must be run and how the results must be portrayed. In many cases, results may be subject to audit to establish validity. In some cases, there are slightly less-strict rules for research or evaluation purposes and such runs must be clearly identified as specified by the license the end user agreed to when she acquired the benchmark. In our case, the results were run on an internal machine and were not conformant. In particular, the operating system had custom modifications and the lack of controls means that these may not be repeatable and that the specific results should not be compared against any other SPEC SDET results. Many major benchmarks come with such restrictions on how the tests can be run, how the results should be repeatable, and how the results should be auditable. These restrictions and publication guidelines are intended to protect the integrity of the benchmark, preventing anyone from generating misleading comparisons. The organizations providing the benchmarks strive to ensure that anyone attempting to compare performance or price performance numbers can do so and therefore make informed decisions about technologies, capabilities, and so on of a platform or software offering. Because these benchmarks become a standard for comparison, the competition to publish the best numbers can be intense. And because the costs of running, tuning, and optimizing these large benchmarks can be so high, typically only large companies purchase licenses to use the benchmarks, procure the complex hardware environment required to run the benchmark, and apply the experience and resources needed to generate publishable numbers. Benchmarks also tend to drive new technologies, which improve the components measured by that benchmark. One constant concern that the major benchmark providers diligently address is the tendency of technologists to optimize specifically for the benchmark. If any component of the benchmark does not reflect actual, common usage of the components being measured, the benchmark could become less useful to real users. Therefore, the benchmarking companies constantly evaluate the results and work to improve the validity and utility of their benchmarks. With all of that in mind, included in the following are the results from one such test that is less stringently controlled, primarily because it is retired, but also because the test is less complex and easily repeatable by an end user. SPEC and the benchmark name SPECsdm are registered trademarks of the Standard Performance Evaluation Corporation. This benchmarking was performed for research purposes only. This benchmark run is noncompliant, and the results may not be compared with other results. SDET results 1-20 users, 5 iterations: 1 users: 2553 2011 2553 2553 2307 mean: 2395 stddev: 10.01% 2 users: 4931 4022 4260 4186 4586 mean: 4397 stddev: 8.24% 3 users: 5869 5373 5934 5046 6206 mean: 5685 stddev: 8.22% 4 users: 6666 5830 6000 6233 7422 mean: 6430 stddev: 9.91% 5 users: 6642 7142 7929 7826 7317 mean: 7371 stddev: 7.13% 6 users: 7970 8089 8470 8605 8307 mean: 8288 stddev: 3.16% 7 users: 9403 8936 9264 8456 8344 mean: 8880 stddev: 5.32% 8 users: 9230 10034 9085 9350 9411 mean: 9422 stddev: 3.86% 9 users: 10031 10836 10351 10220 10657 mean: 10419 stddev: 3.13% 10 users: 11320 11764 11842 10843 11726 mean: 11499 stddev: 3.64% 11 users: 10909 12036 12492 12336 11478 mean: 11850 stddev: 5.51% 12 users: 12485 11250 12743 12067 12378 mean: 12184 stddev: 4.73% 13 users: 13371 13333 12219 11011 11878 mean: 12362 stddev: 8.13% 14 users: 12923 11325 11004 13298 13298 mean: 12369 stddev: 9.03% 15 users: 11297 10588 11739 14285 13953 mean: 12372 stddev:13.34% 16 users: 13395 13150 14117 13457 13333 mean: 13490 stddev: 2.73% 17 users: 13275 12830 14069 14036 13814 mean: 13604 stddev: 3.95% 18 users: 13584 13251 13360 13360 14117 mean: 13534 stddev: 2.57% 19 users: 13680 14279 12930 13790 13902 mean: 13716 stddev: 3.60% 20 users: 13584 13308 13740 13872 13636 mean: 13628 stddev: 1.54% For reference, the metric of interest in the preceding is the number of scripts per hour executed by each of five runs of the test with the indicated number of users. The tests are run five times, and the metric of interest for each run is printed, followed by the mean and standard deviation of those values. In this particular case, we were looking at how the Linux scheduler performed with a couple of algorithms, attempting to identify whether a new approach would have benefits for multitask workloads. With only the output of a single run here, we cannot compare the results or come to any specific conclusions; however, we can see what the output format shows us. The first two columns indicate the number of users, or the amount of stress that the benchmark was applying to the system, moving upward from simulating a single user to simulating 20 users. The next five columns show the number of tasks completed in each run, followed by the mean and standard deviation of those five runs. In particular, it is interesting to note that this test generates results that are probably less stable than might be ideal. Using the mean value as the primary comparison key and noting the standard deviation between each set of five runs indicates the runs' stability. Some standardized benchmarks provide highly stable results where variations of a single unit of the measurement metric are common on repeated runs. Other benchmarks derive more stable metrics from a mathematical analysis of the underlying raw data. Deriving metrics is best left to the experts in defining and developing benchmarks. Benchmark metrics can be as misleading as any other statistics if results are derived without appropriate context and without a solid understanding of the underlying sources of variance in the raw numbers. In this particular benchmark, the high standard deviation may actually be a side effect of the operation of the Linux scheduler that was measured at the time, or it may be inherent in the benchmark. Without a wide variety of tests against which to correlate the results, the metrics could easily be taken out of context, and a performance analyst could draw invalid conclusions. On the other hand, such microbenchmarks can be invaluable when measuring the impact of various tuning parameters and hardware or software configurations. And results from the slightly more complex microbenchmarking workload, such as AIM7, Reaim, and SPEC SDET often have a higher correlation with real customer workloads without the added complexity and difficulty in measurement that those real workloads entail. Disk BenchmarksThis section discusses the following disk benchmarks:
We have examined two classes of microbenchmarks: those that focus on the primitive, primary capabilities of the underlying processor and operating system, and those that use simplistic workload simulations that exercise some of the more commonly used operating system components such as the scheduler or IPC mechanisms. Next, we'll look at a few other classes of benchmarks that focus on other components of the system. These include benchmarks for the disk subsystem, various file systems, and the networking subsystem. These components often have a higher correlation with specific aspects of the end-user workload and therefore are quite valuable in isolating and identifying the key performance characteristics of the file systems and disk subsystems underlying the end-user workload. Many of these benchmarks are developed from or related to traces of actual workloads running on specific file systems. Others are simply designed to create a stressful environment in which throughput can be measured in best-case and worst-case scenarios. Bonnie/Bonnie++Bonnie is primarily a file system test useful to measure underlying file system performance and identify bottlenecks. Bonnie and Bonnie++ focus on a set of sequential read and write tests and some random read/write testing. Bonnie measures both character-at-a-time reads and writes and block-at-a-time reads and writes. The character-at-a-time tests use the operating system's getc() and putc() functions to provide some measure of the overhead of the stdio libraries, whereas the block transfers use the read() and write() system calls to provide a measure of the file system throughput and latency with very little additional overhead. The random tests run a user-specified number of threads, each of which seeks to a random location and performs a read(); at a random interval, the block that was read is updated and written back via write(). $ ./Bonnie -d bonnie-tmp -m laptop File 'bonnie-tmp/Bonnie.31541', size: 104857600 Writing with putc()...done Rewriting...done Writing intelligently...done Reading with getc()...done Reading intelligently...done Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU laptop 100 5911 56.2 14987 27.5 3703 4.7 8889 70.7 12414 4.5 981.6 7.6 In the preceding run, the default block size was used (100MB). Bonnie did about 100 million (100MB, one character at a time) putc() operations at a rate of 5911KBps, utilizing 56% of the CPU. The same I/O done in blocks was about 15KBps using only 27% of the CPU. The overhead for doing a byte at a time was more than three times that of doing block I/O operations, so obviously the resulting reduced CPU utilization indicates that the library interface is fairly CPU-intensive. Application writers may use this same data to decide whether they want the advantages of using buffered I/O with an eye toward the worst case being a byte-at-a-time, or the advantages of doing direct read() and write() operations with the resulting overhead as quantified in the preceding output for a particular hardware and operating system combination. Note that the comparison may not be completely fair. The block size used by Bonnie is 16K, but the underlying block size for getc() and putc() is only 4K on the version of Linux and glibc used on the test machine. Also, the random seeks and reads using the strategy of 10% modify and write may not map directly to any particular application, although they do provide an initial basis for comparison. Bonnie++ is a benchmark derived directly from Bonnie. It begins with the same tests that are in Bonnie but extends the very simple application with support for larger files (more than 2GB on a 32-bit machine), a number of parameters to control block sizes, the capability to set the random number generator seed, the capability to generate data for use in a spreadsheet, and so on. $ ./bonnie++ Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em...done...done...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP laptop 1G 121 88 11920 23 4797 8 452 73 17249 13 71.9 1 Latency 529ms 3201ms 3105ms 127ms 376ms 2487ms Version 1.93c ------Sequential Create------ --------Random Create-------- laptop -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 294 69 26593 90 11278 72 346 82 27319 90 889 70 Latency 1128ms 3338us 195ms 199ms 3321us 1187ms 1.93c,1.93c,laptop,1,1087268459,1G,,121,88,11920,23,4797,8,452,73,17249,13,71.9,1,16 ,,,,,294,69,26593,90,11278,72,346,82,27319,90,889,70,529ms,3201ms,3105ms,127ms,376ms, 2487ms,1128ms,3338us,195ms,199ms,3321us,1187ms Note that Bonnie++ includes measures of I/O latency and shows the rate of file creation and deletion with interspersed reads. This model provides some insight into the sort of operations that a mail server might perform. In fact, Bonnie++ has an option to perform an fsync() after each write operation, as many mail servers might do. The last line of the output describes the same results in a format that might be suitable for inclusion in a spreadsheet or for analysis by a Perl script. IOzoneIOzone is another file system benchmark that can be used to compare a number of different workloads or operating system platforms. In addition to common tests for read and write, IOzone can test such capabilities as reading backward, reading a stride of data (for example, every third or fifth block of data), various block sizes, and random mixes of reads and writes. IOzone also can test reads and writes through stdiofor example, via fread()/fwrite(). IOzone allows a performance analyst to model different types of workloads with its rich set of configuration parameters, including simulating diverse workloads like primarily sequential decision support database workloads or transaction processing workloads, which consist of read-mostly, highly random I/O patterns. IOzone also can use asynchronous I/O and can model performance differences between primarily synchronous and asynchronous I/O-based workloads. IOzone also can model applications using threaded and nonthreaded workloads and can use mmap() as the underlying API for reads and writes of file operations. IOmeterIOmeter is another I/O and file system subsystem test that is useful for I/O testing on a single machine but can also be used on clustered systems. The test was originally written by Intel and later released into the public domain. IOmeter is available for download at http://sourceforge.net/projects/iometer/. IOmeter is another test that runs on multiple operating systems and can be used to compare operating systems. However, it also can be used on a single system to compare changes in kernel configuration parameters, application tuning, hardware configuration, and so on. IOmeter provides a slightly different view of performance tuning. Its goal is not to run a task and see how quickly it completes or see how many operations it completes in a period of time. Instead, it provides a steady-state workload and allows you to monitor various aspects of the system under load. You can also vary the workload and see how the underlying system responds. In addition, you can vary the system parameters, configuration, and so on to see how that impacts the workload. In other words, it provides more tuning control points for analyzing changes in workloads than any of the other benchmarks discussed previously. A similar tool that launched before IOmeter was publicly available is pgmeter, also known as Penguinometer. pgmeter is available at http://pgmeter.sourceforge.net/ (or http://sourceforge.net/projects/pgmeter). tiobenchtiobench is a file system benchmark that is threaded and can generate tests using one or more simultaneous threads of I/O. This allows the simulation of applications that generate I/O in parallel, either independently of each other (truly random) or with some coordination (for example, with some sequential I/O). $ ./tiobench.pl gerrit@w-gerrit2:/nue/tiobench-0.3.3$ ./tiosum.pl No size specified, using 1022 MB Run #1: ./tiotest -t 8 -f 127 -r 500 -b 4096 -d . -TTT Unit information ================ File size = megabytes Blk Size = bytes Rate = megabytes per second CPU% = percentage of CPU used during the test Latency = milliseconds Lat% = percent of requests that took longer than X seconds CPU Eff = Rate divided by CPU% - throughput per cpu load File size in megabytes, Blk Size in bytes. Read, write, and seek rates in MB/sec. Latency in milliseconds. Percent of requests that took longer than 2 and 10 seconds. Sequential Reads File Blk Num Avg Maximum Lat% Lat% CPU Kernel Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff --------------------------------------------------------------------------------- 2.6.3 1022 4096 1 12.82 5.645% 0.303 1450.99 0.00000 0.00000 227 2.6.3 1022 4096 2 17.45 7.706% 0.446 1749.80 0.00000 0.00000 226 2.6.3 1022 4096 4 16.77 7.419% 0.901 2100.15 0.00038 0.00000 226 2.6.3 1022 4096 8 18.54 8.404% 1.585 1717.18 0.00000 0.00000 221 Random Reads File Blk Num Avg Maximum Lat% Lat% CPU Kernel Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff ---------------------------- ------ ----- --- ------------------------------------ 2.6.3 1022 4096 1 0.26 0.637% 15.081 2199.50 0.02500 0.00000 41 2.6.3 1022 4096 2 0.46 0.685% 16.754 293.99 0.00000 0.00000 66 2.6.3 1022 4096 4 0.41 0.717% 37.062 1367.18 0.00000 0.00000 57 2.6.3 1022 4096 8 0.35 0.591% 84.926 3423.89 0.40000 0.00000 59 Sequential Writes File Blk Num Avg Maximum Lat% Lat% CPU Kernel Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff ---------------------------- ------ ----- --- ---------------------------------- 2.6.3 1022 4096 1 18.68 24.68% 0.194 799.52 0.00000 0.00000 76 2.6.3 1022 4096 2 17.21 22.51% 0.411 3206.45 0.00076 0.00000 76 2.6.3 1022 4096 4 18.78 24.89% 0.684 3277.35 0.00191 0.00000 75 2.6.3 1022 4096 8 18.10 23.86% 1.357 6397.58 0.01615 0.00000 76 Random Writes File Blk Num Avg Maximum Lat% Lat% CPU Kernel Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff ---------------------------- ------ ----- --- ---------------------------------- 2.6.3 1022 4096 1 0.58 0.698% 0.236 66.66 0.00000 0.00000 83 2.6.3 1022 4096 2 0.73 0.911% 1.106 967.64 0.00000 0.00000 81 2.6.3 1022 4096 4 0.70 0.911% 2.677 1474.37 0.00000 0.00000 77 2.6.3 1022 4096 8 0.78 1.035% 6.960 3392.60 0.05000 0.00000 75 tiobench allows for a variety of patterns, as shown from the help option. A single run can be set up to do a large set of tests with multiple block sizes, multiple numbers of threads, each thread issuing a random set of interspersed I/Os, which can be configured to resemble a number of applications using a single file system. $ tiobench.pl --help Usage: ./tiobench.pl [<options>] Available options: [--help] (this help text) [--identifier IdentString] (use IdentString as identifier in output) [--nofrag] (don't write fragmented files) [--size SizeInMB]+ [--numruns NumberOfRuns]+ [--dir TestDir]+ [--block BlkSizeInBytes]+ [--random NumberRandOpsPerThread]+ [--threads NumberOfThreads]+ + means you can specify this option multiple times to cover multiplecases, for instance: ./tiobench.pl --block 4096 --block 8192 will first runthrough with a 4KB block size and then again with a 8KB block size.--numruns specifies over how many runs each test should be averaged Overall, tiobench is a simple set of code with some fairly flexible configuration capabilities to simulate some multiple application workloads or to simulate very simple database random or sequential style models from a purely file-system-oriented perspective. dbenchdbench is a workload created to model another benchmark: NetBench. NetBench is the de facto industry standard benchmark for measuring Windows file servers. However, as with several benchmarks discussed later in this chapter, NetBench requires access to a large number of clients and a very large server to measure performance. With many benchmarks, the complexity of the benchmark makes it better suited to benchmark teams sponsored by larger vendors to create a run, usually after a product is generally available. NetBench is also poorly suited to help developers analyze code during the development cycle because of the long cycle time for testing and the complex testing environment required. Many Linux developers prefer to analyze the performance of various aspects of their code as that code is developed. That analysis requires benchmarks that are easy to run, benchmarks that can be easily understood, and benchmarks that require minimal hardware configurations. dbench was written by the developers of the Windows file system support for Linux and achieves those goals. Their goal was to use dbench to analyze their code for compatibility, functionality, performance, and scalability on Linux. dbench was developed based on a set of network traces captured while running NetBench. The theory was that if you can simulate the traffic pattern on the network from accessing a network file system, you can generate effectively the same load with a program that does nothing except generate those patterns. NetBench used a more brute-force approach where each network client was an individual computer, generating file system accesses at the rate a real user might generate file system accesses. By effectively simulating the file system traffic on the local machine, the server portion of the network file system code should be stressed in much the same way. dbench effectively eliminates the need for a network in this case, simply simulating locally the traffic that would otherwise come in over the network. This simplifies the overall analysis to a point that allows the developers to focus on a single component of the entire systemthe way that the file system code responds to a relatively standardized client load. In fact, with this level of simulation, it is now possible to simulate workloads even larger than those often done in a NetBench configuration, and with only a single machine. Running dbench is very simpleits only argument is the number of clients to run, so the following output is the result of running dbench while simulating 16 clients. $ dbench 16 16 clients started 16 637 234.00 MB/sec 16 3070 285.37 MB/sec 16 4086 214.91 MB/sec 16 4909 177.61 MB/sec 16 7051 177.85 MB/sec 16 8999 174.34 MB/sec 16 11030 174.34 MB/sec 16 13069 173.72 MB/sec 16 15765 179.60 MB/sec 16 17505 176.47 MB/sec 16 20106 180.02 MB/sec 16 22086 179.87 MB/sec 16 24246 179.19 MB/sec 16 26469 179.88 MB/sec 16 27781 175.42 MB/sec 16 30133 177.20 MB/sec 16 32413 177.82 MB/sec 16 34923 179.28 MB/sec 16 37410 181.32 MB/sec 16 39026 179.35 MB/sec 16 42022 182.70 MB/sec 16 44060 182.13 MB/sec 16 46126 181.81 MB/sec 16 48693 183.46 MB/sec 16 50478 182.01 MB/sec 14 53321 183.62 MB/sec 12 55804 185.15 MB/sec 7 59312 188.96 MB/sec 1 62463 191.78 MB/sec 0 62477 191.49 MB/sec Throughput 191.494 MB/sec 16 procs The final line indicates the overall throughput of the underlying file system when responding to the rough equivalent of 16 NetBench client systems. The interim lines rovide a running summary of the throughput. As with any benchmark, it is important to understand exactly what the workload being measured is. It is also important for the performance analyst to understand how that measurement might relate to his own workload. As an example, NetBench tends to model a workload that is very write intensiveas much as 90% writes. Most real-user situations rarely have a need to write so much data to the server or to write that percentage of data for a long period of time. However, writes tend to be more resource-intensive and harder to optimize than reads. In this case, dbench could be useful for some worst-case analysis of a workload or an environment, but it is unlikely to be directly representative of that workload. dbench was originally written to model another benchmark. However, dbench is weak in a couple of properties that would make it a much more useful benchmark. In particular, dbench seems to have a very wide variation in the throughput results between runs, even when run repeatedly on identical hardware under what are believed to be identical conditions. This deviation in the average run results means that the test often needs to be run many times and an average and standard deviation for those results needs to be calculated. This means that running dbench from a development and stress perspective can be very useful but that using dbench as a multiplatform or multiconfiguration test requires additional runs and more stringent analysis of the results. Also, absolute results from dbench tend to be nearly meaningless because of these deviations. However, dbench remains a useful tool for generating stress, seeing how the system responds under that stress, and for providing some reasonable guidelines on the benefits of configuration changes, such as comparing different file systems on the same hardware and software stack, or changing some file system or related kernel parameters on a given configuration. As with all benchmarks, a solid understanding of their operation, strengths, and weaknesses, as well as an ability to compare the benchmark's operation to that of the end-user workload to be used on a given machine, critical. A companion tool to dbench is tbench. Whereas dbench factored out all the networking and real client component of NetBench, tbench factors out all the disk I/O. tbench was constructed from the same NetBench traces and models the related networking traffic in a network file system. Here is the output from a 32-thread (client) run; the output is nearly identical to that of dbench but measures the effective throughput of only the networking component of a NetBench-style workload. 32 clients started 32 735 402.57 MB/sec 32 1978 395.83 MB/sec 32 3735 399.99 MB/sec 32 5781 390.50 MB/sec 32 7870 382.69 MB/sec 32 10164 382.61 MB/sec 32 12356 380.21 MB/sec 32 14491 379.11 MB/sec 32 16614 376.35 MB/sec 32 18638 372.94 MB/sec 32 20838 372.71 MB/sec 31 23087 373.25 MB/sec 31 25178 371.87 MB/sec 31 27435 373.10 MB/sec 31 29571 372.40 MB/sec 30 31702 371.65 MB/sec 29 33870 370.35 MB/sec 28 36058 371.11 MB/sec 26 38345 371.14 MB/sec 22 40935 374.55 MB/sec 20 43394 376.73 MB/sec 20 45815 377.82 MB/sec 18 48291 379.80 MB/sec 17 50970 383.07 MB/sec 15 53593 385.23 MB/sec 13 56601 389.88 MB/sec 7 59426 393.44 MB/sec 2 61201 390.04 MB/sec 2 61966 381.01 MB/sec 1 62323 370.27 MB/sec 0 62477 364.31 MB/sec Throughput 364.306 MB/sec 32 procs By using these two tests independently, it is possible to simulate the conditions under which the untested component is assumed to be infinitely fast. This helps find bottlenecks in individual subsystems without being dependent on bottlenecks or performance constraints in other subsystems. For instance, some of the complex benchmarks discussed later in this chapter provide a single or small set of numbers to represent the overall throughput of a highly complex workload. In these complex workloads, a combination of processing speed, network capacity, disk latency, file system performance, database workload, Java performance, web server response times, and other factors combine to yield a single throughput and/or latency result. But if a single component of the workload is misconfigured or inadequate for the environment being tested, the overall number suffers, often without any direct feedback as to the cause of the deficiency. At that point, an analyst needs to look at the detailed statistics provided by the benchmark, understand what portions are most critical to the workload's throughput, and attempt to tune or otherwise correct that subsystem and rerun. This process can take days for an expert, and for the uninitiated, this process of analysis, retuning, reconfiguration, and so on can take months! It may appear that tests like dbench, tbench, and other component tests are "the only way to go." The initial answer is: Definitely. However, even detailed analysis of the various components does not provide a comprehensive view of how those components interact. For instance, dbench can indicate that the file system component is working well, and tbench can help tune the networking component to work well. But if there are interactions between the networking component and the underlying file system code, such as locking, scheduling, or latency issues, the overall result of a comparable NetBench may be quite different. If it is different, it is most likely that the more holistic view is the one that will perform the worst. As with software design, integrated circuit design, home building, automobile manufacture, and so on, there is distinct value in doing performance analysis on the various components before integrating them into a greater whole. But they are not a replacement for doing full integration testing and performance analysis on the integrated set of those tools and components. As with automobile design, it is possible to design an engine that performs so well that it could push an automobile to 200 mph. But if the remainder of the automobile's components are not designed and tested to meet those same performance specifications, the end user of that automobile may not be happy with how the vehicle performs and handles under that user's workload. Network BenchmarksThis section discusses the following network benchmarks:
NetperfNetperf is a fairly comprehensive set of network subsystem tests, including support for TCP, UDP, DLPI (Data Link Provider Interface), and UNIX Domain Sockets. Netperf is actually two executable programs: a client-side program, Netperf, and an accompanying server-side program, Netserver. Netserver is usually automatically executed out of a system's inetd daemon; all command specification is done via the Netperf command-line interface. Netperf communicates with Netserver over a standard TCP-based socket, using that communication channel to establish the subsequent tests. A secondary control and status connection is opened between the client and server, depending on the test options selected, and performance statistics are returned to the client over this secondary connection. During a test, there should be no traffic (other than potential keep-alive packets) on the primary control channel. Therefore, if the performance analyst shuts off all other network traffic to the machine, Netperf can measure the full capability of the network performance of a pair of machines and their corresponding network fabric. Netperf is primarily useful for measuring TCP throughput, which accounts for the majority of network traffic generated by modern applications. Netperf comes with a couple of scripts that help automatically generate some basic tests covering socket size and send buffer size: tcp_stream_script and tcp_range_script. The first one runs Netperf with some specific socket sizes (56K, 32K, 8K) and send sizes (4K, 8K, 32K). The second script uses a socket size of 32K by default and iterates over send sizes from 1K to 64K. Here is some sample output from a run of Netperf. RCV SND MSG TIME THRUPUT CPU CPU Snd RCV sebd recv lat lat 131070 131070 8192 60.00 798.59 9.28 11.34 1.903 2.327 131070 131070 8192 60.00 796.62 9.12 11.48 1.876 2.360 131070 131070 8192 60.00 800.74 9.14 11.78 1.869 2.411 131070 131070 8192 60.00 796.92 9.24 11.90 1.899 2.447 131070 131070 8192 60.00 800.05 9.20 11.48 1.885 2.350 RCV = RCV socket size (bytes) SND = send socket size (bytes) MSG = send msg size (bytes) TIME = elapsed time (seconds) THRUPUT = throughput (10^6 bits/second) CPU send = CPU utilization send local (%) CPU recv = CPU utilization receive remote (%) Snd lat = Service Send local (us/KB) RCV lat = Demand RCV remote (us/KB) See http://www.netperf.org for the benchmark and some archived results. SPEC SFSSPEC (http://www.spec.org/) provides a benchmark for network file systems. It is primarily used today for benchmarking NFS v2 and v3 servers. It allows for benchmarking both the UDP and TCP transports. SPEC SFS also provides a breakdown of all the common network file operations used by NFS, including reading directories, reading files, writing files, checking access rights, removing files, updating access rights, and so on. At the completion of a test run, an indication of the throughput for each operation is generated. The test attempts to simulate a real-world workload with an appropriate mix of the underlying operations so that the results should be relevant to any reasonable NFS workload. A performance analyst may want to monitor the operations of her own workload and compare it to the published breakdown of operations as run by the SPEC SFS benchmark before drawing too many conclusions. As with other SPEC-based tests, there are strict run rules, configuration requirements, data integrity requirements, and reporting requirements for test results. It is still possible to use the test for internal testing and to use it to compare the results of various configuration changes to an environment. Application BenchmarksThis section discusses the following benchmarks:
Java BenchmarksThree commonly used benchmarks for comparing implementations of Java, Java settings, or Java on particular platforms are Volanomark (http://www.volano.com/benchmarks.html), SPECjbb, and SPECjvm. Volanomark is a benchmark designed to simulate the VolanoChat application. VolanoChat is a chat room application written entirely in Java and has certain application characteristics that are similar to many Java applications. In particular, Volanomark simulates long-running network connections with lots of threads instantiated on the local operating system. Because chat benchmarks are inherently interactive, latency and response times are very important, and the Volanomark benchmark's goal is to see how many connections to the chat room can be maintained within a specified response time. Obviously, the more connections you can maintain with a reasonable latency, the less overhead the operating system or the Java implementation consumes. Volanomark can be implemented with both the server application and all clients running on the local host. This would be useful for separating the network component from the overall equation and focusing exclusively on the Java implementation and the underlying operating system performance characteristics. However, a more realistic workload involves setting up clients on a network, driving a server. This helps measure end-to-end latency as well as overall Java performance. It also simulates a more reasonable customer deployment of Java in which Java is used as a middleware layer in a multitier client/server configuration. Java, SPECjvm, and SPECjbb also add a new level of complexity over the benchmarks discussed thus far. Most of the preceding benchmarks may not be affected significantly by system tuning. They are more likely to be affected by hardware configuration changes, total available system memory, I/O configuration, and so on. For these benchmarks, the same operating system configuration options, hardware configuration options, and so on are often secondary to the tuning configuration parameters of the version of Java in use. Java itself has configuration parameters governing the size of the heap it uses, the number of threads, the type of threads, and so on. Each of these configuration parameters interacts with the test and the underlying operating system in ways that are often difficult to predict. As a result, benchmark publication efforts often run waves of tests, changing various configuration parameters at each level with some intelligent guesses as to the potential interactions of the configuration parameters. This technique is fairly common for finding the best results a particular benchmark can provide. However, a benchmark like this can also be used to model the impacts of various configuration changes. The benchmark in this case can provide a scientific control against which to compare various changes. Using the benchmark as a control allows the performance analyst to try out different versions of Java, different operating systems or kernels, different configuration parameters for Java, and so on. This gives the performance analyst more insights into optimizing similar workloads. Volanomark provides a high-level metric for indicating throughput, including messages sent, messages received, total messages, elapsed time, and average throughput in messages per second. The following is the output from a sample run. $ volanomark 1024 1024 1000 Running with start heap = 1024 max heap=1024 msg_count=1000 [ JVMST080: verbosegc is enabled ] [ JVMST082: -verbose:gc output will be written to stderr ] <GC[0]: Expanded System Heap by 65536 bytes java version "1.4.1" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1) Classic VM (build 1.4.1, J2RE 1.4.1 IBM build cxia32dev-20030702 (JIT enabled: jitc)) VolanoMark(TM) Benchmark Version 2.1.2 Copyright (C) 1996-1999 Volano LLC. All rights reserved. Creating room number 1 ... 20 connections so far. Creating room number 2 ... 40 connections so far. Creating room number 3 ... 60 connections so far. Creating room number 4 ... 80 connections so far. Creating room number 5 ... 100 connections so far. Creating room number 6 ... 120 connections so far. Creating room number 7 ... 140 connections so far. Creating room number 8 ... 160 connections so far. Creating room number 9 ... 180 connections so far. Creating room number 10 ... 200 connections so far. Running the test ... <AF[1]: Allocation Failure. need 528 bytes, 0 ms since last AF> <AF[1]: managing allocation failure, action=1 (0/1019991016) (53683736/53683736)> <GC(4): GC cycle started Mon Nov 24 10:13:30 2003 <GC(4): freed 1018220216 bytes, 99% free (1071903952/1073674752), in 83 ms> <GC(4): mark: 68 ms, sweep: 15 ms, compact: 0 ms> <GC(4): refs: soft 0 (age >= 32), weak 0, final 0, phantom 0> <AF[1]: completed in 94 ms> Test complete. VolanoMark version = 2.1.2 Messages sent = 200000 Messages received = 3800000 Total messages = 4000000 Elapsed time = 79.49 seconds Average throughput = 50321 messages per second SPECjvm is a benchmark controlled by and made available through the Standard Performance Evaluation Corporation (SPEC). SPECjvm is primarily used to measure the client-side performance of JVM, including how well it runs on the underlying operating system. SPECjvm also measures the performance of the just-in-time (JIT) compiler. The underlying operating system and hardware performance also affect the throughput measures of SPECjvm. SPECjbb measures a more complex application environment, specifically that of an order processing application for a wholesale supplier. Like SPECjvm, SPECjbb also measures the system's underlying hardware and operating system performance. SPECjbb is loosely modeled on the TPC-C benchmark; however, it is written entirely in Java. SPECjbb models a three-tier system (database, application, user interface) with a focus on the Java-based application layer in the middle tier. The middle tier models a typical business application and is the basis for the generated metrics of interest. SPECjbb is conveniently self-contained, which means that it does not need a complex database to be installed (it implements a simple tree-based data structure in JVM for the database tier) and does not need a web server because it provides a randomized simulation of user input on the first tier. PostMarkPostMark is another type of application test, although a bit different from the preceding Java application tests. PostMark provides a means of testing a specific usage pattern on a file systemspecifically, the usage pattern that might be seen on a mail server. In particular, mail servers tend to operate on many small files. Many of the other benchmarks focus specifically on file system throughput in terms of the amount of data sent to or fetched from the file system. PostMark recognizes the pattern used by mail servers (and often news servers, and possibly some web-based e-commerce servers) in which small files are frequently created and written, appended to, and read and deleted. In fact, the access patterns of most mail servers are a subset of all possible file system operations. Therefore an I/O and disk storage subsystem that performs that set of operations well should provide a higher-performance mail server. PostMark provides statistics at the completion of a run that indicate how many files can be modified in common patterns per second. These common patterns include the overall elapsed time, elapsed time performing actual transactions and the average transaction rate in terms of files per second, total files created and the creation rate, total number of files read and the average rate at which those files were read, and so on for file appends, deletions, and so on. The test also reports on the overall amount and rate of data read and written. As with many of the more complex benchmarks, many of the parameters for rates of file creation, appends, and so on can be modified to better represent a particular workload. See http://www.netapp.com/tech_library/3022.html for more information on PostMark. Database BenchmarksDatabase benchmarks are worthy of an entire book of their own and are mentioned here only briefly. Databases are often the backbone of many enterprises, maintaining customer records, sales records, inventory, marketing patterns, video clips, images, sounds, order fulfillment information, and so on. Although there are several major patterns for databases, two of the most common are online transaction processing (OLTP) and decision support (DS) (sometimes referred to as business intelligence (BI)). These two major models have fairly different operational characteristics, especially as to how they stress the underlying hardware and operating system. OLTP tends to be used more often for recording all aspects of transactions done in a dynamic environment. For instance, an OLTP system can be viewed as a set of point of sale (POS) terminals all doing queries on current inventory, updating sales numbers, looking up prices, and so on, much like a checkout stand at a store or an e-commerce web front end to a company's inventory. These operations tend to use the database's organizational capabilities to generate queries based on selected attributes. They do the smallest number of disk I/Os to locate the specific record or set of records for which an operator is searching. Occasionally when a record is located, some data related to that record is updated, such as the number available when one is sold. In the OLTP model, most disk I/O is done in fairly small chunks (sometimes as small as 2K or 4K), and the processor is used to help run joins on tables or calculate the next offset in the table to look up. The access pattern is typically read-mostly with a small number of writes. Decision support, on the other hand, is often run on a database generated through an OLTP mode of operation. However, in this case, the entire database is often searched for trends, summary operations, full reports on a day's activities, and so on. In the case of DS workloads, nearly the entire database is read, often in large chunks (more often 64K or 2MB at a time), and a limited amount of processing is done based on the data returned. Occasionally, summary writes are written back to the database. These two models are sufficiently different in that distinct benchmarks help model these workloads. For the OLTP model, the gold standard for benchmarking is the TPC-C benchmark, published by the Transaction Processing Council (http://www.tpc.org/). The same organization also publishes the TPC-H benchmark, which models the decision support style workload. Both of these benchmarks have evolved over the years to attempt to represent real-life workloads. These tests are also very rigorously controlled, as is the reporting of the results. As with the SPEC tests discussed earlier, the run rules are well established and controlled, various requirements are placed on the coherence of data stored in the database, benchmark runs must be highly controlled, results must be published and auditable, and all publications must be done on hardware and software that is generally available or will soon be generally available. Because these benchmarks are often used by major companies to decide what hardware, operating system, and database to purchase, these benchmarks not only provide throughput and latency measures, but they also calculate performance relative to cost metrics. The downside of these tests is that they can be very resource-intensive to set up and run. In some cases, the size of the database to be used can be very large, and it can take hours or days to create and load the initial database. After the database is loaded, runs can take anywhere from a few hours to many days to complete. And, in some cases, the cost of the required hardware can be prohibitive, even for large companies. As a result, most benchmark runs on the larger configurations come from large departments of large hardware or software vendor companies that have large budgets dedicated to demonstrating the value of their products in terms of raw performance or significant cost advantage over their competitors. As an alternative to the expensive, complex, and rigorously controlled databases, the Open Source Development Lab (http://www.osdl.org/) has developed a couple of similar, but vastly simpler, tests for performance analysts to run on a smaller scale. For OLTP workloads, the OSDL's Database Test #1 (OSDL DBT-1: http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/) models an e-commerce site (similar to TPC-W). Database Test #2 (OSDL DBT-2: http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-2/) attempts to model the TPC-C benchmark, and Database Test #3 (OSDL DBT-3: http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-3) attempts to model the TPC-H benchmark. All three of these tests are freely available and can be used to compare hardware and software configuration changes, application changes, and different application or operating system stacks. Of course, because the applications and their results are not tightly controlled like the TPC or SPEC benchmarks, keep in mind that comparing publicly published numbers may show inconsistencies in results, especially because the results are not typically audited or validated by anyone other than the individual or company that publishes the results. However, as a tool for a performance analyst, these test suites are much more accessible for day-to-day testing than the TPC or SPEC tests. |
|