2.3 Benchmarking | System Performance Tuning2002

A benchmark measures the performance of a standard workload. By holding the workload fixed, you can vary the underlying system parameters to generate relative performance numbers .

In general, there are three types of benchmarks: component-specific benchmarks, like SPECint and SPECfp for microprocessors; whole-system benchmarks designed to emulate commercial environments, like the TPC series and SPECweb99, and user -developed benchmarks. In this section, I touch on these different approaches to benchmarking, providing examples of real-world benchmarks throughout.

This book is not about tuning systems to produce world-record benchmark results. In general, producing a world record requires an immense amount of work, strong support from the engineers responsible for the underlying subsystems, and an understanding that the end result is a world record (which may not necessarily be directly applicable to a customer environment).

2.3.1 MIPS and Megaflops

Before we get started in a discussion of benchmarks, we need to discuss a little bit of terminology and history: the MIPS and megaflops metrics.

2.3.1.1 MIPS

One of the first ways of estimating the performance of an application on a computer system was to look at how fast the system could execute instructions. However, this is a dismal metric. The execution rate of a processor (typically expressed in millions of instructions per second, or MIPS) is strongly correlated to the clock rate of the processor. It's probably safe to assume that a 5-MIPS CPU will outperform a 100-MIPS CPU, but a 50-MIPS processor might well actually outperform a 100-MIPS CPU. Using MIPS to compare the performance of different computers is flawed in a basic way: a million instructions on one microprocessor might not accomplish the same work as the same number of instructions on another. For example, let's say you have a processor that does all floating-point work in software, and takes 50 integer instructions to perform 1 floating-point operation; your 50-MIPS CPU now performs 1 million useful instructions per second. Coupled with this is the fact that different instructions can take different lengths of time to complete, especially on CISC processors. Were those million instructions "do-nothing" operations, or floating-point multiplies?

To make things worse , some vendors have redefined the meaning of "MIP" as their own internal metric. DEC was particularly famous for this; for a long time, the VAX 11/780 was considered as the standard "1 MIP" machine. DEC publicly stated that, for a reasonable job mix, the 11/780 was capable of executing about 470,000 instructions per second -- less than half a MIP. For this reason, you sometimes hear people refer to "VAX MIPS" or "VUPs" (VAX units of processing) to make the comparison clear.

The key point to take away from this is that MIPS is only a useful metric to compare processors from the same vendor that support the exact same instruction set with the same compilers. The fundamental weaknesses in MIPS as a performance metric have often led people to joke that it's really short for "Meaningless Indicator of Processor Speed."

2.3.1.2 Megaflops

Floating-point performance is often measured in terms of millions of floating-point instructions per second, often called megaflops or MFLOPS . A floating-point instruction is commonly regarded as one that executes on a microprocessor's specialized floating-point hardware; this includes floating-point additions, multiplications, comparisons, and format conversions. Less commonly included are square roots and divisions. Typically, though, megaflop rates are computed from some mixture of additions and multiplications. While it is considered misleading to present MIPS values, vendors regularly quote peak MFLOPS values. The peak MFLOPS for a processor is usually on the close order of an integer multiple (1, 2, or 4) of the clock rate. This is because optimizations in microprocessor hardware perform fast multiply-add operations (such as for accumulating a running sum of vector elements, like in a dot product of two matrices). As microprocessors become faster and faster, the peak MFLOPS value becomes less useful as a reasonable upper bound on application floating-point performance: memory bandwidth (how fast data can be moved) in and out of the processor becomes limiting.

2.3.2 Component-Specific Benchmarks

The first set of benchmarks I discuss are designed to stress specific areas of the system; typically, they are memory performance and/or microprocessor performance. I discuss two of the most common: the Linpack benchmark and the SPECcpu suite.

2.3.2.1 Linpack

One of the first commonly accepted benchmarks is the Linpack benchmark , written by Jack Dongarra, then of Argonne National Laboratory. Interestingly, the goal of Linpack was not to develop a benchmark; rather, it was intended to provide a high-performance library of routines for linear algebra problems. The Linpack benchmark derives megaflop rates by measuring the time and number of operations required by a few of these routines to solve a dense system of linear equations using Gaussian elimination . There are several versions of Linpack, differentiated by the size of the system of linear equations they worked with, the numerical precision required, and the ground rules for the benchmark. The most typically-quoted was given by the 100 x 100 matrix with double-precision results and strictly compiled FORTRAN (no hand optimizations were permitted).

In addition to the 100 x 100 benchmark, there is also a 1000 x 1000 benchmark, in which the system of equations can be solved via any method of the vendor's choice. The larger Linpack is an important benchmark because it tends to give a reasonable upper bound for the floating-point performance of very large, highly optimized scientific problems on a given system.

The heart of the 100 x 100 Linpack benchmark is a routine called daxpy , which scales a vector by a constant and adds it to another vector: ^[7]

^[7] Daxpy is actually written in FORTRAN; it is represented here in C.

 for (i = 0; i <= N, i++) {      dy[i] = dy[i] + da * dx[i]; }

Each iteration of this loop performs two floating-point operations (a multiplication and an addition) and three memory operations (two loads and a store). It turns out that the Linpack benchmark stressed both the floating-point and memory performance of computer systems up through the early 1990s. When microprocessor cache sizes increased to a size where they could hold the entire benchmark, the Linpack 100 x 100 ceased to be a useful benchmark -- the data structures for the entire benchmark are only 320 KB. The first system that was widely credited with "breaking" the Linpack 100 x 100 was IBM's RS/6000 series. Unfortunately, optimization techniques prevent the Linpack 100 x 100 benchmark from being scaled up to the point where they stress the memory subsystem again.

So, the Linpack 100 x 100 went from being the most prevalent single-number benchmark in history to being relegated to a historical footnote in about two years , as vendors scaled their caches to fit the entire benchmark.

2.3.2.2 SPECint and SPECfp

In the late 1980s, a group then called the Systems Performance Evaluation Cooperative, or SPEC, formed to establish, maintain, and endorse a standardized set of benchmarks. ^[8] The first SPEC microprocessor benchmark was called SPEC89, and was revised in 1992, 1995, and 2000. By releasing new versions, the SPEC organization keeps current with the latest trends in hardware and compiler design, preventing vendors from becoming too good at tuning their systems to the benchmarks. While the most commonly quoted SPEC benchmarks regard CPU performance, SPEC publishes many standard benchmarks. For more information on SPEC, consult http://www.spec.org.

^[8] They are now the Standard Performance Evaluation Corporation.

The first SPEC benchmark was SPEC89, which consisted of the geometric mean of the runtimes for 10 programs. It did not differentiate between floating-point and integer performance, which had the side effect of cloaking performance subtleties. For example, a system that did very poorly on integer performance but very well in floating-point would have the same SPECmark score as a system that performed in exactly the opposite way. Interestingly, when SPEC89 was released, it was not a particularly important benchmark; the different competing architectures of the time had very different performance characteristics, and no benchmark could effectively relate all of their performances . In the field of high-performance computing, the Linpack 100 x 100 code was completely dominant.

To improve the quality of the benchmark, SPEC revised it and released the successor as SPEC92. SPEC92 consisted of 20 test programs, broken up as 6 integer and 14 floating-point codes. The benchmark also reports two summary figures, SPECint92 and SPECfp92, which measure integer and floating-point performance, respectively. As the Linpack 100 x 100 benchmark started to fall, SPECfp92 became the battleground metric for floating-point performance. Around 1994, large caches and smart compilers broke the SPECfp92 benchmark. Thankfully, in 1995, SPEC again released a revision to the benchmark suite. It increased the size of the benchmarks so that they wouldn't fit into caches, greatly lengthened runtime, and became more like large user applications. A similar process occured in 2000, when SPEC again rescaled the benchmark.

The applications under test in the SPECcpu2000 benchmark are quite representative of typical user codes, and are summarized in Table 2-3.

Table 2-3. Codes under test in SPECcpu2000

Class	Code	Language	Description
Integer	164.gzip	C	Data compression
	175.vpr	C	FPGA circuit placement and routing
	176.gcc	C	C compiler
	181.mcf	C	Minimum-cost network flow solver
	186.crafty	C	Chess
	197.parser	C	Natural language parser
	252.eon	C++	Ray tracer
	253.perlbmk	C	Perl
	254.gap	C	Computational group theory
	255.vortex	C	Object-oriented database
	256.bzip2	C	Data compression
	300.twolf	C	Circuit placement/routing simulator
Floating-point	168.wupwise	F77	Physics (quantum chromodynamics)
	171.swim	F77	Shallow water modeling
	172.mgrid	F77	Multigrid solver: 3D potential field
	173.applu	F77	Partial differential equation solver
	177.mesa	C	3D graphics
	178.galgel	F90	Computational fluid dynamics
	179.art	C	Image recognition
	183.equake	C	Seismic wave propagation simulation
	187.facerec	F90	Face recognition image processing
	188.ammp	C	Computational chemistry
	189. lucas	C	Prime number testing
	191.fma3d	F90	Finite-element crash simulation
	200.sixtrack	F77	Physics (high-energy accelerator design)
	301.apsi	F77	Meteorology (pollutant distribution)

The SPEC benchmark reports three numbers for both integer and floating-point performance: the pure metric, the base metric, and the rate metric.

The base metric (e.g., SPECint_base95) measures the performance under a restrictive set of rules: applications may only use four compiler flags, the flags must be identical for all applications, etc. The idea is that the base rate will more accurately represent the typically observed performance of user applications.
The rate metric (e.g., SPECfp_rate95) measures the throughput of an entire, possibly multiprocessor system. The vendor can run as many concurrent copies of the benchmark as necessary; the rate metric is the aggregate across the entire system.

2.3.3 Commercial Workload Benchmarks

Of course, not all applications are computationally bound; in fact, most business applications stress almost every part of the system. In recognition of this fact, benchmarks have been developed to measure more typical user environments. Here, I discuss two of the most common of these commercial workload benchmarks: the TPC series, which focuses on database performance, and the SPECweb benchmarks for web servers.

2.3.3.1 TPC

The Transaction Processing Council, or TPC (http://www.tpc.org), is an industry-based organization representing computer system and database vendors. TPC provides a framework for measuring the relative performance of these systems. Performance is measured by the amount of time it takes the computer to complete transactions under various loads. The TPC results are extremely well regarded in the database field. Like SPEC, TPC has had to modify and scale their benchmarks over time, as user needs and technological capabilities have changed. Consequently, TPC-A and TPC-B have fallen out of favor; TPC-C and TCP-D are the most common, and TPC-W is gaining favor as an e-commerce benchmark.

There are seven benchmarks in the TPC series:

TPC-A was issued in November 1989, and is designed to measure performance in update- intensive , terminal-driven database environments, such as those seen in online transaction processing (OLTP) scenarios. These environments typically have multiple terminal sessions, significant disk I/O, moderate system and application execution time, and transaction integrity requirements. TPC-A is obsolete.
TPC-B was issued in August of 1990. Unlike TPC-A, TPC-B is not an OLTP benchmark. Rather, TPC-B is a database stress test. It tends to emphasize disk I/O without having any remote terminal sessions. TPC-B is obsolete.
TPC-C was approved in July 1992. TPC-C is an OLTP benchmark like TPC-A, but is significantly more complex; it implements multiple transaction types and a more complex database. TPC-C simulates a complete computing environment, similar to that of a wholesale supplier, where a population of users executes transactions against a database. The benchmark is based around the principal transactions of an order-entry environment: entering and delivering orders, recording payments, checking order status, and monitoring the level of stock on hand.
TPC-D was introduced in December of 1995, and is designed to measure decision-support performance. The test essentially consists of a database that is being slowly modified online while other processes are performing large queries, joins, and scans . TPC-D is obsolete.
TPC-H is a decision-support benchmark that consists of a suite of business-oriented ad hoc queries and concurrent data modifications, much like TPC-D. This benchmark was designed to measure the performance of decision-support systems that examine large amounts of data, execute very complex queries, and give answers to critical business questions.
TPC-R is another decision-support benchmark. It is very similar to TPC-H, except that it allows additional optimizations based on advance knowledge of the queries.
TPC-W is a benchmark designed to measure transactional web e-commerce performance. The workload simulates an Internet commerce environment driven by a transactional web server. There are multiple concurrent browser sessions, dynamic page generation with associated database accesses and updates, consistent web objects, online transaction execution modes, transaction integrity, and contention on data access and update.

2.3.3.2 SPECweb99

As the Web began to explode in popularity in the mid-1990s, SPEC released a benchmark for web performance, called SPECweb96. SPECweb96 was focused entirely on the speed at which static pages could be retrieved by a set of clients . SPECweb96 was eventually broken by a software technique pioneered by Sun Microsystems, called the Network Cache Accelerator, or NCA. NCA uses a kernel module to transparently cache static web content in a kernel memory buffer, and replies to HTTP document requests for documents in its cache without ever waking up the application web server.

By the end of the 1990s, there was an increasing trend toward dynamic content -- the precise contents of a page depended on form data or was in some other way personalized to the end user. Driven by this change, SPEC released a new revision to the SPECweb benchmarks, which is known as SPECweb99.

SPECweb99 measures the number of concurrent conforming connections per second; in order for a connection to conform, it needs to sustain between 400 and 320 Kbps. The transaction mix includes both static and dynamic content; the default mix is shown in Table 2-4.

Table 2-4. SPECweb99 transaction mixture

Request type	Percentage
Static GET	70.00%
Dynamic GET (plain)	12.45%
Dynamic GET (custom advertisement)	12.60%
Dynamic POST	4.80%
Dynamic POST calling CGI code	0.15%

The static portion of the workload models a hypothetical web content provider. The pages of content are represented by files of different sizes that are accessed with varying frequency; SPEC obtained the precise values of these parameters by inspecting the log files of a handful of busy web sites. The working set of content files fit into four classes, as illustrated in Table 2-5.

Table 2-5. Static content classes

Class	Size range	Size step	Percentage of hits
	< 1 KB	0.1 KB	35%
1	< 10 KB	1 KB	50%
2	< 100 KB	10 KB	14%
3	< 1000 KB	100 KB	1%

The workload file set consists of a number of directories, each of which contains 9 files per class, for a total of 36 files per directory. The files in Class 0 are in increments of 0.1 KB, those in Class 1 are in increments of 1 KB, etc. For example, the files for Class 2 will be 10 KB, 20 KB, 30 KB, 40 KB, 50 KB, 60 KB, 70 KB, 80 KB, and 90 KB. Since larger web servers are expected to service more files, the exact size of the workload file set (the number of directories to be created) is determined by the number of simultaneous connections.

During a benchmark test, the class of file to be retrieved from a random directory is determined by a fixed distribution (illustrated in Table 2-4). A Zipf distribution is used to choose which file should be retrieved from the selected class. ^[9] The relative hit rates for each file in a class are shown in Table 2-6.

^[9] The probability of selecting the n th item in a Zipf distribution is proportional to 1/ n . Zipf distributions are usually associated with situations where there are many equal-cost alternatives, such as picking books at a public library.

Table 2-6. Relative hit rates of files in a given class

File number	Percentage
	3.9%
1	5.9%
2	8.8%
3	17.7%
4	35.3%
5	11.8%
6	7.1%
7	5.0%
8	4.4%

SPECweb99 is an emerging benchmark, and vendors are starting to play a lot of tricks to improve performance that you might not want to play on a production system. If you're trying to use SPECweb99 to evaluate systems for purchase, please take them with a grain of salt, and read the reporting configuration very carefully ; vendors are required to report exactly what hardware was used and what parameters were tuned . ^[10]

^[10] Note that the way you tune a system when you're going for a world record can be quite different than the way you'd tune a system that you had to use every day in production. Vendors take world records very seriously.

2.3.4 User Benchmarks

The most meaningful benchmarks are the ones you develop that are relevant to your environment. In general, there are a few principles that you should follow to set up a good benchmark: they should be relevant to your typical problems, have a relatively short runtime, be extremely automated, and set rules for running the benchmark.

2.3.4.1 Choose your problem set

In general, pick problems that represent the spectrum of work you actually do. If, for example, you are buying a computer for genetic sequence analysis, you might want to pick a few representative sequences, each with a separate computational code. It's okay if most of your work uses one particular application -- just weight the performance of that application higher. You should also design the benchmark so that it tests not only the work that you are doing now, but the work that you plan on doing over the lifetime of the machine (say, three to five years). Try and benchmark the big memory, high computation problems, or a reasonable subset thereof, that you'll be doing five years from now.

2.3.4.2 Choose your runtime

Of course, it's hard to predict runtime. One of the best guidelines for this is a well respected, industry-standard benchmark like SPECfp2000. If your current system is rated at 100 SPECfp2000, a system that's rated at 1000 SPECfp2000 should go ten times as fast (work with me, here). A benchmark that took 20 minutes to run on your current system would be reduced to a trivial 2-minute run on the new system. As a rule of thumb, each "problem" should run for 10 to 20 minutes on the new hardware under consideration. It should take less than an hour or two to run the entire suite, if for no other reason than you will have to sit down and run it many, many times.

2.3.4.3 Automate heavily

You, or a vendor trying to get you to give them money, will be running this benchmark frequently. As a consequence, you will want things to be as automated as possible. The best way to do this is to write a wrapper (often a shell script) that will do everything from compiling the benchmark, to running it, to printing a summary of the results, along with a snapshot of the most important elements of the system configuration. These elements might include things like the number of configured processors, the amount of memory installed, or the compilation flags used.

2.3.4.4 Set benchmark runtime rules

It is important to set rules, especially if anyone other than you will be running the benchmark. There are a lot of questions to keep in mind when writing benchmark rules: are modifications allowed? Are code changes only allowed for portability, or can algorithms be replaced ? What type of compiler optimizations are allowed? How closely must the benchmark results match your baseline results? In fact, you probably need some rules about the rules, just to make sure you have a clear understanding with the vendor.

We've only touched on the highlights of benchmark design. I strongly recommend that you refer to Kevin Dowd and Charles Severance's High Performance Computing (O'Reilly), for a much more in depth discussion of this topic.