A
benchmark
measures the performance of a standard workload. By holding the workload fixed, you can vary the underlying system parameters to generate relative performance
In general, there are three types of benchmarks:
This book is not about tuning systems to produce world-record benchmark results. In general, producing a world record requires an immense amount of work, strong support from the
Before we get started in a discussion of benchmarks, we need to discuss a little bit of terminology and history: the MIPS and megaflops metrics.
One of the first ways of estimating the performance of an application on a computer system was to look at how fast the system could execute instructions. However, this is a dismal metric. The execution rate of a processor (typically
To make things
The key point to take away from this is that MIPS is only a useful metric to compare processors from the same vendor that support the exact same instruction set with the same compilers. The fundamental weaknesses in MIPS as a performance metric have often led people to joke that it's really short for "Meaningless Indicator of Processor Speed."
Floating-point performance is often measured in terms of millions of floating-point instructions per second, often called
megaflops
or
MFLOPS
. A floating-point instruction is commonly regarded as one that executes on a microprocessor's specialized floating-point hardware; this includes floating-point additions, multiplications, comparisons, and format conversions. Less commonly included are square roots and divisions. Typically, though, megaflop rates are computed from some mixture of additions and multiplications. While it is considered misleading to present MIPS values, vendors regularly quote peak MFLOPS values. The peak MFLOPS for a processor is usually on the close order of an integer multiple (1, 2, or 4) of the clock rate. This is because optimizations in microprocessor hardware perform fast multiply-add operations (such as for
The first set of benchmarks I discuss are designed to stress specific areas of the system; typically, they are memory performance and/or microprocessor performance. I discuss two of the most common: the Linpack benchmark and the SPECcpu suite.
One of the first commonly accepted benchmarks is the
Linpack benchmark
, written by Jack Dongarra, then of Argonne National Laboratory. Interestingly, the goal of Linpack was not to develop a benchmark; rather, it was intended to provide a high-performance library of routines for linear algebra problems. The Linpack benchmark derives megaflop rates by measuring the time and number of operations required by a few of these routines to solve a dense system of linear equations using Gaussian
In addition to the 100 x 100 benchmark, there is also a 1000 x 1000 benchmark, in which the system of equations can be
The heart of the 100 x 100 Linpack benchmark is a routine called
daxpy
, which
[7] Daxpy is actually written in FORTRAN; it is represented here in C.
for (i = 0; i <= N, i++) {
dy[i] = dy[i] + da * dx[i];
}
Each iteration of this loop performs two floating-point operations (a multiplication and an addition) and three memory operations (two loads and a store). It turns out that the Linpack benchmark stressed both the floating-point and memory performance of computer systems up through the early 1990s. When microprocessor cache sizes increased to a size where they could hold the entire benchmark, the Linpack 100 x 100 ceased to be a useful benchmark -- the data structures for the entire benchmark are only 320 KB. The first system that was widely credited with "breaking" the Linpack 100 x 100 was IBM's RS/6000 series. Unfortunately, optimization techniques prevent the Linpack 100 x 100 benchmark from being scaled up to the point where they stress the memory subsystem again.
So, the Linpack 100 x 100 went from being the most
In the late 1980s, a
[8] They are now the Standard Performance Evaluation Corporation.
The first SPEC benchmark was SPEC89, which consisted of the geometric mean of the
To improve the quality of the benchmark, SPEC revised it and released the successor as SPEC92. SPEC92 consisted of 20 test programs, broken up as 6 integer and 14 floating-point codes. The benchmark also
The applications under test in the SPECcpu2000 benchmark are quite representative of typical user codes, and are summarized in Table 2-3.
|
Class |
Code |
Language |
Description |
|---|---|---|---|
|
Integer |
164.gzip |
C |
Data compression |
|
175.vpr |
C |
FPGA circuit placement and routing |
|
|
176.gcc |
C |
C compiler |
|
|
181.mcf |
C |
Minimum-cost network flow solver |
|
|
186.crafty |
C |
Chess |
|
|
197.parser |
C |
Natural language parser |
|
|
252.eon |
C++ |
Ray tracer |
|
|
253.perlbmk |
C |
Perl |
|
|
254.gap |
C |
Computational group theory |
|
|
255.vortex |
C |
Object-oriented database |
|
|
256.bzip2 |
C |
Data compression |
|
|
300.twolf |
C |
Circuit placement/routing simulator |
|
|
Floating-point |
168.wupwise |
F77 |
Physics (quantum chromodynamics) |
|
171.swim |
F77 |
Shallow water modeling |
|
|
172.mgrid |
F77 |
Multigrid solver: 3D potential field |
|
|
173.applu |
F77 |
Partial differential equation solver |
|
|
177.mesa |
C |
3D graphics |
|
|
178.galgel |
F90 |
Computational fluid dynamics |
|
|
179.art |
C |
Image recognition |
|
|
183.equake |
C |
Seismic wave propagation simulation |
|
|
187.facerec |
F90 |
Face recognition image processing |
|
|
188.ammp |
C |
Computational
|
|
|
189.
|
C |
Prime number testing |
|
|
191.fma3d |
F90 |
Finite-element crash simulation |
|
|
200.sixtrack |
F77 |
Physics (high-energy accelerator design) |
|
|
301.apsi |
F77 |
Meteorology (pollutant distribution) |
The SPEC benchmark reports three numbers for both integer and floating-point performance: the pure metric, the base metric, and the rate metric.
The
base
metric (e.g., SPECint_base95) measures the performance under a
The rate metric (e.g., SPECfp_rate95) measures the throughput of an entire, possibly multiprocessor system. The vendor can run as many concurrent copies of the benchmark as necessary; the rate metric is the aggregate across the entire system.
Of course, not all applications are
The Transaction Processing Council, or TPC (http://www.tpc.org), is an industry-based organization representing computer system and database vendors. TPC provides a framework for measuring the relative performance of these systems. Performance is measured by the amount of time it takes the computer to complete transactions under various loads. The TPC results are extremely well regarded in the database field. Like SPEC, TPC has had to modify and scale their benchmarks over time, as user needs and technological capabilities have changed. Consequently, TPC-A and TPC-B have
There are seven benchmarks in the TPC series:
TPC-A was issued in November 1989, and is designed to measure performance in update-
TPC-B was issued in August of 1990. Unlike TPC-A, TPC-B is not an OLTP benchmark. Rather, TPC-B is a database stress test. It tends to
TPC-C was approved in July 1992. TPC-C is an OLTP benchmark like TPC-A, but is significantly more complex; it implements multiple transaction types and a more complex database. TPC-C simulates a complete computing environment, similar to that of a wholesale supplier, where a population of users executes transactions against a database. The benchmark is based around the principal transactions of an
TPC-D was introduced in December of 1995, and is designed to measure decision-support performance. The test
TPC-H is a decision-support benchmark that consists of a suite of business-oriented ad hoc queries and concurrent data modifications, much like TPC-D. This benchmark was designed to measure the performance of decision-support systems that examine large amounts of data, execute very complex queries, and give answers to critical business questions.
TPC-R is another decision-support benchmark. It is very similar to TPC-H, except that it allows additional optimizations based on advance knowledge of the queries.
TPC-W is a benchmark designed to measure transactional web e-commerce performance. The workload simulates an Internet commerce environment driven by a transactional web server. There are multiple concurrent browser sessions, dynamic page generation with associated database
As the Web
By the end of the 1990s, there was an increasing trend toward dynamic content -- the precise contents of a page depended on form data or was in some other way personalized to the end user. Driven by this change, SPEC released a new revision to the SPECweb benchmarks, which is known as SPECweb99.
SPECweb99 measures the number of concurrent conforming connections per second; in order for a connection to conform, it needs to sustain between 400 and 320 Kbps. The transaction mix includes both static and dynamic content; the default mix is shown in Table 2-4.
|
Request type |
Percentage |
|---|---|
|
Static GET |
70.00% |
|
Dynamic GET (plain) |
12.45% |
|
Dynamic GET (custom advertisement) |
12.60% |
|
Dynamic POST |
4.80% |
|
Dynamic POST calling CGI code |
0.15% |
The static portion of the workload models a hypothetical web content provider. The pages of content are represented by files of different sizes that are accessed with varying frequency; SPEC obtained the precise values of these parameters by inspecting the log files of a handful of busy web sites. The working set of content files fit into four classes, as
|
Class |
Size range |
Size step |
Percentage of hits |
|---|---|---|---|
|
< 1 KB |
0.1 KB |
35% |
|
|
1 |
< 10 KB |
1 KB |
50% |
|
2 |
< 100 KB |
10 KB |
14% |
|
3 |
< 1000 KB |
100 KB |
1% |
The workload file set consists of a number of directories, each of which contains 9 files per class, for a total of 36 files per directory. The files in Class 0 are in increments of 0.1 KB, those in Class 1 are in
During a benchmark test, the class of file to be retrieved from a random directory is determined by a fixed distribution (illustrated in Table 2-4). A Zipf distribution is used to choose which file should be retrieved from the selected class. [9] The relative hit rates for each file in a class are shown in Table 2-6.
[9] The probability of selecting the n th item in a Zipf distribution is proportional to 1/ n . Zipf distributions are usually associated with situations where there are many equal-cost alternatives, such as picking books at a public library.
|
File number |
Percentage |
|---|---|
|
3.9% |
|
|
1 |
5.9% |
|
2 |
8.8% |
|
3 |
17.7% |
|
4 |
35.3% |
|
5 |
11.8% |
|
6 |
7.1% |
|
7 |
5.0% |
|
8 |
4.4% |
SPECweb99 is an emerging benchmark, and vendors are starting to play a lot of tricks to improve performance that you might not want to play on a production system. If you're trying to use SPECweb99 to evaluate systems for purchase,
[10] Note that the way you tune a system when you're going for a world record can be quite different than the way you'd tune a system that you had to use every day in production. Vendors take world records very seriously.
The most meaningful benchmarks are the ones you develop that are relevant to your environment. In general, there are a few principles that you should follow to set up a good benchmark: they should be relevant to your typical problems, have a relatively short runtime, be extremely automated, and set rules for running the benchmark.
In general, pick problems that represent the spectrum of work you actually do. If, for example, you are buying a computer for genetic sequence analysis, you might want to pick a few representative sequences, each with a separate computational code. It's okay if most of your work uses one particular application -- just weight the performance of that application higher. You should also design the benchmark so that it tests not only the work that you are doing now, but the work that you plan on doing over the lifetime of the machine (say, three to five years). Try and benchmark the big memory, high computation problems, or a reasonable subset thereof, that you'll be doing five years from now.
Of course, it's hard to predict runtime. One of the best guidelines for this is a well respected,
You, or a vendor trying to get you to give them money, will be running this benchmark frequently. As a consequence, you will want things to be as automated as possible. The best way to do this is to write a wrapper (often a shell script) that will do everything from compiling the benchmark, to running it, to printing a summary of the results, along with a snapshot of the most important elements of the system configuration. These elements might include things like the number of configured processors, the amount of memory installed, or the compilation flags used.
It is important to set rules, especially if
We've only touched on the highlights of benchmark design. I strongly recommend that you refer to Kevin Dowd and Charles Severance's High Performance Computing (O'Reilly), for a much more in depth discussion of this topic.