Server Performance Benchmarks | Lan Tutorial With Glossary of Terms: A Complete Introduction to Local Area Networks (Lan Networking Library)

"IQ tests measure whatever IQ tests measure." That's the common wisdom in applied psychology, but it's just as true when applied to computer benchmarks.

A benchmark suite is designed to measure and compare the relative performance of computer systems. In some cases, a benchmark may be extremely specific: Back in the mid-1980s, as technical editor of Portable Computing, one of my tests measured the relative CPU performance of different DOS-based laptops by launching WordPerfect for DOS, loading a very large document, and timing how long the system took to replace every letter e with the letters xyz. Crude, yes, but that special-purpose benchmark did its job. The benchmarks I'm going to talk about here are more general, and quite a bit more sophisticated.

What Benchmarks Do

A well-crafted benchmark is designed to meet one of two goals: to measure the performance of an entire system based on a well-defined set of tasks , or to attempt to isolate the performance characteristics of a particular subsystem. You can use the benchmark results to compare systems or, if you know the performance characteristics that you require for a certain project, you can use benchmarks to " size " the hardware and/or software appropriately.

In the laptop performance test example, I used a well-defined set of tasksusing a certain word processor, a certain test document, and a certain word processor operationto crudely test a single aspect of laptop performance. Had I been concerned with measuring the performance of word processing software, I would have tested many word processing packages on one laptop computer.

There are many better-designed benchmarks in general use. Many are administered by independent organizations, such as the Standard Performance Evaluation Corp. (SPEC) and the Transaction Processing Performance Council (TPC). Others are created and promoted by vendors , such as Intel's iComp Index, which measures relative performance of the company's various x86 processors.

Several magazines have also written benchmark suites, such as ZD Labs' WinBench suite (www.zdnet.com/zdbop/), whose results are cited by many Ziff-Davis magazines, including PC Magazine, PC Computing, and Computer Shopper . Another respected test is Khornerstone, which is used by our sister publication UNIX Review's Performance Computing (www.performancecomputing.com) for its Tested Mettle server reviews.

A challenge when designing benchmarks or deciding how to interpret them is knowing what you wish to benchmark. Of course, a benchmark measures performance, but which performance do you mean? Some tests measure a system's raw CPU throughput, attempting to minimize the impact on relatively slow I/O systems or network throughput; such tests would be appropriate for a CAD workstation, but not for a file/print server. Other tests stress a storage subsystem by moving large files across a SCSI bus, or by attempting to flood the network; neither test would be greatly affected by raw CPU performance.

The best bets are so-called real-world benchmarks, which attempt to measure how a system would respond in a realistic situation; my old laptop test tried to measure real-world results. No benchmark can truly be real-world, but if used judiciously, the proper benchmark may provide valuable information for comparing systems, as well as assessing what size system may be required to run a certain application.

Benchmark results may only be valid on the exact configurations tested, and only if the primary application you use that server for has characteristics (such as I/O centric or CPU intensive ) similar to the benchmark software (see "Rules to Bench By"). Also, note that an application's performance characteristics may change as the workload increases ; a transaction server might offer a flat response time when handling 100, 500, or 1,000 messages per hour, but it might choke at 2,500 messages per hour .

Rules To Bench By

When reading benchmark results, particularly tables or charts that compare various systems' performance, be certain you are comparing apples to apples.

It is imperative that you know which benchmark was run, and which version. Benchmark results are generally not comparable between tests or different test versions.

Know the exact system configuration used for the benchmark, including processor model and speed, the amount of L2 cache, the amount of RAM, and which peripherals were in use. In many cases, the operating system and OS version will be important.

Up To SPEC

Perhaps the most widely known benchmarks have historically come from SPEC (www.spec.org), a nonprofit corporation based in Manassas, VA. SPEC is an umbrella organization that covers three groups, each with its own benchmarks: the Open Systems Group (OSG), which produces server and processor benchmarks for Unix and Windows NT; the Graphics Performance Characterization (GPC) group, which tests the graphics performance of OpenGL and X Windows systems; and the High-Performance Group (HPG), which tests numeric computing systems such as engineering workstations.

A number of system tests are designed by SPEC OSG and are licensed to testing organizations (typically vendors) who run those tests. The testers may publish the results, and they may also send the results back to SPEC, which reviews the results and publishes them on its Web site. As only tests published on the SPEC Web site have been checked by the organization, they (and we) recommend that you look at the SPEC site as the official results repository.

SPEC OSG's flagship benchmark suite, CPU95, isn't particularly valuable for measuring the performance of a network server; its components measure a system's CPU performance, not the OS or I/O functions where many servers find their bottlenecks. OSG's newer SPECweb96 benchmark, however, is well-respected as a test of Web servers.

SPECweb96 runs on any server that supports HTTP/1.0, and it measures the basic Get performance of static pages using random searches. The higher the benchmark score, the better. As of mid-summer 1998, the three highest SPECweb96 results were all Unix systems: NCR's four-CPU model 4400 (score: 7,800), Silicon Graphics' eight-CPU Origin 2000 (score: 7,214), and IBM's 12-processor RS/6000 Model S70 RS64-2 (score: 7,013).

I found it interesting to note that Hewlett-Packard submitted two results for its uniprocessor NetServer LH 3/400; it scored 2,131 when running Netscape Enterprise Server 3.5 for NetWare, and the score was 1,342 using Internet Information Server 4.0 for Windows NT 4.0.

Another test to watch is SPECsfs97, which measures system file-server performance for Network File System (NFS)-based servers, which are typically Unix-based. SPECsfsS97 tests CPU, mass storage, and network components. Unfortunately, only four vendors (Digital Equipment, IBM, Network Appliance, and Sun Microsystems) have submitted results to SPEC, and that's not really enough to help you make decisions.

TPC-C And TPC-D

No new product preview with a server manufacturer would be complete without the vendor mentioning either their record-breaking TPC-C or TPC-D results. Using those results, manufacturers hope to prove they're offering either the most powerful transaction processing systems ever built (at least until next week) or the best valueor both.

The TPC (www.tpc.org) is a nonprofit consortium based in San Jose, CA; its 42 current members are primarily systems manufacturers (like the Acer Group and IBM) or database software developers (like Oracle and Sybase). Because the members are often arch competitors , the TPC is widely perceived as being unbiased . But because the members are corporations, there's no bones that the purpose of TPC is primarily to help vendors sell hardware or software, as well as to provide timely access to competitive test data.

Unlike SPEC, TPC releases its benchmark software freely into the public domain. Like SPEC, vendors run their own tests, and may submit the results back to the TPC for publication. The two main tests are designed to measure real-world performance of servers and software.

The simpler TPC-C test measures online transaction processingin particular, the speed of entering new-order transactions into a nine-table database while the system is simultaneously executing payment, order-status, delivery, and stock-level queries. The results, presented in a unit called tpmC (transactions per minute for TPC-C), represent the number of new-order transactions performed without allowing response time to drop below five seconds; the higher the number, the better. Based on the total retail price of the tested system (including hardware and software), a price/performance value known as $tpm C can be derived. For a list of the current top 10 TPC-C machines in both performance and value categories, see Table 1 and Table 2. [Editor's note: This data was current in early 1998.]

Table 1: These figures are effective as of July 1998. All of these systems were running Windows NT 4.0 and Microsoft's SQL Server.
The Top 10 $tpmC Machines
Rank	Machine	Processors	tpmC	$tpmC
1	Compaq ProLiant 5500 6/200	Four 200MHz Pentium Pro	11,649	$27
2	Unisys Aquanta QS/2 Server	Four 400MHz Pentium II Xeon	17,700	$27
3	Compaq ProLiant 7000 6/400	Four 400MHz Pentium II Xeon	18,127	$27
4	Acer Altos 19000 Pro4	Four 200MHz Pentium Pro	11,082	$28
5	Dell PowerEdge 6000	Four 200MHz Pentium Pro	10,984	$30
6	Compaq ProLiant 3000	Two 333MHz Pentium II	8,228	$31
7	Compaq Server 7105	Four 200MHz Pentium Pro	11,359	$33
8	NEC Express 5800 HX4100	Four 200MHz Pentium Pro	12,106	$33
9	Unisys Aquanta HS/6 Server	Four 200HMz Pentium Pro	13,729	$33
10	NEC Express 5800 HX6100	Six 200HMz Pentium Pro	14,144	$33

Table 2: This table shows the top 10TCP-C clusters and machines as of July 1998.
The Top 10 TPC-C Clusters and Machines
Rank	Machine	Processors	Software	tpmC	$tpmC
1	Compaq AlphaServer 8400 5/625	500MHz Alpha 21164	Digital Unix 4.0D and Oracle8	102,542	$140
2	IBM RS/6000 SP Model 309	233MHz PowerPC 604e	AIX 4.2.1 and Oracle8	57,054	$148
3	HP 9000 V2250	160MHz PA-RISC 7300LC	HP-UX 11 and Sybase11.5	52,118	$82
4	Sun Enterprise 6000	167MHz UltraSparc	Solaris 2.6 and Oracle8	51,822	$135
5	HP 9000 V2200	160MHz PA-RISC 7300	HP-UX 11 and Sybase 11.5	39,469	$95
6	Fujitsu GranPower 7000 Model 800	250HMz UltraSparc	UXP/DS 20 and Oracle8	34,117	Y(yen) 57,883
7	Sun Ultra Enterprise 6000	167MHz UltraSparc	Solaris 2.6 and Oracle8	31,147	$109
8	Compaq AlphaServer 8400 5/350	266MHz Alpha 21064A	Digital Unix 4.0A and Oracle7	30,390	$305
9	Tandem ServerNet Cluster	300MHz Pentium II	Windows NT Server 4 and Oracle8	27,383	$72
10	HP 9000 K580	160MHz PA-RISC 7300LC	HP-UX 11 and Sybase 11.5	25,363	$77

The more complex TPC-D test is designed to test decision-support systems. Instead of updating databases with new orders, TPC-D submits complex queries against nine read-only database tables. The sizes of the databases can scale from 1Gbyte to 3Tbytes, so make sure

Table 1: These figures are effective as of July 1998. All of these systems were running you're comparing similar database scale results. The test produces three metrics: the power metric, based on a geometric mean of response time on all of the test's SQL queries (called QppD@Size); a throughput metric measured in queries per hour (called QphD@Size); and a price/performance metric (total hardware/software price divided by the QphD@Size metric). The power metric measures the scalability of a system, while the throughput measures its response time. Because of its complexity, vendors must have TPC-D ratings approved by independent auditors before the TPC will publish the results.

Measuring It Up

Before you use any benchmarks, keep in mind that the vendor runs the benchmark and, in particular, chooses the platform, picks the best configuration, runs the tests, and has the option not to report numbers that don't reflect well on the brand. Furthermore, the tests by their very nature are somewhat simplistic and won't reflect the actual application mix you're installing onto any server.

So, would I choose a single benchmark to compare servers or to help scale a server to my needs? No. No single number can do a complex server justice , any more than a new car's zero-to-60 measurement reflects the true capability of an automobile. That said, I would recommend perusing the TPC-C benchmark results next time you're shopping for a transaction server, and the SPECweb96 results to similarly benchmark Web servers.

Resources

Links to numerous benchmarking organizations can be found at www.nullstone.com/htmls/benchmark/benchmark.htm.

Intel's latest iComp Index results can be found at www.intel.com/procs/perf/index.htm. Additional information on benchmarks that Intel recommends can be found at www.intel.com/procs/perf/PentiumII/index.htm.

SPEC's home base is at www.spec.org, and the TPC can be found at www.tpc.org.

An out-of-date, but still interesting, FAQ on benchmarks was written by Dave Sill; the latest version I can find, dated March 16, 1996, is at http://sacam.oren.ortn.edu/~dave/benchmark-faq.html.

This tutorial, number 122, by Alan Zeichick, was originally published in the September 1998 issue of Network Magazine.