12.3 Benchmarking Tools

only for RuBoard - do not distribute or recompile

12.3 Benchmarking Tools

A number of tools are available for benchmarking proxy caches. Some are self-contained because they generate all requests and responses internally. Others rely on trace log files for requests and on live origin servers for responses. Each technique has advantages and disadvantages.

Using trace files is attractive because the client and server programs are simpler to implement. A self-contained benchmark is more complicated because it uses mathematical formulas to generate new requests and responses. For example, a particular request has some probability of being a cache hit, of being cachable , and of being a certain size . With trace files, instead of managing complex workload parameters, the client just reads a file of URLs and sends HTTP requests. In essence, the workload parameters are embedded in the log files. Another problem is that trace files don't normally record all the information needed to correctly play back the requests. For example, a log file doesn't normally say if a particular request was on a persistent connection. It's also unlikely to indicate certain request headers, such as Cache-control and If-Modified-Since .

Trace log files are usually taken from production proxy caches. This is good because the trace represents real web traffic on your network, generated by real users. If you want to run a trace-based benchmark or simulation but don't have any log files, you might be out of luck. Log files are not usually shared between organizations because of privacy issues.

A benchmark that uses real URLs requested from live origin servers is likely to give inconsistent and unreproducible results. For example, if you run the same test on two different days, you may get significantly different results. Live origin servers have different performance characteristics depending on the time of day, day of the week, and unpredictable network conditions.

A self-contained benchmark has the advantage of being configurable, and it's also reproducible when run on dedicated systems. For example, it's easy to change a workload's hit ratio and other parameters as needed. However, some people find the myriad of workload parameters daunting. If your goal is to simulate the real traffic on your network, it's less work to take a trace file than to characterize the traffic and enter the parameters into a configuration file. Investing the time to characterize your traffic gives you many more options. Since simulated data sets are essentially infinite, you can run tests for as long as you like. You don't need to worry if your trace file has enough URLs for the test you want to run.

Another distinguishing feature of benchmarking tools is how they submit new requests. Two techniques are commonly used: best-effort and constant rate. The best-effort method uses a finite number of simultaneous client agents . Each agent sends one request at a time, waiting for the current request to complete before sending the next one. As the name implies, the constant rate method submits new requests at a constant, average rate, regardless of the number of active requests.

The best-effort technique is usually easier to implement. Often, each agent is a separate thread or process. This leads to a natural feedback mechanism: once all threads become busy, no new requests are submitted. New requests are sent only as fast as the cache responds to them. A fixed number of agents may not sufficiently stress a proxy cache. Furthermore, using too few agents leads to very misleading throughput measurements. For example, consider a caching proxy that, for whatever reason, delays every request by at least 100 milliseconds . A test with 10 threads can produce no more than 100 requests per second. However, a test with 100 threads can generate up to 1,000 requests per second.

The constant rate method usually results in a better workload. Even if the cache starts to slow down (i.e., response time increases ), new requests continue to arrive at the same rate. Note that constant doesn't necessarily mean that requests arrive at perfectly spaced time intervals (e.g., one request every 10 milliseconds). Real Internet traffic is bursty , with a lot of variation between consecutive interarrival times. Some benchmarks model interarrival times with Poisson distributions or self-similar models that have constant average request rates. This technique also provides a better simulation of real traffic because the total number of users is usually much larger than the number of concurrent connections. When my browser submits requests, it doesn't care how many other browsers already have requests pending.

12.3.1 Web Polygraph

Web Polygraph (http://www.web-polygraph.org) is a comprehensive cache benchmarking tool. It's freely available and distributed as source code that runs on both Unix and Microsoft Windows. Polygraph is designed to be high-performance and very flexible. It simulates both HTTP clients and servers based on a rich specification language. (I have a personal preference for Polygraph because I am involved in its development.)

Polygraph is high-performance because it makes very efficient use of system resources. A single midrange PC system can easily saturate a fast Ethernet segment, which corresponds to about 1,000 requests per second. A lot of attention is given to optimizing the Polygraph source code. You can scale the offered load to very high rates by running Polygraph on many systems in parallel.

Simulating both clients and servers gives Polygraph a lot of flexibility. For example, clients and servers can exchange certain state information with each other via custom HTTP headers. This allows Polygraph to know whether a specific response was generated by the server or served from the cache. Polygraph can detect false hits, foreign requests, foreign responses, and other errors.

Polygraph's configuration language allows users to customize the simulated workload. For example, you can specify mathematical distributions for reply sizes, request popularity, think time delays, and more.

Another important part of Web Polygraph is the development of standard workloads. These workloads allow anyone to run their own tests and compare the results with previously published data. Workloads are developed and evolved with input from users, caching vendors , and other interested parties. The standard workloads mimic real web traffic as closely as possible. The parameters are taken from analyses of real systems, log files, and published research results.

Polygraph models the following web traffic characteristics:

Request submission

Polygraph supports both constant rate and best-effort request submission. In the best-effort approach, a finite number of agents each submit one request at a time. When the agent receives the complete response, it immediately submits another request. The constant rate model, on the other hand, submits new requests at a constant average rate. Polygraph allows you to specify an interarrival distribution for the constant rate model.

As I mentioned previously, the best-effort method may not put much stress on a proxy cache, especially if a small number of agents are used. New requests are submitted only as fast as the cache can respond to them. Constant rate submission is more likely to test the cache under stressful conditions. If the cache cannot keep up with the client, a significant backlog develops. Polygraph continues to submit new requests until it reaches a configurable limit or runs out of unused file descriptors.

Popularity

Popularity is an important aspect of web traffic. It determines how often some objects get requested relative to others. Numerous studies have shown that web access patterns are best described by a Zipf-like distribution [Breslau, Cao, Fan, Phillips and Shenker, 1999].

Among other things, the popularity distribution affects memory hit ratios. If the popular objects are too popular, then too many cache hits are served from memory. In this case, the workload does not sufficiently exercise the disk system, and the product achieves better performance than it does with live traffic. Getting the popularity model and memory hit ratio "just right" is one of the hardest aspects of proxy cache benchmarking.

Recurrence

The recurrence parameter also determines the hit ratio. It specifies the probability that a given request is for an object that was previously requested. Recurrence applies both to cachable and uncachable responses. Repeated requests for uncachable objects do not contribute to the hit ratio. Note that recurrence determines only whether a particular request is for a previously requested object. The popularity model actually determines which object gets requested.

Server delays

Polygraph supports server-side think time delays as a way to approximate server and network latencies. Such delays are important because they simulate busy servers and, to some extent, congested networks. They increase response times and cause the proxy cache to maintain a large number of open connections.

Content types

Polygraph allows you to specify a distribution of different content types. For example, you can say that 70% are images, 20% are HTML, and 10% are "other." Each content type can have different characteristics, such as reply size, cachability , and last-modified timestamps.

Reply sizes

It's very important to use a realistic distribution of reply sizes. In particular, a product's disk performance is probably sensitive to the reply size distribution. If the sizes are too small, the benchmark underestimates the actual performance. As mentioned previously, Polygraph allows you to assign different reply size distributions to different content types.

Object lifecycles

In this context, lifecycle refers to the times at which objects are created, modified, and destroyed . The lifecycle model determines values for a response's Expires and Last-modified headers and whether these headers are present at all. These parameters determine how caches and simulated clients revalidate cached responses. Unfortunately, the lifecycle characteristics of real web content are quite difficult to characterize.

Persistent connections

HTTP/1.1 defines persistent connections, which enable a client to make multiple requests over a single TCP connection. Polygraph supports sending a different number of requests over a connection, as occurs in real traffic.

Embedded objects

The bulk of web traffic consists of images embedded in HTML pages. Although Polygraph doesn't currently generate HTML, it can model pages with embedded objects. Furthermore, Polygraph clients mimic graphical browsers by immediately requesting a small number of embedded objects in parallel.

Uncachable responses

Polygraph marks some percentage of responses as uncachable with the no-cache and other Cache-control directives. Of course, uncachable responses affect response time, as well as hit ratio. Polygraph reports an error if it detects a cache hit for an uncachable response.

During an experiment, Polygraph measures the following at regular intervals:

Client- and server-side throughput
Cache hit ratio and byte hit ratio
The distribution of response times, separately for hits and misses
The distribution of reply sizes, separately for hits, misses, cachable responses, and uncachable responses
Persistent connection usage
Number and type of errors
Number of requests and bytes for hits, misses, cachable responses, and uncachable responses

12.3.2 Blast

The blast software, developed by Jens-S. V ckler, replays trace log files. Blast launches a number of child processes in parallel, each one handling one request at a time. In this way, it's similar to Polygraph's best-effort request submission model.

Blast also includes a simple program called junker that simulates an origin server. Junker supports the GET method, HTTP/1.1 persistent connections, and If-Modified-Since validation requests.

You can get Blaster, in the form of Unix source code, from http://www.cache.dfn.de/DFN-Cache/Development/blast.html.

12.3.3 Wisconsin Proxy Benchmark

WPB was developed at the University of Wisconsin, Madison. Like Polygraph, WPB generates requests and responses on demand (versus from trace files). As with Blaster, WPB uses a one-request- per-process approach, which makes it similar to Polygraph's best-effort mode. WPB supports a configurable think time delay in its server. Many other workload characteristics, such as popularity and hit ratio, appear to be hardcoded.

You can download the Unix source code for WPB from http://www.cs.wisc.edu/~cao/wpb1.0.html. Since the project leader recently left the University of Wisconsin, it is unlikely that WPB development will continue.

12.3.4 WebJamma

WebJamma is similar to Blaster in that it plays back trace log files and uses best-effort request submission. One unique feature is the ability to distribute requests to multiple proxy caches. This allows you to evaluate more than one product at the same time and under the same conditions. The WebJamma home page is http://www.cs.vt.edu/~nrg/webjamma.html.

12.3.5 Other Benchmarks

A number of other benchmarks, such as WebStone, SPECweb99, WebBench, and httperf, have been designed to test origin servers. Although it may be possible to use them with proxy caches (via layer four redirection, for example), I do not recommend it. The workload for a single origin server is much different from the workload for a proxy cache.

only for RuBoard - do not distribute or recompile