12.4 Benchmarking Gotchas

only for RuBoard - do not distribute or recompile

12.4 Benchmarking Gotchas

Benchmarking a proxy cache is a complicated endeavor. Problems arise at all layers of the networking model, from the physical to the application. Bottlenecks or inefficiencies can appear in many places. The following sections describe some common problems I have observed while performing benchmarks.

12.4.1 TCP Delayed ACKs

TCP includes a mechanism known as delayed ACKs [Clark, 1982; IETF, 1989]. The idea is to not immediately acknowledge every data packet. ACK-only packets are very small and thus not particularly efficient. If the TCP stack waits a little while, there is a chance that a data packet is headed in the same direction. Piggybacking the ACK with the data is much more efficient. Most TCP implementations delay ACKs for up to 200 milliseconds . For some, the timeout is configurable.

Delayed ACKs are a big win for interactive flows (e.g., telnet) where small packets flow in both directions in bursts. HTTP, however, is largely unidirectional. HTTP requests normally fit inside a single TCP packet and, therefore, they are not affected by delayed ACKs. Responses, however, typically require many packets. At the beginning of the transfer, TCP slow start is also in effect. This means the sender won't transmit the second packet until the first one is acknowledged . Thus, clients that use delayed ACKs increase the response time of each request by about 100 “200 milliseconds.

Most operating systems allow you to disable delayed ACKs. There is a tradeoff in doing so, however. Response times improve, but the number of packets in the network increases . I have seen caching products that achieve a higher throughput (responses per second) with delayed ACKs enabled because the network interface was handling fewer packets.

In 1999, a magazine attempted to benchmark a handful of proxy caches with a best-effort workload. They used only 20 agents . Because they did not consider TCP delayed ACKs, the mean response time was approximately 200 milliseconds. In this environment, none of the products could achieve higher than 100 requests per second. Many of the same products had been previously benchmarked at higher than 1,000 requests per second. Even worse , the delayed ACKs affected different products differently. Slower products actually appeared to perform better than faster ones.

For more information on TCP delayed ACKs, refer to Section 19.3 of [Stevens, 1994].

12.4.2 Port Number Exhaustion

Port numbers are a limited resource for busy clients and servers. When an application closes a connection, TCP requires the host to keep the socket in the TIME_WAIT state for some amount of time [Stevens, 1994, Section 18.6]. The TIME_WAIT duration is defined as twice the Maximum Segment Lifetime (MSL). For most TCP implementations, MSL is set to 30 seconds, so the TIME_WAIT value is 60 seconds. During this time, the same port number cannot be used for a new connection, although Stevens notes some exceptions.

When benchmarking, we usually want to push the limits of our load-generating machines. For example, a reasonable goal is to have the machine generate 1000 HTTP requests per second. At this rate, with a 60-second TIME_WAIT , we'll have 60,000 ports in that state after running for just a minute or so. TCP has only 65,536 ports, some of which we can't use, so this is dangerously close to the limit. In order to support high request rates, we need to decrease the Maximum Segment Lifetime (MSL) value on the load-generating hosts . It's probably safe to use an MSL value as low as three seconds, but only on the load-generating machines and only in a lab environment. If you change the MSL value on a production system, you risk receiving incorrect data. TCP may get confused and believe that packets from an old connection belong to a new connection.

12.4.3 NIC Duplex Mode

Many Ethernet interfaces and switches available today support both 10baseT and 100BaseTX. Furthermore, 100BaseTX supports both half- and full-duplex modes. These devices tend to have autonegotiation features that automatically determine the speed and duplex setting. However, autonegotiation doesn't always work as advertised, especially when it comes to the duplex mode.

A duplex mismatch can be tricky to detect because it probably works properly for very low bandwidth uses such as telnet and ping . As the bandwidth increases, however, a large number of errors and/or collisions occur. This is one of the reasons why running simple TCP throughput tests is very important.

12.4.4 Bad Ethernet Cables

100BaseTX requires high-quality , well-made Ethernet cables. A poorly made cable is likely to cause problems such as poor throughput, errors, and collisions. Remember that netstat -i shows error and collision counters on Unix systems. If you observe these conditions on an interface, replace the cables and try again. Bidirectional TCP throughput tests are useful for identifying bad cables.

12.4.5 Full Caches

To be really useful, measurements must be made on full caches. A cache that is not full has a performance advantage because it does not delete old objects. Furthermore, it is able to write data faster because the filesystem has large amounts of contiguous free space.

Once the disk becomes full, the cache replaces old objects to make room for new ones. This affects performance in two ways. The act of deleting an object typically has some impact, for example, updating a directory to indicate that certain disk blocks are now free. Object removal also leads to fragmentation; thus, when writing new objects, there are fewer contiguous free blocks available.

Some products have smarter filesystems than others. Those that have filesystems optimized for web caching might show roughly similar performance for empty and full caches. However, many products do exhibit a significant performance decrease over time.

12.4.6 Test Duration

It takes a long time to properly benchmark a proxy cache. Measurements should be taken when the cache has reached a steady state. Proxy caches usually have very good performance at the start of a test. As the test progresses, performance decreases slowly. The longer you run a test, the closer you get to the steady-state conditions. A production cache handling live traffic can take days or weeks to stabilize. A week-long benchmark is not usually an option, so we must settle for shorter tests. Personally, I recommend running benchmarks for at least six hours after the cache is full.

12.4.7 Long-Lived Connections

Long-lived connections are important because they tie up valuable resources in the proxy. By long-lived, I mean a few seconds, not hours or days. It's easy for a proxy to achieve 1000 requests per second when the mean response time is just 10 milliseconds. Achieving the same rate when the response time is three seconds is significantly harder. The reason for simulating long connections is because they exist in real web traffic. Slow modem speeds, wide area network delays, and loaded servers all contribute to response time.

12.4.8 Small Working Sets

Small working sets overestimate performance. The working set is the set of all objects that clients can request at any given time. The size of the working set affects memory hit ratios. Note that the contents of the working set can change over time, but the size remains approximately the same.

The working set is too small if it can fit entirely in the cache's memory. Since the cache doesn't need to read from the disk for memory hits, it achieves a higher performance. Similarly, a working set that is too large results in a very low memory hit ratio and worse performance.

For a cache to achieve the maximum possible hit ratio for a given workload, the working set size must be less than the cache size. Otherwise, the cache replaces some objects that are requested later, resulting in a cache miss .

12.4.9 Clock Sync

All systems involved in a benchmark should have their clocks synchronized to a single source. On more than one occasion, I have seen systems clocks differ by more than a year. Such a difference can cause unexpected behavior. HTTP's date- related headers affect validation and may affect cachability as well. Thus, clock skew can result in no cache hits or a decrease in performance due to excessive validations.

12.4.10 MSL (TIME_WAIT) Values

As I described previously, a TCP stack's MSL determines how quickly port numbers are recycled. When comparing different caching products, you should make sure they all use the same reasonable MSL values. Web Polygraph includes a program called msl_test that reports the MSL setting for a remote host.

only for RuBoard - do not distribute or recompile