12.2 Performance Bottlenecks

only for RuBoard - do not distribute or recompile

12.2 Performance Bottlenecks

The performance of a proxy cache system depends on numerous factors. Any one of the system components can become a bottleneck. In some cases, it's possible to use benchmarking tools to identify the bottlenecks. For example, you can benchmark two different hardware configurations and compare their results. Does a faster CPU improve response time? How much does your hit ratio improve if you double the disk space?

12.2.1 Disk Throughput

Disk I/O is often a bottleneck for web caches. Due to the nature of web traffic, web caches make very heavy use of their disk drives . If you've ever looked at the disk lights on a busy cache, you probably noticed the lights are almost constantly on.

The performance of disk systems varies greatly between products, and a number of factors are involved. First, the controller's bus speed limits how quickly chunks of data move from memory to the disk controller. For SCSI, this limit is usually 20, 40, or 80MB per second. For IDE controllers, the limit is either 33, 66, or 100MB per second. These rates correspond to theoretical upper limits. It's rare for a controller to sustain its maximum rate for any length of time, partly because individual disk drives have much lower limits. When multiple devices are connected to the bus, as is common with SCSI, then the bus bandwidth limit becomes more relevant.

Another, and more realistic, limit is how quickly an individual drive can transfer data to and from its magnetic media. This limit relates to the drive's rotational speed and other mechanical properties. For most drives, this limit is on the order of 10 “20MB per second. Note that 20MB per second corresponds to about the same as an OC-3 network link (155 megabits/sec). Again, this rate is achievable only under ideal conditions when the disk can write large chunks of data continuously.

Due the nature of web traffic, cache disk drives are unlikely to achieve sustained data transfer rates as high as 10MB per second. Cached objects are relatively small (about 10KB), so a disk spends a lot of time seeking back and forth to different positions . While some caching products are able to optimize writes into large, contiguous chunks, it's harder to optimize reads for cache hits. In most cases, the real limit for disk performance is the rate of disk accesses . In other words, how many reads and writes can the disk handle per second? The maximum rate for most disks available today is about 100 operations per second. The determining factor is usually the seek time. To maximize your disk performance, use hard drives with low average seek times.

It's relatively easy to find out if your disk system is a bottleneck. First, benchmark the cache with a 100% cachable workload. This should cause the cache to store every response, thereby stressing the disks. Next, run another benchmark with 0% cachable responses. Since disks are probably the slowest devices in the system, chances are you'll see a big difference in performance.

12.2.2 CPU Power

For the most part, web caches don't require really powerful CPUs. However, an underpowered CPU can limit a cache's performance. Caching and proxying involves the movement of data buffers from one place to another (e.g., network and disk). Copying buffers and scanning for string patterns probably accounts for the majority of CPU time.

One exception is caches that encrypt or decrypt HTTP/TLS traffic. The encryption algorithms are very CPU- intensive . A number of companies sell "SSL accelerator" cards or devices that offload encryption processing from the CPU onto a dedicated processor.

Testing for CPU bottlenecks can be difficult. It's probably easiest to use system tools (e.g., top and vmstat on Unix) to watch the CPU usage while running the benchmark. Other than that, you may need simply to try different CPUs in the system with the same workload.

12.2.3 NIC Bandwidth

A cache's network interface may be a limiting factor, although this is unlikely as long as the NIC is configured properly. Fast Ethernet is quite common today and sufficient for most available products. A single 100baseTX interface can usually handle 900 “1,000 HTTP requests per second. Products that support higher request rates have Gigabit Ethernet or multiple 100baseTX interfaces.

To find out if your network interface is a bottleneck, look for a large percentage of collisions. Many Ethernet hubs and switches have lights on the front that indicate utilization or collisions. On Unix systems, the netstat -i command reports collision and error counters for each interface.

12.2.4 Memory

Memory is a precious resource for web caches. It's used to buffer data, index cached objects, store popular objects, and more. A caching application has two ways to deal with a memory shortage: swap some pages to disk or postpone certain actions until memory becomes available. In either case, running out of memory results in a very sudden performance degradation.

To find out if memory is limiting your performance, you can use system tools (e.g., top and vmstat on Unix) to monitor memory usage. Alternatively, you may be able to adjust how the application uses memory. For example, Squid uses less memory if you decrease the cache size . Lowering Squid's cache_mem parameter may help as well.

12.2.5 Network State

TCP ports are a limited resource people sometimes forget about. A TCP connection is defined by a pair of IP addresses and port numbers . Since the TCP port number is a 16-bit field, there are 65,536 ports available. When a system initiates a new TCP connection, it must first find an unused port number. When a connection is closed, the endpoint that did not initiate the close must keep the port in a TIME_WAIT state for a certain amount of time. Most implementations use one minute, even though the standards specify four minutes.

The number of ports in the TIME_WAIT state is roughly equal to the rate of connection establishment multiplied by the timeout value. Thus, with 65,536 ports and a 60-second timeout, a proxy cache can establish no more than about 1,000 connections per second. In fact, the situation may be worse because some operating systems have fewer ephemeral ports by default. On FreeBSD, for example, the ephemeral port range is 1,024 “5,000. You can easily increase the limit with the sysctl program.

Persistent HTTP connections may help to achieve a higher throughput if this limit is reached. However, persistent connections also tie up resources when the connection is idle. Very busy servers should use a low timeout (such as 1 “2 seconds) for idle persistent connections.

Generally, there are three ways to avoid running out of ports: increase the ephemeral port range, decrease the TCP MSL value, or use additional IP addresses. Decreasing the MSL value reduces the time spent in the TIME_WAIT state and recycles ports faster, but you probably shouldn't use an MSL value smaller than 15 seconds on a production server. Adding more IP addresses effectively increases the number of available ports because sockets are bound to address/port tuples. Thus, two connections can use the same TCP port as long as they use different IP addresses. On Unix systems you can use ifconfig to add IP aliases to an interface.

only for RuBoard - do not distribute or recompile