11.1 What to Monitor?

only for RuBoard - do not distribute or recompile

11.1 What to Monitor?

Different caching products provide different statistics and measurements for monitoring purposes. In general, you should collect as much information as you can. It's better to be on the safe side, since you never know when some information may come in handy. The following list describes the parameters you should collect and monitor, if possible.

Client request rate

This is the rate at which clients issue HTTP requests to the cache. In a corporate environment, you'll probably observe a daily peak that begins in the morning when people arrive at work and ends when people leave in the evening. ISPs with dial-up customers usually see an evening peak. University traffic usually has both trends combined, with one long peak from early morning until late at night.

The shape of the request rate data is less important than how it differs from the normal pattern. As the administrator, you should be familiar with how the request rate changes over the course of a day. If you notice a significant change, you should probably investigate the cause. Less traffic than usual can indicate local network problems. More requests than usual could mean someone is using prefetching or web crawling software.

Bandwidth

There is normally a strong correlation between bandwidth and request rate. However, you may want to monitor it separately. If bandwidth becomes a bottleneck, it might be more obvious in this data. Also, large file downloads, such as MP3s, show up as spikes on a bandwidth plot but do not stand out on a request rate plot.

If your cache doesn't give you bandwidth measurements, you can probably get it from other sources. In fact, you might prefer to take readings from a switch or router. These should enable you to separate out the internal and external traffic as well.

Client-side response time

Response time is the most important way to measure how well the cache serves your users. A normal value probably means your users are happy. A high response time means your users are upset and are probably calling you to complain about it. Any number of things can affect response time, including upstream network congestion, an overloaded cache server, a broken DNS server, etc.

Server-side response time

You should monitor server-side response times separately from the client side, if possible. Server-side requests are sent to origin servers or upstream caches for cache misses. If you notice an increase in client response times, but no change in server response times, it probably indicates a local problem. On the other hand, if both measurements increase, then you can blame the slowness on upstream Internet delays.

DNS lookup response time

Resolution of DNS names to addresses is a critical aspect of web caching. Some caching products have an internal "DNS cache" but use an external DNS server to resolve unknown names . Tracking the response time for DNS lookups is a good way to monitor the status of the external DNS server. Because internal DNS caches have very high hit ratios, a DNS server failure may not show up in the client-side response time data.

Hit ratio

Monitoring your cache's hit ratio tells you how much bandwidth is being saved. Don't be too surprised if the hit ratio varies significantly over the course of the day.

Memory usage

If you have an appliance-based product, you may not have access to memory-usage information. For software products, however, it is important to understand how much memory the application is using. If it's near the limit, then you'll need to buy more memory or change the configuration to use less of it. Monitoring memory usage is also a good way to find some software bugs , such as memory leaks.

Disk usage

Normally, your cache should use a constant amount of disk space. If you notice it changing significantly over time, then something interesting may be going on. Some caches may lose their disk contents if the system crashes. Such an event would show up in this data. Also, it is interesting to see how quickly the disk fills up. As I mentioned in Section 10.2, a cache that fills very quickly (in less than three days) can benefit from having more disk space.

CPU usage

Monitoring the system's CPU usage tells you if the CPU is becoming a bottleneck. CPU usage can be tricky because a high amount of utilization isn't necessarily bad. For example, if your caching proxy is blocking on disk I/O, the CPU utilization may be quite low. Removing the disk bottleneck allows the cache to support higher throughput, which should result in higher CPU utilization as well. I wouldn't worry about upgrading the CPU until utilization reaches 75% for hours at a time.

Another thing to keep in mind is that some products may use spare CPU cycles to poll for events more frequently. Even though the CPU appears to be really busy, the application isn't working as hard as it can. CPU utilization doesn't necessarily increase in proportion to supported load. Although the system reports 100% utilization, the cache may be able to handle additional load without negatively affecting overall performance.

File descriptor usage

Unix-based caches, such as Squid, use file descriptors to read and write disk files and network sockets. Each file descriptor identifies a different file or TCP connection that is currently open for reading or writing. Usually, a Unix system places limits on the number of file descriptors available to a single process and to the system as a whole. Running out of file descriptors results in service denial for your users, so you probably want to track the usage. All modern Unix systems let you raise the per-process and systemwide limits. It's a good idea to be conservative here. Give yourself plenty of extra descriptors and familiarize yourself with the procedure.

The number of file descriptors in use at any given time is approximately equal to the average request rate multiplied by the average response time. For most locations, both request rate and response time increase during the middle of the day. If your Internet connection goes down, response time becomes essentially infinite, and your cache is likely to reach its file descriptor limit. Thus, when you see a huge spike in file descriptor usage, it's usually due to an upstream network failure.

Abnormal requests

My definition of an abnormal request is one that results in an error, is denied access, or is aborted by the user . You should always expect to see a small percentage of errors, due to transient network errors, broken origin servers, etc. However, a significant increase in errors probably indicates a local problem that requires your attention.

Changes in the percentage of requests that are denied access may require investigation as well. If you normally deny a lot of requests due to request filtering, this data can assure you it's working properly. A sudden drop in the percentage of denied requests could mean that the filtering software is broken, and users are visiting sites that should be blocked. On the other hand, if you usually deny only a small number of requests, an increase may indicate that outsiders are trying to send requests through your proxy to hide their tracks.

ICP/HTCP query rate

If you're using ICP or HTCP and acting as a parent or sibling for others, it's a good idea to track the query rate as well. Often, it's helpful to look at the ratio of HTCP to ICP requests. A sharp change in this ratio can indicate that your neighbor changed its configuration. Perhaps it went from a parent-child to a sibling relationship, or maybe it turned ICP on or off.

TCP connection states

TCP ports are another limited resource, especially for busy servers. If you've ever looked at netstat output on a Unix system, you probably remember seeing connections in various states such as ESTABLISHED , FIN_WAIT_1 , and TIME_WAIT . Monitoring connection states can provide you with important information. For example, your system may be configured with a relatively small number of ephemeral ports. These are the local port numbers for outgoing connections. If too many of the ephemeral ports are in the TIME_WAIT state, your cache won't be able to open new connections to origin servers. In the rare event that you're the target of a "SYN flood" attack, this would appear as a spike in the number of connections in the SYN_RCVD state. It's also interesting to see how enabling or disabling HTTP persistent connections affects these statistics.

Some caching programs and appliances may not make TCP state information available. If you're running on Unix, you can get it with a simple shell script. The following command should be enough to get you started:

 netstat -n  awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

only for RuBoard - do not distribute or recompile