A.6 Popularity

only for RuBoard - do not distribute or recompile

A.6 Popularity

Figure A-4 shows the Zipf-like distribution of object popularity. The X-axis is the popularity rank. The most popular object has a rank of 1, the second most popular object a rank of 2, and so on. The Y-axis is the number of requests for each object. Note that both axes have a logarithmic scale.

Figure A-4. Popularity distributions (IRCache and Anon-U data)
figs/webc_a04.gif

We say the distribution is Zipf-like because it almost follows Zipf's law. This law, named after George Kingsley Zipf, describes things such as the frequency of words in English texts and the populations of cities. It is also useful for characterizing the popularity of web objects. Specifically, the probability of access for the i th most popular object is proportional to i -a . In Zipf's law, the exponent a is close to 1. For web traffic, the exponent is typically between 0.6 and 0.8.

This data is derived by counting the number of times each URL occurs in Squid's access.log . After getting these counts, the particular URLs are unimportant. The values are sorted and plotted against their rank in the list.

For comparison, I show curves for both the IRCache and Anon-U data sets. The Anon-U plot is below the IRCache plot because it has significantly fewer accesses . The two appear to be similar, except that the Anon-U line slopes down much more for the 100 most popular objects.

A.6.1 Size and Popularity

Back in Section 2.4, and Section 12.1.3, I mentioned that byte hit ratios are typically lower than cache hit ratios because small files are more popular than large ones. Here, I back up that assertion by analyzing the IRCache access logs. We'll look at the data in two slightly different ways.

Figure A-5 shows the mean number of requests for objects of different sizes. This is a histogram where the bin number is proportional to the logarithm of the object size. The histogram value is the mean number of requests for all objects in that bin. In other words, it is the number of requests divided by the number of unique objects in each bin. Although there are peaks and valleys, the mean number of requests generally decreases as object size increases . For example, the largest objects were requested about two times on average, while the smallest objects were requested hundreds and thousands of times.

Figure A-5. Mean number of requests versus object size (IRCache data)
figs/webc_a05.gif

Figure A-6 is a little more complicated. First, the Y-axis is the total number of requests, rather than the mean. Second, the bins are constant size. Third, in order to see all of the data, the plot has three views of the data at three different scales . The first trace is with a bin size of 1 byte, the second with 100-byte bins, and the third with 10 KB bins. In other words, the first trace shows file sizes up to 1KB, the second up to 100KB, and the third up to 10MB.

Figure A-6. Total number of requests versus object size (IRCache data)
figs/webc_a06.gif

In the "1 byte bin" trace, you can see that popularity increases with object size for objects between 0 and 400 bytes. From 400 “600 bytes, the curve is relatively flat, but keep in mind that the Y-axis is logarithmic. In the other traces with larger bin sizes, the decreasing trend is quite obvious.

The procedure for generating this data is similar to the one used for reply and object sizes. Unlike the object size distribution, however, I want to count each request, not each object. Unlike the reply size data, I want to count the object size, not the reply size. For this data, I take only GET requests with a status code of 200 or 304 and filter out any Squid-specific requests. For some objects, the size, as logged by Squid, may change significantly over time. To account for this, I use the average size for all requests of a particular URL. The 304 responses do not contribute to the size calculation but are counted as a request.

only for RuBoard - do not distribute or recompile


Web Caching
Web Caching
ISBN: 156592536X
EAN: N/A
Year: 2001
Pages: 160

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net