Hack 24. Bust the Cache for Accuracy
Measurement solutions based on web server logfiles suffer from a variety of factors that decrease their accuracy. Caching devices are the primary culprits but, in some cases, the cache can be beaten and accuracy improved. Web server logfiles suffer from a handful of accuracy issues, perhaps the most significant arising from caching devices on the Internet. A caching device is any piece of hardware or software designed to store temporary copies of a file, most often to improve delivery performance. There are two types of caching devices that create problems for web server logfiles: clientside caches and server-side caches. Client-side caches are deployed locally in corporate network operation centers and at Internet Service Providers to improve performance. The most extreme example of a client-side cache is the browser cache, software built into your Internet browser that is designed to save local copies of files. Server-side caches are often placed in front of your own web servers to reduce load. (See Web Caching [O'Reilly] for a complete treatise on the subject, or, if you prefer going online, Wikipedia has an excellent entry on the subject at http://en.wikipedia.org/wiki/Web_cache.) The essentials of caching are as follows: because the document is served from a cache, the request never actually makes it into the web server log. Depending on how many of your pages are cached, the result can be a dramatic undercounting of page views, which then cascades into a number or related problems (gaps in path analysis, misleading calculation of key ratios, etc.). So what's a web measurement guru to do? One thing you can consider is busting the cache: adding code to your pages that forces caching devices to request the page from your web servers so you're able to see the request. 2.12.1. Bust the Cache Using Document HeadersThrough relatively simple modification of your document headers and the use of META tags, you can request that the document not be cached. Use the HTTP cache-control and pragma directives (HTTP 1.1 and HTTP 1.0, respectively) as follows, remembering to change the expires content from CURRENT DATE AND TIME to the real date and time the page is generated. The complete description of how these headers work can be found at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9 (HTTP 1.1) and http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.32 (HTTP 1.0). To bust the cache, place the following tags in your document's header section: <HEAD> <TITLE>Your document's title</TITLE> <META HTTP-EQUIV="cache-control" CONTENT="no-cache"> <META HTTP-EQUIV="pragma" CONTENT="no-cache"> <META HTTP-EQUIV="expires" content="CURRENT DATE AND TIME"> </HEAD> You may also want to hedge your bets, writing each directive directly to the document headersomething easily done if you're using a dynamic page generation platform like ASP, PHP, or JSP. If you're using PHP, simply add the following at the top of each page: <?php header("Expires: 0"); header("Last-Modified: " .gmdate("D, d M Y H:i: s") . " GMT"); header("cache-control: no-cache"); header("pragma: no cache");?> In Microsoft's Active Server Pages (ASP), add the following: <% Response.Buffer = false Response.Expires = 1 Response.ExpiresAbsolute = Now() - 2 Response.AddHeader "pragma","no-cache" Response.AddHeader "cache-control","no-cache" Response.CacheControl = "no-cache" %> Finally, in Java Server Pages (JSP), use something like this: <% response.setHeader("pragma","no-cache"); response.setHeader("cache-control","no-cache"); response.setDateHeader("expires", 0); %> The best recommendation would be to save your header control code as a small file called cache-control.inc and include that as a server-side include at the top of every page. Using both the document header and META tag strategy described above allows you to hedge your bets. Since some caching devices and browser types may ignore one or the other directives, doubling up increases your chances of seeing the request.
2.12.2. How Cache Busting Affects the Visitor ExperienceOne of the great deliberations for web data analysts relying on web server logfiles is the choice between improved visitor performance and improved data accuracy. Unfortunately, it's a pretty binary issue: you're either for performance or accuracy. While the most widely used argument is that doing anything that compromises the user experience should be avoided, the counterargument is that unless you have an accurate picture of what visitors are doing, you cannot hope to improve usability. You can mitigate some of the problems associated with cache busting by delivering other documents as quickly and efficiently as possible. Here are some suggestions for how to do this to improve performance. 2.12.3. "Unbusting" the Cache for Images and ScriptsBecause your primary concern is the measurement of page views and not the successful delivery of images and other document objects, one strategy for optimizing page delivery is "unbusting" the cache for non-content objects. There are three moderately simple things you can do to make this happen. 2.12.3.1 Deploy an "images never expire" policy.The idea behind "images never expire" is that, for the most part, once images, multimedia files, and PDF documents are created, they rarely change. You can use Apache's <FILESMATCH> directive in mod_headers to set the expiration date for images well out into the future: [4]
# Works with HTTP/1.1 only <FilesMatch "\.(gif|jpe?g|png|pdf|wav|rm)$"> Header set Cache-Control \ "max-age=315360000" </FilesMatch> # Works with both HTTP/1.0 and HTTP/1.1 <FilesMatch "\.(gif|jpe?g|png|pdf|wav|rm)$"> Header set Expires \ "Mon, 28 Jul 2014 23:30:00 GMT" </FilesMatch> Make sure that the regular expression list (gif|jpe?g|png|pdf|wav|rm) contains the file extensions of images and multimedia files contained on your servers, each separated by a pipe character(|). (The max-age=315360000 is 10 years measured in seconds, just in case you were wondering.) pain.
Now your visitors will be required to download the images on your site only once, very handy for those images that are used frequently throughout the site (navigation elements, logos, bullets and buttons, etc.). 2.12.3.2 Use caching defaults for occasionally changing content.While your images and PDFs are unlikely to change, the same cannot be said for CSS and JavaScript files. While you may be tempted to add the css and js file extensions to the <FILESMATCH> directive, fight the urge. While uninteresting from a measurement standpoint, these files are often necessary for rendering your pages. Practically speaking, these files do change, and forcing your web developers to rename their code every time they make even a minor update will incur their ireperhaps even their direct and immediate wrath. 2.12.3.3 Consider a content distribution network (CDN) for images and code.One alternative solution to the "images never expire" policy you may want to consider if you have some money to spend is a content distribution network (CDN). The idea behind a CDN is that the closer you can get a large object to users, the faster they'll get those files. A CDN acts as a proxy server for the static content you're unconcerned about measuring, but it often provides the ability to manually control their expiration dates, simplifying the refresh process. While the details of content distribution networks are outside of the scope of this book, I would refer you to two vendors well known for their CDN platforms: Akamai (www.akamai.com) and Speedera (www.speedera.com). A more complete list of caching device and network vendors can be found at http://www.caching.com/vendors/index.htm. 2.12.4. The Obvious Alternative to Cache BustingSomething to consider if this hack was a bit overwhelming is the fact that the JavaScript page tag data source [Hack #6] completely sidesteps the issue of caching. Page tags are cleverly designed to report back, regardless of where the document they're contained in was delivered from, nullifying the caching effect. Moreover, page tags even beat the browser cache, again because they appear to be a completely new request every time the code is executed (based on random number generating functions and other cache-beating technology baked into the tags). Perhaps best of all, often any externally housed code required for page tags [Hack #28] can be thrown on a content distribution network to reduce latency associated with that file, again improving performance for your visitors. |