Hack 21. Understand Where Data Gets Lost
In web measurement, there are a number of ways that data can be lost. By understanding these sources of loss, you can work to minimize their likelihood. Losing valuable data can taint your web measurement efforts. The more data that is lost, the more inaccurate the web measurement information you produce will be. Data loss generally occurs as a result of one of the following scenarios:
Data loss also depends on the data collection method that you use. Each data collection method used in the field of web measurement is associated with specific potential causes of data loss. 2.9.1. Data Collection Issues Common to Page TagsClient-side ("page tagging") data collection [Hack #6] methods have become a common way of acquiring web measurement data. The following points describe some of the additional ways data can be lost when a client-side data collection method is used. 2.9.1.1 A page tag did not get placed in the page.The most difficult aspect of this threat is that if a page is not tagged due to some mistake, there is no indication that data is lost. If you rely completely on a page taggingbased method for collecting data throughout your site, then you must build and maintain a process for ensuring that a page tag gets placed in every page before that page goes live. 2.9.1.2 The page tag doesn't work as intended.Page tags are generally implemented using JavaScript. Over time, there is a danger that the JavaScript in the page tag will no longer function as originally intended as modifications are made throughout the site. In addition, many types of errors may be made in JavaScript that cause a page tag to not collect data, such as forgetting to correctly close a statement in the JavaScript code. For this reason, each page that is page tagged must be tested to make sure that the page tag functions correctly. 2.9.1.3 Page tagging is your only data collection method.When page tagging is the only data collection method, pages are tagged to the exclusion of all other types of content accessible through HTTP on the Internet, and you are measuring only pages being loaded by web browsers. If there are other types of measurement that are important to your company (such as media downloads, server error responses, software downloads, or image impressions), then you should consider a log-based data collection method to collect those measurements. 2.9.1.4 The visitor's browser has JavaScript turned off.Because page tags rely heavily on JavaScript and the document object model [Hack #30], if a web visitor has JavaScript disabled, data collection is often minimized or disabled as well. Most tag-based solutions support a <NOSCRIPT> tag as a fallback to ensure minimal data collection. Fortunately, browsing the Internet with JavaScript disabled is a painful process, and so few people actually do it. 2.9.1.5 The request was blocked by security software.An increasing number of software products that block such third-party requests are being used with Internet browsers. Popular HTML email clients already allow the blocking of third-party and other requests when the page is loaded by the email client's browser software. To decrease the number of times that page tag requests are blocked by such software, have your page tags make their requests to your own domain on a first-party basis instead of to third-party domains [Hack #16]. 2.9.1.6 The information is never received by the data collector.Data is often lost when a page tag is implemented correctly but the HTTP request is never received or recorded. This problem may be caused by a number of factors, one of which may be the DNS "black holing" of certain data collection service provider domains. In other cases, network or network equipment may fail and cause this result. To avoid this problem, use first-party tracking services and regularly verify the capacity of their network infrastructure. 2.9.2. Data Collection Issues Common to Web Server LogfilesLog-based data collection [Hack #6] methods are the mainstay of web measurement data collection on the Internet. Web and application servers by default collect data about all of the HTTP requests made to them. The following points describe some of these ways in which data can be lost when log-based data collection is used. 2.9.2.1 The browser serves the request from the local cache.Most web browsers cache web page HTML and the embedded objects called by that HTML. If the page HTML is cached by a browser, then no request is made to the server to record the page request. This can occur on page reloads and on back button or forward button actions. This kind of problem can be overcome by some additional work on your part [Hack #24]. 2.9.2.2 The content is served from a content distribution network (CDN).Data may be also be lost when a request made by a browser is served from the cache of a content distribution network (CDN), such as those provided by Akamai, Speedera, or AT&T. Since it's served from an external server, no request will be made to the original web server, and no request will be logged. This type of data loss may be prevented by inserting cache control headers into your HTML pages or setting the cache control headers on your web servers as noted above. As an alternative, a site operator may also set the CDN configuration for its site to forbid the CDN to cache HTML pages but allow it to cache images and other content. If you use a CDN, consult your provider directly about concerns related to web data measurement. 2.9.3. Knowing All This, What Should You Do?At the end of the day, knowing where the data can be lost is the first step in ensuring that it isn't. Specifically, make sure you have answers to the following questions before signing any contract:
Finally, ask yourself these questions and review these issues with your web measurement provider before signing any contact to make sure you know how they deal with data collection and what guarantees you have that the problems described herein won't occur. Jim MacIntyre and Eric T. Peterson |