Hack21.Understand Where Data Gets Lost | Web Site Measurement Hacks: Tips & Tools to Help Optimize Your Online Business

Hack 21. Understand Where Data Gets Lost

In web measurement, there are a number of ways that data can be lost. By understanding these sources of loss, you can work to minimize their likelihood.

Losing valuable data can taint your web measurement efforts. The more data that is lost, the more inaccurate the web measurement information you produce will be. Data loss generally occurs as a result of one of the following scenarios:

Measurements are not taken.: The web data that you collect begins with the measurements that you take to record the activity on your site. If a measurement is not taken, usually you have lost that data forever. For this reason, it is important that you carefully plan your measurement strategy in advance.
A vendor loses your data (and won't recover it).: Many companies rely on third-party vendors to collect data for them. Because of the volume of this data and the pressure to reduce costs and stay competitive, vendors often extract parts of your originally collected data for their databases and throw away the rest. In still other cases, vendors mix your data with the data of other customers, making the future extraction of your data time-consuming and expensive.
Your data is incorrectly indexed, misplaced, or destroyed.: Whether your data is collected in your facilities or at the facilities of a vendor, it must be reliably indexed and stored. While you may have a reliable contact with your web measurement vendor, at the end of the day, it is always your responsibility to ensure that your data is reliably protected.

Data loss also depends on the data collection method that you use. Each data collection method used in the field of web measurement is associated with specific potential causes of data loss.

2.9.1. Data Collection Issues Common to Page Tags

Client-side ("page tagging") data collection [Hack #6] methods have become a common way of acquiring web measurement data. The following points describe some of the additional ways data can be lost when a client-side data collection method is used.

2.9.1.1 A page tag did not get placed in the page.

The most difficult aspect of this threat is that if a page is not tagged due to some mistake, there is no indication that data is lost. If you rely completely on a page taggingbased method for collecting data throughout your site, then you must build and maintain a process for ensuring that a page tag gets placed in every page before that page goes live.

2.9.1.2 The page tag doesn't work as intended.

Page tags are generally implemented using JavaScript. Over time, there is a danger that the JavaScript in the page tag will no longer function as originally intended as modifications are made throughout the site. In addition, many types of errors may be made in JavaScript that cause a page tag to not collect data, such as forgetting to correctly close a statement in the JavaScript code. For this reason, each page that is page tagged must be tested to make sure that the page tag functions correctly.

2.9.1.3 Page tagging is your only data collection method.

When page tagging is the only data collection method, pages are tagged to the exclusion of all other types of content accessible through HTTP on the Internet, and you are measuring only pages being loaded by web browsers. If there are other types of measurement that are important to your company (such as media downloads, server error responses, software downloads, or image impressions), then you should consider a log-based data collection method to collect those measurements.

2.9.1.4 The visitor's browser has JavaScript turned off.

Because page tags rely heavily on JavaScript and the document object model [Hack #30], if a web visitor has JavaScript disabled, data collection is often minimized or disabled as well. Most tag-based solutions support a <NOSCRIPT> tag as a fallback to ensure minimal data collection. Fortunately, browsing the Internet with JavaScript disabled is a painful process, and so few people actually do it.

2.9.1.5 The request was blocked by security software.

An increasing number of software products that block such third-party requests are being used with Internet browsers. Popular HTML email clients already allow the blocking of third-party and other requests when the page is loaded by the email client's browser software. To decrease the number of times that page tag requests are blocked by such software, have your page tags make their requests to your own domain on a first-party basis instead of to third-party domains [Hack #16].

2.9.1.6 The information is never received by the data collector.

Data is often lost when a page tag is implemented correctly but the HTTP request is never received or recorded. This problem may be caused by a number of factors, one of which may be the DNS "black holing" of certain data collection service provider domains. In other cases, network or network equipment may fail and cause this result. To avoid this problem, use first-party tracking services and regularly verify the capacity of their network infrastructure.

2.9.2. Data Collection Issues Common to Web Server Logfiles

Log-based data collection [Hack #6] methods are the mainstay of web measurement data collection on the Internet. Web and application servers by default collect data about all of the HTTP requests made to them. The following points describe some of these ways in which data can be lost when log-based data collection is used.

2.9.2.1 The browser serves the request from the local cache.

Most web browsers cache web page HTML and the embedded objects called by that HTML. If the page HTML is cached by a browser, then no request is made to the server to record the page request. This can occur on page reloads and on back button or forward button actions. This kind of problem can be overcome by some additional work on your part [Hack #24].

2.9.2.2 The content is served from a content distribution network (CDN).

Data may be also be lost when a request made by a browser is served from the cache of a content distribution network (CDN), such as those provided by Akamai, Speedera, or AT&T. Since it's served from an external server, no request will be made to the original web server, and no request will be logged.

This type of data loss may be prevented by inserting cache control headers into your HTML pages or setting the cache control headers on your web servers as noted above. As an alternative, a site operator may also set the CDN configuration for its site to forbid the CDN to cache HTML pages but allow it to cache images and other content. If you use a CDN, consult your provider directly about concerns related to web data measurement.

2.9.3. Knowing All This, What Should You Do?

At the end of the day, knowing where the data can be lost is the first step in ensuring that it isn't. Specifically, make sure you have answers to the following questions before signing any contract:

How will your data be collected? Will you use page tags, logfiles, or a combination of the two? Knowing how the data will be collected will let you better understand where the data could be lost.
Who will take day-to-day responsibility for data storage? If you're outsourcing, this is likely the vendor, but you want to make sure they're not outsourcing that responsibility to yet another party, thus increasing your risk. If you're planning to maintain the data in-house, who will be doing that? Make sure you know which group or person in your company will be on the hook if problems occur.
How the data is being stored and maintained? Regardless of where the data is stored, ask about the hardware your data will be stored on and the backup/rotation plan. If your data is "unrecoverable" in case of a catastrophe, you want to know that, allowing proper expectations to be set. Ask the seemingly stupid question, "What if a bomb goes off?" and see if the answer makes you comfortable enough.
What kind of cookies are used to collect the data? Cookies, while tremendously important to web measurement, are not without risks. You should always be using first-party cookies [Hack #16] to increase your chances of collecting the data in the first place.
Who is responsible for placing page tags on your pages? Since the most common problem with tag-based data collection is simply "the tag was broken, " make sure you know who is on the hook if (some would say "when") this happens. Understand your strategy for deploying and testing page tags, looking for weak spots like, "Well, we'll probably just have whoever is available place the code."
Does your organization use a content delivery network? If you're using a log-based solution, make sure you know if the data is likely to make it back to you or if you need to be thinking about busting the cache to increase accuracy [Hack #24].

Finally, ask yourself these questions and review these issues with your web measurement provider before signing any contact to make sure you know how they deal with data collection and what guarantees you have that the problems described herein won't occur.

Jim MacIntyre and Eric T. Peterson