Hack 6. Understand Common Data Sources
Before you get started analyzing, determine where the data will be coming from. In the early days of web site measurement, there was only a single source of data, web server logfiles [Hack #22]. Generated automatically by web servers like Apache and Microsoft Internet Information Server, these flat text files were simply a report of which IP addresses were requesting which objects. At one point, a smart and enterprising soul realized that these files could be parsed and that the results could tell you roughly what people were doing on your web site. Some say, "In the beginning there were logfiles and they were good.…" Unfortunately, web server logfiles weren't good enough. 1.7.1. Problems with Web Server LogfilesPeople began to see problems creeping into their log-based analysismissing information and requests that could in no way be coming from a human beingand gradually programmers realized that something better would be needed. Problems arose from the emergence of forward caching devices, the addition of page caching in the browsers, and the explosion of nonhuman user agents attempting to catalog the rapidly expanding Internet. As more and more people came online, the caching devices were needed to improve the overall browsing experience, and while they helped the average surfer connecting with a 28k modem, they dramatically impacted the accuracy of log-parsing applications. 1.7.2. Enter the Packet SniffersPacket sniffing, also referred to as the network data collection model [Hack #8], basically puts a listening device on a major node in the web delivery architecture, as shown in Figure 1-6. The listening device would then passively log requests for resources, essentially sitting in front of your web server farm. While sniffing provided some advantagescentralized data collection, more details about failed or cancelled requests, and improved accuracy in server overload conditionsfew applications ever supported the model and it never really took off. Figure 1-6. Network data collectors in an archetypical web server farm1.7.3. JavaScript Page Tags Are BornAbout the same time the packet sniffers were starting to struggle, a handful of companies realized that by embedding a small JavaScript file into web pages, you could collect a great deal of useful information about the page being viewed and the visitor doing the viewing. The JavaScript would then dynamically generate an image request to an external server, appending all the collected information into the query string, and the client-side page tag was born [Hack #28]. Once initial security and privacy concerns were worked out, JavaScript page tags quickly took off. Companies eagerly bought into the use of tags to render traffic data in real time, and others were delighted by the relative simplicity of the hosted application model. Eventually, though, companies began to discover the limits in their utility as well. A major limitation of page tags is that if they're not on a page, no data is collected for that page, making deployment mistakes quite costly from a data collection perspective. Also, some companies balk at the seemingly ever-expanding size of the tags, occasionally exceeding 20 KB! 1.7.4. Evaluating Data SourcesGiven that each data source has its benefits and limitations, you might be asking yourself "How should I decide which source is best for my business?" For the most part, there are a handful of needs and limitations. In general, a "yes" answer to any of the criteria presented in Table 1-2 will yield a "best" data source to use.
While there are no absolutes, it is important decide which source you prefer to use before you start contacting software or service providers. Also, there are often differences in pricing models associated with each data source, with page tags most commonly associated with hosted applications, which are typically paid for on an ongoing basis, and web server logfiles and packet sniffers with software-based solutions, which are typically paid for up front at the beginning of the relationship. Further complicating your decision making process is the recent emergence of vendors offering hybrid solutionsusually combining page tags and logfiles to create a more holistic view of your data. Vendors like Visual Sciences, Urchin, and Sane Solutions are now offering the ability to blend data sources in new and interesting ways. While hybrid vendors may not be the best solution for you, if you have experience with both logs and page tags and are looking for an alternative, these vendors are worth a look. |