Hack22.Deconstruct Web Server Logfiles

Hack 22. Deconstruct Web Server Logfiles

The history of web site measurement is, for the most part, the history of web server logfiles. Understanding the data logfiles provide and their limitations will help you better plan for their use.

Web measurement got its start over 10 years ago with simple log analysis tools. These early tools did little more than scan the logfiles produced by web servers to count hits and visits, report on server errors and page load times, and process other data pertinent to early site administrators.

2.10.1. Anatomy of a Web Server Logfile

Generally speaking, each entry in the logfile will contain the IP address of the requesting client, the requested URL, the number of bytes transferred to the client, the date/time of the request, the URL from the which the request was made (also called the referring URL [Hack #1]), and much more. The log will not only contain each explicitly requested page (commonly a file with an extension of HTM, HTML, ASP, or JSP), but also each image (e.g., GIF and JPG), JavaScript file (JS), and other objects needed to complete the loading of the page. Not surprisingly, logfiles can get excessively large [Hack #19].

Using the following sample line from the author's web server logfile, let's step through the fields captured in the combined log format (see below for more formats).

 216.219.177.29 - elvis [15/May/2000:23:03:36 -0800] "GET /index.htm HTTP/1. 0" 200 956 "http://www.webanalyticsdemystified.com/index.asp" "Mozilla/2.0  (compatible; MSIE4.0; SK; Windows 98)"

Each element tells us something about the visitor or application making the request.

Remote host (remotehost)

The 216.219.177.29 enTRy in the logfile tells us the remote hostname or IP address for the requestor. Sometimes the entry is resolved to a domain name (e.g., webtrends.com), depending on how your web servers are set up.

Authentication server (RFC931)

The second element of the logfiles is called RFC931, or the authentication server. This poorly understood and little-used element provides us some insight into the extremely technical history of web server logfiles. For more information, see the FAQs.org document on RFC931 (www.faqs.org/rfcs/rfc931.html).

Authenticated username (auth-username)

If the visitor has been required to log into your site, as is extremely common on intranets and some extranets [Hack #33], this entry will contain the username portion of their login (elvis, in this example). When this entry is available, a variety of additional tracking options are available to you, including individual identification of activities on the site.

Date and time (timestamp)

The date and time the request was made as recorded by your web server ([15/May/2000:23:03:36 -0800], in this example). The -0800 is the offset from Greenwich Mean Time (GMT).

Requested information (request-line)

Arguably the most important component in the log entry, the "request" is the name and location of the actual object being requested by the visitor to your web site, including the method and HTTP protocol of the request. In this example, the visitor is asking for the web site's home page (index.html):

GET /index.htm HTTP/1.0

You can ignore the GET and HTTP/1.0 information and instead focus on the /index.htm componentthat's what your log analyzer will do.

HTTP status (response-code)

The numeric HTTP status code returned with the document, letting the requestor know whether the document is available or if some error has been generated. A better overview of status codes is available in [Hack #34] or online at www.w3.org/Protocols/HTTP/HTRESP.html.

Content length returned (response-size)

The length in bytes of the content that was returned to the requestora data point that becomes especially interesting if your visitors are complaining about your web pages (the download is not complete, causing rendering issues in the visitor's browser) or if one of your business goals is to have visitors download a document (PDF, EXE, etc.).In the latter case, content length can be mined against the known file sizes, looking for incomplete downloads (where the number of bytes delivered is different than the known file size).

Referring URL (referrer)

The referring URL is the exact URL that contained the link to the requested document (http://www.webanalyticsdemystified.com/index.asp, in this example). Arguably the second most important element in a web server logfile, thereferrer is present only when a link has physically been clicked (and not always even in that case due to a number of technologies that for whatever reason remove the referrer from the HTTP request). The referring URL is usually available only in combined or extendedlog formats (see below).

User Agent (user-agent)

The user agent is the description of the application making the request (e. g., the web browser, robot, or spider, Mozilla/2.0 (compatible; MSIE4.0; SK; Windows 98), in this example). User agents can help you understand your visitor's distribution of browser usage [Hack #71], but recent proliferation of user agent strings has made deeper analysis nearly impossible.

For more information on the details of web server logfiles, see the W3C document on logging control (www.w3.org/Daemon/User/Config/Logging.html) or Chapter 21 of HTTP: The Definitive Guide (O'Reilly).

Requests from clients are interspersed throughout the logfile in the order they are received. The job of the logfile analysis tool is to parse up this large file to stitch together the individual visits from their respective clients. This can be a difficult problem if the only means of identifying the client is the IP address, as IP addresses are apt to change for a given client (especially when the IP address is dynamically assigned by a corporate DHCP server or commercial ISP service, such as AOL [Hack #78]).

Fortunately, vastly superior visitor tracking methods are available to site designers, including session parameters [Hack #17] and cookies [Hack #15]. Use of a session parameter or cookie to identify visitors will dramatically improve the accuracy of your analysis results.

2.10.2. Types of Web Server Logfiles

Logfiles also come in a variety of formats and types (Table 2-2). While industry standards exist for web servers, there are many other types of servers that can use other logging formats. Streaming media servers are one example of this, dedicated to serving media-only files and using a unique, often proprietary logging format that often requires a special processing engine to prepare for analysis.

Table 2-2. The most widely used web server logfile formats
Format	Description
Common Log Format (CLF), also referred to as the NCSA Common format	The most widely used logfile format. Originally defined by the NCSA, it is now available in a variety of server applications, including Apache and Microsoft's Internet Information Server. More information can be found at www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format.
Combined Log Format	An extension of the Common Log Format that includes the referring URL and User Agent fields. More information is available at httpd.apache.org/docs/logs.html#combined.
W3C Extended Logfile Format	A customizable log format based on the Common Log Format that allows you to collect only the information you need for your analysis. More information is available atwww.w3.org/TR/WD-logfile.html.
IIS Logfile Format	A fixed format used in Microsoft's Internet Information Server based on the Common Log Format but that allows collection of additional information, including elapsed time and number of bytes sent (different than bytes received in the CLF). More information is available at msdn.microsoft.com/library/defaultasp?url=/library/en-us/iissdk/iis/iis_log_file_formats.asp.

2.10.3. The (Occasional) Need for Translation

Content management servers, commerce servers, portal servers, and other "dynamic" application servers also produce special logfiles. In these instances the files themselves are usually easily parsed by a logfile analysis tool, but the resulting reports may contain unintelligible code values in place of the actual page names, document names, product names, or other elements.

To solve this problem, some analysis tools provide "look up" capabilities into the underlying database used by the application server to translate the code values into names or titles that are commonly understood. Figure 2-3 illustrates the SKUs before translation.

Figure 2-3. Untranslated product SKUs

Figure 2-4 illustrates the SKUs after translation.

If you are using an application server, such as BEA WebLogic, BroadVision, or Microsoft SharePoint, and client-side data tagging is not a viable data collection technique, make sure your web analysis tool can perform the required database lookups for your reports to be usable.

Figure 2-4. Translated product SKUs

When all is said and done, web server logfiles remain among the most popular and widely deployed web measurement data sources and will likely continue to be popular for years to come. While there has been a noticeable trend in the last three years towards the use of client-side page tags [Hack #28], many businesses now realize that the decision is not necessarily black and white, and that both sources of data have their intrinsic value.

Jeff Seacrist and Eric T. Peterson