Hack35.Build Your Own Web Measurement Application: The Core Code | Web Site Measurement Hacks: Tips & Tools to Help Optimize Your Online Business

Hack 35. Build Your Own Web Measurement Application: The Core Code

One thing that every web measurement application needs to deal with, regardless of price or sophistication, is stitching together multiple page views into a visit and assigning that visit to a unique visitor.

In "Build Your Own Web Measurement Application: An Overview and Data Collection" [Hack #12], we saw how to write a small page tag script to record the visits to your web site. The program produced a logfile in this format:

    1104772080 192.168.17.32 /index.html?from=google http://www.google.com/    search?q=widgets 192.168.17.32.85261104772101338    1104772091 192.168.17.32 /products.html http://www.example.com/index.    html?from=google 192.168.17.32.85261104772101338

In each line, the fields correspond to the time of the request, the client IP address, the page requested, the referring page, and the visitor's cookie.

Now that we have such a logfile, what should we do with it? One possibility is to analyze it by using one of the existing logfile analyzer programs, as long as the program can be configured to read data in our nonstandard format. For example, you can read the file by using the free web measurement application Analog (www.analog.cx) [Hack #10] and supplying the command:

   LOGFORMAT %U\t%S\t%r\t%f\t%u

In this and subsequent "build your own" hacks we shall build a new program to read this logfile and produce a report. This will demonstrate the basics of what web analytics programs actually do under the hood. We will write our program in Perl. This may not be the best choice of language for high-traffic web sites, but it is adequate for smaller sites, and probably the clearest and most concise language.

2.24.1. Parsing the Data into Sessions

The first task you need to undertake in writing this application is to parse the data into visits and visitors [Hack #1], a task often referred to as sessionization.This is the bulk of the work: once the individual requests to the server are arranged into sessions, extracting data from the sessions is straightforward.

The traditional rule for carrying out this session parsing is to look for requests that appear to be from the same visitor. These requests are counted as one session, unless there is a half-hour gap between two requests, in which case, we start a new session.

We can see which requests are from the same visitor by looking at the cookies. A few visitors may block cookies, in which case, we use their IP addresses as an identifier. (This may cause us to combine two visitors from the same company who visit at about the same time, but that's likely to be a very small problem.)

There is one complication. The first time a visitor requests a file from a web server, a cookie is created and sent back to the visitor. It is only when requesting the second file that the visitor sends the cookie back to the web server and the cookie is recorded in the logfile. When our page tag script is on the same server as the pages, this is not a problem: the request for the page tag will be the second request to the server and will have a cookie. However, when the page tag is running on a separate server, the first page tag requested will not have a cookie (this is not the case if the tag is being served from the same server providing the web pages being tracked). In the example logfile at the start of this hack, the cookie will be missing on the request for /index.html?from=google, and appear only on the request for /products.html.

So our logfile reader must be able to connect sessions when the first line is missing a cookie. To achieve that, we apply the following rule. If the line has a cookie, look for a session from that cookie. If we can't find one, look for a session from that IP address. If that succeeds, re-label the session so that it becomes indexed by cookie instead of by IP address.

2.24.2. The Code

For this code to function properly, we recommend saving it in a file readlog.pl somewhere that has easy access to the files created by readtag.pl [Hack #12].

Here's the top level of the program:

   # We start by declaring two classes, Request and Sessions.   #!perl -w   use strict;   use Request;   use Sessions;   # This variable determines how often we check for expired sessions.   my $purge_interval = 1000;   # Create an object to hold the sessions.   my $sessions = new Sessions;   # For each line in the logfile, check that the line can be parsed, and if not,   go on to the next line.   while (<>) {      chomp;      my $req = new Request($_) or next;   # Find or create the session into which this request falls, and add the    request to the  session.   my $sess = $sessions->FindSession($req);       $sess->AddRequest($req);    # Every $purge_interval lines, clear up any expired sessions ($. is a Perl    variable holding the current input line number).   if ($. % $purge_interval == 0) { $sessions->Purge($req->{time}); }   }   # After reading all the lines, clear up all remaining sessions, and write the   report.   $sessions->Purge();   WriteReport();

That's the end of the main program. Of course, it's the Request and Sessions classes that will do the real work. The Request class is very simple. It consists solely of a constructor, which constructs a Request object, given a line from the logfile. Save this file to a text file called Request.pm.

   package Request;   use strict;   # Does the server on which the pages are stored use a case-insensitive   filesystem (e.g., Windows, not UNIX)?   my $case_insensitive = 0;   sub new {   # Take the string that was passed in. Attempt to parse it into its fields   using a regular expression, and return undef if failed.   my ($invocant, $str) = @_;   return undef      unless (my ($time, $host, $file, $referrer, $cookie) =          $str =~  /^ # start of line  (1\d{9})\t # time: ten digits starting with 1  ([^\t]+)\t # host: non-empty string  ([^\t]+)\t # file: non-empty string  ([^\t]*)\t # referrer: possibly empty string  ([^\t]*)   # cookie: possibly empty string  $/x);      # end of line   # If the filesystem is case insensitive, convert the filename to lower case.   Then create and return the Request object.   $file = lc $file if $case_insensitive;   return bless {       time => $time,       host => $host,   file => $file,   referrer => $referrer,   cookie => $cookie }   }

The Sessions class represents the collection of all the individual sessions. This class is in charge of defining the session. It also contains a class called Data, which stores all the statistics we shall report. Save this to a file called Sessions.pm.

    package Sessions;    use strict;    use Session;    use Data;

A Sessions object will be a hash table of Session objects, indexed by cookie or client hostname. There will be one special hash key, DATA, to hold the statistics. This is a constructor to set up that hash table.

 sub new { my $data = new Data; return bless {DATA => $data}; } # Find or create a session containing a certain request. As described above, we first look for a session with this cookie. If that fails, look for a session from this client address. sub FindSession { my ($self, $req) = @_; my $key = $req->{cookie}; my $sess = $self->{$key}; if (!defined($sess)) { $key = $req->{host}; $sess = $self->{$key};  } # If we found a session, and it's not expired, return it.  # However, if we found the session by client address, and this request also  has a cookie, first move the session to be indexed by the cookie. my $expired = 0;  if (defined($sess) && !($expired = $sess->IsExpired($req->{time}))) { if ($req->{cookie} && $key eq $req->{host}) { $self->{$req->{cookie}} = $sess; delete $self->{$req->{host}}; }  return $sess;  }  # If we didn't find an unexpired session, create and return a new session.  $self->PurgeSession($key) if $expired;  $sess = new Session;  $key = $req->{cookie} || $req->{host};  $self->{$key} = $sess;  return $sess;  }  # The next function purges some expired sessions from memory.  # If a time is specified, purge all sessions up to that time.  # If not, purge all sessions.  sub Purge { my ($self, $time) = @_; while (my ($key, $sess) = each %$self) {  next if ($key eq 'DATA'); # don't delete the special DATA key  $self->PurgeSession($key) if (!defined($time) || $sess-> IsExpired($time)); } } # Delete a single session, after saving its data in the Data object. sub PurgeSession { my ($self, $key) = @_; $self->{DATA}->AddSession($self->{$key}); delete $self->{$key}; }

Finally, for this hack, we shall describe the Session object. A Session will just be an array of Request objects. First we need some variables to describe when a session expires. It is considered stale if there is a gap of more than 1,800 seconds, or if it contains more than 250 requests. Save this to a file called Sessions.pm.

 package Session; use strict; my $max_gap_in_session = 1800; my $max_requests_in_session = 250; # This is a minimal constructor setting up an empty Request array. sub new { return bless []; } # Add a request to the session. sub AddRequest { my ($self, $req) = @_; push @$self, $req;  }  # The session has expired if the requests array is too large, or if it has  been too long since the last request.  sub IsExpired { my ($self, $time) = @_; return (@$self > $max_requests_in_session || $time > $$self[-1]->{time} + $max_gap_in_session); }

2.24.3. Next Steps

In this hack, we've seen how to read a logfile, parse it to extract requests, and accumulate those requests into sessions. In "Build Your Own Web Measurement Application: Marketing Data" [Hack #54], we shall describe how to extract data from the session objects and create a report.

Dr. Stephen Turner and Eric T. Peterson