Hack13.Build Your Own RSS Tracking Application: An Overview and Data Collection | Web Site Measurement Hacks: Tips & Tools to Help Optimize Your Online Business

Hack 13. Build Your Own RSS Tracking Application: An Overview and Data Collection

Content syndication via RSS and XML and blogging are extremely hot topics, but there are few tools available to track people reading and interacting with your content and articles. With a little bit of Perl knowledge, you can use our "build your own" hack to write a bare-bones RSS traffic analyzer.

If you're willing to roll up your sleeves a bit and dig into some Perl, you can significantly enhance your ability to track syndicated content compared to the little you're likely able to learn using only web measurement tools [Hack #47]. Using the following scripts to track your own RSS feeds and posts will tell you:

What articles and posts people read
Who refers people to your work
Where readers click out to from your posts (which links are clicked)

For syndicated content, this is pretty much it: the information you need to determine the reach and response to your blogging activities. While it depends on a little bit more codeand it won't work on every blogging platform or every RSS reader because there is really no better source for this datathe results are very satisfying.

1.14.1. The Data Collection Code

The code for this hack is relatively simple and broken into four parts:

The code that goes into each RSS feed or article you want to track
The code that the RSS feed will call (track_rss.js)
The code that will process the resulting request, generated by the first two blocks of code (write_rss_tag.cgi) and generate a log of your RSS activity (rss.log)

This code functions in nearly the same way as a client-side page tag [Hack #28] by leveraging a "round trip" call to an external JavaScript file.

1.14.1.1 Tracking code to be placed into the feed or article you want to track.

In order to enable measurement, you need to add the following code to each post you want tracked.

 <DIV > <!-- YOUR ARTICLE OR CONTENT WOULD GO HERE --> </DIV> <SCRIPT LANGUAGE="JAVASCRIPT">n="NAME OF ARTICLE";</SCRIPT> <SCRIPT LANGUAGE="JavaScript" SRC="http://www.yourserverlocation.com/scripts /track_rss.js"></script>

Remember to change the NAME OF ARTICLE to the actual name of the article as you'd like tracked and the location of the http://www.yourserverlocation.com/scripts/track_rss.js file to the actual location where that file is kept:

The NAME OF ARTICLE must be identical in the DIV and Java-Script definition for this code to work.

For example, if you had written a weblog post about how great Firefox is, the whole code might look like this:

  <DIV >  I love Firefox, it is so cool. <a href=mailto:me@mysite.com>Mail me</A> if you love Firefox as much as I do. </DIV> <SCRIPT LANGUAGE="JAVASCRIPT">n="Firefox is so super cool!";</SCRIPT> <SCRIPT LANGUAGE="JavaScript" SRC="http://www.yourserverlocation.com/ scripts/track_rss.js"></script>

Be sure to include the SCRIPT portion of the code after the text of the article since the JavaScript for tracking clicks depends on being run after the page has loaded. Assuming you've done everything correctly, once you deploy the article or feed via XML, you'll end up with the JavaScript code embedded in the appropriate XML container.

Unfortunately, this code will not work in all weblog publishing applications, since not all of them allow JavaScript to be embedded.

1.14.1.2 Tracking code to be referenced externally (the track_rss.js file).

The following code is the trackrss.js file referred to in the JavaScript you're placing in the article proper. This code is referenced externally to minimize the amount of code that needs to be placed in the article itself. You need to save the file in a publicly available directory on your web site (for example, /scripts/).

 // Declare and call the tracking image passing name, location, referrer and // random number in the query i=new Image(); i.src="http://www.yourserverlocation.com/cgi-bin/write_rss_tag. cgi?n="+escape(n) +"&t=v&u="+escape(document.location)+"&r="+escape(document.referrer)+'&rn=' +eval(RSSRandomNum()); // Get the article container by id and the links within and iterate through them var articlecontainer = document.getElementById(n); var articlelinks = articlecontainer.getElementsByTagName('a'); for(i=0;(link=articlelinks[i]); i++) {   // Build the new function to add       var addfunc = "RSSClickTrack('" + escape(link.href ) + "','" + escape(n) +    "');";      // Test if the link already has an onclick event defined     if (link.onclick) {         // Get the existing onclick function       var previousstart = link.onclick.toString().indexOf('{')+1;   var previousend = link.onclick.toString().lastIndexOf('}');   var previousfunc = link.onclick.toString().substring(previousstart, previousend);      // Test if exisitng onclick already has the RSSClickTrack call  if (previousfunc.indexOf('RSSClickTrack')<0) {     // define and write the new onclick wih both the existing and the new var newfunc = addfunc + previousfunc; link.onclick= new Function(newfunc);      }   } else {     // No esisitng onclick, create it with the new     link.onclick= new Function(addfunc);  } } function RSSClickTrack(link, name){     // declare and call the click tracking image passing link, name, location     and   //random number in the query location is passed as the referrer to the click   c=new Image();   c.src="http://www.yourserverlocation.com/cgi-bin/write_rss_tag. cgi?n="+name  +"&t=c&u="+link+"&r="+escape(document.location)+'&rn='+eval(RSSRandomNum());  } function RSSRandomNum() {           //get a random number to break caching   rnum = Math.random() * 1000000;   rnum = Math.round(rnum);   return rnum;   }

Use this code at your own risk! Because content syndication is still an emerging field, it is difficult to know how all RSS readers and applications will deal with JavaScript.

For this code to function properly, you need to change the location http://www.yourserverlocation.com/cgi-bin/write_rss_tag.cgi to the location of the write_rss_tag.cgi file (see below). It is worth noting that the variable t is set differently, depending on whether the article is viewed (t=v) or a link is clicked (t=c).

1.14.1.3 Code to parse the JavaScript into an RSS logfile (write_rss_tag.cgi).

The following code is very similar to the "page tag" generated in the "Build Your Own Web Measurement Application" hacks [Hack #12]. It is written to accept input from the JavaScript tag above. You need to save this code on your web server in a location where it can be executed by an external script (for example, your /cgi-bin/ directory). The #!perl line may need to be adjusted to point to the location of Perl on your machinefor example, #!/usr/bin/perl.

  # The #!perl may need to be adjusted to point to the location of perl   # on your machin e, for example #!/usr/bin/perl   #!perl -w   use strict;    # Declare the location of the logfile. The CGI program needs to be given   # permission to write to this file. Exactly how to do that is   # system-dependent.   my $logfile = '/v ar/log/apache/rss.log';  # The name of the cookie, if any.  # 'Apache' is the default for mod_usertrack cookies.  my $cookie_name = 'Apache';  # We shall use the standard CGI module. This does all the work of extracting   # the parameters from the query string and unescaping them.   use CGI;   my $cgi = new CGI;   my $name = $cgi->param('n'); # Get the RSS STORY NAME   my $type = $cgi->param('t'); # Get Event TYPE   my $param_url = $cgi->param('u'); # Get the u= url that is quantified of the event   my $env_url = $cgi->referer(); # Get the referrer from environment for  noscript/image calls for url  my $ref = $cgi->param('r'); # Get the r= Referrer to the event (will only be  captured for javascript executed tracking calls))  # Use the referrer from the image call for the url of the page with the tag  # if it exists and the incoming value for the param_url does not exist.  # if neither exist set the value to UNKNOWN. The use of UNKNOWN is to cover  # requests from RSS Readers that don't execute javascript and/or don't send  # a referrer to an image request.  my $url = "UNKNOWN"; # declare url with default value  $url = $env_url if ($env_url); # use referrer to the image request if it  exists  $url = $param_url if ($param_url); # use param_url for url if exists  # Referrer is not always specified for brevity in image tracking calls. If  it is not  # defined define a blank one.  $ref = "" unless (defined($ref));  # As long as we've got a non-empty NAME and a non-empty TYPE  # write a line in the logfile.  if ($name && $type) { # Look up the current time, the client name and the cookie. The # cookie may not be present for requests from some RSS readers or it # might not be set prior to some events. my $time = time(); my $client = $cgi->remote_host(); my $cookie_val = $cookie_name ? $cgi->cookie($cookie_name) : ""; if (!defined($cookie_val)) { $cookie_val = ""; } # build the log line my $logout = "$type\t$time\t$client\t$name\t$url\t$ref\t$cookie_val"; # We need to open the logfile. We also need to lock it, to make sure that # we're not writing two requests at the same time. If we can't open it or  # can't lock it, write a diagnostic message to STDERR, which is the  # server's error log.  use Fcntl qw/:flock/; # Import the definition of LOCK_EX  unless (open (LF, ">>", "$logfile") && flock(LF, LOCK_EX)) { my $lt = localtime; my $progname = $0 || 'readrsstag.pl'; print STDERR "[$lt] $progname: Can't open logfile\n"; } # Everything worked, so jump to the end of the logfile (this is necessary  # in case something was written between the time we opened it and the time  # we locked it), and write the line.  else { seek(LF, 0, 2); print LF "$logout\n"; close LF; }   } # Finally, send a 1x1 pixel transparent gif back to the browser. # (The long list of numbers just happens to be that gif, byte by byte). print "Content-Type: image/gif\n\n"; print 'GIF89a'; print v1.0.1.0.145.0.0.0.0.0.255.255.255.255.255.255.0.0.0.33.249.4.1.0.0.2. 0.44.0.0.0.0.1.0.1.0.0.2.2.84.1.0.59;

1.14.2. Running the Code

Assuming that you've copied the code correctly and set the appropriate permissions for write_rss_tag.cgi on your web server, you should be all set. Again, the most important things to double check are that:

The ID in the <DIV> tag in your post matches the value of n exactly.
The reference http://www.yourserverlocation.com/scripts/track_rss.js in the JavaScript has been changed to the location of the file on your server (likely in your /scripts/ directory).
The http://www.yourserverlocation.com/cgi-bin/write_rss_tag.cgi reference in the track_rss.js file has been changed to the location of the file on your server (likely in your /cgi-bin/ directory).

Also, because some applications for deploying content via RSS (most notably, the blogging tools) will insert HTML tags automatically (usually the </BR> tag), you should double check that the JavaScript renders correctly when the post is viewed.

1.14.3. The Results

Once you've successfully deployed the data collection code, you'll generate a logfile similar to the one in Figure 1-13.

Figure 1-13. Sample RSS log generated by the write_rss_tag.cgi script

All that's left is to parse this log and generate reports [Hack #36]. We'll do this using a series of Perl objects, a strategy similar to the "build your own" hacks in this book, and one that allows greater flexibility if you want to modify this code for your own purposes.

Ian Houston and Eric T. Peterson