Hack36.Build Your Own RSS Tracking Application: The Core Code and Reporting


Hack 36. Build Your Own RSS Tracking Application: The Core Code and Reporting

Syndicating content via RSS is similar to, but not the same as, building normal web pages. Because of this, the parsing of information collected is very similar to our "build your own" web measurement application, using a similar architecture. However, because RSS is designed to be presented in any number of applications and environments, the reporting is slightly different (but no less interesting).

Assuming you've already read how to collect data from within RSS feeds [Hack #12], you should have an RSS logfile. To learn anything meaningful from RSS.log, you need to parse the file and generate human-readable reports.

The reporting code is broken into five packages and driven by a single script called from the command line (rss_report.pl). The packages are:


RSS_Article.pm

Holds the RSS_Event objects and provides methods for accessing events by type from the RSS_Request object


RSS_Articles.pm

Holds articles by name from the RSS_Request and RSS_Data objects and provides methods for accessing information about the article


RSS_Data.pm

Creates the summary data object and provides methods for processing and reporting


RSS_Event.pm

Provides methods for accessing the RSS_Request objects for the event


RSS_Request.pm

Parses the incoming log line to create an object containing data broken down by field name

Be sure to save each .pm file in your Perl /lib directory.

2.25.1.

2.25.1.1 RSS_Article.pm.

The RSS_Article object is a container for the article that holds all of its RSS_Event objects by the type filed from the RSS_Request object. It provides methods for sending the request to the proper event, creating it if it doesn't exist, and access to the individual event objects. Type the following code into a file named RSS_Article.pm.

 package RSS_Article; use strict; use RSS_Event; # An Articles object will be a hash table of Events objects, # indexed by type. # This is a simple constructor for setting up an empty hash. sub new { return bless {}; } # Add the request to the article for the event sub AddRequest { my ($self, $req) = @_; # Look for an event with this type. my $key = $req->{type}; my $event = $self->{$key}; # If we didn't find and event, create it. unless (defined($event)) { $event = new RSS_Event; $self->{$key} = $event; } # Add the Request to the Event $event->AddRequest($req); } # Find the event for the given type sub FindEvent { my ($self, $type) = @_; # Look for the event by type my $event = $self->{$type}; # Return the event if it exists and return undef if it doesn't if (defined($event)) { return $event; } } return 1; 

2.25.1.2 RSS_Articles.pm.

The RSS_Articles object is the container that holds all the RSS_Article objects and a single instance of the RSS_Data object used to store the statistical data. It provides methods for finding and creating the article objects by the name field from the RSS_Request object, initiating the processing of each article's data and initiating the writing of reports from rss_report.pl. Type the following code into a file named RSS_Articles.pm.

 package RSS_Articles; use strict; use RSS_Article; # An Articles object will be a hash table of Article objects, # indexed by article name. # There will be one special hash key, DATA, to hold the statistics. # This is a constructor to set up that hash table. use RSS_Data; sub new { my $data = new RSS_Data; return bless {DATA => $data}; } # Find or create an article sub FindArticle { my ($self, $req) = @_; # Look for an article with this name. my $key = $req->{name}; my $arti = $self->{$key}; # If we found an article, return it. if (defined($arti)) { return $arti; } # If we didn't find an article, create and return a new article. $arti = new RSS_Article; $key = $req->{name}; $self->{$key} = $arti; return $arti; } # Process Articles Data into reporting data sub Process { my ($self) = @_; while (my ($key, $arti) = each %$self) {  next if ($key eq 'DATA'); # don't process the special DATA key  $self->{DATA}->AddArticle($self->{$key},$key); } } # Write the report sub WriteReport { my $self = shift; $self->{DATA}->WriteReport(); } return 1; 

2.25.1.3 RSS_Data.pm.

The RSS_Data object creates the summary data object and provides the methods for processing the data from an RSS_Article object into summary data and writing the reports. One of the most important things to note about the RSS_Data package is the @events_to_sum and @events_to_list array declarations (in bold), which allow you to configure the reports that are generated. Type the following code in a file named RSS_Data.pm.

 package RSS_Data; use strict; # The configuration of summary information to report # The fields are event type and item name my @events_to_sum = ( [ 'v', 'Views' ], [ 'c', 'Clicks' ]    );     # The configuration of article itemized information to report  # The fields are event type, report title, request field and # list items to  # display  my @events_to_list = ( [ 'v', 'Top 10 Pages in which the Article was Viewed', 'url', 10 ],  [ 'v', 'Top 10 referrers to the Page with the Article', 'ref', 10 ],  [ 'c', 'Top 10 Links clicked within the Article', 'url', 10 ]    ); # The constructor, with all the variables we will store  sub new { return bless { totals_events => {}, totals_articles => {}, article_reports => {} };  } # Add Articles data to the totals  sub AddArticle {  my ($self, $arti, $articlename) = @_; # Loop through each event type defined in %events_to_report # If the event type exits for the article collect and add # its summary data foreach my $event_sum (@events_to_sum) { my $type = @$event_sum[0]; # Get the event my $event = $arti->FindEvent($type) or next; # Get a count of the Request for my $count = $event->NumRequests; if ($count) { $self->{totals_events}->{$type} += $count;  $self->{totals_articles}->{$articlename}->{$type} += $count;  }  } # Loop through each defined report, process field values # and add report data foreach my $event_list (@events_to_list) { my $type = @$event_list[0]; # Get the event my $event = $arti->FindEvent($type) or next; # Get the request field to count for this report my $field = @$event_list[2]; my $title = @$event_list[1]; # Get the values of that field from the event my $values = $event->GetFieldValues($field); # Process values if retuned if (@$values) { # Loop through each value and increment a count of that value foreach my $value (@$values) {  #print $value . "\n";  ++$self->{article_reports}->{$articlename}->{$title}->{$value} if        (defined($value));  }  }  }  } # Write all the reports we have collected  sub WriteReport {  my $self = shift; # Write the Summary Report and then the Article Reports $self->WriteSummaryReport(); $self->WriteArticleReports(); } # Write Article Reports  sub WriteArticleReports {  my ($self) = @_; my $hashref = $self->{totals_articles}; foreach my $articlename (sort {$a cmp $b} keys %$hashref) { # Write the report title $self->WriteArticleReportTitle($articlename); # Write the summary statistics  $self->WriteArticleSummaryStats($self->{totals_articles}->{$articlename}); # Write the event reports $self->WriteEventReports($self->{article_reports}->{$articlename}); } } # Write a report title, = underlined sub WriteReportTitle {  my ($self, $title) = @_; print "\n$title\n";  print "=" for 1..(length $title);  print "\n"; } # Write an article report title, - underlined sub WriteArticleReportTitle {  my ($self, $title) = @_;  print "\n$title\n";  print "-" for 1..(length $title);  print "\n"; }  # Write an event report title, - undelined and indented sub WriteEventReportTitle {  my ($self, $title) = @_;  print "\n $title\n ";  print "-" for 1..(length $title);  print "\n"; } # Write the summary report sub WriteSummaryReport {  my $self = shift; # Write the report title $self->WriteReportTitle('Summary Statistics'); # Write the summary statistics  $self->WriteSummaryStats($self->{totals_events});  } # Write the summary statistics  sub WriteSummaryStats {  my ($self, $hashref) = @_; # loop through each event type defined in %events_to_report  # and write the summary statistics  foreach my $event_sum (@events_to_sum) { my $type = @$event_sum[0]; my $name = @$event_sum[1]; printf "Total %s: %d\n",$name,$hashref->{$type} || 0; }  } # Write the article summary statistics, indented  sub WriteArticleSummaryStats {  my ($self, $hashref) = @_; # loop through each event type defined in %events_to_report  # and write the summary statistics  foreach my $event_sum (@events_to_sum) { my $type = @$event_sum[0]; my $name = @$event_sum[1]; printf " Total %s: %d\n",$name,$hashref->{$type} || 0; }  } # Write event reports by event type sub WriteEventReports {  my ($self, $hashref) = @_; #loop through each report defined for the event foreach my $event_list (@events_to_list) { # Get the title of the report my $title = @$event_list[1]; # Write event report title $self->WriteEventReportTitle($title); # Get the top items to print limit my $toplimit = @$event_list[3]; # Test if event report data exists and write statistics if it does  $self->WriteReportStats($hashref->{$title},$toplimit) if ($hashref->  {$title});  }  } # Write the report stats list from most occurances to least limiting the  length to  # the incoming $toplimit  sub WriteReportStats { my ($self, $hashref, $toplimit) = @_; # Loop through sorted hash data and print until line numer matches # the top_items set for the report my $n = scalar keys %$hashref; if ($toplimit < $n) { $n = $toplimit; } for ((sort {$hashref->{$b} <=> $hashref->{$a}} keys %$hashref)[0..$n-1]) { printf "%9s: %s\n", $hashref->{$_}, $_; } print "\n"; } return 1; 

2.25.1.4 RSS_Event.pm.

The RSS_Event object holds all RSS_Request objects for the event. It provides methods for getting a count of the requests it holds and getting the values from a specified field of the request, if they exist. Type the following code into a file named RSS_Event.pm.

 package RSS_Event; use strict; # An Event will be an array of Requests. # This is a minimal constructor setting up an empty array. sub new { return bless []; } # Add a request to the event sub AddRequest { my ($self, $req) = @_; push @$self, $req; } # The number of requests the event contains is the length of the array  # of requests sub NumRequests { my $self = shift; return scalar @$self; } # Return an array of values for a request field when it has a value # If no requests for that field have a value return empty array sub GetFieldValues { my ($self, $field) = @_; my @values = undef; foreach my $req (@$self) { my $value = $req->{$field}; push @values, $value if ($value); } return \@values; } return 1; 

2.25.1.5 RSS_Request.pm.

The RSS_Request object parses the incoming rss.log line by line and creates an object that holds the data in individual fields. These fields are time, host, name, URL, ref (as in "referrer"), and cookie. There is an additional field (type) that captures whether the line was logging a page view (v) or a link click (c). Type the following code into a file named RSS_Request.pm.

 package RSS_Request; use strict; # Construct a request hash from a string sub new { # Take the string that was passed in. Attempt to parse it into its fields # using a regular expression, and return undef if failed. my ($invocant, $str) = @_; return undef unless (my ($type, $time, $host, $name, $url, $ref, $cookie, $revenue) = $str =~ /^ # start of line ([cv])\t    # type: v for view or c for click (1\d{9})\t  # time: ten digits starting with 1 ([^\t]+)\t  # host: non-empty string ([^\t]+)\t # storyname: non-empty string ([^\t]+)\t # url: non-empty string ([^\t]*)\t # ref: possibly empty string ([^\t]*) # cookie: possibly empty string $/x); # end of line        # If the parsing succeeded, create and return an object.    return bless {  type => $type,  time => $time,  host => $host,  name => $name,  url => $url, ref => $ref, cookie => $cookie } } return 1; 

2.25.1.6 Bringing the packages together with rss_report.pl.

The rss_report.pl script should be saved to the same directory as the rss.log file generated by write_rss_tag.cgi. Remember, the #!perl line may need to be adjusted to point to the location of Perl on your machine; for example, #!/usr/bin/perl. Type the following code into a file named rss_report.pl.

 #!perl -w use strict; # The classes we're going to use use RSS_Request; use RSS_Articles; # Create an object to hold the articles my $articles = new RSS_Articles; # For each line in the logfile while (<>) { chomp; # Check that the line is parseable, and if not, go on to the next line  my $req = new RSS_Request($_) or next; # Find or create the article into which this request falls my $arti = $articles->FindArticle($req); # Add the request to the article $arti->AddRequest($req); } # After reading all the lines, process them into the final data $articles->Process(); # And write the report $articles->WriteReport(); 

Now you're ready to generate your first RSS traffic report!

2.25.2. Running the Code

To run the program, you will need Perl installed on your computer. If you are using Unix or Linux, you almost certainly have Perl already, but if you are using Windows, you may not. You can download ActiveState's Perl for Windows from http://www.activestate.com/Products/ActivePerl.

All that remains now is to tell rss_report.pl where the rss.log filegenerated by the write_rss_tag.cgi scriptis located, and the rest is automatic!

From the command line (assuming that rss.log is in the same directory as rss_report.pl), all you need to do is type:

 perl rss_report.pl rss.log 

Figure 2-24 has sample output from the script showing summary statistics (total views and total clicks) for all tracked articles and a per-article breakdown showing the total views, clicks, pages where the article was viewed, referrers to the article, and links clicked in the article, each of which are described below.

Figure 2-24. Sample output from the rss_report.pl script


2.25.3. The Results

Because RSS is a slightly different beastie than other types of web traffic, it is worthwhile to define each of the reports generated by rss_report.pl.


Total article views

Total article views are the count of all page views for all articles listed in your report.


Total article clicks

Total article clicks are the count of all clicks on links contained in the articles listed in your report. The clicks are limited to tracked RSS articles by the use of the <DIV> tag in the JavaScript [Hack #12].


Pages in which the article was viewed

The pages in which the article was viewed report reflects the key complexity of RSSthe fact that your content does not necessarily appear in your web pages. Other people are easily able to grab your XML feed and present your content in their web pages. This report will tell you who is doing that and which URLs you should be looking at to see how your content is reused.


Referrers to the page with the article

Referrers to the page with the article will tell you who is linking to your articles on the Internet. This is perhaps the single most powerful aspect of this application, allowing you to determine how readers respond to your content.


Links clicked in the article

Provided you have normal HREF based links in your articles, this list will tell you what people are clicking on. Remember, this report is limited to only the links in your post by the <DIV> wrapper around the post.

2.25.4. Hacking the Hack

Obviously, this hack provides only the bare essentials for measuring content syndicated via RSS, and there are a handful of things you could do to improve the quality of tracking, including:

  • Adding the idea of "session" to the RSS_Event object, allowing you to generate both "visit" and "page view" counts for each article.

  • Adding a NOSCRIPT tag, extending a lighter version of this tracking for applications that do not allow the use of JavaScript.

  • Giving rss_report.pl the ability to accept a date range from the command line to limit the dates for which the report is generated.

  • Giving rss_report.pl the ability to accept a text string from the command line to limit the names of the articles reported on.

While there is a small handful of other methods you can use to track the readership of you weblog (FeedBurner is one, at http://www.feedburner.com, Syndicate IQ is another, at http://www.syndicateiq.com), there are no major providers of this functionality who take such elegant advantage of then existing web measurement model. Until the rest of the measurement world comes around and provides this valuable functionality, enjoy, and remember you read it here first.

Ian Houston and Eric T. Peterson



    Web Site Measurement Hacks
    Web Site Measurement Hacks: Tips & Tools to Help Optimize Your Online Business
    ISBN: 0596009887
    EAN: 2147483647
    Year: 2005
    Pages: 157

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net