Hack100.Build Your Own Web Measurement Application: Reporting | Web Site Measurement Hacks: Tips & Tools to Help Optimize Your Online Business

Hack 100. Build Your Own Web Measurement Application: Reporting

Now that the application is built and you're collecting and analyzing data, the time has come to think about next steps. We make a handful of suggestions about other things this application could do to improve data collection, performance, or reliability.

As we've gone through the book, we've built a program to collect data from visitors to your web site, and a program to analyze the data and produce some basic statistics. In this final hack, we're going to remind you how to run the application and discuss ways in which it could be extended.

You can get both programs from:

http://www.webanalyticsdemystified.com/byo

You will also find a sample logfile there to allow you to run the second program without waiting to collect any data of your own first.

7.11.1. Running the Application

Remember that this application depends on a small piece of JavaScript [Hack #12] code that is embedded in each page you want tracked. All you need to do is add the following code near the top of the <BODY> element of each of your web pages:

 <script> document.write('<img src="/books/4/263/1/html/2//cgi-bin/readtag.pl?url= '+escape(document.location)+'&amp;ref='+escape(document.referrer)+'">');      </script>

To process the code, you'll use the readlog.pl Perl script that we developed throughout most of the book. To execute this program from the command line, assuming that page.log is in the same directory as readlog.pl, all you need to do (assuming you've already installed the GeoIP modules [Hack #80] is type:

 perl readlog.pl page.log

And again, the sample output will look something like Figure 7-15.

Figure 7-15. Output from the readlog.pl application

7.11.2. Extending This Application

The program really only gives a taste of how web analytics programs work; you're probably already thinking of several great ways you can extend this application, building something that will suit your specific needs, that takes into consideration the other ideas in this book. Here are possible extensions.

7.11.2.1 Add simple visitor segmentation.

The program has no way to segment visitors to compare two groups against each other, or one group against the whole population [Hack #48]. As you've learned while reading the rest of this chapter, the most interesting use of web measurement is usually making comparisons.

It's not very useful to know the raw conversion rate; it's much more useful to know the conversion rate this month compared to last month, or to compare response rates between one campaign and another. You may want to allow segmenting by referrer or entry page, or even add a variable to the JavaScript that allows you to differentiate groups of visitors for your analysis.

7.11.2.2 Clean up duplicate page names.

There is no attempt to combine URLs that correspond to the same file. For example, /products/index.html is actually the same file as /products/, and /%7Esret1/ is actually the same as /~sret1/. These transformations should happen automatically, in the same way as we converted filenames to lowercase on a case-insensitive filesystem. In addition, the user should be able to specify other transformations to apply, such as ignoring certain URL parameters. You may want to build in a transformation filter or table that will resolve these kinds of very common problems.

7.11.2.3 Improve the reporting.

The reporting for the basic system could easily be converted from plain text into a more dynamic HTML format. If you choose to rewrite it in HTML, be careful to encode non-alphanumeric characters in the output to prevent a type of attack known as a cross-site scripting attack. This occurs when a visitor to your web site pretends that his referrer is some nonsense URL that contains malicious JavaScript code. If you were to view the data in a browser without encoding it, you would execute the malicious code. Still, you may want to apply more thoughtful formatting than we did with our bare-bones application.

7.11.2.4 Add user configuration.

At the moment, the program has no way to specify which reports you want to see, or how much data to show in each report, except by editing the source code. You could add the ability to specify these things through command-line arguments using Perl's Getopt::Simple module. Or you could have a text configuration file, or even a graphical user interface.

7.11.2.5 Improve program efficiency.

We have valued code clarity above both speed and memory requirement. This is usually the right choice for the majority of the code. But real web analysis typically deals with very large quantities of data, and in a production environment, certain parts of the code would have to be written to be both faster and less memory intensive. You may even want to take what you've learned in these hacks and rewrite the application in a faster language such as C++.

7.11.2.6 Add error checking.

There is insufficient error checking. While this helps the clarity of the code as a tool for demonstrating major concepts, it would not be appropriate in a production environment. For example, the program assumes that the logfile lines occur in chronological order. If you were to analyze two logfiles and specify them in the wrong order, or even if some corrupt data crept into the logfile, the results would be wrong. Logfiles are typically very large, and errors do creep into them. In addition, malicious visitors can insert arbitrary text into them. So we should be more careful about trusting the data.

7.11.2.7 Track exits from the site.

As we mentioned in [Hack #67], you could extend the data collection to track exits from the site. This would allow you to measure the time spent on the last page of a session. It would also allow you to see where people went when they followed links out of your site.

7.11.2.8 Add multi-session tracking functionality.

Even if the web site uses persistent cookies rather than just session cookies, there is no attempt to remember a visitor who visited yesterday. This is important for understanding the relationship between new and returning visitors and customers [Hack #89], and for attributing purchases to the lead that generated them [Hack #50]. Doing this usually requires saving the visitors in a database on disk because it is not possible to store them all in memory.

7.11.2.9 You could get the logfile from a remote location via FTP.

It would not be difficult to remove the requirement that the readlog.pl application lives in the same filesystem as the page.log file by using FTP to download the logfile from a remote location. The advantage of doing this would be not having to run a Perl script on your web servers.

Dr. Stephen Turner and Eric T. Peterson