Hack 100. Build Your Own Web Measurement Application: Reporting
Now that the application is built and you're collecting and analyzing data, the time has come to think about
steps. We make a handful of suggestions about other things this application could do to improve data collection, performance, or reliability
As we've gone through the book, we've built a program to collect data from
to your web site, and a program to analyze the data and produce some basic statistics. In this final hack, we're going to
you how to run the application and discuss ways in which it could be extended.
You can get both programs from:
You will also find a sample logfile there to allow you to run the second program without waiting to collect any data of your own first.
7.11.1. Running the Application
code that is embedded in each page you want tracked. All you need to do is add the following code near the top of the <
> element of each of your web pages:
To process the code, you'll use the
Perl script that we developed throughout most of the book. To execute this program from the command line,
is in the same directory as
, all you need to do (assuming you've already installed the GeoIP modules
perl readlog.pl page.log
And again, the sample output will look something like Figure 7-15.
Figure 7-15. Output from the readlog.pl application
7.11.2. Extending This Application
The program really only gives a taste of how web analytics programs work; you're probably already thinking of several great ways you can extend this application, building something that will suit your specific needs, that takes into consideration the other ideas in this book. Here are possible extensions.
22.214.171.124 Add simple visitor segmentation.
The program has no way to segment visitors to compare two groups against each other, or one
against the whole population
. As you've learned while reading the rest of this chapter, the most interesting use of web measurement is usually making comparisons.
126.96.36.199 Clean up duplicate page
There is no attempt to combine URLs that
to the same file. For example,
is actually the same file as
is actually the same as
. These transformations should happen automatically, in the same way as we converted filenames to lowercase on a case-insensitive filesystem. In addition, the
should be able to specify other transformations to apply, such as ignoring certain URL parameters. You may want to build in a transformation filter or table that will resolve these kinds of very common problems.
188.8.131.52 Improve the reporting.
The reporting for the basic system could easily be converted from plain text into a more dynamic HTML format. If you choose to rewrite it in HTML, be careful to encode non-
in the output to prevent a type of attack known as a cross-site scripting attack. This occurs when a visitor to your web site pretends that his referrer is some
184.108.40.206 Add user configuration.
At the moment, the program has no way to specify which
you want to see, or how much data to show in each report, except by editing the source code. You could add the ability to specify these things through command-line arguments using Perl's
module. Or you could have a text configuration file, or even a graphical user interface.
220.127.116.11 Improve program efficiency.
We have valued code clarity above both speed and memory requirement. This is usually the right choice for the majority of the code. But real web analysis typically deals with very large
of data, and in a production environment, certain
of the code would have to be written to be both faster and less memory
. You may even want to take what you've learned in these hacks and rewrite the application in a faster language such as C++.
18.104.22.168 Add error checking.
There is insufficient error checking. While this helps the clarity of the code as a tool for demonstrating major concepts, it would not be appropriate in a production environment. For example, the program assumes that the logfile lines occur in chronological order. If you were to analyze two logfiles and specify them in the wrong order, or even if some corrupt data crept into the logfile, the results would be wrong. Logfiles are typically very large, and errors do creep into them. In addition, malicious visitors can insert arbitrary text into them. So we should be more careful about trusting the data.
22.214.171.124 Track exits from the site.
As we mentioned in
, you could extend the data collection to track exits from the site. This would allow you to measure the time spent on the last page of a session. It would also allow you to see where people went when they followed links out of your site.
126.96.36.199 Add multi-session tracking functionality.
Even if the web site uses persistent cookies rather than just session cookies, there is no attempt to remember a visitor who visited
. This is important for understanding the relationship between new and returning visitors and customers
, and for attributing purchases to the lead that generated them
. Doing this usually requires saving the visitors in a database on disk because it is not possible to store them all in memory.
188.8.131.52 You could get the logfile from a remote location via FTP.
It would not be difficult to remove the requirement that the
application lives in the same filesystem as the
file by using FTP to download the logfile from a remote location. The advantage of doing this would be not having to run a Perl script on your web servers.
Dr. Stephen Turner and Eric T. Peterson