Recipe 19.12. Web Page Access Counting

Problem

You want to count the number of times a page has been accessed.

Solution

Implement a hit counter, keyed to the page you want to count. This can be used to display a counter in the page. The same technique can be used to record other types of information as well, such as the number of times each of a set of banner ads has been served.

Discussion

This recipe discusses access counting, using hit counters for the examples. Counters that display the number of times a web page has been accessed are not such a big thing as they used to be, presumably because page authors now realize that most visitors don't really care how popular a page is. Still, the general concept has application in several contexts. For example, if you're displaying banner ads in your pages (Recipe 18.8), you may be charging vendors by the number of times you serve their ads. To do so, you need to count the number of accesses for each one. You can adapt the technique shown in this section for such purposes.

There are several methods for writing a page that displays a count of the number of times it has been accessed. The most basic is to maintain the count in a file. When the page is requested, open the file, read the count, increment it, and write the new count back to the file and display it in the page. This has the advantage of being easy to implement and the disadvantage that it requires a counter file for each page that includes a hit count. It also doesn't work properly if two clients access the page at the same time, unless you implement some kind of locking protocol in the file access procedure. It's possible to reduce counter file litter by keeping multiple counts in a single file, but that makes it more difficult to access particular values within the file, and it doesn't solve the simultaneous-access problem. In fact, it makes it worse, because a multiple-counter file has a higher likelihood of being accessed by multiple clients simultaneously than does a single-counter file. So you end up implementing storage and retrieval methods for processing the file contents, and locking protocols to keep multiple processes from interfering with each other. Hmm... those sound suspiciously like the problems that a database management system such as MySQL already takes care of! Keeping the counts in the database centralizes them into a single table, SQL provides the storage and retrieval interface, and the locking problem goes away because MySQL serializes access to the table so that clients can't interfere with each other. Furthermore, depending on how you manage the counters, you might be able to update the counter and retrieve the new sequence value using a single statement.

I'll assume that you want to log hits for more than one page. To do that, create a table that has one row for each page to be counted. This means that it's necessary to have a unique identifier for each page, so that counters for different pages don't get mixed up. You could assign identifiers somehow, but it's easier just to use the page's path within your web tree. Web programming languages typically make this path easy to obtain; in fact, we've already discussed how to do so in Section 19.1. On that basis, you can create a hitcount table as follows:

CREATE TABLE hitcount (   path  VARCHAR(255)           CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL,   hits  BIGINT UNSIGNED NOT NULL,   PRIMARY KEY (path) );

This table definition involves some assumptions:

The path column that stores page pathnames has a character set of latin1 and a case-sensitive collation of latin1_general_cs. Use of a case-sensitive collation is appropriate for a web platform where pathnames are case-sensitive, such as most versions of Unix. For Windows or for HFS+ filesystems under Mac OS X, filenames are not case-sensitive, so you would choose a collation that is not case-sensitive, such as latin1_swedish_ci. If your filesystem is set up to use pathnames in a different character set, you should change the character set and collation.
The path column has a maximum length of 255 characters, which limits you to page paths no longer than that.
The path column is indexed as a PRIMARY KEY to require unique values. Either a PRIMARY KEY or UNIQUE index is required because we will implement the hit-counting algorithm using an INSERT statement with an ON DUPLICATE KEY UPDATE clause to insert a row if none exists for the page or update the row if it does exist. (Section 11.14 provides background that further explains ON DUPLICATE KEY UPDATE.)
The table is set up to count page hits for a single document tree, such as when your web server is used to serve pages for a single domain. If you institute a hit count mechanism on a host that serves multiple virtual domains, you may want to add a column for the domain name. This value is available in the SERVER_NAME value that Apache puts into your script's environment. In this case, the hitcount table index should include both the hostname and the page path.

The general logic involved in hit counter maintenance is to increment the hits column of the row for a page, and then retrieve the updated counter value:

UPDATE hitcount SET hits = hits + 1 WHERE path = 'page path'; SELECT hits FROM hitcount WHERE path = 'page path';

Unfortunately, if you use that approach, you may often not get the correct value. If several clients request the same page simultaneously, several UPDATE statements may be issued in close temporal proximity. The following SELECT statements then wouldn't necessarily get the corresponding hits value. This can be avoided by using a transaction or by locking the hitcount table, but that slows down hit counting. MySQL provides a solution that enables each client to retrieve its own count, no matter how many updates happen at the same time:

UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = 'page path'; SELECT LAST_INSERT_ID();

The basis for updating the count here is LAST_INSERT_ID( expr ), which is discussed in Section 11.14. The UPDATE statement finds the relevant row and increments its counter value. The use of LAST_INSERT_ID(hits+1) rather than just hits+1 tells MySQL to treat the value as though it were an AUTO_INCREMENT value. This allows it to be retrieved in the second statement using LAST_INSERT_ID⁠(⁠ ⁠ ⁠). The LAST_INSERT_ID⁠(⁠ ⁠ ⁠) function returns a connection-specific value, so you always get back the value corresponding to the UPDATE issued on the same connection. In addition, the SELECT statement doesn't need to query a table, so it's very fast.

However, there's still a problem here. What if the page isn't listed in the hitcount table? In that case, the UPDATE statement finds no row to modify and you get a counter value of zero. You could deal with this problem by requiring that any page that includes a hit counter must be registered in the hitcount table before the page goes online. An easier approach is to use MySQL's INSERT ... ON DUPLICATE KEY UPDATE syntax, which inserts a row with a count of 1 if it does not exist, and updates its counter if it does exist:

INSERT INTO hitcount (path,hits) VALUES('some path',LAST_INSERT_ID(1)) ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1);

The counter value then can be retrieved as the value of the LAST_INSERT_ID⁠(⁠ ⁠ ⁠) function:

SELECT LAST_INSERT_ID();

The first time you request a count for a page, the statement inserts a row because the page won't be listed in the table yet. The statement creates a new counter and returns a value of one. For each request thereafter, the statement updates the existing row for the page with the new count. That way, web page designers can include counters in pages with no advance preparation required to initialize the hitcount table with a row for the page.

A further efficiency can be gained by eliminating the SELECT statement altogether, which is possible if your API provides a means for direct retrieval of the most recent sequence number. For example, in Perl, you can update the count and get the new value with only one SQL statement like this:

$dbh->do ("INSERT INTO hitcount (path,hits) VALUES(?,LAST_INSERT_ID(1))            ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1)",           undef, $page_path); $count = $dbh->{mysql_insertid};

To make the counter mechanism easier to use, put the code in a utility function that takes a page path as an argument and returns the count. In Perl, a hit-counting function might look like this, in which the arguments are a database handle and the page path:

sub get_hit_count { my ($dbh, $page_path) = @_; my $count;   $dbh->do ("INSERT INTO hitcount (path,hits) VALUES(?,LAST_INSERT_ID(1))              ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1)",             undef, $page_path);   $count = $dbh->{mysql_insertid};   return $count }

The CGI.pm script_name⁠(⁠ ⁠ ⁠) function returns the local part of the URL, so you use get_hit_count⁠(⁠ ⁠ ⁠) like this:

my $count = get_hit_count ($dbh, script_name ()); print p ("This page has been accessed $count times.");

The counter-update mechanism involves a single SQL statement, so it is unnecessary to use transactions or explicit table locking to prevent race conditions that might result if multiple clients simultaneously request a page.

A Ruby version of the hit counter looks like this:

def get_hit_count(dbh, page_path)   dbh.do("INSERT INTO hitcount (path,hits) VALUES(?,LAST_INSERT_ID(1))           ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1)",          page_path)   return dbh.func(:insert_id) end

Use the counter method as follows:

self_path = ENV["SCRIPT_NAME"] count = get_hit_count(dbh, self_path) page << cgi.p { "This page has been accessed " + count.to_s + " times." }

In Python, the counting function looks like this:

def get_hit_count (conn, page_path):   cursor = conn.cursor ()   cursor.execute ("""              INSERT INTO hitcount (path,hits) VALUES(%s,LAST_INSERT_ID(1))              ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1)              """, (page_path,))   cursor.close ()   return (conn.insert_id ())

Use the function as follows:

self_path = os.environ["SCRIPT_NAME"] count = get_hit_count (conn, self_path) print "<p>This page has been accessed %d times.</p>" % count

The recipes distribution includes demonstration hit counter scripts for Perl, Ruby, PHP, and Python under the apache/hits directory. A JSP version is under the tomcat/mcb directory. Install any of these in your web tree, invoke it from your browser a few times, and watch the count increase. First, you'll need to create the hitcount table, as well as the hitlog table described in Section 19.13. (The hit-counting scripts show a count and also a log of the most recent hits. Section 19.13 discusses the logging mechanism.) Both tables can be created using the hits.sql script provided in the tables directory.