Section 7.3. Server Log Analysis

7.3. Server Log Analysis

Individual log records can be revealing but often even greater insights come from looking through access logs over a period of time and finding patterns in the data. There is a whole industry devoted to log analysis of large sites involved in news or e-commerce, trying to assess what visitors are most interested in, where they are coming from, how the server performs under load, and so on. I'm going to take a much simpler approach and use the tools that I have at hand to uncover some very interesting needles hidden in my haystack. Hopefully these examples will inspire you to take a closer look at your own server logs.

7.3.1. Googlebot Visits

Given that Google is such a powerful player in the field of Internet search, you might like to know how often they update their index of your site. To see how often their web robot, or spider, pays you a visit, simply search through the access log looking for a User-Agent called GoogleBot. Do this using the standard Unix command grep:

     % grep -i googlebot access_log | grep 'GET / ' | more

The first grep gets all GoogleBot page visits and the second limits the output to the first page of each site visit. Here is a sample of the output from my site:

     66.249.71.9 - - [01/Feb/2005:22:33:27 -0800] "GET / HTTP/1.0"           304 - "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"     66.249.71.14 - - [02/Feb/2005:21:11:30 -0800] "GET / HTTP/1.0"           304 - "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"     66.249.64.54 - - [03/Feb/2005:22:39:17 -0800] "GET / HTTP/1.0"           304 - "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"     66.249.71.17 - - [04/Feb/2005:20:04:59 -0800] "GET / HTTP/1.0"           304 - "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

We can see that Googlebot comes around every day. The IP address of the machine doing the indexing varies, as does the time, but every evening one of their swarm visits my server and looks for any changes. This is quite reassuring because it means any new pages that I post on the site should be picked up within 24 hours. The next step would be to post a new page and see when that actually shows up in a search for unique text on that page.

7.3.2. Bad Robots

Googlebot is a polite and well-behaved robot that indexes only pages on my site that I want it to. The first thing it does when it visits is check the file /robots .txt to see where it can and cannot crawl. Furthermore it checks each page for the presence of a robots meta tag to see if that particular page is not to be indexed. All robots are supposed to uphold this Robot Exclusion Standard , but not all do. Apache logs can help identify the rogues.

Create a simple page in your web tree that you will use as bait. I call my file robots_test.html:

     <html><head>     <title>You can't get here from there</title>     <meta name="ROBOTS" content="NOINDEX, NOFOLLOW">     </head><body>     <p>You can't get here from there...</p>     <p>This is a test page that helps identify web spiders     that do not adhere to the robots exclusion protocol. </p>     </body></html>

Add an entry for this file in the robots.txt file that instructs robots that it should not be copied:

     Disallow: /robots_test.html

Place a link to the bait page on your home page, but do not enter any text between the <a> and </a> tags. This will make it invisible to the casual viewer but the robots will find it.

     <a href="robots_test.html"></a>

Let it sit there for a week or so and then look for the filename in your logs. You might not have to wait long.

     % grep -i robots_test access_log     220.181.26.70 - - [08/Feb/2005:10:16:31 -0800]     "GET /robots_test.html HTTP/1.1" 200 447 "-" "sohu-search"

This tells us that a robot called sohu-search found it on the 8th of February. The file was placed there on the 7th! Further investigation tells me that this is a search engine for sohu.com, a portal site in China.

7.3.3. Google Queries

An interesting search is to look for visits that originated as Google searches. Your visitor entered a specific query into Google and was led to your site. What exactly were they looking for?

This sounds like an impossible task because the search took place on Google's site, not your's. But when they click on a link in a Google results page, its URL is passed on as the referring page, which contains the search terms. Assuming you have been recording visits using the combined log format, you can use this command to pull out records that are the result of a link from Google:

     % grep -i google access_log | grep '[&?]q='     [...]     194.47.254.215 - - [07/Feb/2005:01:54:17 -0800]     "GET /pdf_docs/oreillynet_bioinfo_compgen.pdf HTTP/1.1" 200 707249     "http://www.google.com/search?q=comparative+analysis+genomes+     %22complete+DNA+sequence%22+filetype:pdf&hl=en&lr=&as_qdr=all     &start=10&sa=N"     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2) Opera 7.54 [en]"     [...]     81.210.54.242 - - [07/Feb/2005:02:01:05 -0800]     "GET /mobile/ora/apache_config.html HTTP/1.1" 200 1324     "http://www.google.pl/search?hl=pl&q=rewrite+apache+wap&lr="     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"     [...]

The first record is a request for a PDF file of an O'Reilly Network article in response to the query comparative analysis genomes complete DNA sequence, and the second is a request for a page on web programming for mobile phone browsers in response to the query rewrite apache wap. Manually dissecting records is fine the first few times you try it but it is too tedious for general use. Here are a couple of Perl scripts to make this process easier.

The first one, shown in Example 7-2, will extract specific fields from a combined format log file. You can specify whether you want the hosts that requested the pages, the referring pages, or the user agent used to make the request. The script is set up so that it can open a file or it can be used in a pipeline of several commands, which is helpful when dealing with large log files.

Example 7-2. parse_apache_log.pl

     #!/usr/bin/perl -w     die "Usage: $0 <field> <log file>\n" unless @ARGV > 0;           $ARGV[1] = '-' if(@ARGV == 1);     open INPUT, "< $ARGV[1]" or          die "$0: Unable to open log file $ARGV[1]\n";     while(<INPUT>) {     if(/^(\S+).*(\".*?\")\s+(\".*?\")\s*$/) {             my $host = $1;             my $referer = $2;             my $user_agent = $3;             if($ARGV[0] =~ /host/i) {                 print "$host\n";             } elsif(($ARGV[0] =~ /refer/i) {                 print "$referer\n";             } elsif(($ARGV[0] =~ /user/i)                 print "$user_agent\n";             }         }     }     close INPUT;

You can use it to extract the referring pages from Google using this pipe:

     % grep -i google access_log | ./parse_apache_log referrer     [...]     http://www.google.com/search?q=comparative+analysis+genomes+     %22complete+DNA+sequence%22+filetype:pdf&hl=en&lr=&as_qdr=all     &start=10&sa=N     http://www.google.pl/search?hl=pl&q=rewrite+apache+wap&lr=     [...]

That's an improvement on the raw log file format, but it's still pretty ugly. The script shown in Example 7-3 cleans things up further.

Example 7-3. parse_google_queries.pl

     #!/usr/bin/perl -w     die "Usage: $0 <log file>\n" unless @ARGV < 2;     $ARGV[0] = '-' if @ARGV == 0;           open INPUT, "< $ARGV[0]" or          die "$0: Unable to open log file $ARGV[0]\n";     while(<INPUT>) {         if(/[\?\&]q=([^\&]+)/) {             my $query = $1;             $query =~ s/\+/ /g;             $query =~ s/\%([0-9a-fA-F][0-9a-fA-F])/chr hex $1/ge;             print "$query\n";         }     }     close INPUT;

Adding it to the previous pipeline produces output like this:

     % grep -i google access_log | ./parse_apache_log referrer |     ./parse_gooogle_queries.pl     [..]     comparative analysis genomes "complete DNA sequence" filetype:pdf     rewrite apache wap     [...]

The output of this on a large log file can make for very interesting reading. The vast majority of queries to my site are interested in a single article I wrote on mobile phones but only a few are specifically interested in my company, which tells me I need to work on my marketing skills!