Recipe 19.14. Using MySQL for Apache Logging


Problem

You don't want to use MySQL to log accesses for just a few pages, as shown in Section 19.13. You want to log all page accesses, and you don't want to have to put logging code in each page explicitly.

Solution

Tell Apache to log page accesses to a MySQL table.

Discussion

The uses for MySQL in a web context aren't limited just to page generation and processing. You can use it to help you run the web server itself. For example, most Apache servers are set up to log a record of page requests to a file. But it's also possible to send log records to a program instead, from which you can write the records wherever you likesuch as to a database. With log records in a database rather than a flat file, the log becomes more highly structured and you can apply SQL analysis techniques to it. Logfile analysis tools may be written to provide some flexibility, but often this is a matter of deciding which summaries to display and which to suppress. It's more difficult to tell a tool to display information it wasn't built to provide. With log entries in a table, you gain additional flexibility. Want to see a particular report? Write the SQL statements that produce it. To display the report in a specific format, issue the statements from within an API and take advantage of your language's output production capabilities.

By handling log entry generation and storage using separate processes, you gain some additional flexibility. Some of the possibilities are to send logs from multiple web servers to the same MySQL server, or to send different logs generated by a given web server to different MySQL servers.

This recipe shows how to set up web request logging from Apache into MySQL and demonstrates some summary queries you may find useful.

Setting up database logging

Apache logging is controlled by directives in the httpd.conf configuration file. For example, a typical logging setup uses LogFormat and CustomLog directives that look like this:

LogFormat "%h %l %u %t \"%r\" %>s %b" common CustomLog /usr/local/apache/logs/access_log common 

The LogFormat line defines a format for log records and gives it the nickname common. The CustomLog directive indicates that lines should be written in that format to the access_log file in Apache's logs directory. To set up logging to MySQL instead, use the following procedure:[*]

[*] If you're using logging directives such as transferLog rather than LogFormat and CustomLog, you'll need to adapt the instructions in this section.

  1. Decide what values you want to record and set up a table that contains the appropriate columns.

  2. Write a program to read log lines from Apache and write them into the database.

  3. Set up a LogFormat line that defines how to write log lines in the format the program expects, and a CustomLog directive that tells Apache to write to the program rather than to a file.

Suppose that you want to record the date and time of each request, the host that issued the request, the request method and URL pathname, the status code, the number of bytes transferred, the referring page, and the user agent (typically a browser or spider name). A table that includes columns for these values can be created as follows:

CREATE TABLE httpdlog (   dt      DATETIME NOT NULL,            # request date   host    VARCHAR(255) NOT NULL,        # client host   method  VARCHAR(4) NOT NULL,          # request method (GET, PUT, etc.)   url     VARCHAR(255)                  # URL path             CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL,   status  INT NOT NULL,                 # request status   size    INT,                          # number of bytes transferred   referer VARCHAR(255),                 # referring page   agent   VARCHAR(255)                  # user agent ); 

Most of the string columns use VARCHAR and are not case-sensitive. The exception, url, is declared with a case-sensitive collation as is appropriate for a server running on a system with case-sensitive filenames. See Section 19.12 for notes on choosing the character set and collation for the path column.

The httpdlog table definition shown here doesn't include any indexes. If you plan to run summary queries, you should add appropriate indexes to the table. Otherwise, the summaries will slow down dramatically as the table becomes large. The choice of which columns to index will be based on the types of statements you intend to run to analyze the table contents. For example, statements to analyze the distribution of client host values will benefit from an index on the host column.

Next, you need a program to process log lines produced by Apache and insert them into the httpdlog table. The following script, httpdlog.pl, opens a connection to the MySQL server, and then loops to read input lines. It parses each line into column values and inserts the result into the database. When Apache exits, it closes the pipe to the logging program. That causes httpdlog.pl to see end of file on its input, terminate the loop, disconnect from MySQL, and exit.

#!/usr/bin/perl # httpdlog.pl - Log Apache requests to httpdlog table # path to directory containing Cookbook.pm (CHANGE AS NECESSARY) use lib qw(/usr/local/lib/mcb); use strict; use warnings; use Cookbook; my $dbh = Cookbook::connect (); my $sth = $dbh->prepare (qq{                   INSERT DELAYED INTO httpdlog                     (dt,host,method,url,status,size,referer,agent)                     VALUES (?,?,?,?,?,?,?,?)                   }); while (<>)  # loop reading input {   chomp;   my ($dt, $host, $method, $url, $status, $size, $refer, $agent)                           = split (/\t/, $_);   # map "-" to NULL for some columns   $size = undef if $size eq "-";   $agent = undef if $agent eq "-";   $sth->execute ($dt, $host, $method, $url,                  $status, $size, $refer, $agent); } $dbh->disconnect (); 

Install the httpdlog.pl script where you want Apache to look for it. On my system, the Apache root directory is /usr/local/apache, so /usr/local/apache/bin is a reasonable installation directory. The path to this directory will be needed shortly for constructing the CustomLog directive that instructs Apache to log to the script.

The purpose of including the use lib line is so that Perl can find the Cookbook.pm module. This line will be necessary if the environment of scripts invoked by Apache for logging does not enable Perl to find the module. Change the path as necessary for your system.

The script uses INSERT DELAYED rather than INSERT. The advantage of using DELAYED is that the MySQL server buffers the row in memory and then later inserts a batch of rows at a time, which is more efficient. This also enables the client to continue immediately rather than having to wait if the table is busy. The disadvantage is that if the MySQL server crashes, any rows buffered in memory at the time are lost. I figure that this is not very likely, and that the loss of a few log records is not a serious problem. If you disagree, just remove DELAYED from the statement.

httpdlog.pl assumes that input lines contain httpdlog column values delimited by tabs (to make it easy to break apart input lines), so Apache must write log entries in a matching format. The LogFormat field specifiers to produce the appropriate values are shown in the following table.

SpecifierMeaning
%{%Y-%m-%d %H:%M:%S}t The date and time of the request, in MySQL's DATETIME format
%h The host from which the request originated
%m The request method (get, post, and so forth)
%U The URL path
%>s The status code
%b The number of bytes transferred
%{Referer}i The referring page
%{User-Agent}i The user agent


To define a logging format named mysql that produces these values with tabs in between, add the following LogFormat directive to your httpd.conf file:

LogFormat \ "%{%Y-%m-%d %H:%M:%S}t\t%h\t%m\t%U\t%>s\t%b\t%{Referer}i\t%{User-Agent}i" \ mysql 

Most of the pieces are in place now. We have a log table, a program that writes to it, and a mysql format for producing log entries. All that remains is to tell Apache to write the entries to the httpdlog.pl script. However, until you know that the output format really is correct and that the program can process log entries properly, it's premature to tell Apache to log directly to the program. To make testing and debugging a bit easier, have Apache log mysql-format entries to a file instead. That way, you can look at the file to check the output format, and you can use it as input to httpdlog.pl to verify that the program works correctly. To instruct Apache to log lines in mysql format to the file test_log in Apache's log directory, use this CustomLog directive:

CustomLog /usr/local/apache/logs/test_log mysql 

Then restart Apache to enable the new logging directives. After your web server receives a few requests, take a look at the test_log file. Verify that the contents are as you expect, and then feed the file to httpdlog.pl:

% /usr/local/apache/bin/httpdlog.pl test_log                

After httpdlog.pl finishes, take a look at the httpdlog table to make sure that it looks correct. Once you're satisfied, tell Apache to send log entries directly to httpdlog.pl by modifying the CustomLog directive as follows:

CustomLog "|/usr/local/apache/bin/httpdlog.pl" mysql 

The | character at the beginning of the pathname tells Apache that httpdlog.pl is a program, not a file. Restart Apache and new entries should appear in the httpdlog table as visitors request pages from your site.

Nothing you have done to this point changes any logging you may have been doing originally. For example, if you were logging to an access_log file before, you still are now. Thus, Apache will be sending entries both to the original logfile and to MySQL. If that's what you want, fine. Apache doesn't care if you log to multiple destinations. But you'll use more disk space if you do. To disable file logging, comment out your original CustomLog directive by placing a # character in front of it, and then restart Apache.

Analyzing the logfile

Now that you have Apache logging into the database, what can you do with the information? That depends on what you want to know. Here are some examples that show the kinds of questions you can use MySQL to answer easily:

  • How many rows are in the request log?

    SELECT COUNT(*) FROM httpdlog; 

  • How many different client hosts have sent requests?

    SELECT COUNT(DISTINCT host) FROM httpdlog; 

  • How many different pages have clients requested?

    SELECT COUNT(DISTINCT url) FROM httpdlog; 

  • What are the 10 most popular pages?

    SELECT url, COUNT(*) AS count FROM httpdlog GROUP BY url ORDER BY count DESC LIMIT 10; 

  • How many requests have been received for those favicon.ico files that certain browsers like to check for?

    SELECT COUNT(*) FROM httpdlog WHERE url LIKE '%/favicon.ico%'; 

  • What is the range of dates spanned by the log?

    SELECT MIN(dt), MAX(dt) FROM httpdlog; 

  • How many requests have been received each day?

    SELECT DATE(dt) AS day, COUNT(*) FROM httpdlog GROUP BY day; 

    Answering this question requires stripping off the time-of-day part from the dt values so that requests received on a given date can be grouped. The statement does this using the DATE⁠(⁠ ⁠ ⁠) function to convert DATETIME values to DATE values. However, if you intend to run a lot of statements that use just the date part of the dt values, it would be more efficient to create the httpdlog table with separate DATE and TIME columns, change the LogFormat directive to produce the date and time as separate output values, and modify httpdlog.pl accordingly. Then you can operate on the request dates directly without stripping off the time, and you can index the date column for even better performance.

  • What is the hour-of-the-day request histogram?

    SELECT HOUR(dt) AS hour, COUNT(*) FROM httpdlog GROUP BY hour; 

  • What is the average number of requests received each day?

    SELECT COUNT(*)/(DATEDIFF(MAX(dt), MIN(dt)) + 1) FROM httpdlog; 

    The numerator is the total number of requests in the table. The denominator is the number of days spanned by the records.

  • What is the longest URL recorded in the table?

    SELECT MAX(LENGTH(url)) FROM httpdlog; 

    If the url column is defined as VARCHAR(255) and this statement produces a value of 255, it's likely that some URL values were too long to fit in the column and were truncated at the end. To avoid this, change the column definition to allow more characters. For example, to allow up to 5,000 characters, modify the url column as follows:

    ALTER TABLE httpdlog   MODIFY url VARCHAR(5000)   CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL; 

  • What is the total number of bytes served and the average bytes per request?

    SELECT   COUNT(size) AS requests,   SUM(size) AS bytes,   AVG(size) AS 'bytes/request' FROM httpdlog; 

    The statement uses COUNT(size) rather than COUNT(*) to count only those requests with a non-NULL size value. If a client requests a page twice, the server may respond to the second request by sending a header indicating that the page hasn't changed rather than by sending content. In this case, the log entry for the request will have NULL in the size column.

  • How much traffic has there been for each kind of file (based on filename extension such as .html, .jpg, or .php)?

    SELECT   SUBSTRING_INDEX(SUBSTRING_INDEX(url,'?',1),'.',-1) AS extension,   COUNT(size) AS requests,   SUM(size) AS bytes,   AVG(size) AS 'bytes/request' FROM httpdlog WHERE url LIKE '%.%' GROUP BY extension; 

    The WHERE clause selects only url values that have a period in them, to eliminate pathnames that refer to files that have no extension. To extract the extension values for the output column list, the inner SUBSTRING_INDEX⁠(⁠ ⁠ ⁠) call strips off any parameter string at the right end of the URL and leaves the rest. (This turns a value like /cgi-bin/script.pl?id=43 into /cgi-bin/script.pl. If the value has no parameter part, SUBSTRING_INDEX⁠(⁠ ⁠ ⁠) returns the entire string.) The outer SUBSTRING_INDEX⁠(⁠ ⁠ ⁠) call strips everything up to and including the rightmost period from the result, leaving only the extension.

Other logging issues

The preceding discussion shows a simple method for hooking Apache to MySQL, which involves writing a short script that communicates with MySQL and then telling Apache to write to the script rather than to a file. This works well if you log all requests to a single file, but certainly won't be appropriate for every possible configuration that Apache is capable of. For example, if you have virtual servers defined in your httpd.conf file, you might have separate CustomLog directives defined for each of them. To log them all to MySQL, you can change each directive to write to httpdlog.pl, but then you'll have a separate logging process running for each virtual server. That brings up two issues:

  • How do you associate log records with the proper virtual server? One solution is to create a separate log table for each server and modify httpdlog.pl to take an argument that indicates which table to use. Another is to use a table that has a vhost column, an Apache log format that includes the %v virtual host format specifier, and a logging script that uses the vhost value when it generates INSERT statements. The apache/httpdlog directory of the recipes distribution contains information about doing this.

  • Do you really want a lot of httpdlog.pl processes running? If you have many virtual servers, you may want to consider using a logging module that installs directly into Apache. Some of these can multiplex logging for multiple virtual hosts through a single connection to the database server, reducing resource consumption for logging activity.

Logging to a database rather than to a file enables you to bring the full power of MySQL to bear on log analysis, but it doesn't eliminate the need to think about space management. Web servers can generate a lot of activity, and log records use space regardless of whether you write them to a file or to a database. One way to save space is to expire records now and then. For example, to remove log records that are more than a year old, run the following statement periodically:

DELETE FROM httpdlog WHERE dt < NOW() - INTERVAL 1 YEAR; 

If you have MySQL 5.1 or higher, you can set up an event that runs the DELETE statement on a scheduled basis (Section 16.7).

With respect to disk space consumed by web logging activity, be aware that if you have query logging enabled for the MySQL server, each request will be written to the httpdlog table and also to the query logfile. Thus, you may find disk space disappearing more quickly than you expect, so it's a good idea to have some kind of logfile rotation or expiration set up for the MySQL server.




MySQL Cookbook
MySQL Cookbook
ISBN: 059652708X
EAN: 2147483647
Year: 2004
Pages: 375
Authors: Paul DuBois

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net