Using MySQL for Apache Logging

18.15.1 Problem

You don't want to use MySQL to log accesses for just a few pages, as shown in Recipe 18.14. You want to log all pages accesses, and you don't want to have to put logging actions in each page explicitly.

18.15.2 Solution

Tell Apache to log pages accesses to MySQL.

18.15.3 Discussion

The uses for MySQL in a web context aren't limited just to page generation and processing. You can use it to help you run the web server itself. For example, most Apache servers are set up to log a record of web requests to a file. But it's also possible to send log records to a program instead, from which you can write the records wherever you likesuch as to a database. With log records in a database rather than a flat file, the log becomes more highly structured and you can apply SQL analysis techniques to it. Log file analysis tools may be written to provide some flexibility, but often this is a matter of deciding which summaries to display and which to suppress. It's more difficult to tell a tool to display information it wasn't built to provide. With log entries in a table, you gain additional flexibility. Want to see a particular report? Write the SQL statements that produce it. To display the report in a specific format, issue the queries from within an API and take advantage of your language's output production capabilities.

By handling log entry generation and storage using separate processes, you gain some additional flexibility. Some of the possibilities are to send logs from multiple web servers to the same MySQL server, or to send different logs generated by a given web server to different MySQL servers.

This section shows how to set up web request logging from Apache into MySQL and demonstrates some summary queries you may find useful.

18.15.4 Setting Up Database Logging

Apache logging is controlled by directives in the httpd.conf configuration file. For example, a typical logging setup uses LogFormat and CustomLog directives that look like this:

LogFormat "%h %l %u %t "%r" %>s %b" common
CustomLog /usr/local/apache/logs/access_log common

The LogFormat line defines a format for log records and gives it the nickname common. The CustomLog directive indicates that lines should be written in that format to the access_log file in Apache's logs directory. To set up logging to MySQL instead, use the following procedure:[4]

[4] If you're using logging directives such as TransferLog rather than LogFormat and CustomLog, you'll need to adapt the instructions in this section.

  • Decide what values you want to record and set up a table that contains the appropriate columns.
  • Write a program to read log lines from Apache and write them into the database.
  • Set up a LogFormat line that defines how to write log lines in the format the program expects, and a CustomLog directive that tells Apache to write to the program rather than to a file.

Suppose you want to record the date and time of each request, the host that issued the request, the request method and URL pathname, the status code, the number of bytes transferred, and the user agent (typically a browser or spider name). A table that includes columns for these values can be created as follows:

CREATE TABLE httpdlog
(
 dt DATETIME NOT NULL, # request date
 host VARCHAR(255) NOT NULL, # client host
 method VARCHAR(4) NOT NULL, # request method (GET, PUT, etc.)
 url VARCHAR(255) BINARY NOT NULL, # URL path
 status INT NOT NULL, # request status
 size INT, # number of bytes transferred
 agent VARCHAR(255) # user agent
);

Most of the string columns use VARCHAR and are not case sensitive. The exception, url, is declared as a binary string as is appropriate for a server running on a system with case-sensitive filenames. If you're using a server where URL lettercase doesn't matter, you can omit the word BINARY.

The httpdlog table definition shown here doesn't include any indexes. You should add some, because otherwise any summary queries you run will slow down dramatically as the table becomes large. The choice of which columns to index will be based on the types of queries you intend to run to analyze the table contents. For example, queries to analyze the distribution of client host values will benefit from an index on the host column.

Next, you need a program to process log lines produced by Apache and insert them into the httpdlog table. The following script, httpdlog.pl, opens a connection to the MySQL server, then loops to read input lines. It parses each line into column values and inserts the result into the database. When Apache exits, it closes the pipe to the logging program. That causes httpdlog.pl to see end of file on its input, terminate the loop, disconnect from MySQL, and exit.

#! /usr/bin/perl -w
# httpdlog.pl - Log Apache requests to httpdlog table

use strict;
use lib qw(/usr/local/apache/lib/perl);
use Cookbook;

my $dbh = Cookbook::connect ( );
my $sth = $dbh->prepare (qq{
 INSERT INTO httpdlog (dt,host,method,url,status,size,agent)
 VALUES (?,?,?,?,?,?,?)
 });

while (<>) # loop reading input
{
 chomp;
 my ($dt, $host, $method, $url, $status, $size, $agent)
 = split (/	/, $_);
 # map "-" to NULL for some columns
 $size = undef if $size eq "-";
 $agent = undef if $agent eq "-";
 $sth->execute ($dt, $host, $method, $url, $status, $size, $agent);
}

$dbh->disconnect ( );
exit (0);

Install the httpdlog.pl script where you want Apache to look for it. On my system, the Apache root directory is /usr/local/apache, so /usr/local/apache/bin is a reasonable installation directory. The path to this directory will be needed shortly for constructing the CustomLog directive that instructs Apache to log to the script.

httpdlog.pl assumes that input lines contain httpdlog column values delimited by tabs (to make it easy to break apart input lines), so Apache must write log entries in a matching format. The LogFormat field specifiers to produce the appropriate values are as follows:

%{%Y-%m-%d %H:%M:%S}

The date and time of the request, in MySQL's DATETIME format

%h

The host from which the request originated

%m

The request method (GET, POST, and so forth)

%U

The URL path

%>s

The status code

%b

The number of bytes transferred

%{User-Agent}i

The user agent

To define a logging format named mysql that produces these values with tabs in between, add the following LogFormat directive to your httpd.conf file:

LogFormat "%{%Y-%m-%d %H:%M:%S}t	%h	%m	%U	%>s	%b	%{User-Agent}i" mysql

Most of the pieces are in place now. We have a log table, a program that writes to it, and a mysql format for producing log entries. All that remains is to tell Apache to write the entries to the httpdlog.pl script. However, until you know that the output format really is correct and that the program can process log entries properly, it's premature to tell Apache to log directly to the program. To make testing and debugging a bit easier, have Apache log mysql-format entries to a file instead. That way, you can look at the file to check the output format, and you can use it as input to httpdlog.pl to verify that the program works correctly. To instruct Apache to log lines in mysql format to the file test_log in Apache's log directory, use this CustomLog directive:

CustomLog /usr/local/apache/logs/test_log mysql

Then restart Apache to enable the new logging directives. After your web server receives a few requests, take a look at the test_log file. Verify that the contents are as you expect, then feed the file to httpdlog.pl. If you're in Apache's logs directory and the bin and logs directories are both under the Apache root, the command looks like this:

% ../bin/httpdlog.pl test_log

After httpdlog.pl finishes, take a look at the httpdlog table to make sure that it looks correct. Once you're satisfied, tell Apache to send log entries directly to httpdlog.pl by modifying the CustomLog directive as follows:

CustomLog "|/usr/local/apache/bin/httpdlog.pl" mysql

The | character at the beginning of the pathname tells Apache that httpdlog.pl is a program, not a file. Restart Apache and new entries should appear in the httpdlog table as visitors request pages from your site.

Nothing you have done to this point changes any logging you may have been doing originally. For example, if you were logging to an access_log file before, you still are now. Thus, Apache will be sending entries both to the original log file and to MySQL. If that's what you want, fine. Apache doesn't care if you log to multiple destinations. But you'll use more disk space if you do. To disable file logging, comment out your original CustomLog directive by placing a # character in front of it, then restart Apache.

18.15.5 Analyzing the Log File

Now that you have Apache logging into the database, what can you do with the information? That depends on what you want to know. Here are some examples that show the kinds of questions you can use MySQL to answer easily:

  • How many records are in the request log?

    SELECT COUNT(*) FROM httpdlog;
  • How many different client hosts have sent requests?

    SELECT COUNT(DISTINCT host) FROM httpdlog;
  • How many different pages have clients requested?

    SELECT COUNT(DISTINCT url) FROM httpdlog;
  • What are the ten most popular pages?

    SELECT url, COUNT(*) AS count FROM httpdlog
    GROUP BY url ORDER BY count DESC LIMIT 10;
  • How many requests have been received for those useless, wretched favicon.ico files that certain browsers like to check for?

    SELECT COUNT(*) FROM httpdlog WHERE url LIKE '%/favicon.ico%';
  • What is the range of dates spanned by the log?

    SELECT MIN(dt), MAX(dt) FROM httpdlog;
  • How many requests have been received each day?

    SELECT FROM_DAYS(TO_DAYS(dt)) AS day, COUNT(*) FROM httpdlog GROUP BY day;

    Answering this question requires stripping off the time-of-day part from the dt values so that requests received on a given date can be grouped. The query does this using TO_DAYS( ) and FROM_DAYS( ) to convert DATETIME values to DATE values. However, if you intend to run a lot of queries that use just the date part of the dt values, it would be more efficient to create the httpdlog table with separate DATE and TIME columns, change the LogFormat directive to produce the date and time as separate output values, and modify httpdlog.pl accordingly. Then you can operate on the request dates directly without stripping off the time, and you can index the date column for even better performance.

  • What is the hour-of-the-day request histogram?

    SELECT HOUR(dt) AS hour, COUNT(*) FROM httpdlog GROUP BY hour;
  • What is the average number of requests received each day?

    SELECT COUNT(*)/(TO_DAYS(MAX(dt)) - TO_DAYS(MIN(dt)) + 1) FROM httpdlog;

    The numerator is the total number of requests in the table. The denominator is the number of days spanned by the records.

  • What is the longest URL recorded in the table?

    SELECT MAX(LENGTH(url)) FROM httpdlog;

    If the url column is defined as VARCHAR(255) and this query produces a value of 255, it's likely that some URL values were too long to fit in the column and were truncated at the end. To avoid this, you can convert the column to BLOB or TEXT (depending on whether or not you want the values to be case sensitive). For example, if you want case-sensitive values up to 65,535 characters long, modify the url column as follows:

    ALTER TABLE httpdlog MODIFY url BLOB NOT NULL;
  • What is the total number of bytes served and the average bytes per request?

    SELECT
     COUNT(size) AS requests,
     SUM(size) AS bytes,
     AVG(size) AS 'bytes/request'
    FROM httpdlog;

    The query uses COUNT(size) rather than COUNT(*) to count only those requests with a non-NULL size value. (If a client requests a page twice, the server may respond to the second request by sending a header indicating that the page hasn't changed rather than by sending content. In this case, the log entry for the request will have NULL in the size column.)

  • How much traffic has there been for each kind of file (based on filename extension such as .html, .jpg, or .php)?

    SELECT
     SUBSTRING_INDEX(SUBSTRING_INDEX(url,'?',1),'.',-1) AS extension,
     COUNT(size) AS requests,
     SUM(size) AS bytes,
     AVG(size) AS 'bytes/request'
    FROM httpdlog
    WHERE url LIKE '%.%'
    GROUP BY extension;

    The WHERE clause selects only url values that have a period in them, to eliminate pathnames that name files that have no extension. To extract the extension values for the output column list, the inner SUBSTRING_INDEX( ) call strips off any parameter string at the right end of the URL and leaves the rest. (This turns a value like /cgi-bin/script.pl?id=43 into /cgi-bin/script.pl. If the value has no parameter part, SUBSTRING_INDEX( ) returns the entire string.) The outer SUBSTRING_INDEX( ) call strips everything up to and including the rightmost period from the result, leaving only the extension.

18.15.6 Other Logging Issues

I've chosen a simple method for hooking Apache to MySQL, which is to write a short script that communicates with MySQL and then tell Apache to write to the script rather than to a file. This works well if you log all requests to a single file, but certainly won't be appropriate for every possible configuration that Apache is capable of. For example, if you have virtual servers defined in your httpd.conf file, you might have separate CustomLog directives defined for each of them. To log them all to MySQL, you can change each directive to write to httpdlog.pl, but then you'll have a separate logging process running for each virtual server. That brings up two issues:

  • How do you associate log records with the proper virtual server? One solution is to create a separate log table for each server and modify httpdlog.pl to take an argument that indicates which table to use. Another is to add a virt_host column to the httpdlog table and modify httpdlog.pl to take a hostname argument indicating a server name to write to the virt_host column.
  • Do you really want a lot of httpdlog.pl processes running? If you have many virtual servers, you may want to consider using a logging module that installs directly into Apache. Some of these can multiplex logging for multiple virtual hosts through a single connection to the database server, reducing resource consumption for logging activity.

Logging to a database rather than to a file allows you to bring the full power of MySQL to bear on log analysis, but it doesn't eliminate the need to think about space management. Web servers can generate a lot of activity, and log records use space regardless of whether you write them to a file or to a database. One way to save space is to expire records now and then. For example, to remove log records that are more than a year old, run the following query periodically:

DELETE FROM httpdlog WHERE dt < DATE_SUB(NOW( ),INTERVAL 1 YEAR);

Another option is to archive old records into compressible tables. (This requires that you use MyISAM tables so that you can compress them with the myisampack utility.) For example, when the date changes from September 2001 to October 2001, you know that Apache won't generate any more records with September dates and that you can move them into another table that will remain static. Create a table named httpdlog_2001_09 that has the same structure as httpdlog (including any indexes). Then transfer September's log records from httpdlog into httpdlog_2001_09 using these queries:

INSERT INTO httpdlog_2001_09
 SELECT * FROM httpdlog
 WHERE dt >= '2001-09-01' AND dt < '2001-10-01';
DELETE FROM httpdlog
 WHERE dt >= '2001-09-01' AND dt < '2001-10-01';

Finally, run myisampack on httpdlog_2001_09 to compress it and make it read-only.

This strategy has the potential drawback of spreading log entries over many tables. If you want to treat the tables as a single entity so that you can run queries on your entire set of log records, create a MERGE table that includes them all. Suppose the set of tables includes the current table and tables for September 2001 through April 2002. The statement to create the MERGE table would look like this:

CREATE TABLE httpdlog_all
(
 dt DATETIME NOT NULL, # request date
 host VARCHAR(255) NOT NULL, # client host
 method VARCHAR(4) NOT NULL, # request method (GET, PUT, etc.)
 url VARCHAR(255) BINARY NOT NULL, # URL path
 status INT NOT NULL, # request status
 size INT, # number of bytes transferred
 agent VARCHAR(255) # user agent
)
TYPE = MERGE
UNION = (httpdlog, httpdlog_2001_09, httpdlog_2001_10, httpdlog_2001_11,
httpdlog_2001_12, httpdlog_2002_01, httpdlog_2002_02, httpdlog_2002_03,
httpdlog_2002_04);

The UNION clause should name all the tables that you want to include in the MERGE table. Note that you'll need to drop and recreate the httpdlog_all definition each time you generate a new static monthly log table. (Also, if you add an index, you'll need to add it to each of the individual tables, and recreate the MERGE table to include the index definition as well.)

Reports run against the httpdlog_all table will be based on all log entries. To produce monthly reports, just refer to the appropriate individual table.

With respect to disk space consumed by web logging activity, be aware that if you have query logging enabled for the MySQL server, each request will be written to the httpdlog table and also to the query log. Thus, you may find disk space disappearing more quickly than you expect, so it's a good idea to have some kind of log rotation or expiration set up for the MySQL server.

Using the mysql Client Program

Writing MySQL-Based Programs

Record Selection Techniques

Working with Strings

Working with Dates and Times

Sorting Query Results

Generating Summaries

Modifying Tables with ALTER TABLE

Obtaining and Using Metadata

Importing and Exporting Data

Generating and Using Sequences

Using Multiple Tables

Statistical Techniques

Handling Duplicates

Performing Transactions

Introduction to MySQL on the Web

Incorporating Query Resultsinto Web Pages

Processing Web Input with MySQL

Using MySQL-Based Web Session Management

Appendix A. Obtaining MySQL Software

Appendix B. JSP and Tomcat Primer

Appendix C. References



MySQL Cookbook
MySQL Cookbook
ISBN: 059652708X
EAN: 2147483647
Year: 2005
Pages: 412
Authors: Paul DuBois

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net