Apache can, and usually does, record information about every request it processes. Controlling how this is done and extracting useful information out of these logs after the fact is at least as important as gathering the information in the first place. The logfiles may record two types of data: information about the request itself, and possibly one or more messages about abnormal conditions encountered during processing (such as file permissions). You, as the webmaster, have a limited amount of control over the logging of error conditions, but a great deal of control over the format and amount of information logged about request processing (activity logging). The server may log activity information about a request in multiple formats in mulitple log files, but it will only record a single copy of an error message. One aspect of activity logging you should be aware of is that the log entry is formatted and written after the request has been completely processed. This means that the interval between the time a request begins and when it finishes may be long enough to make a difference. For example, if your logfiles are rotated while a particularly large file is being downloaded, the log entry for the request will appear in the new logfile when the request completes, rather than in the old logfile when the request was started. In contrast, an error message is written to the error log as soon as it is encountered. The web server will continue to record information in its logfiles as long as it's running. This can result in extremely large logfiles for a busy site and uncomfortably large ones even for a modest site. To keep the file sizes from growing ever larger, most sites rotate or roll over their logfiles on a semi-regular basis. Rolling over a logfile simply means persuading the server to stop writing to the current file and start recording to a new one. Due to Apache's determination to see that no records are lost, cajoling it to do this according to a specific timetable may require a bit of effort; some of the recipes in this chapter cover how to accomplish the task successfully and reliably (see Recipe 3.8 and Recipe 3.9). The log declaration directives, CustomLog and ErrorLog, can appear inside <VirtualHost> containers, outside them (in what's called the main or global server, or sometimes the global scope), or both. Entries will only be logged in one set or the other; if a <VirtualHost> container applies to the request or error and has an applicable log directive, the message will be written only there and won't appear in any globally declared files. On the other hand, if no <VirtualHost> log directive applies, the server will fall back on logging the entry according to the global directives. However, whichever scope is used for determining what logging directives to use, all CustomLog directives in that scope are processed and treated independently. That is, if you have a CustomLog directive in the global scope and two inside a <VirtualHost> container, both of these will be used. Similarly, if a CustomLog directive uses the env= option, it has no effect on what requests will be logged by other CustomLog directives in the same scope. Activity logging has been around since the Web first appeared, and it didn't take long for the original users to decide what items of information they wanted logged. The result is called the common log format (CLF). In Apache terms, this format is: "%h %l %u %t \"%r\" %>s %b" That is, it logs the client's hostname or IP address, the name of the user on the client (as defined by RFC 1413 and if Apache has been told to snoop for it with an IdentityCheck On directive), the username with which the client authenticated (if weak access controls are being imposed by the server), the time at which the request was received, the actual HTTP request line, the final status of the server's processing of the request, and the number of bytes of content that were sent in the server's response. Before long, as the HTTP protocol advanced, the common log format was found to be wanting, so an enhanced format, called the combined log format, was created: "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" The two additions were the Referer (it's spelled incorrectly in the specifications) and the User-agent. These are the URL of the page that linked to the document being requested, and the name and version of the browser or other client software making the request. Both of these formats are widely used, and many logfile analysis tools assume log entries are made in one or the other. The Apache web server's standard activity logging module allows you to create your own formats; it is highly configurable and is called (surprise!) mod_log_config. Apache 2.0 has an additional module, mod_logio, which enhances mod_log_config with the ability to log the number of bytes actually transmitted or received over the network. If these doesn't meet your requirements, though, there are a significant number of third-party modules available from the module registry at http://modules.apache.org/. The status code entry in the common and combined log formats deserve some mention, because its meaning is not immediately clear. The status codes are defined by the HTTP protocol specification documents (currently RFC 2616 at ftp://ftp.isi.edu/in-notes/rfc2616.txt). Table 3-1 gives a brief description of the codes defined at the time of this writing.
The one-line descriptions shown in Table 3-1 are sometimes terse to the point of being confusing, but they should at least give you an inkling of what the server thinks happened. The first digit is used to separate the codes into classes or categories; for example, all codes starting with 5 indicate there is a problem handling the request, and the server thinks the problem is on its end rather than on the client's end. For a complete description of the various status codes, you'll need to read a document about the HTTP protocol or the RFC itself. |