Hack 74 Consolidate Web Server Logs

Automate log processing on a web farm.

As the administrator of multiple web servers, I ran across a few logging problems. The first was the need to collect logs from multiple web servers and move them to one place for processing. The second was the need to do a real-time tail on multiple logs so I could watch for specific patterns, clients, and URLs.

As a result, I wrote a series of Perl scripts collectively known as logproc. These scripts send the log line information to a single log host where some other log analysis tool can work on them, solving the first problem. They also multicast the log data, letting you watch live log information from multiple web servers without having to watch individual log files on each host. A primary goal is never to lose log information, so these scripts are very careful about checking exit codes and such.

The basic model is to feed logs to a program via a pipe. Apache supports this with its standard logging mechanism, and it is the only web server considered in this hack. It should be possible to make the system work with other web servers even servers that can only write logs to a file by using a named pipe.

I've used these scripts on production sites at a few different companies, and I've found that they handle high loads quite well.

7.7.1 logproc Described

Download logproc from http://www.peterson.ath.cx/~jlp/software/logproc.tar.gz. Then, extract it:

% gunzip logproc.tar.gz % tar xvf logproc.tar % ls -F logproc ./    ../    logserver.bin/    webserver.bin/ % ls -F logserver.bin ./    apache_rrd*    cleantmp*    logwatch*    mining/ ../   arclogs*       collect*     meter* % ls -F webserver.bin ./    ../    batcher*    cleantmp*    copier*

As you can see, there are two parts. One runs on each web server and the other runs on the log server.

The logs are fed to a process called batcher that runs on the web server and writes the log lines to a batch file as they are received. The batch file stays small, containing only five minutes' worth of logs. Each completed batch file moves off to a holding area. A second script on each web server, the copier , takes the completed batch files and copies them to the centralized log host. It typically runs from cron. On the log host, the collect process, also run from cron, collects the batches and sorts the log lines into the appropriate daily log files.

The system can also monitor log information in real time. Each batcher process dumps the log lines as it receives them out to a multicast group. Listener processes can retrieve those log lines and provide real-time analysis or monitoring. See the sample logwatch script included with logproc for details.

7.7.2 Preparing the Web Servers

First, create a home directory for the web server user. In this case, we'll call the user www. Make sure that www's home directory in /etc/master.passwd points to that same location, not to /nonexistent. If necessary, use vipw to modify the location in the password file.

# mkdir ~www # chown www:www ~www

Next, log in as the web server user and create a public/private SSH keypair:

# su www % ssh-keygen -t dsa

Create the directories used by the log processing tools, and copy the scripts over:

% cd ~www % mkdir -p bin logs/{work,save}/0 logs/tmp logs/work/1 % cp $srcdir/logproc/webserver.bin/* bin/

Examine those scripts, and edit the variables listed in Table 7-1 to reflect your situation.

Table 7-1. Variables and values for logproc's web server scripts
Script	Variable	Value
`batcher`	`$loguser`	The name of the web server user
	`$mcast_if`	The name of the interface that can reach the log host
	`$logroot`	The home directory of the web server user
`cleantmp`	`$logroot`	The home directory of the web server user
copier	`$loghost`	The name of the host where the logs will collect
	`$logroot`	The home directory of the web server user
	`$loghost_logroot`	The directory on the collector host where the logs will be collected
	$loghost_loguser	The user on the log host who owns the logs
	$scp_prog	The full path to the `scp` program, plus any additional options
	$ssh_prog	The full path to `ssh`, plus any options

Then, make sure you have satisfied all of the dependencies for these programs:

# perl -wc batcher; perl -wc cleantmp; perl -wc copier

The only dependency you likely won't have is IO::Socket::Multicast. Install it via the /usr/ports/net/p5-IO-Socket-Multicast port on FreeBSD systems or from the CPAN site (http://www.cpan.org/).

Next, configure httpd.conf to log to the batcher in parallel with normal logging. Note that the batcher command line must include the instance (site, virtual, secure) and type (access, error, ssl) of logging:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" "%{User-Agent}i\" \     \"%{Cookie}i\" %v" full CustomLog "|/home/www/bin/batcher site access" full ErrorLog  "|/home/www/bin/batcher site error"

You can adjust the LogFormat directive as necessary to log the information you or your log summarization software needs.

Finally, restart Apache and verify that the batchers are creating batches:

# apachectl configtest   # apachectl graceful  # cd $wwwhome/logs/  # ls tmp         Should list error log files for each batcher instance # ls work/0      Should list the working batches for each batcher instance # ls save/0      Verify that batches have moved into the save directory after a                                     five-minute batch interval # ls work/0      and that new batches are currently being created

7.7.3 Preparing the Log Host

Start by creating a log user to receive the logs, complete with a home directory. Become the log user and copy the public key from the web server into ~log/.ssh/authorized_keys2. Then, as the log user, create the directories the log collection tools use:

# su log % cd ~log % mkdir -p bin web/{work,save}/{0,1} web/tmp web/{current,archive}

7.7.4 Testing the Configuration

From a web server (as the web server's user), ssh to the log host manually to verify the configuration of the authorized_keys2:

# su www % ssh loghost -l loguser date

If your command fails, check that the permissions on that file are set to 600.

Then, run copier manually to verify that the log files actually make it to the log server. Watch your run output on the web server, then check that save/0 on the log server contains the newly copied logs.

Once you're satisfied with these manual tests, schedule a cron job that copies and cleans up log files. These jobs should run as the web server user:

# crontab -e -u www ----------------------------- cut here ----------------------------- # copy the log files down to the collector host every 15 minutes 0,15,30,45 * * * * /home/www/bin/copier # clean the tmp directory once an hour 0 * * * * /home/www/bin/cleantmp ----------------------------- cut here -----------------------------

Finally, wait until the next copier run and verify that the batches appear on the log host.

7.7.5 Configuring Scripts on the Log Host

You should now have several batches sitting in save/0 in the log tree. Each batch contains the log lines collected over the batch interval (by default, five minutes) and has a filename indicating the instance (site, virtual, secure), type (access, error, ssl), web server host, timestamp indicating when the batch was originally created, and PID of the batcher process that created each batch.

Now, install the log processing scripts into bin/:

# cp $srcdir/collector/{arclogs,cleantmp,collect} bin/

Edit them to have valid paths for their new location and any OS dependencies, as shown in Table 7-2.

Table 7-2. Variables and values for logproc's log host scripts
Script	Variable	Value
`arclogs`	`$logroot`	The location of the logs
	`$gzip_prog`	The full path to the `gzip` binary
`cleantmp`	`$logroot`	The location of the logs
`collect`	`$logroot`	The location of the logs
	`$gzip_prog`	The full path to the `gzip` binary

Again, make sure all dependencies are satisfied:

# perl -wc arclogs; perl -wc cleantmp; perl -wc collect

If you don't have Time::ParseDate, then install it from the /usr/ports/devel/p5-Time-modules port on FreeBSD or from CPAN.

Run collect manually as the log user to verify that the log batches get collected and that log data ends up in the appropriately dated log file. Once you're satisfied, automate these tasks in a cron job for the log user:

# crontab -e -u log ----------------------------- cut here ----------------------------- # run the collector once an hour 0 * * * * /home/log/bin/collect # clean the tmp directory once an hour 0 * * * * /home/log/bin/cleantmp ----------------------------- cut here -----------------------------

Wait until the next collect run and verify that the batches are properly collected.

Compare the collected log files with the contents of your old logging mechanism's log file on the web servers. Make sure every hit makes it into the collected log files for the day. You might want to run both logging mechanisms for several days to get a good feel that the system is working as expected.

7.7.6 Viewing Live Log Data

The log server programs provide additional tools for monitoring and summarizing live log data. On a traditional single web server environment, you can always tail the log file to see what's going on. This is no longer easy to do, because the logs are now written in small batches. (Of course, if you have multiple web servers, multiple tail processes would have to run on each web server.)

The batcher process helps with this by multicasting the logs out to a multicast group. Use the logwatch tool on the log server to view the live log data:

% cd ~log/bin % ./logwatch <lines of log data spew out here>

On a high-volume web site, there is likely to be too much data to scan manually. logwatch accepts arguments to specify which type of log data you want to see. You can also specify a Perl regular expression to limit the output.

The meter script watches the log data on the multicast stream, in real time, and summarizes some information about the log data. It also stores information in an RRDTool (http://www.rrdtool.org/) database.

The mining directory contains a checklog script that produces a "top ten clients" and "top ten vhosts" report. Alternatively, you can feed the collected log files to your existing web server log processing tools.

7.7.7 See Also

The logproc web site (http://www.peterson.ath.cx/~jlp/software/logproc.tar.gz)