Building such a beast may sound intimidating, but in truth it is actually much simpler and more elegant than the traditional logging approaches. You only need to put in place three components: a mechanism for publication, a subscriber that can satisfy on-disk log aggregation needs, and a substrate over which they will operate.
We've already discussed Spread briefly in previous chapters, and we will put it to good use again here. Spread is a fast and efficient messaging bus that can provide exactly the publish/subscribe substrate we are looking for to solve the problem at hand.
We will install Spread on each web server, log host, and monitor server in our architecture; launch it; and verify that it works. This is the first step of using any underlying framework, networking system, or messaging substrate. Because several chapters in this book refer to using Spread, the configuration and installation is described in detail in Appendix A, "Spread."
Before we jump into how to publish logs into a substrate on which no one will be listening, it seems reasonable to first tackle the issue of journaling said logs to storage. You might think that the tool to accomplish such a task would be simple because it is responsible only for reading such messages from Spread and writing those messages to disk. If you thought that, you were right; the tool is brain-dead simple, and it is called spreadlogd.
spreadlogd, is a simple C program that utilizes the Spread C API to connect to Spread, subscribe to a predefined set of groups, and journal the messages it reads from those groups to a set of files. It is amazingly simple.
Listing 9.1. spreadlogd.confA Simple spreadlogd Configuration
The sample configuration detailed in Listing 9.1 and presented in diagram form in Figure 9.4 is short and simple. Effectively, we'd like to read messages up to 64 kilobytes in length from the Spread daemon listening on port 4803. We want to subscribe to the groups named logdomain1 and logdomain2 and write them to the files /data/logs/logdomain1/debug_log and /data/logs/logdomain2/common_log, respectively. As an extra, added feature we want the spreadlogd instance to recognize the logs it reads from logdomain2 as Apache common log format, find the time stamp therein, and rewrite it with the current local time stamp on the machine. spreadlogd is a simple program and as such is reliable and fast.
You can test the operation now (after running spreadlogd on your logging host) by running spuser (a tool that comes with Spread) from any machine in your cluster (logger, monitor, or web server). The following output presents a spuser publishing session:
# /opt/spread/bin/spuser -s 4803 Spread library version is 3.17.3 User: connected to 4803 with private group #user#admin-va-1 ========== User Menu: ---------- j <group> -- join a group l <group> -- leave a group s <group> -- send a message b <group> -- send a burst of messages r -- receive a message (stuck) p -- poll for a message e -- enable asynchonous read (default) d -- disable asynchronous read q -- quit User> s logdomain1 enter message: This is a test message. Walla walla bing bang. User> q Bye.
We should see that log line immediately (sub second) appear in the /data/logs/logdomain1/debug_log on our logging server. If we are running two logging servers, it will appear in both places, and if we were running spuser on another machine subscribed to the logdaemon1 group (j logdaemon1), we would also see it appear on our spuser console. It's like magic.
Now, we have something in place to write the logs published to our groups to disk and thus it is safe to explore methods to actually publishing them.
Before I jump into explaining how to configure Apache to log web logs through our new Spread infrastructure, I'd like to rant a bit about the danger of inertia.
Spread has a C, Perl, Python, PHP, Ruby, Java, and OCaml APIjust for starters. It is trivial to write support into any application to publish messages into Spread. Although there is no good reason to fear code modification, it is a common fear nonetheless. Ironically, most systems engineers and developers are comfortable using a modification (patch) written by someone else. I suppose it is a lack of self-confidence or a misplaced faith in the long-term support interests the author has for the changeset. Whatever it is, I suggest we get past thatit hinders thinking out of the box and causes the wrong technologies to be used despite the simplicity of adopting new ones.
A specific example of this is that a few large enterprises I've worked with simply would not consider this logging option, despite the advantages it offered, because it didn't expose a log4j (a popular Java logging specification) implementation. They did not want to invest in the effort to switch from log4j to something new. If you don't know what log4j is, you should be confused by now. log4j is an API more than anything else, and below that API there is an implementation that knows how to write to a disk, or to JMS, or to a database, and so on. There is no good reason why some engineer couldn't spend an hour building a log4j implementation that published to Spread. The fear of new technology was irrational in this case, and the company was prepared to forfeit an extreme advantage due to its fear of having to write code. This is ridiculousend of rant.
mod_log_spread, available at http://www.backhand.org/, is a version of mod_log_config (a core Apache module for logging) that has been patched to allow publishing logs to a Spread daemon; as such, you have all the features of mod_log_config. mod_log_config allows you to specify the log destination as the path to a file to which it will append or, alternatively, a program to which it will pipe the logsthis is accomplished by preceding the name of the program with a | character as is conventional on UNIX systems. To this, mod_log_spread adds the capability to publish to a group by specifying the log destination as the group name prepended by a $ character. In our case, instead of specifying the target of the CustomLog statement as a path to a local file or a pipe to Apache's rotatelogs program, we specify $logdomain2. This can be seen in the following httpd.conf excerpt:
LoadModule log_spread_module libexec/mod_log_spread.so AddModule mod_log_spread.c SpreadDaemon 4803 CommonLog $logdomain2
The preceding configuration loads the mod_log_spread module into Apache (1.3), configures it to talk to Spread locally on port 4803 (actually through /tmp/4803 on UNIX), and writes logs in Common Log Format (CLF) to the Spread group logdomain2. Start up your server with that, and you should immediately see log lines (in CLF) written as prescribed by your spreadlogd configuration whenever a page is loaded.