Recipe 4.10 Program: Apache Logfile Parsing


The Apache web server is the world's leading web server and has been for most of the web's history. It is one of the world's best-known open source projects, and one of many fostered by the Apache Foundation. But the name Apache is a pun on the origins of the server; its developers began with the free NCSA server and kept hacking at it or "patching" it until it did what they wanted. When it was sufficiently different from the original, a new name was needed. Since it was now "a patchy server," the name Apache was chosen. One place this patchiness shows through is in the log file format. Consider this entry:

123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0"  200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)"

The file format was obviously designed for human inspection but not for easy parsing. The problem is that different delimiters are used: square brackets for the date, quotes for the request line, and spaces sprinkled all through. Consider trying to use a StringTokenizer; you might be able to get it working, but you'd spend a lot of time fiddling with it. However, this somewhat contorted regular expression[5] makes it easy to parse:

[5] You might think this would hold some kind of world record for complexity in regex competitions, but I'm sure it's been outdone many times.

^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([^"]+)"  "([^"]+)"

You may find it informative to refer back to Table 4-2 and review the full syntax used here. Note in particular the use of the non-greedy quantifier +? in \"(.+?)\" to match a quoted string; you can't just use .+ since that would match too much (up to the quote at the end of the line). Code to extract the various fields such as IP address, request, referer URL, and browser version is shown in Example 4-7.

Example 4-7. LogRegExp.java
import java.util.regex.*; /**  * Parse an Apache log file with Regular Expressions  */ public class LogRegExp implements LogExample {     public static void main(String argv[]) {         String logEntryPattern =             "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"";         System.out.println("Using regex Pattern:");         System.out.println(logEntryPattern);         System.out.println("Input line is:");         System.out.println(logEntryLine);         Pattern p = Pattern.compile(logEntryPattern);         Matcher matcher = p.matcher(logEntryLine);         if (!matcher.matches( ) ||              NUM_FIELDS != matcher.groupCount( )) {             System.err.println("Bad log entry (or problem with regex?):");             System.err.println(logEntryLine);             return;         }         System.out.println("IP Address: " + matcher.group(1));         System.out.println("Date&Time: " + matcher.group(4));         System.out.println("Request: " + matcher.group(5));         System.out.println("Response: " + matcher.group(6));         System.out.println("Bytes Sent: " + matcher.group(7));         if (!matcher.group(8).equals("-"))             System.out.println("Referer: " + matcher.group(8));         System.out.println("Browser: " + matcher.group(9));     } }

The implements clause is for an interface that just defines the input string; it was used in a demonstration to compare the regular expression mode with the use of a StringTokenizer. The source for both versions is in the online source for this chapter. Running the program against the sample input shown above gives this output:

Using regex Pattern: ^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([^"]+)" "([^"]+)" Input line is: 123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0"  200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)" IP Address: 123.45.67.89 Date&Time: 27/Oct/2000:09:27:09 -0400 Request: GET /java/javaResources.html HTTP/1.0 Response: 200 Bytes Sent: 10450 Browser: Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)

The program successfully parsed the entire log file format with one call to matcher.matches( ).



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net