Section 7.2. Apache Web Server Logging


7.2. Apache Web Server Logging

Let's now look at how Apache can be configured to log information about the requests it services and how you, as the operator of a server, can extract specific information from what can become huge log files.

Logging in Apache can be set up in several different ways. For most purposes the default configuration works fine and serves as a good compromise between logging useful information while keeping the log files from filling all available disk space. The configuration options are detailed here: http://httpd.apache.org/docs/logs.html.

You will find the relevant directives buried deep in the configuration file httpd.conf. Look for a block like this (I've edited out some of the comments for readability):

     # The following directives define some format nicknames for     # use with a CustomLog directive (see below).     LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\"     \"%{User-Agent}i\"" combined     LogFormat "%h %l %u %t \"%r\" %>s %b" common     LogFormat "%{Referer}i -> %U" referer     LogFormat "%{User-agent}i" agent     #     # The location and format of the access logfile.     [...]     # CustomLog /var/log/httpd/access_log common     CustomLog logs/access_log combined     #     # If you would like to have agent and referer logfiles,     # uncomment the following directives.     #CustomLog logs/referer_log referer     #CustomLog logs/agent_log agent     #     # If you prefer a single logfile with access, agent, and referer     # information (Combined Logfile Format) use the following directive.     #     #CustomLog logs/access_log combined 

The basic idea is simple. You define what information should go into the log for each visit by creating a LogFormat record in the configuration file. There are several of these predefined, as in the above example. Each format is given a nickname, such as combined or common.

The syntax used on a LogFormat record looks a bit like a C printf format string. The URL http://httpd.apache.org/docs/mod/mod_log_config.html describes the complete syntax, but the key elements are shown in Table 7-2.

Table 7-2. Apache LogFormat directives

Directive

Meaning

%h

The hostname of the machine making the request

%l

The logname of the remote user, if supplied

%u

The username of the person making the request (only relevant if the page requires user authentication)

%d

Date and time the request was made

%r

The first line of the request, which includes the document name

%>s

The status of the response to the request

%b

The number of bytes of content sent to the browser

%{NAME}i

The value of the NAME header line; e.g., Accept, User-Agent, etc.


You then specify which format will be used and the name of the log file in a CustomLog record. Several common setups are predefined in httpd.conf, and you can simply uncomment the one that suits your taste. Remember that when messing with Apache configuration files you should always make a backup copy before you start and add comment lines in front of any directives that you modify.

The default level of logging is defined in the common LogFormat. So in a typical installation these lines are all that you need:

     LogFormat "%h %l %u %t \"%r\" %>s %b" common     [...]     CustomLog logs/access_log common 

The combined LogFormat extends that to include the Referer and User-Agent:

     LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\"     \"%{User-Agent}i\"" combined     [...]     CustomLog logs/access_log combined 

You can choose between logs containing just IP addresses or the full hostname by setting HostnameLookups to On or Off:

     HostnameLookups On 

Be aware that turning this on will trigger a DNS lookup for every page requested, which can add an unnecessary burden to busy web servers.

By default, all page requests will be logged, which is probably not what you want. It results in a log record for every image on every page. You end up with massive log files that are much harder to trawl through than they need to be. Fortunately we solve this by identifying pages that can be ignored and then excluding these from the CustomLog directive. We define a specific environment variable if the requested page matches any of a set of patterns. The variable is called donotlog in this example but the name is arbitrary. It gets set if the request is for a regular image, a stylesheet, or one of those mini-icons that appear in browser address windows. We apply a qualifier to the end of the CustomLog line, which means log this record if donotlog is not defined in the environment variables. Note the syntax of this (=!) is reversed from "not equal" in languages such as Perl. That makes it easy to mistype and the error will prevent Apache from restarting:

     SetEnvIf Request_URI \.gif donotlog     SetEnvIf Request_URI \.jpg donotlog     SetEnvIf Request_URI \.png donotlog     SetEnvIf Request_URI \.css donotlog     SetEnvIf Request_URI favicon\.ico donotlog     CustomLog logs/access_log combined env=!donotlog 

This short block will lower the size of your log files dramatically with little or no loss of useful information.

Here are some examples of real log records. A simple page fetch as recorded using the common LogFormat, with HostnameLookups turned off, looks like this:

     66.134.177.170 - - [20/Feb/2004:15:34:13 -0800]     "GET /index.html HTTP/1.1" 200 13952 

With HostnameLookups turned on:

     h-66-134-177-170.sttnwaho.covad.net - -     [20/Feb/2004:15:37:50 -0800]     "GET /index.html HTTP/1.1" 200 13952 

And finally using the combined format:

     h-66-134-177-170.sttnwaho.covad.net - -     [20/Feb/2004:15:46:03 -0800]     "GET /index.html HTTP/1.1" 200 13952     "http://www.craic.com/index.html"     "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6)     Gecko/20040207 Firefox/0.8" 

Consider the last example. h-66-134-177-170.sttnwaho.covad.net is the hostname of the machine making the request. This would just be the IP address if hostname lookups were turned off. The two dashes that follow are placeholders for logname and username information that is not available in this request, as is the case with most that you will come across. Next is the timestamp, followed by the first line of the actual request. "GET /index.html HTTP/1.1" reads as a request for the document index.html, to be delivered using the GET method as it is interpreted in Version 1.1 of the http protocol. The two numbers that follow signify a successful transaction, with status code 200, in which 13,952 bytes were sent to the browser. This request was initiated by someone clicking on a link on a web page, and the URL of that referring page is given next in the record. If the user had typed in the URL directly into a browser then this would be recorded simply as a dash.

Finally there is the User-Agent header. This is often the most interesting item in the whole record. It tells us in considerable detail what browser was used to make the request, often including the type of operating system used on that computer. This example tells us the browser was Firefox Version 0.8 running under the Linux operating system on a PC:

     "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6)     Gecko/20040207 Firefox/0.8" 

This one identifies the browser as Safari running under Mac OS X on a PowerPC Macintosh:

     "Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/125.5.6     (KHTML, like Gecko) Safari/125.12" 

Notice that the version numbers are very specific. If I were so inclined, I might use those to look up security vulnerabilities on that system that might help me break in to it over the network. You might not want to pass all this information on to every site that you visit.

Even more specific are User-Agent strings like these:

     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;     ESB{837E7A43-A894-47CD-8B49-6C273A84BE29}; SV1)"     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;     {A0D0A528-5BFC-4FB3-B56C-EC45BCECC088}; SV1; .NET CLR)" 

These are two examples of Microsoft Internet Explorer Version 6.0 running on Windows 2000 systems. More importantly, they appear to have a unique identifier string embedded in the User-Agentfor example, {A0D0A528-5BFC-4FB3-B56C-EC45BCECC088}. Every example of this that I have seen is different so it cannot be a product number and not all Windows 2000 browsers have a string like this. It appears to be a serial number that either identifies that copy of Windows or that copy of Explorer. I have to admit that I don't fully understand this one, but if it is a unique ID then it could be used to trace a visit to a specific web site all the way back to a specific computer. That may very well be its purpose. Companies concerned about their staff leaking confidential information or visiting inappropriate web sites might want to identify the precise source of any web page request.

Other User-Agent strings tell us that we are being visited by web robots, also known as crawlers or spiders. Here are the strings for the robots from MSN, Yahoo!, and Google:

     msnbot/1.0 (+http://search.msn.com/msnbot.htm)     Mozilla/5.0 (compatible; Yahoo! Slurp;         http://help.yahoo.com/help/us/ysearch/slurp)     Googlebot      /2.1 (+http://www.google.com/bot.html) 

When you combine the information present in a log record with some simple dig and whois searches, you can learn a lot about the person making the request. Here is someone based in India, on a Windows 98 PC, looking at my resume, which they found by running a Google search on the name of my Ph.D. supervisor:

     221.134.26.74 - - [02/Feb/2005:07:24:25 -0800]     "GET /pdf_docs/Robert_Jones_CV.pdf HTTP/1.1" 206 7801     "http://www.google.com/search?hl=en&ie=ISO-8859-1&q=R.L.+Robson"     "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)" 

The next example involves a browser on a mobile phone, specifically a Nokia 3650. Not only that, I know that they use ATT Wireless as their carrier, because the IP address maps to the host pnupagt11.attwireless.net:

     209.183.48.55 - - [20/Feb/2004:15:47:46 -0800] "GET / HTTP/1.1"     200 904 "-" "Nokia3650/1.0 SymbianOS/6.1 Series60/1.2     Profile/MIDP-1.0 Configuration/CLDC-1.0 UP.Link/5.1.2.9" 

You can while away many a happy hour looking through server logs like this. It's both fascinating to see what you can uncover and chilling to realize what other people can uncover about you.



Internet Forensics
Internet Forensics
ISBN: 059610006X
EAN: 2147483647
Year: 2003
Pages: 121

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net