Analyzing Server Log Files | Advanced Linux Networking

Web server log files can be an important source of information to help you manage your Web site. Log files may include information on the clients that are visiting your site, which of your documents are popular with those clients , when your files are being accessed, and so on. Unfortunately, examining raw log files can be a tedious undertaking, so various tools exist to help summarize the data in the log files. Two common tools are Analog and Webalizer.

NOTE

This section describes the routine log files created by Apache, as set by the CustomLog directive, described earlier. Apache may also log errors, startup messages, and so on to a separate log file.

The Apache Log File Format

There are actually several different Apache log file formats, which you can set with the CustomLog directive, as described in "Setting Common Configuration Options." This section describes the combined format, which combines information into a single file. The other options provide a subset of this information.

An entry in the combined log file looks something like this:

 192.168.1.1 - - [06/Nov/2002:16:45:49 -0500] "GET /index.html \ HTTP/1.0" 200 8597 "-"  "Mozilla (X11; I; Linux 2.0.32 i586)"

This entry consists of several parts :

Client hostname or IP address ” The first field is the IP address or hostname of the client that made a request.
User identification ” The next two fields (both dashes in this example) provide the username of the individual who made the request. The dashes indicate that this information isn't available. If they're present, the first field is the name as identified by the identd server and the second is as identified by HTTP user authentication.
Date and time ” Apache logs the date and time of the transfer request. This information is recorded in local time, but the log includes the time zone ( -0500 in this example, meaning five hours before GMT).
HTTP request ” The HTTP request code ( GET /index.html HTTP/1.0 in this example) shows the command that the client used ( GET ), the document requested ( /index.html ), and the HTTP level used ( 1.0 ). You can use this information to discover which of your pages are the most popular. This field also often contains clues to attempted break-ins, because these often rely upon requests for strange documents.
Response code ” Apache replies to the client, in part, with a response code that provides information about the ability of the server to fulfill a request. In this example, the response code is 200 , which means Apache could fulfill the request. Codes beginning in 3 are redirections, client errors are indicated by codes beginning in 4 , and server errors turn up in codes beginning in 5 .
Object size ” The 8597 entry in this example is the size of the document that Apache returned, not counting HTTP overhead.
Referrer ” When a user clicks a link from another page, most browsers deliver the URL of the referring page to the new page's Web server. Apache records this information in its log file. In the preceding example, the referrer is "-" , indicating there was no referring page ”the user typed the URL into the Web browser directly.
User agent ” The final field contains information that the browser sends to Apache about itself, such as its name and the OS on which it runs. (Note that Netscape reports itself as Mozilla .) This information isn't wholly reliable; Web browsers can be programmed to lie, or proxy servers may change the information.

Using this information, you can peruse your Apache log files to determine something about the popularity of your Web pages, when they're being accessed, who's accessing them, and so on. Examining these files "raw" can be tedious, though. That's where log file analysis tools come in handy.

NOTE

Most Linux installations include cron jobs to automatically rotate log files, including Apache log files. Check your system's cron jobs (usually stored in /etc/cron.d , /etc/cron.interval , or a similar location) for such a log file rotation system. If your Web file logs aren't being rotated , you may want to add this feature to prevent the log files from growing to consume available disk space.

Using Analog

Analog (http://www.analog.cx) claims to be the most popular Web log file analyzer in the world. This package's output is heavily text-based, but includes some bar and pie charts . You can see an example report at http://www.statslab.cam.ac.uk/~sret1/stats/stats.html. Analog ships with some distributions, or you can obtain it from its Web page.

Setting Analog Options

Analog is controlled through its configuration file, analog.cfg , which usually resides in /etc . This file contains various options that help Analog combine data into useful chunks . For instance, SEARCHENGINE specifies search engines that might appear as referrers, so that Analog can summarize search engine links to your sites. Three options you're likely to want to set immediately are the following:

 LOGFILE  /path/to/log/file  OUTFILE  /path/to/output/file  HOSTNAME "  Your Organization's Name  "

The first two of these items are critically important. If you don't specify them, Analog won't be able to locate your log file, and it will dump its output file directly to standard output. Analog's output is in the form of an HTML file with associated graphics, so you can read it with a Web browser. (You specify only the name for the main file, such as /home/httpd/html/analog/index.html ; Analog creates its graphics files in the same directory.) The HOSTNAME specification is purely cosmetic; Analog displays this information at the top of its report.

Unfortunately, some Analog packages are not fully functional out of the box because they make peculiar and contradictory assumptions about the locations of files. These problems can be overcome by creating a few symbolic links:

Configuration files ” Some Analog packages are built to assume that the analog.cfg file will be in the same directory as the Analog executable (usually /usr/bin ), although the file actually resides in /etc . The /usr/bin directory is a bizarre location for a configuration file, but you can type ln -s /etc/analog.cfg /usr/bin to leave the file in /etc but still satisfy Analog.
Language files ” Analog relies upon language files to operate properly. Some packages place these in /var/lib/analog/lang , but Analog may look for them in /usr/bin/lang . Typing ln -s /var/lib/ analog/lang /usr/bin allows Analog to function.
Support graphics ” Analog generates some graphics, such as pie charts, for each site it summarizes, but it also relies on other fixed graphics files. Some packages drop these files in /var/www/html/images by default, but the HTML that Analog generates looks for them in the images directory under the Analog output directory, so you may need to create another symbolic link. Change to the output directory you specified with the OUTFILE option and type ln -s /var/www/html/ images to give Analog access to these graphics files.

Keep in mind that these adjustments may not be required for all Analog packages. (I found them to be necessary with an Analog package intended for Linux Mandrake, analog-5.01-1mdk .)

Running Analog

Analog can be run by typing its name: analog . The user who types this command must have read access to the Web access logs, as well as write access to the Analog output directory. If you plan your permissions appropriately, you do not need root access for either task.

In most cases, you'll want to run Analog from a cron job on a regular basis, such as once a month, once a week, or even once a day. Keep in mind that Analog, although not a huge process, does consume some system resources, so running it very frequently (such as once a minute) can cause a performance hit, particularly on a busy server.

Interpreting Analog Output

The Analog output is broken into several distinct reports, each of which provides information that's been processed and summarized in a different way. The specific sections are as follows :

General summary ” This section provides general information that may be useful to judging the overall health of your Web server, such as the average number of requests it processes per day, the average number of successful and failed requests per day, and the total and average daily data transfers.
Monthly report ” The monthly report summarizes the number of pages served on a monthly basis. Increasing monthly use and decreasing perceived performance could mean you need to upgrade the server or its network connections.
Daily summary ” This section provides information on the number of pages served by the day of week (Monday, Tuesday, and so on).
Hourly summary ” This section is like the daily summary, but it summarizes server use by the hour within a day (1:00, 2:00, and so on). If you're experiencing slowdowns, you may want to check this summary and use it to fine-tune your diagnostics; you might miss a problem if you look for it during a less busy time of the day.
Domain report ” If your server handles multiple domains, you'll see a summary of the amount of traffic each one processes.
Organizational report ” If you associate different organizations with different domains or pages, this report breaks traffic down by organization.
Operating system report ” You can see which OSs your clients report using if you use a combined Apache log format or another format that provides this information. Note that because of proxies and other reporting inaccuracies, this information may not be wholly reliable.
Status code report ” Analog provides a pie chart showing the number of each status code responses issued by the Web server. This can be useful in quickly spotting problems if there are a lot of 4 xx or 5 xx responses.
File size report ” This section shows the number of files of various sizes that the Web server delivers. This can be very useful in traffic management; if you see your file sizes drifting up, you might want to take steps to check this trend, such as using higher compression levels on your graphics files.
File type report ” You can see the types of files (JPEG files, HTML files, and so on) delivered by your Web server. This may be useful in conjunction with the file size report in controlling the expansion of your Web site.
Directory report ” Most Web sites are broken into multiple directories, and this report tells you which of these are most popular, by bytes delivered.
Request report ” This report displays the popularity of the files in the root directory on the Web site.

You can use these various reports to get a good idea of how your Web site is being used. It can be even more valuable if you maintain a record of Analog reports that spans some time, because then you can examine multiple reports for changes over time. You can do this by creating or editing a cron job to rotate the Apache log files. (Many distributions' Apache packages include such a cron job script.) When the log file rotation occurs, back up an existing Analog directory in a subdirectory. You can then create a master HTML page that links into these backup directories so that you can peruse several weeks or months worth of Analog summaries.

Although Analog is a useful tool that can produce a wealth of data, sifting through that data can sometimes be almost as intimidating as confronting the raw Apache log files. Various additional tools, such as Report Magic (http://www.reportmagic.com), can further summarize Analog's reports and present details in a more readable form.

Using the Webalizer

The Webalizer (http://www.webalizer.org) is a major competitor to Analog in the Web page summary sphere. Like Analog, the Webalizer reads configuration files and creates an output HTML file and supporting graphics so that you can peruse your Web site's traffic patterns in a convenient summary form. Webalizer ships with some distributions, or you can obtain it from its Web page. You can view a sample report at http://www.webalizer.org/sample/.

Setting the Webalizer Options

The Webalizer is controlled through its configuration file, webalizer.conf , which is typically stored in /etc . As with Analog, you must tell the Webalizer where to find your Web server log files and where to store its output. You do this with options like the following:

 LogFile  /path/to/log/file  OutputDir  /path/to/output/directory

One important difference between the Analog and Webalizer settings for these options is that Analog requires you to specify an output filename, but the Webalizer has you specify an output directory in which it stores its files. If you set the output directory to a location within your Web server area, you can browse the Webalizer output using a Web browser. If you place the output elsewhere, you can still access it with your Web browser, but only on the Web server computer itself by specifying a file:// URL. There are a few other Webalizer configuration options you might want to adjust, including:

Incremental ” If set to yes , this option causes the Webalizer to store its internal state between runs so that you can process logs in chunks. For instance, you can run the Webalizer once a day and it will remember the entries it's already processed and adjust to rotated log files. This option defaults to no , which causes the Webalizer to analyze the log file fresh each time it's run.
HostName ” You can set the hostname used in the report title (which is set with the ReportTitle option).
GroupDomains ” When reporting hostnames, Webalizer normally analyzes by complete hostname. You can group hostnames within a domain by specifying a non-0 value for GroupDomains , though. The value is the number of elements, starting from the rightmost hostname element, to use as a group. For instance, GroupDomains 2 causes gingko. pangaea .edu and birch .pangaea.edu to be grouped into pangaea.edu . This option can help to unclutter some of the information that the Webalizer produces.
GroupSite ” This is another grouping option, but it works on individual sites. For instance, GroupSite *.abigisp.net causes all hostnames under abigisp.net to be grouped together in reports.
HideSite ” This option hides the sites under a given domain, which is specified as in the GroupSite option. The GroupSite and HideSite options are frequently used together to create a grouping with no reporting of the individual sites.

Webalizer configuration files are often longer and more complex than are Analog configuration files, and the preceding list covers just a handful of Webalizer options. Most of the options are documented by comments in the standard configuration file, so you can consult it for more information.

Running the Webalizer

You can run the Webalizer by typing its name: webalizer . Like Analog, Webalizer doesn't need to be run as root unless read access to the Web server access logs or write access to the Webalizer output directory is restricted to root . You may want to run the Webalizer in a cron job with the Incremental option set to yes in order to have the program automatically build a history of Web site access summaries.

TIP

Chances are your Apache installation created a cron job to rotate the Apache log files; if it didn't, you'll want to create such a configuration, as noted earlier. To ensure that Webalizer catches as many Web hits as possible, run Webalizer just before the rotation occurs, even if you also run Webalizer in its own cron job.

Interpreting the Webalizer Output

Webalizer presents a two-tiered report. The first overview tier shows a summary of activity over the past year. (On a newly installed system, most of those months will be empty.) This summary includes information presented in both a table and a bar chart on the number of hits, Web page downloads, kilobytes transferred, and so on for each month. You can click on the month name in the summary table to get to the second tier of the analysis, which breaks down the month's activity in more detail. This page contains several subsections:

Monthly statistics ” The first area presents the same information as in the first-tier analysis page, plus a bit more, such as the number of various response codes returned to clients.
Daily statistics ” The second area shows a bar graph and table summarizing the Web traffic for each day of the month. Summary statistics include the number of pages, number of hits, number of files, and number of kilobytes transferred.
Hourly statistics ” This area presents information similar to the daily statistics area, but broken down by hour of the day. You can use this to locate peak traffic times for your site, which can be important information when planning capacity or debugging capacity- related problems.
Top URLs ” The Webalizer presents two tables that summarize the number of hits and kilobytes associated with specific URLs. (You can use grouping options in the Webalizer configuration file to create groups of URLs to appear in this list, if you like.) One table presents the top URLs by hits, the other the top URLs by kilobytes.
Entry and Exit pages ” Two tables show the most popular entry and exit pages. An entry page is the first page that a user viewed when visiting your site. An exit page is the last page a user viewed when visiting your site.
Top sites ” The Webalizer summarizes the clients that accessed your site the most, both by number of hits and by number of kilobytes. You can group sites together in the Webalizer configuration file with options like GroupSite , described earlier.
Top referrers ” If your Web log files include referrer information, the Webalizer summarizes this information so you can see which sites produce the most links to yours.
Top search strings ” Some Web search engines, when they produce links to your site, include the search string as part of the referrer URL. The Webalizer can break this information out and regenerate the search strings, which the Webalizer then summarizes for you.
Top user agents ” The Webalizer summarizes the names of the Web browsers that most frequently accessed your site.
Top countries ” The Webalizer's final section summarizes access by what it calls countries. In reality, the Webalizer is summarizing access by top-level domain (TLD) name, so your top "countries" may include US Commercial, Network, and other domains that aren't restricted to particular countries.

If you want to compare trends in your Web server access, the overview tier can give you general trends, but you'll need to compare the monthly reports (say, in side-by-side Web browser windows ) to see how specific access patterns change with time.