Recipe 1.10. Monitoring Web Server Activity

Problem

You want to see programs your web server is running and user requests for web pages.

Solution

Use command-line tools to get a real-time snapshot of web server activity:

tail: Returns the last part of a file, such as most recent connection entries from the web server logfile
grep: Searches for a pattern in a file, such as specific filenames or error codes from the web server logfile
ps: Reports on the status of web server processes

Discussion

Almost any decent web hosting account will record connections to your web site in logfiles that you can view and process. A good hosting provider may even help you automate the task of purging the connection recordsor log rollingso the files do not consume your account's disk quota, and give you access to web site statistics software, such as Analog or Urchin, that will generate easy-to-read reports about activity on your web site.

If you're serious about your web site, then you should take advantage of the tools available to you and review web site traffic reports often to understand how visitors get to your site, what's popular, and what's working (or not working). How to look at and use web site traffic reports is covered in Recipe 9.9.

The access and error logs that provide the raw material for traffic reports are constantly updated. Traffic reports themselves, on the other hand, are usually generated less frequentlydaily, or even weekly, in some cases. A situation may arise when you can't wait for the next traffic report to be created. You need to get an up-to-the-minute picture of the who, what, and how many of your web site's current activity. Here are some command-line tools you can use to take your web site's pulse.

Using tail to track web site requests in real time

First, you'll need to find your Apache access and error logfiles. They are usually saved in a separate logs directory and have names like access_log, access.log, or apache.access_log. The error log should be in the same directory with the access log, so once you've found the logs, Telnet into your web server and switch to the logfiles directory.

Now you can watch connections to your web site as they're handled by Apache with the Unix utility tail. Assuming your access log is named access_log, type this command at your Telnet prompt:

 tail -f access_log

Your shell window should be filled with several lines, like this:

 128.118.152.116 - - [14/May/2005:12:49:26 -0500] "GET /swgr/index.php HTTP/1.1" 200 29070 "http://daddison.com/index.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" 68.142.250.83 - - [14/May/2005:12:49:30 -0500] "GET /case_studies/cs01.html HTTP/1.0" 200 19604 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT; .NET CLR 1.1.4322)" 165.83.120.231 - - [14/May/2005:12:49:33 -0500] "GET /clients/index.html HTTP/1.1" 301 255 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

Each line indicates the IP number, file requested, and status of each unique connection, or hit, to your web site. The -f flag on the command tells tail to show the last 10 lines in the access log, and to echo new lines to the shell window as they are appended to the file. See for yourself: open a browser window and, with your shell window still visible, hit a page on your web site. Your request should be duly noted by tail.

Using grep to find specific requests in the web server log

Going back to the problem in Recipe 1.8 about automatically updating pages on your site, let's say that your boss wants to know how many hits to the company's latest news release have been recorded today. And she can't wait until tomorrow, when a nice and neat traffic report will be waiting on the site with the answer. With grep, you can narrow your focus on the access log to just see recent requests for a specific file.

At the Telnet prompt to your web server, you can instruct the grep utility to search the access log for the filename of the news release in the content of the current access log by typing this command:

 grep "GET /news/newsrelease.html" access_log

With the search string GET /news/newsrelease.html you're looking for all the requests for newsrelease.html in the /news directory in the current server log. The results might look like this:

 24.91.149.141 - - [14/May/2005:13:55:45 -0500] "GET /news/newsrelease.html HTTP/1.1" 200 18912 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)" 213.219.80.16 - - [14/May/2005:13:56:36 -0500] "GET /news/newsrelease.html HTTP/1.1" 200 18912 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)" 70.176.205.66 - - [14/May/2005:13:58:09 -0500] "GET /news/newsrelease.html HTTP/1.1" 200 18912 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

You can also send the results of the search to file by modifying the command like this:

 grep "newsrelease.html" access_log > newsrelease_report.txt

And if you want to get really fancy, you can put that second grep command in your crontab file, have it run every 15 minutes, and let the boss check the hits herself.

You also can use grep to sift the access log for errors and unsuccessful requests that visitors to your web site are encountering. Each line in the log also includes an error code indicating the result of the request. Some common error codes are shown in Table 1-2. For a complete list, see the World Wide Web Consortium (W3C) list referred to in the "See Also" section of this Recipe.

Table 1-2. Common error codes
Code	Meaning
200	OK, the request has succeeded
401	Unauthorized, the request requires authorization
403	Forbidden, the request was refused
404	Not found
500	Internal server error

Using ps to monitor web server processes

Finally, there may come a time when you want to see what processes are running under your user ID on your web server. Use the Unix process report utilitypswith this command, replacing userid with your own ID (right after the -U flag):

 ps -Uuserid

The results should look something like this, with httpd indicating Apache processes that are currently running on your web server:

 PID    TTY      TIME CMD 11565  ?        0:00 httpd  1715  pts/5    0:00 tail 11569  pts/6    0:00 tcsh 11560  ?        0:00 httpd 11567  ?        0:00 sshd 11512  ?        0:00 sh 11542  ?        0:01 httpd 29475  ?        0:01 sshd 29477  pts/5    0:00 tcsh  6373  ?        0:00 sshd 11559  ?        0:00 httpd 11578  pts/6    0:00 ps 11557  ?        0:00 httpd 11553  ?        0:00 httpd 11554  ?        0:00 httpd