8.3 Log Files

only for RuBoard - do not distribute or recompile

While information provided by users may be the most detailed information collected, by far the most pervasive information collection comes from the operation of the network itself. This data is stored in log files created by network programs and devices.

Log files are ubiquitous. Programmers add log files to their programs to assist in writing and debugging. System operators leave log files enabled so they can verify that software is working correctly, and so they can diagnose the cause of problems when things do not operate properly. Governments and marketers use this information because it is an excellent source of data.

Computers are extraordinarily complicated systems; few system operators are aware of all the log files that their computers create. Many times, a system operator will firmly assert that a particular piece of information is not being retained by their computer system, only to discover that in fact the information is being retained, somewhere in a log file.

There is fundamentally no way for the user of a computer system to know with certainty if a log file is being created of the user's activities. Many organizations that have assured users that records were not being kept of user actions have later discovered that activities were in fact logged. Likewise, many organizations that assumed activities were logged have later discovered problems with the logging system.

8.3.1 Retention and Rotation

Some computer systems automatically age and discard old log files, a process that is called rotation . On other computer systems, there is no formal system for discarding old log file information: these systems retain log files until their disks fill up and somebody manually deletes the log file entries.

For the same reasons it is impossible for the user of a computer system to know if her actions are being logged, it is also impossible to know how long log files are actually retained. Here is an example of a few log files from a moderately busy web server:

% ls -l access* -rw-r-----  1 root     www    312714072 Apr 19 13:42 access_log -rw-r-----  1 root     www    401536508 Apr 15 00:00 access_log.1 -rw-r-----  1 root     www     32408676 Apr  8 00:00 access_log.2.gz -rw-r-----  1 root     www     31062796 Apr  1 00:00 access_log.3.gz %

This computer appears to retain log files for one month. The file access_log contains a record for each web page downloaded since the beginning of April 15. The file access_log.1 contains a list of all web pages downloaded from the start of April 8 to the end of April 14. The files access_log.2.gz and access_log.3.gz are for the two preceding weeks. These files are smaller than the first two files because they were compressed.

Despite appearances, the organization that operates this web server actually maintains log files for a significantly longer period of time. This is because the organization backs up the directory that contains the log files to magnetic tape. These tapes are stored off-site in a safe deposit box. Although there are no specific records of which log files are backed up and which are not, in an emergency (or under a court order), it might be possible for this organization to retrieve log file records that are a year old or even older.

8.3.2 Web Logs

Practically every time a web browser downloads a page on the Web, a record of this event is routinely recorded in the log files of the remote web server. If the web page is assembled using a database server, the database server may create log files of its own. Finally, web logs are also routinely kept on network firewalls, web proxies, and web caches. As a result, simple web browsing can result in a plethora of records being created on machines in locations that are controlled by multiple organizations.

Log files are under the control of the person or organization that controls the web server. Log files are frequently subpoenaed and used in lawsuits or criminal investigations. Log files can be used by employers to determine what employees are doing when they are at work. Log files can be used by a nosy system administrator to spy on others. But in the vast majority of cases, the information in log files is never looked at by anybody. Because most log files are never consulted, and because the contents of most log files are never revealed, most users of the Internet do not know the full extent of their activities are recorded.

8.3.2.1 What's in a web log?

The following information is either stored directly in most web log files or can be readily inferred from other information in web logs:

The name and IP address of the computer that downloaded the web page.
The time of the request.
The URL that was requested.
The time it took to download the file (this is an indication of the user's Internet connection).
If HTTP authentication was used, the log file contains the username of the person who downloaded the file.
Any errors that occurred.
The previous web page that was downloaded by the web browser (called the refer link).
The kind of web browser that was used.

This information can be combined with other log files such as login/logout information from Internet service providers, or logs from mail servers to discover the actual identity of the person who was doing the downloading. Normally this kind of cross-correlation requires the assistance of another organization, but that is not always the case.

For example, many ISPs dynamically assign IP addresses to computers each time they call up. A web server may know that a user accessed a page from the host free-dial-77.freeport.mwci.net ; someone would then have to go to mwci.net 's log files to find out who the actual user was. On the other hand, sometimes computers are assigned permanent IP addresses; for several years, Simson used a computer named pc-slg.vineyard.net and Spaf would routinely check his email while on the road dialed in from shire-ppp.cs.purdue.edu.

A typical web server log is shown in Example 8-1.

Example 8-1. A sample web server log

free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:04:11 -0500] "GET /awa/issue2/     Woodstock.gif HTTP/1.0" 200 26385  "http://www.vineyard.net/awa/issue2/Wood.html" "Mozilla/2.0 (compatible; MSIE 3.01;      Windows 95)" "" free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:04:27 -0500] "GET /awa/issue2/     WoodstockWoodcut.gif HTTP/1.0" 200 54467  "http://www.vineyard.net/awa/issue2/Wood.html" "Mozilla/2.0 (compatible; MSIE 3.01;      Windows 95)" "" crawl4.atext.com - - [09/Mar/1997:00:04:30 -0500] "GET /org/mvcc/ HTTP/1.0" 200 10768 "-"      "ArchitextSpider" "" www-as6.proxy.aol.com - - [09/Mar/1997:00:04:34 -0500] "GET /cgi-bin/imagemap/mvol/cat2.     map?31,39 HTTP/1.0" 302 - "http://www.mvol.com/" "Mozilla/2.0 (Compatible; AOL-IWENG      3.0; Win16)" "" www-as6.proxy.aol.com - - [09/Mar/1997:00:04:40 -0500] "GET /mvol/photo.html HTTP/1.0" 200      6801  "http://www.mvol.com/" "Mozilla/2.0 (Compatible; AOL-IWENG 3.0; Win16)" "" www-as6.proxy.aol.com - - [09/Mar/1997:00:04:48 -0500] "GET /mvol/photo2.gif HTTP/1.0" 200      12748  "http://www.mvol.com/" "Mozilla/2.0 (Compatible; AOL-IWENG 3.0; Win16)" "" free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:05:07 -0500] "GET /awa/issue2/Wood.html      HTTP/1.0" 200 37016  "http://www.altavista.digital.com/cgi-bin/query?pg=q&what=web&fmt=.&q=woodstock" "Mozilla/     2.0 (compatible; MSIE 3.01; Windows 95)" "" free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:05:07 -0500] "GET /awa/issue2/     Sprocket1.gif HTTP/1.0" 200 4648  "http://www.vineyard.net/awa/issue2/Wood.html" "Mozilla/2.0 (compatible; MSIE 3.01;      Windows 95)" "" free-dial-77.freeport.mwci.net - - [09/Mar/1997:00:05:08 -0500] "GET /awa/issue2/     Sprocket2.gif HTTP/1.0" 200 5506  "http://www.vineyard.net/awa/issue2/Wood.html" "Mozilla/2.0 (compatible; MSIE 3.01;      Windows 95)" "" www-as6.proxy.aol.com - - [09/Mar/1997:00:05:09 -0500] "GET /mvol/peter/index.html HTTP/1.     0" 200 891 "http://www.vineyard.net/mvol/photo.html" "Mozilla/2.0 (Compatible; AOL-     IWENG 3.0; Win16)" ""

8.3.2.2 The refer link field

The refer link field is another source of privacy violations. It works like this: whenever you, as a web surfer, look for a new page, one of the pieces of information that is sent along is the URL of the page that you are currently looking at. (The HTTP specification says that sending this information should be an option left up to the user to decide, but we have never seen a web browser where sending the refer information is optional.)

One of the main uses that companies have found for the refer link is to gauge the effectiveness of advertisements they purchase on other web sites. Another use is charting how customers move through a site. The refer link field can also reveal personal information namely, the URL of the page that a user was looking at before he or she clicked into your site.

Refer links frequently reveal unintended information. When you click the link of a web search engine, for instance, the refer link that is sent to the remote web server encodes the search that you were performing. Consider this entry from the log file of the www.simson.net web server:

pc109240.stofanet.dk - - [21/Mar/2001:16:27:25 -0500] "GET /clips/95.SJMN. AltKeyboards.txt HTTP/1.1" 200 9988 "http://www.google.com/ search?hl=da&safe=off&q=%22Building+a+better+keyboard+%22&lr=" "Mozilla/4.0  (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"

This log file entry indicates that the user of the computer pc109240.stofanet.dk was searching on the web search engine Google for the phrase "Building a better keyboard" on March 21, 2001. Sometimes, the results of a refer field can give away far more information than the web user might wish.

As is the case with Google, the largest number of privacy violations involving refer fields occur with HTML forms that use the GET method (as opposed to the POST method). This is because the GET method encodes the contents of each field in the URL itself. The big advantage of using the GET method is that it allows people to bookmark filled-in-forms, such as searches. For example, opening the URL http://www.google.com/search?q=simson will automatically perform a Google search for the name "Simson." But if the previous web page posted contained a credit card number or other personal information that was provided to a GET form, information leakage from one web site to another web site is inevitable.^[6]

^[6] The risk of transferring credit card numbers to third-party sites was reduced somewhat in 1997, when Netscape and Microsoft modified their browsers so that the refer link would no longer be passed from an SSL-enabled site to a non-SSL site.

8.3.2.3 Obscuring web logs

Proxy servers can render web logs less useful. When a user accesses a web server through a proxy, the web server records the proxy's address, rather than the address of the user's machine. For example, most users who access the Internet through America Online do so through the company's proxy server.

Web proxies do not necessarily give web users anonymity: the user's identity can still be learned by referring to the proxy's logs. Proxies simply make the task more difficult.

8.3.3 RADIUS Logs

RADIUS (Remote Authentication Dial-In User Service) is widely used on the Internet by ISPs and large organizations to validate usernames/passwords for dialup users and to provide for proper accounting. Originally designed by Livingston, RADIUS is now widely implemented by Cisco, Nortel, Lucent, Redback, and most other vendors. Although RADIUS provides functionality that is similar to Cisco's proprietary TACACS and TACACS+ protocols, RADIUS became the dominant protocol because clients and servers were distributed in source-code form, because it was extensible, and because it provided for encryption of passwords sent over the wire (unlike TACACS).

RADIUS log files contain an astonishing amount of information, including usernames, times, IP addresses, and even CALLER-ID information. Here are two example RADIUS records that were created with a Livingston Portmaster 3:

Thu Apr 19 13:54:09 2001         Acct-Session-Id = "0E027BE9"         User-Name = "beth"         NAS-IP-Address = 199.232.91.8         NAS-Port = 43         NAS-Port-Type = Async         Acct-Status-Type = Start         Acct-Authentic = RADIUS         Connect-Info = "50666 LAPM/V42BIS"         Called-Station-Id = "5086292329"         Calling-Station-Id = "5086962222"         Service-Type = Framed         Framed-Protocol = PPP         Framed-IP-Address = 199.232.91.50         Acct-Delay-Time = 0 Thu Apr 19 13:54:52 2001         Acct-Session-Id = "0E027BD6"         User-Name = "simson"         NAS-IP-Address = 199.232.91.8         NAS-Port = 34         NAS-Port-Type = Async         Acct-Status-Type = Stop         Acct-Session-Time = 2350         Acct-Authentic = RADIUS         Connect-Info = "14400 LAPM/V42BIS"         Acct-Input-Octets = 18321         Acct-Output-Octets = 108087         Called-Station-Id = "6173442329"         Calling-Station-Id = "6178761111"         Acct-Terminate-Cause = Idle-Timeout         Vendor-Specific = vLivingston-020e49646c652054696d656f7574         Service-Type = Framed         Framed-Protocol = PPP         Framed-IP-Address = 199.232.91.38         Acct-Delay-Time = 0

The CALLER-ID information in RADIUS logs was instrumental in determining the identity of the author of the Melissa computer worm.

8.3.4 Mail Logs

Every time an email message is sent, received, or transported through a mail server, there is a good chance that some program somewhere is making note of that fact in a mail log file. Mail logs usually contain the from: and to: email addresses, the time that the message was sent, and the message-id. Subject: lines and content are usually not logged, although they certainly could be.

Here is an example of a mail log:

Apr 20 11:43:42 <mail.info> r2 sendmail[50422]: f3KGhg150422: from=<owner-     august96@groucho.ctel.net>, size=2468, class=-60, nrcpts=1,  msgid=<l03130300b706172b0593@[192.168.1.55]>, proto=ESMTP, daemon=Daemon0,      relay=groucho.ctel.net [209.222.72.2] Apr 20 11:43:42 <mail.info> r2 sendmail[50423]: f3KGhg150422: to=<beth@walden.     cambridge.ma.us>, delay=00:00:00, xdelay=00:00:00, mailer=local, pri=139359,      relay=local, dsn=2.0.0, stat=Sent Apr 20 11:43:54 <mail.info> r2 sendmail[50426]: f3KGhs150426: from=<elbows-     request@mc.lcs.mit.edu>, size=1138, class=0, nrcpts=1, msgid=<Pine.GSO.4.21. 0104201114300.2335-100000@server.genericdomain.net>, proto=ESMTP, daemon=Daemon0,      relay=mc.lcs.mit.edu [18.24.10.26] Apr 20 11:43:58 <mail.info> r2 sendmail[50427]: f3KGhs150426: to=<beth@walden.     cambridge.ma.us>, delay=00:00:04, xdelay=00:00:04, mailer=local, pri=30456,      relay=local, dsn=2.0.0, stat=Sent Apr 20 11:44:13 <mail.info> r2 sendmail[50432]: f3KGiC150432: from=<owner-     august96@groucho.ctel.net>, size=4303, class=-60, nrcpts=1, msgid=<200104201642.     NAA18970@kiln.isn.net>, proto=ESMTP, daemon=Daemon0, relay=groucho.ctel.net [209.     222.72.2] Apr 20 11:44:13 <mail.info> r2 sendmail[50433]: f3KGiC150432: to=<beth@walden.     cambridge.ma.us>, delay=00:00:01, xdelay=00:00:00, mailer=local, pri=141458,      relay=local, dsn=2.0.0, stat=Sent

Mail logs are useful for determining people who exchange email and users on mailing lists. (In the example above, the user "beth" is evidently on the mailing list august96@groucho.ctel.net.)

8.3.5 DNS Logs

The bind DNS nameserver produced by the Internet Software Consortium can be configured to log every DNS query that it receives. The bind log file contains the name of the host from which each query was made, the IP address from which the query was made, and the query itself. An example of such a log file is shown here:

Apr 20 13:18:17 <local2.info> r2 named[50916]: XX /206.196.128.1/queen.simson.net/A/IN Apr 20 13:18:20 <local2.info> r2 named[50916]: XX+/64.7.15.234/2.72.222.209.in-addr.     arpa/PTR/IN Apr 20 13:18:20 <local2.info> r2 named[50916]: XX+/64.7.15.234/234.15.7.64.in-addr. arpa/PTR/IN Apr 20 13:18:20 <local2.info> r2 named[50916]: XX+/64.7.15.234/groucho.ctel.net/A/IN Apr 20 13:18:20 <local2.info> r2 named[50916]: XX+/64.7.15.234/groucho.ctel.net/ANY/IN Apr 20 13:18:20 <local2.info> r2 named[50916]: XX+/64.7.15.234/walden.cambridge.ma.us/     ANY/IN Apr 20 13:18:21 <local2.info> r2 named[50916]: XX+/64.7.15.234/ctel.net/ANY/IN Apr 20 13:18:21 <local2.info> r2 named[50916]: XX+/64.7.15.234/earthlink.net/ANY/IN Apr 20 13:18:36 <local2.info> r2 named[50916]: XX /209.20.178.33/queen.simson.net/A/IN Apr 20 13:18:36 <local2.info> r2 named[50916]: XX /200.186.94.1/www.dbz.ex.com/A/IN

Logging DNS queries can be useful for system maintenance and forensic work. It is also a great way to silently monitor the activities of customers or other individuals. Because a computer must resolve a DNS address before a URL that contains a hostname can be resolved, monitoring a user's DNS usage provides an ISP with a detailed report of each web site that the user accesses. Monitoring DNS queries can also give pointers to attackers, as even attackers who launch their attacks from third-party machines frequently perform DNS queries from their home machines first.

only for RuBoard - do not distribute or recompile