6.4 Some Useful Programs | Java Network Programming, Third Edition

You now know everything there is to know about the java.net.InetAddress class. The tools in this class alone let you write some genuinely useful programs. Here we'll look at two examples: one that queries your domain name server interactively and another that can improve the performance of your web server by processing log files offline.

6.4.1 HostLookup

nslookup is an old Unix utility that converts hostnames to IP addresses and IP addresses to hostnames. It has two modes: interactive and command-line. If you enter a hostname on the command line, nslookup prints the IP address of that host. If you enter an IP address on the command line, nslookup prints the hostname. If no hostname or IP address is entered on the command line, nslookup enters interactive mode, in which it reads hostnames and IP addresses from standard input and echoes back the corresponding IP addresses and hostnames until you type "exit". Example 6-11 is a simple character mode application called HostLookup , which emulates nslookup . It doesn't implement any of nslookup 's more complex features, but it does enough to be useful.

Example 6-11. An nslookup clone

 import java.net.*; import java.io.*; public class HostLookup {   public static void main (String[] args) {     if (args.length > 0) { // use command line       for (int i = 0; i < args.length; i++) {         System.out.println(lookup(args[i]));       }     }     else {       BufferedReader in = new BufferedReader(new InputStreamReader                                                        (System.in));       System.out.println("Enter names and IP addresses.                                            Enter \"exit\" to quit.");       try {         while (true) {           String host = in.readLine( );           if (host.equalsIgnoreCase("exit")                                      host.equalsIgnoreCase("quit")) {             break;           }           System.out.println(lookup(host));         }       }       catch (IOException ex) {         System.err.println(ex);       }    }   } /* end main */   private static String lookup(String host) {     InetAddress node;     // get the bytes of the IP address     try {       node = InetAddress.getByName(host);     }     catch (UnknownHostException ex) {       return "Cannot find host " + host;     }     if (isHostname(host)) {       return node.getHostAddress( );     }     else {  // this is an IP address       return node.getHostName( );     }   }  // end lookup   private static boolean isHostname(String host) {     // Is this an IPv6 address?     if (host.indexOf(':') != -1) return false;            char[] ca = host.toCharArray( );     // if we see a character that is neither a digit nor a period     // then host is probably a hostname     for (int i = 0; i < ca.length; i++) {       if (!Character.isDigit(ca[i])) {         if (ca[i] != '.') return true;       }     }     // Everything was either a digit or a period     // so host looks like an IPv4 address in dotted quad format     return false;    }  // end isHostName  } // end HostLookup

Here's some sample output; the input typed by the user is in bold:

 $  java HostLookup utopia.poly.edu  128.238.3.21 $  java HostLookup 128.238.3.21  utopia.poly.edu $  java HostLookup  Enter names and IP addresses. Enter "exit" to quit.  cs.nyu.edu  128.122.80.78  199.1.32.90  star.blackstar.com  localhost  127.0.0.1  stallio.elharo.com  Cannot find host stallio.elharo.com  stallion.elharo.com  127.0.0.1  127.0.0.1  stallion.elharo.com  java.oreilly.com  208.201.239.37  208.201.239.37  www.oreillynet.com  exit  $

There are three methods in the HostLookup program: main( ) , lookup( ) , and isHostName( ) . The main( ) method determines whether there are command-line arguments. If there are command-line arguments, main() calls lookup( ) to process each one. If there are no command-line arguments, main( ) chains a BufferedReader to an InputStreamReader chained to System.in and reads input from the user with the readLine( ) method. (The warning about this method in Chapter 4 doesn't apply here because the program is reading from the console, not a network connection.) If the line is "exit", then the program exits. Otherwise, the line is assumed to be a hostname or IP address and is passed to the lookup() method.

The lookup( ) method uses InetAddress.getByName( ) to find the requested host, regardless of the input's format; remember that getByName( ) doesn't care if its argument is a name or a dotted quad address. If getByName( ) fails, lookup( ) returns a failure message. Otherwise, it gets the address of the requested system. Then lookup( ) calls isHostName( ) to determine whether the input string host is a hostname such as cs.nyu.edu, a dotted quad IPv4 address such as 128.122.153.70, or a hexadecimal IPv6 address such as FEDC::DC:0:7076:10 . isHostName() first looks for colons, which any IPv6 hexadecimal address will have and no hostname will have. If it finds any, it returns false. Checking for IPv4 addresses is a little trickier because dotted quad addresses don't contain any character that can't appear in a hostname. Instead, isHostName( ) looks at each character of the string; if all the characters are digits or periods, isHostName( ) guesses that the string is a numeric IP address and returns false. Otherwise, isHostName( ) guesses that the string is a hostname and returns true. What if the string is neither? Such an eventuality is very unlikely : if the string is neither a hostname nor an address, getByName( ) won't be able to do a lookup and will throw an exception. However, it would not be difficult to add a test making sure that the string looks valid; this is left as an exercise for the reader. If the user types a hostname, lookup( ) returns the corresponding dotted quad or hexadecimal address using getHostAddress( ) . If the user types an IP address, then we use the getHostName() method to look up the hostname corresponding to the address, and return it.

6.4.2 Processing Web Server Log Files

Web server logs track the hosts that access a web site. By default, the log reports the IP addresses of the sites that connect to the server. However, you can often get more information from the names of those sites than from their IP addresses. Most web servers have an option to store hostnames instead of IP addresses, but this can hurt performance because the server needs to make a DNS request for each hit. It is much more efficient to log the IP addresses and convert them to hostnames at a later time, when the server isn't busy or even on another machine completely. Example 6-12 is a program called Weblog that reads a web server log file and prints each line with IP addresses converted to hostnames.

Most web servers have standardized on the common log file format, although there are exceptions; if your web server is one of those exceptions, you'll have to modify this program. A typical line in the common log file format looks like this:

 205.160.186.76 unknown - [17/Jun/2003:22:53:58 -0500]                                 "GET /bgs/greenbg.gif HTTP 1. 0" 200 50

This line indicates that a web browser at IP address 205.160.186.76 requested the file /bgs/greenbg.gif from this web server at 11:53 p.m. (and 58 seconds) on June 17, 2003. The file was found (response code 200) and 50 bytes of data were successfully transferred to the browser.

The first field is the IP address or, if DNS resolution is turned on, the hostname from which the connection was made. This is followed by a space. Therefore, for our purposes, parsing the log file is easy: everything before the first space is the IP address, and everything after it does not need to be changed.

The Common Log File Format

If you want to expand Weblog into a more general web server log processor, you need a little more information about the common log file format. A line in the file has the format:

 remotehost rfc931 authuser [date] "request" status bytes

remotehost: remotehost is either the hostname or IP address from which the browser connected.
rfc931: rfc931 is the username of the user on the remote system, as specified by Internet protocol RFC 931. Very few browsers send this information, so it's almost always either unknown or a dash. This is followed by a space.
authuser: authuser is the authenticated username as specified by RFC 931. Once again, most popular browsers or client systems do not support this; this field usually is filled in with a dash, followed by a space.
[date]: The date and time of the request are given in brackets. This is the local system time when the request was made. Days are a two-digit number ranging from 01 to 31. The month is Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, or Dec. The year is indicated by four digits. The year is followed by a colon , the hour (from 00 to 23), another colon, two digits signifying the minute (00 to 59), a colon, and two digits signifying the seconds (00 to 59). Then comes the closing bracket and another space.
"request": The request line exactly as it came from the client. It is enclosed in quotation marks because it may contain embedded spaces. It is not guaranteed to be a valid HTTP request since client software may misbehave.
status: A numeric HTTP status code returned to the client. A list of HTTP 1.0 status codes is given in Chapter 3. The most common response is 200, which means the request was successfully processed .
bytes: The number of bytes of data that was sent to the client as a result of this request.

The dotted quad format IP address is converted into a hostname using the usual methods of java.net.InetAddress . Example 6-12 shows the code.

Example 6-12. Process web server log files

 import java.net.*; import java.io.*; import java.util.*; import com.macfaq.io.SafeBufferedReader;  public class Weblog {   public static void main(String[] args) {     Date start = new Date( );     try {       FileInputStream fin =  new FileInputStream(args[0]);       Reader in = new InputStreamReader(fin);       SafeBufferedReader bin = new SafeBufferedReader(in);              String entry = null;       while ((entry = bin.readLine( )) != null) {                  // separate out the IP address         int index = entry.indexOf(' ', 0);         String ip = entry.substring(0, index);         String theRest = entry.substring(index, entry.length( ));                  // find the hostname and print it out         try {           InetAddress address = InetAddress.getByName(ip);           System.out.println(address.getHostName( ) + theRest);         }         catch (UnknownHostException ex) {           System.out.println(entry);         }                } // end while     }     catch (IOException ex) {       System.out.println("Exception: " + ex);     }          Date end = new Date( );     long elapsedTime = (end.getTime( )-start.getTime( ))/1000;     System.out.println("Elapsed time: " + elapsedTime + " seconds");   }  // end main }

The name of the file to be processed is passed to Weblog as the first argument on the command line. A FileInputStream fin is opened from this file and an InputStreamReader is chained to fin . This InputStreamReader is buffered by chaining it to an instance of the SafeBufferedReader class developed in Chapter 4. The file is processed line by line in a while loop.

Each pass through the loop places one line in the String variable entry . entry is then split into two substrings: ip , which contains everything before the first space, and theRest , which is everything after the first space. The position of the first space is determined by entry.indexOf( " ", 0) . ip is converted to an InetAddress object using getByName() . getHostName( ) then looks up the hostname. Finally, the hostname, a space, and everything else on the line ( theRest ) are printed on System.out . Output can be sent to a new file through the standard means for redirecting output.

Weblog is more efficient than you might expect. Most web browsers generate multiple log file entries per page served , since there's an entry in the log not just for the page itself but for each graphic on the page. And many visitors request multiple pages while visiting a site. DNS lookups are expensive and it simply doesn't make sense to look up each site every time it appears in the log file. The InetAddress class caches requested addresses. If the same address is requested again, it can be retrieved from the cache much more quickly than from DNS.

Nonetheless, this program could certainly be faster. In my initial tests, it took more than a second per log entry. (Exact numbers depend on the speed of your network connection, the speed of the local and remote DNS servers, and network congestion when the program is run.) The program spends a huge amount of time sitting and waiting for DNS requests to return. Of course, this is exactly the problem multithreading is designed to solve. One main thread can read the log file and pass off individual entries to other threads for processing.

A thread pool is absolutely necessary here. Over the space of a few days, even low-volume web servers can easily generate a log file with hundreds of thousands of lines. Trying to process such a log file by spawning a new thread for each entry would rapidly bring even the strongest virtual machine to its knees, especially since the main thread can read log file entries much faster than individual threads can resolve domain names and die. Consequently, reusing threads is essential. The number of threads is stored in a tunable parameter, numberOfThreads , so that it can be adjusted to fit the VM and network stack. (Launching too many simultaneous DNS requests can also cause problems.)

This program is now divided into two classes. The first class, PooledWeblog , shown in Example 6-13, contains the main( ) method and the processLogFile( ) method. It also holds the resources that need to be shared among the threads. These are the pool, implemented as a synchronized LinkedList from the Java Collections API, and the output log, implemented as a BufferedWriter named out . Individual threads have direct access to the pool but have to pass through PooledWeblog 's log( ) method to write output.

The key method is processLogFile() . As before, this method reads from the underlying log file. However, each entry is placed in the entries pool rather than being immediately processed. Because this method is likely to run much more quickly than the threads that have to access DNS, it yields after reading each entry. Furthermore, it goes to sleep if there are more entries in the pool than threads available to process them. The amount of time it sleeps depends on the number of threads. This setup avoids using excessive amounts of memory for very large log files. When the last entry is read, the finished flag is set to true to tell the threads that they can die once they've completed their work.

Example 6-13. PooledWebLog

 import java.io.*; import java.util.*; import com.macfaq.io.SafeBufferedReader; public class PooledWeblog {   private BufferedReader in;   private BufferedWriter out;   private int numberOfThreads;   private List entries = Collections.synchronizedList(new LinkedList( ));   private boolean finished = false;   private int test = 0;   public PooledWeblog(InputStream in, OutputStream out,     int numberOfThreads) {     this.in = new BufferedReader(new InputStreamReader(in));     this.out = new BufferedWriter(new OutputStreamWriter(out));     this.numberOfThreads = numberOfThreads;   }      public boolean isFinished( ) {     return this.finished;    }      public int getNumberOfThreads( ) {     return numberOfThreads;    }      public void processLogFile( ) {        for (int i = 0; i < numberOfThreads; i++) {       Thread t = new LookupThread(entries, this);       t.start( );     }          try {       String entry = in.readLine( );       while (entry != null) {                  if (entries.size( ) > numberOfThreads) {           try {             Thread.sleep((long) (1000.0/numberOfThreads));           }           catch (InterruptedException ex) {}           continue;         }         synchronized (entries) {           entries.add(0, entry);           entries.notifyAll( );          }                  entry = in.readLine( );         Thread.yield( );                } // end while   }      public void log(String entry) throws IOException {     out.write(entry + System.getProperty("line.separator", "\r\n"));     out.flush( );   }      public static void main(String[] args) {     try {       PooledWeblog tw = new PooledWeblog(new FileInputStream(args[0]),         System.out, 100);       tw.processLogFile( );     }     catch (FileNotFoundException ex) {       System.err.println("Usage: java PooledWeblog logfile_name");     }     catch (ArrayIndexOutOfBoundsException ex) {       System.err.println("Usage: java PooledWeblog logfile_name");     }     catch (Exception ex) {       System.err.println(ex);       e.printStackTrace( );     }   }  // end main }

The LookupThread class, shown in Example 6-14, handles the detailed work of converting IP addresses to hostnames in the log entries. The constructor provides each thread with a reference to the entries pool it will retrieve work from and a reference to the PooledWeblog object it's working for. The latter reference allows callbacks to the PooledWeblog so that the thread can log converted entries and check to see when the last entry has been processed. It does so by calling the isFinished( ) method in PooledWeblog when the entries pool is empty (i.e., has size 0). Neither an empty pool nor isFinished( ) returning true is sufficient by itself. isFinished( ) returns true after the last entry is placed in the pool, which occurs, at least for a small amount of time, before the last entry is removed from the pool. And entries may be empty while there are still many entries remaining to be read if the lookup threads outrun the main thread reading the log file.

Example 6-14. LookupThread

 import java.net.*;  import java.io.*; import java.util.*; public class LookupThread extends Thread {   private List entries;   PooledWeblog log;   // used for callbacks      public LookupThread(List entries, PooledWeblog log) {     this.entries = entries;     this.log = log;   }      public void run( ) {        String entry;     while (true) {            synchronized (entries) {         while (entries.size( ) == 0) {           if (log.isFinished( )) return;           try {             entries.wait( );           }           catch (InterruptedException ex) {           }         }                entry = (String) entries.remove(entries.size( )-1);       }              int index = entry.indexOf(' ', 0);       String remoteHost = entry.substring(0, index);       String theRest = entry.substring(index, entry.length( ));       try {         remoteHost = InetAddress.getByName(remoteHost).getHostName( );       }       catch (Exception ex) {         // remoteHost remains in dotted quad format       }       try {         log.log(remoteHost + theRest);       }       catch (IOException ex) {       }        this.yield( );            }   } }

Using threads like this lets the same log files be processed in parallela huge time-savings. In my unscientific tests, the threaded version is 10 to 50 times faster than the sequential version.

The biggest disadvantage to the multithreaded approach is that it reorders the log file. The output statistics aren't necessarily in the same order as the input statistics. For simple hit counting, this doesn't matter. However, there are some log analysis tools that can mine a log file to determine paths users followed through a site. These tools could get confused if the log is out of sequence. If the log sequence is an issue, attach a sequence number to each log entry. As the individual threads return log entries to the main program, the log( ) method in the main program stores any that arrive out of order until their predecessors appear. This is in some ways reminiscent of how network software reorders TCP packets that arrive out of order.