17.2 The ContentHandler Class


A subclass of ContentHandler overrides the getContent() method to return an object that's the Java equivalent of the content. This method can be quite simple or quite complex, depending almost entirely on the complexity of the content type you're trying to parse. A text/plain content handler is quite simple; a text/rtf content handler would be very complex.

The ContentHandler class has only a simple noargs constructor:

 public ContentHandler( ) 

Since ContentHandler is an abstract class, you never call its constructor directly, only from inside the constructors of subclasses.

The primary method of the class, albeit an abstract one, is getContent( ) :

 public abstract Object getContent(URLConnection uc) throws IOException 

This method is normally called only from inside the getContent( ) method of a URLConnection object. It is overridden in a subclass that is specific to the type of content being handled. getContent( ) should use the URLConnection 's InputStream to create an object. There are no rules about what type of object a content handler should return. In general, this depends on what the application requesting the content expects. Content handlers for text-like content bundled with the JDK return some subclass of InputStream . Content handlers for images return ImageProducer objects.

The getContent( ) method of a content handler does not get the full InputStream that the URLConnection has access to. The InputStream that a content handler sees should include only the content's raw data. Any MIME headers or other protocol-specific information that come from the server should be stripped by the URLConnection before it passes the stream to the ContentHandler . A ContentHandler is only responsible for content, not for any protocol overhead that may be present. The URLConnection should have already performed any necessary handshaking with the server and interpreted any headers it sends.

17.2.1 A Content Handler for Tab-Separated Values

To see how content handlers work, let's create a ContentHandler that handles the text/tab-separated-values content type. We aren't concerned with how the tab-separated values get to us. That's for a protocol handler to deal with. All a ContentHandler needs to know is the MIME type and format of the data.

Tab-separated values are produced by many database and spreadsheet programs. A tab-separated file may look something like this (tabs are indicated by arrows).

 JPE Associates  341 Lafayette Street, Suite 1025  New York  NY  10012 O'Reilly & Associates  103 Morris Street, Suite A  Sebastopol  CA  95472 

In database parlance, each line is a record , and the data before each tab is a field . It is usually (though not necessarily ) true that each field has the same meaning in each record. In the previous example, the first field is the company name .

The first question to ask is: what kind of Java object should we convert the tab- separated values to? The simplest and most general way to store each record is as an array of String s. Successive records can be collected in a Vector . In many applications, however, you have a great deal more knowledge about the exact format and meaning of the data than we do here. The more you know about the data you're dealing with, the better a ContentHandler you can write. For example, if you know that the data you're downloading represents U.S. addresses, you could define a class like this:

 public class Address {   private String name;   private String street;   private String city;   private String state;   private String zip;    } 

This class would also have appropriate constructors and other methods to represent each record. In this example, we don't know anything about the data in advance, or how many records we'll have to store. Therefore, we will take the most general approach and convert each record into an array of strings, using a Vector to store each array until there are no more records. The getContent( ) method can return the Vector of String arrays.

Example 17-1 shows the code for such a ContentHandler . The full package-qualified name is com.macfaq.net.www.content.text.tab_separated_values . This unusual class name follows the naming convention for a content handler for the MIME type text/tab-separated-values . Since MIME types often contain hyphens, as in this example, a convention exists to replace these with the underscore (_). Thus text/tab-separated-values becomes text.tab_separated_values . To install this content handler, all that's needed is to put the compiled .class file somewhere the class loader can find it and set the java.content.handler.pkgs property to com.macfaq.net.www.content .

Example 17-1. A ContentHandler for text/tab-separated-values
 package com.macfaq.net.www.content.text; import java.net.*; import java.io.*; import java.util.*; import com.macfaq.io.SafeBufferedReader  // From Chapter 4 public class tab_separated_values extends ContentHandler {   public Object getContent(URLConnection uc) throws IOException {     String theLine;     Vector lines = new Vector( );     InputStreamReader isr = new InputStreamReader(uc.getInputStream( ));     SafeBufferedReader in = new SafeBufferedReader(isr);     while ((theLine = in.readLine( )) != null) {       String[] linearray = lineToArray(theLine);       lines.addElement(linearray);     }     return lines;    }   private String[] lineToArray(String line)  {     int numFields = 1;     for (int i = 0; i < line.length( ); i++) {       if (line.charAt(i) == '\t') numFields++;     }     String[] fields = new String[numFields];     int position = 0;     for (int i = 0; i < numFields; i++) {       StringBuffer buffer = new StringBuffer( );       while (position < line.length( ) && line.charAt(position) != '\t') {         buffer.append(line.charAt(position));         position++;       }       fields[i] = buffer.toString( );       position++;     }     return fields;   } } 

Example 17-1 has two methods. The private utility method lineToArray( ) converts a tab-separated string into an array of strings. This method is for the private use of this subclass and is not required by the ContentHandler interface. The more complicated the content you're trying to parse, the more such methods your class will need. The lineToArray( ) method begins by counting the number of tabs in the string. This sets the numFields variable to one more than the number of tabs. An array is created for the fields with the length numFields ; a for loop fills the array with the strings between the tabs; and this array is returned.

You may have expected a StringTokenizer to split the line into parts . However, that class has unusual ideas about what makes up a token. In particular, it interprets multiple tabs in a row as a single delimiter . That is, it never returns an empty string as a token.

The getContent( ) method starts by instantiating a Vector . Then it gets the InputStream from the URLConnection uc and chains this to an InputStreamReader , which is in turn chained to the SafeBufferedReader (introduced in Chapter 4) so getContent( ) can read the array one line at a time in a while loop. Each line is fed to the lineToArray( ) method, which splits it into a String array. This array is then added to the Vector . When no more lines are left, the loop exits and the Vector is returned.

17.2.2 Using Content Handlers

Now that you've written your first ContentHandler , let's see how to use it in a program. Files of MIME type text/tab-separated-values can be served by gopher servers, HTTP servers, FTP servers, and more. Let's assume you're retrieving a tab-separated-values file from an HTTP server. The filename should end with the .tsv or .tab extension so that the server knows it's a text/tab-separated-values file.

Not all servers are configured to support this type out of the box. Consult your server documentation to see how to set up a MIME-type mapping for your server. For instance, to configure my Apache server, I added these lines to my .htaccess file:

 AddType text/tab-separated-values tab AddType text/tab-separated-values tsv 

You can test the web server configuration by connecting to port 80 of the web server with Telnet and requesting the file manually:

 %  telnet www.ibiblio.org 80  Trying Connected to www.ibiblio.org. Escape character is '^]'.  GET /javafaq/addresses.tab HTTP 1.0  HTTP 1.0 200 OK Date: Mon, 15 Nov 1999 18:36:51 GMT Server: Apache/1.3.4 (Unix) PHP/3.0.6 mod_perl/1.17 Last-Modified: Thu, 04 Nov 1999 18:22:51 GMT Content-type: text/tab-separated-values Content-length: 163 JPE Associates 341 Lafayette Street, Suite 1025 New York NY 10012 O'Reilly & Associates 103 Morris Street, Suite A Sebastopol CA 95472 Connection closed by foreign host. 

You're looking for a line that says Content-type : text/tab-separated-values . If you see a Content-type of text/plain , application/octet-stream , or some other value, or you don't see any Content-type at all, the server is misconfigured and must be fixed before you continue.

The application that uses the tab-separated-values content handler does not need to know about it explicitly. It simply has to call the getContent( ) method of URL or URLConnection on a URL with a matching MIME type. Furthermore, the package where the content handler can be found has to be listed in the java.content.handlers.pkg property.

Example 17-2 is a class that downloads and prints a text/tab-separated-values file using the ContentHandler of Example 17-1. However, note that it does not import com.macfaq.net.www.content.text and never references the tab_separated_values class. It does explicitly add com.macfaq.net.www.content to the java.content.handlers.pkgs property because that's the simplest way to make sure this standalone program works. However, the lines that do this could be deleted if the property were set in a property file or from the command line.

Example 17-2. The tab-separated-values ContentTester class
 import java.io.*; import java.net.*; import java.util.*; public class TSVContentTester {   private static void test(URL u) throws IOException {        Object content = u.getContent( );     Vector v = (Vector) content;     for (Enumeration e = v.elements( ) ; e.hasMoreElements( ) ;) {       String[] sa = (String[]) e.nextElement( );       for (int i = 0; i < sa.length; i++) {         System.out.print(sa[i] + "\t");       }       System.out.println( );     }    }   public static void main (String[] args) {            // If you uncomment these lines, then you don't have to      // set the java.content.handler.pkgs property from the     // command line or your properties files. /*    String pkgs = System.getProperty("java.content.handler.pkgs", "");     if (!pkgs.equals("")) {       pkgs = pkgs + "";     }     pkgs += "com.macfaq.net.www.content";           System.setProperty("java.content.handler.pkgs", pkgs);  */       for (int i = 0; i < args.length; i++) {       try {         URL u = new URL(args[i]);         test(u);       }       catch (MalformedURLException ex) {         System.err.println(args[i] + " is not a good URL");        }       catch (Exception ex) {         ex.printStackTrace( );       }     }   } } 

Here's how you run this program. The arrows indicate tabs:

 %  java -Djava.content.handler.pkgs=com.macfaq.net.www.content\   TSVContentTester http://www.ibiblio.org/javafaq/addresses.tab  JPE Associates  341 Lafayette Street, Suite 1025  New York  NY  10012 O'Reilly & Associates  103 Morris Street, Suite A  Sebastopol  CA  95472 

17.2.3 Choosing Return Types

There is one overloaded variant of the getContent( ) method in the ContentHandler class:

 public Object getContent(URLConnection uc, Class[] classes) // Java 1.3  throws IOException 

The difference is the array of java.lang.Class objects passed as the second argument. This allows the caller to request that the content be returned as one of the types in the array and enables content handlers to support multiple types. For example, the text/tab-separated-values content handler could return data as a Vector , an array, a string, or an InputStream . One would be the default used by the single argument getContent( ) method, while the others would be options that a client could request. If the client doesn't request any of the classes this ContentHandler knows how to provide, it returns null .

To call this method, the client invokes the method with the same arguments in a URL or URLConnection object. It passes an array of Class objects in the order it wishes to receive the data. Thus, if it prefers to receive a String but is willing to accept an InputStream and will take a Vector as a last resort, it puts String.class in the zeroth component of the array, InputStream.class in the first component of the array, and Vector.class in the last component of the array. Then it uses instanceof to test what was actually returned and either process it or convert it into the preferred type. For example:

 Class[] requestedTypes = {String.class, InputStream.class,   Vector.class}; Object content = url.getContent(requestedTypes); if (content instanceof String) {   String s = (String) content;   System.out.println(s); } else if (content instanceof InputStream) {   InputStream in = (InputStream) content;   int c;   while ((c = in.read( )) != -1) System.out.write(c); } else if (content instanceof Vector) {   Vector v = (Vector) content;   for (Enumeration e = v.elements( ) ; e.hasMoreElements( ) ;) {     String[] sa = (String[]) e.nextElement( );     for (int i = 0; i < sa.length; i++) {       System.out.print(sa[i] + "\t");     }     System.out.println( );   } } else {   System.out.println("Unrecognized content type " + content.getClass( )); } 

To demonstrate this, let's write a content handler that can be used in association with the time protocol. Recall that the time protocol returns the current time at the server as a 4-byte, big-endian, unsigned integer giving the number of seconds since midnight, January 1, 1900, Greenwich Mean Time. There are several obvious candidates for storing this data in a Java content handler, including java.lang.Long ( java.lang.Integer won't work since the unsigned value may overflow the bounds of an int ), java.util.Date , java.util.Calendar , java.lang.String , and java.io.InputStream , which often works as a last resort. Example 17-3 provides all five options. There's no standard MIME type for the time format. We'll use application for the type to indicate that this is binary data and x-time for the subtype to indicate that this is a nonstandard extension type. It will be up to the time protocol handler to return the right content type.

Example 17-3. A time content handler
 package com.macfaq.net.www.content.application; import java.net.*; import java.io.*; import java.util.*; public class x_time extends ContentHandler {   public Object getContent(URLConnection uc) throws IOException {     Class[] classes = new Class[1];     classes[0] = Date.class;     return this.getContent(uc, classes);    }   public Object getContent(URLConnection uc, Class[] classes)    throws IOException {          InputStream in = uc.getInputStream( );     for (int i = 0; i < classes.length; i++) {       if (classes[i] == InputStream.class) {         return in;         }        else if (classes[i] == Long.class) {         long secondsSince1900 = readSecondsSince1900(in);         return new Long(secondsSince1900);       }       else if (classes[i] == Date.class) {         long secondsSince1900 = readSecondsSince1900(in);         Date time = shiftEpochs(secondsSince1900);         return time;       }       else if (classes[i] == Calendar.class) {         long secondsSince1900 = readSecondsSince1900(in);         Date time = shiftEpochs(secondsSince1900);         Calendar c = Calendar.getInstance( );         c.setTime(time);         return c;       }       else if (classes[i] == String.class) {         long secondsSince1900 = readSecondsSince1900(in);         Date time = shiftEpochs(secondsSince1900);         return time.toString( );       }           }          return null; // no requested type available        }      private long readSecondsSince1900(InputStream in)     throws IOException {          long secondsSince1900 = 0;     for (int j = 0; j < 4; j++) {       secondsSince1900 = (secondsSince1900 << 8)  in.read( );     }     return secondsSince1900;        }      private Date shiftEpochs(long secondsSince1900) {        // The time protocol sets the epoch at 1900, the Java Date class     //  at 1970. This number converts between them.     long differenceBetweenEpochs = 2208988800L;          long secondsSince1970 = secondsSince1900 - differenceBetweenEpochs;            long msSince1970 = secondsSince1970 * 1000;     Date time = new Date(msSince1970);     return time;        } } 

Most of the work is performed by the second getContent() method, which checks to see whether it recognizes any of the classes in the classes array. If so, it attempts to convert the content into an object of that type. The for loop is arranged so that classes earlier in the array take precedence; that is, it first tries to match the first class in the array; next it tries to match the second class in the array; then the third class in the array; and so on. As soon as one class is matched, the method returns so later classes won't be matched even if they're an allowed choice.

Once a type is matched, a simple algorithm converts the four bytes that the time server sends into the right kind of object, either an InputStream , a Long , a Date , a Calendar , or a String . The InputStream conversion is trivial. The Long conversion is one of those times when it seems a little inconvenient that primitive data types aren't objects. Although you can convert to and return any object type, you can't convert to and return a primitive data type like long , so we return the type wrapper class Long instead. The Date and Calendar conversions require shifting the origin of the time from January 1, 1900 to January 1, 1970 and changing the units from seconds to milliseconds , as discussed in Chapter 9. Finally, the conversion to a String simply converts to a Date and then invokes the Date object's toString( ) method.

While it would be possible to configure a web server to send data of MIME type application/x-time , this class is really designed to be used by a custom protocol handler. This handler would know not only how to speak the time protocol, but also how to return application/x-time from the getContentType( ) method. Example 17-4 and Example 17-5 demonstrate such a protocol handler. It assumes that time URLs look like time://vision.poly.edu:3737/ .

Example 17-4. The URLConnection for the time protocol handler
 package com.macfaq.net.www.protocol.time; import java.net.*; import java.io.*; import com.macfaq.net.www.content.application.*; public class TimeURLConnection extends URLConnection {   private Socket connection = null;   public final static int DEFAULT_PORT = 37;   public TimeURLConnection (URL u) {     super(u);   }   public String getContentType( ) {     return "application/x-time";   }   public Object getContent( ) throws IOException {     ContentHandler ch = new x_time( );     return ch.getContent(this);   }   public Object getContent(Class[] classes) throws IOException {      ContentHandler ch = new x_time( );     return ch.getContent(this, classes);   }   public InputStream getInputStream( ) throws IOException {     if (!connected) this.connect( );           return this.connection.getInputStream( );   }   public synchronized void connect( ) throws IOException {        if (!connected) {       int port = url.getPort( );       if ( port < 0) {         port = DEFAULT_PORT;       }       this.connection = new Socket(url.getHost( ), port);       this.connected = true;     }    } } 

In general, it should be enough for the protocol handler to simply know or be able to deduce the correct MIME content type. However, in a case like this, where both content and protocol handlers must be provided, you can tie them a little more closely together by overriding getContent( ) as well. This allows you to avoid messing with the java.content.handler.pkgs property or installing a ContentHandlerFactory . You will still need to set the java.protocolhandler.pkgs property to point to your package or install a URLStreamHandlerFactory , however. Example 17-5 is a simple URLStreamHandler for the time protocol handler.

Example 17-5. The URLStreamHandler for the time protocol handler
 package com.macfaq.net.www.protocol.time; import java.net.*; import java.io.*; public class Handler extends URLStreamHandler {   protected URLConnection openConnection(URL u) throws IOException {     return new TimeURLConnection(u);   } } 

We could install the time protocol handler into HotJava as we did with protocol handlers in the previous chapter. However, even if we place the time content handler in HotJava's class path , HotJava won't use it. Consequently, I've written a simple standalone application, shown in Example 17-6, that uses these protocol and content handlers to tell the time. Notice that it does not need to import or directly refer to any of the classes involved. It simply lets the URL find the right content handler.

Example 17-6. URLTimeClient
 import java.net.*; import java.util.*; import java.io.*; public class URLTimeClient {   public static void main(String[] args) {        System.setProperty("java.protocol.handler.pkgs",       "com.macfaq.net.www.protocol");        try {       // You can replace this with your own time server       URL u = new URL("time://tock.usno.navy.mil/");       Class[] types = {String.class, Date.class,         Calendar.class, Long.class};       Object o = u.getContent(types);       System.out.println(o);     }     catch (IOException ex) {      // Let's see what went wrong      ex.printStackTrace( );      }   } } 

Here's a sample run:

 D:\JAVA\JNP3\examples>  java URLTimeClient  Mon Aug 23 21:30:34 EDT 2004 

In this case, a String object was returned. This was the first choice of URLTimeClient but the last choice of the content handler. The client choice always takes precedence.

Java Network Programming
Java Network Programming, Third Edition
ISBN: 0596007213
EAN: 2147483647
Year: 2003
Pages: 164

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net