15.3 Reading the Header | Java Network Programming, Third Edition

HTTP servers provide a substantial amount of information in the header that precedes each response. For example, here's a typical HTTP header returned by an Apache web server:

 HTTP/1.1 200 OK Date: Mon, 18 Oct 1999 20:06:48 GMT Server: Apache/1.3.4 (Unix) PHP/3.0.6 mod_perl/1.17 Last-Modified: Mon, 18 Oct 1999 12:58:21 GMT ETag: "1e05f2-89bb-380b196d" Accept-Ranges: bytes Content-Length: 35259 Connection: close     Content-Type: text/html

There's a lot of information there. In general, an HTTP header may include the content type of the requested document, the length of the document in bytes, the character set in which the content is encoded, the date and time, the date the content expires , and the date the content was last modified. However, the information depends on the server; some servers send all this information for each request, others send some information, and a few don't send anything. The methods of this section allow you to query a URLConnection to find out what metadata the server has provided.

Aside from HTTP, very few protocols use MIME headers (and technically speaking, even the HTTP header isn't actually a MIME header; it just looks a lot like one). When writing your own subclass of URLConnection , it is often necessary to override these methods so that they return sensible values. The most important piece of information you may be lacking is the MIME content type. URLConnection provides some utility methods that guess the data's content type based on its filename or the first few bytes of the data itself.

15.3.1 Retrieving Specific Header Fields

The first six methods request specific, particularly common fields from the header. These are:

Content-type
Content-length
Content-encoding
Date
Last-modified
Expires

15.3.1.1 public String getContentType( )

This method returns the MIME content type of the data. It relies on the web server to send a valid content type. (In a later section, we'll see how recalcitrant servers are handled.) It throws no exceptions and returns null if the content type isn't available. text/html will be the most common content type you'll encounter when connecting to web servers. Other commonly used types include text/plain , image/gif , application/xml , and image/jpeg .

If the content type is some form of text, then this header may also contain a character set part identifying the document's character encoding. For example:

 Content-type: text/html; charset=UTF-8

Or:

 Content-Type: text/xml; charset=iso-2022-jp

In this case, getContentType( ) returns the full value of the Content-type field, including the character encoding. We can use this to improve on Example 15-1 by using the encoding specified in the HTTP header to decode the document, or ISO-8859-1 (the HTTP default) if no such encoding is specified. If a nontext type is encountered , an exception is thrown. Example 15-2 demonstrates :

Example 15-2. Download a web page with the correct character set

 import java.net.*; import java.io.*; public class EncodingAwareSourceViewer {   public static void main (String[] args) {     for (int i = 0; i < args.length; i++) {                 try {         // set default encoding         String encoding = "ISO-8859-1";         URL u = new URL(args[i]);         URLConnection uc = u.openConnection( );         String contentType = uc.getContentType( );         int encodingStart = contentType.indexOf("charset=");         if (encodingStart != -1) {             encoding = contentType.substring(encodingStart+8);         }         InputStream in = new BufferedInputStream(uc.getInputStream( ));            Reader r = new InputStreamReader(in, encoding);         int c;         while ((c = r.read( )) != -1) {           System.out.print((char) c);         }        }       catch (MalformedURLException ex) {         System.err.println(args[0] + " is not a parseable URL");       }       catch (IOException ex) {         System.err.println(ex);       }     } //  end if   } // end main }  // end EncodingAwareSourceViewer

In practice, most servers don't include charset information in their Content-type headers, so this is of limited use.

15.3.1.2 public int getContentLength( )

The getContentLength() method tells you how many bytes there are in the content. Many servers send Content-length headers only when they're transferring a binary file, not when transferring a text file. If there is no Content-length header, getContentLength() returns -1. The method throws no exceptions. It is used when you need to know exactly how many bytes to read or when you need to create a buffer large enough to hold the data in advance.

In Chapter 7, we discussed how to use the openStream( ) method of the URL class to download text files from an HTTP server. Although in theory you should be able to use the same method to download a binary file, such as a GIF image or a .class byte code file, in practice this procedure presents a problem. HTTP servers don't always close the connection exactly where the data is finished; therefore, you don't know when to stop reading. To download a binary file, it is more reliable to use a URLConnection 's getContentLength( ) method to find the file's length, then read exactly the number of bytes indicated. Example 15-3 is a program that uses this technique to save a binary file on a disk.

Example 15-3. Downloading a binary file from a web site and saving it to disk

 import java.net.*; import java.io.*; public class BinarySaver {   public static void main (String args[]) {     for (int i = 0; i < args.length; i++) {       try {         URL root = new URL(args[i]);         saveBinaryFile(root);       }       catch (MalformedURLException ex) {         System.err.println(args[i] + " is not URL I understand.");       }       catch (IOException ex) {         System.err.println(ex);       }     } // end for   } // end main   public static void saveBinaryFile(URL u) throws IOException {        URLConnection uc = u.openConnection( );     String contentType = uc.getContentType( );     int contentLength = uc.getContentLength( );     if (contentType.startsWith("text/")  contentLength == -1 ) {       throw new IOException("This is not a binary file.");     }     InputStream raw = uc.getInputStream( );     InputStream in  = new BufferedInputStream(raw);     byte[] data = new byte[contentLength];     int bytesRead = 0;     int offset = 0;     while (offset < contentLength) {        bytesRead = in.read(data, offset, data.length-offset);        if (bytesRead == -1) break;        offset += bytesRead;     }     in.close( );          if (offset != contentLength) {       throw new IOException("Only read " + offset         + " bytes; Expected " + contentLength + " bytes");     }     String filename = u.getFile( );     filename = filename.substring(filename.lastIndexOf('/') + 1);     FileOutputStream fout = new FileOutputStream(filename);     fout.write(data);     fout.flush( );     fout.close( );      }  } // end BinarySaver

As usual, the main( ) method loops over the URLs entered on the command line, passing each URL to the saveBinaryFile( ) method. saveBinaryFile() opens a URLConnection uc to the URL . It puts the type into the variable contentType and the content length into the variable contentLength . Next , an if statement checks whether the content type is text or the Content-length field is missing or invalid ( contentLength == -1 ). If either of these is true , an IOException is thrown. If these assertions are both false , we have a binary file of known length: that's what we want.

Now that we have a genuine binary file on our hands, we prepare to read it into an array of bytes called data . data is initialized to the number of bytes required to hold the binary object, contentLength . Ideally, you would like to fill data with a single call to read( ) but you probably won't get all the bytes at once, so the read is placed in a loop. The number of bytes read up to this point is accumulated into the offset variable, which also keeps track of the location in the data array at which to start placing the data retrieved by the next call to read( ) . The loop continues until offset equals or exceeds contentLength ; that is, the array has been filled with the expected number of bytes. We also break out of the while loop if read( ) returns -1, indicating an unexpected end of stream. The offset variable now contains the total number of bytes read, which should be equal to the content length. If they are not equal, an error has occurred, so saveBinaryFile() throws an IOException . This is the general procedure for reading binary files from HTTP connections.

Now we are ready to save the data in a file. saveBinaryFile() gets the filename from the URL using the getFile( ) method and strips any path information by calling filename.substring(theFile.lastIndexOf('/') + 1) . A new FileOutputStream fout is opened into this file and the data is written in one large burst with fout.write(b) .

15.3.1.3 public String getContentEncoding( )

This method returns a String that tells you how the content is encoded. If the content is sent unencoded (as is commonly the case with HTTP servers), this method returns null . It throws no exceptions. The most commonly used content encoding on the Web is probably x-gzip, which can be straightforwardly decoded using a java.util.zip.GZipInputStream .

The content encoding is not the same as the character encoding. The character encoding is determined by the Content-type header or information internal to the document, and specifies how characters are specified in bytes. Content encoding specifies how the bytes are encoded in other bytes.

When subclassing URLConnection , override this method if you expect to be dealing with encoded data, as might be the case for an NNTP or SMTP protocol handler; in these applications, many different encoding schemes, such as BinHex and uuencode, are used to pass eight-bit binary data through a seven-bit ASCII connection.

15.3.1.4 public long getDate( )

The getDate( ) method returns a long that tells you when the document was sent, in milliseconds since midnight, Greenwich Mean Time (GMT), January 1, 1970. You can convert it to a java.util.Date . For example:

 Date documentSent = new Date(uc.getDate( ));

This is the time the document was sent as seen from the server; it may not agree with the time on your local machine. If the HTTP header does not include a Date field, getDate( ) returns 0.

15.3.1.5 public long getExpiration( )

Some documents have server-based expiration dates that indicate when the document should be deleted from the cache and reloaded from the server. getExpiration( ) is very similar to getDate( ) , differing only in how the return value is interpreted. It returns a long indicating the number of milliseconds after 12:00 A.M., GMT, January 1, 1970, at which point the document expires. If the HTTP header does not include an Expiration field, getExpiration( ) returns 0, which means 12:00 A.M., GMT, January 1, 1970. The only reasonable interpretation of this date is that the document does not expire and can remain in the cache indefinitely.

15.3.1.6 public long getLastModified( )

The final date method, getLastModified( ) , returns the date on which the document was last modified. Again, the date is given as the number of milliseconds since midnight, GMT, January 1, 1970. If the HTTP header does not include a Last-modified field (and many don't), this method returns 0.

Example 15-4 reads URLs from the command line and uses these six methods to print their content type, content length, content encoding, date of last modification, expiration date, and current date.

Example 15-4. Return the header

 import java.net.*; import java.io.*; import java.util.*; public class HeaderViewer {   public static void main(String args[]) {     for (int i=0; i < args.length; i++) {       try {         URL u = new URL(args[0]);         URLConnection uc = u.openConnection( );         System.out.println("Content-type: " + uc.getContentType( ));         System.out.println("Content-encoding: "           + uc.getContentEncoding( ));         System.out.println("Date: " + new Date(uc.getDate( )));         System.out.println("Last modified: "           + new Date(uc.getLastModified( )));         System.out.println("Expiration date: "           + new Date(uc.getExpiration( )));         System.out.println("Content-length: " + uc.getContentLength( ));       }  // end try       catch (MalformedURLException ex) {         System.err.println(args[i] + " is not a URL I understand");       }       catch (IOException ex) {         System.err.println(ex);       }             System.out.println( );      }  // end for          }  // end main }  // end HeaderViewer

Here's the result when used to look at http://www.oreilly.com:

 %  java HeaderViewer http://www.oreilly.com  Content-type: text/html Content-encoding: null Date: Mon Oct 18 13:54:52 PDT 1999 Last modified: Sat Oct 16 07:54:02 PDT 1999 Expiration date: Wed Dec 31 16:00:00 PST 1969 Content-length: -1

The content type of the file at http://www.oreilly.com is text/html . No content encoding was used. The file was sent on Monday, October 18, 1999 at 1:54 P.M., Pacific Daylight Time. It was last modified on Saturday, October 16, 1999 at 7:54 A.M. Pacific Daylight Time and it expires on Wednesday, December 31, 1969 at 4:00 P. M., Pacific Standard Time. Did this document really expire 31 years ago? No. Remember that what's being checked here is whether the copy in your cache is more recent than 4:00 P.M. PST, December 31, 1969. If it is, you don't need to reload it. More to the point, after adjusting for time zone differences, this date looks suspiciously like 12:00 A.M., Greenwich Mean Time, January 1, 1970, which happens to be the default if the server doesn't send an expiration date. (Most don't.)

Finally, the content length of -1 means that there was no Content-length header. Many servers don't bother to provide a Content-length header for text files. However, a Content-length header should always be sent for a binary file. Here's the HTTP header you get when you request the GIF image http://www.oreilly.com/graphics/space.gif. Now the server sends a Content-length header with a value of 57.

 %  java HeaderViewer http://www.oreilly.com/graphics/space.gif  Content-type: image/gif Content-encoding: null Date: Mon Oct 18 14:00:07 PDT 1999 Last modified: Thu Jan 09 12:05:11 PST 1997 Expiration date: Wed Dec 31 16:00:00 PST 1969 Content-length: 57

15.3.2 Retrieving Arbitrary Header Fields

The last six methods requested specific fields from the header, but there's no theoretical limit to the number of header fields a message can contain. The next five methods inspect arbitrary fields in a header. Indeed, the methods of the last section are just thin wrappers over the methods discussed here; you can use these methods to get header fields that Java's designers did not plan for. If the requested header is found, it is returned. Otherwise, the method returns null .

15.3.2.1 public String getHeaderField(String name )

The getHeaderField() method returns the value of a named header field. The name of the header is not case-sensitive and does not include a closing colon . For example, to get the value of the Content-type and Content-encoding header fields of a URLConnection object uc , you could write:

 String contentType = uc.getHeaderField("content-type"); String contentEncoding = uc.getHeaderField("content-encoding"));

To get the Date, Content-length, or Expires headers, you'd do the same:

 String data = uc.getHeaderField("date"); String expires = uc.getHeaderField("expires"); String contentLength = uc.getHeaderField("Content-length");

These methods all return String , not int or long as the getContentLength( ) , getExpirationDate() , getLastModified( ) , and getDate( ) methods of the last section did. If you're interested in a numeric value, convert the String to a long or an int .

Do not assume the value returned by getHeaderField() is valid. You must check to make sure it is non-null.

15.3.2.2 public String getHeaderFieldKey(int n)

This method returns the key (that is, the field name: for example, Content-length or Server ) of the n ^th header field. The request method is header zero and has a null key. The first header is one. For example, to get the sixth key of the header of the URLConnection uc , you would write:

 String header6 = uc.getHeaderFieldKey(6);

15.3.2.3 public String getHeaderField(int n)

This method returns the value of the n th header field. In HTTP, the request method is header field zero and the first actual header is one. Example 15-5 uses this method in conjunction with getHeaderFieldKey( ) to print the entire HTTP header.

Example 15-5. Print the entire HTTP header

 import java.net.*; import java.io.*; public class AllHeaders {   public static void main(String args[]) {     for (int i=0; i < args.length; i++) {       try {         URL u = new URL(args[i]);         URLConnection uc = u.openConnection( );         for (int j = 1; ; j++) {           String header = uc.getHeaderField(j);           if (header == null) break;           System.out.println(uc.getHeaderFieldKey(j) + ": " + header);         }  // end for       }  // end try       catch (MalformedURLException ex) {         System.err.println(args[i] + " is not a URL I understand.");       }       catch (IOException ex) {         System.err.println(ex);       }       System.out.println( );     }  // end for   }  // end main }  // end AllHeaders

For example, here's the output when this program is run against http://www.oreilly.com:

 %  java AllHeaders http://www.oreilly.com  Server: WN/1.15.1 Date: Mon, 18 Oct 1999 21:20:26 GMT Last-modified: Sat, 16 Oct 1999 14:54:02 GMT Content-type: text/html Title: www.oreilly.com -- Welcome to O'Reilly &amp; Associates!  -- computer  books, software, online publishing Link: <mailto:webmaster@oreilly.com>; rev="Made"

Besides Date, Last-modified, and Content-type headers, this server also provides Server, Title, and Link headers. Other servers may have different sets of headers.

15.3.2.4 public long getHeaderFieldDate(String name, long default)

This method first retrieves the header field specified by the name argument and tries to convert the string to a long that specifies the milliseconds since midnight, January 1, 1970, GMT. getHeaderFieldDate() can be used to retrieve a header field that represents a date: for example, the Expires, Date, or Last-modified headers. To convert the string to an integer, getHeaderFieldDate() uses the parseDate( ) method of java.util.Date . The parseDate() method does a decent job of understanding and converting most common date formats, but it can be stumpedfor instance, if you ask for a header field that contains something other than a date. If parseDate( ) doesn't understand the date or if getHeaderFieldDate( ) is unable to find the requested header field, getHeaderFieldDate( ) returns the default argument. For example:

 Date expires = new Date(uc.getHeaderFieldDate("expires", 0)); long lastModified = uc.getHeaderFieldDate("last-modified", 0); Date now = new Date(uc.getHeaderFieldDate("date", 0));

You can use the methods of the java.util.Date class to convert the long to a String .

15.3.2.5 public int getHeaderFieldInt(String name, int default)

This method retrieves the value of the header field name and tries to convert it to an int . If it fails, either because it can't find the requested header field or because that field does not contain a recognizable integer, getHeaderFieldInt( ) returns the default argument. This method is often used to retrieve the Content-length field. For example, to get the content length from a URLConnection uc , you would write:

 int contentLength = uc.getHeaderFieldInt("content-length", -1);

In this code fragment, getHeaderFieldInt( ) returns -1 if the Content-length header isn't present.