HTTP as a Transport Protocol | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

XML is just a document format. Documents by themselves don't do anythingthey simply are. They contain information, but they neither read nor write that information. They tend to stay in one place, and do not move unless somebody or something moves them.

If two systems want to exchange messages in an XML format, it is not hard for them to do so. Each message can be encoded as a complete XML document. The sender can transmit the document to the receiver using FTP, HTTP, NFS, named pipes, RPC, floppy disks, a null-modem cable running between the serial ports of two machines, modems communicating over telephone lines, or any other means of moving data between systems. It's even acceptable for the sender to print the XML document on paper, seal it in a stamped envelope, and drop it in the mail. After the post office delivers the mail, the recipient can scan it in. XML is completely transport-protocol neutral. As long as the document arrives at its destination without being corrupted along the way, XML neither knows nor cares how it got there.

Because XML doesn't care how documents are moved from point A to point B, it's sensible to pick the simplest broadly supported protocol you can; and that is HTTP, the Hypertext Transport Protocol used by all web browsers and servers. Using HTTP to transport XML has a number of advantages, among them:

HTTP is well supported by libraries in Java, Perl, C, and many other languages for both client and server programs. This takes a large burden off the shoulders of the developer.
HTTP is platform independent. Windows PCs, Macs, Unix boxes, mainframes, and more are all happy speaking HTTP.
HTTP connections are normally allowed to pass through firewalls.
Because the HTTP protocol is text-based, you can use telnet to test out servers.
The HTTP header provides a convenient place to store out-of- band information, such as the document size and encoding.
HTTP is very well understood in the developer community.

Thus it should be no surprise that one of the most popular ways to move XML documents between systems is to use HTTP. A server can send an XML document to a client just as easily as it can send an HTML document or a JPEG image. Less well known is that a client can easily send an XML document (or anything else) to a server using HTTP POST.

How HTTP Works

The simplest-possible XML protocol merely pulls down an XML document from a known URL on a server. You don't need to send the server anything more than a request for a page. For example, today's Slashdot headlines are available in XML from http://www.slashdot.org/slashdot.xml . To get the headlines, you simply load that URL into a browser, as shown in Figure 2.1.

Figure 2.1. Slashdot Headlines in XML

graphics/02fig01.gif

Behind the scenes, here is what's going over the wire for that request. First the browser opens a socket to www.slashdot.org on port 80. Then it sends a request that looks something like this:

 GET /slashdot.xml HTTP/1.1  Host: www.slashdot.org User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; rv:0.9.2) Accept: text/xml, text/html;q=0.9, image/jpeg, */*;q=0.1 Accept-Language: en, fr;q=0.50 Accept-Encoding: gzip,deflate,compress,identity Accept-Charset: ISO-8859-1, utf-8;q=0.66, *;q=0.66 Keep-Alive: 300 Connection: keep-alive

The server responds with its own HTTP header, a blank line, and the body of the requested document. The result looks like this:

 HTTP/1.1 200 OK  Date: Fri, 20 Jul 2001 19:31:10 GMT Server: Apache/1.3.12 (Unix) mod_perl/1.24 Last-Modified: Fri, 20 Jul 2001 18:34:06 GMT ETag: "de47c-e5e-3b58799e" Accept-Ranges: bytes Content-Length: 3678 Connection: close Content-Type: text/xml <?xml version="1.0" encoding="ISO-8859-1"?><backslash xmlns:backslash="http://slashdot.org/backslash.dtd">   <story>     <title>TheKompany's Shawn Gordon Responds In Full</title>     <url>http://slashdot.org/article.pl?sid=01/07/20/1637220</url>     <time>2001-07-20 18:30:28</time>     <author>timothy</author>     <department>what's-good-for-thekompany</department>     <topic>linuxbiz</topic>     <comments>3</comments>     <section>interviews</section>     <image>topiclinuxbiz.gif</image>   </story>   <story>     <title>GNOME Usability Study Report</title>     <url>http://slashdot.org/article.pl?sid=01/07/20/1752205</url>     <time>2001-07-20 18:08:03</time>     <author>michael</author>     <department>press-ctrl-alt-shift-k-q-z-to-continue</department>     <topic>gnome</topic>     <comments>68</comments>     <section>developers</section>     <image>topicgnome.gif</image>   </story>   <story>     <title>Mono Unimplementable?</title>     <url>http://slashdot.org/article.pl?sid=01/07/20/1555256</url>     <time>2001-07-20 17:14:39</time>     <author>Hemos</author>     <department>keeping-things-propietary</department>     <topic>microsoft</topic>     <comments>132</comments>     <section>articles</section>     <image>topicms.gif</image>   </story> </backslash>

In both directions the HTTP header is pure text. Lines in the header are delimited by carriage -return linefeed pairs; that is, \r\n . The body of the document is separated from the HTTP header by a blank line. The document may or may not be text. For example, it could be a JPEG image or a gzipped HTML file.

Servers can also respond with a variety of error codes. For example, here's the common 404 Not Found error:

 HTTP/1.0 404 Not found  Server: Netscape-Enterprise/2.01 Date: Wed, 04 Jul 2001 20:35:17 GMT Content-length: 207 Content-type: text/html <TITLE>Not Found</TITLE><H1>Not Found</H1> The requested object does not exist on this server. The link you followed is either outdated, inaccurate, or the server has been instructed not to let you have it.

HTTP in Java

Of course, you don't have to load these documents into a browser. Java lets you write programs that connect to and retrieve information from web sites with hardly any effort. Once you have a document in memory, you can do whatever you want with it: search it, sort it, transform it, forward it to somebody else, clean up after the family dog with it, whatever you want. Future chapters will cover the details of all these operations (except cleaning up after the family dog; you're on your own for that :-) ). But first I want to show you how to use Java to retrieve such information and dump it to the console. Later we'll move from merely retrieving and printing a document to reading and understanding it.

Example 2.2 is a Java class that uses the java.net.URL class to load documents from a server via HTTP (or any other supported protocol). It has four methods . getDocumentAsInputStream() connects to a server and returns the unread stream after stripping off the HTTP header. getDocumentAsString() actually reads the entire document, stores it in a string buffer, and then returns a string containing the document at the URL. Overloaded variants of each method allow you to pass in either the string form of the URL or a java.net.URL object, whichever is more convenient. You would use the method that retrieves the document as an input stream if you wanted to process the document as it arrived. You would use the string version if the document wasn't too big and you wanted to make sure the entire document was available before working with it.

Example 2.2 URLGrabber

 package com.macfaq.net; import java.net.URL; import java.io.IOException; import java.net.MalformedURLException; import java.io.InputStream; public class URLGrabber {   public static InputStream getDocumentAsInputStream(URL url)     throws IOException {     InputStream in = url.openStream();     return in;   }   public static InputStream getDocumentAsInputStream(String url)     throws MalformedURLException, IOException {     URL u = new URL(url);     return getDocumentAsInputStream(u);   }   public static String getDocumentAsString(URL url)     throws IOException {     StringBuffer result = new StringBuffer();     InputStream in = url.openStream();     int c;     while ((c = in.read()) != -1) result.append((char) c);     return result.toString();   }   public static String getDocumentAsString(String url)     throws MalformedURLException, IOException {     URL u = new URL(url);     return getDocumentAsString(u);   } }

One method you might expect to see here I've deliberately left out: There is no getReaderFromURL() . Before you can convert the input stream into a reader, you need to figure out which encoding to use. Because XML documents normally carry their own information about encoding, this requires you to parse the first line or two of the XML document. URLGrabber could do that, but the details are rather complicated and would make this class less generic than it currently is. (Right now it can handle any kind of document, not just XML documents.) It's better to leave all the XML-specific details, such as determining the character encoding, to the XML parser.

Because URLGrabber is a generally useful class, I've placed it in the com.macfaq.net package. I will use this class again later in this book without further comment. To do so I will need to import com.macfaq.net.* , and I will need to make sure my source files are properly organized in my file system. The Java examples in this book are not all trivial, and they cannot all fit in a single class. In general, I will divide programs into different classes, offering small pieces of functionality like this one so that I can debug and explain them separately, as well as mix and match them in different programs.

Warning

Nothing in Java is as pointlessly confusing as the proper organization of .class and .java files in different packages on a file system. Learning how to use packages correctly is one of the major hurdles for novice Java programmers. It is also one of the obstacles a good Integrated Development Environment (IDE) can really help you with. If you are having problems making these programs compile and work as described here because of the package structure, especially if you're seeing error messages that involve " java.lang.NoClassDefFoundError ," please consult a good introductory reference on Java. For these details, I recommend the following books:

C. S. Horstmann and Gary Cornell. 2001. How the virtual machine locates classes. In Core Java 2 . Palo Alto, CA: Sun Microsystems Press. ISBN 0-13-089468-0.
M. Campione, K. Walrath, and A. Hunt. 2001. Managing source and class files, In The Java Tutorial . 3d edition . Boston: Addison-Wesley. ISBN 0-201-70393-9.

Neither of these is complete. In particular neither covers the crucial sourcepath option [http://java.sun.com/j2se/1.3/docs/tooldocs/win32/javac.html#options] to the javac compiler, nor shows you how to compile and run a program divided across multiple packages [http://java.sun.com/j2se/1.3/docs/tooldocs/win32/javac.html#examples]. I'm still looking for a better introductory reference on these topics. If you know of one, please drop an email to elharo@metalab.unc.edu. Thanks! However, in this book I will assume that you have learned how to navigate the obstacles Java places in your CLASSPATH with a reasonable degree of proficiency.

Isolating the code to communicate with the network in the URLGrabber class will allow us to ignore it in the future. For the most part, the details of network transport are not relevant when processing XML. Many of the programs in this book that process XML expect to receive an XML document as a stream or a string. They really don't care where that stream or string comes from, as long as it contains a well- formed XML document. For the remainder of this book, when I need to load an XML document from a URL via GET, I will refer to this class without duplicating the code.

There is another aspect of this program that I find sometimes confuses people: it does not have a main() method. It is not intended to be used directly by typing java URLGrabber at the command line. Rather, this is a library class meant for other programs to use. Example 2.3 is a simple program designed simply to test URLGrabber with a very basic command-line user interface. Because it is not a generally useful class, I have not placed it anywhere in the com.macfaq packages. Instead it is simply a quick, one-off program I can use to make sure that URLGrabber actually works and does what I want it to do.

Example 2.3 URLGrabberTest

 import com.macfaq.net.URLGrabber; import java.io.IOException; import java.net.MalformedURLException; public class URLGrabberTest {   public static void main(String[] args) {     for (int i = 0; i < args.length; i++) {       try {         String doc = URLGrabber.getDocumentAsString(args[i]);         System.out.println(doc);       }       catch (MalformedURLException e) {         System.err.println(args[i]          + " cannot be interpreted as a URL.");       }       catch (IOException e) {         System.err.println("Unexpected IOException: "          + e.getMessage());        }     }   } }

Following is a simple example of using URLGrabberTest to download the XML document from http://www.slashdot.org/slashdot.xml :

 %  java URLGrabberTest http://www.slashdot.org/slashdot.xml  <?xml version="1.0" encoding="ISO-8859-1"?><backslash xmlns:backslash="http://slashdot.org/backslash.dtd">   <story>     <title>TheKompany's Shawn Gordon Responds In Full</title> ...

However, URLGrabberTest is really nothing more than a toy example that fits nicely and neatly in the space available on a printed page. In 2002 serious programs either use graphical user interfaces (GUIs), or they don't have an explicit user interface at all. Example 2.2 is a useful program; Example 2.3 is not.

Many services can use this simple approach to transmitting XML data across HTTP. For example, in addition to headlines, this is a straightforward way to distribute the titles and show times of the movies playing at a particular theater, the viewing conditions at an observatory, the operational state of a machine in a factory, the daily sales at a retail store, the surf conditions at Ehukai Beach, and more. What makes this possible is that all users want pretty much the same document. The response does not need to be customized for each requester.