7.1 The URL Class | Java Network Programming, Third Edition

The java.net.URL class is an abstraction of a Uniform Resource Locator such as http://www.hamsterdance.com/ or ftp://ftp.redhat.com/pub/. It extends java.lang.Object , and it is a final class that cannot be subclassed. Rather than relying on inheritance to configure instances for different kinds of URLs, it uses the strategy design pattern. Protocol handlers are the strategies, and the URL class itself forms the context through which the different strategies are selected:

 public final class URL extends Object implements Serializable

Although storing a URL as a string would be trivial, it is helpful to think of URLs as objects with fields that include the scheme (a.k.a. the protocol), hostname, port, path , query string, and fragment identifier (a.k.a. the ref), each of which may be set independently. Indeed, this is almost exactly how the java.net.URL class is organized, though the details vary a little between different versions of Java.

The fields of java.net.URL are only visible to other members of the java.net package; classes that aren't in java.net can't access a URL 's fields directly. However, you can set these fields using the URL constructors and retrieve their values using the various getter methods ( getHost( ) , getPort() , and so on). URLs are effectively immutable. After a URL object has been constructed , its fields do not change. This has the side effect of making them thread-safe.

7.1.1 Creating New URLs

Unlike the InetAddress objects in Chapter 6, you can construct instances of java.net.URL . There are six constructors, differing in the information they require. Which constructor you use depends on the information you have and the form it's in. All these constructors throw a MalformedURLException if you try to create a URL for an unsupported protocol and may throw a MalformedURLException if the URL is syntactically incorrect.

Exactly which protocols are supported is implementation-dependent. The only protocols that have been available in all major virtual machines are http and file, and the latter is notoriously flaky. Java 1.5 also requires virtual machines to support https, jar, and ftp; many virtual machines prior to Java 1.5 support these three as well. Most virtual machines also support ftp, mailto, and gopher as well as some custom protocols like doc, netdoc, systemresource, and verbatim used internally by Java. The Netscape virtual machine supports the http, file, ftp, mailto, telnet, ldap, and gopher protocols. The Microsoft virtual machine supports http, file, ftp, https , mailto, gopher, doc, and systemresource, but not telnet, netdoc, jar, or verbatim. Of course, support for all these protocols is limited in applets by the security policy. For example, just because an untrusted applet can construct a URL object from a file URL does not mean that the applet can actually read the file the URL refers to. Just because an untrusted applet can construct a URL object from an HTTP URL that points to a third-party web site does not mean that the applet can connect to that site.

If the protocol you need isn't supported by a particular VM, you may be able to install a protocol handler for that scheme. This is subject to a number of security checks in applets and is really practical only for applications. Other than verifying that it recognizes the URL scheme, Java does not make any checks about the correctness of the URLs it constructs. The programmer is responsible for making sure that URLs created are valid. For instance, Java does not check that the hostname in an HTTP URL does not contain spaces or that the query string is x-www-form-URL-encoded. It does not check that a mailto URL actually contains an email address. Java does not check the URL to make sure that it points at an existing host or that it meets any other requirements for URLs. You can create URLs for hosts that don't exist and for hosts that do exist but that you won't be allowed to connect to.

7.1.1.1 Constructing a URL from a string

The simplest URL constructor just takes an absolute URL in string form as its single argument:

 public URL(String url) throws MalformedURLException

Like all constructors, this may only be called after the new operator, and like all URL constructors, it can throw a MalformedURLException . The following code constructs a URL object from a String , catching the exception that might be thrown:

 try {   URL u = new URL("http://www.audubon.org/"); } catch (MalformedURLException ex)  {   System.err.println(ex); }

Example 7-1 is a simple program for determining which protocols a virtual machine supports. It attempts to construct a URL object for each of 14 protocols (8 standard protocols, 3 custom protocols for various Java APIs, and 4 undocumented protocols used internally by HotJava). If the constructor succeeds, you know the protocol is supported. Otherwise , a MalformedURLException is thrown and you know the protocol is not supported.

Example 7-1. ProtocolTester

 /* Which protocols does a virtual machine support? */ import java.net.*; public class ProtocolTester {   public static void main(String[] args) {          // hypertext transfer protocol     testProtocol("http://www.adc.org");            // secure http     testProtocol("https://www.amazon.com/exec/obidos/order2/");           // file transfer protocol     testProtocol("ftp://metalab.unc.edu/pub/languages/java/javafaq/");        // Simple Mail Transfer Protocol      testProtocol("mailto:elharo@metalab.unc.edu");     // telnet      testProtocol("telnet://dibner.poly.edu/");        // local file access     testProtocol("file:///etc/passwd");     // gopher      testProtocol("gopher://gopher.anc.org.za/");        // Lightweight Directory Access Protocol     testProtocol(      "ldap://ldap.itd.umich.edu/o=University%20of%20Michigan,c=US?postalAddress");     // JAR     testProtocol(      "jar:http://cafeaulait.org/books/javaio/ioexamples/javaio.jar!"          +"/com/macfaq/io/StreamCopier.class");        // NFS, Network File System     testProtocol("nfs://utopia.poly.edu/usr/tmp/");        // a custom protocol for JDBC     testProtocol("jdbc:mysql://luna.metalab.unc.edu:3306/NEWS");        // rmi, a custom protocol for remote method invocation     testProtocol("rmi://metalab.unc.edu/RenderEngine");        // custom protocols for HotJava     testProtocol("doc:/UsersGuide/release.html");     testProtocol("netdoc:/UsersGuide/release.html");     testProtocol("systemresource://www.adc.org/+/index.html");     testProtocol("verbatim:http://www.adc.org/");        }      private static void testProtocol(String url) {          try {         URL u = new URL(url);       System.out.println(u.getProtocol( ) + " is supported");     }     catch (MalformedURLException ex) {       String protocol = url.substring(0, url.indexOf(':'));       System.out.println(protocol + " is not supported");     }        }  }

The results of this program depend on which virtual machine runs it. Here are the results from Java 1.4.1 on Mac OS X 10.2, which turns out to support all the protocols except Telnet, LDAP, RMI, NFS, and JDBC:

 %  java ProtocolTester  http is supported https is supported ftp is supported mailto is supported telnet is not supported file is supported gopher is supported ldap is not supported jar is supported nfs is not supported jdbc is not supported rmi is not supported doc is supported netdoc is supported systemresource is supported verbatim is supported

Results using Sun's Linux 1.4.2 virtual machine were identical. Other 1.4 virtual machines derived from the Sun code will show similar results. Java 1.2 and later are likely to be the same except for maybe HTTPS, which was only recently added to the standard distribution. VMs that are not derived from the Sun codebase may vary somewhat in which protocols they support. For example, here are the results of running ProtocolTester with the open source Kaffe VM 1.1.1:

 %  java ProtocolTester  http is supported https is not supported ftp is supported mailto is not supported telnet is not supported file is supported gopher is not supported ldap is not supported jar is supported nfs is not supported jdbc is not supported rmi is not supported doc is not supported netdoc is not supported systemresource is not supported verbatim is not supported

The nonsupport of RMI and JDBC is actually a little deceptive; in fact, the JDK does support these protocols. However, that support is through various parts of the java.rmi and java.sql packages, respectively. These protocols are not accessible through the URL class like the other supported protocols (although I have no idea why Sun chose to wrap up RMI and JDBC parameters in URL clothing if it wasn't intending to interface with these via Java's quite sophisticated mechanism for handling URLs).

7.1.1.2 Constructing a URL from its component parts

The second constructor builds a URL from three strings specifying the protocol, the hostname, and the file:

 public URL(String protocol, String hostname, String file)   throws MalformedURLException

This constructor sets the port to -1 so the default port for the protocol will be used. The file argument should begin with a slash and include a path, a filename, and optionally a fragment identifier. Forgetting the initial slash is a common mistake, and one that is not easy to spot. Like all URL constructors, it can throw a MalformedURLException . For example:

 try {   URL u = new URL("http", "www.eff.org", "/blueribbon.html#intro"); } catch (MalformedURLException ex)  {   // All VMs should recognize http }

This creates a URL object that points to http://www.eff.org/blueribbon.html#intro, using the default port for the HTTP protocol (port 80). The file specification includes a reference to a named anchor. The code catches the exception that would be thrown if the virtual machine did not support the HTTP protocol. However, this shouldn't happen in practice.

For the rare occasions when the default port isn't correct, the next constructor lets you specify the port explicitly as an int :

 public URL(String protocol, String host, int port, String file)   throws MalformedURLException

The other arguments are the same as for the URL(String protocol , String host , String file) constructor and carry the same caveats. For example:

 try {   URL u = new URL("http", "fourier.dur.ac.uk", 8000, "/~dma3mjh/jsci/"); } catch (MalformedURLException ex)  {   System.err.println(ex); }

This code creates a URL object that points to http://fourier.dur.ac.uk:8000/~dma3mjh/jsci/, specifying port 8000 explicitly.

Example 7-2 is an alternative protocol tester that can run as an applet, making it useful for testing support of browser virtual machines. It uses the three-argument constructor rather than the one-argument constructor in Example 7-1. It also stores the schemes to be tested in an array and uses the same host and file for each scheme. This produces seriously malformed URLs like mailto://www.peacefire.org/bypass/SurfWatch/ , once again demonstrating that all Java checks for at object construction is whether it recognizes the scheme, not whether the URL is appropriate.

Example 7-2. A protocol tester applet

 import java.net.*; import java.applet.*; import java.awt.*; public class ProtocolTesterApplet extends Applet {   TextArea results = new TextArea( );     public void init( ) {     this.setLayout(new BorderLayout( ));         this.add("Center", results);   }   public void start( ) {        String host = "www.peacefire.org";     String file = "/bypass/SurfWatch/";          String[] schemes = {"http",   "https",   "ftp",  "mailto",                          "telnet", "file",    "ldap", "gopher",                         "jdbc",   "rmi",     "jndi", "jar",                         "doc",    "netdoc",  "nfs",  "verbatim",                         "finger", "daytime", "systemresource"};                                for (int i = 0; i < schemes.length; i++) {       try {         URL u = new URL(schemes[i], host, file);         results.append(schemes[i] + " is supported\r\n");       }       catch (MalformedURLException ex) {         results.append(schemes[i] + " is not supported\r\n");             }     }          } }

Figure 7-1 shows the results of Example 7-2 in Mozilla 1.4 with Java 1.4 installed. This browser supports HTTP, HTTPS, FTP, mailto, file, gopher, doc, netdoc, verbatim, systemresource, and jar but not HTTPS, ldap, Telnet, jdbc, rmi, jndi, finger or daytime.

Figure 7-1. The ProtocolTesterApplet running in Mozilla 1.4

7.1.1.3 Constructing relative URLs

This constructor builds an absolute URL from a relative URL and a base URL :

 public URL(URL base, String relative) throws MalformedURLException

For instance, you may be parsing an HTML document at http://www. ibiblio .org/javafaq/index.html and encounter a link to a file called mailinglists.html with no further qualifying information. In this case, you use the URL to the document that contains the link to provide the missing information. The constructor computes the new URL as http://www.ibiblio.org/javafaq/mailinglists.html. For example:

 try {   URL u1 = new URL("http://www.ibiblio.org/javafaq/index.html");   URL u2 = new URL (u1, "mailinglists.html"); } catch (MalformedURLException ex) {    System.err.println(ex); }

The filename is removed from the path of u1 and the new filename mailinglists.html is appended to make u2 . This constructor is particularly useful when you want to loop through a list of files that are all in the same directory. You can create a URL for the first file and then use this initial URL to create URL objects for the other files by substituting their filenames. You also use this constructor when you want to create a URL relative to the applet's document base or code base, which you retrieve using the getDocumentBase() or getCodeBase() methods of the java.applet.Applet class. Example 7-3 is a very simple applet that uses getDocumentBase( ) to create a new URL object:

Example 7-3. A URL relative to the web page

 import java.net.*; import java.applet.*; import java.awt.*; public class RelativeURLTest extends Applet {   public void init ( ) {        try {               URL base = this.getDocumentBase( );       URL relative = new URL(base, "mailinglists.html");       this.setLayout(new GridLayout(2,1));       this.add(new Label(base.toString( )));       this.add(new Label(relative.toString( )));     }     catch (MalformedURLException ex) {       this.add(new Label("This shouldn't happen!"));     }        } }

Of course, the output from this applet depends on the document base. In the run shown in Figure 7-2, the original URL (the document base) refers to the file RelativeURL.html ; the constructor creates a new URL that points to the mailinglists.html file in the same directory.

Figure 7-2. A base and a relative URL

When using this constructor with getDocumentBase() , you frequently put the call to getDocumentBase( ) inside the constructor, like this:

 URL relative = new URL(this.getDocumentBase( ), "mailinglists.html");

7.1.1.4 Specifying a URLStreamHandler // Java 1.2

Two constructors allow you to specify the protocol handler used for the URL. The first constructor builds a relative URL from a base URL and a relative part. The second builds the URL from its component pieces:

 public URL(URL base, String relative, URLStreamHandler handler) // 1.2  throws MalformedURLException public URL(String protocol, String host, int port, String file, // 1.2  URLStreamHandler handler) throws MalformedURLException

All URL objects have URLStreamHandler objects to do their work for them. These two constructors change from the default URLStreamHandler subclass for a particular protocol to one of your own choosing. This is useful for working with URLs whose schemes aren't supported in a particular virtual machine as well as for adding functionality that the default stream handler doesn't provide, such as asking the user for a username and password. For example:

 URL u = new URL("finger", "utopia.poly.edu", 79, "/marcus",   new com.macfaq.net.www.protocol.finger.Handler( ));

The com.macfaq.net.www.protocol.finger.Handler class used here will be developed in Chapter 16.

While the other four constructors raise no security issues in and of themselves , these two do because class loader security is closely tied to the various URLStreamHandler classes. Consequently, untrusted applets are not allowed to specify a URLSreamHandler . Trusted applets can do so if they have the NetPermission specifyStreamHandler . However, for reasons that will become apparent in Chapter 16, this is a security hole big enough to drive the Microsoft money train through. Consequently, you should not request this permission or expect it to be granted if you do request it.

7.1.1.5 Other sources of URL objects

Besides the constructors discussed here, a number of other methods in the Java class library return URL objects. You've already seen getDocumentBase( ) from java.applet.Applet . The other common source is getCodeBase( ) , also from java.applet.Applet . This works just like getDocumentBase( ) , except it returns the URL of the applet itself instead of the URL of the page that contains the applet. Both getDocumentBase( ) and getCodeBase( ) come from the java.applet.AppletStub interface, which java.applet.Applet implements. You're unlikely to implement this interface yourself unless you're building a web browser or applet viewer.

In Java 1.2 and later, the java.io.File class has a toURL( ) method that returns a file URL matching the given file. The exact format of the URL returned by this method is platform-dependent. For example, on Windows it may return something like file:/D:/JAVA/JNP3/07/ToURLTest.java . On Linux and other Unixes, you're likely to see file:/home/elharo/books/JNP3/07/ToURLTest.java . In practice, file URLs are heavily platform- and program-dependent. Java file URLs often cannot be interchanged with the URLs used by web browsers and other programs, or even with Java programs running on different platforms.

Class loaders are used not only to load classes but also to load resources such as images and audio files. The static ClassLoader.getSystemResource(String name ) method returns a URL from which a single resource can be read. The ClassLoader.getSystemResources(String name) method returns an Enumeration containing a list of URL s from which the named resource can be read. Finally, the instance method getResource(String name) searches the path used by the referenced class loader for a URL to the named resource. The URLs returned by these methods may be file URLs, HTTP URLs, or some other scheme. The name of the resource is a slash-separated list of Java identifiers, such as /com/macfaq/sounds/swale.au or com/macfaq/images/headshot.jpg . The Java virtual machine will attempt to find the requested resource in the class pathpotentially including parts of the class path on the web server that an applet was loaded fromor inside a JAR archive.

Java 1.4 adds the URI class, which we'll discuss soon. URIs can be converted into URLs using the toURL( ) method, provided Java has the relevant protocol handler installed.

There are a few other methods that return URL objects here and there throughout the class library, but most are simple getter methods that return only a URL you probably already know because you used it to construct the object in the first place; for instance, the getPage( ) method of java.swing.JEditorPane and the getURL( ) method of java.net.URLConnection .

7.1.2 Splitting a URL into Pieces

URLs are composed of five pieces:

The scheme, also known as the protocol
The authority
The path
The fragment identifier, also known as the section or ref
The query string

For example, given the URL http://www.ibiblio.org/javafaq/books/jnp/index.html?isbn=1565922069#toc, the scheme is http , the authority is www.ibiblio.org , the path is /javafaq/books/jnp/index.html , the fragment identifier is toc , and the query string is isbn=1565922069 . However, not all URLs have all these pieces. For instance, the URL http://www.faqs.org/rfcs/rfc2396.html has a scheme, an authority, and a path, but no fragment identifier or query string.

The authority may further be divided into the user info, the host, and the port. For example, in the URL http://admin@www.blackstar.com:8080/, the authority is admin@www.blackstar.com:8080. This has the user info admin , the host www.blackstar.com , and the port 8080 .

Read-only access to these parts of a URL is provided by five public methods: getFile( ) , getHost() , getPort( ) , getProtocol( ) , and getRef( ) . Java 1.3 adds four more methods: getQuery( ) , getPath( ) , getUserInfo( ) , and getAuthority( ) .

7.1.2.1 public String getProtocol( )

The getProtocol( ) method returns a String containing the scheme of the URL, e.g., "http", "https", or "file". For example:

 URL page = this.getCodeBase( ); System.out.println("This applet was downloaded via "   + page.getProtocol( ));

7.1.2.2 public String getHost( )

The getHost( ) method returns a String containing the hostname of the URL. For example:

 URL page = this.getCodeBase( ); System.out.println("This applet was downloaded from " + page.getHost( ));

The most recent virtual machines get this method right but some older ones, including Sun's JDK 1.3.0, may return a host string that is not necessarily a valid hostname or address. In particular, URLs that incorporate usernames, like ftp://anonymous:anonymous@wuarchive.wustl.edu/, sometimes include the user info in the host. For example, consider this code fragment:

 URL u = new URL("ftp://anonymous:anonymous@wuarchive.wustl.edu/"); String host = u.getHost( );

Java 1.3 sets host to anonymous:anonymous@wuarchive.wustl.edu , not simply wuarchive.wustl.edu . Java 1.4 would return wuarchive.wustl.edu instead.

7.1.2.3 public int getPort( )

The getPort( ) method returns the port number specified in the URL as an int . If no port was specified in the URL , getPort( ) returns -1 to signify that the URL does not specify the port explicitly, and will use the default port for the protocol. For example, if the URL is http://www.userfriendly.org/, getPort( ) returns -1; if the URL is http://www.userfriendly.org:80/, getPort( ) returns 80. The following code prints -1 for the port number because it isn't specified in the URL :

 URL u = new URL("http://www.ncsa.uiuc.edu/demoweb/html-primer.html"); System.out.println("The port part of " + u + " is " + u.getPort( ));

7.1.2.4 public int getDefaultPort( )

The getDefaultPort( ) method returns the default port used for this URL 's protocol when none is specified in the URL. If no default port is defined for the protocol, getDefaultPort( ) returns -1. For example, if the URL is http://www.userfriendly.org/, getDefaultPort( ) returns 80; if the URL is ftp://ftp.userfriendly.org:8000/, getDefaultPort( ) returns 21.

7.1.2.5 public String getFile( )

The getFile( ) method returns a String that contains the path portion of a URL; remember that Java does not break a URL into separate path and file parts. Everything from the first slash (/) after the hostname until the character preceding the # sign that begins a fragment identifier is considered to be part of the file. For example:

 URL page = this.getDocumentBase( ); System.out.println("This page's path is " + page.getFile( ));

If the URL does not have a file part, Java 1.2 and earlier append a slash to the URL and return the slash as the filename. For example, if the URL is http://www.slashdot.org (rather than something like http://www.slashdot.org/, getFile() returns / . Java 1.3 and later simply set the file to the empty string.

7.1.2.6 public String getPath( ) // Java 1.3

The getPath( ) method, available only in Java 1.3 and later, is a near synonym for getFile( ) ; that is, it returns a String containing the path and file portion of a URL. However, unlike getFile( ) , it does not include the query string in the String it returns, just the path.

Note that the getPath( ) method does not return only the directory path and getFile( ) does not return only the filename, as you might expect. Both getPath() and getFile( ) return the full path and filename. The only difference is that getFile() also returns the query string and getPath( ) does not.

7.1.2.7 public String getRef( )

The getRef( ) method returns the fragment identifier part of the URL. If the URL doesn't have a fragment identifier, the method returns null . In the following code, getRef( ) returns the string xtocid1902914 :

 URL u = new URL(  "http://www.ibiblio.org/javafaq/javafaq.html#xtocid1902914"); System.out.println("The fragment ID of " + u + " is " + u.getRef( ));

7.1.2.8 public String getQuery( ) // Java 1.3

The getQuery( ) method returns the query string of the URL. If the URL doesn't have a query string, the method returns null . In the following code, getQuery() returns the string category=Piano :

 URL u = new URL(  "http://www.ibiblio.org/nywc/compositions.phtml?category=Piano"); System.out.println("The query string of " + u + " is " + u.getQuery( ));

In Java 1.2 and earlier, you need to extract the query string from the value returned by getFile( ) instead.

7.1.2.9 public String getUserInfo( ) // Java 1.3

Some URLs include usernames and occasionally even password information. This information comes after the scheme and before the host; an @ symbol delimits it. For instance, in the URL http://elharo@java.oreilly.com/, the user info is elharo . Some URLs also include passwords in the user info. For instance, in the URL ftp://mp3:secret@ftp.example.com/c%3a/stuff/mp3/ , the user info is mp3:secret . However, most of the time including a password in a URL is a security risk. If the URL doesn't have any user info, getUserInfo() returns null . Mailto URLs may not behave like you expect. In a URL like mailto:elharo@metalab.unc.edu, elharo@metalab.unc.edu is the path, not the user info and the host. That's because the URL specifies the remote recipient of the message rather than the username and host that's sending the message.

7.1.2.10 public String getAuthority( ) // Java 1.3

Between the scheme and the path of a URL, you'll find the authority. The term authority is taken from the Uniform Resource Identifier specification (RFC 2396), where this part of the URI indicates the authority that resolves the resource. In the most general case, the authority includes the user info, the host, and the port. For example, in the URL ftp://mp3:mp3@138.247.121.61:21000/c%3a/, the authority is mp3:mp3@138.247.121.61:21000 . However, not all URLs have all parts. For instance, in the URL http://conferences.oreilly.com/java/ speakers /, the authority is simply the hostname conferences.oreilly.com . The getAuthority( ) method returns the authority as it exists in the URL, with or without the user info and port.

Example 7-4 uses all eight methods to split URLs entered on the command line into their component parts. This program requires Java 1.3 or later.

Example 7-4. The parts of a URL

 import java.net.*; public class URLSplitter {   public static void main(String args[]) {     for (int i = 0; i < args.length; i++) {       try {         URL u = new URL(args[i]);         System.out.println("The URL is " + u);         System.out.println("The scheme is " + u.getProtocol( ));                 System.out.println("The user info is " + u.getUserInfo( ));                  String host = u.getHost( );         if (host != null) {           int atSign = host.indexOf('@');             if (atSign != -1) host = host.substring(atSign+1);           System.out.println("The host is " + host);            }         else {                     System.out.println("The host is null.");            }         System.out.println("The port is " + u.getPort( ));         System.out.println("The path is " + u.getPath( ));         System.out.println("The ref is " + u.getRef( ));         System.out.println("The query string is " + u.getQuery( ));       }  // end try       catch (MalformedURLException ex) {         System.err.println(args[i] + " is not a URL I understand.");       }       System.out.println( );     }  // end for   }  // end main }  // end URLSplitter

Here's the result of running this against several of the URL examples in this chapter:

 %  java URLSplitter    \   http://www.ncsa.uiuc.edu/demoweb/html-primer.html#A1.3.3.3 \   ftp://mp3:mp3@138.247.121.61:21000/c%3a/                 \   http://www.oreilly.com                                   \   http://www.ibiblio.org/nywc/compositions.phtml?category=Piano \   http://admin@www.blackstar.com:8080/                     \  The URL is http://www.ncsa.uiuc.edu/demoweb/html-primer.html#A1.3.3.3 The scheme is http The user info is null The host is www.ncsa.uiuc.edu The port is -1 The path is /demoweb/html-primer.html The ref is A1.3.3.3 The query string is null The URL is ftp://mp3:mp3@138.247.121.61:21000/c%3a/ The scheme is ftp The user info is mp3:mp3 The host is 138.247.121.61 The port is 21000 The path is /c%3a/ The ref is null The query string is null The URL is http://www.oreilly.com The scheme is http The user info is null The host is www.oreilly.com The port is -1 The path is  The ref is null The query string is null The URL is http://www.ibiblio.org/nywc/compositions.phtml?category=Piano The scheme is http The user info is null The host is www.ibiblio.org The port is -1 The path is /nywc/compositions.phtml The ref is null The query string is category=Piano The URL is http://admin@www.blackstar.com:8080/ The scheme is http The user info is admin The host is www.blackstar.com The port is 8080 The path is / The ref is null The query string is null

7.1.3 Retrieving Data from a URL

Naked URLs aren't very exciting. What's interesting is the data contained in the documents they point to. The URL class has several methods that retrieve data from a URL:

 public InputStream openStream( ) throws IOException public URLConnection openConnection( ) throws IOException public URLConnection openConnection(Proxy proxy) throws IOException // 1.5 public Object getContent( ) throws IOException public Object getContent(Class[] classes)  throws IOException // 1.3

These methods differ in that they return the data at the URL as an instance of different classes.

7.1.3.1 public final InputStream openStream( ) throws IOException

The openStream( ) method connects to the resource referenced by the URL , performs any necessary handshaking between the client and the server, and returns an InputStream from which data can be read. The data you get from this InputStream is the raw (i.e., uninterpreted) contents of the file the URL references: ASCII if you're reading an ASCII text file, raw HTML if you're reading an HTML file, binary image data if you're reading an image file, and so forth. It does not include any of the HTTP headers or any other protocol- related information. You can read from this InputStream as you would read from any other InputStream . For example:

 try {   URL u  = new URL("http://www.hamsterdance.com");   InputStream in = u.openStream( );   int c;   while ((c = in.read( )) != -1) System.out.write(c); } catch (IOException ex) {   System.err.println(ex); }

This code fragment catches an IOException , which also catches the MalformedURLException that the URL constructor can throw, since MalformedURLException subclasses IOException .

Example 7-5 reads a URL from the command line, opens an InputStream from that URL, chains the resulting InputStream to an InputStreamReader using the default encoding, and then uses InputStreamReader 's read( ) method to read successive characters from the file, each of which is printed on System.out . That is, it prints the raw data located at the URL: if the URL references an HTML file, the program's output is raw HTML.

Example 7-5. Download a web page

 import java.net.*; import java.io.*; public class SourceViewer {   public static void main (String[] args) {     if  (args.length > 0) {       try {         //Open the URL for reading         URL u = new URL(args[0]);         InputStream in = u.openStream( );         // buffer the input to increase performance          in = new BufferedInputStream(in);                // chain the InputStream to a Reader         Reader r = new InputStreamReader(in);         int c;         while ((c = r.read( )) != -1) {           System.out.print((char) c);         }        }       catch (MalformedURLException ex) {         System.err.println(args[0] + " is not a parseable URL");       }       catch (IOException ex) {         System.err.println(ex);       }     } //  end if   } // end main }  // end SourceViewer

And here are the first few lines of output when SourceViewer downloads http://www.oreilly.com:

 %  java SourceViewer http://www.oreilly.com  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> <head> <title>oreilly.com -- Welcome to O'Reilly Media, Inc. -- computer books,  software conferences, online publishing</title> <meta name="keywords" content="O'Reilly, oreilly, computer books, technical  books, UNIX, unix, Perl, Java, Linux, Internet, Web, C, C++, Windows, Windows  NT, Security, Sys Admin, System Administration, Oracle, PL/SQL, online books, books online, computer book online, e-books, ebooks, Perl Conference, Open Source Conference, Java Conference, open source, free software, XML, Mac OS X, .Net, dot net, C#, PHP, CGI, VB, VB Script, Java Script, javascript, Windows 2000, XP,  bioinformatics, web services, p2p" /> <meta name="description" content="O'Reilly is a leader in technical and computer book  documentation, online content, and conferences for UNIX, Perl, Java, Linux, Internet,  Mac OS X, C, C++, Windows, Windows NT, Security, Sys Admin, System Administration,  Oracle, Design and Graphics, Online Books, e-books, ebooks, Perl Conference, Java  Conference, P2P Conference" />

There are quite a few more lines in that web page; if you want to see them, you can fire up your web browser.

The shakiest part of this program is that it blithely assumes that the remote URL is text, which is not necessarily true. It could well be a GIF or JPEG image, an MP3 sound file, or something else entirely. Even if it is text, the document encoding may not be the same as the default encoding of the client system. The remote host and local client may not have the same default character set. As a general rule, for pages that use a character set radically different from ASCII, the HTML will include a META tag in the header specifying the character set in use. For instance, this META tag specifies the Big-5 encoding for Chinese:

 <meta http-equiv="Content-Type" content="text/html; charset=big5">

An XML document will likely have an XML declaration instead:

 <?xml version="1.0" encoding="Big5"?>

In practice, there's no easy way to get at this information other than by parsing the file and looking for a header like this one, and even that approach is limited. Many HTML files hand-coded in Latin alphabets don't have such a META tag. Since Windows, the Mac, and most Unixes have somewhat different interpretations of the characters from 128 to 255, the extended characters in these documents do not translate correctly on platforms other than the one on which they were created.

And as if this isn't confusing enough, the HTTP header that precedes the actual document is likely to have its own encoding information, which may completely contradict what the document itself says. You can't read this header using the URL class, but you can with the URLConnection object returned by the openConnection( ) method. Encoding detection and declaration is one of the thornier parts of the architecture of the Web.

7.1.3.2 public URLConnection openConnection( ) throws IOException

The openConnection( ) method opens a socket to the specified URL and returns a URLConnection object. A URLConnection represents an open connection to a network resource. If the call fails, openConnection( ) throws an IOException . For example:

 try {   URL u = new URL("http://www.jennicam.org/");   try {     URLConnection uc = u.openConnection( );     InputStream in = uc.getInputStream( );     // read from the connection...   } // end try   catch (IOException ex) {     System.err.println(ex);   } } // end try catch (MalformedURLException ex) {   System.err.println(ex); }

Use this method when you want to communicate directly with the server. The URLConnection gives you access to everything sent by the server: in addition to the document itself in its raw form (e.g., HTML, plain text, binary image data), you can access all the metadata specified by the protocol. For example, if the scheme is HTTP, the URLConnection lets you access the HTTP headers as well as the raw HTML. The URLConnection class also lets you write data to as well as read from a URLfor instance, in order to send email to a mailto URL or post form data. The URLConnection class will be the primary subject of Chapter 15.

Java 1.5 adds one overloaded variant of this method that specifies the proxy server to pass the connection through:

 public URLConnection openConnection(Proxy proxy) throws IOException

This overrides any proxy server set with the usual socksProxyHost , socksProxyPort , http.proxyHost , http.proxyPort , http.nonProxyHosts , and similar system properties. If the protocol handler does not support proxies, the argument is ignored and the connection is made directly if possible.

7.1.3.3 public final Object getContent( ) throws IOException

The getContent( ) method is the third way to download data referenced by a URL. The getContent( ) method retrieves the data referenced by the URL and tries to make it into some type of object. If the URL refers to some kind of text object such as an ASCII or HTML file, the object returned is usually some sort of InputStream . If the URL refers to an image such as a GIF or a JPEG file, getContent( ) usually returns a java.awt.ImageProducer (more specifically , an instance of a class that implements the ImageProducer interface). What unifies these two disparate classes is that they are not the thing itself but a means by which a program can construct the thing:

 try {   URL u = new URL("http://mesola.obspm.fr/");   Object o = u.getContent( );   // cast the Object to the appropriate type   // work with the Object... }  catch (Exception ex) {   System.err.println(ex); }

getContent( ) operates by looking at the Content-type field in the MIME header of the data it gets from the server. If the server does not use MIME headers or sends an unfamiliar Content-type , getContent( ) returns some sort of InputStream with which the data can be read. An IOException is thrown if the object can't be retrieved. Example 7-6 demonstrates this.

Example 7-6. Download an object

 import java.net.*; import java.io.*; public class ContentGetter {   public static void main (String[] args) {     if  (args.length > 0) {       //Open the URL for reading       try {         URL u = new URL(args[0]);         try {           Object o = u.getContent( );           System.out.println("I got a " + o.getClass( ).getName( ));         } // end try         catch (IOException ex) {           System.err.println(ex);         }       } // end try       catch (MalformedURLException ex) {         System.err.println(args[0] + " is not a parseable URL");       }     } //  end if   } // end main }  // end ContentGetter

Here's the result of trying to get the content of http://www.oreilly.com:

 %  java ContentGetter http://www.oreilly.com/  I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream

The exact class may vary from one version of Java to the next (in earlier versions, it's been java.io.PushbackInputStream or sun.net.www.http.KeepAliveStream ) but it should be some form of InputStream .

Here's what you get when you try to load a header image from that page:

 %  java ContentGetter http://www.oreilly.com/graphics_new/animation.gif  I got a sun.awt.image.URLImageSource

Here's what happens when you try to load a Java applet using getContent( ) :

 %  java ContentGetter http://www.cafeaulait.org/RelativeURLTest.class  I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream

Here's what happens when you try to load an audio file using getContent( ) :

 %  java ContentGetter http://www.cafeaulait.org/course/week9/spacemusic.au  I got a sun.applet.AppletAudioClip

The last result is the most unusual because it is as close as the Java core API gets to a class that represents a sound file. It's not just an interface through which you can load the sound data.

This example demonstrates the biggest problems with using getContent( ) : it's hard to predict what kind of object you'll get. You could get some kind of InputStream or an ImageProducer or perhaps an AudioClip ; it's easy to check using the instanceof operator. This information should be enough to let you read a text file or display an image.

7.1.3.4 public final Object getContent(Class[] classes) throws IOException // Java 1.3

Starting in Java 1.3, it is possible for a content handler to provide different views of an object. This overloaded variant of the getContent( ) method lets you choose what class you'd like the content to be returned as. The method attempts to return the URL's content in the order used in the array. For instance, if you prefer an HTML file to be returned as a String , but your second choice is a Reader and your third choice is an InputStream , write:

 URL u = new URL("http://www.nwu.org"); Class[] types = new Class[3]; types[0] = String.class; types[1] = Reader.class; types[2] = InputStream.class; Object o = u.getContent(types);

You then have to test for the type of the returned object using instanceof . For example:

 if (o instanceof String) {   System.out.println(o);  } else if (o instanceof Reader) {   int c;   Reader r = (Reader) o;   while ((c = r.read( )) != -1) System.out.print((char) c);  } else if (o instanceof InputStream) {   int c;   InputStream in = (InputStream) o;   while ((c = in.read( )) != -1) System.out.write(c);          } else {   System.out.println("Error: unexpected type " + o.getClass( ));  }

7.1.4 Utility Methods

The URL class contains a couple of utility methods that perform common operations on URLs. The sameFile( ) method determines whether two URLs point to the same document. The toExternalForm( ) method converts a URL object to a string that can be used in an HTML link or a web browser's Open URL dialog.

7.1.4.1 public boolean sameFile(URL other)

The sameFile( ) method tests whether two URL objects point to the same file. If they do, sameFile( ) returns true ; otherwise, it returns false . The test that sameFile( ) performs is quite shallow ; all it does is compare the corresponding fields for equality. It detects whether the two hostnames are really just aliases for each other. For instance, it can tell that http://www.ibiblio.org/ and http://metalab.unc.edu/ are the same file. However, it cannot tell that http://www.ibiblio.org:80/ and http://metalab.unc.edu/ are the same file or that http://www.cafeconleche.org/ and http://www.cafeconleche.org/index.html are the same file. sameFile( ) is smart enough to ignore the fragment identifier part of a URL, however. Here's a fragment of code that uses sameFile( ) to compare two URLs:

 try {   URL u1 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimer.html#GS");   URL u2 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimer.html#HD");   if (u1.sameFile(u2)) {     System.out.println(u1 + " is the same file as \n" + u2);   }   else {     System.out.println(u1 + " is not the same file as \n" + u2);   } } catch (MalformedURLException ex) {   System.err.println(ex); }

The output is:

 http://www.ncsa.uiuc.edu/HTMLPrimer.html#GS is the same file as  http://www.ncsa.uiuc.edu/HTMLPrimer.html#HD

The sameFile( ) method is similar to the equals( ) method of the URL class. The main difference between sameFile( ) and equals( ) is that equals( ) considers the fragment identifier (if any), whereas sameFile( ) does not. The two URLs shown here do not compare equal although they are the same file. Also, any object may be passed to equals( ) ; only URL objects can be passed to sameFile( ) .

7.1.4.2 public String toExternalForm( )

The toExternalForm( ) method returns a human-readable String representing the URL. It is identical to the toString( ) method. In fact, all the toString( ) method does is return toExternalForm( ) . Therefore, this method is currently redundant and rarely used.

7.1.4.3 public URI toURI( ) throws URISyntaxException // Java 1.5

Java 1.5 adds a toURI( ) method that converts a URL object to an equivalent URI object. We'll take up the URI class shortly. In the meantime, the main thing you need to know is that the URI class provides much more accurate, specification-conformant behavior than the URL class. For operations like absolutization and encoding, you should prefer the URI class where you have the option. In Java 1.4 and later, the URL class should be used primarily for the actual downloading of content from the remote server.

7.1.5 The Object Methods

URL inherits from java.lang.Object , so it has access to all the methods of the Object class. It overrides three to provide more specialized behavior: equals( ) , hashCode( ) , and toString( ) .

7.1.5.1 public String toString( )

Like all good classes, java.net.URL has a toString( ) method. Example 7-1 through Example 7-5 implicitly called this method when URL s were passed to System.out.println( ) . As those examples demonstrated, the String produced by toString( ) is always an absolute URL, such as http://www.cafeaulait.org/javatutorial.html.

It's uncommon to call toString( ) explicitly. Print statements call toString( ) implicitly. Outside of print statements, it's more proper to use toExternalForm( ) instead. If you do call toString( ) , the syntax is simple:

 URL codeBase = this.getCodeBase( ); String appletURL = codeBase.toString( );

7.1.5.2 public boolean equals(Object o)

An object is equal to a URL only if it is also a URL , both URL s point to the same file as determined by the sameFile( ) method, and both URL s have the same fragment identifier (or both URL s don't have fragment identifiers). Since equals( ) depends on sameFile( ) , equals( ) has the same limitations as sameFile( ) . For example, http://www.oreilly.com/ is not equal to http://www.oreilly.com/index.html, and http://www.oreilly.com:80/ is not equal to http://www.oreilly.com/. Whether this makes sense depends on whether you think of a URL as a string or as a reference to a particular Internet resource.

Example 7-7 creates URL objects for http://www.ibiblio.org/ and http://metalab.unc.edu/ and tells you if they're the same using the equals() method.

Example 7-7. Are http://www.ibiblio.org and http://www.metalab.unc.edu the same?

 import java.net.*; public class URLEquality {   public static void main (String[] args) {     try {       URL ibiblio = new URL ("http://www.ibiblio.org/");       URL metalab = new URL("http://metalab.unc.edu/");       if (ibiblio.equals(metalab)) {         System.out.println(ibiblio + " is the same as " + metalab);       }       else {         System.out.println(ibiblio + " is not the same as " + metalab);       }     }     catch (MalformedURLException ex) {       System.err.println(ex);     }   } }

When you run this program, you discover:

 %  java URLEquality  http://www.ibiblio.org/ is the same as http://metalab.unc.edu/

7.1.5.3 public int hashCode( )

The hashCode( ) method returns an int that is used when URL objects are used as keys in hash tables. Thus, it is called by the various methods of java.util.Hashtable . You rarely need to call this method directly, if ever. Hash codes for two different URL objects are unlikely to be the same, but it is certainly possible; there are far more conceivable URLs than there are four-byte integers.

7.1.6 Methods for Protocol Handlers

The last method in the URL class I'll just mention briefly here for the sake of completeness: setURLStreamHandlerFactory( ) . It's primarily used by protocol handlers that are responsible for new schemes, not by programmers who just want to retrieve data from a URL. We'll discuss it in more detail in Chapter 16.

7.1.6.1 public static synchronized void setURLStreamHandlerFactory(URLStreamHandlerFactory factory)

This method sets the URLStreamHandlerFactory for the application and throws a generic Error if the factory has already been set. A URLStreamHandler is responsible for parsing the URL and then constructing the appropriate URLConnection object to handle the connection to the server. Most of the time this happens behind the scenes.