7.3 The URI Class | Java Network Programming, Third Edition

A URI is an abstraction of a URL that includes not only Uniform Resource Locators but also Uniform Resource Names (URNs). Most URIs used in practice are URLs, but most specifications and standards such as XML are defined in terms of URIs. In Java 1.4 and later, URIs are represented by the java.net.URI class. This class differs from the java.net.URL class in three important ways:

The URI class is purely about identification of resources and parsing of URIs. It provides no methods to retrieve a representation of the resource identified by its URI.
The URI class is more conformant to the relevant specifications than the URL class.
A URI object can represent a relative URI. The URL class absolutizes all URIs before storing them.

In brief, a URL object is a representation of an application layer protocol for network retrieval, whereas a URI object is purely for string parsing and manipulation. The URI class has no network retrieval capabilities. The URL class has some string parsing methods, such as getFile( ) and getRef( ) , but many of these are broken and don't always behave exactly as the relevant specifications say they should. Assuming you're using Java 1.4 or later and therefore have a choice, you should use the URL class when you want to download the content of a URL and the URI class when you want to use the URI for identification rather than retrieval, for instance, to represent an XML namespace URI. In some cases when you need to do both, you may convert from a URI to a URL with the toURL( ) method, and in Java 1.5 you can also convert from a URL to a URI using the toURI( ) method of the URL class.

7.3.1 Constructing a URI

URIs are built from strings. Unlike the URL class, the URI class does not depend on an underlying protocol handler. As long as the URI is syntactically correct, Java does not need to understand its protocol in order to create a representative URI object. Thus, unlike the URL class, the URI class can be used for new and experimental URI schemes.

7.3.1.1 public URI(String uri) throws URISyntaxException

This is the basic constructor that creates a new URI object from any convenient string. For example,

 URI voice = new URI("tel:+1-800-9988-9938"); URI web   = new URI("http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc"); URI book  = new URI("urn:isbn:1-565-92870-9");

If the string argument does not follow URI syntax rulesfor example, if the URI begins with a colonthis constructor throws a URISyntaxException . This is a checked exception, so you need to either catch it or declare that the method where the constructor is invoked can throw it. However, one syntactic rule is not checked. In contradiction to the URI specification, the characters used in the URI are not limited to ASCII. They can include other Unicode characters , such as and . Syntactically, there are very few restrictions on URIs, especially once the need to encode non-ASCII characters is removed and relative URIs are allowed. Almost any string can be interpreted as a URI.

7.3.1.2 public URI(String scheme, String schemeSpecificPart, String fragment) throws URISyntaxException

This constructor is mostly used for nonhierarchical URIs. The scheme is the URI's protocol, such as http, urn, tel, and so forth. It must be composed exclusively of ASCII letters and digits and the three punctuation characters + , - , and .. It must begin with a letter. Passing null for this argument omits the scheme, thus creating a relative URI. For example:

 URI absolute = new URI("http", "//www.ibiblio.org" , null); URI relative = new URI(null, "/javafaq/index.shtml", "today");

The scheme-specific part depends on the syntax of the URI scheme; it's one thing for an http URL, another for a mailto URL, and something else again for a tel URI. Because the URI class encodes illegal characters with percent escapes , there's effectively no syntax error you can make in this part.

Finally, the third argument contains the fragment identifier, if any. Again, characters that are forbidden in a fragment identifier are escaped automatically. Passing null for this argument simply omits the fragment identifier.

7.3.1.3 public URI(String scheme, String host, String path , String fragment) throws URISyntaxException

This constructor is used for hierarchical URIs such as http and ftp URLs. The host and path together (separated by a /) form the scheme-specific part for this URI. For example:

 URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html", "today");

produces the URI http://www.ibiblio.org/javafaq/index.html#today.

If the constructor cannot form a legal hierarchical URI from the supplied piecesfor instance, if there is a scheme so the URI has to be absolute but the path doesn't start with /then it throws a URISyntaxException .

7.3.1.4 public URI(String scheme, String authority, String path, String query, String fragment) throws URISyntaxException

This constructor is basically the same as the previous one, with the addition of a query string component. For example:

 URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html",                     "referrer=cnet&date=2004-08-23",  "today");

As usual, any unescapable syntax errors cause a URISyntaxException to be thrown and null can be passed to omit any of the arguments.

7.3.1.5 public URI(String scheme, String userInfo, String host, int port, String path, String query, String fragment) throws URISyntaxException

This is the master hierarchical URI constructor that the previous two invoke. It divides the authority into separate user info , host, and port parts , each of which has its own syntax rules. For example:

 URI styles = new URI("ftp", "anonymous:elharo@metalab.unc.edu",    "ftp.oreilly.com",  21, "/pub/stylesheet", null, null);

However, the resulting URI still has to follow all the usual rules for URIs and again, null can be passed for any argument to omit it from the result.

7.3.1.6 public static URI create(String uri)

This is not a constructor, but rather a static factory method. Unlike the constructors, it does not throw a URISyntaxException . If you're sure your URIs are legal and do not violate any of the rules, you can use this method. For example, this invocation creates a URI for anonymous FTP access using an email address as password:

 URI styles = URI.create(   "ftp://anonymous:elharo%40metalab.unc.edu@ftp.oreilly.com:                                          21/pub/stylesheet");

If the URI does prove to be malformed , this method throws an IllegalArgumentException . This is a runtime exception, so you don't have to explicitly declare it or catch it.

7.3.2 The Parts of the URI

A URI reference has up to three parts: a scheme, a scheme-specific part, and a fragment identifier. The general format is:

   scheme   :   scheme-specific-part   :   fragment

If the scheme is omitted, the URI reference is relative. If the fragment identifier is omitted, the URI reference is a pure URI. The URI class has getter methods that return these three parts of each URI object. The getRaw Foo ( ) methods return the encoded forms of the parts of the URI, while the equivalent get Foo () methods first decode any percent-escaped characters and then return the decoded part:

 public String getScheme( ) public String getSchemeSpecificPart( ) public String getRawSchemeSpecificPart( ) public String getFragment( ) public String getRawFragment( )

There's no getRawScheme( ) method because the URI specification requires that all scheme names be composed exclusively of URI-legal ASCII characters and does not allow percent escapes in scheme names.

These methods all return null if the particular URI object does not have the relevant component: for example, a relative URI without a scheme or an http URI without a fragment identifier.

A URI that has a scheme is an absolute URI. A URI without a scheme is relative . The isAbsolute() method returns true if the URI is absolute, false if it's relative:

 public boolean isAbsolute( )

The details of the scheme-specific part vary depending on the type of the scheme. For example, in a tel URL, the scheme-specific part has the syntax of a telephone number. However, in many useful URIs, including the very common file and http URLs, the scheme-specific part has a particular hierarchical format divided into an authority, a path, and a query string. The authority is further divided into user info, host, and port. The isOpaque() method returns false if the URI is hierarchical, true if it's not hierarchicalthat is, if it's opaque :

 public boolean isOpaque( )

If the URI is opaque, all you can get is the scheme, scheme-specific part, and fragment identifier. However, if the URI is hierarchical, there are getter methods for all the different parts of a hierarchical URI:

 public String getAuthority( ) public String getFragment( ) public String getHost( ) public String getPath( ) public String getPort( ) public String getQuery( ) public String getUserInfo( )

These methods all return the decoded parts; in other words, percent escapes, such as %3C, are changed into the characters they represent, such as <. If you want the raw, encoded parts of the URI, there are five parallel getRaw Foo () methods:

 public String getRawAuthority( ) public String getRawFragment( ) public String getRawPath( ) public String getRawQuery( ) public String getRawUserInfo( )

Remember the URI class differs from the URI specification in that non-ASCII characters such as and ¼ are never percent-escaped in the first place, and thus will still be present in the strings returned by the getRaw Foo () methods unless the strings originally used to construct the URI object were encoded.

There are no getRawPort( ) and getRawHost( ) methods because these components are always guaranteed to be made up of ASCII characters, at least for now. Internationalized domain names are coming, and may require this decision to be rethought in future versions of Java.

In the event that the specific URI does not contain this informationfor instance, the URI http://www.example.com has no user info, path, port, or query stringthe relevant methods return null. getPort( ) is the single exception. Since it's declared to return an int , it can't return null . Instead, it returns -1 to indicate an omitted port.

For various technical reasons that don't have a lot of practical impact, Java can't always initially detect syntax errors in the authority component. The immediate symptom of this failing is normally an inability to return the individual parts of the authority: port, host, and user info. In this event, you can call parseServerAuthority() to force the authority to be reparsed:

 public URI parseServerAuthority( )  throws URISyntaxException

The original URI does not change ( URI objects are immutable), but the URI returned will have separate authority parts for user info, host, and port. If the authority cannot be parsed, a URISyntaxException is thrown.

Example 7-10 uses these methods to split URIs entered on the command line into their component parts. It's similar to Example 7-4 but works with any syntactically correct URI, not just the ones Java has a protocol handler for.

Example 7-10. The parts of a URI

 import java.net.*; public class URISplitter {   public static void main(String args[]) {     for (int i = 0; i < args.length; i++) {       try {         URI u = new URI(args[i]);         System.out.println("The URI is " + u);         if (u.isOpaque( )) {           System.out.println("This is an opaque URI.");            System.out.println("The scheme is " + u.getScheme( ));                   System.out.println("The scheme specific part is "             + u.getSchemeSpecificPart( ));                   System.out.println("The fragment ID is " + u.getFragment( ));                 }         else {           System.out.println("This is a hierarchical URI.");            System.out.println("The scheme is " + u.getScheme( ));                   try {                    u = u.parseServerAuthority( );             System.out.println("The host is " + u.getUserInfo( ));                     System.out.println("The user info is " + u.getUserInfo( ));                     System.out.println("The port is " + u.getPort( ));                   }           catch (URISyntaxException ex) {             // Must be a registry based authority             System.out.println("The authority is " + u.getAuthority( ));                   }           System.out.println("The path is " + u.getPath( ));                   System.out.println("The query string is " + u.getQuery( ));                   System.out.println("The fragment ID is " + u.getFragment( ));          } // end else              }  // end try       catch (URISyntaxException ex) {         System.err.println(args[i] + " does not seem to be a URI.");       }       System.out.println( );     }  // end for   }  // end main }  // end URISplitter

Here's the result of running this against three of the URI examples in this section:

 %  java URISplitter tel:+1-800-9988-9938 \   http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc \   urn:isbn:1-565-92870-9  The URI is tel:+1-800-9988-9938 This is an opaque URI. The scheme is tel The scheme specific part is +1-800-9988-9938 The fragment ID is null The URI is http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc This is a hierarchical URI. The scheme is http The host is null The user info is null The port is -1 The path is /pub/a/2003/09/17/stax.html The query string is null The fragment ID is id=_hbc The URI is urn:isbn:1-565-92870-9 This is an opaque URI. The scheme is urn The scheme specific part is isbn:1-565-92870-9 The fragment ID is null

7.3.3 Resolving Relative URIs

The URI class has three methods for converting back and forth between relative and absolute URIs.

7.3.3.1 public URI resolve(URI uri)

This method compares the uri argument to this URI and uses it to construct a new URI object that wraps an absolute URI. For example, consider these three lines of code:

 URI absolute = new URI("http://www.example.com/"); URI relative = new URI("images/logo.png"); URI resolved = absolute.resolve(relative);

After they've executed, resolved contains the absolute URI http://www.example.com/images/logo.png .

If the invoking URI does not contain an absolute URI itself, the resolve( ) method resolves as much of the URI as it can and returns a new relative URI object as a result. For example, take these three statements:

 URI top = new URI("javafaq/books/"); URI relative = new URI("jnp3/examples/07/index.html"); URI resolved = top.resolve(relative);

After they've executed, resolved now contains the relative URI javafaq/books/jnp3/examples/07/index.html with no scheme or authority.

7.3.3.2 public URI resolve(String uri)

This is a convenience method that simply converts the string argument to a URI and then resolves it against the invoking URI, returning a new URI object as the result. That is, it's equivalent to resolve(new URI(str)) . Using this method, the previous two samples can be rewritten as:

 URI absolute = new URI("http://www.example.com/"); URI resolved = absolute.resolve("images/logo.png"); URI top = new URI("javafaq/books/"); resolved = top.resolve("jnp3/examples/07/index.html");

7.3.3.3 public URI relativize(URI uri)

It's also possible to reverse this procedure; that is, to go from an absolute URI to a relative one. The relativize( ) method creates a new URI object from the uri argument that is relative to the invoking URI . The argument is not changed. For example:

 URI absolute = new URI("http://www.example.com/images/logo.png"); URI top = new URI("http://www.example.com/"); URI relative = top.relativize(absolute);

The URI object relative now contains the relative URI images/logo.png .

7.3.4 Utility Methods

The URI class has the usual batch of utility methods: equals() , hashCode( ) , toString( ) , and compareTo( ) .

7.3.4.1 public boolean equals(Object o)

URIs are tested for equality pretty much as you'd expect. It's not a direct string comparison. Equal URIs must both either be hierarchical or opaque. The scheme and authority parts are compared without considering case. That is, http and HTTP are the same scheme, and www.example.com is the same authority as www.EXAMPLE.com . The rest of the URI is case-sensitive, except for hexadecimal digits used to escape illegal characters. Escapes are not decoded before comparing. http://www.example.com/A and http://www.example.com/%41 are unequal URIs.

7.3.4.2 public int hashCode( )

The hashCode( ) method is a usual hashCode( ) method, nothing special. Equal URIs do have the same hash code and unequal URIs are fairly unlikely to share the same hash code.

7.3.4.3 public int compareTo(Object o)

URIs can be ordered. The ordering is based on string comparison of the individual parts, in this sequence:

If the schemes are different, the schemes are compared, without considering case.
Otherwise, if the schemes are the same, a hierarchical URI is considered to be less than an opaque URI with the same scheme.
If both URIs are opaque URIs, they're ordered according to their scheme-specific parts.
If both the scheme and the opaque scheme-specific parts are equal, the URIs are compared by their fragments .
If both URIs are hierarchical, they're ordered according to their authority components, which are themselves ordered according to user info, host, and port, in that order.
If the schemes and the authorities are equal, the path is used to distinguish them.
If the paths are also equal, the query strings are compared.
If the query strings are equal, the fragments are compared.

URIs are not comparable to any type except themselves. Comparing a URI to anything except another URI causes a ClassCastException .

7.3.4.4 public String toString( )

The toString( ) method returns an unencoded string form of the URI . That is, characters like and \ are not percent-escaped unless they were percent-escaped in the strings used to construct this URI . Therefore, the result of calling this method is not guaranteed to be a syntactically correct URI. This form is sometimes useful for display to human beings, but not for retrieval.

7.3.4.5 public String toASCIIString( )

The toASCIIString( ) method returns an encoded string form of the URI . Characters like and \ are always percent-escaped whether or not they were originally escaped. This is the string form of the URI you should use most of the time. Even if the form returned by toString( ) is more legible for humans , they may still copy and paste it into areas that are not expecting an illegal URI. toASCIIString( ) always returns a syntactically correct URI.