3.1 URIs | Java Network Programming, Third Edition

A Uniform Resource Identifier (URI) is a string of characters in a particular syntax that identifies a resource. The resource identified may be a file on a server, but it may also be an email address, a news message, a book, a person's name , an Internet host, the current stock price of Sun Microsystems, or something else. An absolute URI is made up of a scheme for the URI and a scheme-specific part, separated by a colon , like this:

   scheme   :   scheme-specific-part

The syntax of the scheme-specific part depends on the scheme being used. Current schemes include:

data: Base64-encoded data included directly in a link; see RFC 2397
file: A file on a local disk
ftp: An FTP server
http: A World Wide Web server using the Hypertext Transfer Protocol
gopher: A Gopher server
mailto: An email address
news: A Usenet newsgroup
telnet: A connection to a Telnet-based service
urn: A Uniform Resource Name

In addition, Java makes heavy use of nonstandard custom schemes such as rmi , jndi , and doc for various purposes. We'll look at the mechanism behind this in Chapter 16, when we discuss protocol handlers.

There is no specific syntax that applies to the scheme-specific parts of all URIs. However, many have a hierarchical form, like this:

 //   authority   /   path   ?   query

The authority part of the URI names the authority responsible for resolvin g the rest of the URI. For instance, the URI http://www.ietf.org/rfc/rfc2396.txt has the scheme http and the authority www.ietf.org. This means the server at www.ietf.org is responsible for mapping the path /rfc/rfc2396.txt to a resource. This URI does not have a query part. The URI http://www.powells.com/cgi-bin/ biblio ?inkey=62-1565928709-0 has the scheme http , the authority www.powells.com, the path /biblio, and the query inkey=62-1565928709-0. The URI urn:isbn:156592870 has the scheme urn but doesn't follow the hierarchical //authority/path?query form for scheme-specific parts.

Although most current examples of URIs use an Internet host as an authority, future schemes may not. However, if the authority is an Internet host, optional usernames and ports may also be provided to make the authority more specific. For example, the URI ftp://mp3:mp3@ci43198-a.ashvil1.nc.home.com:33/VanHalen-Jump.mp3 has the authority mp3:mp3@ci43198-a.ashvil1.nc.home.com:33 . This authority has the username mp3 , the password mp3 , the host ci43198-a.ashvil1.nc.home.com , and the port 33 . It has the scheme ftp and the path /VanHalen-Jump.mp3 . (In most cases, including the password in the URI is a big security hole unless, as here, you really do want everyone in the universe to know the password.)

The path (which includes its initial / ) is a string that the authority can use to determine which resource is identified. Different authorities may interpret the same path to refer to different resources. For instance, the path /index.html means one thing when the authority is www.landoverbaptist.org and something very different when the authority is www.churchofsatan.com. The path may be hierarchical, in which case the individual parts are separated by forward slashes , and the . and . . operators are used to navigate the hierarchy. These are derived from the pathname syntax on the Unix operating systems where the Web and URLs were invented. They conveniently map to a filesystem stored on a Unix web server. However, there is no guarantee that the components of any particular path actually correspond to files or directories on any particular filesystem. For example, in the URI http://www.amazon.com/ exec /obidos/ISBN%3D1565924851/cafeaulaitA/002-3777605-3043449, all the pieces of the hierarchy are just used to pull information out of a database that's never stored in a filesystem. ISBN%3D1565924851 selects the particular book from the database by its ISBN number, cafeaulaitA specifies who gets the referral fee if a purchase is made from this link, and 002-3777605-3043449 is a session key used to track the visitor's path through the site.

Some URIs aren't at all hierarchical, at least in the filesystem sense. For example, snews://secnews.netscape.com/netscape.devs-java has a path of /netscape.devs-java . Although there's some hierarchy to the newsgroup names indicated by the . between netscape and netscape.devs-java , it's not visible as part of the URI.

The scheme part is composed of lowercase letters , digits, and the plus sign, period, and hyphen. The other three parts of a typical URI (authority, path, and query) should each be composed of the ASCII alphanumeric characters; that is, the letters A-Z, a-z, and the digits 0-9. In addition, the punctuation characters - _ . ! ~ * ' may also be used. All other characters, including non-ASCII alphanumerics such as and , should be escaped by a percent sign (%) followed by the hexadecimal code for the character. For instance, would be encoded as %E1. A URL so transformed is said to have been "x-www-form-urlencoded".

This process assumes that the character set is the Latin 1. The URI and URL specifications don't actually say what character set should be used, which means most software tends to use the local default character set. Thus, URLs containing non-ASCII characters aren't very interoperable across different platforms and languages. With the Web becoming more international and less English daily, this situation has become increasingly problematic . Work is ongoing to define Internationalized Resource Identifiers (IRIs) that can use the full range of Unicode. At the time of this writing, the IRI draft specification indicates that non-ASCII characters should be encoded by first converting them to UTF-8, then percent-escaping each byte of the UTF-8, as specified above. For instance, the Greek letter is Unicode code point 3C0. In UTF-8, this letter is encoded as the three bytes E0, A7, 80. Thus in a URL it would be encoded as %E0%A7%80.

Punctuation characters such as / and @ must also be encoded with percent escapes if they are used in any role other than what's specified for them in the scheme-specific part of a particular URL. For example, the forward slashes in the URI http://www.cafeaulait.org/books/javaio/ do not need to be encoded as %2F because they serve to delimit the hierarchy as specified for the http URI scheme. However, if a filename includes a / characterfor instance, if the last directory were named Java I/O instead of javaio to more closely match the name of the bookthe URI would have to be written as http://www.cafeaulait.org/books/Java%20I%2FO/. This is not as farfetched as it might sound to Unix or Windows users. Mac filenames frequently include a forward slash. Filenames on many platforms often contain characters that need to be encoded, including @, $, +, =, and many more.

3.1.1 URNs

There are two types of URIs: Uniform Resource Locators (URLs) and Uniform Resource Names (URNs). A URL is a pointer to a particular resource on the Internet at a particular location. For example, http://www.oreilly.com/catalog/javanp3/ is one of several URLs for the book Java Network Programming . A URN is a name for a particular resource but without reference to a particular location. For instance, urn:isbn:1565928709 is a URN referring to the same book. As this example shows, URNs, unlike URLs, are not limited to Internet resources.

The goal of URNs is to handle resources that are mirrored in many different locations or that have moved from one site to another; they identify the resource itself, not the place where the resource lives. For instance, when given a URN for a particular piece of software, an FTP program should get the file from the nearest mirror site. Given a URN for a book, a browser might reserve the book at the local library or order a copy from a bookstore.

A URN has the general form:

   urn   :   namespace   :   resource_name

The namespace is the name of a collection of certain kinds of resources maintained by some authority. The resource_name is the name of a resource within that collection. For instance, the URN urn:ISBN:1565924851 identifies a resource in the ISBN namespace with the identifier 1565924851 . Of all the books published, this one selects the first edition of Java I/O .

The exact syntax of resource names depends on the namespace. The ISBN namespace expects to see strings composed of 10 or 13 characters, all of which are digitswith the single exception that the last character may be the letter X (either upper- or lowercase) instead. Furthermore, ISBNs may contain hyphens that are ignored when comparing. Other namespaces will use very different syntaxes for resource names. The IANA is responsible for handing out namespaces to different organizations, as described in RFC 3406. Basically, you have to submit an Internet draft to the IETF and publish an announcement on the urn-nid mailing list for public comment and discussion before formal standardization.

3.1.2 URLs

A URL identifies the location of a resource on the Internet. It specifies the protocol used to access a server (e.g., FTP, HTTP), the name of the server, and the location of a file on that server. A typical URL looks like http://www. ibiblio .org/javafaq/javatutorial.html. This specifies that there is a file called javatutorial.html in a directory called javafaq on the server www.ibiblio.org , and that this file can be accessed via the HTTP protocol. The syntax of a URL is:

   protocol   ://   username   @   hostname   :   port   /   path   /   filename   ?   query   #   fragment

Here the protocol is another word for what was called the scheme of the URI. ( Scheme is the word used in the URI RFC. Protocol is the word used in the Java documentation.) In a URL, the protocol part can be file , ftp , http , https , gopher , news , telnet , wais , or various other strings (though not urn ).

The hostname part of a URL is the name of the server that provides the resource you want, such as www.oreilly.com or utopia.poly.edu . It can also be the server's IP address, such as 204.148.40.9 or 128.238.3.21. The username is an optional username for the server. The port number is also optional. It's not necessary if the service is running on its default port (port 80 for HTTP servers).

The path points to a particular directory on the specified server. The path is relative to the document root of the server, not necessarily to the root of the filesystem on the server. As a rule, servers that are open to the public do not show their entire filesystem to clients . Rather, they show only the contents of a specified directory. This directory is called the document root, and all paths and filenames are relative to it. Thus, on a Unix server, all files that are available to the public might be in /var/public/html , but to somebody connecting from a remote machine, this directory looks like the root of the filesystem.

The filename points to a particular file in the directory specified by the path. It is often omittedin which case, it is left to the server's discretion what file, if any, to send. Many servers send an index file for that directory, often called index.html or Welcome.html . Some send a list of the files and folders in the directory, as shown in Figure 3-1. Others may send a 403 Forbidden error message, as shown in Figure 3-2.

Figure 3-1. A web server configured to send a directory list when no index file exists

Figure 3-2. A web server configured to send a 403 error when no index file exists

The query string provides additional arguments for the server. It's commonly used only in http URLs, where it contains form data for input to programs running on the server.

Finally, the fragment references a particular part of the remote resource. If the remote resource is HTML, the fragment identifier names an anchor in the HTML document. If the remote resource is XML, the fragment identifier is an XPointer. Some documents refer to the fragment part of the URL as a "section"; Java documents rather unaccountably refer to the fragment identifier as a "Ref". A named anchor is created in an HTML document with a tag, like this:

 <A NAME="xtocid1902914">Comments</A>

This tag identifies a particular point in a document. To refer to this point, a URL includes not only the document's filename but the named anchor separated from the rest of the URL by a # :

 http://www.cafeaulait.org/javafaq.html#xtocid1902914

Technically, a string that contains a fragment identifier is a URL reference, not a URL. Java, however, does not distinguish between URLs and URL references.

3.1.3 Relative URLs

A URL tells the web browser a lot about a document: the protocol used to retrieve the document, the name of the host where the document lives, and the path to that document on the host. Most of this information is likely to be the same for other URLs that are referenced in the document. Therefore, rather than requiring each URL to be specified in its entirety, a URL may inherit the protocol, hostname, and path of its parent document (i.e., the document in which it appears). URLs that aren't complete but inherit pieces from their parent are called relative URLs. In contrast, a completely specified URL is called an absolute URL . In a relative URL, any pieces that are missing are assumed to be the same as the corresponding pieces from the URL of the document in which the URL is found. For example, suppose that while browsing http://www.ibiblio.org/javafaq/javatutorial.html you click on this hyperlink:

 <a href="javafaq.html">

The browser cuts javatutorial.html off the end of http://www.ibiblio.org/javafaq/javatutorial.html to get http://www.ibiblio.org/javafaq/. Then it attaches javafaq.html onto the end of http://www.ibiblio.org/javafaq/ to get http://www.ibiblio.org/javafaq/javafaq.html. Finally, it loads that document.

If the relative link begins with a / , then it is relative to the document root instead of relative to the current file. Thus, if you click on the following link while browsing http://www.ibiblio.org/javafaq/javatutorial.html:

 <a href="/boutell/faq/www_faq.html">

the browser would throw away /javafaq/javatutorial.html and attach /boutell/faq/www_faq.html to the end of http://www.ibiblio.org to get http://www.ibiblio.org/boutell/faq/www_faq.html.

Relative URLs have a number of advantages. Firstand least importantthey save a little typing. More importantly, relative URLs allow a single document tree to be served by multiple protocols: for instance, both FTP and HTTP. The HTTP might be used for direct surfing, while the FTP could be used for mirroring the site. Most importantly of all, relative URLs allow entire trees of documents to be moved or copied from one site to another without breaking all the internal links.