3.1 URLs

A URL is an address to a digital resource, much like a street address to a house. Both kinds of addresses include specific information; a street address has a house number, and a URL has the resource name. Both kinds of addresses also include high-level information; a street address has the name of the city where the house is located, and a URL has the name of the server. To locate a house based on its street address, postal systems route a letter starting with the least specific information. For example, all letters to Norway get put into the same bag. Once the bag gets to Norway, the next step is to look at the city name, and so on.

Postal addresses aren't organized to be written down in the same order that they are handled. The country is near the bottom of the address, whereas the house number is near the top. In contrast, programmers and protocol designers think algorithmically. Knowing that addresses are always parsed starting with the most general information, programmers put that first. In a URL, the most general information is the name of the protocol to use to get the resource, which is officially called the scheme, then, increasing in specificity, the server address, then a TCP port number, and finally the path of the resource to ask for (see Figure 3-1). This structure holds for most URLs, including HTTP URLs.

Figure 3-1. URL structure goes from general to specific.

graphics/03fig01.gif

To locate an HTTP resource from an HTTP URL, a client follows roughly these steps:

If the client recognizes the protocol, it should know how to parse the rest of the URL. URLs for other protocols might have different structures.
Find the server from its name. Domain names are resolved to an IP address through the Domain Name System (DNS) protocol [RFC1035].
Open a TCP connection to that IP address, requesting the port.
Using this TCP connection, send a request using the specified protocol. Ask for the resource by asking for the path part of the URL.

Domain Names

Domain names are ordered with the most specific piece first, the opposite order from URLs. IP addresses are ordered with the least specific piece first. Either way, address resolution is usually passed off to another module.

The server uses the path to identify the resource, and usually the path part is also least specific to most specific. The path might look like a file system path, as shown in the example in Figure 3-1, with directory names and a file name at the end. However, it might not look like that. The path part of the URL can be opaque or meaningless to the client. The client does not attempt to parse the path but merely provides the entire path to the server to identify the resource. Many e-commerce sites use URLs that are only meaningful to the software running on that server, such as:

http://store.example.com/index.asp?id=CF52&addtocart=true&goBack=Yes
http://www.sharemation.com/_xy-29129_docstore1

WebDAV puts a few more constraints on URL structure than HTTP does, but that won't be covered until we get to WebDAV URLs (Chapter 5, WebDAV Modifications to HTTP).

3.1.1 Relative URLs

In addition to full URLs, HTTP also uses relative or partial URLs. Again, the concept is similar to street addresses. If you and I are standing in Conestogo, Ontario, and I tell you that I used to live at 39 King Street, it's clear from our location that I'm referring to the King Street in Conestogo and not one of the tens of thousands of other King Streets (the same street name appears in every town in the province).

HTTP requests frequently are sent to a relative address. Since TCP is used to establish a connection to the correct server and port, and the protocol is specified anyway, the relative address only has to include the path.

 /hr/ergonomics/posture.doc

It's possible to construct even shorter relative addresses. If the client and server both agree they're talking about resources within /hr/ergonomics/, the relative address could just be posture.doc.

3.1.2 Links

Links allow Web browsers to find and identify URLs. A Web browser can't construct or make up HTTP URLs on its own the browser has to find URLs. The most common way of finding a URL is to load another Web page with a link in it.

After the client downloads and displays the page, the user can see if the page contains links that would be useful for further navigation. A link in an HTML page can be a full or relative URL, associated with some text or an image (see Listing 3-1). The user clicks on the linked image or text to request the addressed resource. Linked text and images are organized together with ordinary text and images, all formatted with HTML tags.

A standard link contains no information other than the URL. The resource could be a Web page, an image, a text file, or a data file. Possibly the resource at the URL is unavailable, deleted, or never existed. The URL in the link could be a typo, but there's no way to know up front. The browser can't determine the size, type, or date last modified of the linked resource without querying that address on the server where it's hosted.

Listing 3-1 Simple HTML document with links.

 <html> Links: <br> <a href="/aboutus.html">About us</a>, <a href="/events.html">Upcoming Events</a>, <a href="http://other.example.com/index.html">Our Sponsor</a> </html>

There are some great advantages to links and URLs, compared to the mechanisms in earlier systems such as FTP and Gopher. Thanks to the universality of URLs, networked information repositories can link to other repositories, even repositories run by other organizations on distant machines, without any coordination required. Thanks to links in text, navigation can be organized not just by the way files were stored but by arranging links in any helpful way. Links can be embedded in paragraphs of text, listed as references or footnotes, or organized in hierarchical menus.

3.1.3 URL Escaping

URLs can contain only a limited number of characters from the base US-ASCII set. Any unprintable characters from that set are excluded. Many more printable characters are reserved for special purposes within the Uniform Resource Identifier (URI) or excluded altogether [RFC2396] (see Table 3-1).

Table 3-1. Reserved and Excluded URI Characters
Reserved	; / ? : @ & = + $ ,	semicolon slash question mark colon at sign ampersand equal sign plus sign dollar sign comma
Delimiters	< > # % "	angle brackets pound sign percent sign double quote
Unwise	{ } \| \ ^ [ ] '	curly braces pipe symbol backslash caret square brackets single quote
Excluded		space, tab, etc. ASCII control characters (line feed, carriage return, etc.)

Many characters were excluded because URIs and URLs encompass existing addresses. SMTP mailbox addresses can appear in URLs, so @ must be reserved to act as a separator between the mailbox and the mail server (mailto:alice@mail.example.com). HTTP URLs use colon, slash, pound, question mark, and ampersand to delimit various parts of the URL.

Other characters are excluded because URLs must appear inside HTML pages and HTTP messages. Spaces are used as separator characters in HTTP messages. Angle brackets are used as control characters in HTML, and both single and double quotes are used to contain attribute values, as in the href attribute. Note the use of angle brackets and double quotes in the link examples in the previous section (Listing 3-1).

Characters that can't directly appear in URLs can be encoded by using % followed by two hexadecimal digits. For example, %20 replaces a single space in a URL.

In protocols and data formats (HTTP, SMTP, XML, or other), URLs always are formatted legally that is, with excluded characters encoded. URLs appear unescaped only when displayed to users and never when saved or transmitted.

3.1.4 Requests to Directories

Often, the first HTTP request to a new server is to a URL with no path part:

http://www.example.com/

This address isn't usually the address of a page the Web server considers it the address of the root directory of the Web content, but there's no standard way to return a directory in HTTP. Two kinds of responses are common.

First, the Web server might construct an HTML page on-the-fly, containing a simple list of the pages in the directory. This is very common, but there's no standard format for this kind of response.

Second, instead of returning information for the base directory, the server might return a default starting page. The administrator defines one or more default page names for the entire server. If a directory contains a page with that default name, the page is returned in response to a request for the directory. The page has its own URL, such as:

http://www.example.com/index.html

The response includes the URL of the default page (in the Location header; see Section 3.7.13) so that the client can use the "correct" URL in the future for requests such as PUT and DELETE.

The first mechanism is frequently used by Web servers hosting content with very few link-rich pages, such as Office documents, data files, or pictures. These servers return directory listings to allow users to navigate, just as in FTP. That's why Web servers have frequently replaced FTP servers Web servers can easily handle the FTP navigation model as well as handle more sophisticated navigation.

The second mechanism is frequently used by Web servers where the content is compiled into carefully organized pages for navigation. The directories might contain resources that users shouldn't see out of context alone, such as partial image maps or footer HTML included in complete pages. On these servers, administrators might forbid directory content listings entirely.