7.1 URIs | Secure XML: The New Syntax for Signatures and Encryption

A Uniform Resource Identifier [RFC 2396] is the fundamental way to name or locate something in the World Wide Web. Formally, URIs represent the union of Uniform Resource Locators and Uniform Resource Names [RFC 2141].

7.1.1 URI Syntax

The most general syntax for a URI is

 Scheme ":" scheme-specific-part

If the scheme-specific-part starts with a double slash ("//"), the URI is called a "generic URI." In this case, a specific substructure to the scheme-specific-part is implied as follows:

 Scheme "://" authority path [ "?" query ]

Here square brackets surround the optional parts, italics indicate variables, and quoted text indicates fixed characters.

The scheme name is case independent and must start with a letter. The remainder of the scheme name can consist of any combination of letters, digits, periods ("."), hyphens ("-"), and plus signs ("+"). The most common scheme seen by users is "http:"; it indicates a pointer to a resource the user can obtain through the Hypertext Transfer Protocol (HTTP; [RFC 2616]).

Authorities

According to the specification of URIs, the "authority" portion is scheme dependent and can be null. Thus a scheme can have its own registry, and URIs using that scheme could potentially use names in that registry as their authority section. As a practical matter, however, almost all authority sections seen in real life are "server"-based. That is, they rely on the specification of a server computer on the Internet. When desired, they can refer to particular accounts and/or ports at such computers as described below.

The general syntax for a server-based URI authority portion is

 [ [ userInfo "@" ] host [ ":" portNumber ] ]

Unless the authority is null, it must always specify a host (computer). It can be optionally preceded by user information and an at sign ("@"), where the user information identifies a user or account at that host. A colon and a port number can also follow it.

The user information could consist of a user name or an account number. On many systems, it is structured as follows:

 UserName [ ":" Password ]

This syntax indicates that a user name is optionally followed by a colon and a password. Use of this format to send plain text passwords is dangerous and not recommended except in the extraordinary circumstance that the "password" is intended to be public knowledge.

The host specification can consist of either a numeric address or a domain name [RFC 1034, 1035] to be looked up to find the address. Domain names are a dotted sequence of labels such as "server.example.mp.us" or "www.example.com". Numeric addresses can appear in several formats. The Internet is currently based on version 4 of the Internet Protocol (IPv4), which uses 32-bit addresses [RFC 791]. Version 6 (IPv6), which uses 128-bit addresses, is being slowly deployed, however [RFC 2460]. Numeric host addresses must follow one of the standardized formats:

A "dotted quad" takes the form x.y.z.w, where each letter is an integer in the range 0 to 255. This form represents the IPv4 host address with four bytes of those integer values in that order (i.e., w + z*2⁸ + y*2¹⁶ + x*2²⁴). Example: "10.0.0.1".
An integer in the range 0 to 2³² - 1 represents an IPv4 host with that address. Example: "167772161", which has the same meaning as 10.0.0.1.
Numeric IPv6 addresses within square brackets can be represented in a flexible format that preferably consists of hexadecimal chunks of four digits [RFC 2373]. Example: "[1080:0:0:0:8:800:200C:417A]". (A number of abbreviations are available, such as squeezing out one run of zeros by using "::", as in "[1080::8:800:200C:417A]".)

Internet protocols historically differentiated between numeric addresses and domain names in user input by placing numeric addresses inside square brackets for example, "[10.0.0.1]". Although no all-numeric domain names existed, that might be confused with numeric addresses because no numeric top-level domain names (the last label in a domain name) were allowed, they might be created in the future. The initial URL design violated this principle by using a naked dotted quad, thereby leaving no way to distinguish this type of name from certain all-numeric domain names. For compatibility reasons, this format had to be carried forward into URIs. As a result, it will be very difficult to ever create general-use numeric top-level domain names.

The optional port number part of the authority information indicates how to obtain the resource from the specified host by using the "scheme". Typically communication with the server occurs through the Transmission Control Protocol (TCP; [RFC 793]) or some other method in which a numeric port or other designator selects the type of server process initially contacted. For example, TCP port 80 is assigned to HTTP, so a server expects to speak HTTP on a connection established for a request arriving addressed to that port. The port number optional field, when present, specifies the port to connect to at the server. For example:

 http://www.example.com:8000/foo/bar/

specifies that an HTTP connection will use port 8000 at the computer specified by "www.example.com", instead of port number 80, which is typically assigned to HTTP.

The complex design of the authority portion of URIs particularly the choice to put the optional user information first enables semantic security attacks that work because few users will understand what a complex authority portion really means. For example, <http://www.navy.mil@167772161/page.html> looks like a URI pointing to a Web page of the U.S. Navy because users are accustomed to believing the domain name that appears immediately after "//". In fact, it is a pointer to the host at numeric address 167772161, which is presumably run by imposters. The identifier "www.navy.mil" will merely be provided to that host as a user name and probably ignored. Browsers should check for such semantic attacks and warn the user. The Opera browser [Opera] does so, but Netscape Navigator and Internet Explorer do not.

It's too late to change URIs, but it would have been better if the host portion always came first.

Paths

A URI path consists of a slash ("/") followed by one or more path segments separated by slashes. Each path segment consists of a string that can take parameters after a semicolon (";"), although the use of such parameters is extremely rare. For example:

 http://foo.example/pub/seg1/bar http://bar.example/seg2/seg3;rare/seg4

Path segments are considered hierarchical. They look like file names on some systems and may even be implemented that way, but this need not be. The service at the designated host can generally choose how it interprets paths, with the exception of the few rules in [RFC 2396]. These rules provide for special handling of the special path segments consisting of period (".") and double period (".."). For example, "./"does nothing if the period appears as a full segment and "segment/../" does nothing. In effect, the double period backs up one level.

Queries

The query component of a generic URI occurs at the end of the URI. A question mark ("?") precedes the query component. It is handled opaquely and passed on to the service addressed by the rest of the URI. The most common coding in the query portion consists of a name value pair format in which the values are quoted strings and the pairs are separated with the ampersand ("&") character. The "mailto:" scheme [RFC 2377] uses this coding, for example. It is also commonly used for "http:" and related schemes that pass on their information to Common Gateway Interface (CGI) programs to calculate a response. For example:

 http://foo.example/path?a="value"&bar="foo"

Services using this format of query also sometimes encode a space as a plus sign ("+"), because spaces are prohibited and standard URI character escaping encodes a space as three characters.

7.1.2 Relative URIs

In XML, pointers often point elsewhere in the same document or to other documents on the same host or even within the same directory. Similar pointers within an HTML Web page might point to other places in the HTML code or other files that reside on the same host. You can easily create such pointers by omitting the "scheme://<authority>" portion of the URI and as much of the beginning of the "<path>" portion as appropriate. A base URI then supplies this omitted prefix.

For URIs found inside retrieved XML or HTML, the base URI defaults to the URI that retrieves the document or page. For example, if a page is retrieved from "http://foo.example/path/page/" and contains the relative URI "gifs/picture.gif", then that relative URI converts by default to the following absolute URI:

 http://foo.example/path/page/gifs/picture.gif

Use of relative URIs provides another advantage: You can move a Web resource or set of resources around in a file system or even between computers. The Web resource will still work if its internal references are relative.

If an entity expansion actually moves some XML into a document, however, it may change the base URI from its former location to that of the document. This switch changes the meaning of relative URIs. In XML, you can fix the base URI for converting relative to absolute URIs by using the xml:base attribute (see Section 7.2).

7.1.3 URI References and Fragment Specifiers

A URI can point to a part or fragment of a resource. You indicate this fact by adding a fragment specifier to the end of the URI. The result is formally called a "URI Reference," although the term "URI," whose formal definition excludes URI References, is frequently stretched in practice to include them. For example, the schema (Chapter 5) restriction of a character string to "anyURI" includes URI References.

URI References have the following syntax:

 <scheme>:<scheme-specific-part>#<fragment>

The generic URI reference syntax is

 <scheme>://<authority><path>?<query>#<fragment>

The interpretation of the "fragment" portion depends on the type of data pointed to. When that type is XML, the format of the fragment specifier is XPointer (see Section 7.3).

7.1.4 URI Encoding

Trying to write a URI in XML sounds simple until you look at the process closely. First, the original URI specification did not define how to encode anything other than ASCII characters in URIs [ASCII]. Second, if the encoding of a URI is a sequence of octets, no specification dictates what octets with a value higher than x7F (127 decimal) mean.

HTML did, however, provide a rule for encoding characters that URIs or parts thereof do not allow. For example, if a particular label in a path needs to contain a space or a question mark (possibly because those characters occur in a file name), you encode the character as a percent sign ("%") followed by the two hex digits for the value of that character for example, "%20" for space and "%3F" for question mark ("?"). These disallowed characters were ASCII characters so the question of how to handle a value that wouldn't fit into one octet didn't arise.

The rules for encoding arbitrary Unicode characters [Unicode] in URIs follow:

Determine all characters in the URI that are not allowed. They include all non-ASCII characters, such as all characters with a character value exceeding 127, and the extended characters listed in Section 2.4 of [RFC 2396]. Exceptions are square brackets ("[ ]"), which are restored by [RFC 2732], the octothorpe ("#"), and the percent sign ("%").
Convert each disallowed character in the original URI to UTF-8 [RFC 2279] as one or more octets.
For the octet sequences produced in Step 2, search for disallowed characters and replace them with the percent sign ("%") followed by the two-hexadecimal-digit encoding given earlier in this section.
Replace each disallowed character in the original URI with the results of the two-stage expansion described in Steps 2 and 3.

For example,

 http://example.com/foobar©.htm

would be encoded as

 http://example.com/foo%20bar%57%33.htm

These steps describe the general URI encoding. The URI usage context may require further encoding. For example, if the URI is used in XML, you must further encode the ampersand ("&") and less than sign ("<").

The exact specification of the encoding rules for URIs was forced by the development of digital signatures, with their stringent canonicalization requirements. These rules first appeared in an XML digital signature document. At this point, the same description of URI encoding appears in at least the XMLDSIG, XML Base, and XML Pointer Language documents.