A Uniform Resource Identifier [RFC 2396] is the fundamental way to name or locate something in the World Wide Web. Formally, URIs represent the union of Uniform Resource Locators and Uniform Resource Names [RFC 2141].
7.1.1 URI Syntax
The most general syntax for a URI is
Scheme ":" scheme-specific-part
If the scheme-specific-part starts with a double slash ("//"), the URI is called a "generic URI." In this case, a specific substructure to the scheme-specific-part is implied as follows:
Scheme "://" authority path [ "?" query ]
Here square brackets surround the optional parts, italics indicate variables, and quoted text indicates fixed characters.
The scheme name is case independent and must start with a letter. The remainder of the scheme name can consist of any combination of letters, digits, periods ("."), hyphens ("-"), and plus signs ("+"). The most common scheme seen by users is "http:"; it indicates a pointer to a resource the user can obtain through the Hypertext Transfer Protocol (HTTP; [RFC 2616]).
According to the specification of URIs, the "authority" portion is scheme dependent and can be null. Thus a scheme can have its own registry, and URIs using that scheme could potentially use names in that registry as their authority section. As a practical matter, however, almost all authority sections seen in real life are "server"-based. That is, they rely on the specification of a server computer on the Internet. When desired, they can refer to particular accounts and/or ports at such computers as described below.
The general syntax for a server-based URI authority portion is
[ [ userInfo "@" ] host [ ":" portNumber ] ]
Unless the authority is null, it must always specify a host (computer). It can be optionally preceded by user information and an at sign ("@"), where the user information identifies a user or account at that host. A colon and a port number can also follow it.
The user information could consist of a user name or an account number. On many systems, it is structured as follows:
UserName [ ":" Password ]
This syntax indicates that a user name is optionally followed by a colon and a password. Use of this format to send plain text passwords is dangerous and not recommended except in the extraordinary circumstance that the "password" is intended to be public knowledge.
The host specification can consist of either a numeric address or a domain name [RFC 1034, 1035] to be looked up to find the address. Domain names are a dotted sequence of labels such as "server.example.mp.us" or "www.example.com". Numeric addresses can appear in several formats. The Internet is currently based on version 4 of the Internet Protocol (IPv4), which uses 32-bit addresses [RFC 791]. Version 6 (IPv6), which uses 128-bit addresses, is being slowly deployed, however [RFC 2460]. Numeric host addresses must follow one of the standardized formats:
The optional port number part of the authority information indicates how to obtain the resource from the specified host by using the "scheme". Typically communication with the server occurs through the Transmission Control Protocol (TCP; [RFC 793]) or some other method in which a numeric port or other designator selects the type of server process initially contacted. For example, TCP port 80 is assigned to HTTP, so a server expects to speak HTTP on a connection established for a request arriving addressed to that port. The port number optional field, when present, specifies the port to connect to at the server. For example:
specifies that an HTTP connection will use port 8000 at the computer specified by "www.example.com", instead of port number 80, which is typically assigned to HTTP.
A URI path consists of a slash ("/") followed by one or more path segments separated by slashes. Each path segment consists of a string that can take parameters after a semicolon (";"), although the use of such parameters is extremely rare. For example:
Path segments are considered hierarchical. They look like file names on some systems and may even be implemented that way, but this need not be. The service at the designated host can generally choose how it interprets paths, with the exception of the few rules in [RFC 2396]. These rules provide for special handling of the special path segments consisting of period (".") and double period (".."). For example, "./"does nothing if the period appears as a full segment and "segment/../" does nothing. In effect, the double period backs up one level.
The query component of a generic URI occurs at the end of the URI. A question mark ("?") precedes the query component. It is handled opaquely and passed on to the service addressed by the rest of the URI. The most common coding in the query portion consists of a name value pair format in which the values are quoted strings and the pairs are separated with the ampersand ("&") character. The "mailto:" scheme [RFC 2377] uses this coding, for example. It is also commonly used for "http:" and related schemes that pass on their information to Common Gateway Interface (CGI) programs to calculate a response. For example:
Services using this format of query also sometimes encode a space as a plus sign ("+"), because spaces are prohibited and standard URI character escaping encodes a space as three characters.
7.1.2 Relative URIs
In XML, pointers often point elsewhere in the same document or to other documents on the same host or even within the same directory. Similar pointers within an HTML Web page might point to other places in the HTML code or other files that reside on the same host. You can easily create such pointers by omitting the "scheme://<authority>" portion of the URI and as much of the beginning of the "<path>" portion as appropriate. A base URI then supplies this omitted prefix.
For URIs found inside retrieved XML or HTML, the base URI defaults to the URI that retrieves the document or page. For example, if a page is retrieved from "http://foo.example/path/page/" and contains the relative URI "gifs/picture.gif", then that relative URI converts by default to the following absolute URI:
Use of relative URIs provides another advantage: You can move a Web resource or set of resources around in a file system or even between computers. The Web resource will still work if its internal references are relative.
If an entity expansion actually moves some XML into a document, however, it may change the base URI from its former location to that of the document. This switch changes the meaning of relative URIs. In XML, you can fix the base URI for converting relative to absolute URIs by using the xml:base attribute (see Section 7.2).
7.1.3 URI References and Fragment Specifiers
A URI can point to a part or fragment of a resource. You indicate this fact by adding a fragment specifier to the end of the URI. The result is formally called a "URI Reference," although the term "URI," whose formal definition excludes URI References, is frequently stretched in practice to include them. For example, the schema (Chapter 5) restriction of a character string to "anyURI" includes URI References.
URI References have the following syntax:
The generic URI reference syntax is
The interpretation of the "fragment" portion depends on the type of data pointed to. When that type is XML, the format of the fragment specifier is XPointer (see Section 7.3).
7.1.4 URI Encoding
Trying to write a URI in XML sounds simple until you look at the process closely. First, the original URI specification did not define how to encode anything other than ASCII characters in URIs [ASCII]. Second, if the encoding of a URI is a sequence of octets, no specification dictates what octets with a value higher than x7F (127 decimal) mean.
HTML did, however, provide a rule for encoding characters that URIs or parts thereof do not allow. For example, if a particular label in a path needs to contain a space or a question mark (possibly because those characters occur in a file name), you encode the character as a percent sign ("%") followed by the two hex digits for the value of that character for example, "%20" for space and "%3F" for question mark ("?"). These disallowed characters were ASCII characters so the question of how to handle a value that wouldn't fit into one octet didn't arise.
The rules for encoding arbitrary Unicode characters [Unicode] in URIs follow:
would be encoded as