Section 6.2. Referencing Documents: The URL


6.2. Referencing Documents: The URL

Every document on the Web has a unique address. (Imagine the chaos if they didn't.) The document's address is known as its uniform resource locator (URL). [ ]

[ ] "URL usually is pronounced "you are ell," not "earl."

Several HTML/XHTML tags include a URL attribute value, including hyperlinks , inline images, and forms. All use the same URL syntax to specify the location of a web resource, regardless of the type or content of that resource. That's why it's known as a uniform resource locator.

Because they can be used to represent almost any resource on the Internet, URLs come in a variety of flavors. All URLs, however, have the same top-level syntax:

   scheme   :   scheme_specific_part   

The scheme describes the kind of object the URL references; the scheme_specific_part is, well, the part that is peculiar to the specific scheme. The important thing to note is that the scheme is always separated from the scheme_specific_part by a colon , with no intervening spaces.

6.2.1. Writing a URL

Write URLs using the displayable characters in the US-ASCII character set. For example, surely you have heard what has become annoyingly common on the radio for an announced business web site: "h, t, t, p, colon, slash, slash, w, w, w, dot, blah-blah, dot, com." That's a simple URL, written:

 http://www.blah-blah.com 

If you need to use a character in a URL that is not part of this character set, you must encode the character using a special notation. The encoding notation replaces the desired character with three characters: a percent sign and two hexadecimal digits whose values correspond to the position of the character in the ASCII character set.

This is easier than it sounds. One of the most common special characters is the space ( owners of older Macintoshes, take special notice), whose position in the character set is 20 hexadecimal. [*] You can't type a space in a URL (well, you can, but it won't work). Rather, replace spaces in the URL with %20 :

[*] Hexadecimal numbering is based on 16 characters: 0 through 9 followed by A through F, which in decimal are equivalent to values 0 through 15. Also, letter case for these extended values is not significant; "a" (10 decimal) is the same as "A," for example.

 http://www.kumquat.com/new%20pricing.html 

This URL actually retrieves a document named new pricing.html from the www.kumquat.com server.

6.2.1.1. Handling reserved and unsafe characters

In addition to the nonprinting characters, you'll need to encode reserved and unsafe characters in your URLs as well.

Reserved characters are those that have a specific meaning within the URL itself. For example, the slash character separates elements of a pathname within a URL. If you need to include in a URL a slash that is not intended to be an element separator, you'll need to encode it as %2F :

 http://www.calculator.com/compute?3%2f4 

This URL actually references the resource named compute on the www.calculator.com server and passes the string 3/4 to it, as delineated by the question mark ( ? ). Presumably, the resource is a server-side program that performs some arithmetic function on the passed value and returns a result.

Unsafe characters are those that have no special meaning within the URL but may have a special meaning in the context in which the URL is written. For example, double quotes ( "" ) delimit URL attribute values in tags. If you were to include a double quotation mark directly in a URL, you would probably confuse the browser. Instead, you should encode the double quotation mark as %22 to avoid any possible conflict.

Table 6-1shows other reserved and unsafe characters that should always be encoded.

Table 6-1. Reserved and unsafe characters and their URL encodings

Character

Description

Usage

Encoding

;

Semicolon

Reserved

%3B

/

Slash

Reserved

%2F

?

Question mark

Reserved

%3F

:

Colon

Reserved

%3A

@

At sign

Reserved

%40

=

Equals sign

Reserved

%3D

&

Ampersand

Reserved

%26

<

Less-than sign

Unsafe

%3C

>

Greater-than sign

Unsafe

%3E

"

Double quotation mark

Unsafe

%22

#

Hash symbol

Unsafe

%23

%

Percent

Unsafe

%25

{

Left curly brace

Unsafe

%7B

}

Right curly brace

Unsafe

%7D

Vertical bar

Unsafe

%7C

\

Backslash

Unsafe

%5C

^

Caret

Unsafe

%5E

~

Tilde

Unsafe

%7E

[

Left square bracket

Unsafe

%5B

]

Right square bracket

Unsafe

%5D

'

Back single quotation mark

Unsafe

%60


In general, you should always encode a character if there is some doubt as to whether it can be placed as is in a URL. As a rule of thumb, any character other than a letter, number, or any of the symbolic characters like $-_.+!*'( ) should be encoded.

It is never an error to encode a character, unless that character has a specific meaning in the URL. For example, encoding the slashes in an HTTP URL causes them to be used as regular characters, not as pathname delimiters, breaking the URL. Similarly, encoding an ampersand when it is used as a parameter separator in a URL will defeat the intended purpose. Instead, write these ampersands using &amp; to keep their intended function intact.

6.2.2. Absolute and Relative URLs

You may address a URL in one of two ways: absolute or relative . An absolute URL is the complete address of a resource and has everything your system needs to find a document and its server on the Web. At the very least, an absolute URL contains the scheme and all required elements of the scheme_specific_part of the URL. It may also contain any of the optional portions of the scheme_specific_part .

With a relative URL, you provide an abbreviated document address that, when automatically combined with a base address by the system, becomes a complete address for the document. Within the relative URL, any component of the URL may be omitted. The browser automatically fills in the missing pieces of the relative URL using corresponding elements of a base URL. This base URL is usually the URL of the document containing the relative URL, but it may be another document specified with the <base> tag, as we will discuss later in this chapter. [<base>, 6.7.1]

6.2.2.1. Relative schemes and servers

A common form of a relative URL is missing the scheme and server name . Because many related documents are on the same server, it makes sense to omit the scheme and server name from the relative URL. For instance, assume the base document was last retrieved from the server www.kumquat.com. This relative URL:

 another-doc.html 

is equivalent to the absolute URL:

 http://www.kumquat.com/another-doc.html 

Table 6-2 shows how the base and relative URLs in this example are combined to form an absolute URL.

Table 6-2. Forming an absolute URL
 

Protocol

Server

Directory

File

Base URL

http

www.kumquat.com

/

 

Relative URL

another-doc.html

Absolute URL

http

www.kumquat.com

/

another-doc.html


6.2.2.2. Relative document directories

Another common form of a relative URL omits the leading slash and one or more directory names from the beginning of the document pathname. The directory of the base URL is automatically assumed to replace these missing components . It's the most common abbreviation, because most authors place their collections of documents and subdirectories of support resources in the same directory path as the home page. For example, you might have a special subdirectory containing FTP files referenced in your document. Let's say that the absolute URL for that document is:

 http://www.kumquat.com/planting/guide.html 

A relative URL for the file README.txt in the special subdirectory looks like this:

 ftp:special/README.txt 

You'll actually be retrieving:

 ftp://www.kumquat.com/planting/special/README.txt 

Visually, the operation looks like that in Table 6-3.

Table 6-3. Forming an absolute FTP URL
 

Protocol

Server

Directory

File

Base URL

http

www.kumquat.com

/planting

guide.html

Relative URL

ftp

special

README.txt

Absolute URL

ftp

www.kumquat.com

/planting/special

README.txt


6.2.2.3. Using relative URLs

Relative URLs are more than just a typing convenience. Because they are relative to the current server and directory, you can move an entire set of documents to another directory or even another server and never have to change a single relative link. Imagine the difficulties if you had to go into every source document and change the URL for every link every time you moved it. You'd loathe using hyperlinks! Use relative URLs wherever possible.

6.2.3. The http URL

The http URL is by far the most common. It is used to access documents from a web server, and it has two formats:

 http://   server   :   port   /   path   #   fragment   http://   server   :   port   /   path   ?   search   

Some of the parts are optional. In fact, the most common form of the http URL is simply:

 http://   server   /   path   

which designates the unique server and the directory path and name of a document.

6.2.3.1. The http server

The server is the unique Internet name or IP numerical address of the computer system that stores the web resource. We suspect you'll mostly use more easily remembered Internet names for the servers in your URLs. [*] The name consists of several parts, including the server's actual name and the successive names of its network domain, each part separated by a period. Typical Internet names look like www.oreilly.com or hoohoo.ncsa.uiuc.edu. [ ]

[*] Each Internet-connected computer has a unique addressa numeric (Internet Protocol, or IP) address, of course, because computers deal only in numbers . Humans prefer names, so the Internet folks provide us with a collection of special servers and software (the domain name system, or DNS) that automatically resolve Internet names into IP addresses.

[ ] The three-letter suffix of the domain name identifies the type of organization or business that operates that portion of the Internet. For instance, "com is a commercial enterprise, "edu" is an academic institution, and "gov" identifies a government-based domain. Outside the United States, a less-descriptive suffix is often assignedtypically a two-letter abbreviation of the country name, such as "jp" for Japan and "de" for Deutschland. Many organizations around the world now use the generic three-letter suffixes in place of the more conventional two-letter national suffixes.

It has become something of a convention that webmasters name their servers www for quick and easy identification on the Web. For instance, O'Reilly Media's web server's name is www, which, along with the publisher's domain name, becomes the very easily remembered web site, www.oreilly.com. Similarly, MobileRobots' web server is named www.mobilerobots.com. Being a nonprofit organization, the World Wide Web Consortium's main server has a different domain suffix: www.w3c.org. The naming convention has very obvious benefits, which you, too, should take advantage of if you are called upon to create a web server for your organization.

You may also specify the address of a server using its numerical IP address. The address is a sequence of four numbers, 0 to 255, separated by periods. Valid IP addresses look like 137.237.1.87 or 192.249.1.33.

It'd be a dull diversion to tell you now what the numbers mean or how to derive an IP address from a domain name, particularly because you'll rarely, if ever, use one in a URL. Rather, this is a good place to hyperlink: pick up any good Internet networking treatise for rigorous detail on IP addressing, such as Ed Krol's The Whole Internet User 's Guide and Catalog (O'Reilly).

6.2.3.2. The http port

The port is the number of the communication port by which the client browser connects to the server. It's a networking thingservers perform many functions besides serving up web documents and resources to client browsers: electronic mail, FTP document fetches, filesystem sharing, and so on. Although all that network activity may come into the server on a single wire, it's typically divided into software-managed "ports" for service-specific communicationssomething analogous to boxes at your local post office.

The default URL port for web servers is 80. Special secure web serversSecure HTTP (SHTTP) or Secure Sockets Layer (SSL)run on port 443. Most web servers today use port 80; you need to include a port number along with an immediately preceding colon in your URL if the target server does not use port 80 for web communication.

When the Web was in its infancy, pioneer webmasters ran their Wild Wild Web connections on all sorts of port numbers. For technical and security reasons, system-administrator privileges are required to install a server on port 80. Lacking such privileges, these webmasters chose other, more easily accessible, port numbers.

Now that web servers have become acceptable and are under the care and feeding of responsible administrators, documents being served on some port other than 80 or 443 should make you wonder whether that server is really on the up and up. Most likely, the maverick server is being run by a clever user unbeknownst to the server's bona fide system administrators.

6.2.3.3. The http path

The document path is the Unix-style hierarchical location of the file in the server's storage system. The pathname consists of one or more names separated by slashes. All but the last name represent directories leading down to the document. The last name is usually that of the document itself, though the web server will typically default to a file called index.html .

It has become a convention that for easy identification, HTML document names end with the suffix .html ( otherwise , they're plain ASCII text files, remember?). Although recent versions of Windows allow longer suffixes, old-time developers often stick to the three-letter .htm name suffix for HTML documents.

Although the server name in a URL is not case-sensitive, the document pathname may be. Because most web servers are run on Linux-based systems, and Linux filenames are case-sensitive, those document pathnames will be case-sensitive, too. Web servers running on Windows machines are not case-sensitive, so those document pathnames are not. Because it is impossible to know the operating system of the server you are accessing, always assume that the server has case-sensitive pathnames and take care to get the case correct when typing your URLs.

Certain conventions regarding the document pathname have arisen. If the last element of the document path is a directory, not a single document, the server usually will send back either a listing of the directory contents or the HTML index document in that directory. You should end the document name for a directory with a trailing slash character, but in practice, most servers will honor the request even if this character is omitted.

If the directory name is just a slash alone, or nothing at all, the server decides what to serve to your browsertypically, a so-called home page in the root directory stored as a file named index.html . Every well-designed web server should have an attractive, well-designed home page; it's a shorthand way for users to access your web collection because they don't need to remember the document's actual filename, just your server's name. That's why, for example, you can type http://www.oreilly.com into Netscape's Open dialog box and get O'Reilly's home page.

Another twist: if the first component of the document path starts with the tilde character (~), it means that the rest of the pathname begins from the personal directory in the home directory of the specified user on the server machine. For instance, the URL http://www.kumquat.com/~chuck would retrieve the top-level page from Chuck's document collection.

Different servers have different ways of locating documents within a user's home directory. Many search for the documents in a directory named public_html . Unix-based servers are fond of the name index.html for home pages. When all else fails, servers tend to cough up a directory listing or the default HTML document in the home page directory.

6.2.3.4. The http document fragment

The fragment is an identifier that points to a specific section of a document. In URL specifications, it follows the server and pathname and is separated by the pound sign ( # ). A fragment identifier indicates to the browser that it should begin displaying the target document at the indicated fragment name. As we describe in more detail later in this chapter, you insert fragment names into a document either with the universal id tag attribute or with the name attribute for the <a> tag. In the following example, the browser loads the file named kumquat_locations.html from the www.kumquat.com server, and then displays the document starting at the section of the page named Northeast:

 http://www.kumquat.com/kumquat_locations.html#Northeast 

Like a pathname, a fragment name may be any sequence of characters, as long as you are careful with spaces and other symbolic characters.

The fragment name and the preceding hash symbol are optional; omit them when referencing a document without defined fragments .

Formally, the fragment element applies only to HTML and XHTML documents. If the target of the URL is some other document type, the browser may misinterpret the fragment name.

Fragments are useful for long documents. By identifying key sections of your document with a fragment name, you make it easy for readers to link directly to that portion of the document, avoiding the tedium of scrolling or searching through the document to get to the section that interests them.

As a rule of thumb, we recommend that every section header in your documents be accompanied by an equivalent fragment name. By consistently following this rule, you'll make it possible for readers to jump to any section in any of your documents. Fragments also make it easier to build tables of contents for your document families.

6.2.3.5. The http search parameter

The search component of the http URL, along with its preceding question mark, is optional. It indicates that the path is a searchable or executable resource on the server. The content of the search component is passed to the server as parameters that control the search or execution function.

The actual encoding of parameters in the search component depends upon the server and the resource being referenced. We cover the parameters for searchable resources later in this chapter, when we discuss searchable documents. We discuss parameters for executable resources in Chapter 9.

Although our initial presentation of http URLs indicated that a URL may have either a fragment identifier or a search component, some browsers let you use both in a single URL. If you so desire , you can follow the search parameter with a fragment identifier, telling the browser to begin displaying the results of the search at the indicated fragment. Netscape, for example, supports this usage.

We don't recommend this kind of URL, though. First and foremost, it doesn't work on all browsers. Just as important, using a fragment implies that you are sure that the results of the search will have a fragment of that name defined within the document. For large document collections, this is hardly likely. You are better off omitting the fragment, showing the search results from the beginning of the document, and avoiding potential confusion among your readers.

6.2.3.6. Sample http URLs

Here are some sample http URLs:

 http://www.oreilly.com/catalog.html http://www.oreilly.com http://www.kumquat.com:8080 http://www.kumquat.com/planting/guide.html#soil_prep http://www.kumquat.com/find_a_quat?state=Florida 

The first example is an explicit reference to a bona fide HTML document named catalog.html that is stored in the root directory of the www.oreilly.com server. The second references the top-level home page on that same server. That home page may or may not be catalog.html . Sample three also assumes that there is a home page in the root directory of the www.kumquat.com server and that the web connection is to the nonstandard port 8080.

The fourth example is the URL for retrieving the web document named guide.html from the planting directory on the www.kumquat.com server. Once retrieved, the browser should display the document beginning at the fragment named soil_ prep .

The last example invokes an executable resource named find_a_quat with the parameter named state set to the value Florida . Presumably, this resource generates an HTML or XHTML response, presumably a new document about kumquats in Florida that is subsequently displayed by the browser.

6.2.4. The file URL

The file URL is perhaps the second most common one used, but it is not readily recognized by web users and particularly web authors. It points to a file stored on a computer without indicating the protocol used to retrieve the file. As such, it has limited use in a networked environment. That's a good thing. The file URL lets you load and display a locally stored document and is particularly useful for referencing personal HTML/XHTML document collections, such as those "under construction" and not yet ready for general distribution, or document collections on CD-ROM. The file URL has the following format:

 file://   server   /   path   

6.2.4.1. The file server

The file server can be, like the http one, an Internet domain name or IP address of the computer containing the file to be retrieved. Unlike http, however, which requires Transmission Control Protocol/Internet Protocol (TCP/IP) networking, the file server may also be the unqualified but unique name of a computer on a personal network, or a storage device on the same computer, such as a CD-ROM, or mapped from another networked computer. No assumptions are made as to how the browser might contact the machine to obtain the file; presumably the browser can make some connection, perhaps via a Network File System or FTP, to obtain the file.

If you omit the server name by including an extra slash ( / ) in the URL, or if you use the special name localhost , the browser retrieves the file from the machine on which the browser is running. In this case, the browser simply accesses the file using the normal facilities of the local operating system. In fact, this is the most common usage of the file URL. By creating document families on a diskette or CD-ROM and referencing your hyperlinks using the file:/// URL, you create a distributable, standalone document collection that does not require a network connection to use.

6.2.4.2. The file path

This is the path of the file to be retrieved on the desired server. The syntax of the path may differ based on the operating system of the server; be sure to encode any potentially dangerous characters in the path.

6.2.4.3. Sample file URLs

The file URL is easy:

 file://localhost/home/chuck/document.html file:///home/chuck/document.html file://marketing.kumquat.com/monthly_sales.html file://D:/monthly_sales.html 

The first URL retrieves /home/chuck/document.html from the user's local machine off the current storage device, typically C:\ on a Windows PC. The second is identical to the first, except we've omitted the localhost reference to the server; the server name defaults to the local drive.

The third example uses some protocol to retrieve monthly_sales.html from the marketing.kumquat.com server, and the fourth example uses the local PC's operating system to retrieve the same file from the D:\ drive or device.

6.2.5. The mailto URL

The mailto URL is very common in HTML/XHTML documents. It has the browser send an electronic mail message to a named recipient. It has the format:

 mailto:   address   

The address is any valid email address, usually of the form:

   user   @   server   

Thus, a typical mailto URL might look like:

 mailto:chuckandbill@kumquats.com 

You may include multiple recipients in the mailto URL, separated by commas. For example, this URL addresses the message to all three recipients:

 mailto:chuck@kumquats.com,bill@kumquats.com,booktech@ora.com 

There should be no spaces before or after the commas in the URL.

6.2.5.1. Defining mail header fields

The popular browsers open an email helper or plug-in application when the user selects a mailto URL. It may be the default email program for their system, or a common application such as Outlook Express with Internet Explorer or Netscape's built-in Communicator. With some browsers, users can designate their own email programs for handling mailto URLs by altering a specification in their browsers' Options or Preferences.

Like http search parameters that you attach at the end of the URL, separated by question marks ( ? ), you include email-related parameters with the mailto URL in the HTML document. Typically, additional parameters may include the message's header fields, such as the subject, cc (carbon copy), and bcc (blind carbon copy) recipients. How these additional fields are handled depends on the email program.

A few examples are in order:

 mailto:chuckandbill@kumquats.com?subject=Loved your book! mailto:chuck@kumquats.com?cc=booktech@oreilly.com mailto:bill@kumquats.com?bcc=archive@myserver.com 

As you can probably guess, the first URL sets the subject of the message. Note that some email programs allow spaces in the parameter value and others do not. Annoyingly, you can't replace spaces with their hexadecimal equivalent, %20 , because many email programs won't make the proper substitution. It's best to use spaces because the email programs that don't honor the spaces simply truncate the parameter to the first word.

The second URL places the address booktech@oreilly.com in the cc field of the message. Similarly, the last example sets the bcc field. You may also set several fields in one URL by separating the field definitions with ampersands. For example, this URL sets the subject and cc addresses:

 mailto:chuckandbill@kumquats.com?subject=Loved your book!&cc=booktech@oreilly.com&bcc =archive@myserver.com 

Not all email programs accept or recognize the bcc and cc extensions in the mailto URLsome either ignore them or append them to a preceding subject. Thus, when forming a mailto URL, it's best to order the extra fields as subject first, followed by cc and bcc. And don't depend on the cc and bcc recipients being included in the email.

6.2.6. The ftp URL

The ftp URL is used to retrieve documents from a File Transfer Protocol (FTP) server. [*] It has the format:

[*] FTP is an ancient Internet protocol that dates back to the Dark Ages, around 1975. It was designed as a simple way to move files among machines and is popular and useful to this day. Many HTML/XHTML authors use FTP to place files on their web servers.

 ftp://   user   :   password   @   server   :   port   /   path   ;   type   =   typecode   

6.2.6.1. The ftp user and password

FTP is an authenticated service, meaning that you usually must have a valid username and password in order to retrieve documents from a server. However, most FTP servers also support restricted, nonauthenticated access known as anonymous FTP . In this mode, anyone can supply the username "anonymous" or "guest" and be granted access to a limited portion of the server's documents. Most FTP servers also assume (but may not grant) anonymous access if the username and password are omitted.

If you are using an authenticated ftp URL to access a site that requires a username and password, include the user and password components in the URL, along with the colon ( : ) and at sign ( @ ). If you keep the user component and at sign but omit the password and the preceding colon, most browsers prompt you for a password after connecting to the FTP server. This is the recommended way of accessing authenticated resources on an FTP server because it prevents others from seeing your password.

We recommend you never place an ftp URL with a username and password in any HTML/XHTML document. The reasoning is simple: anyone can retrieve the simple text document, extract the username and password from the URL, log into the FTP server, and tamper with its documents.

6.2.6.2. The ftp server and port

The ftp server and port operate by the same rules as the server and port in an http URL. The server must be a valid Internet domain name or IP address, and the optional port specifies the port on which the server is listening for requests . If omitted, the default port number is 21.

6.2.6.3. The ftp path and typecode

The path component of an ftp URL represents a series of directories, separated by slashes, leading to the file to be retrieved. By default, the file is retrieved as a binary file; you can change this by adding the typecode (and the preceding ;type= ) to the URL.

If the typecode is set to d , the path is assumed to be a directory. The browser requests a listing of the directory contents from the server and displays this listing to the user. If the typecode is any other letter, it is used as a parameter to the FTP type command before retrieving the file referenced by the path. While some FTP servers may implement other codes, most servers accept i to initiate a binary transfer and a to treat the file as a stream of ASCII text.

6.2.6.4. Sample ftp URLs

Here are some sample ftp URLs:

 ftp://www.kumquat.com/sales/pricing ftp://bob@bobs-box.com/results;type=d ftp://bob:secret@bobs-box.com/listing;type=a 

The first example retrieves the file named pricing from the sales directory on the anonymous FTP server at www.kumquat.com. The second logs into the FTP server on bobs-box.com as user bob , prompting for a password before retrieving the contents of the directory named results and displaying them to the user. The last example logs into bobs-box.com as bob with the password secret and retrieves the file named listing , treating its contents as ASCII characters.

6.2.7. The javascript URL

The javascript URL actually is a pseudoprotocol, not usually included in discussions of URLs. With advanced browsers such as Netscape, Opera, Firefox, and Internet Explorer, the javascript URL can be associated with a hyperlink and used to execute JavaScript commands when the user selects the link. While these URLs will work, we don't recommend using them. Instead, authors should use the onclick attribute to associate JavaScript commands with elements in their documents.

6.2.7.1. The javascript URL arguments

Following the javascript pseudoprotocol is one or more semicolon-separated JavaScript expressions and methods , including references to multi-expression JavaScript functions that you embed within the <script> tag in your documents (see Chapter 12 for details). For example:

 javascript:window.alert('Hello, world!') javascript:doFlash('red', 'blue'); window.alert('Do not press me!') 

are valid URLs you may include as the value for a link reference (see section 6.3.1.2). The first example contains a single JavaScript method that activates an alert dialog with the simple message "Hello, world!", if the user allows JavaScript to run with their browser.

The second javascript URL example contains two arguments: the first calls a JavaScript function, doFlash , which presumably you have located elsewhere in the document within the <script> tag and which perhaps flashes the background color of the document window between red and blue. The second expression is the same alert method as in the first example, with a slightly different message.

The javascript URL may appear in a hyperlink sans arguments, too. In that case, the browser may open, if enabled, a special JavaScript editor wherein the user types in and tests various expressions and methods.

6.2.8. The news URL

Although rarely used anymore, the news URL accesses either a single message or an entire newsgroup within the Usenet news system. It has two forms:

 news:   newsgroup   news:   message_id   

An unfortunate limitation in news URLs is that they don't allow you to specify a news server. Rather, users specify news servers in their browser preferences. At one time, not long ago, Internet newsgroups were nearly universally distributed; all news servers carried all the same newsgroups and their respective articles, so one news server was as good as any. Today, the sheer bulk of disk space needed to store the daily volume of newsgroup activity is often prohibitive for any single news server, and there's also local censorship of newsgroups. Hence, you cannot expect that all newsgroups, and certainly not all articles for a particular newsgroup, will be available on the user's news server.

Many users' browsers may not be correctly configured to read news. We recommend that you avoid placing news URLs in your documents except in rare cases.

6.2.8.1. Accessing entire newsgroups

Several thousand newsgroups are devoted to nearly every conceivable topic under the sun, and beyond. Each group has a unique name, composed of hierarchical elements separated by periods. For example, the World Wide Web announcements newsgroup is:

 comp.infosys.www.announce 

To access this group, use the URL:

 news:comp.infosys.www.announce 

6.2.8.2. Accessing single messages

Every message on a news server has a unique message identifier (ID) associated with it. This ID has the form:

   unique_string   @   server   

The unique_string is a sequence of ASCII characters; the server is usually the name of the machine from which the message originated. The unique_string must be unique among all the messages that originated from the server. A sample URL to access a single message might be:

 news:12A7789B@news.kumquat.com 

In general, message IDs are cryptic sequences of characters not readily understood by humans. Moreover, the life span of a message on a server is usually measured in days, after which the message is deleted and the message ID is no longer valid. The bottom line: single-message news URLs are difficult to create, become invalid quickly, and generally are not used.

6.2.9. The nntp URL

The nntp URL goes beyond the news URL to provide a complete mechanism for accessing articles in the Usenet news system. It has the form:

 nntp://   server   :   port   /   newsgroup   /   article   

6.2.9.1. The nntp server and port

The nntp server and port are defined similarly to the http server and port, described earlier. The server must be the Internet domain name or IP address of an nntp server; the port is the port on which that server is listening for requests.

If the port and its preceding colon are omitted, the default port of 119 is used.

6.2.9.2. The nntp newsgroup and article

The newsgroup is the name of the group from which an article is to be retrieved, as just defined in section 6.2.8 The article is the numeric ID of the desired article within that newsgroup. Although the article number is easier to determine than a message ID, it falls prey to the same limitations of single-message references using the news URL, just described in section 6.2.8. Specifically, articles do not last long on most nntp servers, and nntp URLs quickly become invalid as a result.

6.2.9.3. Sample nntp URLs

A sample nntp URL might be:

 nntp://news.kumquat.com/alt.fan.kumquats/417 

This URL retrieves article 417 from the alt.fan.kumquats newsgroup on news.kumquat.com. Keep in mind that the article will be served only to machines that are allowed to retrieve articles from this server. In general, most nntp servers restrict access to those machines on the same local area network.

6.2.10. The telnet URL

The telnet URL opens an interactive session with a desired server, allowing the user to log in and use the machine. Often, the connection to the machine automatically starts a specific service for the user; in other cases, the user must know the commands to type to use the system. The telnet URL has the form:

 telnet://   user   :   password   @   server   :   port   

6.2.10.1. The Telnet user and password

Specify the Telnet user and password are defined exactly like the user and password components of the ftp URL, described previously. In particular, the same caveats apply regarding protecting your password and never placing it within a URL.

Just like the ftp URL, if you omit the password from the URL, the browser should prompt you for a password just before contacting the Telnet server.

If you omit both the user and the password, the Telnet occurs without supplying a username. For some servers, Telnet automatically connects to a default service when no username is supplied. For others, the browser may prompt for a username and password when making the connection to the Telnet server.

6.2.10.2. The Telnet server and port

The Telnet server and port are defined similarly to the http server and port, described earlier. The server must be the Internet domain name or IP address of a Telnet server; the port is the port on which that server is listening for requests. If the port and its preceding colon are omitted, the default port of 23 is used.

6.2.11. The gopher URL

Gopher is a web-like document-retrieval system that achieved some popularity on the Internet just before the Web took off, making gopher obsolete. Some gopher servers still exist, though, and the gopher URL lets you access gopher documents.

The gopher URL has the form:

 gopher://   server   :   port   /   path   

6.2.11.1. The gopher server and port

The gopher server and port are defined similarly to the http server and port, described previously. The server must be the Internet domain name or IP address of a gopher server; the port is the port on which that server is listening for requests.

If the port and its preceding colon are omitted, the default port of 70 is used.

6.2.11.2. The gopher path

The gopher path can take one of three forms:

   type   /   selectortype   /   selector   %09   searchtype   /   selector   %09   search   %09   gopherplus   

The type is a single character value denoting the type of the gopher resource. If the entire path is omitted from the gopher URL, the type defaults to 1.

The selector corresponds to the path of a resource on the gopher server. It may be omitted, in which case the top-level index of the gopher server is retrieved.

If the gopher resource is actually a gopher search engine, the search component provides the string for which to search. The search string must be preceded by an encoded horizontal tab ( %09 ).

If the gopher server supports gopher+ resources, the gopherplus component supplies the necessary information to locate that resource. The exact content of this component varies based upon the resources on the gopher server. This component is preceded by an encoded horizontal tab ( %09 ). If you want to include the gopherplus component but omit the search component, you must still supply both encoded tabs within the URL.



HTML & XHTML(c) The definitive guide
Data Networks: Routing, Security, and Performance Optimization
ISBN: 596527322
EAN: 2147483647
Year: 2004
Pages: 189
Authors: Tony Kenyon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net