< Day Day Up > |
An amazing wealth of information is available on the Web. Although many people complain of information overload, the real problem isn't volume. It's relevance. How can a particular piece of information be located quickly and easily? If the Web were ideal, it would be like the computer on the television show Star Trek , which always seems to deliver in a matter of seconds any information a user requests . On the Internet, a request to a search tool often yields an overwhelming list of tens of thousands of entries. Some of these entries might be outdated , the documents to which others refer might have moved, or the server that specifies an entry might be unreachable. Although the Web isn't science fiction, many of the computer and information systems presented in science fiction represent valid goals for the Web. The key problem with building a more organized Web is URL-based addressing.
The primary problem with URLs is that they define location rather than meaning. In other words, URLs specify where something is located on the Web, not what it is or what it's about. This might not seem to be a big deal, but it is. This issue becomes obvious when the problems with URLs are enumerated:
URLs aren't persistent. Documents move around, servers change names , and documents might eventually be deleted. This is the nature of the Web, and the reason why the 404 Not Found message is so common. When users hit a broken link, they might be at a loss to determine what happened to the document and how to locate its new home. Wouldn't it be nice if, no matter what happened , a unique identifier indicated where to get a copy of the information?
URLs are often long and confusing. People often have to transcribe addresses. For example, the following is quite a lot to type, read to someone, or avoid not breaking across lines in an e-mail:
http://www.democompany.com/about/press/pressdetail.cfm?id=7&view=screen
Firms are already scrambling for short domain names and paths to improve the typability of URLs, and most folks tend to omit the protocol when discussing things. Despite these minor cleanups, many URLs are very long and "dirty," filled with all sorts of special characters .
URLs create an artificial bottleneck and extreme reliance on DNS services by specifying location rather than meaning. For example, the text of the HTML 4.01 specification is a useful document and certainly has an address at the W3C Web site. But does it live in other places on the Internet? It probably is mirrored in a variety of locations, but what happens if the W3C server is unreachable, or DNS services fail to resolve the host? In this case, the resource is unreachable. URLs create a point source for information. Rather than trying to find a particular document, wherever it might be on the Internet, Web users try to go to a particular location. Rather than talking about where something is, Web users should try to talk about what that something is.
Talking about what a document is rather than where it is makes sense when you consider how information is organized outside the Internet. Nobody talks about which library carries a particular book, or what shelf it is on. The relevant information is the title of the book, its author, and perhaps some other information. But what happens if two or more books have the same title, or two authors have the same name ? This actually is quite common. Generally , a book should have a unique identifier such as an ISBN number that, when combined with other descriptive information, such as the author, publisher, and publication date, uniquely describes the book. This naming scheme enables people to specify a particular book and then hunt it down.
The Web, however, isn't as ordered as a library. On the Web, people name their documents whatever they like, and search robots organize their indexes however they like. Categorizing things is difficult. The only unique item for documents is the URL, which simply says where the document lives. But how many URLs does the HTML 4 specification have? A document might exist in many places. Even worse than a document with multiple locations, what happens when the content at the location changes? Perhaps a particular URL address points to information about dogs one day and cats the next . This is how the Web really is. However, a great deal of research is being done to address some of the shortcomings of the Web and its addressing schemes.
A new set of addressing ideas, including URNs, URCs, and URIs, are emerging to remedy some of the Web's shortcomings. A uniform resource name (URN) can locate a resource by giving it a unique symbolic name rather than a unique address. Network services analogous to the current DNS services will transparently translate a URN into the URL (server IP address, directory path , and filename) needed to actually locate a resource. This translation could be used to select the closest server, to improve document delivery speed, or to try various backup servers in case a server is unavailable. The benefit of the abstraction provided by URNs should be obvious from this simple idea alone.
To better understand the logic behind URNs, consider domain names, such as www.democompany.com. These names are already translated into numeric IP addresses, such as 192.102.249.3, all the time. This mapping provides the ability to change a machine's numeric address or location without seriously disrupting access to it because the name stays the same. Furthermore, numeric addresses provide no meaning to a user, whereas domain names provide some indication of the entity in question. Obviously, the level of abstraction provided by a system such as DNS would make sense on the Web. Rather than typing some unwieldy URL, a URN would be issued that would be translated to an underlying URL. Some experts worry that using a resolving system to translate URNs to URLs is inherently flawed and will not scale well. Because the DNS system is fairly fragile, there might be some truth to this concern. Another problem is that, in reality, URNs probably won't be something easy to remember, such as urn: booktitle , but will instead be something more difficult, such as urn:isbn: 0-12-518408-5.
A uniform resource characteristic (URC), also known as a uniform resource citation, describes a set of attribute/value pairs that defines some aspect of an information resource. For example, in the case of a book, a URC might describe a publication date, number of pages, author, and so on. The form of a URC is still under discussion; however, logically what they would provide is already being used often in the form of simple <meta> tags.
Combined, a URL, URN, and a collection of URCs describe an information resource. For example, the document "Demo Company Corporate Summary" might have a unique URN such as urn://corpid:55127.
Note | The syntax of the preceding URN is fictional. It simply shows that URNs probably won't have easily remembered names and that many naming schemes can be used, such as ISBN numbers or corporate IDs. |
The "Demo Company Corporate Summary" also would have a set of URCs that describes the rating of the file, the author, the publisher, and so on. In addition, the document would have a location(s) on the Web where the document lives, such as one of the following traditional URLs:
http://www.democompany.com/about/corp.html http://www.democompany.co.jp/about/corp.html
Taken all together, a particular information resource has been identified. The collection of information, which is used to identify this document specifically , is termed a uniform resource identifier (URI).
Note | Occasionally, URI is used interchangeably with URL. Although this is acceptable, research into the theories behind the names suggests that the term URI is more generic than URL, and encompasses the ideal of an information resource. Currently, a URL is the only common way to identify an information resource on the Internet. Although technically a URL could be considered a URI, this confuses the issue and obscures the ultimate goal of trying to talk about information more generally than in terms of a network location. |
Although many of the ideas covered here are still being discussed, some systems, such as Persistent URLs, or PURLs (www.purl.org), and Handles (www.handle.net), already implement many of the features of URNs and URCs. Furthermore, many browser vendors and large Web sites are implementing special keyword navigation schemes that mimic many of the ideas of URNs and URCs. Unfortunately, as of the writing of this book, none of these approaches are widely implemented or accepted. Although any of these approaches probably can be considered as true URIs when compared to the URLs used today, URLs are likely to remain the most common way to describe information on the Web for the foreseeable future. Therefore, the system has to be improved slightly and even extended to deal with new types of information and access methods .
For now, unusable URLs are commonplace but URLs are starting to change for the better. As mentioned earlier in the chapter, URLs might be cleaned particularly when they have long query strings. It is even possible to clean file extensions off of URLs so that a URL such as http://www.democompany.com/about/press/pressdetail.cfm?id=7&view=screen can become http://democompany.com/about/press/pressdetail7. To cleanup a URL, Web designers should first name files and directories well. Further cleaning comes primarily from server-side technologies to rewrite URLs; mod_rewrite for Apache or similar products for IIS such as pageXchanger can do the trick, but obviously require some knowledge of Web server setup as well as site design.
In addition to cleaned URLs, new protocols are emerging as the Web starts to converge with TV, video games , and cell phones. For example, a television channel URL form might look like tv:// channel , whereby channel is either an alphanumeric name (such as nbc or nbc7-39) or a numeric channel number. Similarly, a phone URL might look like phone:// phone-number , with a numeric value for the phone number and any extra digit information required, such as the country code or calling card information. For example, phone://+1-555-270-2086 might dial a phone number in the United States. New schemes are being proposed all the time. A variety of esoteric schemes are out there already. If you are interested in new URL schemes, take a look at the W3C area on addressing (www.w3.org/Addressing/) for more information.
< Day Day Up > |