1.1 Web Architecture

only for RuBoard - do not distribute or recompile

Before we can talk more about caching, we need to agree on some terminology. Whenever possible, I use words and meanings taken from Internet standards documents. Unfortunately, colloquial usage of web caching terminology is often just different enough to be confusing.

1.1.1 Clients and Servers

The fundamental building blocks of the Web (and indeed most distributed systems) are clients and servers . A web server manages and provides access to a set of resources . The resources might be simple text files and images, or something more complex, such as a relational database. Clients, also known as user agents , initiate a transaction by sending a request to a server. The server then processes the request and sends a response back to the client.

On the Web, most transactions are download operations; the client downloads some information from the server. In these cases, the request itself is quite small (about 200 bytes) and contains the name of the resource, plus a small amount of additional information from the client. The information being downloaded is usually an image or text file with an average size of about 10,000 bytes. This characteristic of the Web makes cable- and satellite-based Internet services viable . The data rates for receiving are much higher than the data rates for sending because web users mostly receive information.

A small percentage of web transactions are more correctly characterized as upload operations. In these cases, requests are relatively large and responses are very small. Examples of uploads include sending an email message and transferring an image file from your computer to a server.

The most common web clients are called browsers . These are applications such as Netscape Navigator and Microsoft Internet Explorer. The purpose of a browser is to render the web content for us to view and interact with. Because of the myriad of features present in web browsers, they are really very large and complicated programs. In addition to the GUI-based clients, there are a few simple command-line client programs, such as Lynx and Wget.

A number of different servers are in widespread use on the Web. The Apache HTTP server is a popular choice and freely available. Netscape, Microsoft, and other companies also have server products. Many content providers are concerned with the performance of their servers. The most popular sites on the Net can receive ten million requests per day with peak request rates of 1000 per second. At this scale, both the hardware and software must be very carefully designed to cope with the load. Many sites run multiple servers in parallel to handle their high request rates and for redundancy.

Recently, there has been a lot of excitement surrounding peer-to-peer applications, such as Napster. In these systems, clients share files and other resources (e.g., CPU cycles) directly with each other. Napster, which enables people to share MP3 files, does not store the files on its servers. Rather, it acts as a directory and returns pointers to files so that two clients can communicate directly. In the peer-to-peer realm, there are no centralized servers; every client is a server.

The peer-to-peer movement is relatively young but already very popular. It's likely that a significant percentage of Internet traffic today is due to Napster alone. However, I won't discuss peer-to-peer clients in this book. One reason for this is that Napster uses its own transfer protocol, whereas here we'll focus on HTTP.

1.1.2 Proxies

Much of this book is about proxies . A proxy is an intermediary in a web transaction. It is an application that sits somewhere between the client and the origin server. Proxies are often used on firewalls to provide security. They allow (and record) requests from the internal network to the outside Internet.

A proxy behaves like both a client and a server. It acts like a server to clients, and like a client to servers. A proxy receives and processes requests from clients, and then it forwards those requests to origin servers. Some people refer to proxies as "application layer gateways." This name reflects the fact that the proxy lives at the application layer of the OSI reference model, just like clients and servers. An important characteristic of an application layer gateway is that it uses two TCP connections: one to the client and one to the server. This has important ramifications for some of the topics we'll discuss later.

Proxies are used for a number of different things, including logging, access controls, filtering, translation, virus checking, and caching. We'll talk more about these and the issues they create in Chapter 3.

1.1.3 Web Objects

I use the term object to refer to the entity exchanged between a client and a server. Some people may use document or page , but these terms are misleading because they imply textual information or a collection of text and images. "Object" is generic and better describes the different types of content returned from servers, such as audio files, ZIP files, and C programs. The standards documents (RFCs) that describe web components and protocols prefer the terms entity , resource , and response . My use of object corresponds to their use of entity, where an object (entity) is a particular response generated from a particular resource. Web objects have a number of important characteristics, including size (number of bytes), type (HTML, image, audio, etc.), time of creation, and time of last modification.

In broad terms, web resources can be considered either dynamic or static . Responses for dynamic resources are generated on the fly when the request is made. Static responses are pregenerated, independent of client requests. When people think of dynamic responses, often what comes to mind are stock quotes, live camera images, and web page counters. Digitized photographs, magazine articles, and software distributions are all static information. The distinction between dynamic and static content is not necessarily so clearly defined. Many web resources are updated at various intervals (perhaps daily) but not uniquely generated on a per-request basis. The distinction between dynamic and static resources is important because it has serious consequences for cache consistency.

1.1.4 Resource Identifiers

Resource identifiers are a fundamental piece of the architecture of the Web. These are the names and addresses for web objects, analogous to street addresses and telephone numbers . Officially, they are called Universal Resource Identifiers , or URIs. They are used by both people and computers alike. Caches use them to identify and index the stored objects. According to the design specification, RFC 2396, URIs must be extensible, printable, and able to encode all current and future naming schemes. Because of these requirements, only certain characters may appear in URIs, and some characters have special meanings.

Uniform Resource Locators (URLs) are the most common form of URI in use today. The URL syntax is described in RFC 1738. Here are some sample URLs:

http://www.zoidberg.net
http://www.oasis- open .org/ specs /docbook.shtml
ftp://ftp.freebsd.org/pub/FreeBSD/README.TXT

URLs have a very important characteristic worth mentioning here. Every URL includes a network host address ”either a hostname or an IPaddress. Thus, a URL is bound to a specific server, called the origin server . This characteristic has some negative side effects for caching. Occasionally, the same resource exists on two or more servers, as occurs with mirror sites. When a resource has more than one name, it can get cached under different names. This wastes storage space and bandwidth.

Uniform Resource Names (URNs) are similar to URLs, but they refer to resources in a location-independent manner. RFC 2141 describes URNs, which are also sometimes called persistent names . Resources named with URNs can be moved from one server (location) to another without causing problems. Here are some sample (hypothetical) URNs:

  urn:duns:002372413:annual-report-1997   urn:isbn:156592536X

In 1995, the World Wide Web Project left its birthplace at CERN in Geneva, Switzerland, and became the World Wide Web Consortium. In conjunction with this move, their web site location changed from info .cern.ch to www.w3c.org. Everyone who used a URL with the old location received a page with a link to the new location and a reminder to "update your links and hotlist." ^[1] Had URNs been implemented and in use back then, such a problem could have been avoided.

^[1] Many years after this change, accessing info.cern.ch still generated a response with a link to http://www.w3c.org.

Another advantage of URNs is that a single name can refer to a resource replicated at many locations. When an application processes such a URN request, it must select one of the locations (presumably the closest or fastest ) from which to retrieve the object. RFC 2168 describes methods for resolving URNs.

Unfortunately, URNs have been very slow to catch on. Very few applications are able to handle URNs, while everyone and everything knows about URLs. Through the remainder of this book, I'll use both URI and URL somewhat interchangeably. I won't say much more about URNs, but keep in mind that URI is a generic term that refers to both URLs and URNs.

only for RuBoard - do not distribute or recompile