Proxy Servers and Caching | Lan Tutorial With Glossary of Terms: A Complete Introduction to Local Area Networks (Lan Networking Library)

Old ideas aren't the trendiest, but they're often the best. Take the example of Yahoo!, whose founders Jerry Yang and David Filo saw that despite the merits of search engines, the venerable concept of card catalog subject headings still had plenty of merit. By a process of reinvention, and without bothering to hire any actual librarians, Yang and Filo quickly proceeded to fortune and fame.

Or take the example of the proxy server. This is a concept that sounds trendy and cutting-edge, but its roots are also in dusty old library science. Remember in college when you first needed to check out a book housed in your university's locked stacks? Since you weren't allowed to go into this secured part of the library, a staff member acted as your proxy and retrieved the book for you.

All too often, of course, this process took longer than if you'd been able to go to the shelf and get the book yourself. But suppose that each time librarians retrieved a book for one student, they also made several copies, keeping them at the front desk for other users who requested the same title. The result would've been an ideal blend of fast service and airtight security.

This analogy explains the two main functions of a proxy server. First, the proxy server acts as an intermediary, helping users on a private network get information from the Internet when they need it, while ensuring that network security is maintained . Second, a proxy server may store frequently requested information in a local disk cache, rapidly delivering it to multiple users without having to go back to the Internet to get it.

The Layered Approach

A proxy server is usually just one component of software that provides a variety of other services, such as a gateway to connect the local network to the Internet or a firewall to provide protection from outside intrusion.

Because proxy servers and firewalls are so often bundled together, people often confuse the two. However, a packet-filtering firewall operates at the Network layer of the OSI model, while proxy servers work at the Application layer. Packet filters use routers to filter information coming to and from a network. Because routers check each packet against some sort of access-control table (listing, for example, the IP addresses of trusted servers), they make it easy to block traffic that's not trusted. Firewalls can also screen packets based upon TCP and UDP port numbers ; therefore, you can permit certain types of connections (Telnet or FTP, for example) to only certain trusted servers.

Packet-filtering firewalls have the advantage of speed, and they require no special configuration on the part of end- user applications. On the other hand, creating complex access rules can be difficult. Further, all packet filters can do is grant or deny access based on a packet's apparent source or destination address. Hackers can fool such firewalls by forging source addresses via IP packet spoofing. Since client-server connections are direct, hackers can also use packet sniffers to discern a network's address structure with relative ease.

In our library analogy, the equivalent of a packet-filtering firewall would be the librarian keeping a list of trusted students, then allowing only those individuals into the locked stacks to retrieve books. This might make book retrieval faster, but it would require that a list be created and maintained. It would also be vulnerable to impostors who turn up at the front desk bearing fake IDs.

Proxy servers are different. They break the direct link between client and server (or, if you will, between the student and the valuable book). They start by performing network address translation, mapping all of a network's internal IP addresses to a single "safe" IP address. Since the latter is the only address the untrusted network is aware of, spoofing attacks are no longer possible.

Because they operate at the Application layer of the OSI model, proxy servers can do a lot more. Any given proxy server includes a collection of application-specific proxies: an HTTP proxy for Web pages; an FTP proxy; an SMTP/POP proxy for e-mail; a Network News Transfer Protocol (NNTP) proxy for news servers; a RealAudio/RealVideo proxy; and more. Each of these proxies accepts only packets generated by services it is designed to copy, forward, and filter.

Application-specific proxies are almost infinitely configurable. For example, they can be set to block access to certain Web servers at all times, let only certain users play RealAudio files, permit FTP downloads but not uploads, or keep employees from logging on to their personal America Online accounts until after 5 p.m. Proxy servers can also bar specific MIME types and, in conjunction with a third-party plug-in such as SurfWatch, even filter content.

Proxy servers also do a superior job of logging network traffic, and can ensure that connectivity is always available for certain traffic types. For example, a small office might be connected to the Internet at all times for Web browsing via a single dial-up connection; a proxy server could automatically bring up a second dial-up connection when a user starts a long download via FTP.

As usual, though, the flip side of extensive configurability is complexity. Client applications such as Web browsers and RealAudio players must often be reconfigured to be made aware of proxy servers. In addition, as new Internet services become available and use new protocols and ports, new proxies must be written to support them. The process of adding users and defining permissions can also be complicated, though some proxy servers ease this task by working with Lightweight Directory Access Protocol (LDAP) information.

Establishing A Virtual Circuit

Circuit-level proxy servers were devised to simplify matters. Instead of operating at the Application layer, they work as a "shim" between the Application layer and the Transport layer, monitoring TCP handshaking between packets from trusted clients or servers to untrusted hosts , and vice versa. The proxy server is still an intermediary between the two parties, but this time it establishes a virtual circuit between them.

With a circuit-level proxy, client software no longer needs to be configured on a case-by-case basis. With Microsoft's Proxy Server, for example, once WinSock Proxy software has been installed onto a client computera one-time procedureclient software such as the Windows Media Player, Internet Relay Chat (IRC), or Telnet will perform just as if it were directly connected to the Internet.

The downside of a circuit-level proxy is that it cannot examine the Application-layer content of the packets it passes . Also, some computers (such as Macintosh) may not have the required client software available to them. (In such cases, Web browsers and the like may still operate, but they must be configured manually.) This problem has been addressed by the software technology known as SOCKS.

SOCKS was originally developed in 1990 and has currently reached version 5 (defined in RFC 1928). It provides a cross-platform standard for accessing circuit-level proxies. These may be accessed either by a single "SOCKSified" application on an otherwise -unmodified client computer, or by any application running on a computer that has had a SOCKS shim (shared or dynamic link libraries) put onto it.

Apart from standardization, SOCKS has other advantages. Version 5 supports both username/password (RFC 1929) and API-based (RFC 1961) authentication. It also supports both public and private key encryption.

It is historically difficult to proxy UDP-based services, since these are not connection-based; each packet is sent as a separate message. SOCKS 5 is capable of solving this problem by establishing TCP connections and then using these to relay UDP data.

Finally, aspects of packet-filtering firewalls, application-level proxies, and circuit-level proxies are combined by stateful-inspection firewalls. These devices are capable of intercepting and examining all of the packets they pass, using algorithms to recognize Application-layer data. Unlike Application-layer proxies, stateful-inspection firewalls do not break the client-server model in order to analyze data.

Cache As Cache Can

Though I have focused on architectural and security issues so far, most users are interested in proxy servers for a single reasoncaching. Though theoretically optional, this feature has been closely associated with Web proxy servers ever since they were described at the first International World Wide Web Conference (Geneva, April 1994).

A proxy server's basic caching function works much like what's built into a Web browser, with the exception that the contents of the proxy server cache are available to multiple users. Whenever one user on the local network retrieves pages from the Internet, the pages are stored locally, which dramatically speeds access (see Figure 1). For example, Novell claims that when its BorderManager FastCache is configured to run from RAM, it is capable of processing more than 5,000 hits per second.

Figure 1: Proxy servers offer many features, but they are most commonly associated with caching. Caching gets the most out of any Internet connection by converting random, intermittent HTTP requests into an efficient, rule-based stream.

Some proxy servers offer read-ahead caching, which is capable of loading images and other objects embedded on a Web page into a cache before a Web browser has requested them. Caches may also be preloaded via a mechanism known as the last-modified multiplier. With the last-modified multiplier , a proxy server examines the creation dates of frequently requested pages, learning when updates are likely to occur and retrieving the pages when appropriate. And of course proxy servers also let administrators schedule batch retrieval of Web pages during any time of day when network traffic is known to be light.

Reverse caching is an additional feature of some proxy servers. In reverse caching, the cache server not only stores pages from the Internet for the benefit of local users, but it also stores local pages for the benefit of Internet users.

Linking Multiple Caches

No matter how large and speedy it is, no single cache server can store everything. Inevitably the time will come when some user requests uncached data, which then has to trek slowly across the Internet. However, it is possible to ameliorate this problem by linking multiple caches together so they can draw information from one another. RFC 2187 describes the Internet Cache Protocol (ICP), which permits the hierarchical connection of caches.

In a cache hierarchy (or mesh), one cache establishes peering relationships with other caches. There are two types of relationships: parent and sibling. When one cache does not hold a requested object, it performs an ICP query to ask whether any of its siblings has the object. If a sibling does have it, the original cache requests it. If no siblings have it, then the request is forwarded to the parent or to the origin server. Figure 2 shows a typical cache hierarchy.

Figure 2: The Internet Cache Protocol (ICP) links multiple cache servers together in a sibling-parent hierarchy. The local cache can retrieve hits from sibling caches, hits and misses from parent caches, and misses from origin servers directly.

Although ICP lets cache servers be linked, it does have some problems. One is that ICP queries generate extraneous network traffic as they attempt to locate cached information. The more cache servers in the array, the more traffic there is, which results in negative scalability.

Another problem with ICP is that arrays become redundant over time. Each server tends to wind up holding duplicate copies of the most frequently requested URLs. For these reasons, ICP is gradually being replaced by the Cache Array Routing Protocol (CARP), originally devised by Microsoft.

With CARP, cache servers are tracked via an "array membership list," which is automatically updated via a Time-to-Live (TTL) function that regularly checks for active servers. A hashing algorithm is then used to determine which of the members of the array should be the receptacle of a particular URL request.

Caching Unplugged

Cache servers used to be viewed as nice-to-have items you got for free when you purchased a proxy server. Now that the Internet is growing steadily more congested and more and more clients have broadband connections, the terms "cache server" and "proxy server" may not be used quite so interchangeably.

Proxy servers will continue to offer caching as one of their features. However, the increasing demand for specialized caching means that cache servers will gain more visibility as separate products. For example, the CacheQube from Cobalt Networks (Mountain View, CA) is an appliance that can simply be connected between a LAN and a router to provide transparent caching. The Streaming Media Cache from Inktomi (San Mateo, CA) and MediaMall from InfoLibria (Waltham, MA) are caches designed specifically for handling streaming audio and video.

Resources

The Internet Caching Resource Center, at www.caching.com, offers a variety of information and articles about caching.

A SOCKS 5 white paper is available from www.aventail.com/index.phtml/solutions/white_papers/sockswp. phtml . NEC also offers an introduction to SOCKS at www.socks.nec.com/introduction.html. RFC 1928 is readable at http:// info .internet.isi.edu:80/in-notes/rfc/files/rfc1928.txt/.

"Patrolling the Borders of Your Network," though written to promote Novell's BorderManager, is of general background interest. You can find it at www.novell.com/bordermanager/bmgr3_wp.html.

You'll find an introduction to CARP at www.microsoft.com/proxy/guide/CarpWP.asp/.

Finally, a well-regarded book is Ari Luotonen's Web Proxy Servers (Prentice-Hall, 1998, ISBN 0-13-680612-0). Chief architect of the Netscape Proxy Server, Luotonen was also co-developer of the first Web proxy server (CERN, 1994).

This tutorial, number 129, by Jonathan Angel, was originally published in the May 1999 issue of Network Magazine.