5.4 The Application Layer: HTTP

only for RuBoard - do not distribute or recompile

5.4 The Application Layer: HTTP

Recall that standard HTTP requests and proxy-HTTP requests are slightly different (see Section 2.1). The first line of a standard request normally includes only an absolute pathname. Proxy-HTTP requests , on the other hand, use the full URL. Because interception proxying does not require browser configuration, and the browser thinks it is connected directly to an origin server, it sends only the URL-path in the HTTP request line. The URL- path does not include the origin server hostname, so the cache must determine the origin server hostname by some other means.

The most reliable way to determine the origin server is from the HTTP/1.1 Host header. Fortunately, all of the recent browser products do send the Host header, even if they use "HTTP/1.0" in the request line. Thus, it is a relatively simple matter for the cache to transform this standard request:

 GET /index.html HTTP/1.0 Host: www.ircache.net

into a proxy-HTTP request, such as:

 GET http://www.ircache.net/index.html HTTP/1.0

In the absence of the Host header, the cache might be able to use the socket interface to get the IP address for which the packet was originally destined. The Unix sockets interface allows an application to retrieve the local address of a connected socket with the getsockname() function. In this case, the local address is the origin server that the proxy pretends to be. Whether this actually works depends on how the operating system implements the packet redirection. The native Linux and FreeBSD firewall software preserves the destination IP address, so getsockname() does work. The IP Filter package does not preserve destination addresses, so applications need to access /dev/nat to get the origin server's IP address.

If the cache uses getsockname() or /dev/nat , the resulting request looks something like this:

 GET http://192.52.106.29/index.html HTTP/1.0

While either a hostname or an IP address can be used to build a complete URL, hostnames are highly preferable. The primary reason for this is that URLs typically use hostnames instead of IP addresses. In most cases, a cache cannot recognize that both forms of a URL are equivalent. If you first request a URL with a hostname, and then again with its IP address, the second request is a cache miss , and the cache now stores two copies of the same object. This problem is made worse because some hostnames have many different IP addresses.

only for RuBoard - do not distribute or recompile