3.1 Privacy

only for RuBoard - do not distribute or recompile

"Privacy is the power to control what others can come to know about you" [Lessig, 1999, p.143]. In the U.S., most people feel they have a right to privacy. Even though the word does not occur in our Constitution, the fourth amendment comes close when talking about "the right of the people to be secure in their persons, houses , papers, and effects, against unreasonable searches and seizures..." In at least one famous case, the Supreme Court ruled that this amendment does provide for an individual's privacy. ^[2] Also of relevance is Article 12 of the United Nations Universal Declaration of Human Rights, which states:

^[2] Katz v. U.S., 389 U.S. 347 (1967).

No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honor and reputation. Everyone has the right to the protection of the law against such interference or attacks.

Privacy is a very important issue on the Internet as a whole and the Web in particular. Almost everywhere we go in cyberspace , we leave behind a little record of our visit. Today's computer and networking technology makes it almost trivial for information providers to amass huge amounts of data about their audience. As users, you and I might have different feelings about the importance of privacy. As cache operators, however, we have a responsibility always to protect the privacy of our cache users.

Privacy concerns are found in almost every aspect of our daily lives, not just while surfing the Net. My telephone company certainly knows which phone numbers I have dialed . The video store where I rent movies knows what kind of movies I like. My bank and credit card company know where I spend my money. Surveillance cameras are commonplace in stores, offices, and even some outdoor, public places.

In the United States, a consumer's privacy is protected by federal laws on a case-by-case basis. Video stores are not allowed to disclose an individual's rental and sales records without that individual's consent or a court order. ^[3] However, the law does allow video stores to use detailed personal information for marketing purposes if the consumer is given an opportunity to opt out. Similarly, telephone companies must protect their customer's private information, including call records. ^[4] There are no federal laws, however, that address consumer privacy in the banking industry. In fact, under the Banking Secrecy Act, banks must report suspicious transactions, such as large deposits, to federal agencies. ^[5] The latter is intended to aid in the tracking of money laundering, drug trafficking , and other criminal activities. Banks may be subject to state privacy laws, but for the most part, they are self-regulating in this regard.

^[3] 18 U.S.C. 2710(b). Visit http://www.law.cornell.edu to read the actual text for this and other citations.

^[4] 47 U.S.C. 222(c).

^[5] 31 C.F.R. 103.21.

As with banking, there are no U.S. laws that protect an individual's privacy on the Internet. This is unfortunate, because web transactions can reveal a significant amount of information about an individual user . Internet companies that specialize in marketing and advertising can determine a person's age, sex, place of residence, and other information by examining only a handful of requests . In the normal course of administering your cache, you are going to encounter personal details about your users. You are in a unique position to either strengthen or weaken your users' privacy. I strongly encourage you to develop and publish a policy that protects the privacy of your users to the fullest extent possible.

3.1.1 Access Logs

One way in which users' privacy can be compromised is by cache access logs. Most web caches in operation make a log entry for each and every request received. A typical log entry includes the time of access, URL requested , the requesting client's network address, and, in some cases, a username. As a cache operator, you have access to a large amount of potentially revealing information about your users, who trust you not to abuse it. If you violate that trust relationship, your users will not want to use your cache.

Note that proxies are not the only applications that can log user requests. A network packet sniffer can produce the same result. Proxy log files are easier to understand and generate, but a sniffer also logs requests that don't go through a proxy cache.

As a cache administrator, you are probably free to choose if, and how much, proxy information is logged. Some operators choose to disable logging, but some may be required to log because of regional laws, company policies, settlements, or other reasons. At the same time, it is the policy of some organizations to log all requests for security reasons.

If the choice is yours to make, a number of factors should affect your decision. First of all, log files provide some amount of accountability. Under normal operation, it is unlikely you will need to examine specific requests. But if you suspect criminal activity, the log files might prove useful. Also, many organizations like to analyze their log files to extract certain statistics and trends. Analysis tools can calculate hit ratios, network utilization (bandwidth), and request service times. Long- term analysis can inform you about growth trends and when your organization may require additional bandwidth. Debugging reported problems is generally made easier by having log files around. As many of us know, users often fail to report details important for finding and fixing problems. Your log files might fill in the missing pieces.

If you decide to keep access logs, it might be sufficient to log partial information, so that users retain some amount of privacy. For example, instead of logging full client IP addresses, the cache could log only the first three octets (or any subnetting/ netmask scheme). Instead of logging full URLs, the cache might log only the server hostnames. You should be particularly sensitive to logging the query part of a URL. These appear following a "?", and might contain sensitive information such as names and account numbers. Here are some slightly edited examples from my cache logs (with long lines wrapped):

 http://komunitas.detik.com/eshare/server/srvgate.dll?action=2&user=XXXP     &id=68nFdR&rand=975645899050     &request=248+bye%2Ball%2Bgua%2Bpamit%2Bsebentar  http://edit.yahoo.com/config/ncclogin?.src=bl     &login=dream_demon_of_night&passwd=XXXXXXXXXXXXXX&n=1 http://liveuniverse.com/world?type=world&page=ok     &dig.password=XXXXXXXX&dig.handle=darkforce http://mercury.beseen.com/chat/rooms/e/17671/Chat.html?handle=lanunk     &passkey=XXXXXXXXXXXXX http://msg.edit.yahoo.com/config/ncclogin?.src=bl&login=XXXXXXXXX     &passwd=XXXXX&n=1 http://www.creditcardsearchengine.com/?AID=115075&6PID=823653 http://64.23.37.140/cgi-bin/muslimemail/me.cgi?read=Refresh&login=XXXX     &myuid=XXXXXXXX&folder=INBOX&startx=1&endx=15 http://64.40.36.143/cdjoin2.asp?sitename=wetdreams http://ekilat.com/email_login.php?who=XXXXX@ekilat.com http://join.nbci.com/authlogin_remote_redir.php?u=be_nink     &hash=XXX...XXX&targetURL=http%3A%2F%2Fwww%2Eemail%2Ecom%2F%3F...

Due to their sensitive nature, you should carefully consider who else is able to read your proxy access logs. Limiting the number of people who can read the log files reduces the risk of privacy invasion and other abuses . If your cache runs on a multiuser system such as Unix, pay close attention to file ownership, group , and related permissions. Some caching appliances make log files available for download via FTP. In this case, avoid using a widely-known administrator password, and also find out if other (e.g., address-based ) access controls are available.

If you keep access logs, you should also develop a policy for how long the logs are saved. Keeping access logs for long periods of time might be useful for the reasons just described. However, situations might also arise where, in order to protect the privacy of your customers or employees , it would be better to not have the logs at all. In the U.S., the Freedom of Information Act (FOIA) ^[6] requires federal agencies to share certain documents and records with the public upon request. Many state governments have similar laws governing their own agencies. Some people feel that the access logs of systems owned and/or operated by the government are subject to these laws.

^[6] 5 U.S.C. 552.

In order to do long-term analysis, you probably don't need the full log file information. Instead, you can store shorter data files that summarize the information in such a way that it remains useful but no longer compromises the privacy of your users.

3.1.2 Making Requests Anonymous

Caching proxies also serve to strengthen, not just weaken, user privacy. Many organizations run web caches on their firewalls to hide the details of their internal network. External servers see connections coming from only the firewall host. This makes it harder for outsiders to find out the name , address, or type of user machines. Simply hiding internal names and addresses is not really sufficient. A good firewall has additional defense mechanisms in place, but every little bit helps.

Proxies may also protect privacy by filtering outgoing requests. As I alluded to earlier, HTTP requests often include personal or user-specific information. A request can include the browser software, operating system, IP address, and the page being viewed when the new request is made. Older browser software was known to send the user's email address as well, but recent versions should not. Some of this information can be quickly indexed to other online databases to acquire, for example, the user's telephone number and street address.

Content providers are quite happy to collect as much personal information as they can. Usually, much of this information can be filtered out without any negative consequences for the user. Some caching proxies, such as Squid, are able to remove certain HTTP headers before forwarding a request. If you have privacy concerns, you might want to filter some of the following headers:

From

This header, if present, contains the user's email address. In the past, browsers always sent a From header if the user had configured that information. These days, the header is rarely seen, probably because of backlash from privacy advocates. RFC 2616 suggests that user agents should not send the header without the user's consent.

In my experience, it is always safe to remove the From header in outgoing requests. I have not encountered or heard of an origin server that requires it to be present.

Referer

The (historically misspelled ) Referer header contains the URI from which the requested URI was obtained. For example, consider a page A that includes a link to page B . When you click on the link for B , the request looks something like this:

 GET  B  HTTP/1.1 Referer:  A

Content providers love the Referer header because it tells them how people find out about their site. When you use a search engine such as Altavista and follow one of the links, the Referer header is set to the Altavista query URL plus your query terms. Thus, the content provider knows what you are searching for when you find their site through a search engine. The Apache server makes it easy to log Referer headers from requests received. Here are some examples from my own web site:

 http://google.yahoo.com/bin/query?p=%22web+cache%22&hc=0&hs=1 -> / http://www.google.com/search?q=wpad+rfc&hl=en -> /writings.html http://www.altavista.com/cgi-bin/query?pg=q&q=linux+cache -> / http://search.msn.com/spbasic.htm?MT=akamai%20personalization     -> /services.html http://www.google.com/search?q=web+cache+proxy&hl=en -> / http://www.l2g.com/topic/Web_Caching -> /writings.html http://www.google.com/search?q=what+is+a+web+cache&hl=en -> / http://www.altavista.com/cgi-bin/query?q=RFC+WPAD -> /writings.html

In the previous examples, the Referer header is on the left, and the requested URI is to the right of the -> symbol.

The Referer header is useful for fixing broken links. A broken link is a hypertext link to a page or object that doesn't exist anymore. They are, unfortunately , all too common on the Web. With the Referer header, a content provider can discover the pages that point to nonexistent documents. Presumably, the content provider can contact the author of the other document and ask her to update the page or remove the link. However, that's much easier said than done. Getting the broken links fixed is often quite difficult.

Referer violates our privacy by providing origin servers with a history of our browsing activities. It gives content providers probably more information than they really need to satisfy a request. In my experience, it is always safe to filter out Referer headers.

User-agent

The User-agent header contains a string that identifies your web browser, including operating system and version numbers. Here are some sample User-agent values:

 Mozilla/2.0 (compatible; MSIE 3.01; AK; Windows 95) Mozilla/4.0 (compatible; BorderManager 3.0) Mozilla/4.51 [en] (X11; I; Linux 2.2.5-15 i686) NetAnts/1.00 Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt) Mozilla/4.0 (compatible; MSIE 4.01; Windows 98) Wget/1.5.3

User-agent is useful to content providers because it tells them what their users are using to view their site. For example, knowing that 90% of requests come from browser X , which supports feature Z , makes it easier to design a site that takes advantage of that feature. User-agent is also sometimes used to return browser-specific content. Netscape may support some nonstandard HTML tags that Internet Explorer does not, and vice versa.

As you can see in the previous examples, the User-agent field tells the content provider quite a lot about your computer system. They know if you run Windows 95 or Windows NT. In the Linux example, we can see that the system's CPU is an Intel 686 (Pentium II) or newer .

Unfortunately, it is not always safe to filter out the User-agent header. Some sites rely on the header to figure out if certain features are supported. With the header removed, I have seen sites that return a page saying, "Sorry, looks like your browser doesn't support Java, so you can't use our site." I have also seen sites (e.g., www.imdb.com) that refuse requests from specific user-agents, usually to prevent intense robot crawling.

Cookie

Cookies are used to maintain session state information between HTTP requests. A so-called shopping basket is a common application of a cookie. However, they may also be used simply to track a person's browsing activities. The prospect is truly frightening when you consider the way that advertising companies such as doubleclick.net operate . DoubleClick places ads on thousands of servers. The Cookie and Referer headers together allow them to track an individual's browsing across all of the servers where they place ads.

Unfortunately, filtering Cookie headers for all requests is not always practical. Doing so may interfere with sites that you really want to use. As an alternative, users should be able to configure their browsers to require approval before accepting a new cookie. There are also client-side programs for filtering and removing cookies. For more information, visit http://www.cookiecentral.com.

Unfortunately, even some presumably harmless headers can be used to track individuals. Martin Pool uncovered a trick whereby content providers use the Last-modified and If-modified-since headers to accomplish exactly that [Pool, 2000]. In this scheme, the cache validator serves the same purpose as a cookie. It's a small chunk of information that the user-agent receives, probably stores, and then sends back to the server in a future request. If each user receives a unique validator, the content provider knows when that user revisits a particular page.

only for RuBoard - do not distribute or recompile