3.3 Copyright

only for RuBoard - do not distribute or recompile

Copyright laws give authors or creators certain rights regarding the copying and distribution of their original works. These laws are intended to encourage people to share their creative works without fear that they will not receive due credit or remuneration. Copyrights are recognized internationally through various treaties (e.g., the Berne Convention and the Universal Copyright Convention). This helps our discussion somewhat, because the Internet tends to ignore political boundaries.

Digital computers and the Internet challenge our traditional thinking about copyrights. Before computers, we only had to worry about making copies of physical objects such as books, paintings, and records. The U.S. copyright statute defines a copy thusly:

Copies are material objects in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. ^[7]

^[7] 17 U.S.C. 101.

When you memorize a poem, thereby making a copy in your brain, you have not violated a copyright law. Tests for physicality are difficult to apply to computer systems where information exists only as electrostatic or magnetic charges representing ones and zeroes.

Copying is a fundamental characteristic of the Internet. An Internet without copying is like a pizza without cheese ”what would be the point? People like the Internet because it lets them share information with each other. Email, newsgroups, web pages, chat rooms: all require copying information from one place to another. The Internet also challenges traditional copyrights in another interesting way. Revenue is often the primary reason we establish and enforce copyrights. I don't want you to copy this book and give it to someone else because I get a bigger royalty check if your friend buys his own copy. On the Internet, however, paying for information is the exception rather than the rule. Some sites require subscriptions, but most do not, and a lot of web content is available for free.

3.3.1 Does Caching Infringe?

The question for us is this: does web caching infringe upon an author's copyright? If so, cache operators are potentially liable and may be subject to litigation. If not, copyright owners have arguably lost the right to exercise control over the distribution of their works. We can make good arguments for both sides of this debate.

First, let's examine some arguments that suggest that caching does infringe. The Berne Convention text (Article 9) states the following:

Authors of literary and artistic works protected by this Convention shall have the exclusive right of authorizing the reproduction of these works, in any manner or form.

Web caching most certainly qualifies as "reproduction" in this context, and thus the Berne Convention applies. In fact, judges have already ruled that even relatively transitory and short- term copies of works in computer memory (RAM) are subject to copyright laws. One grey area is whether we can say that a cache has the authorization to make a copy. Some people say "no" because such authorization is almost never explicitly granted, especially at the level of the transfer protocol (HTTP). The author of a web page may write something like "caching of this page is allowed," say, in an HTML comment, but the proxy wouldn't know that.

A more credible pro-caching defense states that web publishers implicitly grant a license to create copies, simply by placing material on a web server. After all, browsers must necessarily copy a page from the server to display it to the user . Content providers are generally aware that such copying between nodes is a fundamental aspect of the Internet's operation, and therefore they may accept that their pages are copied and duplicated at various locations. Furthermore, the laws certainly do not discriminate between caching at a browser versus caching at a proxy. If one infringes, we must conclude the other does as well.

In the U.S., this issue remained unresolved for a number of years . Although the threat of copyright infringement claims always existed, that didn't stop most organizations from deploying caching proxies in their networks. The situation changed dramatically in October of 1998, however, when Congress passed the Digital Millennium Copyright Act (DMCA). But before getting to that, let's look at some previous case law.

3.3.2 Cases and Precedents

It appears that copyright law relating to caching has not yet been tested in any court . Rumors of such lawsuits do exist, but, if the rumors are true, the cases were likely dropped or settled before trial. However, U.S. courts have heard a number of cases alleging copyright infringement by system operators. Eric Schlachter's article in Boardwatch magazine provides a good summary of some of these cases [Schlachter, 1997]. It is enlightening to examine a few details from them.

In 1993, Playboy magazine sued a system operator, George Frena, for, among other things, distribution of copyrighted images. The operator was found guilty of violating Playboy's right of distribution, even though he did not copy the files himself. The court noted that intent is not required to be found liable.

The following year, Sega sued the operators of a bulletin board for providing unauthorized copies of game software. The court initially found the operators both directly and contributorily liable. A second hearing of the case in 1996 found the operators were not directly liable because they did not make the copies. However, they remained contributorily liable because their system encouraged users to upload the pirated software.

More recently, the Religious Technology Center, a group affiliated with the Church of Scientology, instigated a lawsuit against system operators (Netcom and Tom Klemesrud) and a user (Dennis Erlich) over some messages posted to a Usenet newsgroup. One of the claims was copyright infringement of material owned by the Religious Technology Center. In the end, neither Netcom nor Klemesrud were found liable. However, this case contains an interesting twist. Both Netcom and Klemesrud were notified that their systems had distributed copyrighted information. The operators did not attempt to cancel the messages because they felt the copyright claims could not reasonably be verified . The court agreed. Although not stated explicitly, this ruling implies operators could be liable if given adequate notice that an infringement has occurred.

The case against Napster is perhaps one of the highest profile copyright lawsuits in recent years. Napster is a service (and software package) that puts its users in touch with each other so they can share music files. Napster doesn't store the files on its servers, but the comapny maintains a central database detailing the files that each user is willing to share. A number of recording industry companies sued Napster in late 1999 for contributory copyright infringement. One of Napster's defenses is that people can use their system in ways that do not infringe on copyrights. For example, users may share their own music or music that is free from copying restrictions. Sony successfully used this line of reasoning in a 1984 Supreme Court case that established that videocassette recording of television shows qualifies as fair use. After many hearings, the court forced Napster to prevent its customers from swapping copyrighted music unless they pay a royalty to the music producers .

3.3.3 The DMCA

The Digital Millennium Copyright Act was signed into law in October of 1998. Its goal is to bring U.S. law in line with the World Intellectual Property Organization Copyright Treaty and the Performances and Phonograms Treaty. To many people, the DMCA is a travesty. The Electronic Freedom Foundation (http://www.eff.org) has a number of anti-DMCA articles on its web site. However, these are all focused on section 1201 of U.S.C. title 17, which makes it illegal to circumvent copyright protection systems.

For this discussion, we focus on a different part of the DMCA. Title II is called the Online Copyright Infringement Liability Limitation Act. It exempts service providers from liability for copyright infringement if certain conditions are met. Surprisingly, this legislation specifically addresses caching! This part of the Act became law as section 512 of U.S.C. title 17. The text of the first two subsections are included in Appendix G.

Subsection (a) doesn't talk about caching, but it's relevant anyway. It exempts service providers for simply providing "transitory digital network communications," for example, routing, transmitting packets, and providing connections through their networks. This language probably applies to proxying (without caching) as well. To avoid liability, certain conditions must be met.

One condition is that the service provider must not modify the content as it passes through its network. Thus, a caching proxy that alters web pages, images, etc., may be liable for copyright infringement. Filtering out "naughty" words from text and changing the resolution or quality of an image both violate this condition. It's not clear to me whether blocking a request for an embedded image (e.g., an advertisement banner) qualifies as content modification.

Subsection (b) deals entirely with caching. It says that "intermediate and temporary storage of material" does not make service providers liable for copyright infringement, but only if the following conditions are met:

The material is made available by somebody other than the ISP.
The material is sent to someone other than the content provider.
The reason for storing the material is to use it in response to future requests from users, subject to a number of additional conditions.

The additional conditions of subsection (b) are quite long and relate to issues such as modification of content, serving stale responses, and access to protected information.

Paragraph (2)(A) says that the service provider can't modify cached responses. Since subsection (a) says the provider can't modify the original response, this implies that all cache hits must be the same as the original.

Paragraph (2)(B) is interesting because it mandates compliance with "rules concerning the refreshing, reloading, or other updating of the material . . . " ”in other words, the HTTP headers. This means that if the caching product is configured or operates in a way that violates HTTP, the service provider may be guilty of copyright infringement. What's even more interesting is that this paragraph applies "only if those rules are not used by the [content owner] to prevent or unreasonably impair" caching. To me it seems that if the origin server is cache busting (see Section 3.7), disobeying HTTP headers does not make the service provider liable for infringement. This is somewhat surprising, because HTTP gives content owners ultimate control over how caches handle their content. However, the language in this paragraph is so vague and confusing that it probably makes this condition altogether worthless.

The next condition, paragraph (2)(C), is tricky as well. It says that service providers cannot interfere with the content owner's ability to receive access statistics from caches. Again, this condition does not apply if the collection of such statistics places an undue burden on the service provider. The condition also does not apply if the content provider somehow uses the cache to collect additional information that it wouldn't normally have if users were connecting directly.

The condition in paragraph (2)(D) is relatively straightforward. It says that caches must respect the content provider's ability to limit access to information. For example, if users must authenticate themselves in order to receive a certain web page, the cache must not give that page to unauthenticated users. Doing so makes the service provider liable for copyright infringement. I think there is a small problem with this requirement, however. In some cases, a caching proxy won't know that the content provider is limiting access. For example, this happens if access is granted based only on the client's IP address. The law should require origin servers to explicitly mark protected responses. Otherwise, service providers may be inadvertently violating the owner's copyright.

The final condition, in paragraph (2)(E), is a requirement to remove material (or deny access to it) when someone makes a claim of copyright infringement. For example, a content provider might place a copyrighted image on its server without the copyright owner's permission. The copyright owner can demand that the content provider remove the image from its site. He can also demand that the image be removed from an ISP's caches. Since such a demand must be made to individual service providers, it's difficult to imagine that the copyright owner will get the content removed from all caches in operation.

The first time I read this legislation, I thought that prefetching could make service providers liable for copyright infringement. On closer examination, however, it seems that prefetching is exempt as well. Subsection (a) ”which is about routing, not caching ”eliminates liability for material requested by "a person other than the service provider." Arguably, when a cache prefetches some objects, those requests are initiated by the service provider, not the user. However, when talking about caching in subsection (b), the exemption covers material requested by a person other than the content provider. Since the ISP is not the content provider, the exemption applies.

One interesting thing to note about this law is that it only exempts ISPs from liability for copyright infringement. The service provider may still be guilty of infringement, but the copyright owner is not entitled to compensation for copies made by caches. Before the Online Copyright Infringement Liability Limitation Act became law, some service providers were afraid that caching would invite lawsuits. With these new additions, however, ISPs in the U.S. have nothing to fear as long as they "play nice" and don't violate the rules of HTTP.

3.3.4 HTTP's Role

Even those who feel strongly that caching is a blatant infringement of copyrights generally agree that caching is necessary for efficient and continued operation of the Web. Because caching is here to stay, we need good technical solutions to deal appropriately with information that authors want to control tightly. HTTP/1.1 provides a number of different directives and mechanisms that content authors can use to control the distribution of their works.

One way is to insert one or more of the Cache-control directives in the response headers. The no-store directive prevents any cache from storing and reusing the response. The no-cache and must-revalidate directives allow a response to be cached but require validation with the origin server before each subsequent use. The private directive allows the response to be cached and reused by a user agent cache but not by a caching proxy. We'll talk further about these in Section 6.1.4.

Authentication is another way to control the distribution of copyrighted material. To access the information, users must enter a username and password. By default, caching proxies cannot store and reuse authenticated responses. Passwords may be a pain to administer, however. Not many organizations can afford the resources to maintain a database with millions of users.

Encryption is an even more extreme way to control distribution. Recall that end-to-end encryption protocols such as SSL/TLS are opaque to proxies in the middle. In other words, caching proxies cannot interpret encrypted traffic, which makes it impossible to cache. User- agents can cache responses received via an encrypted channel. In this sense, encryption is similar to the private cache control.

One of the difficulties with Cache-control is that the header must be generated by the server itself; it is not a part of the content. Modifying the server's behavior probably requires assistance from the server administrator; the author may not be able to do it alone. Even if the author wants to use Cache-control or other HTTP headers to control caching, she may not be able to, unless the administrator is willing and able to help out. Apache actually makes it possible for authors to insert and remove headers after some initial configuration by the administrator. We'll see how do to this in Chapter 6.

Using the no-store header is not a decision to be made lightly. Though preventing caching gives the provider greater control over the distribution of the content, it also increases the time that cache users wait to view it. Anyone who thinks users and cache operators don't notice the difference is wrong. Even if the provider site has enough capacity to handle the load, the user may be sitting behind a congested and/or high-latency network connection. Because people have short attention spans , if the latency is too high, they simply abort the request and move on to another site. Issues relating to servers that don't allow caching are further discussed in Section 3.7.

only for RuBoard - do not distribute or recompile