Running an HTTP Proxy Server

Besides HTTP servers and clients, twisted.web includes support for writing HTTP proxies. A proxy is a client and server in one: it accepts requests from clients (acting as a server) and forwards them to servers (acting as a client). Then it sends the response back to the client who originally sent the request. HTTP proxies are useful mostly for the additional services they can provide, such as caching, filtering, and usage reporting. This lab shows how to build an HTTP proxy using Twisted.

4.6.1. How Do I Do That?

The twisted.web package includes twisted.web.proxy, a module with classes for building HTTP proxies. Example 4-7 shows how easy it is to set up a basic proxy.

Example 4-7. simpleproxy.py




from twisted.web import proxy, http

from twisted.internet import reactor

from twisted.python import log

import sys

log.startLogging(sys.stdout)



class ProxyFactory(http.HTTPFactory):

 protocol = proxy.Proxy



reactor.listenTCP(8001, ProxyFactory( ))

reactor.run( )

Run simpleproxy.py from the command line and you'll have an HTTP proxy running on localhost port 8001. Set up a web browser to use this proxy and try surfing some web pages. The call to log.startLogging prints all HTTP log messages to stdout so you can watch the proxy at work:


 $ python simpleproxy.py

 2005/06/13 00:22 EDT [-] Log opened.

 2005/06/13 00:22 EDT [-] _ _main_ _.ProxyFactory starting on 8001

 2005/06/13 00:22 EDT [-] Starting factory <_ _main_ _.ProxyFactory instance

 at 0xb7d9d10c>

 2005/06/13 00:23 EDT [Proxy,0,127.0.0.1] Starting factory 

 2005/06/13 00:23 EDT [-] Enabling Multithreading.

 2005/06/13 00:23 EDT [Proxy,1,127.0.0.1] Starting factory 

 2005/06/13 00:23 EDT [Proxy,2,127.0.0.1] Starting factory 

 ...

That gives you a working proxy, but not one that does anything useful. Example 4-8 dives deeper into the twisted.web.proxy module to build a proxy that keeps track of the most frequently used words in the HTML documents being browsed.

Example 4-8. wordcountproxy.py


import sgmllib, re

from twisted.web import proxy, http

import sys

from twisted.python import log

log.startLogging(sys.stdout)



WEB_PORT = 8000

PROXY_PORT = 8001



class WordParser(sgmllib.SGMLParser):

 def _ _init_ _(self):

 sgmllib.SGMLParser._ _init_ _(self)

 self.chardata = []

 self.inBody = False



 def start_body(self, attrs):

 self.inBody = True



 def end_body(self):

 self.inBody = False



 def handle_data(self, data):

 if self.inBody:

 self.chardata.append(data)



 def getWords(self):

 # extract words

 wordFinder = re.compile(r'w*')

 words = wordFinder.findall("".join(self.chardata))

 words = filter(lambda word: word.strip( ), words)

 print "WORDS ARE", words

 return words



class WordCounter(object):

 ignoredWords = "the a of in from to this that and or but is was be can could i you

they we at".split( )



 def _ _init_ _(self):

 self.words = {}



 def addWords(self, words):

 for word in words:

 word = word.lower( )

 if not word in self.ignoredWords:

 currentCount = self.words.get(word, 0)

 self.words[word] = currentCount + 1



class WordCountProxyClient(proxy.ProxyClient):

 def handleHeader(self, key, value):

 proxy.ProxyClient.handleHeader(self, key, value)

 if key.lower( ) == "content-type":

 if value.split(';')[0] == 'text/html':

 self.parser = WordParser( )





 def handleResponsePart(self, data):

 proxy.ProxyClient.handleResponsePart(self, data)

 if hasattr(self, 'parser'): self.parser.feed(data)





 def handleResponseEnd(self):

 proxy.ProxyClient.handleResponseEnd(self)

 if hasattr(self, 'parser'):

 self.parser.close( )

 self.father.wordCounter.addWords(self.parser.getWords( ))

 del(self.parser)



class WordCountProxyClientFactory(proxy.ProxyClientFactory):

 def buildProtocol(self, addr):

 client = proxy.ProxyClientFactory.buildProtocol(self, addr)

 # upgrade proxy.proxyClient object to WordCountProxyClient

 client._ _class_ _ = WordCountProxyClient

 return client



class WordCountProxyRequest(proxy.ProxyRequest):

 protocols = {'http': WordCountProxyClientFactory}



 def _ _init_ _(self, wordCounter, *args):

 self.wordCounter = wordCounter

 proxy.ProxyRequest._ _init_ _(self, *args)



class WordCountProxy(proxy.Proxy):

 def _ _init_ _(self, wordCounter):

 self.wordCounter = wordCounter

 proxy.Proxy._ _init_ _(self)



 def requestFactory(self, *args):

 return WordCountProxyRequest(self.wordCounter, *args)



class WordCountProxyFactory(http.HTTPFactory):

 def _ _init_ _(self, wordCounter):

 self.wordCounter = wordCounter

 http.HTTPFactory._ _init_ _(self)



 def buildProtocol(self, addr):

 protocol = WordCountProxy(self.wordCounter)

 return protocol



# classes for web reporting interface

class WebReportRequest(http.Request):

 def _ _init_ _(self, wordCounter, *args):

 self.wordCounter = wordCounter

 http.Request._ _init_ _(self, *args)



 def process(self):

 self.setHeader("Content-Type", "text/html")

 words = self.wordCounter.words.items( )

 words.sort(lambda (w1, c1), (w2, c2): cmp(c2, c1))

 for word, count in words:

 self.write("

%s %s
" % (word, count)) self.finish( ) class WebReportChannel(http.HTTPChannel): def _ _init_ _(self, wordCounter): self.wordCounter = wordCounter http.HTTPChannel._ _init_ _(self) def requestFactory(self, *args): return WebReportRequest(self.wordCounter, *args) class WebReportFactory(http.HTTPFactory): def _ _init_ _(self, wordCounter): self.wordCounter = wordCounter http.HTTPFactory._ _init_ _(self) def buildProtocol(self, addr): return WebReportChannel(self.wordCounter) if __name__ == "_ _main_ _": from twisted.internet import reactor counter = WordCounter( ) prox = WordCountProxyFactory(counter) reactor.listenTCP(PROXY_PORT, prox) reactor.listenTCP(WEB_PORT, WebReportFactory(counter)) reactor.run( )
Run wordcountproxy.py and set your browser to use the proxy server localhost port 8001. Browse to a couple of sites to test the proxy. Then go to http://localhost:8000/ to see a report of word frequency in the sites you've visited. Figure 4-10 shows what your browser might look like after visiting http://www.twistedmatrix.com.

Figure 4-10. List of the most common words in proxied web pages

4.6.2. How Does That Work?

There are a lot of classes in Example 4-8, but the majority of them are just glue. Only a few are doing real work. The first two classes, WordParser and WordCounter, do the work of extracting words from the text of HTML documents and counting their frequency. The third class, WordCountProxy client, contains the code that looks for HTML documents and runs them through a WordParser as it comes back from the server. That's it for code specific to the problem of counting words.

Because a proxy acts as both a client and server, it uses a lot of classes. There's a ProxyClientFactory and ProxyClient, which provide the Factory/Protocol pair for client connections to other servers. To accept connections from clients, the proxy module provides the class ProxyRequest, a subclass of http.Request, and Proxy, a subclass of http.HTTPChannel. These are used the same way as they would be in a regular HTTP server: an HTTPFactory uses Proxy for its Protocol, and the Proxy HTTPChannel uses ProxyRequest as its RequestFactory. Here's the sequence of events when a client sends a request for a web page:
1. The client establishes a connection to the proxy server. This connection is handled by the HTTPFactory.
2. The HTTPFactory.buildProtocol creates a Proxy object to send and receive data over the client connection.
3. When the client sends a request over the connection, the Proxy creates a ProxyRequest to handle it.
4. The ProxyRequest looks at the request to see what server the client is trying to connect to. It creates a ProxyClientFactory and calls reactor.connectTCP to connect the factory to the server.
5. Once the ProxyClientFactory is connected to the server, it creates a ProxyClient Protocol object to send and receive data over the connection.
6. ProxyClient sends the original request to the server. As it receives the reply, it sends it back to the client that sent the request. This is done by calling self.father.transport.write: self.father is the Proxy object that is managing the client's connection.
With such a long chain of classes, it becomes a lot of work to pass an object from one end of the chain to the other. But it is possible, as Example 4-8 demonstrates. By creating a subclass of each class provided by the proxy module, you can have complete control over every step of the process.

At only one step in Example 4-8 is it necessary to resort to a bit of a hack. The ProxyClientFactory class has a buildProtocol method that's hardcoded to use ProxyClient as the protocol. It doesn't give you any easy way to substitute your own subclass of ProxyClient instead. The solution is to use the special Python _ _class_ _ attribute to do an in-place upgrade of the ProxyClient object returned by ProxyClientFactory.buildProtocol, which changes the object from a ProxyClient to a WordCountProxyClient.

In addition to the proxy server, Example 4-8 runs a regular web server on port 8000, which displays the current word count data from the proxy server. The ability to include a lightweight embedded HTTP server in your application is extremely handy, and can be used in any Twisted application where you want to provide a way to view status information remotely.

Getting Started

Building Simple Clients and Servers

Web Clients

Web Servers

Web Services and RPC

Authentication

Mail Clients

Mail Servers

NNTP Clients and Servers

SSH

Services, Processes, and Logging