Besides HTTP servers and clients, twisted.web includes support for writing HTTP proxies. A proxy is a client and server in one: it accepts requests from clients (acting as a server) and forwards them to servers (acting as a client). Then it sends the response back to the client who originally sent the request. HTTP proxies are useful mostly for the additional services they can provide, such as caching, filtering, and usage reporting. This lab shows how to build an HTTP proxy using Twisted.
4.6.1. How Do I Do That?
The twisted.web package includes twisted.web.proxy, a module with classes for building HTTP proxies. Example 4-7 shows how easy it is to set up a basic proxy.
Example 4-7. simpleproxy.py
from twisted.web import proxy, http from twisted.internet import reactor from twisted.python import log import sys log.startLogging(sys.stdout) class ProxyFactory(http.HTTPFactory): protocol = proxy.Proxy reactor.listenTCP(8001, ProxyFactory( )) reactor.run( )
Run simpleproxy.py from the command line and you'll have an HTTP proxy running on localhost port 8001. Set up a web browser to use this proxy and try surfing some web pages. The call to log.startLogging prints all HTTP log messages to stdout so you can watch the proxy at work:
$ python simpleproxy.py 2005/06/13 00:22 EDT [-] Log opened. 2005/06/13 00:22 EDT [-] _ _main_ _.ProxyFactory starting on 8001 2005/06/13 00:22 EDT [-] Starting factory <_ _main_ _.ProxyFactory instance at 0xb7d9d10c> 2005/06/13 00:23 EDT [Proxy,0,127.0.0.1] Starting factory 2005/06/13 00:23 EDT [-] Enabling Multithreading. 2005/06/13 00:23 EDT [Proxy,1,127.0.0.1] Starting factory 2005/06/13 00:23 EDT [Proxy,2,127.0.0.1] Starting factory ...
That gives you a working proxy, but not one that does anything useful. Example 4-8 dives deeper into the twisted.web.proxy module to build a proxy that keeps track of the most frequently used words in the HTML documents being browsed.
Example 4-8. wordcountproxy.py
import sgmllib, re from twisted.web import proxy, http import sys from twisted.python import log log.startLogging(sys.stdout) WEB_PORT = 8000 PROXY_PORT = 8001 class WordParser(sgmllib.SGMLParser): def _ _init_ _(self): sgmllib.SGMLParser._ _init_ _(self) self.chardata = [] self.inBody = False def start_body(self, attrs): self.inBody = True def end_body(self): self.inBody = False def handle_data(self, data): if self.inBody: self.chardata.append(data) def getWords(self): # extract words wordFinder = re.compile(r'w*') words = wordFinder.findall("".join(self.chardata)) words = filter(lambda word: word.strip( ), words) print "WORDS ARE", words return words class WordCounter(object): ignoredWords = "the a of in from to this that and or but is was be can could i you they we at".split( ) def _ _init_ _(self): self.words = {} def addWords(self, words): for word in words: word = word.lower( ) if not word in self.ignoredWords: currentCount = self.words.get(word, 0) self.words[word] = currentCount + 1 class WordCountProxyClient(proxy.ProxyClient): def handleHeader(self, key, value): proxy.ProxyClient.handleHeader(self, key, value) if key.lower( ) == "content-type": if value.split(';')[0] == 'text/html': self.parser = WordParser( ) def handleResponsePart(self, data): proxy.ProxyClient.handleResponsePart(self, data) if hasattr(self, 'parser'): self.parser.feed(data) def handleResponseEnd(self): proxy.ProxyClient.handleResponseEnd(self) if hasattr(self, 'parser'): self.parser.close( ) self.father.wordCounter.addWords(self.parser.getWords( )) del(self.parser) class WordCountProxyClientFactory(proxy.ProxyClientFactory): def buildProtocol(self, addr): client = proxy.ProxyClientFactory.buildProtocol(self, addr) # upgrade proxy.proxyClient object to WordCountProxyClient client._ _class_ _ = WordCountProxyClient return client class WordCountProxyRequest(proxy.ProxyRequest): protocols = {'http': WordCountProxyClientFactory} def _ _init_ _(self, wordCounter, *args): self.wordCounter = wordCounter proxy.ProxyRequest._ _init_ _(self, *args) class WordCountProxy(proxy.Proxy): def _ _init_ _(self, wordCounter): self.wordCounter = wordCounter proxy.Proxy._ _init_ _(self) def requestFactory(self, *args): return WordCountProxyRequest(self.wordCounter, *args) class WordCountProxyFactory(http.HTTPFactory): def _ _init_ _(self, wordCounter): self.wordCounter = wordCounter http.HTTPFactory._ _init_ _(self) def buildProtocol(self, addr): protocol = WordCountProxy(self.wordCounter) return protocol # classes for web reporting interface class WebReportRequest(http.Request): def _ _init_ _(self, wordCounter, *args): self.wordCounter = wordCounter http.Request._ _init_ _(self, *args) def process(self): self.setHeader("Content-Type", "text/html") words = self.wordCounter.words.items( ) words.sort(lambda (w1, c1), (w2, c2): cmp(c2, c1)) for word, count in words: self.write("
Run wordcountproxy.py and set your browser to use the proxy server localhost port 8001. Browse to a couple of sites to test the proxy. Then go to http://localhost:8000/ to see a report of word frequency in the sites you've visited. Figure 4-10 shows what your browser might look like after visiting http://www.twistedmatrix.com.
Figure 4-10. List of the most common words in proxied web pages
4.6.2. How Does That Work?
There are a lot of classes in Example 4-8, but the majority of them are just glue. Only a few are doing real work. The first two classes, WordParser and WordCounter, do the work of extracting words from the text of HTML documents and counting their frequency. The third class, WordCountProxy client, contains the code that looks for HTML documents and runs them through a WordParser as it comes back from the server. That's it for code specific to the problem of counting words.
Because a proxy acts as both a client and server, it uses a lot of classes. There's a ProxyClientFactory and ProxyClient, which provide the Factory/Protocol pair for client connections to other servers. To accept connections from clients, the proxy module provides the class ProxyRequest, a subclass of http.Request, and Proxy, a subclass of http.HTTPChannel. These are used the same way as they would be in a regular HTTP server: an HTTPFactory uses Proxy for its Protocol, and the Proxy HTTPChannel uses ProxyRequest as its RequestFactory. Here's the sequence of events when a client sends a request for a web page:
With such a long chain of classes, it becomes a lot of work to pass an object from one end of the chain to the other. But it is possible, as Example 4-8 demonstrates. By creating a subclass of each class provided by the proxy module, you can have complete control over every step of the process.
At only one step in Example 4-8 is it necessary to resort to a bit of a hack. The ProxyClientFactory class has a buildProtocol method that's hardcoded to use ProxyClient as the protocol. It doesn't give you any easy way to substitute your own subclass of ProxyClient instead. The solution is to use the special Python _ _class_ _ attribute to do an in-place upgrade of the ProxyClient object returned by ProxyClientFactory.buildProtocol, which changes the object from a ProxyClient to a WordCountProxyClient.
In addition to the proxy server, Example 4-8 runs a regular web server on port 8000, which displays the current word count data from the proxy server. The ability to include a lightweight embedded HTTP server in your application is extremely handy, and can be used in any Twisted application where you want to provide a way to view status information remotely.
Getting Started
Building Simple Clients and Servers
Web Clients
Web Servers
Web Services and RPC
Authentication
Mail Clients
Mail Servers
NNTP Clients and Servers
SSH
Services, Processes, and Logging