Checking Whether a Page Has Changed

One popular HTTP application is RSS (Really Simple Syndication) aggregators, which download news items or blog posts in RSS (or Atom) format. RSS aggregators download a new copy of each RSS feed at regular intervals, typically once an hour. This process can end up wasting a lot of bandwidth for the publisher of the RSS feed, though: the contents of the feed may change infrequently, which means that the client will be downloading the same data over and over again.

To prevent this waste of network resources, RSS aggregators (and other applications that request the same page multiple times) are encouraged to use a conditional HTTP GET request. By including conditional HTTP headers with a request, a client instructs the server to return the page data only if certain conditions are met. And, of course, one of those conditions might be whether the page has been modified since it was last checked.

3.4.1. How Do I Do That?

Keep track of the headers returned the first time you download the page. Look for either an ETag header, which identifies the unique revision of the page, or the Last-Modified header, which gives the page's modification time. The next time you request the page, send the headers If-None-Match, with the ETag value, and If-Modified-Since, with the Last-Modified value. If the server supports conditional GET requests, it will return a 304 Unchanged response if the page has not been modified since the last request.

The getPage and downloadPage functions provided by twisted.web.client are handy, but they don't allow for the level of control necessary to use conditional requests. Therefore, you'll need to use the slightly lower-level HTTPClientFactory interface. Example 3-5 demonstrates using HTTPClientFactory to test whether a page has been updated.

Example 3-5. updatecheck.py


from twisted.web import client



class HTTPStatusChecker(client.HTTPClientFactory):



 def _ _init_ _(self, url, headers=None):

 client.HTTPClientFactory._ _init_ _(self, url, headers=headers)

 self.status = None

 self.deferred.addCallback(

 lambda data: (data, self.status, self.response_headers))



 def noPage(self, reason): # called for non-200 responses

 if self.status == '304': # Page hadn't changed

 client.HTTPClientFactory.page(self, '')

 else:

 client.HTTPClientFactory.noPage(self, reason)



def checkStatus(url, contextFactory=None, *args, **kwargs):

 scheme, host, port, path = client._parse(url)

 factory = HTTPStatusChecker(url, *args, **kwargs)

 if scheme == 'https':

 from twisted.internet import ssl

 if contextFactory is None:

 contextFactory = ssl.ClientContextFactory( )

 reactor.connectSSL(host, port, factory, contextFactory)

 else:

 reactor.connectTCP(host, port, factory)

 return factory.deferred



def handleFirstResult(result, url):

 data, status, headers = result

 nextRequestHeaders = {}

 eTag = headers.get('etag')

 if eTag:

 nextRequestHeaders['If-None-Match'] = eTag[0]

 modified = headers.get('last-modified')

 if modified:

 nextRequestHeaders['If-Modified-Since'] = modified[0]

 return checkStatus(url, headers=nextRequestHeaders).addCallback(

 handleSecondResult)



def handleSecondResult(result):

 data, status, headers = result

 print 'Second request returned status %s:' % status,

 if status == '200':

 print 'Page changed (or server does not support conditional requests).'

 elif status == '304':

 print 'Page is unchanged.'

 else:

 print 'Unexpected Response.'

 reactor.stop( )



def handleError(failure):

 print "Error", failure.getErrorMessage( )

 reactor.stop( )



if __name__ == "_ _main_ _":

 import sys

 from twisted.internet import reactor



 url = sys.argv[1]

 checkStatus(url).addCallback(

 handleFirstResult, url).addErrback(

 handleError)

 reactor.run( )

Run updatecheck.py from the command line with a web URL as the first argument. It will download the page once, and then download it again using a conditional GET. It then indicates whether the second response was a 304, indicating that the server understood the conditional headers and indicated that the page had not changed. It's fairly typical for servers to support conditional GET requests for static files, such as RSS feeds, but not dynamically generated content, such as the home page:


 $ python updatecheck.py http://slashdot.org/slashdot.rss

 Second request returned status 304: Page is unchanged

 $ python updatecheck.py http://slashdot.org/

 Second request returned status 200: Page changed

 (or server does not support conditional requests).

 

3.4.2. How Does That Work?

The HTTPStatusChecker class is a subclass of client.HTTPClientFactory. It does a couple of notable things. During initialization, it adds an additional callback to self.deferred, using a lambda function. This anonymous function will catch the result of self.deferred before it gets passed to any external callback handlers. It will then replace this result (the downloaded data) with a tuple containing more information: the data, the HTTP status code, and self.response_headers, which is a dictionary of the headers returned with the response.

HTTPStatusChecker also overrides the noPage method, which HTTPClientFactory calls to indicate an unsuccessful response code. If the response status is 304 (the Unchanged status code), the noPage method calls HTTPClientFactory.page instead of the original noPage method, which indicates a successful response. In the case of a success, of course, the noPage in HTTPStatusChecker passes the call on to the overridden noPage in HTTPClientFactory. In this way, it prevents a 304 response from being considered an error.

The checkStatus function takes a URL and parses it using the twisted.web.client._parse utility function. It looks at the parts of the URL, gets the hostname it needs to connect to, and whether it's using HTTP (which runs over straight TCP) or HTTPS (which runs over SSL, and establishes the connection using reactor.connectSSL). Next, checkStatus creates an HTTPStatusChecker factory object, and opens the connection. All this code is basically lifted from twisted.web.client.getPage and modified to use the HTTPStatusChecker factory instead of the vanilla HTTPClientFactory.

When updatecheck.py runs, it calls checkStatus, setting handleFirstResult as the callback handler. handleFirstResult, in turn, makes a second request using the If-None-Match and If-Modified-Since conditional headers, setting handleSecondResult as the callback handler. The handleSecondResult function reports whether the server returned a 304 response, and then stops the reactor.

handleFirstResult actually returns the deferred result of handleSecondResult. This allows printError, the error handler function assigned to the first call to checkStatus, to handle any errors that come up in the second call to checkStatus as well.

Getting Started

Building Simple Clients and Servers

Web Clients

Web Servers

Web Services and RPC

Authentication

Mail Clients

Mail Servers

NNTP Clients and Servers

SSH

Services, Processes, and Logging



Twisted Network Programming Essentials
Twisted Network Programming Essentials
ISBN: 0596100329
EAN: 2147483647
Year: 2004
Pages: 107
Authors: Abe Fettig

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net