Downloading a Web Page

The simplest and most common task for a web client application is fetching the contents of a web page. The client connects to the server, sends an HTTP GET request, and receives an HTTP response containing the requested page.

3.1.1. How Do I Do That?

Here's where you can begin to experience the usefulness of Twisted's built-in protocol support. The twisted.web package includes a complete HTTP implementation, saving you the work of developing the necessary Protocol and ClientFactory classes. Furthermore, it includes utility functions that allow you to make an HTTP request with a single function call. To fetch the contents of a web page, use the function twisted.web.client.getPage. Example 3-1 is a Python script called webcat.py, which fetches a URL that you specify.

Example 3-1. webcat.py


from twisted.web import client

from twisted.internet import reactor

import sys



def printPage(data):

 print data

 reactor.stop( )



def printError(failure):

 print >> sys.stderr, "Error:", failure.getErrorMessage( )

 reactor.stop( )



if len(sys.argv) == 2:

 url = sys.argv[1]

 client.getPage(url).addCallback(

 printPage).addErrback(

 printError)

 reactor.run( )

else:

 print "Usage: webcat.py "

Give webcat.py a URL as its first argument, and it will fetch and print the contents of the page:


 $ python webcat.py http://www.oreilly.com/

 

 

oreilly.com -- Welcome to O'Reilly Media, Inc. -- computer books, software conferences, online publishing...

3.1.2. How Does That Work?

The printPage and printError functions are simple event handlers that print the downloaded page contents or an error message, respectively. The most important line in Example 3-1 is the call to client.getPage(url). This function returns a Deferred object that will be called back with the contents of the page once it has been completely downloaded.

Notice how the callbacks are added to the Deferred in a single line. This is possible because addCallback and addErrback both return a reference to their Deferred object. Therefore, the statements:


 d = deferredFunction( )

 d.addCallback(resultHandler)

 d.addErrback(errorHandler)

can be expressed as:


 deferredFunction( ).addCallback(resultsHandler).addErrback(errorHandler)

Which of these two forms is more readable is probably a matter of personal opinion, but the latter is an idiom that appears frequently in Twisted code.

3.1.3. What About...

... writing the page to disk as it's being downloaded? One disadvantage to the webcat.py script in Example 3-1 is that it loads the entire contents of the downloading page into memory, which could present a problem if you're downloading a large file. A better approach might be to write the data to a temporary file on disk as it's being downloaded, and then read the contents back from the temp file once the download is complete.

twisted.web.client includes downloadPage, a function that is similar to getPage but that writes data to a file. Call downloadPage with a URL as the first argument, and a filename or file object as the second. The script webcat2.py in Example 3-2 does this.

Example 3-2. webcat2.py


from twisted.web import client

import tempfile



def downloadToTempFile(url):

 """

 Given a URL, returns a Deferred that will be called back with

 the name of a temporary file containing the downloaded data.

 """

 tmpfd, tempfilename = tempfile.mkstemp( )

 os.close(tmpfd)

 return client.downloadPage(url, tempfilename).addCallback(

 returnFilename, tempfilename)



def returnFilename(result, filename):

 return filename



if __name__ == "_ _main_ _":

 import sys, os

 from twisted.internet import reactor



 def printFile(filename):

 for line in file(filename, 'r+b'):

 sys.stdout.write(line)

 os.unlink(filename) # delete file once we're done with it

 reactor.stop( )



 def printError(failure):

 print >> sys.stderr, "Error:", failure.getErrorMessage( )

 reactor.stop( )



 if len(sys.argv) == 2:

 url = sys.argv[1]

 downloadToTempFile(url).addCallback(

 printFile).addErrback(

 printError)

 reactor.run( )

 else:

 print "Usage: %s " % sys.argv[0]

The downloadToTempFile function in Example 3-2 returns the Deferred that results from calling twisted.web.client.downloadPage. downloadToTempFile adds returnFilename as a callback to this Deferred, with the temp filename as an additional argument. This means that when the result of downloadToTempFile comes in, the reactor will call returnFileName with the result of downloadToTempFile as the first argument and the filename as the second argument.

Example 3-2 registers another callback for the result of downloadToTempFile. Remember that the Deferred returned from downloadToTempFile already has returnFilename as a callback handler. Therefore, when the result comes in, returnFilename will be called first. The result of this function (the filename) will be used to call printFile.

Getting Started

Building Simple Clients and Servers

Web Clients

Web Servers

Web Services and RPC

Authentication

Mail Clients

Mail Servers

NNTP Clients and Servers

SSH

Services, Processes, and Logging



Twisted Network Programming Essentials
Twisted Network Programming Essentials
ISBN: 0596100329
EAN: 2147483647
Year: 2004
Pages: 107
Authors: Abe Fettig

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net