Section 20.2. Web Surfing with Python: Creating Simple Web Clients

20.2. Web Surfing with Python: Creating Simple Web Clients

One thing to keep in mind is that a browser is only one type of Web client. Any application that makes a request for data from a Web server is considered a "client." Yes, it is possible to create other clients that retrieve documents or data off the Internet. One important reason to do this is that a browser provides only limited capacity, i.e., it is used primarily for viewing and interacting with Web sites. A client program, on the other hand, has the ability to do moreit can not only download data, but it can also store it, manipulate it, or perhaps even transmit it to another location or application.

Applications that use the urllib module to download or access information on the Web [using either urllib.urlopen() or urllib.urlretrieve()] can be considered a simple Web client. All you need to do is provide a valid Web address.

20.2.1. Uniform Resource Locators

Simple Web surfing involves using Web addresses called URLs (Uniform Resource Locators). Such addresses are used to locate a document on the Web or to call a CGI program to generate a document for your client. URLs are part of a larger set of identifiers known as URIs (Uniform Resource Identifiers). This superset was created in anticipation of other naming conventions that have yet to be developed. A URL is simply a URI which uses an existing protocol or scheme (i.e., http, ftp, etc.) as part of its addressing. To complete this picture, we'll add that non-URL URIs are sometimes known as URNs (Uniform Resource Names), but because URLs are the only URIs in use today, you really don't hear much about URIs or URNs, save perhaps as XML identifiers.

Like street addresses, Web addresses have some structure. An American street address usually is of the form "number street designation," i.e., 123 Main Street. It differs from other countries, which have their own rules. A URL uses the format:

        prot_sch://net_loc/path;params?query#frag

Table 20.1 describes each of the components.

Table 20.1. Web Address Components
URL Component	Description
`prot_sch`	Network protocol or download scheme
`net_loc`	Location of server (and perhaps user information)
`path`	Slash ( / ) delimited path to file or CGI application
`params`	Optional parameters
`query`	Ampersand ( & ) delimited set of "key=value" pairs
`frag`	Fragment to a specific anchor within document

net_loc can be broken down into several more components, some required, others optional. The net_loc string looks like this:

        user:passwd@host:port

These individual components are described in Table 20.2.

Table 20.2. Network Location Components
`net_loc` Component	Description
`user`	User name or login
`passwd`	User password
`host`	Name or address of machine running Web server [required]
`port`	Port number (if not 80, the default)

Of the four, the host name is the most important. The port number is necessary only if the Web server is running on a different port number from the default. (If you aren't sure what a port number is, go back to Chapter 16.)

User names and perhaps passwords are used only when making FTP connections, and even then they usually aren't necessary because the majority of such connections are "anonymous."

Python supplies two different modules, each dealing with URLs in completely different functionality and capacities. One is urlparse, and the other is urllib. We will briefly introduce some of their functions here.

20.2.2. `urlparse` Module

The urlparse module provides basic functionality with which to manipulate URL strings. These functions include urlparse(), urlunparse(), and urljoin().

`urlparse.urlparse()`

urlparse() breaks up a URL string into some of the major components described above. It has the following syntax:

        urlparse(urlstr, defProtSch=None, allowFrag=None)

urlparse() parses urlstr into a 6-tuple (prot_sch, net_loc, path, params, query, frag). Each of these components has been described above. defProtSch indicates a default network protocol or download scheme in case one is not provided in urlstr.allowFrag is a flag that signals whether or not a fragment part of a URL is allowed. Here is what urlparse() outputs when given a URL:

        >>> urlparse.urlparse('http://www.python.org/doc/FAQ.html')         ('http', 'www.python.org', '/doc/FAQ.html', '', '', '')

`urlparse.urlunparse()`

urlunparse() does the exact opposite of urlparse()it merges a 6-tuple (prot_sch, net_loc, path, params, query, frag)urltup, which could be the output of urlparse(), into a single URL string and returns it. Accordingly, we state the following equivalence:

        urlunparse(urlparse(urlstr))  You may have already surmised that the syntax of urlunparse() is as follows:
 
        urlunparse(urltup)

 urlparse.urljoin()
 The urljoin() function is useful in cases where many related URLs are needed, for example, the URLs for a set of pages to be generated for a Web site. The syntax for urljoin() is:
 
        urljoin(baseurl, newurl, allowFrag=None)

 urljoin() takes baseurl and joins its base path (net_loc plus the full path up to, but not including, a file at the end) with newurl. For example:
 
        >>> urlparse.urljoin('http://www.python.org/doc/FAQ.html', \         ... 'current/lib/lib.htm')         'http://www.python.org/doc/current/lib/lib.html'

 A summary of the functions in urlparse can be found in Table 20.3.
 
Table 20.3. Core urlparse Module Functionsurlparse Functions Description
urlparse(urlstr, defProtSch=None, allowFrag=None) Parses urlstr into separate components, using defProtSch if the protocol or scheme is not given in urlstr; allowFrag determines whether a URL fragment is allowed
urlunparse(urltup) Unparses a tuple of URL data (urltup) into a single URL string
urljoin(baseurl,newurl,allowFrag=None) Merges the base part of the baseurl URL with newurl to form a complete URL; allowFrag is the same as for urlparse() 

 20.2.3. urllib Module
 
Core Module: urllib
  Unless you are planning on writing a more lower-level network client, the urllib module provides all the functionality you need. urllib provides a high-level Web communication library, supporting the basic Web protocols, HTTP, FTP, and Gopher, as well as providing access to local files. Specifically, the functions of the urllib module are designed to download data (from the Internet, local network, or local host) using the aforementioned protocols. Use of this module generally obviates the need for using the httplib, ftplib, and gopherlib modules unless you desire their lower-level functionality. In those cases, such modules can be considered as alternatives. (Note: Most modules named *lib are generally for developing clients of the corresponding protocols. This is not always the case, however, as perhaps urllib should then be renamed "internetlib" or something similar!)

 The urllib module provides functions to download data from given URLs as well as encoding and decoding strings to make them suitable for including as part of valid URL strings. The functions we will be looking at in this upcoming section include: urlopen(), urlretrieve(), quote(), unquote(), quote_plus(), unquote_plus(), and urlencode(). We will also look at some of the methods available to the file-like object returned by urlopen(). They will be familiar to you because you have already learned to work with files back in Chapter 9.
 urllib.urlopen()
 urlopen() opens a Web connection to the given URL string and returns a file-like object. It has the following syntax:
 
        urlopen(urlstr, postQueryData=None)

 urlopen() opens the URL pointed to by urlstr. If no protocol or download scheme is given, or if a "file" scheme is passed in, urlopen() will open a local file.
 For all HTTP requests, the normal request type is "GET." In these cases, the query string provided to the Web server (key-value pairs encoded or "quoted," such as the string output of the urlencode() function [see below]), should be given as part of urlstr.
 If the "POST" request method is desired, then the query string (again encoded) should be placed in the postQueryData variable. (For more information regarding the GET and POST request methods, refer to any general documentation or texts on programming CGI applicationswhich we will also discuss below. GET and POST requests are the two ways to "upload" data to a Web server.
 When a successful connection is made, urlopen() returns a file-like object as if the destination was a file opened in read mode. If our file object is f, for example, then our "handle" would support the expected read methods such as f.read(), f.readline(), f.readlines(), f.close(), and f.fileno().
 In addition, a f.info() method is available which returns the MIME (Multipurpose Internet Mail Extension) headers. Such headers give the browser information regarding which application can view returned file types. For example, the browser itself can view HTML (HyperText Markup Language), plain text files, and render PNG (Portable Network Graphics) and JPEG (Joint Photographic Experts Group) or the old GIF (Graphics Interchange Format) graphics files. Other files such as multimedia or specific document types require external applications in order to view.
 Finally, a geturl() method exists to obtain the true URL of the final opened destination, taking into consideration any redirection that may have occurred. A summary of these file-like object methods is given in Table 20.4.
 
Table 20.4. urllib.urlopen() File-like Object Methodsurlopen() Object Methods Description
f.read([bytes]) Reads all or bytes bytes from f 
f.readline() Reads a single line from f 
f.readlines() Reads a all lines from f into a list
f.close() Closes URL connection for f 
f.fileno() Returns file number of f 
f.info() Gets MIME headers of f 
f.geturl() Returns true URL opened for f 

 If you expect to be accessing more complex URLs or want to be able to handle more complex situations such as basic and digest authentication, redirections, cookies, etc., then we suggest using the urllib2 module, introduced back in the 1.6 days (mostly as an experimental module). It too, has a urlopen() function, but also provides other functions and classes for opening a variety of URLs. For more on urllib2, see the next section of this chapter.
 urllib.urlretrieve()
 urlretrieve() will do some quick and dirty work for you if you are interested in working with a URL document as a whole. Here is the syntax for urlretrieve():
 
        urlretrieve(urlstr, localfile=None, downloadSta-         tusHook=None)

 Rather than reading from the URL like urlopen() does, urlretrieve() will simply download the entire HTML file located at urlstr to your local disk. It will store the downloaded data into localfile if given or a temporary file if not. If the file has already been copied from the Internet or if the file is local, no subsequent downloading will occur.
 The downloadStatusHook, if provided, is a function that is called after each block of data has been downloaded and delivered. It is called with the following three arguments: number of blocks read so far, the block size in bytes, and the total (byte) size of the file. This is very useful if you are implementing "download status" information to the user in a text-based or graphical display.
 urlretrieve() returns a 2-tuple, (filename, mime_hdrs). filename is the name of the local file containing the downloaded data. mime_hdrs is the set of MIME headers returned by the responding Web server. For more information, see the Message class of the mimetools module. mime_hdrs is None for local files.
 For a simple example using urlretrieve(), take a look at Example 11.4 (grabweb.py). A larger piece of code using urlretrieve() can be found later in this chapter in Example 20.2.
 Example 20.2. Advanced Web Client: a Web Crawler (crawl.py)
 The crawler consists of two classes, one to manage the entire crawling process (Crawler), and one to retrieve and parse each downloaded Web page (Retriever).
    1      #!/usr/bin/env python    2    3      from sys import argv    4      from os import makedirs, unlink, sep    5      from os.path import dirname, exists, isdir, splitext    6      from string import replace, find, lower    7      from htmllib import HTMLParser    8      from urllib import urlretrieve    9      from urlparse import urlparse, urljoin    10     from formatter import DumbWriter, AbstractFormatter    11     from cStringIO import StringIO    12    13     class Retriever(object):# download Web pages    14    15          def __init__(self, url):    16           self.url = url    17           self.file = self.filename(url)    18    19          def filename(self, url, deffile='index.htm'):    20           parsedurl = urlparse(url, 'http:', 0) ## parse path    21           path = parsedurl[1] + parsedurl[2]    22           ext = splitext(path)    23           if ext[1] == '':           # no file, use default    24               if path[-1] == '/':    25                   path += deffile    26               else:    27                   path += '/' + deffile    28           ldir = dirname(path)   # local directory    29           if sep != '/':         # os-indep. path separator    30               ldir = replace(ldir, '/', sep)    31           if not isdir(ldir):    # create archive dir if nec    32               if exists(ldir): unlink(ldir)    33               makedirs(ldir)    34           return path    35    36          def download(self):     # download Web page    37              try:    38                  retval = urlretrieve(self.url, self.file)    39              except IOError:    40                  retval = ('*** ERROR: invalid URL "%s"' %\    41                       self.url,)    42              return retval    43    44           def parseAndGetLinks(self):# parse HTML, save links    45               self.parser = HTMLParser(AbstractFormatter(\    46                    DumbWriter(StringIO())))    47               self.parser.feed(open(self.file).read())    48               self.parser.close()    49               return self.parser.anchorlist    50    51    class Crawler(object):# manage entire crawling process    52    53      count = 0         # static downloaded page counter    54    55      def __init__(self, url):    56          self.q = [url]    57          self.seen = []    58          self.dom = urlparse(url)[1]    59    60      def getPage(self, url):    61          r = Retriever(url)    62          retval = r.download()    63          if retval[0] == '*': # error situation, do not parse    64              print retval, '... skipping parse'    65              return    66          Crawler.count += 1    67          print '\n(', Crawler.count, ')'    68          print 'URL:', url    69          print 'FILE:', retval[0]    70          self.seen.append(url)    71    72          links = r.parseAndGetLinks() # get and process links    73          for eachLink in links:    74              if eachLink[:4] != 'http' and \    75                      find(eachLink, '://') == -1:    76                  eachLink = urljoin(url, eachLink)    77              print '* ', eachLink,    78    79              if find(lower(eachLink), 'mailto:') != -1:    80                  print '... discarded, mailto link'    81                  continue    82    83              if eachLink not in self.seen:    84                  if find(eachLink, self.dom) == -1:    85                      print '... discarded, not in domain'    86                  else:    87                      if eachLink not in self.q:    88                          self.q.append(eachLink)    89                          print '... new, added to Q'    90                      else:    91                          print '... discarded, already in Q'    92              else:    93                  print '... discarded, already processed'    94    95      def go(self):# process links in queue    96          while self.q:    97              url = self.q.pop()    98              self.getPage(url)    99    100     def main():    101         if len(argv) > 1:    102             url = argv[1]    103          else:    104              try:    105              url = raw_input('Enter starting URL: ')    106          except (KeyboardInterrupt, EOFError):    107              url = ''    108    109          if not url: return    110          robot = Crawler(url)    111          robot.go()    112    113      if __name__ == '__main__':    114          main()

 
 urllib.quote() and urllib.quote_plus()
 The quote*() functions take URL data and "encodes" them so that they are "fit" for inclusion as part of a URL string. In particular, certain special characters that are unprintable or cannot be part of valid URLs acceptable to a Web server must be converted. This is what the quote*() functions do for you. Both quote*() functions have the following syntax:
 
        quote(urldata, safe='/')

 Characters that are never converted include commas, underscores, periods, and dashes, as well as alphanumerics. All others are subject to conversion. In particular, the disallowed characters are changed to their hexadecimal ordinal equivalents prepended with a percent sign (%), i.e., "%xx" where "xx" is the hexadecimal representation of a character's ASCII value. When calling quote*(), the urldata string is converted to an equivalent string that can be part of a URL string. The safe string should contain a set of characters which should also not be converted. The default is the slash ( / ).
 quote_plus() is similar to quote() except that it also encodes spaces to plus signs ( + ). Here is an example using quote() vs. quote_plus():
 
        >>> name = 'joe mama'         >>> number = 6         >>> base = 'http://www/~foo/cgi-bin/s.py'         >>> final = '%s?name=%s&num=%d' % (base, name, number)         >>> final         'http://www/~foo/cgi-bin/s.py?name=joe mama&num=6'         >>>         >>> urllib.quote(final)         'http:%3a//www/%7efoo/cgi-bin/s.py%3fname%3djoe%20mama%26num%3d6'         >>>         >>> urllib.quote_plus(final)         'http%3a//www/%7efoo/cgi-bin/s.py%3fname%3dj    oe+mama%26num%3d6'

 urllib.unquote() and urllib.unquote_plus()
 As you have probably guessed, the unquote*() functions do the exact opposite of the quote*() functionsthey convert all characters encoded in the "%xx" fashion to their ASCII equivalents. The syntax of unquote*() is as follows:
 
        unquote*(urldata)

 Calling unquote() will decode all URL-encoded characters in urldata and return the resulting string. unquote_plus() will also convert plus signs back to space characters.
 urllib.urlencode()
 urlencode(), added to Python back in 1.5.2, takes a dictionary of key-value pairs and encodes them to be included as part of a query in a CGI request URL string. The pairs are in "key=value" format and are delimited by ampersands ( & ). Furthermore, the keys and their values are sent to quote_plus() for proper encoding. Here is an example output from urlencode():
 
        >>> aDict = { 'name': 'Georgina Garcia', 'hmdir': '~ggarcia' }         >>> urllib.urlencode(aDict)         'name=Georgina+Garcia&hmdir=%7eggarcia'

 There are other functions in urllib and urlparse which we did not have the opportunity to cover here. Refer to the documentation for more information.
 Secure Socket Layer support
 The urllib module was given support for opening HTTP connections using the Secure Socket Layer (SSL) in 1.6. The core change to add SSL is implemented in the socket module. Consequently, the urllib and httplib modules were updated to support URLs using the "https" connection scheme. In addition to those two modules, other protocol client modules with SSL support include: imaplib, poplib, and smtplib.
 A summary of the urllib functions discussed in this section can be found in Table 20.5.
 
Table 20.5. Core urllib Module Functionsurllib Functions Description
urlopen(urlstr, postQueryData=None) Opens the URL urlstr, sending the query data in postQueryData if a POST request
urlretrieve(urlstr, localfile=None, downloadStatusHook=None) Downloads the file located at the urlstr URL to localfile or a temporary file if localfile not given; if present, downloaStatusHook is a function that can receive download statistics
quote(urldata, safe='/') Encodes invalid URL characters of urldata; characters in safe string are not encoded
quote_plus(urldata, safe='/') Same as quote() except encodes spaces as plus (+) signs (rather than as %20)
unquote(urldata) Decodes encoded characters of urldata 
unquote_plus(urldata) Same as unquote() but converts plus signs to spaces
urlencode(dict) Encodes the key-value pairs of dict into a valid string for CGI queries and encodes the key and value strings with quote_plus() 

 20.2.4. urllib2 Module
 As mentioned in the previous section, urllib2 can handle more complex URL opening. One example is for Web sites with basic authentication (login and password) requirements. The most straightforward solution to "getting past security" is to use the extended net_loc URL component as described earlier in this chapter, i.e., http://user:passwd@www.python.org. The problem with this solution is that it is not programmatic. Using urllib2, however, we can tackle this problem in two different ways.
 We can create a basic authentication handler (urllib2.HTTPBasicAuthHandler) and "register" a login password given the base URL and perhaps a realm, meaning a string defining the secure area of the Web site. (For more on realms, see RFC 2617 [HTTP Authentication: Basic and Digest Access Authentication]). Once this is done, you can "install" a URL-opener with this handler so that all URLs opened will use our handler.
 The other alternative is to simulate typing the username and password when prompted by a browser and that is to send an HTTP client request with the appropriate authorization headers. In Example 20.1 we can easily identify each of these two methods.
 Example 20.1. HTTP Auth Client (urlopen-auth.py)
 This script uses both techniques described above for basic authentication.
    1     #!/usr/bin/env python    2    3     import urllib2    4    5     LOGIN = 'wesc'    6     PASSWD = "you'llNeverGuess"    7     URL = 'http://localhost'    8    9     def handler_version(url):    10        from urlparse import urlparse as up    11        hdlr = urllib2.HTTPBasicAuthHandler()    12        hdlr.add_password('Archives', up(url)[1], LOGIN, PASSWD)    13        opener = urllib2.build_opener(hdlr)    14        urllib2.install_opener(opener)    15        return url    16    17   def request_version(url):    18        from base64 import encodestring    19        req = urllib2.Request(url)    20        b64str = encodestring('%s:%s' % (LOGIN, PASSWD))[:-1]    21        req.add_header("Authorization", "Basic %s" % b64str)    22        return req    23    24   for funcType in  ('handler', 'request'):    25        print '*** Using %s:' % funcType.upper()    26        url = eval('%s_version')(URL)    27        f = urllib2.urlopen(url)    28        print f.readline()    29        f.close()

 
 Line-by-Line Explanation
 Lines 17
 The usual setup plus some constants for the rest of the script to use.
 Lines 915
 The "handler" version of the code allocates a basic handler class as described earlier, then adds the authentication information. The handler is then used to create a URL-opener that is then installed so that all URLs opened will use the given authentication. This code was adapted from the official Python documentation for the urllib2 module.
 Lines 1722
 The "request" version of our code just builds a Request object and adds the simple base64-encoded authentication header into our HTTP request. This request is then used to substitute the URL string when calling urlopen() upon returning back to "main." Note that the original URL was built into the Request object, hence the reason why it was not a problem to replace it in the subsequent call to urllib2.urlopen(). This code was inspired by Mike Foord's and Lee Harr's recipes in the Python Cookbook located at:
  http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305288
 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/267197
 
 It would have been great to have been able to use Harr's HTTPRealmFinder class so that we do not need to hardcode it in our example.
 Lines 2429
 The rest of this script just opens the given URL using both techniques and displays the first line (dumping the others) of the resulting HTML page returned by the server once authentication has been validated. Note that an HTTP error (and no HTML) would be returned if the authentication information is invalid.
 The output should look something like this:
 
        $ python urlopen-auth.py         Using handler:         <html>         Using request:         <html>

 In addition to the official Python documentation for urllib2, you may find this companion piece useful: http://www.voidspace.org.uk/python/articles/urllib2.shtml.

Section 20.2. Web Surfing with Python: Creating Simple Web Clients

20.2. Web Surfing with Python: Creating Simple Web Clients

20.2.1. Uniform Resource Locators

Table 20.1. Web Address Components

Table 20.2. Network Location Components

20.2.2. `urlparse` Module

`urlparse.urlparse()`

`urlparse.urlunparse()`

`urlparse.urljoin()`

Table 20.3. Core `urlparse` Module Functions

20.2.3. `urllib` Module

`urllib.urlopen()`

Table 20.4. `urllib.urlopen()` File-like Object Methods

`urllib.urlretrieve()`

Example 20.2. Advanced Web Client: a Web Crawler (`crawl.py`)

`urllib.quote()` and `urllib.quote_plus()`

`urllib.unquote()` and `urllib.unquote_plus()`

`urllib.urlencode()`

Secure Socket Layer support

Table 20.5. Core `urllib` Module Functions

20.2.4. `urllib2` Module

Example 20.1. HTTP Auth Client (`urlopen-auth.py`)

Line-by-Line Explanation

Lines 17

Lines 915

Lines 1722

Lines 2429

Section 20.2. Web Surfing with Python: Creating Simple Web Clients

20.2. Web Surfing with Python: Creating Simple Web Clients

20.2.1. Uniform Resource Locators

Table 20.1. Web Address Components

Table 20.2. Network Location Components

20.2.2. urlparse Module

urlparse.urlparse()

urlparse.urlunparse()

urlparse.urljoin()

Table 20.3. Core urlparse Module Functions

20.2.3. urllib Module

urllib.urlopen()

Table 20.4. urllib.urlopen() File-like Object Methods

urllib.urlretrieve()

Example 20.2. Advanced Web Client: a Web Crawler (crawl.py)

urllib.quote() and urllib.quote_plus()

urllib.unquote() and urllib.unquote_plus()

urllib.urlencode()

Secure Socket Layer support

Table 20.5. Core urllib Module Functions

20.2.4. urllib2 Module

Example 20.1. HTTP Auth Client (urlopen-auth.py)

Line-by-Line Explanation

Lines 17

Lines 915

Lines 1722

Lines 2429

20.2.2. `urlparse` Module

`urlparse.urlparse()`

`urlparse.urlunparse()`

`urlparse.urljoin()`

Table 20.3. Core `urlparse` Module Functions

20.2.3. `urllib` Module

`urllib.urlopen()`

Table 20.4. `urllib.urlopen()` File-like Object Methods

`urllib.urlretrieve()`

Example 20.2. Advanced Web Client: a Web Crawler (`crawl.py`)

`urllib.quote()` and `urllib.quote_plus()`

`urllib.unquote()` and `urllib.unquote_plus()`

`urllib.urlencode()`

Table 20.5. Core `urllib` Module Functions

20.2.4. `urllib2` Module

Example 20.1. HTTP Auth Client (`urlopen-auth.py`)