Recipe14.7.Handling Cookies While Fetching Web Pages


Recipe 14.7. Handling Cookies While Fetching Web Pages

Credit: Mike Foord, Nikos Kouremenos

Problem

You need to fetch web pages (or other resources from the web) that require you to handle cookies (e.g., save cookies you receive and also reload and send cookies you had previously received from the same site).

Solution

The Python 2.4 Standard Library provides a cookielib module exactly for this task. For Python 2.3, a third-party ClientCookie module works similarly. We can write our code to ensure usage of the best available cookie-handling moduleincluding none at all, in which case our program will still run but without saving and resending cookies. (In some cases, this might still be OK, just maybe slower.) Here is a script to show how this concept works in practice:

import os.path, urllib2 from urllib2 import urlopen, Request COOKIEFILE = 'cookies.lwp'   # "cookiejar" file for cookie saving/reloading # first try getting the best possible solution, cookielib: try:     import cookielib except ImportError:                 # no cookielib, try ClientCookie instead     cookielib = None     try:         import ClientCookie     except ImportError:             # nope, no cookies today         cj = None                   # so, in particular, no cookie jar     else:                           # using ClientCookie, prepare everything         urlopen = ClientCookie.urlopen         cj = ClientCookie.LWPCookieJar( )         Request = ClientCookie.Request else:                               # we do have cookielib, prepare the jar     cj = cookielib.LWPCookieJar( ) # Now load the cookies, if any, and build+install an opener using them if cj is not None:     if os.path.isfile(COOKIEFILE):         cj.load(COOKIEFILE)     if cookielib:         opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))         urllib2.install_opener(opener)     else:         opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))         ClientCookie.install_opener(opener) # for example, try a URL that sets a cookie theurl = 'http://www.diy.co.uk' txdata = None  # or, for POST instead of GET, txdata=urrlib.urlencode(somedict) txheaders =  {'User-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} try:     req = Request(theurl, txdata, txheaders)  # create a request object     handle = urlopen(req)                     # and open it except IOError, e:     print 'Failed to open "%s".' % theurl     if hasattr(e, 'code'):         print 'Error code: %s.' % e.code else:     print 'Here are the headers of the page:'     print handle.info( ) # you can also use handle.read( ) to get the page, handle.geturl( ) to get the # the true URL (could be different from `theurl' if there have been redirects) if cj is None:     print "Sorry, no cookie jar, can't show you any cookies today" else:     print 'Here are the cookies received so far:'     for index, cookie in enumerate(cj):         print index, ': ', cookie     cj.save(COOKIEFILE)                     # save the cookies again

Discussion

The third-party module ClientCookie, available for download at http://wwwsearch.sourceforge.net/ClientCookie/, was so successful that, in Python 2.4, its functionality has been added to the Python Standard Libraryspecifically, the cookie-handling parts in the new module cookielib, the rest in the current version of urllib2.

So, you do need to be careful if you want your code to work just as well on any 2.4 installation (using the latest and greatest cookielib) or an installation of Python 2.3 with ClientCookie on top. As long as we're at it, we might as well handle running on a 2.3 installation that does not have ClientCookierun anyway, just don't save and resend cookies when we lack library code to do so. On some sites, the inability to handle cookies will just be a bother and perhaps a performance hit due to the loss of session continuity, but the site will still work. Other sites, of course, will be completely unusable without cookies.

The recipe's code is an exercise in the careful management of an idiom that's an essential part of making your Python code portable among releases and installations, while ensuring minimal graceful degradation when third-party modules you'd like to use just aren't there. The idiom is known as conditional import and is expressed as follows:

try:     import something except ImportError:            # 'something' not available   ...code to do without, degrading gracefully... else:                          # 'something' IS available, hooray!   ...code to run only when something is there... # and then, go on with the rest of your program ...code able to run with or w/o `something'...

The use of "conditional import" is particularly delicate in this recipe because ClientCookie and cookielib aren't drop-in replacements for each othertherefore, careful management is indeed necessary. But, if you study this recipe, you will see that it is not rocket scienceit just requires attention.

One key technique is to make double use of a small number of names as "flags", with value None when the object to which they would normally refer is not available. In this recipe, we do that for cookielib (which refers to the module of that name when there is one, and otherwise to None) and cj (which refers to a cookie-jar object when there is any, and otherwise to None). Even better, when feasible, is to assign names appropriately to refer to the best available object under the circumstances: the recipe does that for variables urlopen and Request. Note how crucial it is for this purpose that Python treats all objects as first class: urlopen is a function, Request is a class, cookielib (if any) a module, cj (if any) an instance object. The distinction, however, doesn't matter in the least: the name-object reference concept is exactly the same in every case, with total uniformity, simplicity, and power.

When either cookielib or ClientCookie is available, the cookies are saved in a file in cookie jar format (a useful plain-text format that is automatically handled by either module but can also be examined and modified with text editors and other programs). If the file already exists when the program runs, cookies are loaded from the file, ready to be sent back to the appropriate sites.

My reason for developing this code is that I'm developing a cgi-proxy, approx.py (http://www.voidspace.org.uk/atlantibots/pythonutils.html#cgiproxy), which needs to be able to handle cookies when feasible. To keep the proxy usable on various versions of Python, and ensure it degrades gracefully when no cookie-handling library is available, I needed to develop the carefully managed conditional imports that are shown in the recipe's Solution. I decided to share them in this recipe since, besides the importance of cookie handling, conditional imports are such a generally important Python idiom. Particularly when installing your code on a server you don't control, it is unfortunately quite common to have little say in which version of Python is running, nor in which third-party extensions are installedexactly the kind of situation that requires the conditional import technique to ensure your code does the best it can under the circumstances.

See Also

Documentation on the cookielib and urllib2 standard library modules in the Library Reference for Python 2.4; ClientCookie is at http://wwwsearch.sourceforge.net/ClientCookie/.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net