Hack 38 Archiving Your Favorite Webcams

figs/moderate.gif figs/hack38.gif

Got a number of scenic or strategically placed webcams you watch daily? Or would like to ensure that your coworkers are actually doing the work you've assigned them? Keep on top of your pictorial problems with Python .

Keeping track of a large number of active webcams is a thankless task: half the time the images haven't changed, and the rest of the time it takes just as long to go through refreshing them all or waiting for them to refresh as it does to look at and mentally process the images themselves .

This hack alleviates your grief by automatically downloading images from webcams every 15 secondsbut only if they've been updated, so that we don't waste bandwidth. It's also the only Python script in the entire book and, as such, earns special recognition.

To tell the program which URLs to download, we have to put them in a file, one per line. The program looks for this list by default at URIs.txt , but this can be changed both in the source and on the command line.

The program puts each picture in its own file, after producing an index file (which defaults to webcams.html ) so that we can quickly and easily browse all of the downloaded images in one go.

The Code

Save the following code as getcams.py :

 #!/usr/bin/python """ getcams.py - Archiving Your Favorite Web Cams Sean B. Palmer, <http://purl.org/net/sbp/>, 2003-07. License: GPL 2; share and enjoy! Usage:     python getcams.py [ <filename> ] <filename> defaults to URIs.txt """ import urllib2, time from urllib import quote from email.Utils import parsedate # # # # # # # # # # # # # # # # # # Configurable stuff #  # download how often, in seconds seconds = 15 # what file we should write to index = 'webcams.html'  # End of configurable stuff! # # # # # # # # # # # # # # # # # def quoteURI(uri):     # Turn a URI into a filename.    return quote(uri, safe='') def makeHTML(uris):     # Create an HTML index so that we    # can look at the archived piccies.    print "Creating a webcam index at", index    f = open(index, 'w')    print >> f, '<html xmlns="http://www.w3.org/1999/xhtml" >'    print >> f, '<head><title>My Webcams</title></head>'    print >> f, '<body>'    for uri in uris:        # We use the URI of the image for the filename, but we have        # to hex encode it first so that our operating systems are        # happy with it. The following code unencodes the URI.       link = quoteURI(uri).replace('%', '%25')       # Now we make the image, and provide a link to the original.       print >> f, '<p><img src="%s" alt=" " /><br />' % link       print >> f, '-<a href="%s">%s</a></p>' % (uri, uri)    print >> f, '</body>'    print >> f, '</html>'    f.close(  )    print "Done creating the index!\n" metadata = {} def getURI(uri):     print "Trying", uri    # Try to open the URI--we're not downloading it yet.    try: u = urllib2.urlopen(uri)    except Exception, e: print "   ...failed:", e    else:        # Get some information about the URI; we do this       # to find out whether it's been updated yet.       info = u.info(  )       meta = (info.get('last-modified'), info.get('content-size'))       print "   ...got metadata:", meta       if metadata.get(uri) == meta:           print "   ...not downloading: no update yet"       else:           # The image has been updated, so let's download it.          metadata[uri] = meta          print "   ...downloading; type: %s; size: %s" % \             (info.get('content-type', '?'), info.get('content-size', '?'))          data = u.read(  )          open(quoteURI(uri), 'wb').write(data)          print "   ...done! %s bytes" % len(data)          # Save an archived version for later.          t = parsedate(info.get('last-modified'))          archv = quoteURI(uri) + '-' + time.strftime('%Y%m%dT%H%M%S', t) +  [RETURN]  '.jpg'          open(archv, 'wb').write(data)       u.close(  ) def doRun(uris):     for uri in uris:        startTime = time.time(  )       getURI(uri)       finishTime = time.time(  )       timeTaken = finishTime - startTime       print "This URI took", timeTaken, "seconds\n"       timeLeft = seconds - timeTaken # time until the next run       if timeLeft > 0: time.sleep(timeLeft) def main(argv):     # We need a list of URIs to download. We require them to be     # in a file; the next line defaults the filename to URIs.txt     # if it can't gather one from the command line.    fn = (argv + [None])[0] or 'URIs.txt'    data = open(fn).read(  )    uris = data.splitlines(  )    # Now make an index, and then    # continuously download the piccies.    makeHTML(uris)    while 1: doRun(uris) if __name__=="__main_  _":     import sys    # If the user asks for help, give it to them!    # Otherwise, just run the program as usual.    if sys.argv[1:] in (['--help', '-h', '-?']):        print __doc_  _    else: main(sys.argv[1:]) 

Running the Hack

Here's a typical run, invoked from the command line:

 %  python getcams.py  Creating a webcam index at webcams.html Done creating the index! Trying http://example.org/webcams/someplace.jpg    ...got metadata: ('Thu, 10 Jul 2003 15:50:38 GMT', None)    ...downloading; type: image/jpeg; size: ?    ...done! 32594 bytes This URI took 8.2480000257 seconds   Trying http://example.org/webcams/phenomic.jpg    ...got metadata: ('Thu, 10 Jul 2003 11:35:51 GMT', None)    ...not downloading: no update yet This URI took 1.30099999905 seconds 

The code, complicated though it looks, consists of only a few stages:

  1. Open the list of URLs of each of the webcams.

  2. Create an HTML index so that we can view the downloaded webcam images.

  3. For each URL in our list, check to see if the image has been updated or not. If it has, download it. In the event that it took under 15 seconds to download, wait for the remainder of the time in an attempt to respect the server resources of others.

Hacking the Hack

The code has a number of limitations:

  • We have to know the URL of each picture for downloading. So, if we don't know the URL or if it changes a lot, we have a problem. But really, the biggest problem here is that it's just a bit of an inconvenience to get the actual URL of each picture that we want.

  • If a web site goes down, the script hangs . We could get around this problem by using Python's async module, but this would add quite a bit of complexity.

  • People have been known to fake Last-Modified HTTP headers, so the metadata that we use to ascertain whether a picture has been updated isn't absolutely reliable. However, most Last-Modified headers are faked to force people to use fresh rather than cached versions, so if they're that passionate about it, we may as well let them.

  • If you have any files in your directory that have the same names as the quoted versions of the URLs you're trying to download, the program will overwrite them.

Other than these limitations, the code is safe.

Sean B. Palmer



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net