Retrieving Images from HTML Documents

import HTMLParser import urllib def getImage(addr):     u = urllib.urlopen(addr)     data = u.read() class parseImages(HTMLParser.HTMLParser):   def handle_starttag(self, tag, attrs):     if tag == 'img':         for name,value in attrs:             if name == 'src':                 getImage(urlString + "/" + value) u = urllib.urlopen(urlString) lParser.feed(u.read())

A common task when processing HTML documents is to pull all the images out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_starttag() method to find the img tags and saves the file pointed to by the src attribute value.

Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(url) and read the contents of the HTML file.

To parse the HTML file contents and save the images displayed inside, feed the data to the HTMLParser object using the feed(data) function. The feed function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.

import HTMLParser import urllib import sys urlString = "http://www.python.org" #Save image file to disk def getImage(addr):     u = urllib.urlopen(addr)     data = u.read()     splitPath = addr.split('/')     fName = splitPath.pop()     print "Saving %s" % fName     f = open(fName, 'wb')     f.write(data)     f.close() #Define HTML parser class parseImages(HTMLParser.HTMLParser):     def handle_starttag(self, tag, attrs):         if tag == 'img':             for name,value in attrs:                 if name == 'src':                     getImage(urlString + "/" + value) #Create instance of HTML parser lParser = parseImages() #Open the HTML file u = urllib.urlopen(urlString) print "Opening URL\n====================" print u.info() #Feed HTML file into parser lParser.feed(u.read()) lParser.close()

html_images.py

Opening URL ==================== Date: Wed, 19 Jul 2006 18:47:27 GMT Server: Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5 mod_ssl/2.0.54 OpenSSL/0.9.7e Last-Modified: Wed, 19 Jul 2006 16:08:34 GMT ETag: "601f6-351c-79a6c480" Accept-Ranges: bytes Content-Length: 13596 Connection: close Content-Type: text/html Saving python-logo.gif Saving trans.gif Saving trans.gif Saving nasa.jpg

Output from html_images.py code