Retrieving Links from HTML Documents


import HTMLParser import urllib class parseLinks(HTMLParser.HTMLParser):    def handle_starttag(self, tag, attrs):         if tag == 'a':            for name,value in attrs:                 if name == 'href':                    print value                    print self.get_starttag_text() lParser = parseLinks() lParser.feed(urllib.urlopen( \     "http://www.python.org/index.html").read())

The Python language comes with a very useful HTMLParser module that enables simple, efficient parsing of HTML documents based on the tags inside the HTML document. The HTMLParser module is one of the most important when processing HTML documents.

A common task when processing HTML documents is to pull all the links out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_starttag() method to print the HRef attribute value of all a tags.

Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(url) and read the contents of the HTML file.

To parse the HTML file contents and print the links contained inside, feed the data to the HTMLParser object using the feed(data) function. The feed function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.

Note

If the data passed to the feed() function of the HTMLParser is not complete, the incomplete tag is kept and then parsed the next time the feed() function is called. This can be useful when working with large HTML files that need to be fed to the parser in chunks.


import HTMLParser import urllib import sys #Define HTML Parser class parseLinks(HTMLParser.HTMLParser):     def handle_starttag(self, tag, attrs):         if tag == 'a':            for name,value in attrs:                 if name == 'href':                    print value                    print self.get_starttag_text() #Create instance of HTML parser lParser = parseLinks() #Open the HTML file lParser.feed(urllib.urlopen( \     "http://www.python.org/index.html").read()) lParser.close()


html_links.py

<a href="psf" class="" title="Python Software Foundation"> links <a href="links" class="" title=""> dev <a href="dev" class="" title="Python Core Language Development"> download/releases/2.4.3 <a href="download/releases/2.4.3"> http://docs.python.org <a href="http://docs.python.org"> ftp/python/2.4.3/python-2.4.3.msi <a href="ftp/python/2.4.3/python-2.4.3.msi"> ftp/python/2.4.3/Python-2.4.3.tar.bz2 <a href="ftp/python/2.4.3/Python-2.4.3.tar.bz2"> pypi


Output from html_links.py code



Python Phrasebook(c) Essential Code and Commands
Python Phrasebook
ISBN: 0672329107
EAN: 2147483647
Year: N/A
Pages: 138
Authors: Brad Dayley

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net