Retrieving Text from HTML Documents


import HTMLParser import urllib class parseText(HTMLParser.HTMLParser):     def handle_data(self, data):         if data != '\n':             urlText.append(data) lParser = parseText() lParser.feed(urllib.urlopen( \ http://docs.python.org/lib/module-HTMLParser.html \ ).read())

A common task when processing HTML documents is to pull all the text out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_data() method to parse and print the text data.

Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(url) and read the contents of the HTML file.

To parse the HTML file contents and print the text contained inside, feed the HTML file contents to the HTMLParser object using the feed(data) function. The feed function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.

Note

If the data passed to the feed() function of the HTMLParser is not complete, the incomplete tag is kept and then parsed the next time the feed() function is called. This can be useful when working with large HTML files that need to be fed to the parser in chunks.


import HTMLParser import urllib urlText = [] #Define HTML Parser class parseText(HTMLParser.HTMLParser):     def handle_data(self, data):         if data != '\n':             urlText.append(data) #Create instance of HTML parser lParser = parseText() #Feed HTML file into parser lParser.feed(urllib.urlopen( \ http://docs.python.org/lib/module-HTMLParser.html \ ).read()) lParser.close() for item in urlText:     print item


html_text.py

13.1 HTMLParser - Simple HTML and XHTML parser Python Library Reference Previous: 13. Structured Markup Processing Up: 13. Structured Markup Processing Next: 13.1.1 Example HTML Parser 13.1 HTMLParser  -          Simple HTML and XHTML parser . . .


Output from html_text.py code



Python Phrasebook(c) Essential Code and Commands
Python Phrasebook
ISBN: 0672329107
EAN: 2147483647
Year: N/A
Pages: 138
Authors: Brad Dayley

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net