The htmlib module contains a tag-driven HTML parser, which sends data to a formatting object. Example 5-9 uses this module. For more examples on how to parse HTML files using this module, see the descriptions of the formatter module.
Example 5-9. Using the htmllib Module
File: htmllib-example-1.py
import htmllib
import formatter
import string
class Parser(htmllib.HTMLParser):
# return a dictionary mapping anchor texts to lists
# of associated hyperlinks
def _ _init_ _(self, verbose=0):
self.anchors = {}
f = formatter.NullFormatter()
htmllib.HTMLParser._ _init_ _(self, f, verbose)
def anchor_bgn(self, href, name, type):
self.save_bgn()
self.anchor = href
def anchor_end(self):
text = string.strip(self.save_end())
if self.anchor and text:
self.anchors[text] = self.anchors.get(text, []) + [self.anchor]
file = open("samples/sample.htm")
html = file.read()
file.close()
p = Parser()
p.feed(html)
p.close()
for k, v in p.anchors.items():
print k, "=>", v
print
link => ['http://www.python.org']
If you're only out to parse an HTML file and not render it to an output device, it's usually easier to use the sgmllib module instead.
Core Modules
More Standard Modules
Threads and Processes
Data Representation
File Formats
Mail and News Message Processing
Network Protocols
Internationalization
Multimedia Modules
Data Storage
Tools and Utilities
Platform-Specific Modules
Implementation Support Modules
Other Modules