The htmllib Module | File Formats

The htmlib module contains a tag-driven HTML parser, which sends data to a formatting object. Example 5-9 uses this module. For more examples on how to parse HTML files using this module, see the descriptions of the formatter module.

Example 5-9. Using the htmllib Module

File: htmllib-example-1.py

import htmllib
import formatter
import string

class Parser(htmllib.HTMLParser):
 # return a dictionary mapping anchor texts to lists
 # of associated hyperlinks

 def _ _init_ _(self, verbose=0):
 self.anchors = {}
 f = formatter.NullFormatter()
 htmllib.HTMLParser._ _init_ _(self, f, verbose)

 def anchor_bgn(self, href, name, type):
 self.save_bgn()
 self.anchor = href

 def anchor_end(self):
 text = string.strip(self.save_end())
 if self.anchor and text:
 self.anchors[text] = self.anchors.get(text, []) + [self.anchor]

file = open("samples/sample.htm")
html = file.read()
file.close()

p = Parser()
p.feed(html)
p.close()

for k, v in p.anchors.items():
 print k, "=>", v

print

link => ['http://www.python.org']

If you're only out to parse an HTML file and not render it to an output device, it's usually easier to use the sgmllib module instead.

Core Modules

More Standard Modules

Threads and Processes

Data Representation

File Formats

Mail and News Message Processing

Network Protocols

Internationalization

Multimedia Modules

Data Storage

Tools and Utilities

Platform-Specific Modules

Implementation Support Modules

Other Modules