Section 23.1. The sgmllib Module | Python in a Nutshell, Second Edition (In a Nutshell)

23.1. The sgmllib Module

The name of the sgmllib module is misleading: sgmllib parses only a tiny subset of SGML, but it is still a good way to get information from HTML files. sgmllib supplies one class, SGMLParser, which you subclass, overriding methods. The most frequently used methods of an instance s of your subclass X of SGMLParser are as follows.

close	`s.close( )` Tells the parser that there is no more input data. When `X` overrides `close`, `s.close` must call `SGMLParser.close` to ensure that buffered data is processed.
do_tag	`s.do_tag(attributes)` `X` supplies a method with such a name for each `tag`, with no corresponding end tag, that `X` wants to process. `tag` must be lowercase in the method name, but can be in any case in the parsed text (the SGML standard, like HTML, is case-insensitive, in contrast to XML and XHTML, which are case-sensitive). `SGMLParser`'s `handle_tag` method calls `do_tag` when appropriate. `attributes` is a list of pairs `(name,value)`, where `name` is an attribute's name, lowercased, and `value` is the value, processed to resolve entity and character references and remove surrounding quotes.
end_tag	`s.end_tag( )` `X` supplies a method with such a name for each `tag` whose end tag `X` wants to process. `tag` must be lowercase in the method name, but can be in any case in the parsed text. `X` must also supply a method named `start_tag`; otherwise, `end_tag` is ignored. `SGMLParser`'s `handle_endtag` method calls `end_tag` when appropriate.
feed	`s.feed(data)` Passes to the parser some of the text being parsed. The parser may process some prefix of the text, holding the rest in a buffer until the next call to `s.feed` or `s.close`.
handle_charref	`s.handle_charref(ref)` Called to process a character reference `'&#ref;'. SGMLParser`'s implementation of `handle_charref` handles only decimal numbers in `range(0,256)`, like: def handle_charref(self, ref): try: c = chr(int(ref[1:])) except (TypeError, ValueError): self.unknown_charref(ref) else: self.handle_data(c) Your subclass `X` may override `handle_charref` or `unknown_charref` in order to support other forms of character references `'&#...;'`.
handle_comment	`s.handle_comment(comment)` Called to handle comments. `comment` is the string within `'<!--`...`-->'`, without the delimiters. `SGMLParser`'s implementation of `handle_comment` does nothing.
handle_data	`s.handle_data(data)` Called to process each arbitrary string `data`. Your subclass `X` normally overrides `handle_data. SGMLParser`'s implementation of `handle_data` does nothing.
handle_endtag	`s.handle_endtag(tag,method)` Called to handle termination tags for which `X` supplies methods named `start_tag` and `end_tag. tag` is the tag string, lowercased. `method` is the bound method for `end_tag`. `SGMLParser`'s implementation of `handle_endtag` just calls `method( )`, and it's rarely necessary to override it.
handle_entityref	`s.handle_entityref(ref)` Called to process an entity reference `'&ref;'. SGMLParser`'s implementation of `handle_entityref` looks `ref` up in `s.entitydefs`. Your subclass `X` may override `handle_entityref` or `unknown_entityref` in order to support entity references `'&...;'` in different ways. `SGMLParser`'s attribute `entitydefs` includes keys `'amp', 'apos', 'gt', 'lt'`, and `'quot'`. Suppose your subclass `X` needs to add entities defined in module `htmlentitydefs`, covered in "The htmlentitydefs Module" on page 582. One approach would be: class X(sgmllib.SGMLParser): entitydefs = dict(sgmllib.SGMLParser) entitydefs.update((k, unichr(v)) for k, v in htmlentitydefs.name2codepoint.iteritems( )) Of course, method `X.handle_data` must then also be ready to process Unicode rather than just plain-string arguments (an enhancement that is a good idea in any case).
handle_starttag	`s.handle_starttag(tag, method, attributes)` Called to handle tags for which `X` supplies a method `start_tag` or `do_tag. tag` is the tag string, lowercased. `method` is the bound method for `start_tag` or `do_tag. attributes` is a list of pairs `(name,value)`, where `name` is each attribute's name, lowercased, and `value` is the value, processed to resolve entity references and character references and to remove surrounding quotes. When `X` supplies both `start_tag` and `do_tag` methods, `start_tag` has precedence and `do_tag` is ignored. `SGMLParser`'s implementation of `handle_starttag` just calls `method(attributes)`, and it's rarely necessary to override it.
report_unbalanced	`s.report_unbalanced(tag)` Called when tags terminate without being open. `tag` is the tag string, lowercased. `SGMLParser`'s implementation of `report_unbalanced` does nothing.
start_tag	`s.start_tag(attributes)` `X` supplies a method thus named for each `tag`, with an end tag, that `X` wants to process. `tag` must be lowercase in the method name, but can be in any case in the parsed text. `SGMLParser`'s `handle_tag` method calls `start_tag` when appropriate. `attributes` is a list of pairs `(name,value)`, where `name` is each attribute's name, lowercased, and `value` is the value, processed to resolve entity references and character references and to remove surrounding quotes.
unknown_charref	`s.unknown_charref(ref)` Called to process invalid or unrecognized character references. `SGMLParser`'s implementation of `unknown_charref` does nothing.
unknown_endtag	`s.unknown_endtag(tag)` Called to process termination tags for which `X` supplies no specific method. `SGMLParser`'s implementation of `unknown_endtag` does nothing.
unknown_entityref	`s.unknown_entityref(ref)` Called to process unknown entity references. `SGMLParser`'s implementation of `unknown_entityref` does nothing.
unknown_starttag	`s.unknown_starttag(tag, attributes)` Called to process tags for which `X` supplies no specific method. `tag` is the tag string, lowercased. `attributes` is a list of pairs `(name,value)`, where `name` is each attribute's name, lowercased, and `value` is the value, processed to resolve entity references and character references and to remove surrounding quotes. `SGMLParser`'s implementation of `unknown_starttag` does nothing.

23.1.1. Parsing HTML with sgmllib

The following example uses sgmllib for a typical HTML-related task that could be at the core of a "web spider": fetch a page from the Web with urllib, parse it, and output the targets of outgoing hyperlinks. The example uses urlparse to check the page's links and outputs only links whose URLs have an explicit scheme of 'http'.

 import sgmllib, urllib, urlparse class LinksParser(sgmllib.SGMLParser):     def _ _init_ _(self):         sgmllib.SGMLParser._ _init_ _(self)         self.seen = set( )     def do_a(self, attributes):         for name, value in attributes:             if name == 'href' and value not in self.seen:                 self.seen.add(value)                 pieces = urlparse.urlparse(value)                 if pieces[0] != 'http': return                 print urlparse.urlunparse(pieces)                 return p = LinksParser( ) f = urllib.urlopen('http://www.python.org/index.html') BUFSIZE = 8192 while True:     data = f.read(BUFSIZE)     if not data: break     p.feed(data) p.close( )

Class LinksParser only needs to define method do_a. The superclass calls back to this method for all <a> tags, and the method loops on the attributes, looking for one named 'href', then works with the corresponding value (i.e., the relevant URL).