Section 23.1. The sgmllib Module


23.1. The sgmllib Module

The name of the sgmllib module is misleading: sgmllib parses only a tiny subset of SGML, but it is still a good way to get information from HTML files. sgmllib supplies one class, SGMLParser, which you subclass, overriding methods. The most frequently used methods of an instance s of your subclass X of SGMLParser are as follows.

close

s.close( )

Tells the parser that there is no more input data. When X overrides close, s.close must call SGMLParser.close to ensure that buffered data is processed.

do_tag

s.do_tag(attributes)

X supplies a method with such a name for each tag, with no corresponding end tag, that X wants to process. tag must be lowercase in the method name, but can be in any case in the parsed text (the SGML standard, like HTML, is case-insensitive, in contrast to XML and XHTML, which are case-sensitive). SGMLParser's handle_tag method calls do_tag when appropriate. attributes is a list of pairs (name,value), where name is an attribute's name, lowercased, and value is the value, processed to resolve entity and character references and remove surrounding quotes.

end_tag

s.end_tag( )

X supplies a method with such a name for each tag whose end tag X wants to process. tag must be lowercase in the method name, but can be in any case in the parsed text. X must also supply a method named start_tag; otherwise, end_tag is ignored. SGMLParser's handle_endtag method calls end_tag when appropriate.

feed

s.feed(data)

Passes to the parser some of the text being parsed. The parser may process some prefix of the text, holding the rest in a buffer until the next call to s.feed or s.close.

handle_charref

s.handle_charref(ref)

Called to process a character reference '&#ref;'. SGMLParser's implementation of handle_charref handles only decimal numbers in range(0,256), like:

 def handle_charref(self, ref):     try:         c = chr(int(ref[1:]))     except (TypeError, ValueError):         self.unknown_charref(ref)     else: self.handle_data(c) 

Your subclass X may override handle_charref or unknown_charref in order to support other forms of character references '&#...;'.

handle_comment

s.handle_comment(comment)

Called to handle comments. comment is the string within '<!--...-->', without the delimiters. SGMLParser's implementation of handle_comment does nothing.

handle_data

s.handle_data(data)

Called to process each arbitrary string data. Your subclass X normally overrides handle_data. SGMLParser's implementation of handle_data does nothing.

handle_endtag

s.handle_endtag(tag,method)

Called to handle termination tags for which X supplies methods named start_tag and end_tag. tag is the tag string, lowercased. method is the bound method for end_tag. SGMLParser's implementation of handle_endtag just calls method( ), and it's rarely necessary to override it.

handle_entityref

s.handle_entityref(ref)

Called to process an entity reference '&ref;'. SGMLParser's implementation of handle_entityref looks ref up in s.entitydefs.

Your subclass X may override handle_entityref or unknown_entityref in order to support entity references '&...;' in different ways. SGMLParser's attribute entitydefs includes keys 'amp', 'apos', 'gt', 'lt', and 'quot'. Suppose your subclass X needs to add entities defined in module htmlentitydefs, covered in "The htmlentitydefs Module" on page 582. One approach would be:

 class X(sgmllib.SGMLParser):     entitydefs = dict(sgmllib.SGMLParser)     entitydefs.update((k, unichr(v))         for k, v in htmlentitydefs.name2codepoint.iteritems( )) 

Of course, method X.handle_data must then also be ready to process Unicode rather than just plain-string arguments (an enhancement that is a good idea in any case).

handle_starttag

s.handle_starttag(tag, method, attributes)

Called to handle tags for which X supplies a method start_tag or do_tag. tag is the tag string, lowercased. method is the bound method for start_tag or do_tag. attributes is a list of pairs (name,value), where name is each attribute's name, lowercased, and value is the value, processed to resolve entity references and character references and to remove surrounding quotes. When X supplies both start_tag and do_tag methods, start_tag has precedence and do_tag is ignored. SGMLParser's implementation of handle_starttag just calls method(attributes), and it's rarely necessary to override it.

report_unbalanced

s.report_unbalanced(tag)

Called when tags terminate without being open. tag is the tag string, lowercased. SGMLParser's implementation of report_unbalanced does nothing.

start_tag

s.start_tag(attributes)

X supplies a method thus named for each tag, with an end tag, that X wants to process. tag must be lowercase in the method name, but can be in any case in the parsed text. SGMLParser's handle_tag method calls start_tag when appropriate. attributes is a list of pairs (name,value), where name is each attribute's name, lowercased, and value is the value, processed to resolve entity references and character references and to remove surrounding quotes.

unknown_charref

s.unknown_charref(ref)

Called to process invalid or unrecognized character references. SGMLParser's implementation of unknown_charref does nothing.

unknown_endtag

s.unknown_endtag(tag)

Called to process termination tags for which X supplies no specific method. SGMLParser's implementation of unknown_endtag does nothing.

unknown_entityref

s.unknown_entityref(ref)

Called to process unknown entity references. SGMLParser's implementation of unknown_entityref does nothing.

unknown_starttag

s.unknown_starttag(tag, attributes)

Called to process tags for which X supplies no specific method. tag is the tag string, lowercased. attributes is a list of pairs (name,value), where name is each attribute's name, lowercased, and value is the value, processed to resolve entity references and character references and to remove surrounding quotes. SGMLParser's implementation of unknown_starttag does nothing.


23.1.1. Parsing HTML with sgmllib

The following example uses sgmllib for a typical HTML-related task that could be at the core of a "web spider": fetch a page from the Web with urllib, parse it, and output the targets of outgoing hyperlinks. The example uses urlparse to check the page's links and outputs only links whose URLs have an explicit scheme of 'http'.

 import sgmllib, urllib, urlparse class LinksParser(sgmllib.SGMLParser):     def _ _init_ _(self):         sgmllib.SGMLParser._ _init_ _(self)         self.seen = set( )     def do_a(self, attributes):         for name, value in attributes:             if name == 'href' and value not in self.seen:                 self.seen.add(value)                 pieces = urlparse.urlparse(value)                 if pieces[0] != 'http': return                 print urlparse.urlunparse(pieces)                 return p = LinksParser( ) f = urllib.urlopen('http://www.python.org/index.html') BUFSIZE = 8192 while True:     data = f.read(BUFSIZE)     if not data: break     p.feed(data) p.close( ) 

Class LinksParser only needs to define method do_a. The superclass calls back to this method for all <a> tags, and the method loops on the attributes, looking for one named 'href', then works with the corresponding value (i.e., the relevant URL).




Python in a Nutshell
Python in a Nutshell, Second Edition (In a Nutshell)
ISBN: 0596100469
EAN: 2147483647
Year: 2004
Pages: 192
Authors: Alex Martelli

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net