Adding Quotes to Attribute Values in HTML Documents


import HTMLParser import urllib class parseAttrs(HTMLParser.HTMLParser):    def handle_starttag(self, tag, attrs):          . . . attrParser = parseAttrs() attrParser.init_parser() attrParser.feed(urllib.urlopen("test2.html").read())

Earlier in this chapter, we discussed parsing HTML files based on specific handlers in the HTML parser. There are times when you need to use all the handlers to process an HTML document. Using the HTMLParser module to parse all entities in the HTML file is not much more complex than handling the links or images.

This phrase discusses how to use the HTMLParser module to parse an HTML file to fix the fact that the attribute values do not have quotes around them. The first step is to define a new HTMLParser class that overrides all the following handlers so that the quotes can be added to the attribute values.

handle_starttag(tag, attrs) handle_charref(name) handle_endtag(tag) handle_entityref(ref) handle_data(text) handle_comment(text) handle_pi(text) handle_decl(text) handle_startendtag(tag, attrs)


You will also need to define a function inside the parser class to initialize the variables used to store the parsed data and another function to return the parsed data.

Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Use the init function you created to initialize the parser; then open the HTML document using urllib.urlopen(url) and read the contents of the HTML file.

To parse the HTML file contents and add the quotes to the attribute values, feed the data to the HTMLParser object using the feed(data) function. The feed function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.

import HTMLParser import urllib import sys #Define the HTML parser class parseAttrs(HTMLParser.HTMLParser):     def init_parser (self):         self.pieces = []     def handle_starttag(self, tag, attrs):         fixedAttrs = ""         #for name,value in attrs:         for name, value in attrs:             fixedAttrs += "%s=\"%s\" " % (name, value)         self.pieces.append("<%s %s>" % (tag, fixedAttrs))     def handle_charref(self, name):         self.pieces.append("&#%s;" % (name))     def handle_endtag(self, tag):         self.pieces.append("</%s>" % (tag))     def handle_entityref(self, ref):         self.pieces.append("&%s" % (ref))     def handle_data(self, text):         self.pieces.append(text)     def handle_comment(self, text):         self.pieces.append("<!--%s-->" % (text))     def handle_pi(self, text):         self.pieces.append("<?%s>" % (text))     def handle_decl(self, text):         self.pieces.append("<!%s>" % (text))     def parsed (self):         return "".join(self.pieces) #Create instance of HTML parser attrParser = parseAttrs() #Initialize the parser data attrParser.init_parser() #Feed HTML file into parser attrParser.feed(urllib.urlopen("test2.html").read()) #Display original file contents print "Original File\n========================" print open("test2.html").read() #Display the parsed file print "Parsed File\n========================" print attrParser.parsed() attrParser.close()


html_quotes.py

Original File ======================== <html lang="en" xml:lang="en"> <head> <meta content="text/html; charset=utf-8"  http-equiv="content-type"/> <title>Web Page</title> </head> <body> <H1>Web Listings</H1> <a href=http://www.python.org>Python Web Site</a> <a href=test.html>local page</a> <img SRC=test.jpg> </body> </html> Parsed File ======================== <html lang="en" xml:lang="en" > <head > <meta content="text/html; charset=utf-8"  http-equiv="content-type" ></meta> <title >Web Page</title> </head> <body > <h1 >Web Listings</h1> <a href="http://www.python.org" >Python Web Site</a> <a href="test.html" >local page</a> <img src="/books/2/243/1/html/2/test.jpg" > </body> </html>


Output from html_quotes.py code



Python Phrasebook(c) Essential Code and Commands
Python Phrasebook
ISBN: 0672329107
EAN: 2147483647
Year: N/A
Pages: 138
Authors: Brad Dayley

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net