15.2 Colorizing Python Source Using the Built-in Tokenizer


Credit: Jürgen Hermann

15.2.1 Problem

You need to convert Python source code into HTML markup, rendering comments, keywords, operators, and numeric and string literals in different colors.

15.2.2 Solution

tokenize.tokenize does most of the work and calls us back for each token found, so we can output it with appropriate colorization:

""" MoinMoin - Python Source Parser """ import cgi, string, sys, cStringIO import keyword, token, tokenize # Python Source Parser (does highlighting into HTML) _KEYWORD = token.NT_OFFSET + 1 _TEXT    = token.NT_OFFSET + 2 _colors = {     token.NUMBER:       '#0080C0',     token.OP:           '#0000C0',     token.STRING:       '#004080',     tokenize.COMMENT:   '#008000',     token.NAME:         '#000000',     token.ERRORTOKEN:   '#FF8080',     _KEYWORD:           '#C00000',     _TEXT:              '#000000', } class Parser:     """ Send colorized Python source as HTML to an output file (normally stdout).     """     def _ _init_ _(self, raw, out = sys.stdout):         """ Store the source text. """         self.raw = string.strip(string.expandtabs(raw))         self.out = out     def format(self):         """ Parse and send the colorized source to output. """         # Store line offsets in self.lines         self.lines = [0, 0]         pos = 0         while 1:             pos = string.find(self.raw, '\n', pos) + 1             if not pos: break             self.lines.append(pos)         self.lines.append(len(self.raw))         # Parse the source and write it         self.pos = 0         text = cStringIO.StringIO(self.raw)         self.out.write('<pre><font face="Lucida,Courier New">')         try:             tokenize.tokenize(text.readline, self) # self as handler callable         except tokenize.TokenError, ex:             msg = ex[0]             line = ex[1][0]             self.out.write("<h3>ERROR: %s</h3>%s\n" % (                 msg, self.raw[self.lines[line]:]))         self.out.write('</font></pre>')     def _ _call_ _(self, toktype, toktext, (srow,scol), (erow,ecol), line):         """ Token handler """         if 0:  # You may enable this for debugging purposes only             print "type", toktype, token.tok_name[toktype], "text", toktext,             print "start", srow,scol, "end", erow,ecol, "<br>"         # Calculate new positions         oldpos = self.pos         newpos = self.lines[srow] + scol         self.pos = newpos + len(toktext)         # Handle newlines         if toktype in [token.NEWLINE, tokenize.NL]:             self.out.write('\n')             return         # Send the original whitespace, if needed         if newpos > oldpos:             self.out.write(self.raw[oldpos:newpos])         # Skip indenting tokens         if toktype in [token.INDENT, token.DEDENT]:             self.pos = newpos             return         # Map token type to a color group         if token.LPAR <= toktype <= token.OP:             toktype = token.OP         elif toktype == token.NAME and keyword.iskeyword(toktext):             toktype = _KEYWORD         color = _colors.get(toktype, _colors[_TEXT])         style = ''         if toktype == token.ERRORTOKEN:             style = ' style="border: solid 1.5pt #FF0000;"'         # Send text         self.out.write('<font color="%s"%s>' % (color, style))         self.out.write(cgi.escape(toktext))         self.out.write('</font>') if _ _name_ _ == "_ _main_ _":     import os, sys     print "Formatting..."     # Open own source     source = open('python.py').read(  )     # Write colorized version to "python.html"     Parser(source, open('python.html', 'wt')).format(  )     # Load HTML page into browser     if os.name == "nt":         os.system("explorer python.html")     else:         os.system("netscape python.html &")

15.2.3 Discussion

This code is part of MoinMoin (see http://moin.sourceforge.net/) and shows how to use the built-in keyword, token, and tokenize modules to scan Python source code and re-emit it with appropriate color markup but no changes to its original formatting ("no changes" is the hard part!).

The Parser class's constructor saves the multiline string that is the Python source to colorize and the file object, which is open for writing, where you want to output the colorized results. Then, the format method prepares a self.lines list that holds the offset (the index into the source string, self.raw) of each line's start.

format then calls tokenize.tokenize, passing self as the callback. Thus, the _ _call_ _ method is invoked for each token, with arguments specifying the token type and starting and ending positions in the source (each expressed as line number and offset within the line). The body of the _ _call_ _ method reconstructs the exact position within the original source code string self.raw, so it can emit exactly the same whitespace that was present in the original source. It then picks a color code from the _colors dictionary (which uses HTML color coding), with help from the keyword standard module to determine if a NAME token is actually a Python keyword (to be emitted in a different color than that used for ordinary identifiers).

The test code at the bottom of the module formats the module itself and launches a browser with the result. It does not use the standard Python module webbrowser to ensure compatibility with stone-age versions of Python. If you have no such worries, you can change the last few lines of the recipe to:

# Load HTML page into browser import webbrowser webbrowser.open("python.html", 0, 1)

and enjoy the result in your favorite browser.

15.2.4 See Also

Documentation for the webbrowser, token, tokenize, and keyword modules in the Library Reference; the colorizer is available at http://purl.net/wiki/python/MoinMoinColorizer, part of MoinMoin (http://moin.sourceforge.net).



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2005
Pages: 346

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net