Recipe16.6.Colorizing Python Source Using the Built-in Tokenizer

Recipe 16.6. Colorizing Python Source Using the Built-in Tokenizer

Credit: Jürgen Hermann, Mike Brown

Problem

You need to convert Python source code into HTML markup, rendering comments, keywords, operators, and numeric and string literals in different colors.

Solution

tokenize.generate_tokens does most of the work. We just need to loop over all tokens it finds, to output them with appropriate colorization:

""" MoinMoin - Python Source Parser """ import cgi, sys, cStringIO import keyword, token, tokenize # Python Source Parser (does highlighting into HTML) _KEYWORD = token.NT_OFFSET + 1 _TEXT    = token.NT_OFFSET + 2 _colors = {     token.NUMBER:       '#0080C0',     token.OP:           '#0000C0',     token.STRING:       '#004080',     tokenize.COMMENT:   '#008000',     token.NAME:         '#000000',     token.ERRORTOKEN:   '#FF8080',     _KEYWORD:           '#C00000',     _TEXT:              '#000000', } class Parser(object):     """ Send colorized Python source HTML to output file (normally stdout).     """     def _ _init_ _(self, raw, out=sys.stdout):         """ Store the source text. """         self.raw = raw.expandtabs( ).strip( )         self.out = out     def format(self):         """ Parse and send the colorized source to output. """         # Store line offsets in self.lines         self.lines = [0, 0]         pos = 0         while True:             pos = self.raw.find('\n', pos) + 1             if not pos: break             self.lines.append(pos)         self.lines.append(len(self.raw))         # Parse the source and write it         self.pos = 0         text = cStringIO.StringIO(self.raw)         self.out.write('<pre><font face="Lucida, Courier New">')         try:             for token in tokenize.generate_tokens(text.readline):                 # unpack the components of each token                 toktype, toktext, (srow, scol), (erow, ecol), line = token                 if False:  # You may enable this for debugging purposes only                     print "type", toktype, token.tok_name[toktype],                     print "text", toktext,                     print "start", srow,scol, "end", erow,ecol, "<br>"                 # Calculate new positions                 oldpos = self.pos                 newpos = self.lines[srow] + scol                 self.pos = newpos + len(toktext)                 # Handle newlines                 if toktype in (token.NEWLINE, tokenize.NL):                     self.out.write('\n')                     continue                 # Send the original whitespace, if needed                 if newpos > oldpos:                     self.out.write(self.raw[oldpos:newpos])                 # Skip indenting tokens, since they're whitespace-only                 if toktype in (token.INDENT, token.DEDENT):                     self.pos = newpos                     continue                 # Map token type to a color group                 if token.LPAR <= toktype <= token.OP:                     toktype = token.OP                 elif toktype == token.NAME and keyword.iskeyword(toktext):                     toktype = _KEYWORD                 color = _colors.get(toktype, _colors[_TEXT])                 style = ''                 if toktype == token.ERRORTOKEN:                     style = ' style="border: solid 1.5pt #FF0000;"'                 # Send text                 self.out.write('<font color="%s"%s>' % (color, style))                 self.out.write(cgi.escape(toktext))                 self.out.write('</font>')         except tokenize.TokenError, ex:             msg = ex[0]             line = ex[1][0]             self.out.write("<h3>ERROR: %s</h3>%s\n" % (                 msg, self.raw[self.lines[line]:]))         self.out.write('</font></pre>') if _ _name_ _ == "_ _main_ _":     print "Formatting..."     # Open own source     source = open('python.py').read( )     # Write colorized version to "python.html"     Parser(source, open('python.html', 'wt')).format( )     # Load HTML page into browser     import webbrowser     webbrowser.open("python.html")

Discussion

This code is part of MoinMoin (see http://moin.sourceforge.net/) and shows how to use the built-in keyword, token, and tokenize modules to scan Python source code and re-emit it with appropriate color markup but no changes to its original formatting ("no changes" is the hard part!).

The Parser class' constructor saves the multiline string that is the Python source to colorize, and the file object, which is open for writing, where you want to output the colorized results. Then, the format method prepares a self.lines list that holds the offset (i.e., the index into the source string, self.raw) of each line's start.

format then loops over the result of generator tokenize.tokenize, unpacking each token tuple into items specifying the token type and starting and ending positions in the source (each expressed as line number and offset within the line). The body of the loop reconstructs the exact position within the original source code string self.raw, so it can emit exactly the same whitespace that was present in the original source. It then picks a color code from the _colors dictionary (which uses HTML color coding), with help from the keyword standard module to determine whether a NAME token is actually a Python keyword (to be output in a different color than that used for ordinary identifiers).

The test code at the bottom of the module formats the module itself and launches a browser with the result, using the standard Python library module webbrowser to enable you to see and enjoy the result in your favorite browser.

If you put this recipe's code into a module, you can then import the module and reuse its functionality in CGI scripts (using the PATH_TRANSLATED CGI environment variable to know what file to colorize), command-line tools (taking filenames as arguments), filters that colorize anything they get from standard input, and so on. See http://skew.org/~mike/colorize.py for versions that support several of these various possibilities.

With small changes, it's also easy to turn this recipe into an Apache handler, so your Apache web site can serve colorized .py files. Specifically, if you set up this script as a handler in Apache, then the file is served up as colorized HTML whenever a visitor to the site requests a .py file.

For the purpose of using this recipe as an Apache handler, you need to save the script as colorize.cgi (not .py, lest it confuses Apache), and add, to your .htaccess or httpd.conf Apache configuration files, the following lines:

AddHandler application/x-python .py Action application/x-python /full/virtual/path/to/colorize.cgi

Also, make sure you have the Action module enabled in your httpd.conf Apache configuration file.

Recipe16.6.Colorizing Python Source Using the Built-in Tokenizer