Appendix D. A State Machine for Adding Markup to Text

This book was written entirely in plaintext editors, using a set of conventions I call "smart ASCII." In spirit and appearance, smart ASCII resembles the informal markup that has developed on email and Usenet. In fact, I have used an evolving version of the format for a number of years to produce articles, tutorials, and other documents. The book required a few additional conventions in the earlier smart ASCII format, but only a few. It was a toolchain that made almost all the individual typographic and layout decisions. Of course, that toolchain only came to exist through many hours of programming and debugging by me and by other developers.

The printed version of this book used tools I wrote in Python to assemble the chapters, frontmatter, and endmatter, and then to add $graphics/latex.gif$ markup codes to the text. A moderate number of custom $graphics/latex.gif$ macros are included in that markup. From there, the work of other people lets me convert $graphics/latex.gif$ source into the PDF format Addison-Wesley can convert into printed copies.

For information on the smart ASCII format, see the discussions of it in several places in this book, chiefly in Chapter 4. You may also download the ASCII text of this book from its Web site at <http://gnosis.cx/TPiP/>, along with a semiformal documentation of the conventions used. Readers might also be interested in a format called "reStructuredText," which is similar in spirit, but both somewhat "heavier" and more formally specified. reStructuredText has a semiofficial status in the Python community since it is now included in the DocUtils package; for information see:

<http://docutils.sourceforge.net/rst.html>

In this appendix, I include the full source code for an application that can convert the original text of this book into an HTML document. I believe that this application is a good demonstration of the design and structure of a realistic text processing tool. In general structure, book2html.py uses a line-oriented state machine to categorize lines into appropriate document elements. Under this approach, the "meaning" of a particular line is, in part, determined by the context of the lines that came immediately before it. After making decisions on how to categorize each line with a combination of a state machine and a collection of regular expression patterns, the blocks of document elements are processed into HTML output. In principle, it would not be difficult to substitute a different output format; the steps involved are modular.

The Web site for this book has a collection of utilities similar to the one presented. Over time, I have adapted the skeleton to deal with variations in input and output formats, but there is overlap between all of them. Using this utility is simply a matter of typing something like:

 % book2html.py "Text Processing in Python" < TPiP.txt > TPiP.html

The title is optional, and you may pipe STDIN and STDOUT as usual. Since the target is HTML, I decided it would be nice to colorize source code samples. That capability is in a support module:

colorize.py

 #!/usr/bin/python import keyword, token, tokenize, sys from cStringIO import StringIO PLAIN = '%s' BOLD  = '<b>%s</b>' CBOLD = '<font color="%s"><b>%s</b></font>' _KEYWORD = token.NT_OFFSET+1 _TEXT    = token.NT_OFFSET+2 COLORS   = { token.NUMBER:     'black',              token.OP:         'darkblue',              token.STRING:     'green',              tokenize.COMMENT: 'darkred',              token.NAME:       None,              token.ERRORTOKEN: 'red',              _KEYWORD:         'blue',              _TEXT:            'black'  } class ParsePython:     "Colorize python source"     def __init__(self, raw):         self.inp  = StringIO(raw.expandtabs(4).strip())     def toHTML(self):         "Parse and send the colored source"         raw = self.inp.getvalue()         self.out = StringIO()         self.lines = [0,0]      # store line offsets in self.lines         self.lines += [i+1 for i in range(len(raw)) if raw[i]=='\n']         self.lines += [len(raw)]         self.pos = 0         try:             tokenize.tokenize(self.inp.readline, self)             return self.out.getvalue()         except tokenize.TokenError, ex:             msg,ln = ex [0],ex [1] [0]             sys.stderr.write("ERROR: %s %s\n" %                              (msg, raw[self.lines[ln]:]))             return raw     def __call__(self,toktype,toktext,(srow,scol),(erow,ecol),line):         "Token handler"         # calculate new positions         oldpos = self.pos         newpos = self.lines[srow] + scol         self.pos = newpos + len(toktext)         if toktype in [token.NEWLINE, tokenize.NL]:  # handle newlns             self.out.write('\n')             return         if newpos > oldpos:     # send the orig whitspce, if needed             self.out.write(self.inp.getvalue()[oldpos:newpos])         if toktype in [token.INDENT, token.DEDENT]:             self.pos = newpos   # skip indenting tokens             return         if token.LPAR <= toktype and toktype <= token.OP:             toktype = token.OP  # map token type to a color group         elif toktype == token.NAME and keyword.iskeyword(toktext):             toktype = _KEYWORD         color = COLORS.get(toktype, COLORS [_TEXT])         if toktext:             # send text             txt = Detag(toktext)             if color is None:    txt = PLAIN % txt             elif color=='black': txt = BOLD % txt             else:                txt = CBOLD % (color,txt)             self.out.write(txt) Detag = lambda s: \     s.replace('&','&amp;').replace('<','&lt;').replace('>','&gt;') if __name__=='__main__':     parsed = ParsePython(sys.stdin.read())     print '<pre>'     print parsed.toHTML()     print '</pre>'

The module colorize contains its own self-test code and is perfectly usable as a utility on its own. The main module consists of:

book2html.py

 #!/usr/bin/python """Convert ASCII book source files for HTML presentation" Usage: python book2html.py [title] < source.txt > target.html """ __author__=["David Mertz (mertz@gnosis.cx)",] __version__="November 2002" from __future__ import generators import sys, re, string, time from colorize import ParsePython from cgi import escape #-- Define some HTML boilerplate html_open =\ """<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> <head> <title>%s</title> <style>   .code-sample {background-color:#EEEEEE; text-align:left;                 width:90%%; margin-left:auto; margin-right:auto;}   .module      {color : darkblue}   .libfunc     {color : darkgreen} </style> </head> <body> """ html_title = "Automatically Generated HTML" html_close = "</body></html>" code_block = \ """<table ><tr><td><h4>%s</h4></td></tr> <tr><td><pre>%s</pre></td></tr> </table>""" #-- End of boilerplate #-- State constants for s in ("BLANK CHAPTER SECTION SUBSECT SUBSUB MODLINE "           "MODNAME PYSHELL CODESAMP NUMLIST BODY QUOTE "           "SUBBODY TERM DEF RULE VERTSPC").split():     exec "%s = '%s'" % (s,s) markup = {CHAPTER:'h1', SECTION:'h2', SUBSECT:'h3', SUBSUB:'h4',           BODY:'p', QUOTE:'blockquote', NUMLIST:'blockquote',           DEF:'blockquote'} divs = {RULE:'hr', VERTSPC:'br'} class Regexen:     def __init__(self):         # blank line is empty, spaces/dashes only, or proc instruct         self.blank    = re.compile("^[ -]*$|^  THIS IS [A-Z]+$")         self.chapter  = re.compile("^(CHAPTER|APPENDIX|FRONTMATTER)")         self.section  = re.compile("^SECTION")         self.subsect  = re.compile("^  (TOPIC|PROBLEM|EXERCISE)")         self.subsub   = re.compile("^  [A-Z 0-9-]+:$") # chk befr body         self.modline  = re.compile("^  =+$")         self.pyshell  = re.compile("^ +>>>")         self.codesamp = re.compile("^ +#[*]?[-=]+ .+ [-=]+#")         self.numlist  = re.compile("^  \d+[.] ")       # chk befr body         self.body     = re.compile("^  \S")            # 2 spc indent         self.quote    = re.compile("^     ?\S")        # 4-5 spc indnt         self.subbody  = re.compile("^      +")         # 6+ spc indent         self.rule     = re.compile("^  (-\*-|!!!)$")         self.vertspc  = re.compile("^  \+\+\+$") def Make_Blocks(fpin=sys.stdin, r=Regexen()):     #-- Initialize the globals     global state, blocks, laststate     state, laststate = BLANK, BLANK     blocks = [[BLANK]]     #-- Break the file into relevant chunks     for line in fpin.xreadlines():         line = line.rstrip()            # Normalize line endings         #-- for "one-line states" just act (no accumulation)         if r.blank.match(line):             if inState(PYSHELL):        newState(laststate)             else:                       blocks[-1].append("")         elif r.rule.match(line):        newState(RULE)         elif r.vertspc.match(line):     newState(VERTSPC)         elif r.chapter.match(line):     newState(CHAPTER)         elif r.section.match(line):     newState(SECTION)         elif r.subsect.match(line):     newState(SUBSECT)         elif r.subsub.match(line):      newState(SUBSUB)         elif r.modline.match(line):     newState(MODLINE)         elif r.numlist.match(line):     newState(NUMLIST)         elif r.pyshell.match(line):             if not inState(PYSHELL):    newState(PYSHELL)         elif r.codesamp.match(line):    newState(CODESAMP)         #-- now the multi-line states that are self-defining         elif r.body.match(line):             if not inState(BODY):       newState(BODY)         elif r.quote.match(line):             if inState(MODLINE):        newState(MODNAME)             elif r.blank.match(line):   newState(BLANK)             elif not inState(QUOTE):    newState(QUOTE)         #-- now the "multi-line states" which eat further lines         elif inState(MODLINE, PYSHELL, CODESAMP, NUMLIST, DEF):             "stay in this state until we get a blank line"             "...or other one-line prior type, but shouldn't happen"         elif r.subbody.match(line):             "Sub-body is tricky: it might belong with several states:"             "PYSHELL, CODESAMP, NUMLIST, or as a def after BODY"             if inState(BODY):           newState(DEF)             elif inState(BLANK):                 if laststate==DEF:      pass             elif inState(DEF, CODESAMP, PYSHELL, NUMLIST, MODNAME):                 pass         else:             raise ValueError, \                   "unexpected input block state: %s\n%s" %(state,line)         if inState(MODLINE, RULE, VERTSPC): pass         elif r.blank.match(line): pass         else: blocks[-1].append(line)     return LookBack(blocks) def LookBack(blocks):     types = [f [0] for f in blocks]     for i in range(len(types)-1):         this, next = types[i:i+2]         if (this,next)==(BODY,DEF):             blocks[i][0] = TERM     return blocks def newState(name):     global state, laststate, blocks     if name not in (BLANK, MODLINE):         blocks.append([name])     laststate = state     state = name def instate(*names) :     return state in names def Process_Blocks(blocks, fpout=sys.stdout, title=html_title):     fpout.write(html_open % title)     for block in blocks:        # Massage each block as needed         typ, lines = block[0], block[1:]         tag = markup.get(typ, None)         div = divs.get(typ, None)         if tag is not None:             map(fpout.write, wrap_html(lines, tag))         elif div is not None:             fpout.write('<%s />\n' % div)         elif typ in (PYSHELL, CODESAMP):             fpout.write(fixcode('\n'.join(lines),style=typ))         elif typ in (MODNAME,):             mod = '<hr/><h3 >%s</h3>'%'\n'.join(lines)             fpout.write(mod)         elif typ in (TERM,):             terms = '<br />\n'.join(lines)             fpout.write('<h4 >%s</h4>\n' % terms)         else:             sys.stderr.write(typ+'\n')     fpout.write(html_close) #-- Functions for start of block-type state def wrap_html(lines, tag):     txt = '\n'.join(lines)     for para in txt.split('\n\n'):         if para: yield '<%s>%s</%s>\n' %\                         (tag,URLify(Typography(escape(para))),tag) def fixcode(block, style=CODESAMP):     block = LeftMargin(block)           # Move to left     # Pull out title if available     title = 'Code Sample'     if style==CODESAMP:         re_title = re.compile('^#\*?\-+ (.+) \-+#$', re.M)         if_title = re_title.match(block)         if if_title:             title = if_title.group(1)             block = re_title.sub(", block)  # take title out of code     # Decide if it is Python code     firstline = block[:block.find('\n')]     if re.search(r'\.py_?|[Pp]ython|>>>', title+firstline):         # Has .py, py_, Python/python, or >>> on first line/title         block = ParsePython(block.rstrip()).toHTML()         return code_block % (Typography(title), block)     # elif the-will-and-the-way-is-there-to-format-language-X: ...     else:         return code_block % (Typography(title), escape(block).strip()) def LeftMargin(txt):     "Remove as many leading spaces as possible from whole block"     for 1 in range(12,-1,-1):         re_lead = '(?sm)'+' '*1+'\S'         if re.match(re_lead, txt): break     txt = re.sub('(?sm)^'+' '*1, ", txt)     return txt def URLify(txt):     # Conv special IMG URL's: Alt Text: http://site.org/img.png}     # (don't actually try quite as hard to validate URL though)     txt = re.sub('(?sm){(.*?):\s*(http://.*)}',                  '<img src="/books/2/876/1/html/2/\\2" alt="\\1">', txt)     # Convert regular URL's     txt = re.sub('(?:[^="])((?:http|ftp|file)://(?:[^ \n\r<\)]+))(\s)',                  '<a href="\\1">\\1</a>\\2', txt)     return txt def Typography(txt):     rc = re.compile     # cut down line length     MS = re.M | re.S     # [module] names     r = rc(r"""([\(\s'/">]|^)\[(.*?)\]([<\s\.\),:;"'?!/-])""", MS)     txt = r.sub('\\1<i >\\2</i>\\3',txt)     # *strongly emphasize* words     r = rc(r"""([\(\s'/"]|^)\*(.*?)\*( [\s\.\),:;'"?!/-])""", MS)     txt = r.sub('\\1<strong>\\2</strong>\\3', txt)     # -emphasize- words     r = rc(r"""([\(\s'/"]|^)-(.+?)-( [\s\.\),:;"'?!/])""", MS)     txt = r.sub('\\1<em>\\2</em>\\3', txt)     # _Book Title_ citations     r = rc(r"""([\(\s'/"]|^)_(.*?)_( [\s\.\),:;'"?!/-])""", MS)     txt = r.sub('\\1<cite>\\2</cite>\\3', txt)     # 'Function()' names     r = rc(r"""([\(\s/"]|^)'(.*?)'([\s\.\),:;"?!/-])""", MS)     txt = r.sub("\\1<code>\\2</code>\\3", txt)     # 'library. func() ' names     r = rc(r"""([\(\s/"]|^)'(.*?)'([\s\.\),:;"?!/-])""", MS)     txt = r.sub('\\1<i clas    s="libfunc">\\2</i>\\3', txt)     return txt if __name__ == '__main__':     blocks = Make_Blocks()     if len(sys.argv) > 1:     Process_Blocks(blocks, title=sys.argv[1]) else:     Process_Blocks(blocks)