Recipe1.25.Converting HTML Documents to Texton a Unix Terminal


Recipe 1.25. Converting HTML Documents to Texton a Unix Terminal

Credit: Brent Burley, Mark Moraes

Problem

You need to visualize HTML documents as text, with support for bold and underlined display on your Unix terminal.

Solution

The simplest approach is to code a filter script, taking HTML on standard input and emitting text and terminal control sequences on standard output. Since this recipe only targets Unix, we can get the needed terminal control sequences from the "Unix" command tput, via the function popen of the Python Standard Library module os:

#!/usr/bin/env python import sys, os, htmllib, formatter # use Unix tput to get the escape sequences for bold, underline, reset set_bold = os.popen('tput bold').read( ) set_underline = os.popen('tput smul').read( ) perform_reset = os.popen('tput sgr0').read( ) class TtyFormatter(formatter.AbstractFormatter):     ''' a formatter that keeps track of bold and italic font states, and         emits terminal control sequences accordingly.     '''     def _ _init_ _(self, writer):         # first, as usual, initialize the superclass         formatter.AbstractFormatter._ _init_ _(self, writer)         # start with neither bold nor italic, and no saved font state         self.fontState = False, False         self.fontStack = [  ]     def push_font(self, font):         # the `font' tuple has four items, we only track the two flags         # about whether italic and bold are active or not         size, is_italic, is_bold, is_tt = font         self.fontStack.append((is_italic, is_bold))         self._updateFontState( )     def pop_font(self, *args):         # go back to previous font state         try:             self.fontStack.pop( )         except IndexError:             pass         self._updateFontState( )     def updateFontState(self):         # emit appropriate terminal control sequences if the state of         # bold and/or italic(==underline) has just changed         try:             newState = self.fontStack[-1]         except IndexError:             newState = False, False         if self.fontState != newState:             # relevant state change: reset terminal             print perform_reset,             # set underine and/or bold if needed             if newState[0]:                 print set_underline,             if newState[1]:                 print set_bold,             # remember the two flags as our current font-state             self.fontState = newState # make writer, formatter and parser objects, connecting them as needed myWriter = formatter.DumbWriter( ) if sys.stdout.isatty( ):     myFormatter = TtyFormatter(myWriter) else:     myFormatter = formatter.AbstractFormatter(myWriter) myParser = htmllib.HTMLParser(myFormatter) # feed all of standard input to the parser, then terminate operations myParser.feed(sys.stdin.read( )) myParser.close( )

Discussion

The basic formatter.AbstractFormatter class, offered by the Python Standard Library, should work just about anywhere. On the other hand, the refinements in the TtyFormatter subclass that's the focus of this recipe depend on using a Unix-like terminal, and more specifically on the availability of the tput Unix command to obtain information on the escape sequences used to get bold or underlined output and to reset the terminal to its base state.

Many systems that do not have Unix certification, such as Linux and Mac OS X, do have a perfectly workable tput command and therefore can use this recipe's TtyFormatter subclass just fine. In other words, you can take the use of the word "Unix" in this recipe just as loosely as you can take it in just about every normal discussion: take it as meaning "*ix," if you will.

If your "terminal" emulator supports other escape sequences for controlling output appearance, you should be able to adapt this TtyFormatter class accordingly. For example, on Windows, a cmd.exe command window should, I'm told, support standard ANSI escape sequences, so you could choose to hard-code those sequences if Windows is the platform on which you want to run your version of this script.

In many cases, you may prefer to use other existing Unix commands, such as lynx -dump -, to get richer formatting than this recipe provides. However, this recipe comes in quite handy when you find yourself on a system that has a Python installation but lacks such other helpful commands as lynx.

See Also

Library Reference and Python in a Nutshell docs on the formatter and htmllib modules; man tput on a Unix or Unix-like system for more information about the tput command.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net