Introduction | Python Cookbook

Credit: Paul F. Dubois, Ph.D., Program for Climate Model Diagnosis and Intercomparison, Lawrence Livermore National Laboratory

This chapter was originally meant to cover mainly topics such as lexing, parsing, and code generationthe classic issues of programs that are about programs. It turns out, however, that Pythonistas did not post many recipes about such tasks, focusing more on highly Python-specific topics such as program introspection, dynamic importing, and generation of functions by closure. Many of those recipes, we decided, were more properly located in various other chapterson shortcuts, debugging, object oriented programming, algorithms, metaprogramming, and specific areas such as the handling of text, files, and persistence Therefore, you will find those topics covered in other chapters. In this chapter, we included only those recipes that are still best described as programs about programs. Of these, probably the most important one is that about currying, the creation of new functions by predetermining some arguments of other functions.

This arrangement doesn't mean that the classic issues aren't important! Python has extensive facilities related to lexing and parsing, as well as a large number of user-contributed modules related to parsing standard languages, which reduces the need for doing your own programming. If Pythonistas are not using these tools, then, in this one area, they are doing more work than they need to. Lexing and parsing are among the most common of programming tasks, and as a result, both are the subject of much theory and much prior development. Therefore, in these areas more than most, you will often profit if you take the time to search for solutions before resorting to writing your own. This Introduction contains a general guide to solving some common problems in these categories to encourage reusing the wide base of excellent, solid code and theory in these fields.

Lexing

Lexing is the process of dividing an input stream into meaningful units, known as tokens, which are then processed. Lexing occurs in tasks such as data processing and in tools for inspecting and modifying text.

The regular expression facilities in Python are extensive and highly evolved, so your first consideration for a lexing task is often to determine whether it can be formulated using regular expressions. Also, see the next section about parsers for common languages and how to lex those languages.

The Python Standard Library tokenize module splits an input stream into Python-language tokens. Since Python's tokenization rules are similar to those of many other languages, this module may often be suitable for other tasks, perhaps with a modest amount of pre- and/or post-processing around tokenize's own operations. For more complex tokenization tasks, Plex, http://nz.cosc.canterbury.ac.nz/~greg/python/Plex/, can ease your efforts considerably.

At the other end of the lexing complexity spectrum, the built-in string method split can also be used for many simple cases. For example, consider a file consisting of colon-separated text fields, with one record per line. You can read a line from the file as follows:

fields = line.split(':')

This produces a list of the fields. At this point, if you want to eliminate spurious whitespace at the beginning and ends of the fields, you can remove it:

fields = [f.strip( ) for f in fields]

For example:

>>> x = "abc :def:ghi    : klm\n" >>> fields = x.split(':') >>> print fields ['abc ', 'def', 'ghi    ', ' klm\n'] >>> print [f.strip( ) for f in fields] ['abc', 'def', 'ghi', 'klm']

Do not elaborate on this example: do not try to over-enrich simple code to perform lexing and parsing tasks which are in fact quite hard to perform with generality, solidity, and good performance, and for which much excellent, reusable code exists. For parsing typical comma-separated values files, or files using other delimiters, study the standard Python library module csv. The ScientificPython package, http://starship.python.net/~hinsen/ScientificPython/, includes a module for reading and writing with Fortran-like formats, and other such precious I/O modules, in the Scientific.IO sub-package.

A common "gotcha" for beginners is that, while lexing and other text-parsing techniques can be used to read numerical data from a file, at the end of this stage, the entries are text strings, not numbers. The int and float built-in functions are frequently needed here, to turn each field from a string into a number:

>>> x = "1.2, 2.3, 4, 5.6" >>> print [float(y.strip( )) for y in x.split(',')] [1.2, 2.2999999999999998, 4.0, 5.5999999999999996]

Parsing

Parsing refers to discovering semantic meaning from a series of tokens according to the rules of a grammar. Parsing tasks are quite ubiquitous. Programming tools may attempt to discover information about program texts or to modify such texts to fit a task. (Python's introspection capabilities come into play here, as we will discuss later.) Little languages is the generic name given to application-specific languages that serve as human-readable forms of computer input. Such languages can vary from simple lists of commands and arguments to full-blown languages.

The grammar in the previous lexing example was implicit: the data you need is organized as one line per record with the fields separated by a special character. The "parser" in that case was supplied by the programmer reading the lines from the file and applying the simple split method to obtain the information. This sort of input file can easily grow, leading to requests for a more elaborate form. For example, users may wish to use comments, blank lines, conditional statements, or alternate forms. While most such parsing can be handled with simple logic, at some point, it becomes so complicated that it is much more reliable to use a real grammar.

There is no hard-and-fast way to decide which part of the job is a lexing task and which belongs to the grammar. For example, comments can often be discarded in the lexing, but doing so is not wise in a program-transformation tool that must produce output containing the original comments.

Your strategy for parsing tasks can include:

Using a parser for that language from the Python Standard Library.
Using a parser from the user community. You can often find one by visiting the Vaults of Parnassus site, http://www.vex.net/parnassus/, or by searching the Python site, http://www.python.org.
Generating a parser using a parser generator.
Using Python itself as your input language.

A combination of approaches is often fruitful. For example, a simple parser can turn input into Python-language statements, which Python then executes in concert with a supporting package that you supply.

A number of parsers for specific languages exist in the standard library, and more are out there on the Web, supplied by the user community. In particular, the standard library includes parsing packages for XML, HTML, SGML, command-line arguments, configuration files, and for Python itself. For the now-ubiquitous task of parsing XML specifically, this cookbook includes a chapterChapter 14 , specifically dedicated to XML.

You do not have to parse C to connect C routines to Python. Use SWIG (http://www.swig.org). Likewise, you do not need a Fortran parser to connect Fortran and Python. See the Numerical Python web page at http://www.pfdubois.com/numpy/ for further information. Again, this cookbook includes a chapter, Chapter 17 , which is dedicated to these kind of tasks.

PLY, SPARK, and Other Python Parser Generators

PLY and SPARK are two rich, solid, and mature Python-based parser generators. That is, they take as their input some statements that describe the grammar to be parsed and generate the parser for you. To make a useful tool, you must add the semantic actions to be taken when a certain construct in the grammar is recognized.

PLY (http://systems.cs.uchicago.edu/ply) is a Python implementation of the popular Unix tool yacc. SPARK (http://pages.cpcc.ucalgary-ca/~aycoch/spart/content.html) parses a more general set of grammars than yacc. Both tools use Python introspection, including the idea of placing grammar rules in functions' docstrings.

Parser generators are one of the many application areas that may have even too many excellent tools, so that you may end up frustrated by having to pick just one. Besides SPARK and PLY, other Python tools in this field include TPG (Toy Parser Generator), DParser, PyParsing, kwParsing (or kyParsing), PyLR, Yapps, PyGgy, mx.TextTools and its SimpleParse frontendtoo many to provide more than a bare mention of each, so, happy googling!

The chief problem in using any of these tools is that you need to educate yourself about grammars and learn to write them. A novice without any computer science background will encounter some difficulty except with very simple grammars. A lot of literature is available to teach you how to use yacc, and most of this knowledge will help you use SPARK and most of the others just as well.

If you are interested in this area, the penultimate reference is Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, Compilers (Addison-Wesley), affectionately known as "the Dragon Book" to generations of computer science majors.^[1]

^[1] I'd even call this book the ultimate reference, were it not for the fact that Donald Knuth continues to promise that the fifth volume (current ETA, the year 2010) of his epoch-making The Art of Computer Programming will be about this very subject.

Using Python Itself as a Little Language

Python itself can be used to create many application-specific languages. By writing suitable classes, you can rapidly create a program that is easy to get running, yet is extensible later. Suppose I want a language to describe graphs. Nodes have names, and edges connect the nodes. I want a way to input such graphs, so that after reading the input I will have the data structures in Python that I need for any further processing. So, for example:

nodes = {  } def getnode(name):     " Return the node with the given name, creating it if necessary. "     if name in nodes:         node = nodes[name]     else:         node = nodes[name] = node(name)     return node class node(object):      " A node has a name and a list of edges emanating from it. "     def _ _init_ _(self, name):         self.name = name         self.edgelist = [  ] class edge(object):     " An edge connects two nodes. "     def _ _init_ _(self, name1, name2):         self.nodes = getnode(name1), getnode(name2)         for n in self.nodes:             n.edgelist.append(self)     def _ _repr_ _(self):         return self.nodes[0].name + self.nodes[1].name

Using just these simple statements, I can now parse a list of edges that describe a graph, and afterwards, I will now have data structures that contain all my information. Here, I enter a graph with four edges and print the list of edges emanating from node 'A':

>>> edge('A', 'B') >>> edge('B', 'C') >>> edge('C', 'D') >>> edge('C', 'A') >>> print getnode('A').edgelist [AB, CA]

Suppose that I now want a weighted graph. I could easily add a weight=1.0 default argument to the edge constructor, and the old input would still work. Also, I could easily add error-checking logic to ensure that edge lists have no duplicates. Furthermore, I already have my node class and can start adding logic to it for any needed processing purposes, be it directly or by subclassing. I can easily turn the entries in the dictionary nodes into similarly named variables that are bound to the node objects. After adding a few more classes corresponding to other input I need, I am well on my way.

The advantage to this approach is clear. For example, the following is already handled correctly:

edge('A', 'B') if 'X' in nodes:     edge('X', 'A') def triangle(n1, n2, n3):     edge(n1, n2)     edge(n2, n3)     edge(n3, n1) triangle('A','W','K') execfile('mygraph.txt')     # Read graph from a datafile

So I already have syntactic sugar, user-defined language extensions, and input from other files. The definitions usually go into a module, and the user simply import them. Had I written my own language, instead of reusing Python in this little language role, such accomplishments might be months away.

Introspection

Python programs have the ability to examine themselves; this set of facilities comes under the general title of introspection. For example, a Python function object knows a lot about itself, including the names of its arguments, and the docstring that was given when it was defined:

>>> def f(a, b):         " Return the difference of a and b "         return a-b ...  >>> dir(f) ['_ _call_ _', '_ _class_ _', '_ _delattr_ _', '_ _dict_ _', '_ _doc_ _', '_ _get_ _', '_ _getattribute_ _', '_ _hash_ _', '_ _init_ _', '_ _module_ _', '_ _name_ _', '_ _new_ _', '_ _reduce_ _', '_ _reduce_ex_ _', '_ _repr_ _', '_ _setattr_ _', '_ _str_ _', 'func_closure', 'func_code', 'func_defaults', 'func_dict', 'func_doc', 'func_globals', 'func_name'] >>> f.func_name 'f' >>> f.func_doc 'Return the difference of a and b' >>> f.func_code <code object f at 0175DDF0, file "<pyshell#18>", line 1> >>> dir (f.func_code) ['_ _class_ _', '_ _cmp_ _', '_ _delattr_ _', '_ _doc_ _', '_ _getattribute_ _', '_ _hash_ _', '_ _init_ _', '_ _new_ _', '_ _reduce_ _', '_ _reduce_ex_ _', '_ _repr_ _', '_ _setattr_ _', '_ _str_ _', 'co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno', 'co_flags', 'co_freevars', 'co_lnotab', 'co_name', 'co_names', 'co_nlocals', 'co_stacksize', 'co_varnames'] >>> f.func_code.co_names ('a', 'b')

SPARK and PLY make an interesting use of introspection. The grammar is entered as docstrings in the routines that take the semantic actions when those grammar constructs are recognized. (Hey, don't turn your head all the way around like that! Introspection has its limits.)

Introspection is very popular in the Python community, and you will find many examples of it in recipes in this book, both in this chapter and elsewhere. Even in this field, though, always remember the possibility of reuse! Standard library module inspect has a lot of solid, reusable inspection-related code. It's all pure Python code, and you can (and should) study the inspect.py source file in your Python library to see what "raw" facilities underlie inspect's elegant high-level functionsindeed, this suggestion generalizes: studying the standard library's sources is among the best things you can do to increment your Python knowledge and skill. But reusing the standard library's wealth of modules and packages is still best: any code you don't write is code you don't have to maintain, and solid, heavily tested code such as the code that you find in the standard library is very likely to have far fewer bugs than any newly developed code you might write yourself.

Python is the most powerful language that you can still read. The kinds of tasks discussed in this chapter help to show just how versatile and powerful it really is.