1.1 Techniques and Patterns

1.1.1 Utilizing Higher-Order Functions in Text Processing

This first topic merits a warning. It jumps feet-first into higher-order functions (HOFs) at a fairly sophisticated level and may be unfamiliar even to experienced Python programmers. Do not be too frightened by this first topic you can understand the rest of the book without it. If the functional programming (FP) concepts in this topic seem unfamiliar to you, I recommend you jump ahead to Appendix A, especially its final section on FP concepts.

In text processing, one frequently acts upon a series of chunks of text that are, in a sense, homogeneous. Most often, these chunks are lines, delimited by newline characters but sometimes other sorts of fields and blocks are relevant. Moreover, Python has standard functions and syntax for reading in lines from a file (sensitive to platform differences). Obviously, these chunks are not entirely homogeneous they can contain varying data. But at the level we worry about during processing, each chunk contains a natural parcel of instruction or information.

As an example, consider an imperative style code fragment that selects only those lines of text that match a criterion isCond():

 selected = []                 # temp list to hold matches fp = open(filename): for line in fp.readlines():   # Py2.2 -> "for line in fp:"     if isCond(line):          # (2.2 version reads lazily)         selected.append(line) del line                      # Cleanup transient variable

There is nothing wrong with these few lines (see xreadlines on efficiency issues). But it does take a few seconds to read through them. In my opinion, even this small block of lines does not parse as a single thought, even though its operation really is such. Also the variable line is slightly superfluous (and it retains a value as a side effect after the loop and also could conceivably step on a previously defined value). In FP style, we could write the simpler:

 selected = filter(isCond, open(filename).readlines()) # Py2.2 -> filter(isCond, open(filename))

In the concrete, a textual source that one frequently wants to process as a list of lines is a log file. All sorts of applications produce log files, most typically either ones that cause system changes that might need to be examined or long-running applications that perform actions intermittently. For example, the PythonLabs Windows installer for Python 2.2 produces a file called INSTALL.LOG that contains a list of actions taken during the install. Below is a highly abridged copy of this file from one of my computers:

INSTALL.LOG sample data file

 Title: Python 2.2 Source: C:\DOWNLOAD\PYTHON-2.2.EXE | 02-23-2002 | 01:40:54 | 7074248 Made Dir: D:\Python22 File Copy: D:\Python22\UNWISE.EXE | 05-24-2001 | 12:59:30 | | ... RegDB Key: Software\Microsoft\Windows\CurrentVersion\Uninstall\Py... RegDB Val: Python 2.2 File Copy: D:\Python22\w9xpopen.exe | 12-21-2001 | 12:22:34 | | ... Made Dir: D:\PYTHON22\DLLs File Overwrite: C:\WINDOWS\SYSTEM\MSVCRT.DLL | | | | 295000 | 770c8856 RegDB Root: 2 RegDB Key: Software\Microsoft\Windows\CurrentVersion\App Paths\Py... RegDB Val: D:\PYTHON22\Python.exe Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Uninstall Py... Link Info: D:\Python22\UNWISE.EXE | D:\PYTHON22 |  | 0 | 1 | 0 | Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Python ... Link Info: D:\Python22\python.exe | D:\PYTHON22 | D:\PYTHON22\...

You can see that each action recorded belongs to one of several types. A processing application would presumably handle each type of action differently (especially since each action has different data fields associated with it). It is easy enough to write Boolean functions that identify line types, for example:

 def isFileCopy(line):     return line[:10]=='File Copy:' # or line.startswith(...) def isFileOverwrite(line):     return line[:15]=='File Overwrite:'

The string method "".startswith() is less error prone than an initial slice for recent Python versions, but these examples are compatible with Python 1.5. In a slightly more compact functional programming style, you can also write these like:

 isRegDBRoot = lambda line: line[:11]=='RegDB Root:' isRegDBKey = lambda line: line[:10]=='RegDB Key:' isRegDBVal = lambda line: line[:10]=='RegDB Val:'

Selecting lines of a certain type is done exactly as above:

 lines = open(r'd:\python22\install.log').readlines() regroot_lines = filter(isRegDBRoot, lines)

But if you want to select upon multiple criteria, an FP style can initially become cumbersome. For example, suppose you are interested in all the "RegDB" lines; you could write a new custom function for this filter:

 def isAnyRegDB(line):     if   line[:11]=='RegDB Root:': return 1     elif line[:10]=='RegDB Key:':  return 1     elif line[:10]=='RegDB Val:':  return 1     else:                          return 0 # For recent Pythons, line.startswith(...) is better

Programming a custom function for each combined condition can produce a glut of named functions. More importantly, each such custom function requires a modicum of work to write and has a nonzero chance of introducing a bug. For conditions that should be jointly satisfied, you can either write custom functions or nest several filters within each other. For example:

 shortline = lambda line: len(line) < 25 short_regvals = filter(shortline, filter(isRegDBVal, lines))

In this example, we rely on previously defined functions for the filter. Any error in the filters will be in either shortline() or isRegDBVal(), but not independently in some third function isShortRegVal(). Such nested filters, however, are difficult to read especially if more than two are involved.

Calls to map() are sometimes similarly nested if several operations are to be performed on the same string. For a fairly trivial example, suppose you wished to reverse, capitalize, and normalize whitespace in lines of text. Creating the support functions is straightforward, and they could be nested in map() calls:

 from string import upper, join, split def flip(s):     a = list(s)     a.reverse()     return join(a,'') normalize = lambda s: join(split(s),' ') cap_flip_norms = map(upper, map(flip, map(normalize, lines)))

This type of map() or filter() nest is difficult to read, and should be avoided. Moreover, one can sometimes be drawn into nesting alternating map() and filter() calls, making matters still worse. For example, suppose you want to perform several operations on each of the lines that meet several criteria. To avoid this trap, many programmers fall back to a more verbose imperative coding style that simply wraps the lists in a few loops and creates some temporary variables for intermediate results.

Within a functional programming style, it is nonetheless possible to avoid the pitfall of excessive call nesting. The key to doing this is an intelligent selection of a few combinatorial higher-order functions. In general, a higher-order function is one that takes as argument or returns as result a function object. First-order functions just take some data as arguments and produce a datum as an answer (perhaps a data-structure like a list or dictionary). In contrast, the "inputs" and "outputs" of a HOF are more function objects ones generally intended to be eventually called somewhere later in the program flow.

One example of a higher-order function is a function factory: a function (or class) that returns a function, or collection of functions, that are somehow "configured" at the time of their creation. The "Hello World" of function factories is an "adder" factory. Like "Hello World," an adder factory exists just to show what can be done; it doesn't really do anything useful by itself. Pretty much every explanation of function factories uses an example such as:

 >>> def adder_factory(n): ...    return lambda m, n=n: m+n ... >>> add10 = adder_factory(10) >>> add10 <function <lambda> at 0x00FB0020> >>> add10(4) 14 >>> add10(20) 30 >>> add5 = adder_factory(5) >>> add5(4) 9

For text processing tasks, simple function factories are of less interest than are combinatorial HOFs. The idea of a combinatorial higher-order function is to take several (usually first-order) functions as arguments and return a new function that somehow synthesizes the operations of the argument functions. Below is a simple library of combinatorial higher-order functions that achieve surprisingly much in a small number of lines:

combinatorial.py

 from operator import mul, add, truth apply_each = lambda fns, args=[]: map(apply, fns, [args]*len(fns)) bools = lambda 1st: map(truth, 1st) bool_each = lambda fns, args=[]: bools(apply_each(fns, args)) conjoin = lambda fns, args=[]: reduce(mul, bool_each(fns, args)) all = lambda fns: lambda arg, fns=fns: conjoin(fns, (arg,)) both = lambda f,g: all((f,g)) all3 = lambda f,g,h: all((f,g,h)) and_ = lambda f,g: lambda x, f=f, g=g: f(x) and g(x) disjoin = lambda fns, args=[]: reduce(add, bool_each(fns, args)) some = lambda fns: lambda arg, fns=fns: disjoin(fns, (arg,)) either = lambda f,g: some((f,g)) anyof3 = lambda f,g,h: some((f,g,h)) compose = lambda f,g: lambda x, f=f, g=g: f(g(x)) compose3 = lambda f,g,h: lambda x, f=f, g=g, h=h: f(g(h(x))) ident = lambda x: x

Even with just over a dozen lines, many of these combinatorial functions are merely convenience functions that wrap other more general ones. Let us take a look at how we can use these HOFs to simplify some of the earlier examples. The same names are used for results, so look above for comparisons:

Some examples using higher-order functions

 # Don't nest filters, just produce func that does both short_regvals = filter(both(shortline, isRegVal), lines) # Don't multiply ad hoc functions, just describe need regroot_lines = \     filter(some([isRegDBRoot, isRegDBKey, isRegDBVal]), lines) # Don't nest transformations, make one combined transform capFlipNorm = compose3(upper, flip, normalize) cap_flip_norms = map(capFlipNorm, lines)

In the example, we bind the composed function capFlipNorm for readability. The corresponding map() line expresses just the single thought of applying a common operation to all the lines. But the binding also illustrates some of the flexibility of combinatorial functions. By condensing the several operations previously nested in several map() calls, we can save the combined operation for reuse elsewhere in the program.

As a rule of thumb, I recommend not using more than one filter() and one map() in any given line of code. If these "list application" functions need to nest more deeply than this, readability is preserved by saving results to intermediate names. Successive lines of such functional programming style calls themselves revert to a more imperative style but a wonderful thing about Python is the degree to which it allows seamless combinations of different programming styles. For example:

 intermed = filter(niceProperty, map(someTransform, lines)) final = map(otherTransform, intermed)

Any nesting of successive filter () or map() calls, however, can be reduced to single functions using the proper combinatorial HOFs. Therefore, the number of procedural steps needed is pretty much always quite small. However, the reduction in total lines-of-code is offset by the lines used for giving names to combinatorial functions. Overall, FP style code is usually about one-half the length of imperative style equivalents (fewer lines generally mean correspondingly fewer bugs).

A nice feature of combinatorial functions is that they can provide a complete Boolean algebra for functions that have not been called yet (the use of operator.add and operator.mul in combinatorial.py is more than accidental, in that sense). For example, with a collection of simple values, you might express a (complex) relation of multiple truth values as:

 satisfied = (this or that) and (foo or bar)

In the case of text processing on chunks of text, these truth values are often the results of predicative functions applied to a chunk:

 satisfied = (thisP(s) or thatP(s)) and (fooP(s) or barP(s))

In an expression like the above one, several predicative functions are applied to the same string (or other object), and a set of logical relations on the results are evaluated. But this expression is itself a logical predicate of the string. For naming clarity and especially if you wish to evaluate the same predicate more than once it is convenient to create an actual function expressing the predicate:

 satisfiedP = both(either(thisP,thatP), either(fooP,barP))

Using a predicative function created with combinatorial techniques is the same as using any other function:

 selected = filter(satisfiedP, lines)

1.1.2 Exercise: More on combinatorial functions

The module combinatorial.py presented above provides some of the most commonly useful combinatorial higher-order functions. But there is room for enhancement in the brief example. Creating a personal or organization library of useful HOFs is a way to improve the reusability of your current text processing libraries.

QUESTIONS

1:	Some of the functions defined in `combinatorial.py` are not, strictly speaking, combinatorial. In a precise sense, a combinatorial function should take one or several functions as arguments and return one or more function objects that "combine" the input arguments. Identify which functions are not "strictly" combinatorial, and determine exactly what type of thing each one does return.
2:	The functions `both()` and `and_()` do almost the same thing. But they differ in an important, albeit subtle, way. `and_()`, like the Python operator `and`, uses shortcutting in its evaluation. Consider these lines: >>> f = lambda n: n**2 > 10 >>> g = lambda n: 100/n > 10 >>> and_(f,g)(5) 1 >>> both(f,g)(5) 1 >>> and_(f,g)(0) 0 >>> both(f,g)(0) Traceback (most recent call last): ... The shortcutting `and_()` can potentially allow the first function to act as a "guard" for the second one. The second function never gets called if the first function returns a false value on a given argument. Create a similarly shortcutting combinatorial `or_()` function for your library. Create general shortcutting functions `shortcut_all()` and `shortcut_some()` that behave similarly to the functions `all()` and `some()`, respectively. Describe some situations where nonshortcutting combinatorial functions like `both()`, `all()`, or `anyof3()` are more desirable than similar shortcutting functions.
3:	The function `ident()` would appear to be pointless, since it simply returns whatever value is passed to it. In truth, `ident()` is an almost indispensable function for a combinatorial collection. Explain the significance of `ident()`. Hint: Suppose you have a list of lines of text, where some of the lines may be empty strings. What filter can you apply to find all the lines that start with a `#`?
4:	The function `not_()` might make a nice addition to a combinatorial library. We could define this function as: >>> not_ = lambda f: lambda x, f=f: not f(x) Explore some situations where a `not_()` function would aid combinatoric programming.
5:	The function `apply_each()` is used in `combinatorial.py` to build some other functions. But the utility of `apply_each()` is more general than its supporting role might suggest. A trivial usage of `apply_each()` might look something like: >>> apply_each(map(adder_factory, range(5)),(10,)) [10, 11, 12, 13, 14] Explore some situations where `apply_each()` simplifies applying multiple operations to a chunk of text.
6:	Unlike the functions `all()` and `some()`, the functions `compose()` and `compose3()` take a fixed number of input functions as arguments. Create a generalized composition function that takes a list of input functions, of any length, as an argument.
7:	What other combinatorial higher-order functions that have not been discussed here are likely to prove useful in text processing? Consider other ways of combining first-order functions into useful operations, and add these to your library. What are good names for these enhanced HOFs?

1.1.3 Specializing Python Datatypes

Python comes with an excellent collection of standard datatypes Appendix A discusses each built-in type. At the same time, an important principle of Python programming makes types less important than programmers coming from other languages tend to expect. According to Python's "principle of pervasive polymorphism" (my own coinage), it is more important what an object does than what it is. Another common way of putting the principle is: if it walks like a duck and quacks like a duck, treat it like a duck.

Broadly, the idea behind polymorphism is letting the same function or operator work on things of different types. In C++ or Java, for example, you might use signature-based method overloading to let an operation apply to several types of things (acting differently as needed). For example:

C++ signature-based polymorphism

 #include <stdio.h> class Print { public:   void print(int i)    { printf("int %d\n", i); }    void print(double d) { printf("double %f\n", d); }   void print(float f)  { printf("float %f\n", f); } }; main() {   Print *p = new Print();   p->print(37);      /* --> "int 37" */   p->print(37.0);    /* --> "double 37.000000" */ }

The most direct Python translation of signature-based overloading is a function that performs type checks on its argument(s). It is simple to write such functions:

Python "signature-based" polymorphism

 def Print(x):     from types import *     if type(x) is FloatType:  print "float", x     elif type(x) is IntType:  print "int", x     elif type(x) is LongType: print "long", x

Writing signature-based functions, however, is extremely un-Pythonic. If you find yourself performing these sorts of explicit type checks, you have probably not understood the problem you want to solve correctly! What you should (usually) be interested in is not what type x is, but rather whether x can perform the action you need it to perform (regardless of what type of thing it is strictly).

PYTHONIC POLYMORPHISM

Probably the single most common case where pervasive polymorphism is useful is in identifying "file-like" objects. There are many objects that can do things that files can do, such as those created with urllib, cStringIO, zipfile, and by other means. Various objects can perform only subsets of what actual files can: some can read, others can write, still others can seek, and so on. But for many purposes, you have no need to exercise every "file-like" capability it is good enough to make sure that a specified object has those capabilities you actually need.

Here is a typical example. I have a module that uses DOM to work with XML documents; I would like users to be able to specify an XML source in any of several ways: using the name of an XML file, passing a file-like object that contains XML, or indicating an already-built DOM object to work with (built with any of several XML libraries). Moreover, future users of my module may get their XML from novel places I have not even thought of (an RDBMS, over sockets, etc.). By looking at what a candidate object can do, I can just utilize whichever capabilities that object has:

Python capability-based polymorphism

 def toDOM(xml_src=None):     from xml.dom import minidom     if hasattr(xml_src, 'documentElement'):         return xml_src    # it is already a DOM object     elif hasattr(xml_src,'read'):         # it is something that knows how to read data         return minidom.parseString(xml_src.read())     elif type(xml_src) in (StringType, UnicodeType):         # it is a filename of an XML document         xml = open(xml_src).read()         return minidom.parseString(xml)     else:         raise ValueError, "Must be initialized with " +\               "filename, file-like object, or DOM object"

Even simple-seeming numeric types have varying capabilities. As with other objects, you should not usually care about the internal representation of an object, but rather about what it can do. Of course, as one way to assure that an object has a capability, it is often appropriate to coerce it to a type using the built-in functions complex(), dict(), float(), int(), list(), long(), str(), tuple(), and unicode(). All of these functions make a good effort to transform anything that looks a little bit like the type of thing they name into a true instance of it. It is usually not necessary, however, actually to transform values to prescribed types; again we can just check capabilities.

For example, suppose that you want to remove the "least significant" portion of any number perhaps because they represent measurements of limited accuracy. For whole numbers ints or longs you might mask out some low-order bits; for fractional values you might round to a given precision. Rather than testing value types explicitly, you can look for numeric capabilities. One common way to test a capability in Python is to try to do something, and catch any exceptions that occur (then try something else). Below is a simple example:

Checking what numbers can do

 def approx(x):                # int attributes require 2.2+     if hasattr(x,'__and__'):  # supports bitwise-and         return x & ^~OxOFL     try:                      # supports real/imag         return (round(x.real,2)+round(x.imag,2)*1j)     except AttributeError:         return round(x,2)

ENHANCED OBJECTS

The reason that the principle of pervasive polymorphism matters is because Python makes it easy to create new objects that behave mostly but not exactly like basic datatypes. File-like objects were already mentioned as examples; you may or may not think of a file object as a datatype precisely. But even basic datatypes like numbers, strings, lists, and dictionaries can be easily specialized and/or emulated.

There are two details to pay attention to when emulating basic datatypes. The most important matter to understand is that the capabilities of an object even those utilized with syntactic constructs are generally implemented by its "magic" methods, each named with leading and trailing double underscores. Any object that has the right magic methods can act like a basic datatype in those contexts that use the supplied methods. At heart, a basic datatype is just an object with some well-optimized versions of the right collection of magic methods.

The second detail concerns exactly how you get at the magic methods or rather, how best to make use of existing implementations. There is nothing stopping you from writing your own version of any basic datatype, except for the piddling details of doing so. However, there are quite a few such details, and the easiest way to get the functionality you want is to specialize an existing class. Under all non-ancient versions of Python, the standard library provides the pure-Python modules UserDict, UserList, and UserString as starting points for custom datatypes. You can inherit from an appropriate parent class and specialize (magic) methods as needed. No sample parents are provided for tuples, ints, floats, and the rest, however.

Under Python 2.2 and above, a better option is available. "New-style" Python classes let you inherit from the underlying C implementations of all the Python basic datatypes. Moreover, these parent classes have become the self-same callable objects that are used to coerce types and construct objects: int(), list(), unicode(), and so on. There is a lot of arcana and subtle profundities that accompany new-style classes, but you generally do not need to worry about these. All you need to know is that a class that inherits from string is faster than one that inherits from UserString; likewise for list versus UserList and dict versus UserDict (assuming your scripts all run on a recent enough version of Python).

Custom datatypes, however, need not specialize full-fledged implementations. You are free to create classes that implement "just enough" of the interface of a basic datatype to be used for a given purpose. Of course, in practice, the reason you would create such custom datatypes is either because you want them to contain non-magic methods of their own or because you want them to implement the magic methods associated with multiple basic datatypes. For example, below is a custom datatype that can be passed to the prior approx() function, and that also provides a (slightly) useful custom method:

 >>> class I:  # "Fuzzy" integer datatype ...     def __init__(self, i):  self.i = i ...     def __and__(self, i):   return self.i & i ...     def err_range(self): ...         lbound = approx(self.i) ...         return "Value: [%d, %d)" % (lbound, lbound+0x0F) ... >>> i1, i2 = I(29), I(20) >>> approx(i1), approx(i2) (16L, 16L) >>> i2.err_range() 'Value: [16, 31)'

Despite supporting an extra method and being able to get passed into the approx() function, I is not a very versatile datatype. If you try to add, or divide, or multiply using "fuzzy integers," you will raise a TypeError. Since there is no module called Userlnt, under an older Python version you would need to implement every needed magic method yourself.

Using new-style classes in Python 2.2+, you could derive a "fuzzy integer" from the underlying int datatype. A partial implementation could look like:

 >>> class I2(int):    # New-style fuzzy integer ...     def __add__(self, j): ...         vals = map(int, [approx(self), approx(j)]) ...         k = int.__add__(*vals) ...         return I2(int.__add__(k, 0x0F)) ...     def err_range(self): ...         lbound = approx(self) ...         return "Value: [%d, %d)" %(lbound,lbound+0x0F) ... >>> i1, i2 = I2(29), I2(20) >>> print "i1 =", i1.err_range(),": i2 =", i2.err_range() i1 = Value: [16, 31) : i2 = Value: [16, 31) >>> i3 = i1 + i2 >>> print i3, type(i3) 47 <class '__main__.I2'>

Since the new-style class int already supports bitwise-and, there is no need to implement it again. With new-style classes, you refer to data values directly with self, rather than as an attribute that holds the data (e.g., self.i in class I). As well, it is generally unsafe to use syntactic operators within magic methods that define their operation; for example, I utilize the .__add__() method of the parent int rather than the + operator in the I2.__add__() method.

In practice, you are less likely to want to create number-like datatypes than you are to emulate container types. But it is worth understanding just how and why even plain integers are a fuzzy concept in Python (the fuzziness of the concepts is of a different sort than the fuzziness of I2 integers, though). Even a function that operates on whole numbers need not operate on objects of IntType or LongType just on an object that satisfies the desired protocols.

1.1.4 Base Classes for Datatypes

There are several magic methods that are often useful to define for any custom datatype. In fact, these methods are useful even for classes that do not really define datatypes (in some sense, every object is a datatype since it can contain attribute values, but not every object supports special syntax such as arithmetic operators and indexing). Not quite every magic method that you can define is documented in this book, but most are under the parent datatype each is most relevant to. Moreover, each new version of Python has introduced a few additional magic methods; those covered either have been around for a few versions or are particularly important.

In documenting class methods of base classes, the same general conventions are used as for documenting module functions. The one special convention for these base class methods is the use of self as the first argument to all methods. Since the name self is purely arbitrary, this convention is less special than it might appear. For example, both of the following uses of self are equally legal:

 >>> import string >>> self = 'spam' >>> object.__repr__(self) '<str object at 0x12c0a0>' >>> string.upper(self) 'SPAM'

However, there is usually little reason to use class methods in place of perfectly good built-in and module functions with the same purpose. Normally, these methods of datatype classes are used only in child classes that override the base classes, as in:

 >>> class UpperObject(object): ...       def __repr__(self): ...           return object.__repr__(self).upper() ... >>> uo = UpperObject() >>> print uo <__MAIN__.UPPEROBJECT OBJECT AT 0X1C2C6C>

object • Ancestor class for new-style datatypes

Under Python 2.2+, object has become a base for new-style classes. Inheriting from object enables a custom class to use a few new capabilities, such as slots and properties. But usually if you are interested in creating a custom datatype, it is better to inherit from a child of object, such as list, float, or dict.

METHODS

object.eq(self, other)

Return a Boolean comparison between self and other. Determines how a datatype responds to the == operator. The parent class object does not implement . __eq__() since by default object equality means the same thing as identity (the is operator). A child is free to implement this in order to affect comparisons.

object.ne(self, other)

Return a Boolean comparison between self and other. Determines how a datatype responds to the != and <> operators. The parent class object does not implement .__ne__() since by default object inequality means the same thing as nonidentity (the is not operator). Although it might seem that equality and inequality always return opposite values, the methods are not explicitly defined in terms of each other. You could force the relationship with:

 >>> class EQ(object): ...     # Abstract parent class for equality classes ...     def __eq__(self, o): return not self <> o ...     def __ne__(self, o): return not self == o ... >>> class Comparable(EQ): ...     # By def'ing inequlty, get equlty (or vice versa) ...     def __ne__(self, other): ...         return someComplexComparison(self, other)

object.nonzero(self)

Return a Boolean value for an object. Determines how a datatype responds to the Boolean comparisons or, and, and not, and to if and filter(None,...) tests. An object whose .__nonzero__() method returns a true value is itself treated as a true value.

object.len(self)
len(object)

Return an integer representing the "length" of the object. For collection types, this is fairly straightforward how many objects are in the collection? Custom types may change the behavior to some other meaningful value.

object.repr(self)
repr(object)
object.str(self)
str(object)

Return a string representation of the object self. Determines how a datatype responds to the repr() and str() built-in functions, to the print keyword, and to the back-tick operator.

Where feasible, it is desirable to have the .__repr__() method return a representation with sufficient information in it to reconstruct an identical object. The goal here is to fulfill the equality obj==eval(repr(obj)). In many cases, however, you cannot encode sufficient information in a string, and the repr() of an object is either identical to, or slightly more detailed than, the str() representation of the same object.

BUILT-IN FUNCTIONS

open(fname [,mode [,buffering]])
file(fname [,mode [,buffering]])

Return a file object that attaches to the filename fname. The optional argument mode describes the capabilities and access style of the object. An r mode is for reading; w for writing (truncating any existing content); a for appending (writing to the end). Each of these modes may also have the binary flag b for platforms like Windows that distinguish text and binary files. The flag + may be used to allow both reading and writing. The argument buffering may be 0 for none, 1 for line-oriented, a larger integer for number of bytes.

 >>> open('tmp','w').write('spam and eggs\n') >>> print open('tmp','r').read(), spam and eggs >>> open('tmp','w').write('this and that\n') >>> print open('tmp','r').read(), this and that >>> open('tmp','a').write('something else\n') >>> print open('tmp','r').read(), this and that something else

METHODS AND ATTRIBUTES

FILE.close()

Close a file object. Reading and writing are disallowed after a file is closed.

FILE.closed

Return a Boolean value indicating whether the file has been closed.

FILE.fileno()

Return a file descriptor number for the file. File-like objects that do not attach to actual files should not implement this method.

FILE.flush()

Write any pending data to the underlying file. File-like objects that do not cache data can still implement this method as pass.

FILE.isatty()

Return a Boolean value indicating whether the file is a TTY-like device. The standard documentation says that file-like objects that do not attach to actual files should not implement this method, but implementing it to always return 0 is probably a better approach.

FILE.mode

Attribute containing the mode of the file, normally identical to the mode argument passed to the object's initializer.

FILE.name

The name of the file. For file-like objects without a filesystem name, some string identifying the object should be put into this attribute.

FILE.read ([size=sys.maxint])

Return a string containing up to size bytes of content from the file. Stop the read if an EOF is encountered or upon another condition that makes sense for the object type. Move the file position forward immediately past the read in bytes. A negative size argument is treated as the default value.

FILE.readline([size=sys.maxint])

Return a string containing one line from the file, including the trailing newline, if any. A maximum of size bytes are read. The file position is moved forward past the read. A negative size argument is treated as the default value.

FILE.readlines([size=sys.maxint])

Return a list of lines from the file, each line including its trailing newline. If the argument size is given, limit the read to approximately size bytes worth of lines. The file position is moved forward past the read in bytes. A negative size argument is treated as the default value.

FILE.seek(offset [,whence=0])

Move the file position by offset bytes (positive or negative). The argument whence specifies where the initial file position is prior to the move: 0 for BOF; 1 for current position; 2 for EOF.

FILE.tell()

Return the current file position.

FILE.truncate([size=0])

Truncate the file contents (it becomes size length).

FILE.write(s)

Write the string s to the file, starting at the current file position. The file position is moved forward past the written bytes.

FILE.writelines(lines)

Write the lines in the sequence lines to the file. No newlines are added during the write. The file position is moved forward past the written bytes.

FILE.xreadlines()

Memory-efficient iterator over lines in a file. In Python 2.2+, you might implement this as a generator that returns one line per each yield.

METHODS

int.and(self, other)
int.rand(self, other)

Return a bitwise-and between self and other. Determines how a datatype responds to the & operator.

int.hex(self)

Return a hex string representing self. Determines how a datatype responds to the built-in hex() function.

int.invert(self)

Return a bitwise inversion of self. Determines how a datatype responds to the ~ operator.

int.lshift(self, other)
int.rlshift(self, other)

Return the result of bit-shifting self to the left by other bits. The right-associative version shifts other by self bits. Determines how a datatype responds to the << operator.

int.oct(self)

Return an octal string representing self. Determines how a datatype responds to the built-in oct() function.

int.or(self, other)
int.ror(self, other)

Return a bitwise-or between self and other. Determines how a datatype responds to the | operator.

int.rshift(self, other)
int.rrshift(self, other)

Return the result of bit-shifting self to the right by other bits. The right-associative version shifts other by self bits. Determines how a datatype responds to the >> operator.

int.xor(self, other)
int.rxor(self, other)

Return a bitwise-xor between self and other. Determines how a datatype responds to the ^ operator.

SEE ALSO: float 19; int 421; long 422; sys.maxint 50; operator 47;

float • New-style base class for floating point numbers

Python floating point numbers are mostly implemented using the underlying C floating point library of your platform; that is, to a greater or lesser degree based on the IEEE 754 standard. A complex number is just a Python object that wraps a pair of floats with a few extra operations on these pairs.

DIGRESSION

Although the details are far outside the scope of this book, a general warning is in order. Floating point math is harder than you think! If you think you understand just how complex IEEE 754 math is, you are not yet aware of all of the subtleties. By way of indication, Python luminary and erstwhile professor of numeric computing Alex Martelli commented in 2001 (on <comp.lang.python>):

Anybody who thinks he knows what he's doing when floating point is involved IS either naive, or Tim Peters (well, it COULD be W. Kahan I guess, but I don't think he writes here).

Fellow Python guru Tim Peters observed:

I find it's possible to be both (wink). But nothing about fp comes easily to anyone, and even Kahan works his butt off to come up with the amazing things that he does.

Peters illustrated further by way of Donald Knuth (The Art of Computer Programming, Third Edition, Addison-Wesley, 1997; ISBN: 0201896842, vol. 2, p. 229):

Many serious mathematicians have attempted to analyze a sequence of floating point operations rigorously, but found the task so formidable that they have tried to be content with plausibility arguments instead.

The trick about floating point numbers is that although they are extremely useful for representing real-life (fractional) quantities, operations on them do not obey the arithmetic rules we learned in middle school: associativity, transitivity, commutativity; moreover, many very ordinary-seeming numbers can be represented only approximately with floating point numbers. For example:

 >>> 1./3 0.33333333333333331 >>> .3 0.29999999999999999 >>> 7 == 7./25 * 25 0 >>> 7 == 7./24 * 24 1

CAPABILITIES

In the hierarchy of Python numeric types, floating point numbers are higher up the scale than integers, and complex numbers higher than floats. That is, operations on mixed types get promoted upwards. However, the magic methods that make a datatype "float-like" are strictly a subset of those associated with integers. All of the magic methods listed below for floats apply equally to ints and longs (or integer-like custom datatypes). Complex numbers support a few addition methods.

Under Python 2.2+, you may create a custom datatype that inherits from float or complex; under earlier versions, you would need to manually define all the magic methods you wished to utilize (generally a lot of work, and probably not worth it).

Each binary operation has a left-associative and a right-associative version. If you define both versions and perform an operation on two custom objects, the left-associative version is chosen. However, if you perform an operation with a basic datatype and a custom object, the custom right-associative method will be chosen over the basic operation. See the example under int.

METHODS

float.abs(self)

Return the absolute value of self. Determines how a datatype responds to the built-in function abs().

float.add(self, other)
float.radd(self, other)

Return the sum of self and other. Determines how a datatype responds to the + operator.

float.cmp(self, other)

Return a value indicating the order of self and other. Determines how a datatype responds to the numeric comparison operators <, >, <=, >=, ==, <>, and !=. Also determines the behavior of the built-in cmp() function. Should return -1 for self<other, 0 for self==other, and 1 for self>other. If other comparison methods are defined, they take precedence over .__cmp__(): .__ge__(), .__gt__(), .__le__(), and .__lt__().

float.div(self, other)
float.rdiv(self, other)

Return the ratio of self and other. Determines how a datatype responds to the / operator. In Python 2.3+, this method will instead determine how a datatype responds to the floor division operator //.

float.divmod(self, other)
float.rdivmod(self, other)

Return the pair (div, remainder). Determines how a datatype responds to the built-in divmod() function.

float.floordiv(self, other)
float.rfloordiv(self, other)

Return the number of whole times self goes into other. Determines how a datatype responds to the Python 2.2+ floor division operator //.

float.mod(self, other)
float.rmod(self, other)

Return the modulo division of self into other. Determines how a datatype responds to the % operator.

float.mul(self, other)
float.rmul(self, other)

Return the product of self and other. Determines how a datatype responds to the * operator.

float.neg(self)

Return the negative of self. Determines how a datatype responds to the unary - operator.

float.pow(self, other)
float.rpow(self, other)

Return self raised to the other power. Determines how a datatype responds to the ^ operator.

float.sub(self, other)
float.rsub(self, other)

Return the difference between self and other. Determines how a datatype responds to the binary - operator.

float.truediv(self, other)
float.rtruediv(self, other)

Return the ratio of self and other. Determines how a datatype responds to the Python 2.3+ true division operator /.

SEE ALSO: complex 22; int 18; float 422; operator 47;

complex • New-style base class for complex numbers

Complex numbers implement all the above documented methods of floating point numbers, and a few additional ones.

Inequality operations on complex numbers are not supported in recent versions of Python, even though they were previously. In Python 2.1+, the methods complex.__ge__(), complex.__gt__(), complex.__le__(), and complex.__lt__() all raise TypeError rather than return Boolean values indicating the order. There is a certain logic to this change inasmuch as complex numbers do not have a "natural" ordering. But there is also significant breakage with this change this is one of the few changes in Python, since version 1.4 when I started using it, that I feel was a real mistake. The important breakage comes when you want to sort a list of various things, some of which might be complex numbers:

 >>> lst = ["string", 1.0, 1, 1L, ('t','u' , 'p')] >>> lst.sort() >>> 1st [1.0, 1, 1L, 'string', ('t', 'u', 'p')] >>> lst.append(1j) >>> lst.sort() Traceback (most recent call last):   File "<stdin>", line 1, in ? TypeError: cannot compare complex numbers using <, <=, >, >=

It is true that there is no obvious correct ordering between a complex number and another number (complex or otherwise), but there is also no natural ordering between a string, a tuple, and a number. Nonetheless, it is frequently useful to sort a heterogeneous list in order to create a canonical (even if meaningless) order. In Python 2.2+, you can remedy this shortcoming of recent Python versions in the style below (under 2.1 you are largely out of luck):

 >>> class C(complex): ...   def __lt__(self, o): ...     if hasattr(o, 'imag'): ...       return (self.real,self.imag) < (o.real,o.imag) ...     else: ...       return self.real < o ...   def __le__(self, o): return self < o or self==o ...   def __gt__(self, o): return not (self==o or self < o) ...   def __ge__(self, o): return self > o or self==o ... >>> 1st = ["str", 1.0, 1, 1L, (1,2,3), C(1+1j), C(2-2j)] >>> lst.sort() >>> 1st [1.0, 1, 1L, (1+1j), (2-2j), 'str', (1, 2, 3)]

Of course, if you adopt this strategy, you have to create all of your complex values using the custom datatype C. And unfortunately, unless you override arithmetic operations also, a binary operation between a C object and another number reverts to a basic complex datatype. The reader can work out the details of this solution if she needs it.

METHODS

complex.conjugate(self)

Return the complex conjugate of self. A quick refresher here: If self is n+mj its conjugate is n-mj.

complex.imag

Imaginary component of a complex number.

complex.real

Real component of a complex number.

METHODS

dict.cmp(self, other)
UserDict.UserDict.cmp(self, other)

dict.contains(self, x)
UserDict.UserDict.contains(self, x)

Return a Boolean value indicating whether self "contains" the value x. By default, being contained in a dictionary means matching one of its keys, but you can change this behavior by overriding it (e.g., check whether x is in a value rather than a key). Determines how a datatype responds to the in operator.

dict.delitem(self, x)
UserDict.UserDict.delitem(self, x)

Remove an item from a dictionary-like datatype. By default, removing an item means removing the pair whose key equals x. Determines how a datatype responds to the del statement, as in: del self [x].

dict.getitem(self, x)
UserDict.UserDict.getitem(self, x)

By default, return the value associated with the key x. Determines how a datatype responds to indexing with square braces. You may override this method to either search differently or return special values. For example:

 >>> class Bag0fPairs(dict): ...     def __getitem__(self, x): ...         if self.has_key(x): ...             return (x, dict.__getitem__(self,x))  ...         else: ...             tmp = dict([(v,k) for k,v in self.items()]) ...             return (dict.__getitem__(tmp,x), x) ... >>> bop = BagOfPairs({'this':'that', 'spam':'eggs'}) >>> bop['this'] ('this', 'that') >>> bop['eggs'] ('spam', 'eggs') >>> bop['bacon'] = 'sausage' >>> bop {'this': 'that', 'bacon': 'sausage', 'spam': 'eggs'} >>> bop ['nowhere'] Traceback (most recent call last):   File "<stdin>", line 1, in ?   File "<stdin>", line 7, in __getitem__ KeyError: nowhere

dict.len(self)
UserDict.UserDict.len(self)

Return the length of the dictionary. By default this is simply a count of the key/val pairs, but you could perform a different calculation if you wished (e.g, perhaps you would cache the size of a record set returned from a database query that emulated a dictionary). Determines how a datatype responds to the built-in len() function.

dict.setitem(self, key, val)
UserDict.UserDict.setitem(self, key, val)

Set the dictionary key key to value val. Determines how a datatype responds to indexed assignment; that is, self[key]=val. A custom version might actually perform some calculation based on val and/or key before adding an item.

dict.clear(self)
UserDict.UserDict.clear(self)

Remove all items from self.

dict.copy(self)
UserDict.UserDict.copy(self)

Return a copy of the dictionary self (i.e., a distinct object with the same items).

dict.get(self, key [,default=None])
UserDict.UserDict.get(self, key [,default=None])

Return the value associated with the key key. If no item with the key exists, return default instead of raising a KeyError.

dict.has_key(self, key)
UserDict.UserDict.has_key(self, key)

Return a Boolean value indicating whether self has the key key.

dict.items(self)
UserDict.UserDict.items(self)
dict.iteritems(self)
UserDict.UserDict.iteritems(self)

Return the items in a dictionary, in an unspecified order. The .items() method returns a true list of (key,val) pairs, while the .iteritems() method (in Python 2.2+) returns a generator object that successively yields items. The latter method is useful if your dictionary is not a true in-memory structure, but rather some sort of incremental query or calculation. Either method responds externally similarly to a for loop:

 >>> d = {1:2, 3:4} >>> for k,v in d.iteritems(): print k,v,':', ... 1 2 : 3 4 : >>> for k,v in d.items(): print k,v,':', ... 1 2 : 3 4 :

dict.keys(self)
UserDict.UserDict.keys(self)
dict.iterkeys(self)
UserDict.UserDict.iterkeys(self)

Return the keys in a dictionary, in an unspecified order. The .keys() method returns a true list of keys, while the .iterkeys() method (in Python 2.2+) returns a generator object.

dict.popitem(self)
UserDict.UserDict.popitem(self)

Return a (key,val) pair for the dictionary, or raise as KeyError if the dictionary is empty. Removes the returned item from the dictionary. As with other dictionary methods, the order in which items are popped is unspecified (and can vary between versions and platforms).

dict.setdefault(self, key [,default=None])
UserDict.UserDict.setdefault(self, key [,default=None])

If key is currently in the dictionary, return the corresponding value. If key is not currently in the dictionary, set self[key]=default, then return default.

dict.update(self, other)
UserDict.UserDict.update(self, other)

Update the dictionary self using the dictionary other. If a key in other already exists in self, the corresponding value from other is used in self. If a (key,val) pair in other is not in self, it is added.

dict.values(self)
UserDict.UserDict.values(self)
dict.itervalues(self)
UserDict.UserDict.itervalues(self)

Return the values in a dictionary, in an unspecified order. The .values() method returns a true list of keys, while the .itervalues() method (in Python 2.2+) returns a generator object.

METHODS

list.add(self, other)
UserList.UserList.add(self, other)
tuple.add(self, other)
list.iadd(self, other)
UserList.UserList.iadd(self, other)

Determine how a datatype responds to the + and += operators. Augmented assignments ("in-place add") are supported in Python 2.0+. For list-like datatypes, normally the statements 1st+=other and 1st=1st+other have the same effect, but the augmented version might be more efficient.

Under standard meaning, addition of the two sequence objects produces a new (distinct) sequence object with all the items in both self and other. An in-place add (.__iadd__) mutates the left-hand object without creating a new object. A custom datatype might choose to give a special meaning to addition, perhaps depending on the datatype of the object added in. For example:

 >>> class XList(list): ...     def __iadd__(self, other): ...         if issubclass(other.__class__, list): ...             return list.__iadd__(self, other) ...         else: ...             from operator import add ...             return map(add, self, [other]*len(self)) ... >>> xl = XList([1,2,3]) >>> xl += [4,5,6] >>> xl [1, 2, 3, 4, 5, 6] >>> xl += 10 >>> xl [11, 12, 13, 14, 15, 16]

list.contains(self, x)
UserList.UserList.contains(self, x)
tuple.contains(self, x)

Return a Boolean value indicating whether self contains the value x. Determines how a datatype responds to the in operator.

list.delitem(self, x)
UserList.UserList.delitem(self, x)

Remove an item from a list-like datatype. Determines how a datatype responds to the del statement, as in del self[x].

list.delslice(self, start, end)
UserList.UserList.delslice(self, start, end)

Remove a range of items from a list-like datatype. Determines how a datatype responds to the del statement applied to a slice, as in del self[start:end].

list.getitem(self, pos)
UserList.UserList.getitem(self, pos)
tuple.getitem(self, pos)

Return the value at offset pos in the list. Determines how a datatype responds to indexing with square braces. The default behavior on list indices is to raise an IndexError for nonexistent offsets.

list.getslice(self, start, end)
UserList.UserList.getslice(self, start, end)
tuple.getslice(self, start, end)

Return a subsequence of the sequence self. Determines how a datatype responds to indexing with a slice parameter, as in self[start:end].

list.hash(self)
UserList.UserList.hash(self)
tuple.hash(self)

Return an integer that distinctly identifies an object. Determines how a datatype responds to the built-in hash() function and probably more importantly the hash is used internally in dictionaries. By default, tuples (and other immutable types) will return hash values but lists will raise a TypeError. Dictionaries will handle hash collisions gracefully, but it is best to try to make hashes unique per object.

 >>> hash(219750523), hash((1,2)) (219750523, 219750523) >>> dct = {219750523:1, (1,2):2} >>> dct[219750523] 1

list.len(self
UserList.UserList.len(self
tuple.len(self

Return the length of a sequence. Determines how a datatype responds to the built-in len() function.

list.mul(self, num)
UserList.UserList.mul(self, num)
tuple.mul(self, num)
list.rmul(self, num)
UserList.UserList.rmul(self, num)
tuple.rmul(self, num)
list.imul(self, num)
UserList.UserList.imul(self, num)

Determine how a datatype responds to the * and *= operators. Augmented assignments ("in-place add") are supported in Python 2.0+. For list-like datatypes, normally the statements lst*=other and lst=lst*other have the same effect, but the augmented version might be more efficient.

The right-associative version .__rmul__() determines the value of num*self, the left-associative .__mul__() determines the value of self*num. Under standard meaning, the product of a sequence and a number produces a new (distinct) sequence object with the items in self duplicated num times:

 >>> [1,2,3] * 3 [1, 2, 3, 1, 2, 3, 1, 2, 3]

list.setitem(self, pos, val)
UserList.UserList.setitem(self, pos, val)

Set the value at offset pos to value value. Determines how a datatype responds to indexed assignment; that is, self[pos]=val. A custom version might actually perform some calculation based on val and/or key before adding an item.

list.setslice(self, start, end, other)
UserList.UserList.setslice(self, start, end, other)

Replace the subsequence self[start:end] with the sequence other. The replaced and new sequences are not necessarily the same length, and the resulting sequence might be longer or shorter than self. Determines how a datatype responds to assignment to a slice, as in self[start:end]=other.

list.append(self, item)
UserList.UserList.append(self, item)

Add the object item to the end of the sequence self. Increases the length of self by one.

list.count(self, item)
UserList.UserList.count(self, item)

Return the integer number of occurrences of item in self.

list.extend(self, seq)
UserList.UserList.extend (self, seq)

Add each item in seq to the end of the sequence self. Increases the length of self by len(seq).

list.index(self, item)
UserList.UserList.index(self, item)

Return the offset index of the first occurrence of item in self.

list.insert(self, pos, item)
UserList.UserList.insert(self, pos, item)

Add the object item to the sequence self before the offset pos. Increases the length of self by one.

list.pop(self [,pos=-1])
UserList.UserList.pop(self [,pos=-1])

Return the item at offset pos of the sequence self, and remove the returned item from the sequence. By default, remove the last item, which lets a list act like a stack using the .pop() and .append() operations.

list.remove(self, item)
UserList.UserList.remove(self, item)

Remove the first occurrence of item in self. Decreases the length of self by one.

list.reverse(self)
UserList.UserList.reverse(self)

Reverse the list self in place.

list.sort(self [cmpfunc])
UserList.UserList.sort(self [,cmpfunc])

Sort the list self in place. If a comparison function cmpfunc is given, perform comparisons using that function.

SEE ALSO: list 427; tuple 427; dict 24; operator 47;

UserString • Custom wrapper around string objects

str • New-style base class for string objects

A string in Python is an immutable sequence of characters (see Glossary entry on "immutable"). There is special syntax for creating strings single and triple quoting, character escaping, and so on but in terms of object behaviors and magic methods, most of what a string does a tuple does, too. Both may be sliced and indexed, and both respond to pseudo-arithmetic operators + and *.

For the str and UserString magic methods that are strictly a matter of the sequence quality of strings, see the corresponding tuple documentation. These include str.__add__(), str.__getitem__(), str.__getslice__(), str.__hash__(), str.__len__(), str.__mul__(), and str.__rmul__(). Each of these methods is also defined in UserString. The UserString module also includes a few explicit definitions of magic methods that are not in the new-style str class: UserString.__iadd__(), UserString.__imul__(), and UserString.__radd__(). However, you may define your own implementations of these methods, even if you inherit from str (in Python 2.2+). In any case, internally, in-place operations are still performed on all strings.

Strings have quite a number of nonmagic methods as well. If you wish to create a custom datatype that can be utilized in the same functions that expect strings, you may want to specialize some of these common string methods. The behavior of string methods is documented in the discussion of the string module, even for the few string methods that are not also defined in the string module. However, inheriting from either str or UserString provides very reasonable default behaviors for all these methods.

SEE ALSO: "".capitalize() 132; "".title() 133; "".center() 133; "".count() 134; "".endswith() 134; "".expandtabs() 134; "".find() 135; "".index() 135; "".isalpha() 136; "".isalnum() 136; "".isdigit() 136; "".islower() 136; "".isspace() 136; "".istitle() 136; "".isupper() 136; "".join() 137; ""ljust() 138; "".lower() 138; "".lstrip() 139; "".replace() 139; "".rfind() 140; "".rindex() 141; "".rjust() 141; "".rstrip() 142; "".split() 142; "".splitlines() 144; "".startswith() 144; "".strip() 144; "".swapcase() 145; "".translate() 145; "".upper() 146; "".encode() 188;

METHODS

str.contains(self, x)
UserString.UserString.contains(self, x)

Return a Boolean value indicating whether self contains the character x. Determines how a datatype responds to the in operator.

In Python versions through 2.2, the in operator applied to strings has a semantics that tends to trip me up. Fortunately, Python 2.3+ has the behavior that I expect. In older Python versions, in can only be used to determine the presence of a single character in a string this makes sense if you think of a string as a sequence of characters, but I nonetheless intuitively want something like the code below to work:

 >>> s = "The cat in the hat" >>> if "the" in s: print "Has definite article" ... Traceback (most recent call last):   File "<stdin>", line 1, in ? TypeError: 'in <string>' requires character as left operand

It is easy to get the "expected" behavior in a custom string-like datatype (while still always producing the same result whenever x is indeed a character):

 >>> class S(str): ...     def __contains__(self, x): ...         for i in range(len(self)): ...             if self.startswith(x,i): return 1 ... >>> s = S("The cat in the hat") >>> "the" in s 1 >>> "an" in s 0

Python 2.3 strings behave the same way as my datatype S.

SEE ALSO: string 422; string 129; operator 47; tuple 28;

1.1.5 Exercise: Filling out the forms (or deciding not to)

DISCUSSION

A particular little task that was quite frequent and general before the advent of Web servers has become absolutely ubiquitous for slightly dynamic Web pages. The pattern one encounters is that one has a certain general format that is desired for a document or file, but miscellaneous little details differ from instance to instance. Form letters are another common case where one comes across this pattern, but thematically related collections of Web pages rule the roost of templating techniques.

It turns out that everyone and her sister has developed her own little templating system. Creating a templating system is a very appealing task for users of most scripting languages, just a little while after they have gotten a firm grasp of "Hello World!" Some of these are discussed in Chapter 5, but many others are not addressed. Often, these templating systems will be HTML/CGI oriented and will often include some degree of dynamic calculation of fill-in values the inspiration in these cases comes from systems like Allaire's ColdFusion, Java Server Pages, Active Server Pages, and PHP, in which some program code gets sprinkled around in documents that are primarily made of HTML.

At the very simplest, Python provides interpolation of special characters in strings, in a style similar to the C sprintf() function. So a simple example might appear like:

 >>> form_letter="""Dear %s %s, ... ... You owe us $%s for account (#%s). Please Pay. ... ... The Company""" >>> fname = 'David' >>> lname = 'Mertz' >>> due = 500 >>> acct = '123-T745' >>> print form_letter % (fname,lname,due,acct) Dear David Mertz, You owe us $500 for account (#123-T745). Please Pay. The Company

This approach does the basic templating, but it would be easy to make an error in composing the tuple of insertion values. And moreover, a slight change to the form_letter template such as the addition or subtraction of a field would produce wrong results.

A bit more robust approach is to use Python's dictionary-based string interpolation. For example:

 >>> form_letter="""Dear %(fname)s %(lname)s, ... ... You owe us $%(due)s for account (#%(acct)s). Please Pay. ... ... The Company""" >>> fields = {'lname':'Mertz', 'fname':'David'} >>> fields['acct'] = '123-T745' >>> fields['due'] = 500 >>> fields['last_letter'] = '01/02/2001' >>> print form_letter % fields Dear David Mertz, You owe us $500 for account (#123-T745). Please Pay. The Company

With this approach, the fields need not be listed in a particular order for the insertion. Furthermore, if the order of fields is rearranged in the template, or if the same fields are used for a different template, the fields dictionary may still be used for insertion values. If fields has unused dictionary keys, it doesn't hurt the interpolation, either.

The dictionary interpolation approach is still subject to failure if dictionary keys are missing. Two improvements using the UserDict module can improve matters, in two different (and incompatible) ways. In Python 2.2+ the built-in dict type can be a parent for a "new-style class"; if available everywhere you need it to run, dict is a better parent than is UserDict.UserDict. One approach is to avoid all key misses during dictionary interpolation:

 >>> form_letter="""%(salutation)s %(fname)s %(lname)s, ... ... You owe us $%(due)s for account (#%(acct)s). Please Pay. ... ... %(closing)s The Company""" >>> from UserDict import UserDict >>> class AutoFillingDict(UserDict): ...     def __init__(self,dict={}): UserDict.__init__(self,dict) ...     def __getitem__(self,key): ...         return UserDict.get(self, key, '') >>> fields = AutoFillingDict() >>> fields['salutation'] = 'Dear' >>> fields {'salutation': 'Dear'} >>> fields['fname'] = 'David' >>> fields['due'] = 500 >>> fields ['closing'] = 'Sincerely,' >>> print form_letter % fields Dear David , You owe us $500 for account (#). Please Pay. Sincerely, The Company

Even though the fields lname and acct are not specified, the interpolation has managed to produce a basically sensible letter (instead of crashing with a KeyError).

Another approach is to create a custom dictionary-like object that will allow for "partial interpolation." This approach is particularly useful to gather bits of the information needed for the final string over the course of the program run (rather than all at once):

 >>> form_letter="""%(salutation)s %(fname)s %(lname)s, ... ... You owe us $%(due)s for account (#%(acct)s). Please Pay. ... ... %(closing)s The Company""" >>> from UserDict import UserDict >>> class ClosureDict(UserDict): ...     def __init__(self,dict={}): UserDict.__init__(self,dict) ...     def __getitem__(self,key): ...         return UserDict.get(self, key, '%('+key+')s') >>> name_dict = ClosureDict({'fname':'David','lname':'Mertz'}) >>> print form_letter % name_dict %(salutation)s David Mertz, You owe us $%(due)s for account (#%(acct)s). Please Pay. %(closing)s The Company

Interpolating using a ClosureDict simply fills in whatever portion of the information it knows, then returns a new string that is closer to being filled in.

SEE ALSO: dict 24; UserDict 24; UserList 28; UserString 33;

QUESTIONS

1:	What are some other ways to provide "smart" string interpolation? Can you think of ways that the `UserList` or `UserString` modules might be used to implement a similar enhanced interpolation?
2:	Consider other "magic" methods that you might add to classes inheriting from `UserDict.UserDict`. How might these additional behaviors make templating techniques more powerful?
3:	How far do you think you can go in using Python's string interpolation as a templating technique? At what point would you decide you had to apply other techniques, such as regular expression substitutions or a parser? Why?
4:	What sorts of error checking might you implement for customized interpolation? The simple list or dictionary interpolation could fail fairly easily, but at least those were trappable errors (they let the application know something is amiss). How would you create a system with both flexible interpolation and good guards on the quality and completeness of the final result?

1.1.6 Problem: Working with lines from a large file

At its simplest, reading a file in a line-oriented style is just a matter of using the .readline(), .readlines(), and .xreadlines() methods of a file object. Python 2.2+ provides a simplified syntax for this frequent operation by letting the file object itself efficiently iterate over lines (strictly in forward sequence). To read in an entire file, you may use the .read() method and possibly split it into lines or other chunks using the string.split() function. Some examples:

 >>> for line in open('chapl.txt'): # Python 2.2+ ...     # process each line in some manner ...     pass ... >>> linelist = open('chap1.txt').readlines() >>> print linelist[1849],   EXERCISE: Working with lines from a large file >>> txt = open('chap1.txt').read() >>> from os import linesep >>> linelist2 = txt.split(linesep)

For moderately sized files, reading the entire contents is not a big issue. But large files make time and memory issues more important. Complex documents or active log files, for example, might be multiple megabytes, or even gigabytes, in size even if the contents of such files do not strictly exceed the size of available memory, reading them can still be time consuming. A related technique to those discussed here is discussed in the "Problem: Reading a file backwards by record, line, or paragraph" section of Chapter 2.

Obviously, if you need to process every line in a file, you have to read the whole file; xreadlines does so in a memory-friendly way, assuming you are able to process them sequentially. But for applications that only need a subset of lines in a large file, it is not hard to make improvements. The most important module to look to for support here is linecache.

A CACHED LINE LIST

It is straightforward to read a particular line from a file using linecache:

 >>> import linecache >>> print linecache.getline('chap1.txt',1850),   PROBLEM: Working with lines from a large file

Notice that linecache.getline() uses one-based counting, in contrast to the zero-based list indexing in the prior example. While there is not much to this, it would be even nicer to have an object that combined the efficiency of linecache with the interfaces we expect in lists. Existing code might exist to process lists of lines, or you might want to write a function that is agnostic about the source of a list of lines. In addition to being able to enumerate and index, it would be useful to be able to slice linecache-based objects, just as we might do to real lists (including with extended slices, which were added to lists in Python 2.3).

cachedlinelist.py

 import linecache, types class CachedLineList:     # Note: in Python 2.2+, it is probably worth including:     # __slots__ = ('_fname')     # ...and inheriting from 'object'     def __init__(self, fname):         self._fname = fname     def __getitem__(self, x):         if type(x) is types.SliceType:             return [linecache.getline(self._fname, n+1)                     for n in range(x.start, x.stop, x.step)]         else:             return linecache.getline(self._fname, x+1)     def __getslice__(self, beg, end):         # pass to __getitem__ which does extended slices also         return self[beg:end:1]

Using these new objects is almost identical to using a list created by open(fname).readlines(), but more efficient (especially in memory usage):

 >>> from cachedlinelist import CachedLineList >>> cll = CachedLineList('../chap1.txt') >>> cll [1849] '  PROBLEM: Working with lines from a large file\r\n' >>> for line in cll[1849:1851]: print line, ...   PROBLEM: Working with lines from a large file   --------------------------------------------------------- >>> for line in cll[1853:1857:2]: print line, ...   a matter of using the '.readline()', '.readlines()' and   simplified syntax for this frequent operation by letting the

A RANDOM LINE

Occasionally especially for testing purposes you might want to check "typical" lines in a line-oriented file. It is easy to fall into the trap of making sure that a process works for the first few lines of a file, and maybe for the last few, then assuming it works everywhere. Unfortunately, the first and last few lines of many files tend to be atypical: sometimes headers or footers are used; sometimes a log file's first lines were logged during development rather than usage; and so on. Then again, exhaustive testing of entire files might provide more data than you want to worry about. Depending on the nature of the processing, complete testing could be time consuming as well.

On most systems, seeking to a particular position in a file is far quicker than reading all the bytes up to that position. Even using linecache, you need to read a file byte-by-byte up to the point of a cached line. A fast approach to finding random lines from a large file is to seek to a random position within a file, then read comparatively few bytes before and after that position, identifying a line within that chunk.

randline.py

 #!/usr/bin/python """Iterate over random lines in a file (req Python 2.2+) From command-line use: % randline.py <fname> <numlines> """ import sys from os import stat, linesep from stat import ST_SIZE from random import randrange MAX_LINE_LEN = 4096 #__ Iterable class class randline(object):     __slots__ = ('_fp','_size','_limit')     def __init__(self, fname, limit=sys.maxint):         self._size = stat(fname)[ST_SIZE]         self._fp = open(fname,'rb')         self._limit = limit     def __iter__(self):         return self     def next(self):         if self._limit <= 0:             raise StopIteration         self._limit -= 1         pos = randrange(self._size)         priorlen = min(pos, MAX_LINE_LEN)   # maybe near start         self._fp.seek(pos-priorlen)         # Add extra linesep at beg/end in case pos at beg/end         prior = linesep + self._fp.read(priorlen)         post = self._fp.read(MAX_LINE_LEN) + linesep         begln = prior.rfind(linesep) + len(linesep)         endln = post.find(linesep)         return prior[begln:]+post[:endln] #-- Use as command-line tool if __name__=='__main__':     fname, numlines = sys.argv[1], int(sys.argv[2])     for line in randline(fname, numlines):         print line

The presented randline module may be used either imported into another application or as a command-line tool. In the latter case, you could pipe a collection of random lines to another application, as in:

 % randline.py reallybig.log 1000 | testapp

A couple details should be noted in my implementation. (1) The same line can be chosen more than once in a line iteration. If you choose a small number of lines from a large file, this probably will not happen (but the so-called "birthday paradox" makes an occasional collision more likely than you might expect; see the Glossary). (2) What is selected is "the line that contains a random position in the file," which means that short lines are less likely to be chosen than long lines. That distribution could be a bug or feature, depending on your needs. In practical terms, for testing "enough" typical cases, the precise distribution is not all that important.

SEE ALSO: xreadlines 72; linecache 64; random 82;

1.1.1 Utilizing Higher-Order Functions in Text Processing

INSTALL.LOG sample data file

combinatorial.py

Some examples using higher-order functions

1.1.2 Exercise: More on combinatorial functions

QUESTIONS

1.1.3 Specializing Python Datatypes

C++ signature-based polymorphism

Python "signature-based" polymorphism

PYTHONIC POLYMORPHISM

Python capability-based polymorphism

Checking what numbers can do

ENHANCED OBJECTS

1.1.4 Base Classes for Datatypes

METHODS

object.__eq__(self, other)

object.__ne__(self, other)

object.__nonzero__(self)

object.__len__(self)len(object)

object.__repr__(self)repr(object)object.__str__(self)str(object)

BUILT-IN FUNCTIONS

open(fname [,mode [,buffering]])file(fname [,mode [,buffering]])

METHODS AND ATTRIBUTES

FILE.close()

FILE.closed

FILE.fileno()

FILE.flush()

FILE.isatty()

FILE.mode

FILE.name

FILE.read ([size=sys.maxint])

FILE.readline([size=sys.maxint])

FILE.readlines([size=sys.maxint])

FILE.seek(offset [,whence=0])

FILE.tell()

FILE.truncate([size=0])

FILE.write(s)

FILE.writelines(lines)

FILE.xreadlines()

METHODS

int.__and__(self, other)int.__rand__(self, other)

int.__hex__(self)

int.__invert__(self)

int.__lshift__(self, other)int.__rlshift__(self, other)

int.__oct__(self)

int.__or__(self, other)int.__ror__(self, other)

int.__rshift__(self, other)int.__rrshift__(self, other)

int.__xor__(self, other)int.__rxor__(self, other)

DIGRESSION

CAPABILITIES

METHODS

float.__abs__(self)

float.__add__(self, other)float.__radd__(self, other)

float.__cmp__(self, other)

float.__div__(self, other)float.__rdiv__(self, other)

float.__divmod__(self, other)float.__rdivmod__(self, other)

float.__floordiv__(self, other)float.__rfloordiv__(self, other)

float.__mod__(self, other)float.__rmod__(self, other)

float.__mul__(self, other)float.__rmul__(self, other)

float.__neg__(self)

float.__pow__(self, other)float.__rpow__(self, other)

float.__sub__(self, other)float.__rsub__(self, other)

float.__truediv__(self, other)float.__rtruediv__(self, other)

METHODS

complex.conjugate(self)

complex.imag

complex.real

METHODS

dict.__cmp__(self, other)UserDict.UserDict.__cmp__(self, other)

dict.__contains__(self, x)UserDict.UserDict.__contains__(self, x)

dict.__delitem__(self, x)UserDict.UserDict.__delitem__(self, x)

dict.__getitem__(self, x)UserDict.UserDict.__getitem__(self, x)

dict.__len__(self)UserDict.UserDict.__len__(self)

dict.__setitem__(self, key, val)UserDict.UserDict.__setitem__(self, key, val)

dict.clear(self)UserDict.UserDict.clear(self)

dict.copy(self)UserDict.UserDict.copy(self)

dict.get(self, key [,default=None])UserDict.UserDict.get(self, key [,default=None])

dict.has_key(self, key)UserDict.UserDict.has_key(self, key)

dict.items(self)UserDict.UserDict.items(self)dict.iteritems(self)UserDict.UserDict.iteritems(self)

object.eq(self, other)

object.ne(self, other)

object.nonzero(self)

object.len(self)
len(object)

object.repr(self)
repr(object)
object.str(self)
str(object)

open(fname [,mode [,buffering]])
file(fname [,mode [,buffering]])

int.and(self, other)
int.rand(self, other)

int.hex(self)

int.invert(self)

int.lshift(self, other)
int.rlshift(self, other)

int.oct(self)

int.or(self, other)
int.ror(self, other)

int.rshift(self, other)
int.rrshift(self, other)

int.xor(self, other)
int.rxor(self, other)

float.abs(self)

float.add(self, other)
float.radd(self, other)

float.cmp(self, other)

float.div(self, other)
float.rdiv(self, other)

float.divmod(self, other)
float.rdivmod(self, other)

float.floordiv(self, other)
float.rfloordiv(self, other)

float.mod(self, other)
float.rmod(self, other)

float.mul(self, other)
float.rmul(self, other)

float.neg(self)

float.pow(self, other)
float.rpow(self, other)

float.sub(self, other)
float.rsub(self, other)

float.truediv(self, other)
float.rtruediv(self, other)

dict.cmp(self, other)
UserDict.UserDict.cmp(self, other)

dict.contains(self, x)
UserDict.UserDict.contains(self, x)

dict.delitem(self, x)
UserDict.UserDict.delitem(self, x)

dict.getitem(self, x)
UserDict.UserDict.getitem(self, x)

dict.len(self)
UserDict.UserDict.len(self)

dict.setitem(self, key, val)
UserDict.UserDict.setitem(self, key, val)

dict.clear(self)
UserDict.UserDict.clear(self)

dict.copy(self)
UserDict.UserDict.copy(self)

dict.get(self, key [,default=None])
UserDict.UserDict.get(self, key [,default=None])

dict.has_key(self, key)
UserDict.UserDict.has_key(self, key)

dict.items(self)
UserDict.UserDict.items(self)
dict.iteritems(self)
UserDict.UserDict.iteritems(self)

dict.keys(self)
UserDict.UserDict.keys(self)
dict.iterkeys(self)
UserDict.UserDict.iterkeys(self)

dict.popitem(self)
UserDict.UserDict.popitem(self)

dict.setdefault(self, key [,default=None])
UserDict.UserDict.setdefault(self, key [,default=None])

dict.update(self, other)
UserDict.UserDict.update(self, other)

dict.values(self)
UserDict.UserDict.values(self)
dict.itervalues(self)
UserDict.UserDict.itervalues(self)

list.add(self, other)
UserList.UserList.add(self, other)
tuple.add(self, other)
list.iadd(self, other)
UserList.UserList.iadd(self, other)

list.contains(self, x)
UserList.UserList.contains(self, x)
tuple.contains(self, x)

list.delitem(self, x)
UserList.UserList.delitem(self, x)

list.delslice(self, start, end)
UserList.UserList.delslice(self, start, end)

list.getitem(self, pos)
UserList.UserList.getitem(self, pos)
tuple.getitem(self, pos)

list.getslice(self, start, end)
UserList.UserList.getslice(self, start, end)
tuple.getslice(self, start, end)

list.hash(self)
UserList.UserList.hash(self)
tuple.hash(self)

list.len(self
UserList.UserList.len(self
tuple.len(self

list.mul(self, num)
UserList.UserList.mul(self, num)
tuple.mul(self, num)
list.rmul(self, num)
UserList.UserList.rmul(self, num)
tuple.rmul(self, num)
list.imul(self, num)
UserList.UserList.imul(self, num)

list.setitem(self, pos, val)
UserList.UserList.setitem(self, pos, val)

list.setslice(self, start, end, other)
UserList.UserList.setslice(self, start, end, other)

list.append(self, item)
UserList.UserList.append(self, item)

list.count(self, item)
UserList.UserList.count(self, item)

list.extend(self, seq)
UserList.UserList.extend (self, seq)

list.index(self, item)
UserList.UserList.index(self, item)

list.insert(self, pos, item)
UserList.UserList.insert(self, pos, item)

list.pop(self [,pos=-1])
UserList.UserList.pop(self [,pos=-1])

list.remove(self, item)
UserList.UserList.remove(self, item)

list.reverse(self)
UserList.UserList.reverse(self)

list.sort(self [cmpfunc])
UserList.UserList.sort(self [,cmpfunc])

str.contains(self, x)
UserString.UserString.contains(self, x)