A.3 Datatypes | Text Processing in Python

Python has a rich collection of basic datatypes. All of Python's collection types allow you to hold heterogeneous elements inside them, including other collection types (with minor limitations). It is straightforward, therefore, to build complex data structures in Python.

Unlike many languages, Python datatypes come in two varieties: mutable and immutable. All of the atomic datatypes are immutable, as is the collection type tuple. The collections list and dict are mutable, as are class instances. The mutability of a datatype is simply a question of whether objects of that type can be changed "in place" an immutable object can only be created and destroyed, but never altered during its existence. One upshot of this distinction is that immutable objects may act as dictionary keys, but mutable objects may not. Another upshot is that when you want a data structure especially a large one that will be modified frequently during program operation, you should choose a mutable datatype (usually a list).

Most of the time, if you want to convert values between different Python datatypes, an explicit conversion/encoding call is required, but numeric types contain promotion rules to allow numeric expressions over a mixture of types. The built-in datatypes are listed below with discussions of each. The built-in function type() can be used to check the datatype of an object.

A.3.1 Simple Types

bool

Python 2.3+ supports a Boolean datatype with the possible values True and False. In earlier versions of Python, these values are typically called 1 and 0; even in Python 2.3+, the Boolean values behave like numbers in numeric contexts. Some earlier micro-releases of Python (e.g., 2.2.1) include the names True and False, but not the Boolean datatype.

int

A signed integer in the range indicated by the register size of the interpreter's CPU/OS platform. For most current platforms, integers range from (2**31)-1 to negative (2**31)-1. You can find the size on your platform by examining sys.maxint. Integers are the bottom numeric type in terms of promotions; nothing gets promoted to an integer, but integers are sometimes promoted to other numeric types. A float, long, or string may be explicitly converted to an int using the int() function.

long

An (almost) unlimited size integral number. A long literal is indicated by an integer followed by an 1 or L (e.g., 34L, 98765432101). In Python 2.2+, operations on ints that overflow sys.maxint are automatically promoted to longs. An int, float, or string may be explicitly converted to a long using the long() function.

float

An IEEE754 floating point number. A literal floating point number is distinguished from an int or long by containing a decimal point and/or exponent notation (e.g., 1.0, 1e3, 37., .453e-12). A numeric expression that involves both int/long types and float types promotes all component types to floats before performing the computation. An int, long, or string may be explicitly converted to a float using the float() function.

complex

An object containing two floats, representing real and imaginary components of a number. A numeric expression that involves both int/long/float types and complex types promotes all component types to complex before performing the computation. There is no way to spell a literal complex in Python, but an addition such as 1.1+2j is the usual way of computing a complex value. A j or J following a float or int literal indicates an imaginary number. An int, long, or string may be explicitly converted to a complex using the complex() function. If two float/int arguments are passed to complex(), the second is the imaginary component of the constructed number (e.g., complex(1.1,2)).

string

An immutable sequence of 8-bit character values. Unlike in many programming languages, there is no "character" type in Python, merely strings that happen to have length one. String objects have a variety of methods to modify strings, but such methods always return a new string object rather than modify the initial object itself. The built-in chr() function will return a length-one string whose ordinal value is the passed integer. The str() function will return a string representation of a passed in object. For example:

 >>> ord('a') 97 >>> chr(97) 'a' >>> str(97) '97'

unicode

An immutable sequence of Unicode characters. There is no datatype for a single Unicode character, but Unicode strings of length-one contain a single character. Unicode strings contain a similar collection of methods to string objects, and like the latter, Unicode methods return new Unicode objects rather than modify the initial object. See Chapter 2 and Appendix C for additional discussion, of Unicode.

A.3.2 String Interpolation

Literal strings and Unicode strings may contain embedded format codes. When a string contains format codes, values may be interpolated into the string using the % operator and a tuple or dictionary giving the values to substitute in.

Strings that contain format codes may follow either of two patterns. The simpler pattern uses format codes with the syntax %[flags][len[.precision]]<type>. Interpolating a string with format codes on this pattern requires % combination with a tuple of matching length and content datatypes. If only one value is being interpolated, you may give the bare item rather than a tuple of length one. For example:

 >>> "float %3.1f, int %+d, hex %06x" % (1.234, 1234, 1234) 'float 1.2, int +1234, hex 0004d2' >>> '%e' % 1234 '1.234000e+03' >>> '%e' % (1234,) '1.234000e+03'

The (slightly) more complex pattern for format codes embeds a name within the format code, which is then used as a string key to an interpolation dictionary. The syntax of this pattern is %(key)[flags][len[.precision]]<type>. Interpolating a string with this style of format codes requires % combination with a dictionary that contains all the named keys, and whose corresponding values contain acceptable datatypes. For example:

 >>> dct = {'ratio':1.234, 'count':1234, 'offset':1234} >>> "float %(ratio)3.1f, int %(count)+d, hex %(offset)06x" % dct 'float 1.2, int +1234, hex 0004d2'

You may not mix tuple interpolation and dictionary interpolation within the same string.

I mentioned that datatypes must match format codes. Different format codes accept a different range of datatypes, but the rules are almost always what you would expect. Generally, numeric data will be promoted or demoted as necessary, but strings and complex types cannot be used for numbers.

One useful style of using dictionary interpolation is against the global and/or local namespace dictionary. Regular bound names defined in scope can be interpolated into strings.

 >>> s = "float %(ratio)3.1f, int %(count)+d, hex %(offset)06x" >>> ratio = 1.234 >>> count = 1234 >>> offset = 1234 >>> s % globals() 'float 1.2, int +1234, hex 0004d2'

If you want to look for names across scope, you can create an ad hoc dictionary with both local and global names:

 >>> vardct = {} >>> vardct.update(globals()) >>> vardct.update(locals()) >>> interpolated = somestring % vardct

The flags for format codes consist of the following:

 0 Pad to length with leading zeros - Align the value to the left within its length - (space) Pad to length with leading spaces + Explicitly indicate the sign of positive values

When a length is included, it specifies the minimum length of the interpolated formatting. Numbers that will not fit within a length simply occupy more bytes than specified. When a precision is included, the length of those digits to the right of the decimal are included in the total length:

 >>> '[%f]' % 1.234 '[1.234000]' >>> '[%5f]' % 1.234 '[1.234000]' >>> '[%.1f]' % 1.234 '[1.2]' >>> '[%5.1f]' % 1.234 '[  1.2]' >>> '[%05.1f]' % 1.234 '[001.2]'

The formatting types consist of the following:

 d Signed integer decimal i Signed integer decimal o Unsigned octal u Unsigned decimal x Lowercase unsigned hexadecimal X Uppercase unsigned hexadecimal e Lowercase exponential format floating point E Uppercase exponential format floating point f Floating point decimal format g Floating point: exponential format if -4 < exp < precision G Uppercase version of 'g' c Single character: integer for chr(i) or length-one string r Converts any Python object using repr() s Converts any Python object using str() % The '%' character, e.g.: '%%%d' % (1) --> '%1'

One more special format code style allows the use of a * in place of a length. In this case, the interpolated tuple must contain an extra element for the formatted length of each format code, preceding the value to format. For example:

 >>> "%0*d # %0*.2f" % (4, 123, 4, 1.23) '0123 # 1.23' >>> "%0*d # %0*.2f" % (6, 123, 6, 1.23) '000123 # 001.23'

A.3.3 Printing

The least-sophisticated form of textual output in Python is writing to open files. In particular, the STDOUT and STDERR streams can be accessed using the pseudo-files sys.stdout and sys.stderr. Writing to these is just like writing to any other file; for example:

 >>> import sys >>> try: ...    # some fragile action ...    sys.stdout.write('result of action\n') ... except: ...    sys.stderr.write('could not complete action\n') ... result of action

You cannot seek within STDOUT or STDERR generally you should consider these as pure sequential outputs.

Writing to STDOUT and STDERR is fairly inflexible, and most of the time the print statement accomplishes the same purpose more flexibly. In particular, methods like sys.stdout.write() only accept a single string as an argument, while print can handle any number of arguments of any type. Each argument is coerced to a string using the equivalent of repr(obj). For example:

 >>> print "Pi: %.3f" % 3.1415, 27+11, {3:4,1:2}, (1,2,3) Pi: 3.142 38 {1: 2, 3: 4} (1, 2, 3)

Each argument to the print statment is evaluated before it is printed, just as when an argument is passed to a function. As a consequence, the canonical representation of an object is printed, rather than the exact form passed as an argument. In my example, the dictionary prints in a different order than it was defined in, and the spacing of the list and dictionary is slightly different. String interpolation is also peformed and is a very common means of defining an output format precisely.

There are a few things to watch for with the print statement. A space is printed between each argument to the statement. If you want to print several objects without a separating space, you will need to use string concatenation or string interpolation to get the right result. For example:

 >>> numerator, denominator = 3, 7 >>> print repr(numerator)+"/"+repr(denominator) 3/7 >>> print "%d/%d" % (numerator, denominator) 3/7

By default, a print statement adds a linefeed to the end of its output. You may eliminate the linefeed by adding a trailing comma to the statement, but you still wind up with a space added to the end:

 >>> letlist = ('a','B','Z','r','w') >>> for c in letlist: print c,   # inserts spaces ... a B Z r w

Assuming these spaces are unwanted, you must either use sys.stdout.write() or otherwise calculate the space-free string you want:

 >>> for c in letlist+('\n',): # no spaces ...     sys.stdout.write(c) ... aBZrw >>> print ''.join(letlist) aBZrw

There is a special form of the print statement that redirects its output somewhere other than STDOUT. The print statement itself can be followed by two greater-than signs, then a writable file-like object, then a comma, then the remainder of the (printed) arguments. For example:

 >>> print >> open('test','w'), "Pi: %.3f" % 3.1415, 27+11 >>> open('test').read() 'Pi: 3.142 38\n'

Some Python programmers (including your author) consider this special form overly "noisy," but it is occassionally useful for quick configuration of output destinations.

If you want a function that would do the same thing as a print statement, the following one does so, but without any facility to eliminate the trailing linefeed or redirect output:

 def print_func(*args):     import sys     sys.stdout.write(' '.join(map(repr,args))+'\n')

Readers could enhance this to add the missing capabilities, but using print as a statement is the clearest approach, generally.

A.3.4 Container Types

tuple

An immutable sequence of (heterogeneous) objects. Being immutable, the membership and length of a tuple cannot be modified after creation. However, tuple elements and subsequences can be accessed by subscripting and slicing, and new tuples can be constructed from such elements and slices. Tuples are similar to "records" in some other programming languages.

The constructor syntax for a tuple is commas between listed items; in many contexts, parentheses around a constructed list are required to disambiguate a tuple for other constructs such as function arguments, but it is the commas not the parentheses that construct a tuple. Some examples:

 >>> tup = 'spam','eggs','bacon','sausage' >>> newtup = tup[1:3] + (1,2,3) + (tup[3],) >>> newtup ('eggs', 'bacon', 1, 2, 3, 'sausage')

The function tuple() may also be used to construct a tuple from another sequence type (either a list or custom sequence type).

list

A mutable sequence of objects. Like a tuple, list elements can be accessed by subscripting and slicing; unlike a tuple, list methods and index and slice assignments can modify the length and membership of a list object.

The constructor syntax for a list is surrounding square braces. An empty list may be constructed with no objects between the braces; a length-one list can contain simply an object name; longer lists separate each element object with commas. Indexing and slices, of course, also use square braces, but the syntactic contexts are different in the Python grammar (and common sense usually points out the difference). Some examples:

 >>> lst = ['spam', (1,2,3), 'eggs', 3.1415] >>> lst[:2] ['spam', (1, 2, 3)]

The function list() may also be used to construct a list from another sequence type (either a tuple or custom sequence type).

dict

A mutable mapping between immutable keys and object values. At most one entry in a dict exists for a given key; adding the same key to a dictionary a second time overrides the previous entry (much as with binding a name in a namespace). Dicts are unordered, and entries are accessed either by key as index; by creating lists of contained objects using the methods .keys(), .values(), and .items(); or in recent Python versions with the .popitem() method. All the dict methods generate contained objects in an unspecified order.

The constructor syntax for a dict is surrounding curly brackets. An empty dict may be constructed with no objects between the brackets. Each key/value pair entered into a dict is separated by a colon, and successive pairs are separated by commas. For example:

 >>> dct = {1:2, 3.14:(1+2j), 'spam':'eggs'} >>> dct['spam'] 'eggs' >>> dct['a'] = 'b'    # add item to dict >>> dct.items() [('a', 'b'), (1, 2), ('spam', 'eggs'), (3.14, (1+2j))] >>> dct.popitem() ('a', 'b') >>> dct {1: 2, 'spam': 'eggs', 3.14: (1+2j)}

In Python 2.2+, the function dict() may also be used to construct a dict from a sequence of pairs or from a custom mapping type. For example:

 >>> d1 = dict([('a','b'), (1,2), ('spam','eggs')]) >>> d1 {'a': 'b', 1: 2, 'spam': 'eggs'} >>> d2 = dict(zip([1,2,3],['a','b','c'])) >>> d2 {1: 'a', 2: 'b', 3: 'c'}

sets.Set

Python 2.3+ includes a standard module that implements a set datatype. For earlier Python versions, a number of developers have created third-party implementations of sets. If you have at least Python 2.2, you can download and use the sets module from <http://tinyurl.com/2d31> (or browse the Python CVS) you will need to add the definition True,False=1, 0 to your local version, though.

A set is an unordered collection of hashable objects. Unlike a list, no object can occur in a set more than once; a set resembles a dict that has only keys but no values. Sets utilize bitwise and Boolean syntax to perform basic set-theoretic operations; a subset test does not have a special syntactic form, instead using the .issubset() and .issuperset() methods. You may also loop through set members in an unspecified order. Some examples illustrate the type:

 >>> from sets import Set >>> x = Set([1,2,3]) >>> y = Set((3,4,4,6,6,2)) # init with any seq >>> print x, '//', y       # make sure dups removed Set([1, 2, 3]) // Set([2, 3, 4, 6]) >>> print x | y            # union of sets Set([1, 2, 3, 4, 6]) >>> print x & y            # intersection of sets Set([2, 3]) >>> print y-x              # difference of sets Set([4, 6]) >>> print x ^ y            # symmetric difference Set([1, 4, 6])

You can also check membership and iterate over set members:

 >>> 4 in y                # membership check 1 >>> x.issubset(y)         # subset check 0 >>> for i in y: ...     print i+10, ... 12 13 14 16 >>> from operator import add >>> plus_ten = Set(map(add, y, [10]*len(y))) >>> plus_ten Set([16, 12, 13, 14])

sets.Set also supports in-place modification of sets; sets.ImmutableSet, naturally, does not allow modification.

 >>> x = Set([1,2,3]) >>> x |= Set([4,5,6]) >>> x Set([1, 2, 3, 4, 5, 6]) >>> x &= Set([4,5,6]) >>> x Set([4, 5, 6]) >>> x ^= Set ([4, 5]) >>> x Set([6])

A.3.5 Compound Types

class instance

A class instance defines a namespace, but this namespace's main purpose is usually to act as a data container (but a container that also knows how to perform actions; i.e., has methods). A class instance (or any namespace) acts very much like a dict in terms of creating a mapping between names and values. Attributes of a class instance may be set or modified using standard qualified names and may also be set within class methods by qualifying with the namespace of the first (implicit) method argument, conventionally called self. For example:

 >>> class Klass: ...     def setfoo(self, val): ...         self.foo = val ... >>> obj = Klass() >>> obj.bar = 'BAR' >>> obj.setfoo(['this','that','other']) >>> obj.bar, obj.foo ('BAR', ['this', 'that', 'other']) >>> obj.__dict__ {'foo': ['this', 'that', 'other'], 'bar': 'BAR'}

Instance attributes often dereference to other class instances, thereby allowing hierarchically organized namespace quantification to indicate a data structure. Moreover, a number of "magic" methods named with leading and trailing double-underscores provide optional syntactic conveniences for working with instance data. The most common of these magic methods is .__init__(), which initializes an instance (often utilizing arguments). For example:

 >>> class Klass2: ...     def __init__(self, *args, **kw): ...         self.listargs = args ...         for key, val in kw.items(): ...             setattr(self, key, val) ... >>> obj = Klass2(1, 2, 3, foo='F00', bar=Klass2(baz='BAZ')) >>> obj.bar.blam = 'BLAM' >>> obj.listargs, obj.foo, obj.bar.baz, obj.bar.blam ((1, 2, 3), 'F00', 'BAZ', 'BLAM')

There are quite a few additional "magic" methods that Python classes may define. Many of these methods let class instances behave more like basic datatypes (while still maintaining special class behaviors). For example, the .__str__() and .__repr__() methods control the string representation of an instance; the .__getitem__() and .__setitem__() methods allow indexed access to instance data (either dict-like named indices, or list-like numbered indices); methods like .__add__(), .__mul__(), .__pow__(), and .__abs__() allow instances to behave in number-like ways. The Python Reference Manual discusses magic methods in detail.

In Python 2.2 and above, you can also let instances behave more like basic datatypes by inheriting classes from these built-in types. For example, suppose you need a datatype whose "shape" contains both a mutable sequence of elements and a .foo attribute. Two ways to define this datatype are:

 >>> class FooList(list):        # works only in Python 2.2+ ...     def __init__(self, lst=[], foo=None): ...         list.__init__(self, lst) ...         self.foo = foo ... >>> foolist = FooList([1,2,3], 'F00') >>> foolist[1], foolist.foo (2, 'F00') >>> class oldFooList:           # works in older Pythons ...     def __init__(self, lst=[], foo=None): ...         self._lst, self.foo = 1st, foo ...     def append(self, item): ...         self._lst.append(item) ...     def __getitem__(self, item): ...         return self._lst[item] ...     def __setitem__(self, item, val): ...         self._lst [item] = val ...     def __delitem__(self, item): ...         del self._lst[item] ... >>> foolst2 = oldFooList([1,2,3], 'F00') >>> foolst2[1], foolst2.foo (2, 'F00')

If you need more complex datatypes than the basic types, or even than an instance whose class has magic methods, often these can be constructed by using instances whose attributes are bound in link-like fashion to other instances. Such bindings can be constructed according to various topologies, including circular ones (such as for modeling graphs). As a simple example, you can construct a binary tree in Python using the following node class:

 >>> class Node: ...     def __init__(self, left=None, value=None, right=None): ...         self.left, self.value, self.right = left, value, right ...     def __repr__(self): ...         return self.value ... >>> tree = Node(Node(value="Left Leaf"), ...             "Tree Root", ...             Node(left=Node(value="RightLeft Leaf"), ...                  right=Node(value="RightRight Leaf") )) >>> tree,tree.left,tree.left.left,tree.right.left,tree.right.right (Tree Root, Left Leaf, None, RightLeft Leaf, RightRight Leaf)

In practice, you would probably bind intermediate nodes to names, in order to allow easy pruning and rearrangement.

SEE ALSO: int 18; float 19; list 28; string 129; tuple 28; UserDict 24; UserList 28; UserString 33;