2.2 Standard Modules

2.2.1 Basic String Transformations

The module string forms the core of Python's text manipulation libraries. That module is certainly the place to look before other modules. Most of the methods in the string module, you should note, have been copied to methods of string objects from Python 1.6+. Moreover, methods of string objects are a little bit faster to use than are the corresponding module functions. A few new methods of string objects do not have equivalents in the string module, but are still documented here.

SEE ALSO: str 33; UserString 33;

string • A collection of string operations

There are a number of general things to notice about the functions in the string module (which is composed entirely of functions and constants; no classes).

  1. Strings are immutable (as discussed in Chapter 1). This means that there is no such thing as changing a string "in place" (as we might do in many other languages, such as C, by changing the bytes at certain offsets within the string). Whenever a string module function takes a string object as an argument, it returns a brand-new string object and leaves the original one as is. However, the very common pattern of binding the same name on the left of an assignment as was passed on the right side within the string module function somewhat conceals this fact. For example:

     >>> import string >>> str = "Mary had a little lamb" >>> str = string.replace(str, 'had', 'ate') >>> str 'Mary ate a little lamb' 

    The first string object never gets modified per se; but since the first string object is no longer bound to any name after the example runs, the object is subject to garbage collection and will disappear from memory. In short, calling a string module function will not change any existing strings, but rebinding a name can make it look like they changed.

  2. Many string module functions are now also available as string object methods. To use these string object methods, there is no need to import the string module, and the expression is usually slightly more concise. Moreover, using a string object method is usually slightly faster than the corresponding string module function. However, the most thorough documentation of each function/method that exists as both a string module function and a string object method is contained in this reference to the string module.

  3. The form string.join(string.split (...)) is a frequent Python idiom. A more thorough discussion is contained in the reference items for string.join() and string.split(), but in general, combining these two functions is very often a useful way of breaking down a text, processing the parts, then putting together the pieces.

  4. Think about clever string.replace() patterns. By combining multiple string.replace() calls with use of "place holder" string patterns, a surprising range of results can be achieved (especially when also manipulating the intermediate strings with other techniques). See the reference item for string.replace() for some discussion and examples.

  5. A mutable string of sorts can be obtained by using built-in lists, or the array module. Lists can contain a collection of substrings, each one of which may be replaced or modified individually. The array module can define arrays of individual characters, each position modifiable, included with slice notation. The function string.join() or the method "".join() may be used to re-create true strings; for example:

     >>> 1st = ['spam','and','eggs'] >>> 1st[2] = 'toast' >>> print ''.join(lst) spamandtoast >>> print ' '.join(lst) spam and toast 

    Or:

     >>> import array >>> a = array.array('c','spam and eggs') >>> print ''.join(a) spam and eggs >>> a[0] = 'S' >>> print ''.join(a) Spam and eggs >>> a[-4:] = array.array('c','toast') >>> print ''.join(a) Spam and toast 
CONSTANTS

The string module contains constants for a number of frequently used collections of characters. Each of these constants is itself simply a string (rather than a list, tuple, or other collection). As such, it is easy to define constants alongside those provided by the string module, should you need them. For example:

 >>> import string >>> string.brackets = "[]{}()<>" >>> print string.brackets []{}()<> 
string.digits

The decimal numerals ("0123456789").

string.hexdigits

The hexadecimal numerals ("0123456789abcdefABCDEF").

string.octdigits

The octal numerals ("01234567").

string.lowercase

The lowercase letters; can vary by language. In English versions of Python (most systems):

 >>> import string >>> string.lowercase 'abcdefghijklmnopqrstuvwxyz' 

You should not modify string.lowercase for a source text language, but rather define a new attribute, such as string.spanish_lowercase with an appropriate string (some methods depend on this constant).

string.uppercase

The uppercase letters; can vary by language. In English versions of Python (most systems):

 >>> import string >>> string.uppercase 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 

You should not modify string.uppercase for a source text language, but rather define a new attribute, such as string.spanish_uppercase with an appropriate string (some methods depend on this constant).

string.letters

All the letters (string.lowercase+string.uppercase).

string.punctuation

The characters normally considered as punctuation; can vary by language. In English versions of Python (most systems):

 >>> import string >>> string.punctuation '!"#$%&\'()*+,-./:;<=>?@[\\]^_'{|}~' 
string.whitespace

The "empty" characters. Normally these consist of tab, linefeed, vertical tab, formfeed, carriage return, and space (in that order):

 >>> import string >>> string.whitespace '\011\012\013\014\015 ' 

You should not modify string.whitespace (some methods depend on this constant).

string.printable

All the characters that can be printed to any device; can vary by language (string.digits+string.letters+string.punctuation+string.whitespace).

FUNCTIONS
string.atof(s=...)

Deprecated. Use float().

Converts a string to a floating point value.

SEE ALSO: eval() 445; float() 422;

string.atoi(s=...[,base=10])

Deprecated with Python 2.0. Use int() if no custom base is needed or if using Python 2.0+.

Converts a string to an integer value (if the string should be assumed to be in a base other than 10, the base may be specified as the second argument).

SEE ALSO: eval() 445; int() 421; long() 422;

string.atol(s=...[,base=10])

Deprecated with Python 2.0. Use long() if no custom base is needed or if using Python 2.0+.

Converts a string to an unlimited length integer value (if the string should be assumed to be in a base other than 10, the base may be specified as the second argument).

SEE ALSO: eval() 445; long() 422; int() 421;

string.capitalize(s=...)
"".capitalize()

Return a string consisting of the initial character converted to uppercase (if applicable), and all other characters converted to lowercase (if applicable):

 >>> import string >>> string.capitalize("mary had a little lamb!") 'Mary had a little lamb!' >>> string.capitalize("Mary had a Little Lamb!") 'Mary had a little lamb!' >>> string.capitalize("2 Lambs had Mary!") '2 lambs had mary!' 

For Python 1.6+, use of a string object method is marginally faster and is stylistically preferred in most cases:

 >>> "mary had a little lamb".capitalize() 'Mary had a little lamb' 

SEE ALSO: string.capwords() 133; string.lower() 138;

string.capwords(s=...)
"".title()

Return a string consisting of the capitalized words. An equivalent expression is:

 string.join(map(string.capitalize,string.split(s)) 

But string.capwords() is a clearer way of writing it. An effect of this implementation is that whitespace is "normalized" by the process:

 >>> import string >>> string.capwords("mary HAD a little lamb!") 'Mary Had A Little Lamb!' >>> string.capwords("Mary     had a      Little Lamb!") 'Mary Had A Little Lamb!' 

With the creation of string methods in Python 1.6, the module function string.capwords() was renamed as a string method to "".title().

SEE ALSO: string.capitalize() 132; string.lower() 138; "".istitle() 136;

string.center(s=. . . , width=...)
"".center(width)

Return a string with s padded with symmetrical leading and trailing spaces (but not truncated) to occupy length width (or more).

 >>> import string >>> string.center(width=30,s="Mary had a little lamb") '    Mary had a little lamb ' >>> string.center("Mary had a little lamb", 5) 'Mary had a little lamb' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> "Mary had a little lamb".center(25) '  Mary had a little lamb ' 

SEE ALSO: string.ljust() 138; string.rjust() 141;

string.count(s, sub [,start [,end]])
"".count(sub [,start [,end]])

Return the number of nonoverlapping occurrences of sub in s. If the optional third or fourth arguments are specified, only the corresponding slice of s is examined.

 >>> import string >>> string.count("mary had a little lamb", "a") 4 >>> string.count("mary had a little lamb", "a", 3, 10) 2 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> 'mary had a little lamb'.count("a") 4 
"".endswith(suffix [,start [,end]])

This string method does not have an equivalent in the string module. Return a Boolean value indicating whether the string ends with the suffix suffix. If the optional second argument start is specified, only consider the terminal substring after offset start. If the optional third argument end is given, only consider the slice [start:end].

SEE ALSO: "".startswith() 144; string.find() 135;

string.expandtabs(s=...[,tabsize=8])
"".expandtabs([,tabsize=8])

Return a string with tabs replaced by a variable number of spaces. The replacement causes text blocks to line up at "tab stops." If no second argument is given, the new string will line up at multiples of 8 spaces. A newline implies a new set of tab stops.

 >>> import string >>> s = 'mary\011had a little lamb' >>> print s mary    had a little lamb >>> string.expandtabs(s, 16) 'mary            had a little lamb' >>> string.expandtabs(tabsize=l, s=s) 'mary had a little lamb' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> 'mary\011had a little lamb'.expandtabs(25) 'mary                     had a little lamb' 
string.find(s, sub [,start [,end]])
"".find(sub [,start [,end]])

Return the index position of the first occurrence of sub in s. If the optional third or fourth arguments are specified, only the corresponding slice of s is examined (but result is position in s as a whole). Return -1 if no occurrence is found. Position is zero-based, as with Python list indexing:

 >>> import string >>> string.find("mary had a little lamb", "a") 1 >>> string.find("mary had a little lamb", "a", 3, 10) 6 >>> string.find("mary had a little lamb", "b") 21 >>> string.find("mary had a little lamb", "b", 3, 10) -1 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> 'mary had a little lamb'.find("ad") 6 

SEE ALSO: string.index() 135; string.rfind() 140;

string.index(s, sub [,start [,end]])
"".index(sub [,start [,end]])

Return the same value as does string.find() with same arguments, except raise ValueError instead of returning -1 when sub does not occur in s.

 >>> import string >>> string.index("mary had a little lamb", "b") 21 >>> string.index("mary had a little lamb", "b", 3, 10) Traceback (most recent call last):   File "<stdin>", line 1, in ?   File "d:/py20sl/lib/string.py", line 139, in index     return s.index(*args) ValueError: substring not found in string.index 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> 'mary had a little lamb'.index("ad") 6 

SEE ALSO: string.find() 135; string.rindex() 141;

Several string methods that return Boolean values indicating whether a string has a certain property. None of the .is*() methods, however, have equivalents in the string module:

"".isalpha()

Return a true value if all the characters are alphabetic.

"".isalnum()

Return a true value if all the characters are alphanumeric.

"".isdigit()

Return a true value if all the characters are digits.

"".islower()

Return a true value if all the characters are lowercase and there is at least one cased character:

 >>> "ab123".islower(), '123'.islower(), 'Ab123'.islower() (1, 0, 0) 

SEE ALSO: "".lower() 138;

"".isspace()

Return a true value if all the characters are whitespace.

"".istitle()

Return a true value if all the string has title casing (each word capitalized).

SEE ALSO: "".title() 133;

"".isupper()

Return a true value if all the characters are uppercase and there is at least one cased character.

SEE ALSO: "".upper() 146;

string.join(words=...[,sep=" "])
"".join (words)

Return a string that results from concatenating the elements of the list words together, with sep between each. The function string.join() differs from all other string module functions in that it takes a list (of strings) as a primary argument, rather than a string.

It is worth noting string.join() and string.split() are inverse functions if sep is specified to both; in other words, string.join(string.split(s,sep),sep)==s for all s and sep.

Typically, string.join() is used in contexts where it is natural to generate lists of strings. For example, here is a small program to output the list of all-capital words from STDIN to STDOUT, one per line:

list_capwords.py
 import string,sys capwords = [] for line in sys.stdin.readlines():     for word in line.split():         if word == word.upper() and word.isalpha():             capwords.append(word) print string.join(capwords, '\n') 

The technique in the sample list_capwords.py script can be considerably more efficient than building up a string by direct concatenation. However, Python 2.0's augmented assignment reduces the performance difference:

 >>> import string >>> s = "Mary had a little lamb" >>> t = "its fleece was white as snow" >>> s = s +" "+ t    # relatively "expensive" for big strings >>> s += " " + t     # "cheaper" than Python 1.x style >>> 1st = [s] >>> lst.append(t)    # "cheapest" way of building long string >>> s = string.join(lst) 

For Python 1.6+, use of a string object method is stylistically preferred in some cases. However, just as string.join() is special in taking a list as a first argument, the string object method "".join() is unusual in being an operation on the (optional) sep string, not on the (required) words list (this surprises many new Python programmers).

SEE ALSO: string.split() 142;

string.joinfields(...)

Identical to string.join().

string.ljust(s=..., width=...)
"".Ijust(width)

Return a string with s padded with trailing spaces (but not truncated) to occupy length width (or more).

 >>> import string >>> string.ljust(width=30,s="Mary had a little lamb") 'Mary had a little lamb        ' >>> string.ljust("Mary had a little lamb", 5) 'Mary had a little lamb' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> "Mary had a little lamb".ljust(25) 'Mary had a little lamb   ' 

SEE ALSO: string.rjust() 141; string.center() 133;

string.lower(s=...)
"".lower()

Return a string with any uppercase letters converted to lowercase.

 >>> import string >>> string.lower("mary HAD a little lamb!") 'mary had a little lamb!' >>> string.lower("Mary had a Little Lamb!") 'mary had a little lamb!' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> "Mary had a Little Lamb!".lower() 'mary had a little lamb!' 

SEE ALSO: string.upper() 146;

string.lstrip(s=...)
"".lstrip([chars=string.whitespace])

Return a string with leading whitespace characters removed. For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> import string >>> s = """ ...     Mary had a little lamb      \011""" >>> string.lstrip(s) 'Mary had a little lamb      \011' >>> s.lstrip() 'Mary had a little lamb      \011' 

Python 2.3+ accepts the optional argument chars to the string object method. All characters in the string chars will be removed.

SEE ALSO: string.rstrip() 142; string.strip() 144;

string.maketrans(from, to)

Return a translation table string for use with string.translate() . The strings from and to must be the same length. A translation table is a string of 256 successive byte values, where each position defines a translation from the chr() value of the index to the character contained at that index position.

 >>> import string >>> ord('A') 65 >>> ord('z') 122 >>> string.maketrans('ABC','abc')[65:123] 'abcDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_'abcdefghijklmnopqrstuvwxyz' >>> string.maketrans('ABCxyz','abcXYZ')[65:123] 'abcDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_'abcdefghijklmnopqrstuvwXYZ' 

SEE ALSO: string.translate() 145;

string.replace(s=..., old=..., new=...[,maxsplit=...])
"".replace(old, new [,maxsplit])

Return a string based on s with occurrences of old replaced by new. If the fourth argument maxsplit is specified, only replace maxsplit initial occurrences.

 >>> import string >>> string.replace("Mary had a little lamb", "a little", "some") 'Mary had some lamb' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> "Mary had a little lamb".replace("a little", "some") 'Mary had some lamb' 

A common "trick" involving string.replace() is to use it multiple times to achieve a goal. Obviously, simply to replace several different substrings in a string, multiple string.replace() operations are almost inevitable. But there is another class of cases where string.replace() can be used to create an intermediate string with "placeholders" for an original substring in a particular context. The same goal can always be achieved with regular expressions, but sometimes staged string.replace() operations are both faster and easier to program:

 >>> import string >>> line = 'variable = val      # see comments #3 and #4' >>> # we'd like '#3' and '#4' spelled out within comment >>> string.replace(line,'#','number ')       # doesn't work 'variable = val      number  see comments number 3 and number 4' >>> place_holder=string.replace(line,' # ',' !!! ') # insrt plcholder >>> place_holder 'variable = val      !!! see comments #3 and #4' >>> place_holder=place_holder.replace('#','number ') # almost there >>> place_holder 'variable = val      !!! see comments number 3 and number 4' >>> line = string.replace(place_holder,'!!!','#') # restore orig >>> line 'variable = val      # see comments number 3 and number 4' 

Obviously, for jobs like this, a placeholder must be chosen so as not ever to occur within the strings undergoing "staged transformation"; but that should be possible generally since placeholders may be as long as needed.

SEE ALSO: string.translate() 145; mx.TextTools.replace() 314;

string.rfind(s, sub [,start [,end]])
"".rfind(sub [,start [,end]])

Return the index position of the last occurrence of sub in s. If the optional third or fourth arguments are specified, only the corresponding slice of s is examined (but result is position in s as a whole). Return -1 if no occurrence is found. Position is zero-based, as with Python list indexing:

 >>> import string >>> string.rfind("mary had a little lamb", "a") 19 >>> string.rfind("mary had a little lamb", "a", 3, 10) 9 >>> string.rfind("mary had a little lamb", "b") 21 >>> string.rfind("mary had a little lamb", "b", 3, 10) -1 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> 'mary had a little lamb'.rfind("ad") 6 

SEE ALSO: string.rindex() 141; string.find() 135;

string.rindex(s, sub [,start [,end]])
"".rindex(sub [,start [,end]])

Return the same value as does string.rfind() with same arguments, except raise ValueError instead of returning -1 when sub does not occur in s.

 >>> import string >>> string.rindex("mary had a little lamb", "b") 21 >>> string.rindex("mary had a little lamb", "b", 3, 10) Traceback (most recent call last):   File "<stdin>", line 1, in ?   File "d:/py20sl/lib/string.py", line 148, in rindex     return s.rindex(*args) ValueError: substring not found in string.rindex 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> 'mary had a little lamb'.index("ad") 6 

SEE ALSO: string.rfind() 140; string.index() 135;

string.rjust(s=..., width=...)
"".rjust(width)

Return a string with s padded with leading spaces (but not truncated) to occupy length width (or more).

 >>> import string >>> string.rjust(width=30,s="Mary had a little lamb") '        Mary had a little lamb' >>> string.rjust("Mary had a little lamb", 5) 'Mary had a little lamb' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> "Mary had a little lamb".rjust(25) '   Mary had a little lamb' 

SEE ALSO: string.ljust() 138; string.center() 133;

string.rstrip(s=...)
"".rstrip([chars=string.whitespace])

Return a string with trailing whitespace characters removed. For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> import string >>> s = """ ...     Mary had a little lamb       \011""" >>> string.rstrip(s) '\012    Mary had a little lamb' >>> s.rstrip() '\012    Mary had a little lamb' 

Python 2.3+ accepts the optional argument chars to the string object method. All characters in the string chars will be removed.

SEE ALSO: string.lstrip() 139; string.strip() 144;

string.split(s=...[,sep=...[,maxsplit=...]])
"".split([,sep [,maxsplit]])

Return a list of nonoverlapping substrings of s. If the second argument sep is specified, the substrings are divided around the occurrences of sep. If sep is not specified, the substrings are divided around any whitespace characters. The dividing strings do not appear in the resultant list. If the third argument maxsplit is specified, everything "left over" after splitting maxsplit parts is appended to the list, giving the list length 'maxsplit'+1.

 >>> import string >>> s = 'mary had a little lamb    ...with a glass of sherry' >>> string.split(s, ' a ') ['mary had', 'little lamb     ...with', 'glass of sherry'] >>> string.split(s) ['mary', 'had', 'a', 'little', 'lamb', '...with', 'a', 'glass', 'of', 'sherry'] >>> string.split(s,maxsplit=5) ['mary', 'had', 'a', 'little', 'lamb', '...with a glass of sherry'] 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> "Mary had a Little Lamb!".split() ['Mary', 'had', 'a', 'Little', 'Lamb!'] 

The string.split() function (and corresponding string object method) is surprisingly versatile for working with texts, especially ones that resemble prose. Its default behavior of treating all whitespace as a single divider allows string.split() to act as a quick-and-dirty word parser:

 >>> wc = lambda s: len(s.split()) >>> wc("Mary had a Little Lamb") 5 >>> s = """Mary had a Little Lamb ... its fleece as white as snow. ... And everywhere that Mary went  ...  the lamb was sure to go.""" >>> print s Mary had a Little Lamb its fleece as white as snow. And everywhere that Mary went   ...  the lamb was sure to go. >>> wc(s) 23 

The function string.split() is very often used in conjunction with string.join(). The pattern involved is "pull the string apart, modify the parts, put it back together." Often the parts will be words, but this also works with lines (dividing on \n) or other chunks. For example:

 >>> import string >>> s = """Mary had a Little Lamb ... its fleece as white as snow. ... And everywhere that Mary went   ...  the lamb was sure to go.""" >>> string.join(string.split(s)) 'Mary had a Little Lamb its fleece as white as snow. And everywhere ... that Mary went the lamb was sure to go.' 

A Python 1.6+ idiom for string object methods expresses this technique compactly:

 >>> "-".join(s.split()) 'Mary-had-a-Little-Lamb-its-fleece-as-white-as-snow.-And-everywhere ...-that-Mary-went--the-lamb-was-sure-to-go.' 

SEE ALSO: string.join() 137; mx.TextTools.setsplit() 314; mx.TextTools.charsplit() 311; mx.TextTools.splitat() 315; mx.TextTools.splitlines() 315;

string.splitfields(...)

Identical to string.split().

"".splitlines([keepends=0])

This string method does not have an equivalent in the string module. Return a list of lines in the string. The optional argument keepends determines whether line break character(s) are included in the line strings.

"".startswith(prefix [,start [,end]])

This string method does not have an equivalent in the string module. Return a Boolean value indicating whether the string begins with the prefix prefix. If the optional second argument start is specified, only consider the terminal substring after the offset start. If the optional third argument end is given, only consider the slice [start: end].

SEE ALSO: "".endswith() 134; string.find() 135;

string.strip(s=...)
"".strip([chars=string.whitespace])

Return a string with leading and trailing whitespace characters removed. For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> import string >>> s = """ ...     Mary had a little lamb     \011""" >>> string.strip(s) 'Mary had a little lamb' >>> s.strip() 'Mary had a little lamb' 

Python 2.3+ accepts the optional argument chars to the string object method. All characters in the string chars will be removed.

 >>> s = "MARY had a LITTLE lamb STEW" >>> s.strip("ABCDEFGHIJKLMNOPQRSTUVWXYZ") # strip caps ' had a LITTLE lamb ' 

SEE ALSO: string.rstrip() 142; string.lstrip() 139;

string.swapcase(s=...)
"".swapcase()

Return a string with any uppercase letters converted to lowercase and any lowercase letters converted to uppercase.

 >>> import string >>> string.swapcase("mary HAD a little lamb!") 'MARY had A LITTLE LAMB!' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> "Mary had a Little Lamb!".swapcase() 'MARY had A LITTLE LAMB!' 

SEE ALSO: string.upper() 146; string.lower() 138;

string.translate(s=..., table=...[,deletechars=""])
"".translate(table [,deletechars=""])

Return a string, based on s, with deletechars deleted (if the third argument is specified) and with any remaining characters translated according to the translation table.

 >>> import string >>> tab = string.maketrans('ABC','abc') >>> string.translate('MARY HAD a little LAMB', tab, 'Atl') 'MRY HD a ie LMb' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases. However, if string.maketrans() is used to create the translation table, one will need to import the string module anyway:

 >>> 'MARY HAD a little LAMB'.translate(tab, 'Atl') 'MRY HD a ie LMb' 

The string.translate() function is a very fast way to modify a string. Setting up the translation table takes some getting used to, but the resultant transformation is much faster than a procedural technique such as:

 >>> (new,frm,to,dlt) = ("",'ABC','abc','Alt') >>> for c in 'MARY HAD a little LAMB': ...     if c not in dlt: ...         pos = frm.find(c) ...         if pos == -1: new += c ...         else:         new += to[pos] ... >>> new 'MRY HD a ie LMb' 

SEE ALSO: string.maketrans() 139;

string.upper(s=...)
"".upper()

Return a string with any lowercase letters converted to uppercase.

 >>> import string >>> string.upper("mary HAD a little lamb!") 'MARY HAD A LITTLE LAMB!' >>> string.upper("Mary had a Little Lamb!") 'MARY HAD A LITTLE LAMB!' 

For Python 1.6+, use of a string object method is stylistically preferred in many cases:

 >>> "Mary had a Little Lamb!".upper() 'MARY HAD A LITTLE LAMB!' 

SEE ALSO: string.lower() 138;

string.zfill(s=..., width=...)

Return a string with s padded with leading zeros (but not truncated) to occupy length width (or more). If a leading sign is present, it "floats" to the beginning of the return value. In general, string.zfill() is designed for alignment of numeric values, but no checking is done to see if a string looks number-like.

 >>> import string >>> string.zfill("this", 20) '0000000000000000this' >>> string.zfill("-37", 20) '-0000000000000000037' >>> string.zfill("+3.7", 20) '+00000000000000003.7' 

Based on the example of string.rjust(), one might expect a string object method "".zfill() ; however, no such method exists.

SEE ALSO: string.rjust() 141;

2.2.2 Strings as Files, and Files as Strings

In many ways, strings and files do a similar job. Both provide a storage container for an unlimited amount of (textual) information that is directly structured only by linear position of the bytes. A first inclination is to suppose that the difference between files and strings is one of persistence files hang around when the current program is no longer running. But that distinction is not really tenable. On the one hand, standard Python modules like shelve, pickle, and marshal and third-party modules like xml_pickle and ZODB provide simple ways of making strings persist (but not thereby correspond in any direct way to a filesystem). On the other hand, many files are not particularly persistent: Special files like STDIN and STDOUT under Unix-like systems exist only for program life; other peculiar files like /dev/cua0 and similar "device files" are really just streams; and even files that live on transient memory disks, or get deleted with program cleanup, are not very persistent.

The real difference between files and strings in Python is no more or less than the set of techniques available to operate on them. File objects can do things like .read() and .seek() on themselves. Notably, file objects have a concept of a "current position" that emulates an imaginary "read-head" passing over the physical storage media. Strings, on the other hand, can be sliced and indexed for example, str[4:10] or for c in str: and can be processed with string object methods and by functions of modules like string and re. Moreover, a number of special-purpose Python objects act "file-like" without quite being files; for example, gzip.open() and urllib.urlopen() . Of course, Python itself does not impose any strict condition for just how "file-like" something has to be to work in a file-like context. A programmer has to figure that out for each type of object she wishes to apply techniques to (but most of the time things "just work" right).

Happily, Python provides some standard modules to make files and strings easily interoperable.

mmap • Memory-mapped file support

The mmap module allows a programmer to create "memory-mapped" file objects. These special mmap objects enable most of the techniques you might apply to "true" file objects and simultaneously most of the techniques you might apply to "true" strings. Keep in mind the hinted caveat about "most," however: Many string module functions are implemented using the corresponding string object methods. Since a mmap object is only somewhat "string-like," it basically only implements the .find() method and those "magic" methods associated with slicing and indexing. This is enough to support most string object idioms.

When a string-like change is made to a mmap object, that change is propagated to the underlying file, and the change is persistent (assuming the underlying file is persistent, and that the object called .flush() before destruction). mmap thereby provides an efficient route to "persistent strings."

Some examples of working with memory-mapped file objects are worth looking at:

 >>> # Create a file with some test data >>> open('test','w').write(' #'.join(map(str, range(1000)))) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(),1000) >>> len(mm) 1000 >>> mm[-20:] '218 #219 #220 #221 #' >>> import string   # apply a string module method >>> mm.seek(string.find(mm, '21')) >>> mm.read(10) '21 #22 #23' >>> mm.read(10)     # next ten bytes ' #24 #25 #' >>> mm.find('21')   # object method to find next occurrence 402 >>> try: string.rfind(mm, '21') ... except AttributeError: print "Unsupported string function" ... Unsupported string function >>> '/'.join(re.findall('..21..',mm))   # regex's work nicely ' #21 #/#121 #/ #210 / #212 / #214 / #216 / #218 /#221 #' 

It is worth emphasizing that the bytes in a file on disk are in fixed positions. You may use the mmap.mmap.resize() method to write into different portions of a file, but you cannot expand the file from the middle, only by adding to the end.

CLASSES
mmap.mmap(fileno, length [,tagname]) (Windows)
mmap.mmap(fileno, length [,flags=MAP_SHARED, prot=PROT_READ|PROT_WRITE])

Create a new memory-mapped file object. fileno is the numeric file handle to base the mapping on. Generally this number should be obtained using the .fileno() method of a file object. length specifies the length of the mapping. Under Windows, the value 0 may be given for length to specify the current length of the file. If length smaller than the current file is specified, only the initial portion of the file will be mapped. If length larger than the current file is specified, the file can be extended with additional string content.

The underlying file for a memory-mapped file object must be opened for updating, using the "+" mode modifier.

According to the official Python documentation for Python 2.1, a third argument tagname may be specified. If it is, multiple memory-maps against the same file are created. In practice, however, each instance of mmap.mmap() creates a new memory-map whether or not a tagname is specified. In any case, this allows multiple file-like updates to the same underlying file, generally at different positions in the file.

 >>> open('test','w').write(' #'.join([str(n) for n in range(1000)])) >>> fp = open('test','r+') >>> import mmap >>> mm1 = mmap.mmap(fp.fileno(),1000) >>> mm2 = mmap.mmap(fp.fileno(),1000) >>> mm1.seek(500) >>> mm1.read(10) '122 #123 #' >>> mm2.read(10) '0 #1 #2 #3' 

Under Unix, the third argument flags may be MAP_PRIVATE or MAP_SHARED. If MAP_SHARED is specified for flags, all processes mapping the file will see the changes made to a mmap object. Otherwise, the changes are restricted to the current process. The fourth argument, prot, may be used to disallow certain types of access by other processes to the mapped file regions.

METHODS
mmap.mmap.close()

Close the memory-mapped file object. Subsequent calls to the other methods of the mmap object will raise an exception. Under Windows, the behavior of a mmap object after . close() is somewhat erratic, however. Note that closing the memory-mapped file object is not the same as closing the underlying file object. Closing the underlying file will make the contents inaccessible, but closing the memory-mapped file object will not affect the underlying file object.

SEE ALSO: FILE.close() 16;

mmap.mmap.find(sub [,pos])

Similar to string.find() . Return the index position of the first occurrence of sub in the mmap object. If the optional second argument pos is specified, the result is the offset returned relative to pos. Return -1 if no occurrence is found:

 >>> open('test','w').write(' #'.join([str(n) for n in range(1000)])) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(), 0) >>> mm.find('21') 74 >>> mm.find('21',100) -26 >>> mm.tell() 0 

SEE ALSO: mmap.mmap.seek() 152; string.find() 135;

mmap.mmap.flush([offset, size])

Writes changes made in memory to mmap object back to disk. The first argument offset and second argument size must either both be specified or both be omitted. If offset and size are specified, only the position starting at offset or length size will be written back to disk.

mmap.mmap.flush() is necessary to guarantee that changes are written to disk; however, no guarantee is given that changes will not be written to disk as part of normal Python interpreter housekeeping. mmap should not be used for systems with "cancelable" changes (since changes may not be cancelable).

SEE ALSO: FILE.flush() 16;

mmap.mmap.move(target, source, length)

Copy a substring within a memory-mapped file object. The length of the substring is the third argument length. The target location is the first argument target. The substring is copied from the position source. It is allowable to have the substring's original position overlap its target range, but it must not go past the last position of the mmap object.

 >>> open('test','w').write(''.join([c*10 for c in 'ABCDE'])) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(),0) >>> mm[:] 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEE' >>> mm.move(40,0,5) >>> mm[:] 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDAAAAAEEEEE' 
mmap.mmap.read(num)

Return a string containing num bytes, starting at the current file position. The file position is moved to the end of the read string. In contrast to the .read() method of file objects, mmap.mmap.read() always requires that a byte count be specified, which makes a memory-map file object not fully substitutable for a file object when data is read. However, the following is safe for both true file objects and memory-mapped file objects:

 >>> open('test','w').write(' #'.join( [str (n) for n in range(1000)])) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(),0) >>> def safe_readall(file): ...     try: ...         length = len(file) ...         return file.read(length) ...     except TypeError: ...         return file.read() ... >>> s1 = safe_readall(fp) >>> s2 = safe_readall(mm) >>> s1 == s2 1 

SEE ALSO: mmap.mmap.read_byte() 151; mmap.mmap.readline() 151; mmap.mmap.write() 153; FILE.read() 17;

mmap.mmap.read_byte()

Return a one-byte string from the current file position and advance the current position by one. Same as mmap.mmap.read (1).

SEE ALSO: mmap.mmap.read() 150; mmap.mmap.readline() 151;

mmap.mmap.readline()

Return a string from the memory-mapped file object, starting from the current file position and going to the next newline character. Advance the current file position by the amount read.

SEE ALSO: mmap.mmap.read() 150; mmap.mmap.read_byte() 151; FILE.readline() 17;

mmap.mmap.resize(newsize)

Change the size of a memory-mapped file object. This may be used to expand the size of an underlying file or merely to expand the area of a file that is memory-mapped. An expanded file is padded with null bytes (\000) unless otherwise filled with content. As with other operations on mmap objects, changes to the underlying file system may not occur until a .flush() is performed.

SEE ALSO: mmap.mmap.flush() 150;

mmap.mmap.seek(offset [,mode])

Change the current file position. If a second argument mode is given, a different seek mode can be selected. The default is 0, absolute file positioning. Mode 1 seeks relative to the current file position. Mode 2 is relative to the end of the memory-mapped file (which may be smaller than the whole size of the underlying file). The first argument offset specifies the distance to move the current file position in mode 0 it should be positive, in mode 2 it should be negative, in mode 1 the current position can be moved either forward or backward.

SEE ALSO: FILE.seek() 17;

mmap.mmap.size()

Return the length of the underlying file. The size of the actual memory-map may be smaller if less than the whole file is mapped:

 >>> open('test','w').write('X'*100) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(),50) >>> mm.size() 100 >>> len(mm) 50 

SEE ALSO: len() 14; mmap.mmap.seek() 152; mmap.mmap.tell() 152;

mmap.mmap.tell()

Return the current file position.

 >>> open('test','w').write('X'*100) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(), 0) >>> mm.tell() 0 >>> mm.seek(20) >>> mm.tell() 20 >>> mm.read(20) 'XXXXXXXXXXXXXXXXXXXX' >>> mm.tell() 40 

SEE ALSO: FILE.tell() 17; mmap.mmap.seek() 152;

mmap.mmap.write(s)

Write s into the memory-mapped file object at the current file position. The current file position is updated to the position following the write. The method mmap.mmap.write() is useful for functions that expect to be passed a file-like object with a .write() method. However, for new code, it is generally more natural to use the string-like index and slice operations to write contents. For example:

 >>> open('test','w').write('X'*50) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(), 0) >>> mm.write('AAAAA') >>> mm.seek(10) >>> mm.write('BBBBB') >>> mm[30:35] = 'SSSSS' >>> mm[:] 'AAAAAXXXXXBBBBBXXXXXXXXXXXXXXXSSSSSXXXXXXXXXXXXXXX' >>> mm.tell() 15 

SEE ALSO: FILE.write() 17; mmap.mmap.read() 150;

mmap.mmap.write_byte(c)

Write a one-byte string to the current file position, and advance the current position by one. Same as mmap.mmap.write(c) where c is a one-byte string.

SEE ALSO: mmap.mmap.write() 153;

StringIO • File-like objects that read from or write to a string buffer

cStringIO • Fast, but incomplete, StringIO replacement

The StringIO and cStringIO modules allow a programmer to create "memory files," that is, "string buffers." These special StringIO objects enable most of the techniques you might apply to "true" file objects, but without any connection to a filesystem.

The most common use of string buffer objects is when some existing techniques for working with byte-streams in files are to be applied to strings that do not come from files. A string buffer object behaves in a file-like manner and can "drop in" to most functions that want file objects.

cStringIO is much faster than StringIO and should be used in most cases. Both modules provide a StringIO class whose instances are the string buffer objects. cStringI0.StringI0 cannot be subclassed (and therefore cannot provide additional methods), and it cannot handle Unicode strings. One rarely needs to subclass StringIO, but the absence of Unicode support in cStringIO could be a problem for many developers. As well, cStringIO does not support write operations, which makes its string buffers less general (the effect of a write against an in-memory file can be accomplished by normal string operations).

A string buffer object may be initialized with a string (or Unicode for StringIO) argument. If so, that is the initial content of the buffer. Below are examples of usage (including Unicode handling):

 >>> from cStringIO import StringIO as CSIO >>> from StringIO import StringIO as SIO >>> alef, omega = unichr(1488), unichr(969) >>> sentence = "In set theory, the Greek "+omega+" represents the \n"+\ ...            "ordinal limit of the integers, while the Hebrew \n"+\ ...            alef+" represents their cardinality." >>> sio = SIO(sentence) >>> try: ...     csio = CSIO(sentence) ...     print "New string buffer from raw string" ... except TypeError: ...     csio = CSIO(sentence.encode('utf-8')) ...     print "New string buffer from ENCODED string" ... New string buffer from ENCODED string >>> sio.getvalue() == unicode(csio.getvalue(),'utf-8') 1 >>> try: ...     sio.getvalue() == csio.getvalue() ... except UnicodeError: ...     print "Cannot even compare Unicode with string, in general" ... Cannot even compare Unicode with string, in general >>> lines = csio.readlines() >>> len(lines) 3 >>> sio.seek(0) >>> print sio.readline().encode('utf-8'), In set theory, the Greek  represents the ordinal >>> sio.tell(), csio.tell() (51, 124) 
CONSTANTS
cStringIO.InputType

The type of a cStringIO.StringIO instance that has been opened in "read" mode. All StringIO.StringIO instances are simply InstanceType.

SEE ALSO: cStringIO.StringIO 155;

cStringlO.OutputType

The type of cStringIO.StringIO instance that has been opened in "write" mode (actually read/write). All StringIO.StringIO instances are simply InstanceType.

SEE ALSO: cStringIO.StringIO 155;

CLASSES
StringlO.StringIO ([buf=...])
cStringIO.StringIO([buf])

Create a new string buffer. If the first argument buf is specified, the buffer is initialized with a string content. If the cStringIO module is used, the presence of the buf argument determines whether write access to the buffer is enabled. A cStringIO.StringIO buffer with write access must be initialized with no argument, otherwise it becomes read-only. A StringIO.StringIO buffer, however, is always read/write.

METHODS
StringIO.StringIO.close()
cStringIO.StringIO.close()

Close the string buffer. No access is permitted after close.

SEE ALSO: FILE.close() 16;

StringIO.StringIO.flush()
cStringIO.StringIO.flush()

Compatibility method for file-like behavior. Data in a string buffer is already in memory, so there is no need to finalize a write to disk.

SEE ALSO: FILE.close() 16;

StringIO.StringIO.getvalue()
cStringIO.StringIO.getvalue()

Return the entire string held by the string buffer. Does not affect the current file position. Basically, this is the way you convert back from a string buffer to a string.

StringIO.StringIO.isatty()
cStringIO.StringIO.isatty()

Return 0. Compatibility method for file-like behavior.

SEE ALSO: FILE.isatty() 16;

StringIO.StringIO.read ([num])
cStringIO.StringIO.read ([num])

If the first argument num is specified, return a string containing the next num characters. If num characters are not available, return as many as possible. If num is not specified, return all the characters from current file position to end of string buffer. Advance the current file position by the amount read.

SEE ALSO: FILE.read() 17; mmap.mmap.read() 150; StringIO.StringIO.readline() 156;

StringIO.StringIO.readline([length=...])
cStringIO.StringIO.readline([length])

Return a string from the string buffer, starting from the current file position and going to the next newline character. Advance the current file position by the amount read.

SEE ALSO: mmap.mmap.readline() 151; StringIO.StringIO.read() 156; StringIO.StringIO.readlines() 156; FILE.readline() 17;

StringIO.StringIO.readlines([sizehint=...])
cStringIO.StringIO.readlines([sizehint]

Return a list of strings from the string buffer. Each list element consists of a single line, including the trailing newline character(s). If an argument sizehint is specified, only read approximately sizehint characters worth of lines (full lines will always be read).

SEE ALSO: StringIO.StringIO.readline() 156; FILE.readlines() 17;

cStringIO.StringIO.reset()

Sets the current file position to the beginning of the string buffer. Same as cStringIO.StringIO.seek(0).

SEE ALSO: StringIO.StringIO.seek() 156;

StringIO.StringIO.seek(offset [,mode=0])
cStringIO.StringIO.seek(offset [,mode])

Change the current file position. If the second argument mode is given, a different seek mode can be selected. The default is 0, absolute file positioning. Mode 1 seeks relative to the current file position. Mode 2 is relative to the end of the string buffer. The first argument offset specifies the distance to move the current file position in mode 0 it should be positive, in mode 2 it should be negative, in mode 1 the current position can be moved either forward or backward.

SEE ALSO: FILE.seek() 17; mmap.mmap.seek() 152;

StringIO.StringIO.tell()
cStringIO.StringIO.tell()

Return the current file position in the string buffer.

SEE ALSO: StringIO.StringIO.seek() 156;

StringIO.StringIO.truncate([len=0])
cStringIO.StringIO.truncate ([len])

Reduce the length of the string buffer to the first argument len characters. Truncation can only reduce characters later than the current file position (an initial cStringIO.StringIO.reset() can be used to assure truncation from the beginning).

SEE ALSO: StringIO.StringIO.seek() 156; cStringIO.StringIO.reset() 156; StringIO.StringIO.close() 155;

StringIO.StringIO.write(s=...)
cStringIO.StringIO.write(s)

Write the first argument s into the string buffer at the current file position. The current file position is updated to the position following the write.

SEE ALSO: StringIO.StringIO.writelines() 157; mmap.mmap.write() 153; StringIO.StringIO.read() 156; FILE.write() 17;

StringIO.StringIO.writelines(list=...)
cStringIO.String IO.writelines(list)

Write each element of list into the string buffer at the current file position. The current file position is updated to the position following the write. For the cStringIO method, list must be an actual list. For the StringIO method, other sequence types are allowed. To be safe, it is best to coerce an argument into an actual list first. In either case, list must contain only strings, or a TypeError will occur.

Contrary to what might be expected from the method name, StringIO.StringIO.writelines() never inserts newline characters. For the list elements actually to occupy separate lines in the string buffer, each element string must already have a newline terminator. Consider the following variants on writing a list to a string buffer:

 >>> from StringIO import StringIO >>> sio = StringIO() >>> 1st = [c*5 for c in 'ABC'] >>> sio.writelines(lst) >>> sio.write(''.join(lst)) >>> sio.write('\n'.join(lst)) >>> print sio.getvalue() AAAAABBBBBCCCCCAAAAABBBBBCCCCCAAAAA BBBBB CCCCC 

SEE ALSO: FILE.writelines() 17; StringIO.StringIO.write() 157;

2.2.3 Converting Between Binary and ASCII

The Python standard library provides several modules for converting between binary data and 7-bit ASCII. At the low level, binascii is a C extension to produce fast string conversions. At a high level, base64, binhex, quopri, and uu provide file-oriented wrappers to the facilities in binascii.

base64 • Convert to/from base64 encoding (RFC1521)

The base64 module is a wrapper around the functions binascii.a2b-base64() and binascii.b2a-base64(). As well as providing a file-based interface on top of the underlying string conversions, base64 handles the chunking of binary files into base64 line blocks and provides for the direct encoding of arbitrary input strings. Unlike uu, base64 adds no content headers to encoded data; MIME standards for headers and message-wrapping are handled by other modules that utilize base64. Base64 encoding is specified in RFC 1521.

FUNCTIONS
base64.encode(input=..., output=...)

Encode the contents of the first argument input to the second argument output. Arguments input and output should be file-like objects; input must be readable and output must be writable.

base64.encodestring(s=...)

Return the base64 encoding of the string passed in the first argument s.

base64.decode(input=..., output=...)

Decode the contents of the first argument input to the second argument output. Arguments input and output should be file-like objects; input must be readable and output must be writable.

base64.decodestring(s=...)

Return the decoding of the base64-encoded string passed in the first argument s.

SEE ALSO: email 345; rfc822 397; mimetools 396; mimetypes 374; MimeWriter 396; mimify 396; binascii 159; quopri 162;

binascii • Convert between binary data and ASCII

The binascii module is a C implementation of a number of styles of ASCII encoding of binary data. Each function in the binascii module takes either encoded ASCII or raw binary strings as an argument, and returns the string result of converting back or forth. Some restrictions apply to the length of strings passed to some functions in the module (for encodings that operate on specific block sizes).

FUNCTIONS
binascii.a2b_base64(s)

Return the decoded version of a base64-encoded string. A string consisting of one or more encoding blocks should be passed as the argument s.

binascii.a2b_hex(s)

Return the decoded version of a hexadecimal-encoded string. A string consisting of an even number of hexadecimals digits should be passed as the argument s.

binascii.a2b_hqx(s)

Return the decoded version of a binhex-encoded string. A string containing a complete number of encoded binary bytes should be passed as the argument s.

binascii.a2b_qp(s [,header=0])

Return the decoded version of a quoted printable string. A string containing a complete number of encoded binary bytes should be passed as the argument s. If the optional argument header is specified, underscores will be decoded as spaces. New to Python 2.2.

binascii.a2b_uu(s)

Return the decoded version of a UUencoded string. A string consisting of exactly one encoding block should be passed as the argument s (for a full block, 62 bytes input, 45 bytes returned).

binascii.b2a_base64(s)

Return the based64 encoding of a binary string (including the newline after block). A binary string no longer than 57 bytes should be passed as the argument s.

binascii.b2a_hex(s)

Return the hexadecimal encoding of a binary string. A binary string of any length should be passed as the argument s.

binascii.b2a_hqx(s)

Return the binhex4 encoding of a binary string. A binary string of any length should be passed as the argument s. Run-length compression of s is not performed by this function (use binascii.rlecode_hqx() first, if needed).

binascii.b2a_qp(s [,quotetabs=0 [,istext=1 [header=0]]])

Return the quoted printable encoding of a binary string. A binary string of any length should be passed as the argument s. The optional argument quotetabs specified whether to escape spaces and tabs; istext specifies not to newlines; header specifies whether to encode spaces as underscores (and escape underscores). New to Python 2.2.

binascii.b2a_uu(s)

Return the UUencoding of a binary string (including the initial block specifier "M" for full blocks and newline after block). A binary string no longer than 45 bytes should be passed as the argument s.

binascii.crc32(s [,crc])

Return the CRC32 checksum of the first argument s. If the second argument crc is specified, it will be used as an initial checksum. This allows partial computation of a checksum and continuation. For example:

 >>> import binascii >>> crc = binascii.crc32('spam') >>> binascii.crc32(' and eggs', crc) 739139840 >>> binascii.crc32('spam and eggs') 739139840 
binascii.crc_hqx(s, crc)

Return the binhex4 checksum of the first argument s, using initial checksum value in second argument. This allows partial computation of a checksum and continuation. For example:

 >>> import binascii >>> binascii.crc_hqx('spam and eggs', 0) 17918 >>> crc = binascii.crc_hqx('spam', 0) >>> binascii.crc_hqx(' and eggs', crc) 17918 

SEE ALSO: binascii.crc32 160;

binascii.hexlify(s)

Identical to binascii.b2a_hex().

binascii.rlecode_hqx(s)

Return the binhex4 run-length encoding (RLE) of first argument s. Under this RLE technique, 0x90 is used as an indicator byte. Independent of the binhex4 standard, this is a poor choice of precompression for encoded strings.

SEE ALSO: zlib.compress() 182;

binascii.rledecode_hqx(s)

Return the expansion of a binhex4 run-length encoded string.

binascii.unhexlify(s)

Identical to binascii.a2b_hex()

EXCEPTIONS
binascii.Error

Generic exception that should only result from programming errors.

binascii.Incomplete

Exception raised when a data block is incomplete. Usually this results from programming errors in reading blocks, but it could indicate data or channel corruption.

SEE ALSO: base64 158; binhex 161; uu 163;

binhex • Encode and decode binhex4 files

The binhex module is a wrapper around the functions binascii.a2b_hqx(), binascii.b2a_hqx(), binascii.rlecode_hqx(), binascii.rledecode_hqx(), and binascii.crc_hqx(). As well as providing a file-based interface on top of the underlying string conversions, binhex handles run-length encoding of encoded files and attaches the needed header and footer information. Under MacOS, the resource fork of a file is encoded along with the data fork (not applicable under other platforms).

FUNCTIONS
binhex.binhex(inp=..., out=...)

Encode the contents of the first argument inp to the second argument out. Argument inp is a filename; out may be either a filename or a file-like object. However, a cStringIO.StringIO object is not "file-like" enough since it will be closed after the conversion and therefore, its value lost. You could override the . close() method in a subclass of StringIO.StringIO to solve this limitation.

binhex.hexbin(inp=...[,out=...])

Decode the contents of the first argument to an output file. If the second argument out is specified, it will be used as the output filename, otherwise the filename will be taken from the binhex header. The argument inp may be either a filename or a file-like object.

CLASSES

A number of internal classes are used by binhex. They are not documented here, but can be examined in $PYTHONHOME/lib/binhex.py if desired (it is unlikely readers will need to do this).

SEE ALSO: binascii 159;

quopri • Convert to/from quoted printable encoding (RFC1521)

The quopri module is a wrapper around the functions binascii.a2b_qp() and binascii.b2a_qp(). The module quopri has the same methods as base64. Unlike uu, base64 adds no content headers to encoded data; MIME standards for headers and message wrapping are handled by other modules that utilize quopri. Quoted printable encoding is specified in RFC 1521.

FUNCTIONS
quopri.encode(input, output, quotetabs)

Encode the contents of the first argument input to the second argument output. Arguments input and output should be file-like objects; input must be readable and output must be writable. If quotetabs is a true value, escape tabs and spaces.

quopri.encodestring(s [,quotetabs=0])

Return the quoted printable encoding of the string passed in the first argument s. If quotetabs is a true value, escape tabs and spaces.

quopri.decode(input=..., output=...[,header=0])

Decode the contents of the first argument input to the second argument output. Arguments input and output should be file-like objects; input must be readable and output must be writable. If header is a true value, encode spaces as underscores and escape underscores.

quopri.decodestring(s [,header=0])

Return the decoding of the quoted printable string passed in the first argument s. If header is a true value, decode underscores as spaces.

SEE ALSO: email 345; rfc822 397; mimetools 396; mimetypes 374; MimeWriter 396; mimify 396; binascii 159; base64 158;

uu • UUencode and UUdecode files

The uu module is a wrapper around the functions binascii.a2b_uu() and binascii.b2a_uu(). As well as providing a file-based interface on top of the underlying string conversions, uu handles the chunking of binary files into UUencoded line blocks and attaches the needed header and footer.

FUNCTIONS
uu.encode(in, out, [name=...[,mode=0666]])

Encode the contents of the first argument in to the second argument out. Arguments in and out should be file objects, but filenames are also accepted (the latter is deprecated). The special filename "-" can be used to specify STDIN or STDOUT, as appropriate. When file objects are passed as arguments, in must be readable and out must be writable. The third argument name can be used to specify the filename that appears in the UUencoding header; by default it is the name of in. The fourth argument mode is the octal filemode to store in the UUencoding header.

uu.decode(in, [,out_file=...[, mode=...])

Decode the contents of the first argument in to an output file. If the second argument out_file is specified, it will be used as the output file; otherwise, the filename will be taken from the UUencoding header. Arguments in and out_file should be file objects, but filenames are also accepted (the latter is deprecated). If the third argument mode is specified (and if out_file is either unspecified or is a filename), open the created file in mode mode.

SEE ALSO: binascii 159;

2.2.4 Cryptography

Python does not come with any standard and general cryptography modules. The few included capabilities are fairly narrow in purpose and limited in scope. The capabilities in the standard library consist of several cryptographic hashes and one weak symmetrical encryption algorithm. A quick survey of cryptographic techniques shows what capabilities are absent from the standard library:

Symmetrical Encryption: Any technique by which a plaintext message M is "encrypted" with a key K to produce a cyphertext C. Application of K or some K' easily derivable from K to C is called "decryption" and produces as output M. The standard module rotor provides a form of symmetrical encryption.

Cryptographic Hash: Any technique by which a short "hash" H is produced from a plaintext message M that has several additional properties: (1) Given only H, it is difficult to obtain any M' such that the cryptographic hash of M' is H; (2) Given two plaintext messages M and M', there is a very low probability that the cryptographic hashes of M and M' are the same. Sometimes a third property is included: (3) Given M, its cryptographic hash H, and another hash H', examining the relationship between H and H' does not make it easier to find an M' whose hash is H'. The standard modules crypt, md5, and sha provide forms of cryptographic hashes.

Asymmetrical Encryption: Also called "public-key cryptography." Any technique by which a pair of keys Kpub and Kpriv can be generated that have several properties. The algorithm for an asymmetrical encryption technique will be called "P(M,K)" in the following. (1) For any plaintext message M, M equals P(Kpriv,P(M,Kpub)). (2) Given only a public-key Kpub, it is difficult to obtain a private-key Kpriv that assures the equality in (1). (3) Given only P(M,Kpub), it is difficult to obtain M. In general, in an asymmetrical encryption system, a user generates Kpub and Kpriv, then releases Kpub to other users but retains Kpriv as a secret. There is no support for asymmetrical encryption in the standard library.

Digital Signatures: Digital signatures are really just "public-keys in reverse." In many cases, the same underlying algorithm is used for each. A digital signature is any technique by which a pair of keys Kver and Ksig can be generated that have several properties. The algorithm for a digital signature will be called S(M,K) in the following. (1) For any message M, M equals P(Kver,P(M,Ksig)). (2) Given only a verification key Kver, it is difficult to obtain a signature key Ksig that assures the equality in (1). (3) Given only P(M,Ksig), it is difficult to find any C' such that P(Kver,C) is a plausible message (in other words, the signature shows it is not a forgery). In general, in a digital signature system, a user generates Kver and Ksig, then releases Kver to other users but retains Ksig as a secret. There is no support for digital signatures in the standard library.

graphics/common.gif

Those outlined are the most important cryptographic techniques. More detailed general introductions to cryptology and cryptography can be found at the author's Web site. A first tutorial is Introduction to Cryptology Concepts I:

<http://gnosis.cx/publish/programming/cryptologyl.pdf>

Further material is in Introduction to Cryptology Concepts II:

<http://gnosis.cx/publish/programming/cryptology2.pdf>

And more advanced material is in Intermediate Cryptology: Specialized Protocols:

<http://gnosis.cx/publish/programming/cryptology3.pdf>

A number of third-party modules have been created to handle cryptographic tasks; a good guide to these third-party tools is the Vaults of Parnassus Encryption/Encoding index at <http://www.vex.net/parnassus/apyllo.py?i=94738404>. Only the tools in the standard library will be covered here specifically, since all the third-party tools are somewhat far afield of the topic of text processing as such. Moreover, third-party tools often rely on additional non-Python libraries, which will not be present on most platforms, and these tools will not necessarily be maintained as new Python versions introduce changes.

The most important third-party modules are listed below. These are modules that the author believes are likely to be maintained and that provide access to a wide range of cryptographic algorithms.

mxCrypto
amkCrypto

Marc-Andre Lemburg and Andrew Kuchling both valuable contributors of many Python modules have played a game of leapfrog with each other by releasing mxCrypto and amkCrypto, respectively. Each release of either module builds on the work of the other, providing compatible interfaces and overlapping source code. Whatever is newest at the time you read this is the best bet. Current information on both should be obtainable from:

<http://www.amk.ca/python/code/crypto.html>

Python Cryptography

Andrew Kuchling, who has provided a great deal of excellent Python documentation, documents these cryptography modules at:

<http://www.amk.ca/python/writing/pycrypt/>

M2Crypto

The mxCrypto and amkCrypto modules are most readily available for Unix-like platforms. A similar range of cryptographic capabilities for a Windows platform is available in Ng Pheng Siong's M2Crypto. Information and documentation can be found at:

<http://www.post1.com/home/ngps/m2/>

fcrypt

Carey Evans has created fcrypt, which is a pure-Python, single-module replacement for the standard library's crypt module. While probably orders-of-magnitude slower than a C implementation, fcrypt will run anywhere that Python does (and speed is rarely an issue for this functionality). fcrypt may be obtained at:

<http://home.clear.net.nz/pages/c.evans/sw/>

crypt • Create and verify Unix-style passwords

The crypt() function is a frequently used, but somewhat antiquated, password creation/verification tool. Under Unix-like systems, crypt() is contained in system libraries and may be called from wrapper functions in languages like Python. crypt() is a form of cryptographic hash based on the Data Encryption Standard (DES). The hash produced by crypt() is based on an 8-byte key and a 2-byte "salt." The output of crypt() is produced by repeated encryption of a constant string, using the user key as a DES key and the salt to perturb the encryption in one of 4,096 ways. Both the key and the salt are restricted to alphanumerics plus dot and slash.

By using a cryptographic hash, passwords may be stored in a relatively insecure location. An imposter cannot easily produce a false password that will hash to the same value as the one stored in the password file, even given access to the password file. The salt is used to make "dictionary attacks" more difficult. If an imposter has access to the password file, she might try applying crypt() to a candidate password and compare the result to every entry in the password file. Without a salt, the chances of matching some encrypted password would be higher. The salt (a random value should be used) decreases the chance of such a random guess by 4,096 times.

The crypt module is only installed on some Python systems (even only some Unix systems). Moreover, the module, if installed, relies on an underlying system library. For a portable approach to password creation, the third-party fcrypt module provides a portable, pure-Python reimplementation.

FUNCTIONS
crypt.crypt(passwd, salt)

Return an ASCII 13-byte encrypted password. The first argument passwd must be a string up to eight characters in length (extra characters are truncated and do not affect the result). The second argument salt must be a string up to two characters in length (extra characters are truncated). The value of salt forms the first two characters of the result.

 >>> from crypt import crypt >>> crypt('mypassword','XY') 'XY5XuULXk4pcs' >>> crypt('mypasswo','XY') 'XY5XuULXk4pcs' >>> crypt('mypassword...more.characters','XY') 'XY5XuULXk4pcs' >>> crypt('mypasswo','AB') 'AB061nfYxWIKg' >>> crypt('diffpass','AB') 'AB105BopaFYNs' 

SEE ALSO: fcrypt 165; md5 167; sha 170;

md5 • Create MD5 message digests

RSA Data Security, Inc.'s MD5 cryptographic hash is a popular algorithm that is codified by RFC1321. Like sha, and unlike crypt, md5 allows one to find the cryptographic hash of arbitrary strings (Unicode strings may not be hashed, however). Absent any other considerations such as compatibility with other programs Secure Hash Algorithm (SHA) is currently considered a better algorithm than MD5, and the sha module should be used for cryptographic hashes. The operation of md5 objects is similar to binascii.crc32() hashes in that the final hash value may be built progressively from partial concatenated strings. The MD5 algorithm produces a 128-bit hash.

CONSTANTS
md5.MD5Type

The type of an md5.new instance.

CLASSES
md5.new([s])

Create an md5 object. If the first argument s is specified, initialize the MD5 digest buffer with the initial string s. An MD5 hash can be computed in a single line with:

 >>> import md5 >>> md5.new('Mary had a little lamb').hexdigest() 'e946adb45d4299def2071880d30136d4' 
md5.md5([s])

Identical to md5.new.

METHODS
md5.copy()

Return a new md5 object that is identical to the current state of the current object. Different terminal strings can be concatenated to the clone objects after they are copied. For example:

 >>> import md5 >>> m = md5.new('spam and eggs') >>> m.digest() '\xb5\x81f\xOc\xff\x17\xe7\x8c\x84\xc3\xa8J\xdO.g\x85' >>> m2 = m.copy() >>> m2.digest() '\xb5\x81f\xOc\xff\x17\xe7\x8c\x84\xc3\xa8J\xdO.g\x85' >>> m.update(' are tasty') >>> m2.update(' are wretched') >>> m.digest() '*\x94\xa2\xc5\xceq\x96\xef&\x1a\xc9#\xac98\x16' >>> m2.digest() 'h\x8c\xfam\xe3\xbO\x90\xe8\xOe\xcb\xbf\xb3\xa7N\xe6\xbc' 
md5.digest()

Return the 128-bit digest of the current state of the md5 object as a 16-byte string. Each byte will contain a full 8-bit range of possible values.

 >>> import md5          # Python 2.1+ >>> m = md5.new('spam and eggs') >>> m.digest() '\xb5\x81f\xOc\xff\x17\xe7\x8c\x84\xc3\xa8J\xdO.g\x85' >>>  import md5         # Python <= 2.0 >>> m = md5.new('spam and eggs') >>> m.digest() '\265\201f\014\377\027\347\214\204\303\250J\320.g\205' 
md5.hexdigest()

Return the 128-bit digest of the current state of the md5 object as a 32-byte hexadecimal-encoded string. Each byte will contain only values in string.hexdigits. Each pair of bytes represents 8-bits of hash, and this format may be transmitted over 7-bit ASCII channels like email.

 >>> import md5 >>> m = md5.new('spam and eggs') >>> m.hexdigest() 'b581660cff17e78c84c3a84ad02e6785' 
md5.update(s)

Concatenate additional strings to the md5 object. Current hash state is adjusted accordingly. The number of concatenation steps that go into an MD5 hash does not affect the final hash, only the actual string that would result from concatenating each part in a single string. However, for large strings that are determined incrementally, it may be more practical to call md5.update() numerous times. For example:

 >>> import md5 >>> ml = md5.new('spam and eggs') >>> m2 = md5.new('spam') >>> m2.update(' and eggs') >>> m3 = md5.new('spam') >>> m3.update(' and ') >>> m3.update('eggs') >>> m1.hexdigest() 'b581660cff17e78c84c3a84ad02e6785' >>> m2.hexdigest() 'b581660cff17e78c84c3a84ad02e6785' >>> m3.hexdigest() 'b581660cff17e78c84c3a84ad02e6785' 

SEE ALSO: sha 170; crypt 166; binascii.crc32() 160;

rotor • Perform Enigma-like encryption and decryption

The rotor module is a bit of a curiosity in the Python standard library. The symmetric encryption performed by rotor is similar to that performed by the extremely historically interesting and important Enigma algorithm. Given Alan Turing's famous role not just in inventing the theory of computability, but also in cracking German encryption during WWII, there is a nice literary quality to the inclusion of rotor in Python. However, rotor should not be mistaken for a robust modern encryption algorithm. Bruce Schneier has commented that there are two types of encryption algorithms: those that will stop your little sister from reading your messages, and those that will stop major governments and powerful organization from reading your messages. rotor is in the first category albeit allowing for rather bright little sisters. But rotor will not help much against TLAs (three letter agencies). On the other hand, there is nothing else in the Python standard library that performs actual military-grade encryption, either.

CLASSES
rotor.newrotor(key [,numrotors])

Return a rotor object with rotor permutations and positions based on the first argument key. If the second argument numrotors is specified, a number of rotors other than the default of 6 can be used (more is stronger). A rotor encryption can be computed in a single line with:

 >>> rotor.newrotor('mypassword').encrypt('Mary had a lamb') '\x10\xef\xf1\x1e\xeaor\xe9\xf7\xe5\xad,r\xc6\x9f' 

Object style encryption and decryption is performed like the following:

 >>> import rotor >>> C = rotor.newrotor('pass2').encrypt('Mary had a little lamb') >>> r1 = rotor.newrotor('mypassword') >>> C2 = r1.encrypt('Mary had a little lamb') >>> r1.decrypt(C2) 'Mary had a little lamb' >>> r1.decrypt(C)   # Let's try it '\217R$\217/sE\311\330~#\310\342\200\025F\221\245\263\036\2200' >>> r1.setkey('pass2') >>> r1.decrypt(C)   # Let's try it 'Mary had a little lamb' 
METHODS
rotor.decrypt(s)

Return a decrypted version of cyphertext string s. Prior to decryption, rotors are set to their initial positions.

rotor.decryptmore(s)

Return a decrypted version of cyphertext string s. Prior to decryption, rotors are left in their current positions.

rotor.encrypt(s)

Return an encrypted version of plaintext string s. Prior to encryption, rotors are set to their initial positions.

rotor.encryptmore(s)

Return an encrypted version of plaintext string s. Prior to encryption, rotors are left in their current positions.

rotor.setkey (key)

Set a new key for a rotor object.

sha • Create SHA message digests

The National Institute of Standards and Technology's (NIST's) Secure Hash Algorithm is the best well-known cryptographic hash for most purposes. Like md5, and unlike crypt, sha allows one to find the cryptographic hash of arbitrary strings (Unicode strings may not be hashed, however). Absent any other considerations such as compatibility with other programs SHA is currently considered a better algorithm than MD5, and the sha module should be used for cryptographic hashes. The operation of sha objects is similar to binascii.crc32() hashes in that the final hash value may be built progressively from partial concatenated strings. The SHA algorithm produces a 160-bit hash.

CLASSES
sha.new([s])

Create an sha object. If the first argument s is specified, initialize the SHA digest buffer with the initial string s. An SHA hash can be computed in a single line with:

 >>> import sha >>> sha.new('Mary had a little lamb').hexdigest() 'bac9388d0498fb378e528d35abd05792291af182' 
sha.sha ([s])

Identical to sha.new.

METHODS
sha.copy()

Return a new sha object that is identical to the current state of the current object. Different terminal strings can be concatenated to the clone objects after they are copied. For example:

 >>> import sha >>> s = sha.new('spam and eggs') >>> s.digest() '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246' >>> s2 = s.copy() >>> s2.digest() '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246' >>> s.update(' are tasty') >>> s2.update(' are wretched') >>> s.digest() '\013^C\366\253?I\323\206nt\2443\251\227\204-kr6' >>> s2.digest() '\013\210\237\216\014\3337X\333\221h&+c\345\007\367\326\274\321' 
sha.digest()

Return the 160-bit digest of the current state of the sha object as a 20-byte string. Each byte will contain a full 8-bit range of possible values.

 >>> import sha          # Python 2.1+ >>> s = sha.new('spam and eggs') >>> s.digest() '\xbe\x87\x94\x8b\xad\xfdx\x14\xa5b\xleC\xd2\xOf\xaa8 @\xOf\xa6' >>> import sha          # Python <= 2.0 >>> s = sha.new('spam and eggs') >>> s.digest() '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246' 
sha.hexdigest()

Return the 160-bit digest of the current state of the sha object as a 40-byte hexadecimal-encoded string. Each byte will contain only values in string.hexdigits. Each pair of bytes represents 8-bits of hash, and this format may be transmitted over 7-bit ASCII channels like email.

 >>> import sha >>> s = sha.new('spam and eggs') >>> s.hexdigest() 'be87948badfd7814a5621e43d20faa3820400fa6' 
sha.update(s)

Concatenate additional strings to the sha object. Current hash state is adjusted accordingly. The number of concatenation steps that go into an SHA hash does not affect the final hash, only the actual string that would result from concatenating each part in a single string. However, for large strings that are determined incrementally, it may be more practical to call sha.update() numerous times. For example:

 >>> import sha >>> s1 = sha.sha('spam and eggs') >>> s2 = sha.sha('spam') >>> s2.update(' and eggs') >>> s3 = sha.sha('spam') >>> s3.update(' and ') >>> s3.update('eggs') >>> s1.hexdigest() 'be87948badfd7814a5621e43d20faa3820400fa6' >>> s2.hexdigest() 'be87948badfd7814a5621e43d20faa3820400fa6' >>> s3.hexdigest() 'be87948badfd7814a5621e43d20faa3820400fa6' 

SEE ALSO: md5 167; crypt 166; binascii.crc32() 160;

2.2.5 Compression

Over the history of computers, a large number of data compression formats have been invented, mostly as variants on Lempel-Ziv and Huffman techniques. Compression is useful for all sorts of data streams, but file-level archive formats have been the most widely used and known application. Under MS-DOS and Windows we have seen ARC, PAK, ZOO, LHA, ARJ, CAB, RAR, and other formats but the ZIP format has become the most widespread variant. Under Unix-like systems, compress (.Z) mostly gave way to gzip (GZ); gzip is still the most popular format on these systems, but bzip (BZ2) generally obtains better compression rates. Under MacOS, the most popular format is SIT. Other platforms have additional variants on archive formats, but ZIP and to a lesser extent GZ are widely supported on a number of platforms.

The Python standard library includes support for several styles of compression. The zlib module performs low-level compression of raw string data and has no concept of a file. zlib is itself called by the high-level modules below for its compression services.

The modules gzip and zipfile provide file-level interfaces to compressed archives. However, a notable difference in the operation of gzip and zipfile arises out of a difference in the underlying GZ and ZIP formats. gzip (GZ) operates exclusively on single files leaving the work of concatenating collections of files to tools like tar. One frequently encounters (especially on Unix-like systems) files like foo.tar.gz or foo.tgz that are produced by first applying tar to a collection of files, then applying gzip to the result. ZIP, however, handles both the compression and archiving aspects in a single tool and format. As a consequence, gzip is able to create file-like objects based directly on the compressed contents of a GZ file. ziplib needs to provide more specialized methods for navigating archive contents and for working with individual compressed file images therein.

Also see Appendix B (A Data Compression Primer).

gzip • Functions that read and write gzipped files

The gzip module allows the treatment of the compressed data inside gzip compressed files directly in a file-like manner. Uncompressed data can be read out, and compressed data written back in, all without a caller knowing or caring that the file is a GZ-compressed file. A simple example illustrates this:

gzip_file.py
 # Treat a GZ as "just another file" import gzip, glob print "Size of data in files:" for fname in glob.glob('*'):     try:         if fname[-3:] == '.gz':             s = gzip.open(fname).read()         else:             s = open(fname).read()         print ' ',fname,'-',len(s),'bytes'     except IOError:         print 'Skipping',file 

The module gzip is a wrapper around zlib, with the latter performing the actual compression and decompression tasks. In many respects, gzip is similar to mmap and StringI0 in emulating and/or wrapping a file object.

SEE ALSO: mmap 147; StringIO 153; cStringIO 153;

CLASSES
gzip.GzipFile([filename=...[,mode="rb" [,compresslevel=9 [,fileobj=...]]]])

Create a gzip file-like object. Such an object supports most file object operations, with the exception of .seek() and .tell(). Either the first argument filename or the fourth argument fileobj should be specified (likely by argument name, especially if fourth argument fileobj).

The second argument mode takes the mode of fileobj if specified, otherwise it defaults to rb (r, rb, a, ab, w, or wb may be specified with the same meaning as with FILE.open() objects). The third argument compresslevel specifies the level of compression. The default is the highest level, 9; an integer down to 1 may be selected for less compression but faster operation (compression level of a read file comes from the file itself, however).

gzip.open(filename=...[mode='rb [,compresslevel=9]])

Same as gzip.GzipFile but with extra arguments omitted. A GZ file object opened with gzip.open is always opened by name, not by underlying file object.

METHODS AND ATTRIBUTES
gzip.close()

Close the gzip object. No access is permitted after close. If the object was opened by file object, the underlying file object is not closed, only the gzip interface to the file.

SEE ALSO: FILE.close() 16;

gzip.flush()

Write outstanding data from memory to disk.

SEE ALSO: FILE.close() 16;

gzip.isatty()

Return 0. Compatibility method for file-like behavior.

SEE ALSO: FILE.isatty() 16;

gzip.myfileobj

Attribute holding the underlying file object.

gzip.read([num])

If the first argument num is specified, return a string containing the next num characters. If num characters are not available, return as many as possible. If num is not specified, return all the characters from current file position to end of string buffer. Advance the current file position by the amount read.

SEE ALSO: FILE.read() 17;

gzip.readline([length])

Return a string from the gzip object, starting from the current file position and going to the next newline character. The argument length limits the read if specified. Advance the current file position by the amount read.

SEE ALSO: FILE.readline() 17;

gzip.readlines([sizehint=...])

Return a list of strings from the gzip object. Each list element consists of a single line, including the trailing newline character(s). If an argument sizehint is specified, read only approximately sizehint characters worth of lines (full lines will always be read).

SEE ALSO: FILE.readlines() 17;

gzip.write(s)

Write the first argument s into the gzip object at the current file position. The current file position is updated to the position following the write.

SEE ALSO: FILE.write() 17;

gzip.writelines(list)

Write each element of list into the gzip object at the current file position. The current file position is updated to the position following the write. Most sequence types are allowed, but list must contain only strings, or a TypeError will occur.

Contrary to what might be expected from the method name, gzip.writelines() never inserts newline characters. For the list elements actually to occupy separate lines in the string buffer, each element string must already have a newline terminator. See StringIO.StringIO.writelines() for an example.

SEE ALSO: FILE.writelines() 17; StringIO.StringIO.writelines() 157;

SEE ALSO: zlib 181; zipfile 176;

zipfile • Read and write ZIP files

The zipfile module enables a variety of operations on ZIP files and is compatible with archives created by applications such as PKZip, Info-Zip, and WinZip. Since the ZIP format allows inclusion of multiple file images within a single archive, the zipfile does not behave in a directly file-like manner as gzip does. Nonetheless, it is possible to view the contents of an archive, add new file images to one, create a new ZIP archive, or manipulate the contents and directory information of a ZIP file.

An initial example of working with the zipfile module gives a feel for its usage.

 >>> for name in 'ABC': ...     open(name,'w').write(name*1000) ... >>> import zipfile >>> z = zipfile.ZipFile('new.zip','w',zipfile.ZIP_DEFLATED) # new archv >>> z.write('A')                        # write files to archive >>> z.write('B','B.newname',zipfile.ZIP_STORED) >>> z.write('C','C.newname') >>> z.close()                           # close the written archive >>> z = zipfile.ZipFile('new.zip')      # reopen archive in read mode >>> z.testzip()                         # 'None' returned means OK >>> z.namelist()                        # What's in it? ['A', 'B.newname', 'C.newname'] >>> z.printdir()                        # details File Name                                   Modified             Size A                                    2001-07-18 21:39:36         1000 B.newname                            2001-07-18 21:39:36         1000 C.newname                            2001-07-18 21:39:36         1000 >>> A = z.getinfo('A')                  # bind ZipInfo object >>> B = z.getinfo('B.newname')          # bind ZipInfo object >>> A.compress_size 11 >>> B.compress_size 1000 >>> z.read(A.filename)[:40]             # Check what's in A 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' >>> z.read(B.filename)[:40]             # Check what's in B 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB' >>> # For comparison, see what Info-Zip reports on created archive >>> import os >>> print os.popen('unzip -v new.zip').read() Archive: new.zip  Length  Method   Size  Ratio   Date    Time   CRC-32     Name  ------  ------   ----  -----   ----    ----   ------     ----    1000  Defl:N      11  99%  07-18-01  21:39  51a02e01   A    1000  Stored    1000   0%  07-18-01  21:39  7d9c564d   B.newname    1000  Defl:N      11  99%  07-18-01  21:39  66778189   C.newname  ------          ------  ---                              ------    3000            1022  66%                              3 files 

The module gzip is a wrapper around zlib, with the latter performing the actual compression and decompression tasks.

CONSTANTS

Several string constants (struct formats) are used to recognize signature identifiers in the ZIP format. These constants are not normally used directly by end-users of zipfile.

 zipfile.stringCentralDir = 'PK\x01\x02' zipfile.stringEndArchive = 'PK\x05\x06' zipfile.stringFileHeader = 'PK\x03\x04' zipfile.structCentralDir = '<4s4B4H315H21' zipfile.structEndArchive = '<4s4H21H' zipfile.structFileHeader = '<4s2B4H312H' 

Symbolic names for the two supported compression methods are also defined.

 zipfile.ZIP_STORED = 0 zipfile.ZIP_DEFLATED = 8 
FUNCTIONS
zipfile.is_zipfile(filename=...)

Check if the argument filename is a valid ZIP archive. Archives with appended comments are not recognized as valid archives. Return 1 if valid, None otherwise. This function does not guarantee archive is fully intact, but it does provide a sanity check on the file type.

CLASSES
zipfile.PyZipFile(pathname)

Create a zipfile.ZipFile object that has the extra method zipfile.ZipFile.writepy() . This extra method allows you to recursively add all *.py[oc] files to an archive. This class is not general purpose, but a special feature to aid distutils.

zipfile.ZipFile(file=...[,mode='r' [,compression=ZIP_STORED]])

Create a new zipfile.ZipFile object. This object is used for management of a ZIP archive. The first argument file must be specified and is simply the filename of the archive to be manipulated. The second argument mode may have one of three string values: r to open the archive in read-only mode; w to truncate the filename and create a new archive; a to read an existing archive and add to it. The third argument compression indicates the compression method ZIP_DEFLATED requires that zlib and the zlib system library be present.

zipfile.Ziplnfo()

Create a new zipfile.ZipInfo object. This object contains information about an individual archived filename and its file image. Normally, one will not directly instantiate zipfile.ZipInfo but only look at the zipfile.ZipInfo objects that are returned by methods like zipfile.ZipFile.infolist(), zipfile.ZipFile.getinfo(), and zipfile.ZipFile.NameToInfo. However, in special cases like zipfile.ZipFile.writestr(), it is useful to create a zipfile.ZipInfo directly.

METHODS AND ATTRIBUTES
zipfile.ZipFile.close()

Close the zipfile.ZipFile object, and flush any changes made to it. An object must be explicitly closed to perform updates.

zipfile.ZipFile.getinfo(name=...)

Return the zipfile.ZipInfo object corresponding to the filename name. If name is not in the ZIP archive, a KeyError is raised.

zipfile.ZipFile.infolist()

Return a list of zipfile.ZipInfo objects contained in the zipfile.ZipFile object. The return value is simply a list of instances of the same type. If the filename within the archive is known, zipfile.ZipFile.getinfo() is a better method to use. For enumerating over all archived files, however, zipfile.ZipFile.infolist() provides a nice sequence.

zipfile.ZipFile.namelist()

Return a list of the filenames of all the archived files (including nested relative directories).

zipfile.ZipFile.printdir()

Print to STDOUT a pretty summary of archived files and information about them. The results are similar to running Info-Zip's unzip with the -l option.

zipfile.ZipFile.read(name=...)

Return the contents of the archived file with filename name.

zipfile.ZipFile.testzip()

Test the integrity of the current archive. Return the filename of the first zipfile.ZipInfo object with corruption. If everything is valid, return None.

zipfile.ZipFile.write(filename=...[,arcname=...[,compress_type=...]])

Add the file filename to the zipfile.ZipFile object. If the second argument arcname is specified, use arcname as the stored filename (otherwise, use filename itself). If the third argument compress_type is specified, use the indicated compression method. The current archive must be opened in w or a mode.

zipfile.ZipFile.writestr(zinfo=..., bytes=...)

Write the data contained in the second argument bytes to the zipfile.ZipFile object. Directory meta-information must be contained in attributes of the first argument zinfo (a filename, data, and time should be included; other information is optional). The current archive must be opened in w or a mode.

zipfile.ZipFile.NameTolnfo

Dictionary that maps filenames in archive to corresponding zipfile.ZipInfo objects. The method zipfile.ZipFile.getinfo() is simply a wrapper for a dictionary lookup in this attribute.

zipfile.ZipFile.compression

Compression type currently in effect for new zipfile.ZipFile.write() operations. Modify with due caution (most likely not at all after initialization).

zipfile.ZipFile.debug = 0

Attribute for level of debugging information sent to STDOUT. Values range from the default 0 (no output) to 3 (verbose). May be modified.

zipfile.ZipFile.filelist

List of zipfile.ZipInfo objects contained in the zipfile.ZipFile object. The method zipfile.ZipFile.infolist() is simply a wrapper to retrieve this attribute. Modify with due caution (most likely not at all).

zipfile.ZipFile.filename

Filename of the zipfile.ZipFile object. DO NOT modify!

zipfile.ZipFile.fp

Underlying file object for the zipfile.ZipFile object. DO NOT modify!

zipfile.ZipFile.mode

Access mode of current zipfile.ZipFile object. DO NOT modify!

zipfile.ZipFile.start_dir

Position of start of central directory. DO NOT modify!

zipfile.Ziplnfo.CRC

Hash value of this archived file. DO NOT modify!

zipfile.ZipInfo.comment

Comment attached to this archived file. Modify with due caution (e.g., for use with zipfile.ZipFile.writestr()).

zipfile.ZipInfo.compress_size

Size of the compressed data of this archived file. DO NOT modify!

zipfile.ZipInfo.compress_type

Compression type used with this archived file. Modify with due caution (e.g., for use with zipfile.ZipFile.writestr()).

zipfile.ZipInfo.create_system

System that created this archived file. Modify with due caution (e.g., for use with zipfile.ZipFile.writestr()).

zipfile.ZipInfo.create_version

PKZip version that created the archive. Modify with due caution (e.g., for use with zipfile.ZipFile.writestr()).

zipfile.ZipInfo.date_time

Timestamp of this archived file. Modify with due caution (e.g., for use with zipfile.ZipFile.writestr()).

zipfile.ZipInfo.external_attr

File attribute of archived file when extracted.

zipfile.ZipInfo.extract_version

PKZip version needed to extract the archive. Modify with due caution (e.g., for use with zipfile.ZipFile.writestr()).

zipfile.ZipInfo.file_offset

Byte offset to start of file data. DO NOT modify!

zipfile.ZipInfo.file size

Size of the uncompressed data in the archived file. DO NOT modify!

zipfile.ZipInfo.filename

Filename of archived file. Modify with due caution (e.g., for use with zipfile.ZipFile.writestr()).

zipfile.ZipInfo.header_offset

Byte offset to file header of the archived file. DO NOT modify!

zipfile.ZipInfo.volume

Volume number of the archived file. DO NOT modify!

EXCEPTIONS
zipfile.error

Exception that is raised when corrupt ZIP file is processed.

zipfile.BadZipFile

Alias for zipfile.error.

SEE ALSO: zlib 181; gzip 173;

zlib • Compress and decompress with zlib library

zlib is the underlying compression engine for all Python standard library compression modules. Moreover, zlib is extremely useful in itself for compression and decompression of data that does not necessarily live in files (or where data does not map directly to files, even if it winds up in them indirectly). The Python zlib module relies on the availability of the zlib system library.

There are two basic modes of operation for zlib. In the simplest mode, one can simply pass an uncompressed string to zlib.compress() and have the compressed version returned. Using zlib.decompress() is symmetrical. In a more complicated mode, one can create compression or decompression objects that are able to receive incremental raw or compressed byte-streams, and return partial results based on what they have seen so far. This mode of operation is similar to the way one uses sha.sha.update(), md5.md5.update() , rotor.encryptmore(), or binascii.crc32() (albeit for a different purpose from each of those). For large byte-streams that are determined, it may be more practical to utilize compression/decompression objects than it would be to compress/decompress an entire string at once (for example, if the input or result is bound to a slow channel).

CONSTANTS
zlib.ZLIB_VERSION

The installed zlib system library version.

zlib.Z_BEST_COMPRESSION = 9

Highest compression level.

zlib.Z_BEST_SPEED = 1

Fastest compression level.

zlib.Z_HUFFMAN_ONLY = 2

Intermediate compression level that uses Huffman codes, but not Lempel-Ziv.

FUNCTIONS
zlib.adler32(s [,crc])

Return the Adler-32 checksum of the first argument s. If the second argument crc is specified, it will be used as an initial checksum. This allows partial computation of a checksum and continuation. An Adler-32 checksum can be computed much more quickly than a CRC32 checksum. Unlike md5 or sha, an Adler-32 checksum is not sufficient for cryptographic hashes, but merely for detection of accidental corruption of data.

SEE ALSO: zlib.crc32() 182; md5 167; sha 170;

zlib.compress(s [,level])

Return the zlib compressed version of the string in the first argument s. If the second argument level is specified, the compression technique can be fine-tuned. The compression level ranges from 1 to 9 and may also be specified using symbolic constants such as Z_BEST_COMPRESSION and Z_BEST_SPEED. The default value for level is 6 and is usually the desired compression level (usually within a few percent of the speed of Z_BEST_SPEED and within a few percent of the size of Z_BEST_COMPRESSION).

SEE ALSO: zlib.decompress() 182; zlib.compressobj 183;

zlib.crc32(s [,crc])

Return the CRC32 checksum of the first argument s. If the second argument crc is specified, it will be used as an initial checksum. This allows partial computation of a checksum and continuation. Unlike md5 or sha, a CRC32 checksum is not sufficient for cryptographic hashes, but merely for detection of accidental corruption of data.

Identical to binascii.crc32() (example appears there).

SEE ALSO: binascii.crc32() 160; zlib.adler32() 182; md5 167; sha 170;

zlib.decompress(s [,winsize [,buffsize]])

Return the decompressed version of the zlib compressed string in the first argument s. If the second argument winsize is specified, it determines the base 2 logarithm of the history buffer size. The default winsize is 15. If the third argument buffsize is specified, it determines the size of the decompression buffer. The default buffsize is 16384, but more is dynamically allocated if needed. One rarely needs to use winsize and buffsize values other than the defaults.

SEE ALSO: zlib.compress() 182; zlib.decompressobj 183;

CLASS FACTORIES

zlib does not define true classes that can be specialized. zlib.compressobj() and zlib.decompressobj() are actually factory-functions rather than classes. That is, they return instance objects, just as classes do, but they do not have unbound data and methods. For most users, the difference is not important: To get a zlib.compressobj or zlib.decompressobj object, you just call that factory-function in the same manner you would a class object.

zlib.compressobj([level])

Create a compression object. A compression object is able to incrementally compress new strings that are fed to it while maintaining the seeded symbol table from previously compressed byte-streams. If argument level is specified, the compression technique can be fine-tuned. The compression level ranges from 1 to 9. The default value for level is 6 and is usually the desired compression level.

SEE ALSO: zlib.compress() 182; zlib.decompressobj() 183;

zlib.decompressobj([winsize])

Create a decompression object. A decompression object is able to incrementally decompress new strings that are fed to it while maintaining the seeded symbol table from previously decompressed byte-streams. If the argument winsize is specified, it determines the base 2 logarithm of the history buffer size. The default winsize is 15.

SEE ALSO: zlib.decompress() 182; zlib.compressobj() 183;

METHODS AND ATTRIBUTES
zlib.compressobj.compress(s)

Add more data to the compression object. If the symbol table becomes full, compressed data is returned, otherwise an empty string. All returned output from each repeated call to zlib.compressobj.compress() should be concatenated to a decompression byte-stream (either a string or a decompression object). The example below, if run in a directory with some files, lets one examine the buffering behavior of compression objects:

zlib_objs.py
 # Demonstrate compression object streams import zlib, glob decom = zlib.decompressobj() com = zlib.compressobj() for file in glob.glob('*'):     s = open(file).read()     c = com.compress(s)     print 'COMPRESSED:', len(c), 'bytes out'     d = decom.decompress(c)     print 'DECOMPRESS:', len(d), 'bytes out'     print 'UNUSED DATA:', len(decom.unused_data), 'bytes'     raw_input('-- %s (%s bytes) --' % (file, 'len(s)')) f = com.flush() m = decom.decompress(f) print 'DECOMPRESS:', len(m), 'bytes out' print 'UNUSED DATA:', len(decom.unused_data), 'byte' 

SEE ALSO: zlib.compressobj.flush() 184; zlib.decompressobj.decompress() 185; zlib.compress() 182;

zlib.compressobj.flush([mode])

Flush any buffered data from the compression object. As in the example in zlib.compressobj.compress(), the output of a zlib.compressobj.flush() should be concatenated to the same decompression byte-stream as zlib.compressobj.compress() calls are. If the first argument mode is left empty, or the default Z_FINISH is specified, the compression object cannot be used further, and one should delete it. Otherwise, if Z_SYNC_FLUSH or Z_FULL_FLUSH are specified, the compression object can still be used, but some uncompressed data may not be recovered by the decompression object.

SEE ALSO: zlib.compress() 182; zlib.compressobj.compress() 183;

zlib.decompressobj.unused_data

As indicated, zlib.decompressobj.unused-data is an instance attribute rather than a method. If any partial compressed stream cannot be decompressed immediately based on the byte-stream received, the remainder is buffered in this instance attribute. Normally, any output of a compression object forms a complete decompression block, and nothing is left in this instance attribute. However, if data is received in bits over a channel, only partial decompression may be possible on a particular zlib.decompressobj.decompress() call.

SEE ALSO: zlib.decompress() 182; zlib.decompressobj.decompress() 185;

zlib.decompressobj.decompress (s)

Return the decompressed data that may be derived from the current decompression object state and the argument s data passed in. If all of s cannot be decompressed in this pass, the remainder is left in zlib.decompressobj.unused-data.

zlib.decompressobj.flush()

Return the decompressed data from any bytes buffered by the decompression object. After this call, the decompression object cannot be used further, and you should del it.

EXCEPTIONS
zlib.error

Exception that is raised by compression or decompression errors.

SEE ALSO: gzip 173; zipfile 176;

2.2.6 Unicode

Note that Appendix C (Understanding Unicode) also discusses Unicode issues.

Unicode is an enhanced set of character entities, well beyond the basic 128 characters defined in ASCII encoding and the codepage-specific national language sets that contain 128 characters each. The full Unicode character set evolving continuously, but with a large number of codepoints already fixed can contain literally millions of distinct characters. This allows the representation of a large number of national character sets within a unified encoding space, even the large character sets of Chinese-Japanese-Korean (CJK) alphabets.

Although Unicode defines a unique codepoint for each distinct character in its range, there are numerous encodings that correspond to each character. The encoding called UTF-8 defines ASCII characters as single bytes with standard ASCII values. However, for non-ASCII characters, a variable number of bytes (up to 6) are used to encode characters, with the "escape" to Unicode being indicated by high-bit values in initial bytes of multibyte sequences. UTF-16 is similar, but uses either 2 or 4 bytes to encode each character (but never just 1). UTF-32 is a format that uses a fixed 4-byte value for each Unicode character. UTF-32, however, is not currently supported by Python.

Native Unicode support was added to Python 2.0. On the face of it, it is a happy situation that Python supports Unicode it brings the world closer to multinational language support in computer applications. But in practice, you have to be careful when working with Unicode, because it is all too easy to encounter glitches like the one below:

 >>> alef, omega = unichr(1488), unichr(969) >>> unicodedata.name(alef) >>> print alef Traceback (most recent call last):   File "<stdin>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) >>> print chr(170) >>> if alef == chr(170): print "Hebrew is Roman diacritic" ... Traceback (most recent call last):   File "<stdin>", line 1, in ? UnicodeError: ASCII decoding error: ordinal not in range(128) 

A Unicode string that is composed of only ASCII characters, however, is considered equal (but not identical) to a Python string of the same characters.

 >>> u"spam" == "spam" 1 >>> u"spam" is "spam" 0 >>> "spam" is "spam"   # string interning is not guaranteed 1 >>> u"spam" is u"spam" # unicode interning not guaranteed 1 

Still, the care you take should not discourage you from working with multilanguage strings, as Unicode enables. It is really amazingly powerful to be able to do so. As one says of a talking dog: It is not that he speaks so well, but that he speaks at all.

Built-In Unicode Functions/Methods

The Unicode string method u"".encode() and the built-in function unicode() are inverse operations. The Unicode string method returns a plain string with the 8-bit bytes needed to represent it (using the specified or default encoding). The built-in unicode() takes one of these encoded strings and produces the Unicode object represented by the encoding. Specifically, suppose we define the function:

 >>> chk_eq = lambda u,enc: u == unicode(u.encode(enc),enc) 

The call chk_eq(u, enc) should return 1 for every value of u and enc as long as enc is a valid encoding name and u is capable of being represented in that encoding.

The set of encodings supported for both built-ins are listed below. Additional encodings may be registered using the codecs module. Each encoding is indicated by the string that names it, and the case of the string is normalized before comparison (case-insensitive naming of encodings):

ascii, us-ascii

Encode using 7-bit ASCII.

base64

Encode Unicode strings using the base64 4-to-3 encoding format.

latin-1, iso-8859-1

Encode using common European accent characters in high-bit values of 8-bit bytes. Latin-1 character's ord() values are identical to their Unicode codepoints.

quopri

Encode in quoted printable format.

rot13

Not really a Unicode encoding, but "rotate 13 chars" is included with Python 2.2+ as an example and convenience.

utf-7

Encode using variable byte-length encoding that is restricted to 7-bit ASCII octets. As with utf-8, ASCII characters encode themselves.

utf-8

Encode using variable byte-length encoding that preserves ASCII value bytes.

utf-16

Encoding using 2/4 byte encoding. Include "endian" lead bytes (platform-specific selection).

utf-16-le

Encoding using 2/4 byte encoding. Assume "little endian," and do not prepend "endian" indicator bytes.

utf-16-be

Encoding using 2/4 byte encoding. Assume "big endian," and do not prepend "endian" indicator bytes.

unicode-escape

Encode using Python-style Unicode string constants (u"\uXXXX").

raw-unicode-escape

Encode using Python-style Unicode raw string constants (ur"\uXXXX").

The error modes for both built-ins are listed below. Errors in encoding transformations may be handled in any of several ways:

strict

Raise UnicodeError for all decoding errors. Default handling.

ignore

Skip all invalid characters.

replace

Replace invalid characters with ? (string target) or u"\xfffd" (Unicode target).

u"".encode([enc [,errmode]])
"".encode([enc [,errmode]])

Return an encoded string representation of a Unicode string (or of a plain string). The representation is in the style of encoding enc (or system default). This string is suitable for writing to a file or stream that other applications will treat as Unicode data. Examples show several encodings:

 >>> alef = unichr(1488) >>> s = 'A'+alef >>> s u'A\u05d0' >>> s.encode('unicode-escape') 'A\\u05d0' >>> s.encode('utf-8') 'A\xd7\x90' >>> s.encode('utf-16') '\xff\xfeA\x00\xd0\x05' >>> s.encode('utf-16-le') 'A\x00\xd0\x05' >>> s.encode('ascii') Traceback (most recent call last):   File "<stdin>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) >>> s.encode('ascii','ignore') 'A' 
unicode(s [,enc [,errmode]])

Return a Unicode string object corresponding to the encoded string passed in the first argument s. The string s might be a string that is read from another Unicode-aware application. The representation is treated as conforming to the style of the encoding enc if the second argument is specified, or system default otherwise (usually utf-8). Errors can be handled in the default strict style or in a style specified in the third argument errmode.

unichr(cp)

Return a Unicode string object containing the single Unicode character whose integer codepoint is passed in the argument cp.

codecs • Python Codec Registry, API, and helpers

The codecs module contains a lot of sophisticated functionality to get at the internals of Python's Unicode handling. Most of those capabilities are at a lower level than programmers who are just interested in text processing need to worry about. The documentation of this module, therefore, will break slightly with the style of most of the documentation and present only two very useful wrapper functions within the codecs module.

codecs.open(filename=...[,mode='rb' [,encoding=...[,errors='strict' [,buffering=1]]]])

This wrapper function provides a simple and direct means of opening a Unicode file, and treating its contents directly as Unicode. In contrast, the contents of a file opened with the built-in open() function are written and read as strings; to read/write Unicode data to such a file involves multiple passes through u"".encode() and unicode().

The first argument filename specifies the name of the file to access. If the second argument mode is specified, the read/write mode can be selected. These arguments work identically to those used by open(). If the third argument encoding is specified, this encoding will be used to interpret the file (an incorrect encoding will probably result in a UnicodeError). Error handling may be modified by specifying the fourth argument errors (the options are the same as with the built-in unicode() function). A fifth argument buffering may be specified to use a specific buffer size (on platforms that support this).

An example of usage clarifies the difference between codecs.open() and the built-in open():

 >>> import codecs >>> alef = unichr(1488) >>> open('unicode_test','wb').write(('A'+alef).encode('utf-8')) >>> open('unicode_test').read()   # Read as plain string 'A\xd7\x90' >>> # Now read directly as Unicode >>> codecs.open('unicode_test', encoding='utf-8').read() u'A\u05d0' 

Data written back to a file opened with codecs.open() should likewise be Unicode data.

SEE ALSO: open() 15;

codecs.EncodedFile(file=..., data_encoding=...[,file_encoding=...[,errors='strict']])

This function allows an already opened file to be wrapped inside an "encoding translation" layer. The mode and buffering are taken from the underlying file. By specifying a second argument data_encoding and a third argument file_encoding, it is possible to generate strings in one encoding within an application, then write them directly into the appropriate file encoding. As with codecs.open() and unicode(), an error handling style may be specified with the fourth argument errors.

The most likely purpose for codecs.EncodedFile() is where an application is likely to receive byte-streams from multiple sources, encoded according to multiple Unicode encodings. By wrapping file objects (or file-like objects) in an encoding translation layer, the strings coming in one encoding can be transparently written to an output in the format the output expects. An example clarifies:

 >>> import codecs >>> alef = unichr(1488) >>> open('unicode_test','wb').write(('A'+alef).encode('utf-8')) >>> fp = open('unicode_test','rb+') >>> fp.read()     # Plain string w/ two-byte UTF-8 char in it 'A\xd7\x90' >>> utf16_writer = codecs.EncodedFile(fp,'utf-16','utf-8') >>> ascii_writer = codecs.EncodedFile(fp,'ascii','utf-8') >>> utf16_writer.tell()   # Wrapper keeps same current position 3 >>> s = alef.encode('utf-16') >>> s             # Plain string as UTF-16 encoding '\xff\xfe\xd0\x05' >>> utf16_writer.write(s) >>> ascii_writer.write('XYZ') >>> fp.close()             # File should be UTF-8 encoded >>> open('unicode_test').read() 'A\xd7\x90\xd7\x90XYZ' 

SEE ALSO: codecs.open() 189;

unicodedata • Database of Unicode characters

The module unicodedata is a database of Unicode character entities. Most of the functions in unicodedata take as an argument one Unicode character and return some information about the character contained in a plain (non-Unicode) string. The function of unicodedata is essentially informational, rather than transformational. Of course, an application might make decisions about the transformations performed based on the information returned by unicodedata. The short utility below provides all the information available for any Unicode codepoint:

unichr_info.py
 # Return all the information [unicodedata] has # about the single unicode character whose codepoint # is specified as a command-line argument. # Arg may be any expression evaluating to an integer from unicodedata import  * import sys char = unichr(eval(sys.argv[1])) print 'bidirectional', bidirectional(char) print 'category     ', category(char) print 'combining    ', combining(char) print 'decimal      ', decimal(char,0) print 'decomposition', decomposition(char) print 'digit        ', digit(char,0) print 'mirrored     ', mirrored(char) print 'name         ', name(char,'NOT DEFINED') print 'numeric      ', numeric(char,0) try: print 'lookup       ', 'lookup(name(char))' except: print "Cannot lookup" 

The usage of unichr_info.py is illustrated below by the runs with two possible arguments:

 % python unichr_info.py 1488 bidirectional R category      Lo combining     0 decimal       0 decomposition digit         0 mirrored      0 name          HEBREW LETTER ALEF numeric       0 lookup        u'\u05dO' % python unichr_info.py ord('1') bidirectional EN category      Nd combining     0 decimal       1 decomposition digit         1 mirrored      0 name          DIGIT ONE numeric       1.0 lookup        u'1' 

For additional information on current Unicode character codepoints and attributes, consult:

<http://www.unicode.org/Public/UNIDATA/UnicodeData.html>

FUNCTIONS
unicodedata.bidirectional(unichr)

Return the bidirectional characteristic of the character specified in the argument unichr. Possible values are AL, AN, B, BN, CS, EN, ES, ET, L, LRE, LRO, NSM, ON, PDF, R, RLE, RLO, S, and WS. Consult the URL above for details on these. Particularly notable values are L (left-to-right), R (right-to-left), and WS (whitespace).

unicodedata.category (unichr)

Return the category of the character specified in the argument unichr. Possible values are Cc, Cf, Cn, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk , Sm, So, Zl, Zp, and Zs. The first (capital) letter indicates L (letter), M (mark), N (number), P (punctuation), S (symbol), Z (separator), or C (other). The second letter is generally mnemonic within the major category of the first letter. Consult the URL above for details.

unicodedata.combining(unichr)

Return the numeric combining class of the character specified in the argument unichr. These include values such as 218 (below left) or 210 (right attached). Consult the URL above for details.

unicodedata.decimal(unichr [,default])

Return the numeric decimal value assigned to the character specified in the argument unichr. If the second argument default is specified, return that if no value is assigned (otherwise raise ValueError).

unicodedata.decomposition(unichr)

Return the decomposition mapping of the character specified in the argument unichr, or empty string if none exists. Consult the URL above for details. An example shows that some characters may be broken into component characters:

 >>> from unicodedata import * >>> name(unichr(190)) 'VULGAR FRACTION THREE QUARTERS' >>> decomposition(unichr(190)) '<fraction> 0033 2044 0034' >>> name(unichr(0x33)), name(unichr(0x2044)), name(unichr(0x34)) ('DIGIT THREE', 'FRACTION SLASH', 'DIGIT FOUR') 
unicodedata.digit(unichr [,default])

Return the numeric digit value assigned to the character specified in the argument unichr. If the second argument default is specified, return that if no value is assigned (otherwise raise ValueError).

unicodedata.lookup(name)

Return the Unicode character with the name specified in the first argument name. Matches must be exact, and ValueError is raised if no match is found. For example:

 >>> from unicodedata import * >>> lookup('GREEK SMALL LETTER ETA') u'\u03b7' >>> lookup('ETA') Traceback (most recent call last):   File "<stdin>", line 1, in ? KeyError: undefined character name 

SEE ALSO: unicodedata.name() 193;

unicodedata.mirrored(unichr)

Return 1 if the character specified in the argument unichr is a mirrored character in bidirection text. Return 0 otherwise.

unicodedata.name(unichr)

Return the name of the character specified in the argument unichr. Names are in all caps and have a regular form by descending category importance. Consult the URL above for details.

SEE ALSO: unicodedata.lookup() 193;

unicodedata.numeric(unichr [,default])

Return the floating point numeric value assigned to the character specified in the argument unichr. If the second argument default is specified, return that if no value is assigned (otherwise raise ValueError).



Text Processing in Python
Text Processing in Python
ISBN: 0321112547
EAN: 2147483647
Year: 2005
Pages: 59
Authors: David Mertz

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net