Section 11.1. Serialization | Python in a Nutshell, Second Edition (In a Nutshell)

11.1. Serialization

Python supplies a number of modules dealing with I/O operations that serialize (save) entire Python objects to various kinds of byte streams and deserialize (load and recreate) Python objects back from such streams. Serialization is also known as marshaling.

11.1.1. The marshal Module

The marshal module supports the specific serialization tasks needed to save and reload compiled Python files (.pyc and .pyo). marshal handles only fundamental built-in data types: None, numbers (int, long, float, complex), strings (plain and Unicode), code objects, and built-in containers (tuple, list, dict) whose items are instances of elementary types. marshal does not handle sets nor user-defined types and classes. marshal is faster than other serialization modules, and is the one such module that supports code objects. Module marshal supplies the following functions.

dump, dumps

dump(value,fileobj)

dumps(value)

dumps returns a string representing object value. dump writes the same string to file object fileobj, which must be opened for writing in binary mode. dump(v,f) is just like f.write(dumps(v)). fileobj cannot be any file-like object: it must be specifically an instance of type file.

load, loads

load(fileobj)

loads(str)

loads creates and returns the object v previously dumped to string str so that, for any object v of a supported type, v==loads(dumps(v)). If str is longer than dumps(v), loads ignores the extra bytes. load reads the right number of bytes from file object fileobj, which must be opened for reading in binary mode, and creates and returns the object v represented by those bytes. fileobj cannot be any file-like object: it must be specifically an instance of type file.

Functions load and dump are complementary. In other words, a sequence of calls to load(f) deserializes the same values previously serialized when f's contents were created by a sequence of calls to dump(v,f).

11.1.1.1. A marshaling example

Say you need to read several text files, whose names are given as your program's arguments, recording where each word appears in the files. What you need to record for each word is a list of (filename, line-number) pairs. The following example uses marshal to encode lists of (filename, line-number) pairs as strings and store them in a DBM-like file (as covered in "DBM Modules" on page 285). Since these lists contain tuples, each containing a string and a number, they are within marshal's abilities to serialize.

 import fileinput, marshal, anydbm wordPos = {} for line in fileinput.input( ):     pos = fileinput.filename( ), fileinput.filelineno( )     for word in line.split( ):         wordPos.setdefault(word,[]).append(pos) dbmOut = anydbm.open('indexfilem', 'n') for word in wordPos:     dbmOut[word] = marshal.dumps(wordPos[word]) dbmOut.close( )

We also need marshal to read back the data stored in the DBM-like file indexfilem, as shown in the following example:

 import sys, marshal, anydbm, linecache dbmIn = anydbm.open('indexfilem') for word in sys.argv[1:]:     if not dbmIn.has_key(word):          sys.stderr.write('Word %r not found in index file\n' % word)          continue     places = marshal.loads(dbmIn[word])     for fname, lineno in places:         print "Word %r occurs in line %s of file %s:" % (word,lineno,fname)         print linecache.getline(fname, lineno),

11.1.2. The pickle and cPickle Modules

The pickle and cPickle modules supply factory functions, named Pickler and Unpickler, to generate objects that wrap file-like objects and supply serialization mechanisms. Serializing and deserializing via these modules is also known as pickling and unpickling. The difference between the modules is that, in pickle, Pickler and Unpickler are classes, so you can inherit from these classes to create customized serializer objects, overriding methods as needed. In cPickle, on the other hand, Pickler and Unpickler are factory functions that generate instances of non-subclassable types, not classes. Performance is much better with cPickle, but inheritance is not feasible. In the rest of this section, I'll be talking about module pickle, but everything applies to cPickle too.

Note that unpickling from an untrusted data source is a security risk; an attacker could exploit this to execute arbitrary code. Don't unpickle untrusted data!

Serialization shares some of the issues of deep copying, covered in deepcopy on page 172. Module pickle deals with these issues in much the same way as module copy does. Serialization, like deep copying, implies a recursive walk over a directed graph of references. pickle preserves the graph's shape when the same object is encountered more than once: the object is serialized only the first time, and other occurrences of the same object serialize references to a single copy. pickle also correctly serializes graphs with reference cycles. However, this means that if a mutable object o is serialized more than once to the same Pickler instance p, any changes to o after the first serialization of o to p are not saved. For clarity and simplicity, avoid altering objects that are being serialized while serialization to a Pickler instance is in progress.

pickle can serialize in an ASCII format or in either of two compact binary ones. The ASCII format is the default, for backward compatibility, but you should normally request binary format 2, which saves time and storage space. When you reload objects, pickle transparently recognizes and uses any format. I recommend you always specify binary format 2: the size and speed savings can be substantial, and binary format has basically no downside except loss of compatibility with very old versions of Python.

pickle serializes classes and functions by name, not by value. pickle can therefore deserialize a class or function only by importing it from the same module where the class or function was found when pickle serialized it. In particular, pickle can serialize and deserialize classes and functions only if they are top-level names for their module (i.e., attributes of their module). For example, consider the following:

 def adder(augend):     def inner(addend, augend=augend): return addend+augend     return inner plus5 = adder(5)

This code binds a closure to name plus5 (as covered in "Nested functions and nested scopes" on page 77)a nested function inner plus an appropriate nested scope. Therefore, trying to pickle plus5 raises a pickle.PicklingError exception: a function can be pickled only when it is top-level, and function inner, whose closure is bound to name plus5 in this code, is not top-level but rather nested inside function adder. Similar issues apply to all pickling of nested functions and nested classes (i.e., classes that are not top-level).

11.1.2.1. Functions of pickle and cPickle

Modules pickle and cPickle expose the following functions.

dump, dumps	`dump(value,fileobj,protocol=None,bin=None)` dumps(`value,protocol=None,bin=None`) `dumps` returns a string representing object `value`. `dump` writes the same string to file-like object `fileobj`, which must be opened for writing. `dump(v,f)` is like `f.write(dumps(v))`. Do not pass the `bin` parameter, which exists only for compatibility with old versions of Python. The `protocol` parameter can be `0` (the default, for compatibility reasons; ASCII output, slowest and bulkiest), `1` (binary output is compatible with old versions of Python), or `2` (fastest and leanest). I suggest you always pass the value `2`. Unless `protocol` is `0` or absent, implying ASCII output, the `fileobj` parameter to `dump` must be open for binary writing.
load, loads	`load(fileobj) loads(str)` `loads` creates and returns the object `v` represented by string `str` so that for any object `v` of a supported type, `v==loads(dumps(v))`. If `str` is longer than `dumps(v), loads` ignores the extra bytes. `load` reads the right number of bytes from file-like object `fileobj` and creates and returns the object `v` represented by those bytes. `load` and `loads` transparently support pickles performed in any binary or ASCII mode. If data is pickled in either binary format, the file must be open as binary for both `dump` and `load. load(f)` is like `Unpickler(f).load( )`. Functions `load` and `dump` are complementary. In other words, a sequence of calls to `load(f)` deserializes the same values previously serialized when `f`'s contents were created by a sequence of calls to `dump(v,f)`.
Pickler	`Pickler(fileobj protocol=None,bin=None)` Creates and returns an object `p` such that calling `p.dump` is equivalent to calling function `dump` with the `fileobj, protocol`, and `bin` arguments passed to `Pickler`. To serialize many objects to a file, `Pickler` is more convenient and faster than repeated calls to `dump`. You can subclass `pickle.Pickler` to override `Pickler` methods (particularly method `persistent_id`) and create a persistence framework. However, this is an advanced issue and is not covered further in this book.
Unpickler	`Unpickler(fileobj)` Creates and returns an object `u` such that calling `u.load` is equivalent to calling function `load` with the `fileobj` argument passed to `Unpickler`. To deserialize many objects from a file, `Unpickler` is more convenient and faster than repeated calls to function `load`. You can subclass `pickle.Unpickler` to override `Unpickler` methods (particularly the method `persistent_load`) and create your own persistence framework. However, this is an advanced issue and is not covered further in this book.

11.1.2.2. A pickling example

The following example handles the same task as the marshal example shown earlier but uses cPickle instead of marshal to encode lists of (filename, line-number) pairs as strings:

 import fileinput, cPickle, anydbm wordPos = {  } for line in fileinput.input( ):     pos = fileinput.filename( ), fileinput.filelineno( )     for word in line.split( ):         wordPos.setdefault(word,[  ]).append(pos) dbmOut = anydbm.open('indexfilep','n') for word in wordPos:     dbmOut[word] = cPickle.dumps(wordPos[word], 1) dbmOut.close( )

We can use either cPickle or pickle to read back the data stored to the DBM-like file indexfilep, as shown in the following example:

 import sys, cPickle, anydbm, linecache dbmIn = anydbm.open('indexfilep') for word in sys.argv[1:]:     if not dbmIn.has_key(word):          sys.stderr.write('Word %r not found in index file\n' % word)          continue     places = cPickle.loads(dbmIn[word])     for fname, lineno in places:         print "Word %r occurs in line %s of file %s:" % (word,lineno,fname)         print linecache.getline(fname, lineno),

11.1.2.3. Pickling instances

In order for pickle to reload an instance x, pickle must be able to import x's class from the same module in which the class was defined when pickle saved the instance. Here is how pickle saves the state of instance object x of class T and later reloads the saved state into a new instance y of T (the first step of the reloading is always to make a new empty instance y of T, except where I explicitly say otherwise in the following):

When T supplies method _ _getstate_ _, pickle saves the result d of calling T._ _getstate_ _(x).
- When T supplies method _ _setstate_ _, d can be of any type, and pickle reloads the saved state by calling T._ _setstate_ _(y, d).
- Otherwise, d must be a dictionary, and pickle just sets y._ _dict_ _ = d.
Otherwise, when T is new-style and supplies method _ _getnewargs_ _, and pickle is pickling with protocol 2, pickle saves the result t of calling T._ _getnewargs_ _(x); t must be a tuple.
- pickle, in this one case, does not start with an empty y but rather creates y by executing y = T._ _new_ _(T, *t), which concludes the reloading.
Otherwise, when T is old-style and supplies method _ _getinitargs_ _, pickle saves the result t of calling T._ _getinitargs_ _(x) (t must be a tuple) and then, as d, the dictionary x._ _dict_ _.
- Pickle reloads the saved state by first calling T._ _init_ _(y, *t) and then calling y._ _dict_ _.update(d).
Otherwise, by default, pickle saves as d the dictionary x._ _dict_ _.
- When T supplies method _ _setstate_ _, pickle reloads the saved state by calling T._ _setstate_ _ (y, d).
- Otherwise, pickle just sets y._ _dict_ _ = d.

All the items in the d or t object that pickle saves and reloads (normally a dictionary or tuple) must in turn be instances of types suitable for pickling and unpickling (i.e., pickleable objects), and the procedure just outlined may be repeated recursively, if necessary, until pickle reaches primitive pickleable built-in types (dictionaries, tuples, lists, sets, numbers, strings, etc.).

As mentioned in "The copy Module" on page 172, special methods _ _getinitargs_ _, _ _getnewargs_ _, _ _getstate_ _, and _ _setstate_ _ also control the way instance objects are copied and deep-copied. If a new-style class defines _ _slots_ _, and therefore its instances do not have a _ _dict_ _, pickle does it best to save and restore a dictionary equivalent to the names and values of the slots. However, such a new-style class should define _ _getstate_ _ and _ _setstate_ _; otherwise, its instances may not be correctly pickleable and copy-able through such best-effort endeavors.

11.1.2.4. Pickling customization with the copy_reg module

You can control how pickle serializes and deserializes objects of an arbitrary type (or new-style class) by registering factory and reduction functions with module copy_reg. This is particularly, though not exclusively, useful when you define a type in a C-coded Python extension. Module copy_reg supplies the following functions.

constructor

constructor(fcon)

Adds fcon to the table of constructors, which lists all factory functions that pickle may call. fcon must be callable and is normally a function.

pickle

pickle(type,fred,fcon=None)

Registers function fred as the reduction function for type type, where type must be a type object (not an old-style class). To save any object o of type type, module pickle calls fred(o) and saves the result. fred(o) must return a pair (fcon,t) or a tuple (fcon,t,d), where fcon is a constructor and t is a tuple. To reload o, pickle calls o=fcon(*t). Then, if fred returned a d, pickle uses d to restore o's state (o._ _setstate_ _(d) if o supplies _ _setstate_ _; otherwise, o._ _dict_ _.update(d)), as in "Pickling instances" on page 282. If fcon is not None, pickle also calls constructor(fcon) to register fcon as a constructor.

pickle does not support pickling of code objects, but marshal does. Here is how you can customize pickling to support code objects by delegating the work to marshal thanks to copy_reg:

 >>> import pickle, copy_reg, marshal >>> def viaMarshal(x): return marshal.loads, (marshal.dumps(x),) ... >>> c=compile('2+2','','eval') >>> copy_reg.pickle(type(c), viaMarshal) >>> s=pickle.dumps(c, 2) >>> cc=pickle.loads(s) >>> print eval(cc) 4

11.1.3. The shelve Module

The shelve module orchestrates modules cPickle (or pickle, when cPickle is not available in the current Python installation), cStringIO (or StringIO, when cStringIO is not available in the current Python installation), and anydbm (and its underlying modules for access to DBM-like archive files, as discussed in "DBM Modules" on page 285) in order to provide a simple, lightweight persistence mechanism.

shelve supplies a function open that is polymorphic to anydbm.open. The mapping object s returned by shelve.open is less limited than the mapping object a returned by anydbm.open. a's keys and values must be strings. s's keys must also be strings, but s's values may be of any pickleable types or classes. pickle customizations (e.g., copy_reg, _ _getinitargs_ _, _ _getstate_ _, and _ _setstate_ _) also apply to shelve, since shelve delegates serialization to pickle.

Beware of a subtle trap when you use shelve and mutable objects. When you operate on a mutable object held in a shelf, the changes don't "take" unless you assign the changed object back to the same index. For example:

 import shelve s = shelve.open('data') s['akey'] = range(4) print s['akey']                    # prints: [0, 1, 2, 3] s['akey'].append('moreover')       # trying direct mutation print s['akey']                    # doesn't take; prints: [0, 1, 2, 3] x = s['akey']              # fetch the object x.append('moreover')       # perform mutation s['akey'] = x              # store the object back print s['akey']            # now it takes, prints: [0, 1, 2, 3, 'moreover']

You can finesse this issue by passing named argument writeback=True when you call shelve.open, but beware: if you do pass that argument, you may seriously impair the performance of your program.

11.1.3.1. A shelving example

The following example handles the same task as the earlier pickling and marshaling examples, but uses shelve to persist lists of (filename, line-number) pairs:

 import fileinput, shelve wordPos = {  } for line in fileinput.input( ):     pos = fileinput.filename( ), fileinput.filelineno( ) for word in line.split( ):         wordPos.setdefault(word,[  ]).append(pos) shOut = shelve.open('indexfiles','n') for word in wordPos:     shOut[word] = wordPos[word] shOut.close( )

We must use shelve to read back the data stored to the DBM-like file indexfiles, as shown in the following example:

 import sys, shelve, linecache shIn = shelve.open('indexfiles') for word in sys.argv[1:]:     if not shIn.has_key(word):          sys.stderr.write('Word %r not found in index file\n' % word)          continue     places = shIn[word]     for fname, lineno in places:         print "Word %r occurs in line %s of file %s:" % (word,lineno,fname)         print linecache.getline(fname, lineno),

These two examples are the simplest and most direct of the various equivalent pairs of examples shown throughout this section. This reflects the fact that module shelve is higher-level than the modules used in previous examples.