Recipe7.2.Serializing Data Using the pickle and cPickle Modules

Recipe 7.2. Serializing Data Using the pickle and cPickle Modules

Credit: Luther Blissett

Problem

You want to serialize and reconstruct, at a reasonable speed, a Python data structure, which may include both fundamental Python object as well as classes and instances.

Solution

If you don't want to assume that your data is composed only of fundamental Python objects, or you need portability across versions of Python, or you need to transmit the serialized form as text, the best way of serializing your data is with the cPickle module. (The pickle module is a pure-Python equivalent and totally interchangeable, but it's slower and not worth using except if you're missing cPickle.) For example, say you have:

data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}

You can serialize data to a text string:

import cPickle text = cPickle.dumps(data)

or to a binary string, a choice that is faster and takes up less space:

bytes = cPickle.dumps(data, 2)

You can now sling text or bytes around as you wish (e.g., send across a network, include as a BLOB in a databasesee Recipe 7.10, Recipe 7.11, and Recipe 7.12) as long as you keep text or bytes intact. In the case of bytes, it means keeping the arbitrary binary bytes intact. In the case of text, it means keeping its textual structure intact, including newline characters. Then you can reconstruct the data at any time, regardless of machine architecture or Python release:

redata1 = cPickle.loads(text) redata2 = cPickle.loads(bytes)

Either call reconstructs a data structure that compares equal to data. In particular, the order of keys in dictionaries is arbitrary in both the original and reconstructed data structures, but order in any kind of sequence is meaningful, and thus it is preserved. You don't need to tell cPickle.loads whether the original dumps used text mode (the default, also readable by some very old versions of Python) or binary (faster and more compact)loads figures it out by examining its argument's contents.

When you specifically want to write the data to a file, you can also use the dump function of the cPickle module, which lets you dump several data structures to the same file one after the other:

ouf = open('datafile.txt', 'w') cPickle.dump(data, ouf) cPickle.dump('some string', ouf) cPickle.dump(range(19), ouf) ouf.close( )

Once you have done this, you can recover from datafile.txt the same data structures you dumped into it, one after the other, in the same order:

inf = open('datafile.txt') a = cPickle.load(inf) b = cPickle.load(inf) c = cPickle.load(inf) inf.close( )

You can also pass cPickle.dump a third argument with a value of 2 to tell cPickle.dump to serialize the data in binary form (faster and more compact), but the data file must then be opened for binary I/O, not in the default text mode, both when you originally dump to the file and when you later load from the file.

Discussion

Python offers several ways to serialize data (i.e., make the data into a string of bytes that you can save on disk, save in a database, send across the network, etc.) and corresponding ways to reconstruct the data from such serialized forms. Typically, the best approach is to use the cPickle module. A pure-Python equivalent, called pickle (the cPickle module is coded in C as a Python extension) is substantially slower, and the only reason to use it is if you don't have cPickle (e.g., with a Python port onto a mobile phone with tiny storage space, where you saved every byte you possibly could by installing only an indispensable subset of Python's large standard library). However, in cases where you do need to use pickle, rest assured that it is completely interchangeable with cPickle: you can pickle with either module and unpickle with the other one, without any problems whatsoever.

cPickle supports most elementary data types (e.g., dictionaries, lists, tuples, numbers, strings) and combinations thereof, as well as classes and instances. Pickling classes and instances saves only the data involved, not the code. (Code objects are not even among the types that cPickle knows how to serialize, basically because there would be no way to guarantee their portability across disparate versions of Python. See Recipe 7.6 for a way to serialize code objects, as long as you don't need the cross-version guarantee.) See Recipe 7.4 for more about pickling classes and instances.

cPickle guarantees compatibility from one Python release to another, as well as independence from a specific machine's architecture. Data serialized with cPickle will still be readable if you upgrade your Python release, and pickling is also guaranteed to work if you're sending serialized data between different machines.

The dumps function of cPickle accepts any Python data structure and returns a text string representing it. If you call dumps with a second argument of 2, dumps returns an arbitrary bytestring instead: the operation is faster, and the resulting string takes up less space. You can pass either the text or the bytestring to the loads function, which will return another Python data structure that compares equal (==) to the one you originally dumped. In between the dumps and loads calls, you can subject the text or bytestring to any procedure you wish, such as sending it over the network, storing it in a database and retrieving it, or encrypting and decrypting it. As long as the string's textual or binary structure is correctly restored, loads will work fine on it (even across platforms and in future releases). If you need to produce data readable by old (pre-2.3) versions of Python, consider using 1 as the second argument: operation will be slower, and the resulting strings will not be as compact as those obtained by using 2, but the strings will be unpicklable by old Python versions as well as current and future ones.

When you specifically need to save the data into a file, you can also use cPickle's dump function, which takes two arguments: the data structure you're dumping and the open file or file-like object. If the file is opened for binary I/O, rather than the default (text I/O), then by giving dump a third argument of 2, you can ask for binary format, which is faster and takes up less space (again, you can also use 1 in this position to get a binary format that's neither as compact nor as fast, but is understood by old, pre-2.3 Python versions too). The advantage of dump over dumps is that, with dump, you can perform several calls, one after the other, with various data structures and the same open file object. Each data structure is then dumped with information about how long the dumped string is. Consequently, when you later open the file for reading (binary reading, if you asked for binary format) and then repeatedly call cPickle.load, passing the file as the argument, each data structure previously dumped is reloaded sequentially, one after the other. The return value of load, like that of loads, is a new data structure that compares equal to the one you originally dumped.

Those accustomed to other languages and libraries offering "serialization" facilities may be wondering whether pickle imposes substantial practical limits on the size of objects you can serialize or deserialize. Answer: Nope. Your machine's memory might, but as long as everything fits comfortably in memory, pickle practically imposes no further limit.

Recipe7.2.Serializing Data Using the pickle and cPickle Modules