Recipe7.1.Serializing Data Using the marshal Module

Recipe 7.1. Serializing Data Using the marshal Module

Credit: Luther Blissett

Problem

You want to serialize and reconstruct a Python data structure whose items are fundamental Python objects (e.g., lists, tuples, numbers, and strings but no classes, instances, etc.) as fast as possible.

Solution

If you know that your data is composed entirely of fundamental Python objects (and you only need to support one version of Python, though possibly on several different platforms), the lowest-level, fastest approach to serializing your data (i.e., turning it into a string of bytes, and later reconstructing it from such a string) is via the marshal module. Suppose that data has only elementary Python data types as items, for example:

data = {12:'twelve', 'feep':list('ciao'), 1.23:4+5j, (1,2,3):u'wer'}

You can serialize data to a bytestring at top speed as follows:

import marshal bytes = marshal.dumps(data)

You can now sling bytes around as you wish (e.g., send it across a network, put it as a BLOB in a database, etc.), as long as you keep its arbitrary binary bytes intact. Then you can reconstruct the data structure from the bytestring at any time:

redata = marshal.loads(bytes)

When you specifically want to write the data to a disk file (as long as the latter is open for binarynot the default text modeinput/output), you can also use the dump function of the marshal module, which lets you dump several data structures to the same file one after the other:

ouf = open('datafile.dat', 'wb') marshal.dump(data, ouf) marshal.dump('some string', ouf) marshal.dump(range(19), ouf) ouf.close( )

You can later recover from datafile.dat the same data structures you dumped into it, in the same sequence:

inf = open('datafile.dat', 'rb') a = marshal.load(inf) b = marshal.load(inf) c = marshal.load(inf) inf.close( )

Discussion

Python offers several ways to serialize data (meaning to turn the data into a string of bytes that you can save on disk, put in a database, send across the network, etc.) and corresponding ways to reconstruct the data from such serialized forms. The lowest-level approach is to use the marshal module, which Python uses to write its bytecode files. marshal supports only elementary data types (e.g., dictionaries, lists, tuples, numbers, and strings) and combinations thereof. marshal does not guarantee compatibility from one Python release to another, so data serialized with marshal may not be readable if you upgrade your Python release. However, marshal does guarantee independence from a specific machine's architecture, so it is guaranteed to work if you're sending serialized data between different machines, as long as they are all running the same version of Pythonsimilar to how you can share compiled Python bytecode files in such a distributed setting.

marshal's dumps function accepts any suitable Python data structure and returns a bytestring representing it. You can pass that bytestring to the loads function, which will return another Python data structure that compares equal (==) to the one you originally dumped. In particular, the order of keys in dictionaries is arbitrary in both the original and reconstructed data structures, but order in any kind of sequence is meaningful and is thus preserved. In between the dumps and loads calls, you can subject the bytestring to any procedure you wish, such as sending it over the network, storing it into a database and retrieving it, or encrypting and decrypting it. As long as the string's binary structure is correctly restored, loads will work fine on it (as stated previously, this is guaranteed only if you use loads under the same Python release with which you originally executed dumps).

When you specifically need to save the data to a file, you can also use marshal's dump function, which takes two arguments: the data structure you're dumping and the open file object. Note that the file must be opened for binary I/O (not the default, which is text I/O) and can't be a file-like object, as marshal is quite picky about it being a true file. The advantage of dump is that you can perform several calls to dump with various data structures and the same open file object: each data structure is then dumped together with information about how long the dumped bytestring is. As a consequence, when you later open the file for binary reading and then call marshal.load, passing the file as the argument, you can reload each previously dumped data structure sequentially, one after the other, at each call to load. The return value of load, like that of loads, is a new data structure that compares equal to the one you originally dumped. (Again, dump and load work within one Python releaseno guarantee across releases.)

Those accustomed to other languages and libraries offering "serialization" facilities may be wondering if marshal imposes substantial practical limits on the size of objects you can serialize or deserialize. Answer: Nope. Your machine's memory might, but as long as everything fits comfortably in memory, marshal imposes practically no further limit.

Recipe7.1.Serializing Data Using the marshal Module