Recipe7.3.Using Compression with Pickling


Recipe 7.3. Using Compression with Pickling

Credit: Bill McNeill, Andrew Dalke

Problem

You want to pickle generic Python objects to and from disk in a compressed form.

Solution

Standard library modules cPickle and gzip offer the needed functionality; you just need to glue them together appropriately:

import cPickle, gzip def save(filename, *objects):    ''' save objects into a compressed diskfile '''     fil = gzip.open(filename, 'wb')     for obj in objects: cPickle.dump(obj, fil, proto=2)     fil.close( ) def load(filename):    ''' reload objects from a compressed diskfile '''     fil = gzip.open(filename, 'rb')     while True:         try: yield cPickle.load(fil)         except EOFError: break     fil.close( )

Discussion

Persistence and compression, as a general rule, go well together. cPickle protocol 2 saves Python objects quite compactly, but the resulting files can still compress quite well. For example, on my Linux box, open('/usr/dict/share/words').readlines( ) produces a list of over 45,000 strings. Pickling that list with the default protocol 0 makes a disk file of 972 KB, while protocol 2 takes only 716 KB. However, using both gzip and protocol 2, as shown in this recipe, requires only 268 KB, saving a significant amount of space. As it happens, protocol 0 produces a more compressible file in this case, so that using gzip and protocol 0 would save even more space, taking only 252 KB on disk. However, the difference between 268 and 252 isn't all that meaningful, and protocol 2 has other advantages, particularly when used on instances of new-style classes, so I recommend the mix I use in the functions shown in this recipe.

Whatever protocol you choose to save your data, you don't need to worry about it when you're reloading the data. The protocol is recorded in the file together with the data, so cPickle.load can figure out by itself all it needs. Just pass it an instance of a file or pseudo-file object with a read method, and cPickle.load returns each object that was pickled to the file, one after the other, and raises EOFError when the file's done. In this recipe, we wrap a generator around cPickle.load, so you can simply loop over all recovered objects with a for statement, or, depending on what you need, you can use some call such as list(load('somefile.gz')) to get a list with all recovered objects as its items.

See Also

Modules gzip and cPickle in the Library Reference.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net