Section 2.4. Step 2: Storing Records Persistently

2.4. Step 2: Storing Records Persistently

So far, we've settled on a dictionary-based representation for our database of records, and we've reviewed some Python data structure concepts along the way. As mentioned, though, the objects we've seen so far are temporarythey live in memory and they go away as soon as we exit Python or the Python program that created them. To make our people persistent, they need to be stored in a file of some sort.

2.4.1. Using Formatted Files

One way to keep our data around between program runs is to write all the data out to a simple text file, in a formatted way. Provided the saving and loading tools agree on the format selected, we're free to use any custom scheme we like.

2.4.1.1. Test data script

So that we don't have to keep working interactively, let's first write a script that initializes the data we are going to store (if you've done any Python work in the past, you know that the interactive prompt tends to become tedious once you leave the realm of simple one-liners). Example 2-1 creates the sort of records and database dictionary we've been working with so far, but because it is a module, we can import it repeatedly without having to retype the code each time. In a sense, this module is a database itself, but its program code format doesn't support automatic or end-user updates as is.

Other Uses for Dictionaries

Besides allowing us to associate meaningful labels with data rather than numeric positions, dictionaries are often more flexible than lists, especially when there isn't a fixed size to our problem. For instance, suppose you need to sum up columns of data stored in a text file where the number of columns is not known or fixed:

 >>> print open('data.txt').read( ) 001.1 002.2 003.3 010.1 020.2 030.3 040.4 100.1 200.2 300.3

Here, we cannot preallocate a fixed-length list of sums because the number of columns may vary. Splitting on whitespace extracts the columns, and float converts to numbers, but a fixed-size list won't easily accommodate a set of sums (at least, not without extra code to manage its size). Dictionaries are more convenient here because we can use column positions as keys instead of using absolute offsets Most of this code uses tools added to Python in the last five years; see Chapter 4 for more on file iterators, Chapter 21 for text processing and alternative summers, and the library manual for the 2.3 enumerate and 2.4 sorted functions this code uses:

 >>> sums = {} >>> for line in open('data.txt'):         cols = [float(col) for col in line.split( )]         for pos, val in enumerate(cols):             sums[pos] = sums.get(pos, 0.0) + val >>> for key in sorted(sums):         print key, '=', sums[key] 0 = 111.3 1 = 222.6 2 = 333.9 3 = 40.4 >>> sums {0: 111.3, 1: 222.59999999999999, 2: 333.90000000000003, 3: 40.399999999999999}

Dictionaries are often also a handy way to represent matrixes, especially when they are mostly empty. The following two-entry dictionary, for example, suffices to represent a potentially very large three-dimensional matrix containing two nonempty valuesthe keys are coordinates and their values are data at the coordinates. You can use a similar structure to index people by their birthdays (use month, day, and year for the key), servers by their Internet Protocol (IP) numbers, and so on.

 >>> D = {} >>> D[(2, 4, 6)] = 43            # 43 at position (2, 4, 6) >>> D[(5, 6, 7)] = 46 >>> X, Y, Z = (5, 6, 7) >>> D.get((X, Y, Z), 'Missing') 46 >>> D.get((0, Y, Z), 'Missing') 'Missing' >>> D {(2, 4, 6): 43, (5, 6, 7): 46}

Example 2-1. PP3E\Preview\initdata.py

 # initialize data to be stored in files, pickles, shelves # records bob = {'name': 'Bob Smith', 'age': 42, 'pay': 30000, 'job': 'dev'} sue = {'name': 'Sue Jones', 'age': 45, 'pay': 40000, 'job': 'mus'} tom = {'name': 'Tom',       'age': 50, 'pay': 0,     'job': None} # database db = {} db['bob'] = bob db['sue'] = sue db['tom'] = tom if _ _name_ _ == '_ _main_ _':       # when run as a script     for key in db:         print key, '=>\n  ', db[key]

As usual, the _ _name_ _ test at the bottom of Example 2-1 is true only when this file is run, not when it is imported. When run as a top-level script (e.g., from a command line, via an icon click, or within the IDLE GUI), the file's self-test code under this test dumps the database's contents to the standard output stream (remember, that's what print statements do by default).

Here is the script in action being run from a system command line on Windows. Type the following command in a Command Prompt window after a cd to the directory where the file is stored, and use a similar console window on other types of computers:

 ...\PP3E\Preview> python initdata.py bob =>    {'job': 'dev', 'pay': 30000, 'age': 42, 'name': 'Bob Smith'} sue =>    {'job': 'mus', 'pay': 40000, 'age': 45, 'name': 'Sue Jones'} tom =>    {'job': None, 'pay': 0, 'age': 50, 'name': 'Tom'}

Now that we've started running script files, here are a few quick startup hints:

On some platforms, you may need to type the full directory path to the Python program on your machine, and on recent Windows systems you don't need python on the command line at all (just type the file's name to run it).
You can also run this file inside Python's standard IDLE GUI (open the file and use the Run menu in the text edit window), and in similar ways from any of the available third-party Python IDEs (e.g., Komodo, Eclipse, and the Wing IDE).
If you click the program's file icon to launch it on Windows, be sure to add a raw_input( ) call to the bottom of the script to keep the output window up. On other systems, icon clicks may require a #! line at the top and executable permission via a chmod command.

I'll assume here that you're able to run Python code one way or another. Again, if you're stuck, see other books such as Learning Python for the full story on launching Python programs.

2.4.1.2. Data format script

Now, all we have to do is store all of this in-memory data on a file. There are a variety of ways to accomplish this; one of the most basic is to write one piece of data at a time, with separators between each that we can use to break the data apart when we reload. Example 2-2 shows one way to code this idea.

Example 2-2. PP3E\Preview\make_db_files.py

 #################################################################### # save in-memory database object to a file with custom formatting; # assume 'endrec.', 'enddb.', and '=>' are not used in the data; # assume db is dict of dict;  warning: eval can be dangerous - it # runs strings as code;  could also eval( ) record dict all at once #################################################################### dbfilename = 'people-file' ENDDB  = 'enddb.' ENDREC = 'endrec.' RECSEP = '=>' def storeDbase(db, dbfilename=dbfilename):     "formatted dump of database to flat file"     dbfile = open(dbfilename, 'w')     for key in db:         print >> dbfile, key         for (name, value) in db[key].items( ):             print >> dbfile, name + RECSEP + repr(value)         print >> dbfile, ENDREC     print >> dbfile, ENDDB     dbfile.close( ) def loadDbase(dbfilename=dbfilename):     "parse data to reconstruct database"     dbfile = open(dbfilename)     import sys     sys.stdin = dbfile     db = {}     key = raw_input( )     while key != ENDDB:         rec = {}         field = raw_input( )         while field != ENDREC:             name, value = field.split(RECSEP)             rec[name] = eval(value)             field = raw_input( )         db[key] = rec         key = raw_input( )     return db if _ _name_ _ == '_ _main_ _':     from initdata import db     storeDbase(db)

This is a somewhat complex program, partly because it has both saving and loading logic and partly because it does its job the hard way; as we'll see in a moment, there are better ways to get objects into files than by manually formatting and parsing them. For simple tasks, though, this does work; running Example 2-2 as a script writes the database out to a flat file. It has no printed output, but we can inspect the database file interactively after this script is run, either within IDLE or from a console window where you're running these examples (as is, the database file shows up in the current working directory):

 ...\PP3E\Preview> python make_db_file.py ...\PP3E\Preview> python >>> for line in open('people-file'): ...     print line, ... bob job=>'dev' pay=>30000 age=>42 name=>'Bob Smith' endrec. sue job=>'mus' pay=>40000 age=>45 name=>'Sue Jones' endrec. tom job=>None pay=>0 age=>50 name=>'Tom' endrec. enddb.

This file is simply our database's content with added formatting. Its data originates from the test data initialization module we wrote in Example 2-1 because that is the module from which Example 2-2's self-test code imports its data. In practice, Example 2-2 itself could be imported and used to store a variety of databases and files.

Notice how data to be written is formatted with the as-code repr( ) call and is re-created with the eval( ) call which treats strings as Python code. That allows us to store and re-create things like the None object, but it is potentially unsafe; you shouldn't use eval( ) if you can't be sure that the database won't contain malicious code. For our purposes, however, there's probably no cause for alarm.

2.4.1.3. Utility scripts

To test further, Example 2-3 reloads the database from a file each time it is run.

Example 2-3. PP3E\Preview\dump_db_file.py

 from make_db_file import loadDbase db = loadDbase( ) for key in db:     print key, '=>\n  ', db[key] print db['sue']['name']

And Example 2-4 makes changes by loading, updating, and storing again.

Example 2-4. PP3E\Preview\update_db_file.py

 from make_db_file import loadDbase, storeDbase db = loadDbase( ) db['sue']['pay'] *= 1.10 db['tom']['name'] = 'Tom Tom' storeDbase(db)

Here are the dump script and the update script in action at a system command line; both Sue's pay and Tom's name change between script runs. The main point to notice is that the data stays around after each script exitsour objects have become persistent simply because they are mapped to and from text files:

 ...\PP3E\Preview> python dump_db_file.py bob =>    {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'} sue =>    {'pay': 40000, 'job': 'mus', 'age': 45, 'name': 'Sue Jones'} tom =>    {'pay': 0, 'job': None, 'age': 50, 'name': 'Tom'} Sue Jones ...\PP3E\Preview> python update_db_file.py ...\PP3E\Preview> python dump_db_file.py bob =>    {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'} sue =>    {'pay': 44000.0, 'job': 'mus', 'age': 45, 'name': 'Sue Jones'} tom =>    {'pay': 0, 'job': None, 'age': 50, 'name': 'Tom Tom'} Sue Jones

As is, we'll have to write Python code in scripts or at the interactive command line for each specific database update we need to perform (later in this chapter, we'll do better by providing generalized console, GUI, and web-based interfaces instead). But at a basic level, our text file is a database of records. As we'll learn in the next section, though, it turns out that we've just done a lot of pointless work.

2.4.2. Using Pickle Files

The formatted file scheme of the prior section works, but it has some major limitations. For one thing, it has to read the entire database from the file just to fetch one record, and it must write the entire database back to the file after each set of updates. For another, it assumes that the data separators it writes out to the file will not appear in the data to be stored: if the characters => happen to appear in the data, for example, the scheme will fail. Perhaps worse, the formatter is already complex without being general: it is tied to the dictionary-of-dictionaries structure, and it can't handle anything else without being greatly expanded. It would be nice if a general tool existed that could translate any sort of Python data to a format that could be saved on a file in a single step.

That is exactly what the Python pickle module is designed to do. The pickle module translates an in-memory Python object into a serialized byte streama string of bytes that can be written to any file-like object. The pickle module also knows how to reconstruct the original object in memory, given the serialized byte stream: we get back the exact same object. In a sense, the pickle module replaces proprietary data formatsits serialized format is general and efficient enough for any program. With pickle, there is no need to manually translate objects to data when storing them persistently.

The net effect is that pickling allows us to store and fetch native Python objects as they are and in a single stepwe use normal Python syntax to process pickled records. Despite what it does, the pickle module is remarkably easy to use. Example 2-5 shows how to store our records in a flat file, using pickle.

Example 2-5. PP3E\Preview\make_db_pickle.py

 from initdata import db import pickle dbfile = open('people-pickle', 'w') pickle.dump(db, dbfile) dbfile.close( )

When run, this script stores the entire database (the dictionary of dictionaries defined in Example 2-1) to a flat file named people-pickle in the current working directory. The pickle module handles the work of converting the object to a string. Example 2-6 shows how to access the pickled database after it has been created; we simply open the file and pass its content back to pickle to remake the object from its serialized string.

Example 2-6. PP3E\Preview\dump_db_pickle.py

 import pickle dbfile = open('people-pickle') db = pickle.load(dbfile) for key in db:     print key, '=>\n  ', db[key] print db['sue']['name']

Here are these two scripts at work, at the system command line again; naturally, they can also be run in IDLE, and you can open and inspect the pickle file by running the same sort of code interactively as well:

 ...\PP3E\Preview> python make_db_pickle.py ...\PP3E\Preview> python dump_db_pickle.py bob =>    {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'} sue =>    {'pay': 40000, 'job': 'mus', 'age': 45, 'name': 'Sue Jones'} tom =>    {'pay': 0, 'job': None, 'age': 50, 'name': 'Tom'} Sue Jones

Updating with a pickle file is similar to a manually formatted file, except that Python is doing all of the formatting work for us. Example 2-7 shows how.

Example 2-7. PP3E\Preview\update-db-pickle.py

 import pickle dbfile = open('people-pickle') db = pickle.load(dbfile) dbfile.close( ) db['sue']['pay'] *= 1.10 db['tom']['name'] = 'Tom Tom' dbfile = open('people-pickle', 'w') pickle.dump(db, dbfile) dbfile.close( )

Notice how the entire database is written back to the file after the records are changed in memory, just as for the manually formatted approach; this might become slow for very large databases, but we'll ignore this for the moment. Here are our update and dump scripts in actionas in the prior section, Sue's pay and Tom's name change between scripts because they are written back to a file (this time, a pickle file):

 ...\PP3E\Preview> python update_db_pickle.py ...\PP3E\Preview> python dump_db_pickle.py bob =>    {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'} sue =>    {'pay': 44000.0, 'job': 'mus', 'age': 45, 'name': 'Sue Jones'} tom =>    {'pay': 0, 'job': None, 'age': 50, 'name': 'Tom Tom'} Sue Jones

As we'll learn in Chapter 19, the Python pickling system supports nearly arbitrary object typeslists, dictionaries, class instances, nested structures, and more. There, we'll also explore the faster cPickle module, as well as the pickler's binary storage protocols, which require files to be opened in binary mode; the default text protocol used in the preceding examples is slightly slower, but it generates readable ASCII data. As we'll see later in this chapter, the pickler also underlies shelves and ZODB databases, and pickled class instances provide both data and behavior for objects stored.

In fact, pickling is more general than these examples may imply. Because they accept any object that provides an interface compatible with files, pickling and unpickling may be used to transfer native Python objects to a variety of media. Using a wrapped network socket, for instance, allows us to ship pickled Python objects across a network and provides an alternative to larger protocols such as SOAP and XML-RPC.

2.4.3. Using Per-Record Pickle Files

As mentioned earlier, one potential disadvantage of this section's examples so far is that they may become slow for very large databases: because the entire database must be loaded and rewritten to update a single record, this approach can waste time. We could improve on this by storing each record in the database in a separate flat file. The next three examples show one way to do so; Example 2-8 stores each record in its own flat file, using each record's original key as its filename with a .pkl prepended (it creates the files bob.pkl, sue.pkl, and tom.pkl in the current working directory).

Example 2-8. PP3E\Preview\make_db_pickle_recs.py

 from initdata import bob, sue, tom import pickle for (key, record) in [('bob', bob), ('tom', tom), ('sue', sue)]:     recfile = open(key+'.pkl', 'w')     pickle.dump(record, recfile)     recfile.close( )

Next, Example 2-9 dumps the entire database by using the standard library's glob module to do filename expansion and thus collect all the files in this directory with a .pkl extension. To load a single record, we open its file and deserialize with pickle; we must load only one record file, though, not the entire database, to fetch one record.

Example 2-9. PP3E\Preview\dump_db_pickle_recs.py

 import pickle, glob for filename in glob.glob('*.pkl'):         # for 'bob','sue','tom'     recfile = open(filename)     record  = pickle.load(recfile)     print filename, '=>\n  ', record suefile = open('sue.pkl') print pickle.load(suefile)['name']          # fetch sue's name

Finally, Example 2-10 updates the database by fetching a record from its file, changing it in memory, and then writing it back to its pickle file. This time, we have to fetch and rewrite only a single record file, not the full database, to update.

Example 2-10. PP3E\Preview\update_db_pickle_recs.py

 import pickle suefile = open('sue.pkl') sue = pickle.load(suefile) suefile.close( ) sue['pay'] *= 1.10 suefile = open('sue.pkl', 'w') pickle.dump(sue, suefile) suefile.close( )

Here are our file-per-record scripts in action; the results are about the same as in the prior section, but database keys become real filenames now. In a sense, the filesystem becomes our top-level dictionaryfilenames provide direct access to each record.

 ...\PP3E\Preview> python make_db_pickle_recs.py ...\PP3E\Preview> python dump_db_pickle_recs.py bob.pkl =>    {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'} tom.pkl =>    {'pay': 0, 'job': None, 'age': 50, 'name': 'Tom'} sue.pkl =>    {'pay': 40000, 'job': 'mus', 'age': 45, 'name': 'Sue Jones'} Sue Jones ...\PP3E\Preview> python update_db_pickle_recs.py ...\PP3E\Preview> python dump_db_pickle_recs.py bob.pkl =>    {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'} tom.pkl =>    {'pay': 0, 'job': None, 'age': 50, 'name': 'Tom'} sue.pkl =>    {'pay': 44000.0, 'job': 'mus', 'age': 45, 'name': 'Sue Jones'} Sue Jones

2.4.4. Using Shelves

Pickling objects to files, as shown in the preceding section, is an optimal scheme in many applications. In fact, some applications use pickling of Python objects across network sockets as a simpler alternative to network protocols such as the SOAP and XML-RPC web services architectures (also supported by Python, but much heavier than pickle).

Moreover, assuming your filesystem can handle as many files as you'll need, pickling one record per file also obviates the need to load and store the entire database for each update. If we really want keyed access to records, though, the Python standard library offers an even higher-level tool: shelves.

Shelves automatically pickle objects to and from a keyed-access filesystem. They behave much like dictionaries that must be opened, and they persist after each program exits. Because they give us key-based access to stored records, there is no need to manually manage one flat file per recordthe shelve system automatically splits up stored records and fetches and updates only those records that are accessed and changed. In this way, shelves provide utility similar to per-record pickle files, but are usually easier to code.

The shelve interface is just as simple as pickle: it is identical to dictionaries, with extra open and close calls. In fact, to your code, a shelve really does appear to be a persistent dictionary of persistent objects; Python does all the work of mapping its content to and from a file. For instance, Example 2-11 shows how to store our in-memory dictionary objects in a shelve for permanent keeping.

Example 2-11. make_db_shelve.py

 from initdata import bob, sue import shelve db = shelve.open('people-shelve') db['bob'] = bob db['sue'] = sue db.close( )

This script creates one or more files in the current directory with the name people-shelve as a prefix; you shouldn't delete these files (they are your database!), and you should be sure to use the same name in other scripts that access the shelve. Example 2-12, for instance, reopens the shelve and indexes it by key to fetch its stored records.

Example 2-12. dump_db_shelve.py

 import shelve db = shelve.open('people-shelve') for key in db:     print key, '=>\n  ', db[key] print db['sue']['name'] db.close( )

We still have a dictionary of dictionaries here, but the top-level dictionary is really a shelve mapped onto a file. Much happens when you access a shelve's keysit uses pickle to serialize and deserialize, and it interfaces with a keyed-access filesystem. From your perspective, though, it's just a persistent dictionary. Example 2-13 shows how to code shelve updates.

Example 2-13. update_db_shelve.py

 from initdb import tom import shelve db = shelve.open('people-shelve') sue = db['sue']                       # fetch sue sue['pay'] *= 1.50 db['sue'] = sue                       # update sue db['tom'] = tom                       # add a new record db.close( )

Notice how this code fetches sue by key, updates in memory, and then reassigns to the key to update the shelve; this is a requirement of shelves, but not always of more advanced shelve-like systems such as ZODB (covered in Chapter 19). Also note how shelve files are explicitly closed; some underlying keyed-access filesystems may require this in order to flush output buffers after changes.

Finally, here are the shelve-based scripts on the job, creating, changing, and fetching records. The records are still dictionaries, but the database is now a dictionary-like shelve which automatically retains its state in a file between program runs:

 ...\PP3E\Preview> python make_db_shelve.py ...\PP3E\Preview> python dump_db_shelve.py bob =>    {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'} sue =>    {'pay': 40000, 'job': 'mus', 'age': 45, 'name': 'Sue Jones'} Sue Jones ...\PP3E\Preview> python update_db_shelve.py ...\PP3E\Preview> python dump_db_shelve.py tom =>    {'pay': 0, 'job': None, 'age': 50, 'name': 'Tom'} bob =>    {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'} sue =>    {'pay': 60000.0, 'job': 'mus', 'age': 45, 'name': 'Sue Jones'} Sue Jones

When we ran the update and dump scripts here, we added a new record for key tom and increased Sue's pay field by 50 percent. These changes are permanent because the record dictionaries are mapped to an external file by shelve. (In fact, this is a particularly good script for Suesomething she might consider scheduling to run often, using a cron job on Unix, or a Startup folder or msconfig entry on Windows.)