Recipe7.8.Using the Berkeley DB Database


Recipe 7.8. Using the Berkeley DB Database

Credit: Farhad Fouladi

Problem

You want to persist some data, exploiting the simplicity and good performance of the Berkeley DB database library.

Solution

If you have previously installed Berkeley DB on your machine, the Python Standard Library comes with package bsddb (and optionally bsddb3 , to access Berkeley DB release 3.2 databases) to interface your Python code with Berkeley DB. To get either bsddb or, lacking it, bsddb3 , use a try / except on import :

try:
    from bsddb import db                  # first try release 4
except ImportError:
    from bsddb3 import db                 # not there, try release 3 instead
print db.DB_VERSION_STRING
# emits, e.g:

Sleepycat Software: Berkeley DB 4.1.25: (December 19, 2002)


To create a database, instantiate a db.DB object, then call its method open with appropriate parameters, such as:

adb = db.DB( )
adb.open('db_filename', dbtype=db.DB_HASH, flags=db.DB_CREATE)

db.DB_HASH is just one of several access methods you may choose when you create a databasea popular alternative is db.DB_BTREE , to use B+tree access (handy if you need to get records in sorted order). You may make an in-memory database, without an underlying file for persistence, by passing None instead of a filename as the first argument to the open method.

Once you have an open instance of db.DB , you can add records, each composed of two strings, key and data :

for i, w in enumerate('some words for example'.split( )):
    adb.put(w, str(i))

You can access records via a cursor on the database:

def irecords(curs):
    record = curs.first( )
    while record:
        yield record
        record = curs.next( )
for key, data in irecords(adb.cursor( )):
    print 'key=%r, data=%r' % (key, data)
#

emits (the order may vary):

#

key='some', data='0'

#

key='example', data='3'

#

key='words', data='1'

#

key='for', data='2'


When you're done, you close the database:

adb.close( )

At any future time, in the same or another Python program, you can reopen the database by giving just its filename as the argument to the open method of a newly created db.DB instance:

the_same_db = db.DB( )
the_same_db.open('db_filename')

and work on it again in the same ways:

the_same_db.put('skidoo', '23')          # add a record
the_same_db.put('words', 'sweet')        # replace a record
for key, data in irecords(the_same_db.cursor( )):
    print 'key=%r, data=%r' % (key, data)
# emits (the order may vary):
#

key='some', data='0'

#

key='example', data='3'

#

key='words', data='sweet'

#

key='for', data='2'

#

key='skidoo', data='23'


Again, remember to close the database when you're done:

the_same_db.close( )

Discussion

The Berkeley DB is a popular open source database. It does not support SQL, but it's simple to use, offers excellent performance, and gives you a lot of control over exactly what happens, if you care to exert it, through a huge array of options, flags, and methods. Berkeley DB is just as accessible from many other languages as from Python: for example, you can perform some changes or queries with a Python program, and others with a separate C program, on the same database file, using the same underlying open source library that you can freely download from Sleepycat.

The Python Standard Library shelve module can use the Berkeley DB as its underlying database engine, just as it uses cPickle for serialization. However, shelve does not let you take advantage of the ability to access a Berkeley DB database file from several different languages, exactly because the records are strings produced by pickle .dumps , and languages other than Python can't easily deal with them. Accessing the Berkeley DB directly with bsddb also gives you access to many advanced functionalities of the database engine that shelve simply doesn't support.

A Database, or pickle . . . or Both?

The use cases for pickle or marshal , and those for databases such as Berkeley DB or relational databases, are rather different, though they do overlap somewhat.

pickle (and marshal even more so) is essentially about serialization: you turn Python objects into BLOBs that you may transmit or store, and later receive or retrieve. Data thus serialized is meant to be reloaded into Python objects, basically only by Python applications. pickle has nothing to say about searching or selecting specific objects or parts of them.

Databases (Berkeley DB, relational DBs, and other kinds yet) are essentially about data: you save and retrieve groupings of elementary data (strings and numbers , mostly), with a lot of support for selecting and searching (a huge lot, for relational databases) and cross-language support. Databases have nothing to say about serializing Python objects into data, nor about deserializing Python objects back from data.

The two approaches, databases and serialization, can even be used together. You can serialize Python objects into strings of bytes with pickle , and store those bytes using a databaseand vice versa at retrieval time. At a very elementary level, that's what the standard Python library shelve module does, for example, with pickle to serialize and deserialize and generally bsddb as the underlying simple database engine. So, don't think of the two approaches as being "in competition" with each otherrather, think of them as completing and complementing each other!


For example, creating a database with an access method of db.DB_HASH , as shown in the recipe, may give maximum performance, but, as you'll have noticed when listing all records with the generator irecords that is also presented in the recipe, hashing puts records in apparently random, unpredictable order. If you need to access records in sorted order, you can use an access method of db.DB_BTREE instead. Berkeley DB also supports more advanced functionality, such as transactions, which you can enable through direct access but not via anydbm or shelve .

For detailed documentation about all functionality of the Python Standard Library bsddb package, see http://pybsddb. sourceforge .net/bsddb3.html. For documentation, downloads, and more of the Berkeley DB itself, see http://www.sleepycat.com/.

See Also

Library Reference and Python in a Nutshell docs for modules anydbm , shelve , and bsddb ; http://pybsddb.sourceforge.net/bsddb3.html for many more details about bsddb and bsddb3 ; http://www.sleepycat.com/ for downloads of, and very detailed documentation on, the Berkeley DB itself.