Recipe2.8.Updating a Random-Access File

Recipe 2.8. Updating a Random-Access File

Credit: Luther Blissett

Problem

You want to read a binary record from somewhere inside a large file of fixed-length records, change some or all of the values of the record's fields, and write the record back.

Solution

Read the record, unpack it, perform whatever computations you need for the update, pack the fields back into the record, seek to the start of the record again, write it back. Phew. Faster to code than to say:

import struct format_string = '8l'                # e.g., say a record is 8 4-byte integers thefile = open('somebinfile', 'r+b') record_size = struct.calcsize(format_string) thefile.seek(record_size * record_number) buffer = thefile.read(record_size) fields = list(struct.unpack(format_string, buffer)) # Perform computations, suitably modifying fields, then: buffer = struct.pack(format_string, *fields) thefile.seek(record_size * record_number) thefile.write(buffer) thefile.close( )

Discussion

This approach works only on files (generally binary ones) defined in terms of records that are all the same, fixed size; it doesn't work on normal text files. Furthermore, the size of each record must be that defined by a struct format string, as shown in the recipe's code. A typical format string, for example, might be '8l', to specify that each record is made up of eight four-byte integers, each to be interpreted as a signed value and unpacked into a Python int. In this case, the fields variable in the recipe would be bound to a list of eight ints. Note that struct.unpack returns a tuple. Because tuples are immutable, the computation would have to rebind the entire fields variable. A list is mutable, so each field can be rebound as needed. Thus, for convenience, we explicitly ask for a list when we bind fields. Make sure, however, not to alter the length of the list. In this case, it needs to remain composed of exactly eight integers, or the struct.pack call will raise an exception when we call it with a format_string of '8l'. Also, this recipe is not suitable when working with records that are not all of the same, unchanging length.

To seek back to the start of the record, instead of using the record_size*record_number offset again, you may choose to do a relative seek:

thefile.seek(-record_size, 1)

The second argument to the seek method (1) tells the file object to seek relative to the current position (here, so many bytes back, because we used a negative number as the first argument). seek's default is to seek to an absolute offset within the file (i.e., from the start of the file). You can also explicitly request this default behavior by calling seek with a second argument of 0.

You don't need to open the file just before you do the first seek, nor do you need to close it right after the write. Once you have a file object that is correctly opened (i.e., for updating and as a binary rather than a text file), you can perform as many updates on the file as you want before closing the file again. These calls are shown here to emphasize the proper technique for opening a file for random-access updates and the importance of closing a file when you are done with it.

The file needs to be opened for updating (i.e., to allow both reading and writing). That's what the 'r+b' argument to open means: open for reading and writing, but do not implicitly perform any transformations on the file's contents because the file is a binary one. (The 'b' part is unnecessary but still recommended for clarity on Unix and Unix-like systems. However, it's absolutely crucial on other platforms, such as Windows.) If you're creating the binary file from scratch, but you still want to be able to go back, reread, and update some records without closing and reopening the file, you can use a second argument of 'w+b' instead. However, I have never witnessed this strange combination of requirements; binary files are normally first created (by opening them with 'wb', writing data, and closing the file) and later reopened for updating with 'r+b'.

While this approach is normally useful only on a file whose records are all the same size, another, more advanced possibility exists: a separate "index file" that provides the offset and length of each record inside the "data file". Such indexed sequential access approaches aren't much in fashion any more, but they used to be very important. Nowadays, one meets just about only text files (of many kinds, more and more often XML ones), databases, and occasional binary files with fixed-length records. Still, if you do need to access an indexed sequential binary file, the code is quite similar to that shown in this recipe, except that you must obtain the record_size and the offset argument to pass to thefile.seek by reading them from the index file, rather than computing them yourself as shown in this recipe's Solution.

Recipe2.8.Updating a Random-Access File