Recipe1.13.Accessing Substrings

Recipe 1.13. Accessing Substrings

Credit: Alex Martelli

Problem

You want to access portions of a string. For example, you've read a fixed-width record and want to extract the record's fields.

Solution

Slicing is great, but it only does one field at a time:

afield = theline[3:8]

If you need to think in terms of field lengths, struct.unpack may be appropriate. For example:

import struct # Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest: baseformat = "5s 3x 8s 8s" # by how many bytes does theline exceed the length implied by this # base-format (24 bytes in this case, but struct.calcsize is general) numremain = len(theline) - struct.calcsize(baseformat) # complete the format with the appropriate 's' field, then unpack format = "%s %ds" % (baseformat, numremain) l, s1, s2, t = struct.unpack(format, theline)

If you want to skip rather than get "all the rest", then just unpack the initial part of theline with the right length:

l, s1, s2 = struct.unpack(baseformat, theline[:struct.calcsize(baseformat)])

If you need to split at five-byte boundaries, you can easily code a list comprehension (LC) of slices:

fivers = [theline[k:k+5] for k in xrange(0, len(theline), 5)]

Chopping a string into individual characters is of course easier:

chars = list(theline)

If you prefer to think of your data as being cut up at specific columns, slicing with LCs is generally handier:

cuts = [8, 14, 20, 26, 30] pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]

The call to zip in this LC returns a list of pairs of the form (cuts[k], cuts[k+1]), except that the first pair is (0, cuts[0]), and the last one is (cuts[len(cuts)-1], None). In other words, each pair gives the right (i, j) for slicing between each cut and the next, except that the first one is for the slice before the first cut, and the last one is for the slice from the last cut to the end of the string. The rest of the LC just uses these pairs to cut up the appropriate slices of theline.

Discussion

This recipe was inspired by recipe 1.1 in the Perl Cookbook. Python's slicing takes the place of Perl's substr. Perl's built-in unpack and Python's struct.unpack are similar. Perl's is slightly richer, since it accepts a field length of * for the last field to mean all the rest. In Python, we have to compute and insert the exact length for either extraction or skipping. This isn't a major issue because such extraction tasks will usually be encapsulated into small functions. Memoizing, also known as automatic caching, may help with performance if the function is called repeatedly, since it allows you to avoid redoing the preparation of the format for the struct unpacking. See Recipe 18.5 for details about memoizing.

In a purely Python context, the point of this recipe is to remind you that struct.unpack is often viable, and sometimes preferable, as an alternative to string slicing (not quite as often as unpack versus substr in Perl, given the lack of a *-valued field length, but often enough to be worth keeping in mind).

Each of these snippets is, of course, best encapsulated in a function. Among other advantages, encapsulation ensures we don't have to work out the computation of the last field's length on each and every use. This function is the equivalent of the first snippet using struct.unpack in the "Solution":

def fields(baseformat, theline, lastfield=False):     # by how many bytes does theline exceed the length implied by     # base-format (struct.calcsize computes exactly that length)     numremain = len(theline)-struct.calcsize(baseformat)     # complete the format with the appropriate 's' or 'x' field, then unpack     format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x")     return struct.unpack(format, theline)

A design decision worth noticing (and, perhaps, worth criticizing) is that of having a lastfield=False optional parameter. This reflects the observation that, while we often want to skip the last, unknown-length subfield, sometimes we want to retain it instead. The use of lastfield in the expression lastfield and s or x (equivalent to C's ternary operator lastfield?"s":"c") saves an if/else, but it's unclear whether the saving is worth the obscurity. See Recipe 18.9 for more about simulating ternary operators in Python.

If function fields is called in a loop, memoizing (caching) with a key that is the tuple (baseformat, len(theline), lastfield) may offer faster performance. Here's a version of fields with memoizing:

def fields(baseformat, theline, lastfield=False, _cache={  }):     # build the key and try getting the cached format string     key = baseformat, len(theline), lastfield     format = _cache.get(key)     if format is None:         # no format string was cached, build and cache it         numremain = len(theline)-struct.calcsize(baseformat)         _cache[key] = format = "%s %d%s" % (             baseformat, numremain, lastfield and "s" or "x")     return struct.unpack(format, theline)

The idea behind this memoizing is to perform the somewhat costly preparation of format only once for each set of arguments requiring that preparation, thereafter storing it in the _cache dictionary. Of course, like all optimizations, memoizing needs to be validated by measuring performance to check that each given optimization does actually speed things up. In this case, I measure an increase in speed of approximately 30% to 40% for the memoized version, meaning that the optimization is probably not worth the bother unless the function is part of a performance bottleneck for your program.

The function equivalent of the next LC snippet in the solution is:

def split_by(theline, n, lastfield=False):     # cut up all the needed pieces     pieces = [theline[k:k+n] for k in xrange(0, len(theline), n)]     # drop the last piece if too short and not required     if not lastfield and len(pieces[-1]) < n:         pieces.pop( )     return pieces

And for the last snippet:

def split_at(theline, cuts, lastfield=False):     # cut up all the needed pieces     pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]     # drop the last piece if not required     if not lastfield:         pieces.pop( )     return pieces

In both of these cases, a list comprehension doing slicing turns out to be slightly preferable to the use of struct.unpack.

A completely different approach is to use generators, such as:

def split_at(the_line, cuts, lastfield=False):     last = 0     for cut in cuts:         yield the_line[last:cut]         last = cut     if lastfield:         yield the_line[last:] def split_by(the_line, n, lastfield=False):     return split_at(the_line, xrange(n, len(the_line), n), lastfield)

Generator-based approaches are particularly appropriate when all you need to do on the sequence of resulting fields is loop over it, either explicitly, or implicitly by calling on it some "accumulator" callable such as ''.join. If you do need to materialize a list of the fields, and what you have available is a generator instead, you only need to call the built-in list on the generator, as in:

list_of_fields = list(split_by(the_line, 5))

Recipe1.13.Accessing Substrings