Recipe4.6.Flattening a Nested Sequence


Recipe 4.6. Flattening a Nested Sequence

Credit: Luther Blissett, Holger Krekel, Hemanth Sethuram, ParzAspen Aspen

Problem

Some of the items in a sequence may in turn be sub-sequences, and so on, to arbitrary depth of "nesting". You need to loop over a "flattened" sequence, "expanding" each sub-sequence into a single, flat sequence of scalar items. (A scalar, or atom, is anything that is not a sequencei.e., a leaf, if you think of the nested sequence as a tree.)

Solution

We need to be able to tell which of the elements we're handling are "subsequences" to be "expanded" and which are "scalars" to be yielded as is. For generality, we can take an argument that's a predicate to tell us what items we are to expand. (A predicate is a function that we can call on any element and that returns a truth value: in this case, true if the element is a subsequence we are to expand, False otherwise.) By default, we can arbitrarily say that every list or tuple is to be "expanded", and nothing else. Then, a recursive generator offers the simplest solution:

def list_or_tuple(x):     return isinstance(x, (list, tuple)) def flatten(sequence, to_expand=list_or_tuple):     for item in sequence:         if to_expand(item):             for subitem in flatten(item, to_expand):                 yield subitem         else:             yield item

Discussion

Flattening a nested sequence, or, equivalently, "walking" sequentially over all the leaves of a "tree", is a common task in many kinds of applications. You start with a nested structure, with items grouped into sequences and subsequences, and, for some purposes, you don't care about the structure at all. You just want to deal with the items, one after the other. For example,

for x in flatten([1, 2, [3, [  ], 4, [5, 6], 7, [8,], ], 9]):     print x,

emits 1 2 3 4 5 6 7 8 9.

The only problem with this common task is that, in the general case, determining what is to be "expanded", and what is to be yielded as a scalar, is not as obvious as it might seem. So, I ducked that decision, delegating it to a callable predicate argument that the caller can pass to flatten, unless the caller accepts flatten's somewhat simplistic default behavior of expanding just tuples and lists.

In the same module as flatten, we should also supply another predicate that a caller might well want to usea predicate that will expand just about any iterable except strings (plain and Unicode). Strings are iterable, but almost invariably applications want to treat them as scalars, not as subsequences.

To identify whether an object is iterable, we just need to try calling the built-in iter on that object: the call raises TypeError if the object is not iterable. To identify whether an object is string-like, we simply check whether the object is an instance of basestring, since isinstance(obj, basestring) is true when obj is an instance of any subclass of basestringthat is, any string-like type. So, the alternative predicate is not hard to code:

def nonstring_iterable(obj):     try: iter(obj)     except TypeError: return False     else: return not isinstance(obj, basestring)

Now the caller may choose to call flatten(seq, nonstring_iterable) when the need is to expand any iterable that is not a string. It is surely better not to make the nonstring_iterable predicate the default for flatten, though: in a simple case, such as the example snippet we showed previously, flatten can be up to three times slower when the predicate is nonstring_iterable rather than list_or_tuple.

We can also write a nonrecursive version of generator flatten. Such a version lets you flatten nested sequences with nesting levels higher than Python's recursion limit, which normally allows no more than a few thousand levels of recursion depth. The main technique for recursion removal is to keep an explicit last-in, first-out (LIFO) stack, which, in this case, we can implement with a list of iterators:

def flatten(sequence, to_expand=list_or_tuple):     iterators = [ iter(sequence) ]     while iterators:         # loop on the currently most-nested (last) iterator         for item in iterators[-1]:             if to_expand(item):                 # subsequence found, go loop on iterator on subsequence                 iterators.append(iter(item))                 break             else:                 yield item         else:             # most-nested iterator exhausted, go back, loop on its parent             iterators.pop( )

The if clause of the if statement executes for any item we are to expandthat is, any subsequence on which we must loop; so in that clause, we push an iterator for the subsequence to the end of the stack, then execute a break to terminate the for, and go back to the outer while, which will in turn execute a new for statement on the iterator we just appended to the stack. The else clause of the if statement executes for any item we don't expand, and it just yields the item.

The else clause of the for statement executes if no break statement interrupts the for loopin other words, when the for loop runs to completion, exhausting the currently most-nested iterator. So, in that else clause, we remove the now-exhausted most-nested (last) iterator, and the outer while loop proceeds, either terminating if no iterators are left on the stack, or executing a new for statement that continues the loop on the iterator that's back at the top of the stackfrom wherever that iterator had last left off, intrinsically, because an iterator's job is exactly to remember iteration state.

The results of this nonrecursive implementation of flatten are identical to those of the simpler recursive version given in this recipe's Solution. If you think non-recursive implementations are faster than recursive ones, though, you may be disappointed: according to my measurements, the nonrecursive version is about 10% slower than the recursive one, across a range of cases.

See Also

Library Reference and Python in a Nutshell sections on sequence types and built-ins iter, isinstance, and basestring.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net