Recipe19.17.Duplicating an Iterator


Recipe 19.17. Duplicating an Iterator

Credit: Heiko Wundram, Raymond Hettinger

Problem

You have an iterator (or other iterable) object x, and need to iterate twice over x's sequence of values.

Solution

In Python 2.4, solving this problem is the job of function tee in the standard library module itertools:

import itertools x1, x2 = itertools.tee(x) # you can now iterate on x1 and x2 separately

In Python 2.3, you can code tee yourself:

import itertools def tee(iterable):     def yield_with_cache(next, cache={  }):         pop = cache.pop         for i in itertools.count( ):             try:                 yield pop(i)             except KeyError:                 cache[i] = next( )                 yield cache[i]     it = iter(iterable)     return yield_with_cache(it.next), yield_with_cache(it.next)

Discussion

The need to iterate repeatedly over the same sequence of values is a reasonably common one. If you know that the sequence comes from a list, or some other container that intrinsically lets you iterate over its items repeatedly, then you simply perform the iteration twice. However, sometimes your sequence may come from a generator, a sequential file (which might, e.g., wrap a stream of data coming from a network socketdata that you can read only once), or some other iterator that is not intrinsically re-iterable. Even then, in some cases, the best approach is the simplest onefirst save the data into a list in memory, then repeatedly iterate over that list:

saved_x = list(x) for item in saved_x: do_something(item) for item in saved_x: do_something_else(item)

The simple approach of first saving all data from the iterator into a list is not feasible for an infinite sequence x, and may not be optimal if x is very large and your separate iterations over it never get far out-of-step from each other. In these cases, the tee function shown in this recipe can help. For example, say that the items of x are either numbers or operators (the latter being represented as strings such as '+', '*', etc.). Whenever you encounter an operator, you must output the result of applying that operator to all numbers immediately preceding it (since the last operator). Using tee, you could code:

def is_operator(item):     return isinstance(item, str) def operate(x):     x1, x2 = tee(iter(x))     while True:         for item in x1:             if is_operator(item): break         else:             # we get here when there are no more operators in the input             # stream, thus the operate function is entirely done             return         if item == '+':             total = 0             for item in x2:                 if is_operator(item): break                 total += item             yield total         elif item == '*':             total = 1             for item in x2:                 if is_operator(item): break                 total *= item             yield total

This kind of "look-ahead" usage is pretty typical of many of the common use cases of tee. Even in this case, you might choose the alternative approach of accumulating items in a list:

def operate_with_auxiliary_list(x):     aux = [  ]     for item in x:         if is_operator(item):             if item == '+':                 yield sum(aux)             elif item == '*':                 total = 1                 for item in aux:                     total *= item                 yield total             aux = [  ]         else:             aux.append(item)

Having tee available lets you freely choose between these different styles of look-ahead processing.

Function itertools.tee as implemented in Python 2.4 is faster and more general than the pure Python version given in this recipe for version 2.3 usage. However, the pure Python version is quite instructive and deserves study for the sake of the techniques it demonstrates, even if you're lucky enough to be using Python 2.4 and therefore don't need to use this pure Python version of tee.

In the pure Python version of tee, the nested generator yield_with_cache makes use of the fact (which some consider a "wart" in Python but is actually quite useful) that the default values of arguments get computed just once, at the time the def statement executes. Thus, both calls to the nested generator in the return statement of tee implicitly share the same initially empty dict as the value of the cache argument.

itertools.count returns non-negative integers, 0 and up, one at a time. yield_with_cache uses each of these integers as a key into the cache dictionary. The call to pop(i) (the argument of the yield statement in the try clause) simultaneously returns and removes the entry corresponding to key i, if that entry was presentthat is, in this case, if the "other" instance of the generator had already reached that point in the iteration (and cached the item for our benefit). Otherwise, the except clause executes, computes the item (by calling the object bound to name next, which in this case is the next bound method of an iterator over the iterable object, which tee is duplicating), and caches the item (for the "other" instance's future benefit) before yielding it.

So, in practice, cache is being used as a FIFO queue. Indeed, were it not for the fact that we don't need a pure-Python tee in Python 2.4, we could code an equivalent implementation of it in Python 2.4 using the new type deque in standard library module collections:

import collections def tee_just_an_example(iterable):     def yield_with_cache(it, cache=collections.deque):         while True:             if cache:                 yield cache.popleft( )             else:                 result = it.next( )                 cache.append(result)                 yield result     it = iter(iterable)     return yield_with_cache(it), yield_with_cache(it)

This latest version is meant purely as an illustrative example, and therefore, it's simplified by not using any of the bound-method extraction idioms shown in the version in the "Solution" (which is intended for "production" use in Python 2.3).

Once you've called tee on an iterator, you should no longer use the original iterator anywhere else; otherwise, the iterator could advance without the knowledge of the tee-generated objects, and those objects would then "get out of sync" with the original. Be warned that tee requires auxiliary storage that is proportional to how much the two tee-generated objects get "apart" from each other in their separate iterations. In general, if one iterator is going to walk over most or all of the data from the original before the "other" one starts advancing, you should consider using list instead of tee. Both of these caveats apply to the itertools.tee function of Python 2.4 just as well as they apply to the pure Python versions of tee presented in this recipe. One more caveat: again both for the versions in this recipe, and the itertools.tee function in Python 2.4, there is no guarantee of thread safety: to access the tee'd iterators from different threads, you need to guard those iterators with a single lock!

See Also

The itertools module is part of the Python Standard Library and is documented in the Library Reference portion of Python's online documentation; Recipe 19.2 shows how to turn an iterator into a list.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net