Recipe19.8.Looping Through Multiple Iterables in Parallel

Recipe 19.8. Looping Through Multiple Iterables in Parallel

Credit: Andy McKay, Hamish Lawson, Corey Coughlin

Problem

You need to loop through every item of multiple iterables in parallel, meaning that you first want to get a tuple with all of the first items of each iterable, next, a tuple with all of the "second items", and so forth.

Solution

Say you have two iterables (lists, in this case) such as:

a = ['a1', 'a2', 'a3'] b = ['b1', 'b2']

If you want to loop "in parallel" over them, the most general and effective approach is:

import itertools for x, y in itertools.izip(a, b):      print x, y

This snippet outputs two lines:

a1 b1 a2 b2

Discussion

The most general and effective way to loop "in parallel" over multiple iterables is to use function izip of standard library module itertools, as shown in the "Solution". The built-in function zip is an alternative that is almost as good:

for x, y in zip(a, b):     print x, y

However, zip has one downside that can hurt your performance if you're dealing with long sequences: it builds the list of tuples in memory all at once (preparing and returning a list), while you need only one tuple at a time for pure looping purposes.

Both zip and itertools.izip, when you iterate in parallel over iterables of different lengths, stop as soon as the "shortest" such iterable is exhausted. This approach to termination is normally what you want. For example, it lets you have one or more non-terminating iterable in the zipping, as long as at least one of the iterables does terminateor (in the case of izip, only) as long as you use some control structure, such as a conditional break within a for statement, to ensure you always require a finite number of items and do not loop endlessly.

In some cases, when iterating in parallel over iterables of different lengths, you may want shorter iterables to be conceptually "padded" with None up to the length of the longest iterable in the zipping. For this special need, you can use the built-in function map with a first argument of None:

for x, y in map(None, a, b):     print x, y

map, like zip, builds and returns a whole list. If that is a problem, you can reproduce map's pad with None's behavior by coding your own generator. Coding your own generator is also a good approach when you need to pad shorter iterables with some value that is different from None.

If you need to deal only with specifically two sequences, your iterator's code can be quite straightforward and linear:

import itertools def par_two(a, b, padding_item=None):     a, b = iter(a), iter(b)     # first, deal with both iterables via izip until one is exhausted:     for x in itertools.izip(a, b):         yield x     # only one of the following two loops, at most, will execute, since     # either a or b (or both!) are exhausted at this point:     for x in a:         yield x, padding_item     for x in b:         yield padding_item, x

Alternatively, you can code a more general function, one that is able to deal with any number of sequences:

import itertools def par_loop(padding_item, *sequences):     iterators = map(iter, sequences)     num_remaining = len(iterators)     result = [padding_item] * num_remaining     while num_remaining:         for i, it in enumerate(iterators):             try:                   result[i] = it.next( )             except StopIteration:                  iterators[i] = itertools.repeat(padding_item)                  num_remaining -= 1                  result[i] = padding_item         if num_remaining:             yield tuple(result)

Here's an example of use for generator par_loop:

print map(''.join, par_loop('x', 'foo', 'zapper', 'ui')) # emits: ['fzu', 'oai', 'opx', 'xpx', 'xex', 'zrx']

Both par_two and par_loop start by calling the built-in function iter on all of their arguments and thereafter use the resulting iterators. This is important, because the functions rely on the state that these iterators maintain. The key idea in par_loop is to keep count of the number of iterators as yet unexhausted, and replace each exhausted iterator with a nonterminating iterator that yields the padding_item ceaselessly; num_remaining counts unexhausted iterators, and both the yield statement and the continuation of the while loop are conditional on some iterators being as yet unexhausted.

Alternatively, if you know in advance which iterable is the longest one, you can wrap every other iterable x as itertools.chain(iter(x), itertools.repeat(padding)) and then call itertools.izip. You can't do this wrapping on all iterables because the resulting iterators are nonterminatingif you izip iterators that are all nonterminating, izip itself cannot terminate! Here, for example, is a version that works as intended only when the longest (but terminating!) iterable is the very first one:

import itertools def par_longest_first(padding_item, *sequences):     iterators = map(iter, sequences)     for i, it in enumerate(iterators):         if not i: continue         iterators[i] = itertools.chain(it, itertools.repeat(padding_item))     return itertools.izip(iterators)

Recipe19.8.Looping Through Multiple Iterables in Parallel