Recipe19.7.Looping on a Sequence by Overlapping Windows


Recipe 19.7. Looping on a Sequence by Overlapping Windows

Credit: Peter Cogolo, Steven Bethard, Ian Bicking

Problem

You have an iterable s and need to make another iterable whose items are sublists (i.e., sliding windows), each of the same given length, over s' items, with successive windows overlapping by a specified amount.

Solution

We can combine built-in function iter and function islice from the standard library module itertools to code a generator to solve our problem:

import itertools def windows(iterable, length=2, overlap=0):     it = iter(iterable)     results = list(itertools.islice(it, length))     while len(results) == length:         yield results         results = results[length-overlap:]         results.extend(itertools.islice(it, length-overlap))     if results:         yield results if _ _name_ _ == '_ _main_ _':     seq = 'foobarbazer'     for length in (3, 4):         for overlap in (0, 1):             print '%d %d: %s' % (length, overlap,                     map(''.join, windows(seq, length, overlap)))

This module, when run as a main script, emits:

3 0: ['foo', 'bar', 'baz', 'er'] 3 1: ['foo', 'oba', 'arb', 'baz', 'zer', 'r'] 4 0: ['foob', 'arba', 'zer'] 4 1: ['foob', 'barb', 'baze', 'er']

When you know you don't need any overlap, a fast and concise alternative is available:

def chop(iterable, length=2):     return itertools.izip(*(iter(iterable),)*length)

The body of this concise alternative may be a bit confusing until you realize that the two occurrences of the asterisk ( * ) there play different roles: the first one is part of a *args syntax form (passing the elements of a sequence as separate positional arguments), the second one indicates that a sequence (the Singleton tuple (iter(iterable),) must be repeated length times.

Discussion

In many cases, we need a sequence of sub-sequences of a given length, and we have to start with a "flat" iterable. For example, we can build a dictionary with given keys and values by calling dict with a sequence of two-item sequencesbut what if we start with a "flat" sequence where keys and values just alternate? The function windows in this recipe meets this need:

the_dict = dict(windows(flat_alternating_keys_and_values))

Or, say we have an iterable whose items are the amounts of sales on each day. To turn it into an iterable whose items are the amounts of sales in each week (seven days):

weekly_sales = itertools.imap(sum, windows(daily_sales, 7))

The two use cases just presented are examples of how windows can be useful when called without overlap (in other words, with an overlap argument of 0, its default value), so the alternative chop function also presented in the recipe would be just as good (and faster). However, overlap is often useful when you deal with iterables that are signals, or time series. For example, if you have a function average such as:

def average(sequence):     return sum(sequence)/float(len(sequence))

then you can apply a simple low-pass filter to a signal:

filtered = itertools.imap(average, windows(raw_signal, 5, 2))

or get the moving average daily sales from the iterable of daily sales:

mvavg_daily_sales = itertools.imap(average, windows(daily_sales, 7, 6))

The implementation of the windows generator in this recipe is quite straightforward, if you're familiar with itertools.islice (and you should be, if you deal with iterables!). For the first "window", we must clearly fill list results with the appropriate number of items (islice does that for us). At each subsequent step, we must throw away a certain number of items from the "front" of results (we do that conveniently by list slicing, since results is, indeed, a list) and replenish the same number at the back (we do that by calling the extend method of results, with islice providing the needed "new" items). That number, as a little reasoning shows, is exactly that given by the expression length-overlap. The loop terminates, if ever, only when results cannot be entirely replenished. (The loop never executes if results cannot even be filled entirely in the first place.)

When the loop terminates, we may be left with a "tail" in results, a "last window" that is shorter than length. What to do in that case depends on your application's exact needs. The recipe, as given above, just yields the shorter window as the last item of the generator, which is what we'd probably want in all of the previously mentioned use cases. In other cases, we might want to drop the last, too-short window (just omit the last two statements in function windows as given in the recipe), raise an exception (when we know that such a situation should never occur), or pad the last window to the required length with a pad value such as None, by changing the last two statements in function windows to something like:

    if result:         result.extend(itertools.repeat(None, length-len(result)))         yield result

One important implementation detail: function windows, as coded in the recipe, yields a new list object at each step. It takes some time to generate all of these objects. In some cases, it may be convenient to the caller to know that each object it gets is a separate and independent list. Such knowledge enables the caller to store or modify the objects it gets, without having to make explicit copies. However, none of the use cases we discussed gets any benefit from this feature. So, you could optimize, by yielding the same list object every time. If you want that optimization, just change the statement:

        results = results[length-overlap:]

into:

        del results[:length-overlap]

If you're applying this optimization, and you're using Python 2.4, you should also consider using the new type collections.deque instead of list. In order to do so, you need to add the statement:

import collections

at the start of your code, change the only occurrence of list in the recipe into collections.queue, and further change the updating of results to avoid slicing, using, instead:

        for i in xrange(length-overlap): results.popleft( )

If your windows are long, and if they overlap substantially, using deque instead of list might give you better performance, since deque is optimized to support adding and removing items at either end, while lists, being compact arrays in memory, are inherently slow (specifically, O(n) for a list of length n) when you add or remove items at the beginning.

When you want windows of some length n that overlap specifically by n-1 items, function itertools.tee, new in Python 2.4, offers an elegant alternative approach. Say that you want to look at each item of the iterable, with access to a few neighboring items and some padding at both ends, so that you get just as many windows as there are items in the iterable. In Python 2.4, you could then code:

import itertools as IT def windowed(iterable, pre=1, post=1, padding=None):     # tee-off one iterator for each index in the window     copies = IT.tee(iterable, pre + 1 + post)     pre_copies, copy, post_copies = copies[:pre], copies[pre], copies[pre+1:]     # iterators before the element have their start filled in with the     # padding value.  no need to slice off the ends, izip will do that.     pre_copies = [IT.chain(IT.repeat(padding, pre - i), itr)                   for i, itr in enumerate(pre_copies)]     # iterators after the element have their starts sliced off and their     # end filled in with the padding value, endlessly repeated.     post_copies = [IT.chain(IT.islice(itr, i + 1, None), IT.repeat(padding))                    for i, itr in enumerate(post_copies)]     # zip the elements with their preceding and following elements     return IT.izip(*(pre_copies + [copy] + post_copies))

For example:

>>> print list(windowed(xrange(4), 1, 2, 'x')) [('x', 0, 1, 2), (0, 1, 2, 3), (1, 2, 3, 'x'), (2, 3, 'x', 'x')]

If you use Python 2.4 and want this flavor of "sliding windows" over the iterable, with specified "padding" at both ends, you might prefer this windowed function to the recipe's windows generator.

See Also

Library Reference documentation on built-in iter and module itertools.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net