Recipe19.14.Merging Sorted Sequences

Recipe 19.14. Merging Sorted Sequences

Credit: Sébastien Keim, Raymond Hettinger, Danny Yoo

Problem

You have several sorted sequences (iterables) and need to iterate on the overall sorted sequence that results from "merging" these sequences.

Solution

A generator is clearly the right tool for the job, in the general case (i.e., when you might not have enough memory to comfortably hold all the sequences). Implementing the generator is made easy by the standard library module heapq, which supplies functions to implement the "heap" approach to priority queues:

import heapq def merge(*subsequences):     # prepare a priority queue whose items are pairs of the form     # (current-value, iterator), one each per (non-empty) subsequence     heap = [  ]     for subseq in subsequences:         iterator = iter(subseq)         for current_value in iterator:             # subseq is not empty, therefore add this subseq's pair             # (current-value, iterator) to the list             heap.append((current_value, iterator))             break     # make the priority queue into a heap     heapq.heapify(heap)     while heap:         # get and yield lowest current value (and corresponding iterator)         current_value, iterator = heap[0]         yield current_value         for current_value in iterator:             # subseq is not finished, therefore add this subseq's pair             # (current-value, iterator) back into the priority queue             heapq.heapreplace(heap, (current_value, iterator))             break         else:             # subseq has been exhausted, therefore remove it from the queue             heapq.heappop(heap)

Discussion

The need for "merging" sorted subsequences into a larger sorted sequence is reasonably frequent. If the amount of data is small enough to fit entirely in memory without problems, then the best approach is to build a list by concatenating all subsequences, then sort the list:

def smallmerge(*subsequences):     result = [  ]     for subseq in subsequences: result.extend(subseq)     result.sort( )     return result

The sort method of list objects is based on a sophisticated natural merge algorithm, able to take advantage of existing sorted subsequences in the list you're sorting; therefore, this approach is quite fast, as well as simple (and general, since this approach's correctness does not depend on all subsequences being already sorted). If you can choose this approach, it has many other advantages. For example, smallmerge works fine even if one of the subsequences isn't perfectly sorted to start with; and in Python 2.4, you may add a generic keywords argument **kwds to smallmerge and pass it right along to the result.sort( ) step, to achieve the flexibility afforded in that version by the cmp=, key=, and reverse= arguments to list's sort method.

However, you sometimes deal with large sequences, which might not comfortably fit in memory all at the same time (e.g., your sequences might come from files on disk, or be computed on the fly, item by item, by other generators). When this happens, this recipe's generator will enable you to perform your sequence merging while consuming a very moderate amount of extra memory (dependent only on the number of subsequences, not on the number of items in the subsequences).

The recipe's implementation uses a classic sequence-merging algorithm based on a priority queue, which, in turn, lets it take advantage of the useful heapq module in the Python Standard Library. heapq offers functions to implement a priority queue through the data structure known as a heap.

A heap is any list H such that, for any valid index 0<=i<len(H), H[i]<=H[2*i+1], and H[i]<=H[2*i+2] (if 2*i+1 and 2*i+2 are also valid indices into H). This heap property is fast to establish on an arbitrary list (function heapify does that) and very fast to re-establish after altering or removing the smallest item (and functions heapreplace and heappop do that). The smallest item is always H[0] (it's easy to see that the "heap" property implies this), and being able to find the smallest item instantly makes heaps an excellent implementation of priority queues.

In this recipe, we use as items in the "heap" a "pair" (i.e., two-items tuple) for each subsequence that is not yet exhausted (i.e., each subsequence through which we have not yet fully iterated). As its first item, each pair has the "current item" in the corresponding subsequence and, as its second item, an iterator over that subsequence. At each iteration step, we yield the smallest "current item", then we advance the corresponding iterator and re-establish the "heap" property; when an iterator is exhausted, we remove the corresponding pair from the "heap" (so that, clearly, we're finished when the "heap" is emptied). Note the idiom that we use to advance an iterator by one step, dealing with the possibility that the iterator is exhausted:

for current_value in iterator:     # if we get here the iterator was not empty, current_value was     # its first value, and the iterator has been advanced one step     ...use pair (current_value, iterator)...     # we break at once as we only wanted the first item of iterator     break else:     # if we get here the break did not execute, so the iterator     # was empty (exhausted)     # deal with the case of iterator being exhausted...

We use this idiom twice in the recipe, although in the first of the two uses we do not need the else clause since we can simply ignore iterators that are immediately exhausted (they correspond to empty subsequences, which can be ignored for merging purposes).

If you find this idiom confusing or tricky (because it uses a for statement whose body immediately breaksi.e., a statement that looks like a loop but is not really a loop because it never executes more than once!), you may prefer a different approach:

try:     current_value = iterator.next( ) except StopIteration:     # if we get here the iterator was empty (exhausted)     #   deal with the case of iterator being exhausted... else:     # if we get here the iterator was not empty, current_value was     # its first value, and the iterator has been advanced one step     #  use pair (current_value, iterator)...

I slightly prefer the idiom using for; in my view, it gains in clarity by putting the normal case (i.e., an unexhausted iterator) first and the rare case (an exhausted iterator) later. A variant of the try/except idiom that has the same property is:

try:     current_value = iterator.next( )     # if we get here the iterator was not empty, current_value was     # its first value, and the iterator has been advanced one step     #  use pair (current_value, iterator)... except StopIteration:     # if we get here the iterator was empty (exhausted)     #  deal with the case of iterator being exhausted...

However, I somewhat dislike this variant (even though it's quite OK for the two specific uses of this recipe) because it crucially depends on the code indicated as "use pair" never raising a StopIteration exception. As a general principle, it's best to use a TRy clause's body that is as small as possiblejust the smallest fragment that you do expect to possibly raise the exception you're catching in the following handlers (except clauses), not the follow-on code that must execute only if the exception was not raised. The follow-on code goes in the else clause of the try statement, in properly defensive Pythonic coding style. In any case, as long as you are fully aware of the tradeoffs in clarity and defensiveness between these three roughly equivalent idioms, you're welcome to develop your own distinctive Pythonic style and, in particular, to choose freely among them!

If you do choose either of the idioms that explicitly call iterator.next( ), a further "refinement" (i.e., a tiny optimization) is to keep as the second item of each pair, rather than the iterator object, the bound-method iterator.next directly, ready for you to call. This optimization is not really tricky at all (it is quite common in Python to stash away bound methods and other such callables), but it may nevertheless result in code of somewhat lower readability. Once again, the choice is up to you!

Recipe19.14.Merging Sorted Sequences