Recipe5.9.Looking for Items in a Sorted Sequence

Recipe 5.9. Looking for Items in a Sorted Sequence

Credit: Noah Spurrier

Problem

You need to look for a lot of items in a sequence.

Solution

If list L is sorted, module bisect from the Python Standard Library makes it easy to check if some item x is present in L:

import bisect x_insert_point = bisect.bisect_right(L, x) x_is_present = L[x_insert_point-1:x_insert_point] == [x]

Discussion

Looking for an item x in a list L is very easy in Python: to check whether the item is there at all, if x in L; to find out where exactly it is, L.index(x). However, if L has length n, these operations take time proportional to nessentially, they just loop over the list's items, checking each for equality to x. If L is sorted, we can do better.

The classic algorithm to look for an item in a sorted sequence is known as binary search, because at each step it roughly halves the range it's still searching onit generally takes about log₂n steps. It's worth considering when you're going to look for items many times, so you can amortize the cost of sorting over many searches. Once you've decided to use binary search for x in L, after calling L.sort( ), module bisect from the Python Standard Library makes the job easy.

Specifically, we need function bisect.bisect_right, which returns the index where an item should be inserted, to keep the sorted list sorted, but doesn't alter the list; moreover, if the item already appears in the list, bisect_right returns an index that's just to the right of any items with the same value. So, after getting this "insert point" by calling bisect.bisect_right(L, x), we need only to check the list immediately before the insert point, to see if an item equal to x is already there.

The way we compute x_is_present in the "Solution" may not be immediately obvious. If we know that L is not empty, we can use a simpler and more obvious approach:

x_is_present = L[x_insert_point-1] == x

However, the indexing in this simpler approach raises an exception when L is empty. When the slice boundaries are invalid, slicing is less "strict" than indexing, since it just produces an empty slice without raising any exception. In general, somelist[i:i+1] is the same one-item list as [somelist[i]] when i is a valid index in somelist: it's an empty list [ ] when the indexing would raise IndexError. The computation of x_is_present in the recipe exploits this important property to avoid having to deal with exceptions and handle empty and nonempty cases for L in one uniform way. An alternative approach is:

x_is_present = L and L[x_insert_point-1] == x

This alternative approach exploits and's short-circuiting behavior to guard the indexing, instead of using slicing.

An auxiliary dict, as shown in Recipe 5.12, is also a possibility as long as items are hashable (meaning that items can be used as keys into a dict). However, the approach in this recipe, based on a sorted list, may be the only useful one when the items are comparable (otherwise, the list could not be sorted) but not hashable (so a dict can't have those items as its keys).

When the list is already sorted, and the number of items you need to look up in it is not extremely large, it may in any case be faster to use bisect than to build an auxiliary dictionary, since the investment of time in the latter operation might not be fully amortized. This is particularly likely in Python 2.4, since bisect has been optimized very effectively and is much faster than it was in Python 2.3. On my machine, for example, bisect.bisect_right for an item in the middle of a list of 10,000 integers is about four times faster in Python 2.4 than it was in Python 2.3.

Recipe5.9.Looking for Items in a Sorted Sequence