Recipe5.8.Getting the First Few Smallest Items of a Sequence


Recipe 5.8. Getting the First Few Smallest Items of a Sequence

Credit: Matteo Dell'Amico, Raymond Hettinger, George Yoshida, Daniel Harding

Problem

You need to get just a few of the smallest items from a sequence. You could sort the sequence and just use seq[:n], but is there any way you can do better?

Solution

Perhaps you can do better, if n, the number of items you need, is small compared to n, the sequence's length. sort is very fast, but it still takes O(n log n) time, while we can get the first n smallest elements in time O(n) if n is small. Here is a simple and practical generator for this purpose, which works equally well in Python 2.3 and 2.4:

import heapq def isorted(data):     data = list(data)     heapq.heapify(data)     while data:         yield heapq.heappop(data)

In Python 2.4 only, you can use an even simpler and faster way to get the smallest n items of data when you know n in advance:

import heapq def smallest(n, data):     return heapq.nsmallest(n, data)

Discussion

data can be any bounded iterable; the recipe's function isorted starts by calling list on it to ensure that. You can remove the statement data = list(data) if all the following conditions hold: you know that data is a list to start with, you don't mind the fact that the generator reorders data's items, and you want to remove items from data as you fetch them.

As shown previously in Recipe 5.7, the Python Standard Library contains module heapq, which supports the data structures known as heaps. Generator isorted in this recipe's Solution relies on making a heap at the start (via heap.heapify) and then yielding and removing the heap's smallest remaining item at each step (via heap.heappop).

In Python 2.4, module heapq has also grown two new functions. heapq.nlargest(n, data) returns a list of the n largest items of data; heapq.nsmallest(n, data) returns a list of the n smallest items. These functions do not require that data satisfy the heap condition; indeed, they do not even require data to be a listany bounded iterable whose items are comparable will do. Function smallest in this recipe's Solution just lets heapq.smallest do all the work.

To judge speed, we must always measure itguessing about relative speeds of different pieces of code is a mug's game. So, how does isorted's performance compare with Python 2.4's built-in function sorted, when we're only looping on the first few (smallest) items? To help measure timing, I wrote a top10 function that can use either approach, and I also made sure I had a sorted function even in Python 2.3, where it's not built in:

try:     sorted except:     def sorted(data):         data = list(data)         data.sort( )         return data import itertools def top10(data, howtosort):     return list(itertools.islice(howtosort(data), 10))

On my machine running Python 2.4 on thoroughly shuffled lists of 1,000 integers, top10 takes about 260 microseconds with isorted, while it takes about 850 microseconds with the built-in sorted. However, Python 2.3 is much slower and gives vastly different results: about 12 milliseconds with isorted, about 2.7 milliseconds with sorted. In other words, Python 2.3 is 3 times slower than Python 2.4 for sorted, but it's 50 times slower for isorted. Lesson to retain: whenever you optimize, measure. You shouldn't choose optimizations based on first principles, since the performance numbers can vary so widely, even between vastly compatible "point releases". A secondary point can be made: if you care about performance, move to Python 2.4 as soon as you can. Python 2.4 has been vastly optimized and accelerated over Python 2.3, particularly in areas related to searching and sorting.

If you know that your code need only support Python 2.4, then, as this recipe's Solution indicates, using heapq's new function nsmallest is faster, as well as simpler, than doing your own coding. To implement top10 in Python 2.4, for example, you just need:

import heapq def top10(data):     return heapq.nsmallest(10, data)

This version takes about half the time of the previously shown isorted-based top10, when called on the same thoroughly shuffled lists of 1,000 integers.

See Also

Library Reference and Python in a Nutshell docs about method sort of type list, and about modules heapq and timeit; Chapter 19 for more about iteration in Python; Python in a Nutshell's chapter on optimization; heapq.py in the Python sources contains a very interesting discussion of heaps; Recipe 5.7 for more information about heapq.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net