Recipe5.3.Sorting a List of Objects by an Attribute of the Objects

Recipe 5.3. Sorting a List of Objects by an Attribute of the Objects

Credit: Yakov Markovitch, Nick Perkins

Problem

You need to sort a list of objects according to one attribute of each object.

Solution

The DSU idiom shines, as usual:

def sort_by_attr(seq, attr):     intermed = [ (getattr(x, attr), i, x) for i, x in enumerate(seq) ]     intermed.sort( )     return [ x[-1] for x in intermed ] def sort_by_attr_inplace(lst, attr):     lst[:] = sort_by_attr(lst, attr)

In Python 2.4, DSU is natively supported, so your code can be even shorter and faster:

import operator def sort_by_attr(seq, attr):     return sorted(seq, key=operator.attrgetter(attr)) def sort_by_attr_inplace(lst, attr):     lst.sort(key=operator.attrgetter(attr))

Discussion

Sorting a list of objects by an attribute of each object is best done using the DSU idiom introduced previously in Recipe 5.2. In Python 2.3 and 2.4, DSU is no longer needed, as it used to be, to ensure that a sort is stable (sorting is always stable in Python 2.3 and later), but DSU's speed advantages still shine.

Sorting, in the general case and with the best algorithms, is O(n log n) (as is often the case in mathematical formulas, the juxtaposition of terms, here n and log n, indicates that the terms are multiplied). DSU's speed comes from maximally accelerating the O(n log n) part, which dominates sorting time for sequences of substantial length n, by using only Python's native (and maximally fast) comparison. The preliminary decoration step, which prepares an intermediate auxiliary list of tuples, and the successive undecoration step, which extracts the important item from each tuple after the intermediate list is sorted, are only O(n). Therefore any minor inefficiencies in these steps contribute negligible overhead if n is large enough, and reasonably little even for many practical values of n.

The O( )-Notation

The most useful way to reason about many performance issues is in terms of what is popularly known as big-O analysis and notation (the O stands for "order"). You can find detailed explanations, for example, at http://en.wikipedia.org/wiki/Big_O_notation, but here's a summary.

If we consider an algorithm applied to input data of some size N, running time can be described, for large enough values of N (and big inputs are often those for which performance is most critical), as being proportional to some function of N. This is indicated with notations such as O(N) (running time proportional to N: processing twice as much data takes about twice as much time, 10 times as much data, 10 times as much time, and so on; also known as linear time), O(N squared) (running time proportional to the square of N: processing twice as much data takes about four times as much time, 10 times as much data, 100 times as much time; also known as quadratic time), and so on. Another case you will see often is O(N log N), which is faster than O(N squared) but not as fast as O(N).

The constant of proportionality is often ignored (at least in theoretical analysis) because it depends on such issues as the clock rate of your computer, not just on the algorithm. If you buy a machine that's twice as fast as your old one, everything will run in half the time, but that will not change any of the comparisons between alternative algorithms.

This recipe puts index i, in each tuple that is an item of list intermed, ahead of the corresponding x (where x is the i-th item in seq). This placement ensures that two items of seq will never be compared directly, even if they have the same value for the attribute named attr. Even in that case, their indices will still differ, and thus Python's lexicographic comparison of the tuples will never get all the way to comparing the tuples' last items (the original items from seq). Avoiding object comparisons may save us from performing extremely slow operations, or even from attempting forbidden ones. For example, we could sort a list of complex numbers by their real attribute: we would get an exception if we ever tried to compare two complex numbers directly, because no ordering is defined on complex numbers. But thanks to the precaution described in this paragraph, such an event can never occur, and the sorting will therefore proceed correctly.

As mentioned earlier in Recipe 5.2, Python 2.4 supports DSU directly. You can pass an optional keyword-argument key, to sort, which is the callable to use on each item to get the sort key. Standard library module operator has two new functions, attrgetter and itemgetter, that exist specifically to return callables suitable for this purpose. In Python 2.4, the ideal solution to this problem therefore becomes:

import operator seq.sort(key=operator.attrgetter(attr))

This snippet performs the sort in-place, which helps make it blazingly faston my machine, three times faster than the Python 2.3 function shown first in this recipe. If you need a sorted copy, without disturbing seq, you can get it using Python 2.4's new built-in function sorted:

sorted_copy = sorted(seq, key=operator.attrgetter(attr))

While not quite as fast as an in-place sort, this latest snippet is still over 2.5 times faster than the function shown first in this recipe. Python 2.4 also guarantees that, when you pass the optional key named argument, list items will never be accidentally compared directly, so you need not take any special safeguards. Moreover, stability is also guaranteed.

Recipe5.3.Sorting a List of Objects by an Attribute of the Objects