Credit: Tim Peters, PythonLabs
Donald Knuth The Art of Computer Programming,vol. 3, Sorting and Searching, page 3 Professor Knuth's masterful work on the topics of sorting and searching spans nearly 800 pages of sophisticated technical text. In Python practice, we reduce it to two imperatives (we read Knuth so you don't have to):
Many recipes in this chapter illustrate these principles. The most common theme is using the decorate-sort-undecorate (DSU) pattern, a general approach to transforming a sorting problem by creating an auxiliary list that we can then sort with the default, speedy sort method. This technique is the single most useful one to take from this chapter. In fact, DSU is so useful that Python 2.4 introduced new features to make it easier to apply. Many recipes can be made simpler in 2.4 as a result, and the discussion of older recipes have been updated to show how. DSU relies on an unusual feature of Python's built-in comparisons: sequences are compared lexicographically. Lexicographical order is a generalization to tuples and lists of the everyday rules used to compare strings (e.g., alphabetical order). The built-in cmp(s1, s2), when s1 and s2 are sequences, is equivalent to this Python code: def lexcmp(s1, s2): # Find leftmost nonequal pair. i = 0 while i < len(s1) and i < len(s2): outcome = cmp(s1[i], s2[i]) if outcome: return outcome i += 1 # All equal, until at least one sequence was exhausted. return cmp(len(s1), len(s2)) This code looks for the first unequal corresponding elements. If such an unequal pair is found, that pair determines the outcome. Otherwise, if one sequence is a proper prefix of the other, the prefix is considered to be the smaller sequence. Finally, if these cases don't apply, the sequences are identical and are considered equal. Here are some examples: >>> cmp((1, 2, 3), (1, 2, 3)) # identical 0 >>> cmp((1, 2, 3), (1, 2)) # first larger because second is a prefix 1 >>> cmp((1, 100), (2, 1)) # first smaller because 1<2 -1 >>> cmp((1, 2), (1, 3)) # first smaller because 1==1, then 2<3 -1 An immediate consequence of lexicographical comparison is that if you want to sort a list of objects by a primary key, breaking ties by comparing a secondary key, you can simply build a list of tuples, in which each tuple contains the primary key, secondary key, and original object, in that order. Because tuples are compared lexicographically, this automatically does the right thing. When comparing tuples, the primary keys are compared first, and if (and only if) the primary keys are equal, the secondary keys are compared. The examples of the DSU pattern in this chapter show many applications of this idea. The DSU technique applies to any number of keys. You can add to the tuples as many keys as you like, in the order in which you want the keys compared. In Python 2.4, you can get the same effect with the new key= optional argument to sort, as several recipes point out. Using the sort method's key= argument is easier, more memory-efficient, and runs faster than building an auxiliary list of tuples by hand. The other 2.4-introduced innovation in sorting is a convenient shortcut: a sorted built-in function that sorts any iterable, not in-place, but by first copying it into a new list. In Python 2.3 (apart from the new optional keyword arguments, which apply to the sorted built-in function as well as to list.sort), you can code the same functionality quite easily: def sorted_2_3(iterable): alist = list(iterable) alist.sort( ) return alist Because copying a list and sorting it are both nontrivial operations, and the built-in sorted needs to perform those operations too, no speed advantage is gained in making sorted a built-in. Its advantage is just the convenience. Having something always around and available, rather than having to code even just four simple lines over and over, does make a difference in practice. On the other hand, few tiny functions are used commonly enough to justify expanding the set of built-ins. Python 2.4 added sorted and reversed because those two functions were requested very frequently over the years. The biggest change in Python sorting since the first edition of this book is that Python 2.3 moved to a new implementation of sorting. The primary visible consequences are increased speed in many common cases, and the fact that the new sort is stable (meaning that when two elements compare equal in the original list, they retain their relative order in the sorted list). The new implementation was so successful, and the chances of improving on it appeared so slim, that Guido was persuaded to proclaim that Python's list.sort method will always be stable. This guarantee started with Python 2.4 but was actually realized in Python 2.3. Still, the history of sorting cautions us that better methods may yet be discovered. A brief account of Python's sorting history may be instructive in this regard. A Short History of Python SortingIn early releases of Python, list.sort used the qsort routine from the underlying platform's C library. This didn't work out for several reasons, primarily because the quality of qsort varied widely across machines. Some versions were extremely slow when given a list with many equal values or in reverse-sorted order. Some versions even dumped core because they weren't reentrant. A user-defined _ _cmp_ _ function can also invoke list.sort, so that one list.sort can invoke others as a side effect of comparing. Some platform qsort routines couldn't handle that. A user-defined _ _cmp_ _ function can also (if it's insane or malicious) mutate the list while it's being sorted, and many platform qsort routines dumped core when that happened. Python then grew its own implementation of the quicksort algorithm. This was rewritten with every release, as real-life cases of unacceptable slowness were discovered. Quicksort is a delicate algorithm indeed! In Python 1.5.2 the quicksort algorithm was replaced by a hybrid of samplesort and binary insertion sort, and that implementation remained unchanged for more than four years, until Python 2.3. Samplesort can be viewed as a variant of quicksort that uses a very large sample size to pick the partitioning element, also known as the pivot (it recursively samplesorts a large random subset of the elements and picks the median of those). This variant makes quadratic-time behavior almost impossible and brings the number of comparisons in the average case much closer to the theoretical minimum. However, because samplesort is a complicated algorithm, it has too much administrative overhead for small lists. Therefore, small lists (and small slices resulting from samplesort partitioning) were handled by a separate binary insertion sort, which is an ordinary insertion sort, except that it uses binary search to determine where each new element belongs. Most sorting texts say this isn't worth the bother, but that's because most texts assume that comparing two elements is as cheap as or cheaper than swapping them in memory, which isn't true for Python's sort! Moving an object is very cheap, since what is copied is just a reference to the object. Comparing two objects is expensive, though, because all of the object-oriented machinery for finding the appropriate code to compare two objects and for coercion gets reinvoked each time. This made binary search a major win for Python's sort. On top of this hybrid approach, a few common special cases were exploited for speed. First, already-sorted or reverse-sorted lists were detected and handled in linear time. For some applications, these kinds of lists are very common. Second, if an array was mostly sorted, with just a few out-of-place elements at the end, the binary insertion sort handled the whole job. This was much faster than letting samplesort have at it and occurred often in applications that repeatedly sort a list, append a few new elements, then sort it again. Finally, special code in the samplesort looked for stretches of equal elements, so that the slice they occupy could be marked as done early. In the end, all of this yielded an in-place sort with excellent performance in all known real cases and supernaturally good performance in some common special cases. It spanned about 500 lines of complicated C code, which gives special poignancy to recipe Recipe 5.11. Over the years samplesort was in use, I made a standing offer to buy dinner for anyone who could code a faster Python sort. Alas, I ate alone. Still, I kept my eye on the literature because several aspects of the samplesort hybrid were irritating:
Current SortingIt was always clear that a mergesort would be better on several counts, including guaranteed worst-case n log n time, and that mergesort is easy to make stable. The problem was that half a dozen attempts to code a mergesort for Python yielded a sort that ran slower (mergesort does much more data movement than samplesort) and consumed more memory. A large and growing literature concerns adaptive sorting algorithms, which attempt to detect order of various kinds in the input. I coded a dozen of them, but they were all much slower than Python's samplesort except on the cases they were designed to exploit. The theoretical bases for these algorithms were simply too complex to yield effective practical algorithms. Then I read an article pointing out that list merging naturally reveals many kinds of partial order, simply by paying attention to how often each input list "wins" in a row. This information was simple and general. When I realized how it could be applied to a natural mergesort, which would obviously exploit all the kinds of partial order I knew and cared about, I got obsessed enough to solve the speed problem for random data and to minimize the memory burden. The resulting "adaptive, natural, stable" mergesort implemented for Python 2.3 was a major success, but also a major engineering effortthe devil is in the details. There are about 1,200 lines of C code, but unlike the code in the samplesort hybrid, none of these lines are coding for special cases, and about half implement a technical trick allowing the worst-case memory burden to be cut in half. I'm quite proud of it, but the margins of this introduction lack the space for me to explain the details. If you're curious, I wrote a long technical description that you can find in a Python source distribution: Objects/listsort.txt under the main directory (say, Python-2.3.5 or Python-2.4) where you unpacked Python's source distribution archive. In the following list, I provide examples of the partial order Python 2.3's mergesort naturally exploits, where "sorted" means in either forward-sorted or reverse-sorted order:
In short, Python 2.3's timsort (well, it has to have some brief name) is stable, robust, and preternaturally fast in many real-life cases: use it any time you can! |