Introduction


Credit: Tim Peters, PythonLabs

Computer manufacturers of the 1960s estimated that more than 25 percent of the running time on their computers was spent on sorting, when all their customers were taken into account. In fact, there were many installations in which the task of sorting was responsible for more than half of the computing time. From these statistics we may conclude that either (i) there are many important applications of sorting, or (ii) many people sort when they shouldn't, or (iii) inefficient sorting algorithms have been in common use.

Donald Knuth

The Art of Computer Programming,vol. 3, Sorting and Searching, page 3

Professor Knuth's masterful work on the topics of sorting and searching spans nearly 800 pages of sophisticated technical text. In Python practice, we reduce it to two imperatives (we read Knuth so you don't have to):

  • When you need to sort, find a way to use the built-in sort method of Python lists.

  • When you need to search, find a way to use built-in dictionaries.

Many recipes in this chapter illustrate these principles. The most common theme is using the decorate-sort-undecorate (DSU) pattern, a general approach to transforming a sorting problem by creating an auxiliary list that we can then sort with the default, speedy sort method. This technique is the single most useful one to take from this chapter. In fact, DSU is so useful that Python 2.4 introduced new features to make it easier to apply. Many recipes can be made simpler in 2.4 as a result, and the discussion of older recipes have been updated to show how.

DSU relies on an unusual feature of Python's built-in comparisons: sequences are compared lexicographically. Lexicographical order is a generalization to tuples and lists of the everyday rules used to compare strings (e.g., alphabetical order). The built-in cmp(s1, s2), when s1 and s2 are sequences, is equivalent to this Python code:

def lexcmp(s1, s2):      # Find leftmost nonequal pair.      i = 0      while i < len(s1) and i < len(s2):          outcome = cmp(s1[i], s2[i])          if outcome:              return outcome          i += 1      # All equal, until at least one sequence was exhausted.      return cmp(len(s1), len(s2))

This code looks for the first unequal corresponding elements. If such an unequal pair is found, that pair determines the outcome. Otherwise, if one sequence is a proper prefix of the other, the prefix is considered to be the smaller sequence. Finally, if these cases don't apply, the sequences are identical and are considered equal. Here are some examples:

>>> cmp((1, 2, 3), (1, 2, 3))  # identical 0 >>> cmp((1, 2, 3), (1, 2))     # first larger because second is a prefix 1 >>> cmp((1, 100), (2, 1))      # first smaller because 1<2 -1 >>> cmp((1, 2), (1, 3))        # first smaller because 1==1, then 2<3 -1

An immediate consequence of lexicographical comparison is that if you want to sort a list of objects by a primary key, breaking ties by comparing a secondary key, you can simply build a list of tuples, in which each tuple contains the primary key, secondary key, and original object, in that order. Because tuples are compared lexicographically, this automatically does the right thing. When comparing tuples, the primary keys are compared first, and if (and only if) the primary keys are equal, the secondary keys are compared.

The examples of the DSU pattern in this chapter show many applications of this idea. The DSU technique applies to any number of keys. You can add to the tuples as many keys as you like, in the order in which you want the keys compared. In Python 2.4, you can get the same effect with the new key= optional argument to sort, as several recipes point out. Using the sort method's key= argument is easier, more memory-efficient, and runs faster than building an auxiliary list of tuples by hand.

The other 2.4-introduced innovation in sorting is a convenient shortcut: a sorted built-in function that sorts any iterable, not in-place, but by first copying it into a new list. In Python 2.3 (apart from the new optional keyword arguments, which apply to the sorted built-in function as well as to list.sort), you can code the same functionality quite easily:

def sorted_2_3(iterable):     alist = list(iterable)     alist.sort( )     return alist

Because copying a list and sorting it are both nontrivial operations, and the built-in sorted needs to perform those operations too, no speed advantage is gained in making sorted a built-in. Its advantage is just the convenience. Having something always around and available, rather than having to code even just four simple lines over and over, does make a difference in practice. On the other hand, few tiny functions are used commonly enough to justify expanding the set of built-ins. Python 2.4 added sorted and reversed because those two functions were requested very frequently over the years.

The biggest change in Python sorting since the first edition of this book is that Python 2.3 moved to a new implementation of sorting. The primary visible consequences are increased speed in many common cases, and the fact that the new sort is stable (meaning that when two elements compare equal in the original list, they retain their relative order in the sorted list). The new implementation was so successful, and the chances of improving on it appeared so slim, that Guido was persuaded to proclaim that Python's list.sort method will always be stable. This guarantee started with Python 2.4 but was actually realized in Python 2.3. Still, the history of sorting cautions us that better methods may yet be discovered. A brief account of Python's sorting history may be instructive in this regard.

A Short History of Python Sorting

In early releases of Python, list.sort used the qsort routine from the underlying platform's C library. This didn't work out for several reasons, primarily because the quality of qsort varied widely across machines. Some versions were extremely slow when given a list with many equal values or in reverse-sorted order. Some versions even dumped core because they weren't reentrant. A user-defined _ _cmp_ _ function can also invoke list.sort, so that one list.sort can invoke others as a side effect of comparing. Some platform qsort routines couldn't handle that. A user-defined _ _cmp_ _ function can also (if it's insane or malicious) mutate the list while it's being sorted, and many platform qsort routines dumped core when that happened.

Python then grew its own implementation of the quicksort algorithm. This was rewritten with every release, as real-life cases of unacceptable slowness were discovered. Quicksort is a delicate algorithm indeed!

In Python 1.5.2 the quicksort algorithm was replaced by a hybrid of samplesort and binary insertion sort, and that implementation remained unchanged for more than four years, until Python 2.3. Samplesort can be viewed as a variant of quicksort that uses a very large sample size to pick the partitioning element, also known as the pivot (it recursively samplesorts a large random subset of the elements and picks the median of those). This variant makes quadratic-time behavior almost impossible and brings the number of comparisons in the average case much closer to the theoretical minimum.

However, because samplesort is a complicated algorithm, it has too much administrative overhead for small lists. Therefore, small lists (and small slices resulting from samplesort partitioning) were handled by a separate binary insertion sort, which is an ordinary insertion sort, except that it uses binary search to determine where each new element belongs. Most sorting texts say this isn't worth the bother, but that's because most texts assume that comparing two elements is as cheap as or cheaper than swapping them in memory, which isn't true for Python's sort! Moving an object is very cheap, since what is copied is just a reference to the object. Comparing two objects is expensive, though, because all of the object-oriented machinery for finding the appropriate code to compare two objects and for coercion gets reinvoked each time. This made binary search a major win for Python's sort.

On top of this hybrid approach, a few common special cases were exploited for speed. First, already-sorted or reverse-sorted lists were detected and handled in linear time. For some applications, these kinds of lists are very common. Second, if an array was mostly sorted, with just a few out-of-place elements at the end, the binary insertion sort handled the whole job. This was much faster than letting samplesort have at it and occurred often in applications that repeatedly sort a list, append a few new elements, then sort it again. Finally, special code in the samplesort looked for stretches of equal elements, so that the slice they occupy could be marked as done early.

In the end, all of this yielded an in-place sort with excellent performance in all known real cases and supernaturally good performance in some common special cases. It spanned about 500 lines of complicated C code, which gives special poignancy to recipe Recipe 5.11.

Over the years samplesort was in use, I made a standing offer to buy dinner for anyone who could code a faster Python sort. Alas, I ate alone. Still, I kept my eye on the literature because several aspects of the samplesort hybrid were irritating:

  • While no case of quadratic-time behavior appeared in real life, I knew such cases could be contrived, and it was easy to contrive cases two or three times slower than average ones.

  • The special cases to speed sorting in the presence of extreme partial order were valuable in practice, but my real data often had many other kinds of partial order that should be exploitable. In fact, I came to believe that random ordering in input lists almost never exists in real life (i.e., not outside of timing harnesses for testing sorting algorithms!).

  • There is no practical way to make samplesort stable without grossly increasing memory use.

  • The code was very complex and complicated in ugly ways by the special cases.

Current Sorting

It was always clear that a mergesort would be better on several counts, including guaranteed worst-case n log n time, and that mergesort is easy to make stable. The problem was that half a dozen attempts to code a mergesort for Python yielded a sort that ran slower (mergesort does much more data movement than samplesort) and consumed more memory.

A large and growing literature concerns adaptive sorting algorithms, which attempt to detect order of various kinds in the input. I coded a dozen of them, but they were all much slower than Python's samplesort except on the cases they were designed to exploit. The theoretical bases for these algorithms were simply too complex to yield effective practical algorithms. Then I read an article pointing out that list merging naturally reveals many kinds of partial order, simply by paying attention to how often each input list "wins" in a row. This information was simple and general. When I realized how it could be applied to a natural mergesort, which would obviously exploit all the kinds of partial order I knew and cared about, I got obsessed enough to solve the speed problem for random data and to minimize the memory burden.

The resulting "adaptive, natural, stable" mergesort implemented for Python 2.3 was a major success, but also a major engineering effortthe devil is in the details. There are about 1,200 lines of C code, but unlike the code in the samplesort hybrid, none of these lines are coding for special cases, and about half implement a technical trick allowing the worst-case memory burden to be cut in half. I'm quite proud of it, but the margins of this introduction lack the space for me to explain the details. If you're curious, I wrote a long technical description that you can find in a Python source distribution: Objects/listsort.txt under the main directory (say, Python-2.3.5 or Python-2.4) where you unpacked Python's source distribution archive. In the following list, I provide examples of the partial order Python 2.3's mergesort naturally exploits, where "sorted" means in either forward-sorted or reverse-sorted order:

  • The input is already sorted.

  • The input is mostly sorted but has random elements appended at either end, or both, or inserted in the middle.

  • The input is the concatenation of two or more sorted lists. In fact, the fastest way to merge multiple sorted lists in Python now is to join them into one long list and run list.sort on that.

  • The input is mostly sorted but has some scattered elements that are out of order. This is common, for example, when people manually add new records to a database sorted by name: people aren't good at maintaining strict alphabetic order but are good at getting close.

  • The input has many keys with the same value. For example, when sorting a database of American companies by the stock exchange they're listed on, most will be associated with the NYSE or NASDAQ exchanges. This is exploitable for a curious reason: records with equal keys are already in sorted order, by the definition of "stable"! The algorithm detects that naturally, without code especially looking for equal keys.

  • The input was in sorted order but got dropped on the floor in chunks; the chunks were reassembled in random order, and to fight boredom, some of the chunks were riffle-shuffled together. While that's a silly example, it still results in exploitable partial order and suggests how general the method is.

In short, Python 2.3's timsort (well, it has to have some brief name) is stable, robust, and preternaturally fast in many real-life cases: use it any time you can!



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net