Recipe5.6.Processing All of a List s Items in Random Order

Recipe 5.6. Processing All of a List's Items in Random Order

Credit: Iuri Wickert, Duncan Grisby, T. Warner, Steve Holden, Alex Martelli

Problem

You need to process, in random order, all of the items of a long list.

Solution

As usual in Python, the best approach is the simplest one. If we are allowed to change the order of items in the input list, then the following function is simplest and fastest:

def process_all_in_random_order(data, process):     # first, put the whole list into random order     random.shuffle(data)     # next, just walk over the list linearly     for elem in data: process(elem)

If we must preserve the input list intact, or if the input data may be some iterable that is not a list, just insert as the first statement of the function the assignment data = list(data).

Discussion

While it's a common mistake to be overly concerned with speed, don't make the opposite mistake of ignoring the different performances of various algorithms. Suppose we must process all of the items in a long list in random order, without repetition (assume that we're allowed to mangle or destroy the input list). The first idea to suggest itself might be to repeatedly pick an item at random (with function random.choice), removing each picked item from the list to avoid future repetitions:

import random def process_random_removing(data, process):     while data:         elem = random.choice(data)          data.remove(elem)          process(elem)

However, this function is painfully slow, even for input lists of just a few hundred elements. Each call to data.remove must linearly search through the list to find the element to delete. Since the cost of each of n steps is O(n), the whole process is O(n2)time proportional to the square of the length of the list (and with a large multiplicative constant, too).

Minor improvements to this first idea could focus on obtaining random indices, using the pop method of the list to get and remove an item at the same time, low-level fiddling with indices to avoid the costly removal in favor of swapping the picked item with the last yet-unpicked one towards the end, or using dictionaries or sets instead of lists. This latest idea might be based on a hope of using a dict's popitem method (or the equivalent method pop of class sets.Set and Python 2.4's built-in type set), which may look like it's designed exactly to pick and remove a random item, but, beware! dict.popitem is documented to return and remove an arbitrary item of the dictionary, and that's a far cry from a random item. Check it out:

>>> d=dict(enumerate('ciao')) >>> while d: print d.popitem( )

It may surprise you, but in most Python implementations this snippet will print d's items in a far from random order, typically (0,'c') then (1,'i') and so forth. In short, if you need pseudo-random behavior in Python, you need standard library module randompopitem is not an alternative.

If you thought about using a dictionary rather than a list, you are definitely on your way to "thinking Pythonically", even though it turns out that dictionaries wouldn't provide a substantial performance boost for this specific problem. However, an approach that is even more Pythonic than choosing the right data structure is best summarized as: let the standard library do it!. The Python Standard Library is large, rich, and chock full of useful, robust, fast functions and classes for a wide variety of tasks. In this case, the key intuition is realizing that, to walk over a sequence in a random order, the simplest approach is to first put that sequence into random order (known as shuffling the sequence, an analogy with shuffling a deck of cards) and then walk over the shuffled sequence linearly. Function random.shuffle performs the shuffling, and the function shown in this recipe's Solution just uses it.

Performance should always be measured, never guessed at, and that's what standard library module timeit is for. Using a null process function and a list of length 1,000 as data, process_all_in_random_order is almost 10 times faster than process_random_removing; with a list of length 2,000, the performance ratio grows to almost 20. While an improvement of, say, 25%, or even a constant factor of 2, usually can be neglected without really affecting the performance of your program as a whole, the same does not apply to an algorithm that is 10 or 20 times as slow as it could be. Such terrible performance is likely to make that program fragment a bottleneck, all by itself. Moreover, this risk increases when we're talking about O(n2) versus O(n) behavior: with such differences in big-O behavior, the performance ratio between bad and good algorithms keeps increasing without bounds as the size of the input data grows.

Recipe5.6.Processing All of a List s Items in Random Order