Recipe18.4.Generating Random Samples Without Replacement


Recipe 18.4. Generating Random Samples Without Replacement

Credit: Raymond Hettinger

Problem

You need to generate random samples without replacement out of a "population" (the integers between included and some n excluded), and you want better memory consumption characteristics than random.sample provides.

Solution

A generator for this purpose requires only constant memory and makes a small number of calls to random.random:

import random def sample(n, r):     " Generate r randomly chosen, sorted integers from [0,n) "     rand = random.random     pop = n     for samp in xrange(r, 0, -1):         cumprob = 1.0         x = rand( )         while x < cumprob:             cumprob -= cumprob * samp / pop             pop -= 1         yield n-pop-1

Discussion

random.sample(xrange(10), 3) produces output statistically equal to list(sample(10, 3)) using this recipe's sample. Differently from random.sample(xrange(n), r), this recipe's sample(n, r) requires a bounded amount of memory (which does not grow with either r or n) and is guaranteed to make only r calls to random.random. Moreover, this recipe's sample yields the r numbers of the sample in sorted order, while random.sample returns them in random orderwhich may be insignificant or a crucial advantage one way or the other, depending on your application's needs. A definite advantage of random.sample is that its running time is O(r), while this recipe's sample function's running time is O(n).

This recipe was inspired by Knuth's Algorithm S in Donald E. Knuth, The Art of Computer Programming, Volume 3, Seminumerical Algorithms, in section 3.4.2. However, this recipe has one improvement over Knuth's algorithm: by tracking a cumulative probability for each selection, this recipe eliminates the need to make n calls to random.random.

A potential major improvement would be to find a direct formula giving the same result as the inner loop: given x, samp, and pop, compute the index of the first sample. Finding this formula would reduce the running time to O(r).

See Also

Library Reference and Python in a Nutshell docs about random.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net