Recipe18.1.Removing Duplicates from a Sequence


Recipe 18.1. Removing Duplicates from a Sequence

Credit: Tim Peters

Problem

You have a sequence that may include duplicates, and you want to remove the duplicates in the fastest possible way, without knowing much about the properties of the items in the sequence. You do not care about the "or"der of items in the resulting sequence.

Solution

The key is to try several approaches, fastest first, and use try/except to handle the failing cases of the faster approaches by falling back to slower approaches. Here's a function that implements exactly this strategy:

# support 2.3 as well as 2.4 try: set except NameError: from sets import Set as set def unique(s):     """ Return a list of the elements in s in arbitrary order, but without         duplicates. """     # Try using a set first, because it's the fastest and will usually work     try:         return list(set(s))      except TypeError:         pass  # Move on to the next method     # Since you can't hash all elements, try sorting, to bring equal items     # together and then weed them out in a single pass     t = list(s)     try:         t.sort( )     except TypeError:         del t  # Move on to the next method     else:         # the sort worked, so we're fine -- do the weeding         return [x for i, x in enumerate(t) if not i or x != t[i-1]]     # Brute force is all that's left     u = [  ]     for x in s:         if x not in u:             u.append(x)     return u

Discussion

The purpose of this recipe's unique function is to take a sequence s as an argument and return a list of the items in s in arbitrary order, but without duplicates. For example, calling unique([1, 2, 3, 1, 2, 3]) returns an arbitrary permutation of [1, 2, 3], calling unique('abcabc') returns an arbitrary permutation of ['a', 'b', 'c'], and calling unique(([1, 2], [2, 3], [1, 2])) returns an arbitrary permutation of [[2, 3], [1, 2]].

The fastest way to remove duplicates from a sequence depends on fairly subtle properties of the sequence elements, such as whether they're hashable and whether they support full comparisons. The unique function shown in this recipe tries three methods, from fastest to slowest, letting runtime exceptions pick the best method for the sequence at hand.

For fastest speed, all sequence elements must be hashable. When they are, the unique function will usually work in linear time (i.e., O(n), or directly proportional to the number of elements in the input, which is good and highly scalable performance behavior).

If it turns out that hashing the elements (e.g., using them as dictionary keys, or, as in this case, set elements) is not possible, the next best situation is when the elements enjoy a total ordering, meaning that each element can be compared to each other element with the < operator. If list(s).sort() doesn't raise a TypeError, we can assume that s' elements can be sorted and therefore enjoy a total ordering. Then unique will usually work in O(n log(n)) time. Python lists' sort method is particularly efficient in the presence of partially ordered data (including, e.g., data with many duplicates), so the sorting approach may be more effective in Python than elsewhere.

If sorting also turns out to be impossible, the sequence items must at least support equality testing, or else the very concept of duplicates can't really be meaningful for them. In this case, unique works in quadratic timethat is, O(n2), meaning time proportional to the square of the number of elements in the input: not very scalable, but the least of all evils, given the sequence items' obviously peculiar nature (assuming we get all the way to this subcase).

This recipe is a pure example of how algorithm efficiency depends on the strength of the assumptions you can make about the data. You could split this recipe's function into three distinct functions and directly call the one that best meets your needs. In practice, however, the brute-force method is so slow for large sequences that nothing measurable is lost by simply letting the function as written try the faster methods first.

If you need to preserve the same order of items in the output sequence as in the input sequence, see Recipe 18.2.

See Also

Recipe 18.2.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net