Recipe18.2.Removing Duplicates from a Sequence While Maintaining Sequence Order

Recipe 18.2. Removing Duplicates from a Sequence While Maintaining Sequence Order

Credit: Alex Martelli

Problem

You have a sequence that may include duplicates, and you want to remove the duplicates in the fastest possible way. Moreover, the output sequence must respect the item ordering of the input sequence.

Solution

The need to respect the item ordering of the input sequence means that picking unique items becomes a problem quite different from that explored previously in Recipe 18.1. This requirement often arises in conjunction with a function f that defines an equivalence relation among items: x is equivalent to y if and only if f(x)==f(y). In this case, the task of removing duplicates may often be better described as picking the first representative of each resulting equivalence class. Here is a function to perform this task:

# support 2.3 as well as 2.4 try: set except NameError: from sets import Set as set # f defines an equivalence relation among items of sequence seq, and # f(x) must be hashable for each item x of seq def uniquer(seq, f=None):     """ Keeps earliest occurring item of each f-defined equivalence class """     if f is None:    # f's default is the identity function f(x) -> x         def f(x): return x     already_seen = set( )     result = [  ]     for item in seq:         marker = f(item)         if marker not in already_seen:              already_seen.add(marker)              result.append(item)     return result

Discussion

The previous Recipe 18.1 is applicable only if you are not concerned about item ordering or, in other words, if the sequences involved are meaningful only as the sets of their items, which is often the case. When sequential order is significant, a different approach is needed.

If the items are hashable, it's not hard to maintain sequence order, keeping only the first occurrence of each value. More generally, we may want uniqueness within equivalence classes, as shown in this recipe's Solution: the uniquer function accepts as an argument a function f that must return hashable objects, such that f(x)==f(y) if and only if items x and y are equivalent. Identity (in the mathematical sense, not in the Python sense) is used as the default when no argument f is supplied. The added generality of allowing an f different from the identity function costs no added complication whatsoever.

If you need to keep the last occurrence, rather than the earliest occurrence, of an item in each equivalence class, the simplest approach is to reverse the input sequence (or, rather, a copy thereof into a local list, since the input might be immutable or at any rate not support reversing), then, after processing with uniquer, reverse the resulting list:

def uniquer_last(seq, f=None):     seq = list(seq)     seq.reverse( )     result = uniquer(seq, f)     result.reverse( )     return result

In Python 2.4, instead of the first three statements of this version of uniquer_last, you could use the single statement:

    result = uniquer(reversed(seq), f)

exploiting the new built-in reversed. However, this Python 2.4 version, while marginally faster, is less general, because it does require seq to be really a sequence, while the previously shown version (and the uniquer function in the "Solution") work with any iterable seq. For example:

    somelines = uniquer_last(open('my.txt'), str.lower)

binds name somelines to the list of unique lines from text file my.txt, considering two lines equivalent if they're equal aside from uppercase and lowercase distinctions, picking the last occurring one from each set of equivalent lines, and preserving the order of the lines in the file (phew). If you used Python 2.4's built-in reversed, this latest snippet would not work, due to reversed's prerequisites.

If you must deal with nonhashable items, the simplest fallback approach is to use a set-like container that supports the add method and membership testing without requiring items to be hashable. Unfortunately, performance will be much worse than with a real set. Here's the simplest fallback implementation, which demands of items nothing but support for equality testing:

def uniquer_with_simplest_fallback(seq, f=None):     if f is None:         def f(x): return x     already_seen = set( )     result = [  ]     for item in seq:         marker = f(item)         try:             new_marker = marker not in already_seen         except TypeError:             class TotallyFakeSet(list):                 add = list.append             already_seen = TotallyFakeSet(already_seen)             new_marker = marker not in already_seen          if new_marker:              already_seen.add(marker)              result.append(item)     return result

A more refined approach would be to use two levels of fallback, the intermediate one based on sorting, as shown previously in Recipe 18.1 testing in a sorted list can be performed efficiently by using the Python Standard Library module bisect.

However, remember that you can often use an f that gives you hashable markers for nonhashable items. The built-in function repr can often be useful for this purpose. For example:

lol = [ [1, 2], [  ], [1, 2], [3], [  ], [3, 4], [1, 2], [  ], [2, 1] ] print uniquer(lol, repr) # emits: [[1, 2], [  ], [3], [3, 4], [2, 1]]

While the items of lol are lists, and thus are not hashable, the built-in function repr produces representations of each of the items as a string, which is hashable. This enables use of the fast function uniquer. Unfortunately, repr is not useful for nonhashable items of other types, including dict and set. Because of the workings of hash-collision resolution, it's quite possible to have d1==d2 and yet repr(d1)!=repr(d2) for two dictionaries d1 and d2, depending on the exact sequences of adds that built each dict. Still, you may be able build your own repr-like function to work around these issues, depending on the exact nature of your data. Whether repr can help for instances of a certain user-defined type depends on how accurately and usefully that specific type defines special method _ _repr_ _, which repr calls.

The task of picking one representative item, out of all of those belonging to each equivalence class, can be generalized. Instead of the simple ideas of implicitly picking the first such item, or the last such item, we can choose among multiple items in the same equivalence class via an arbitrary picking function p that considers both the actual items and their indexes of occurrence. As long as function p can operate pairwise, the key idea is just to keep a dictionary that maps the marker of each equivalence class to the index and item of the representative so far picked for that class. At the end, we reconstruct sequence order by sorting on the indices:

def fancy_unique(seq, f, p):     """ Keeps "best" item of each f-defined equivalence class, with         picking function p doing pairwise choice of (index, item) """     representative = {  }     for index, item in enumerate(seq):         marker = f(item)         if marker in representative:             # It's NOT a problem to rebind index and item within the             # for loop: the next leg of the loop does not use their binding             index, item = p((index, item), representative[marker])         representative[marker] = index, item     # reconstruct sequence order by sorting on indices     auxlist = representative.values( )     auxlist.sort( )     return [item for index, item in auxlist]

It's possible that the picking function cannot operate pairwise, but rather must be presented with the whole bunch of (index, item) pairs for each equivalence class in order to pick the best representative of that class (e.g., it may have to get the median of the items in each class as being the best representative of that class). Then we need one pass over the sequence to collect the bunches, followed by a pass over the bunches, to pick the representatives:

def fancier_uniquer(seq, f, p):     """ Keeps "best" item of each f-defined equivalence class, with         picking function p choosing appropriate (index, item) for each         equivalence class from the list of all (index, item) pairs in         that class """     bunches = {  }     for index, item in enumerate(seq):         marker = f(item)         bunches.setdefault(marker, [  ]).append((index, item))     auxlist = [p(candidates) for candidates in bunches.values( )]     auxlist.sort( )     return [item for index, item in auxlist]

These fancy approaches that rely on a picking function are useful only for substantial equivalence functions, not for identity, so I removed f's default value from these versions.

An example of use for fancy_unique may help. Say we're given a list of words, and we need to get a sublist from it, respecting order, such that no two words on the sublist begin with the same letter. Out of all the words in the "or"iginal list that begin with each given letter, we need to keep the longest word and, in case of equal lengths, the word appearing later on the list. This sounds complicated, but with fancy_unique to help us, it's really not that bad:

def complicated_choice(words):     def first_letter(aword):         return aword[0].lower( )     def prefer((indx1, word1), (indx2, word2)):         if len(word2) > len(word1):             return indx2, word2         return indx1, word1     return fancy_unique(words, first_letter, prefer)

The prefer nested function within complicated_choice is simplified because it knows fancy_unique always calls it with indx2<indx1. So, the older indx2, word2 pair must be returned only when word2 is longer than word1; otherwise, indx1, word1 is always the proper result. The automatic tuple unpacking in prefer's signature is debatable, stylewise, but I like it (it reminds me of SML or Haskell).

Out of all the general programming techniques presented in the various functions of this recipe, the idea of writing higher-order functions, which organize a computation and appropriately call back to functions that they receive as arguments, is easily the most precious and widely applicable concept. This idea is well worth keeping in mind in several circumstancesnot just for old Haskell-heads, because it works just as well in Python.

Recipe18.2.Removing Duplicates from a Sequence While Maintaining Sequence Order