Recipe5.4.Sorting Keys or Indices Basedon the Corresponding Values

Recipe 5.4. Sorting Keys or Indices Basedon the Corresponding Values

Credit: John Jensen, Fred Bremmer, Nick Coghlan

Problem

You need to count the occurrences of various items and present the items in order of their number of occurrencesfor example, to produce a histogram.

Solution

A histogram, apart from graphical issues, is based on counting the occurrences of items (easy to do with a Python list or dictionary) and then sorting the keys or indices in an order based on corresponding values. Here is a subclass of dict that adds two methods for the purpose:

class hist(dict):     def add(self, item, increment=1):         ''' add 'increment' to the entry for 'item' '''         self[item] = increment + self.get(item, 0)     def counts(self, reverse=False):         ''' return list of keys sorted by corresponding values '''         aux = [ (self[k], k) for k in self ]         aux.sort( )         if reverse: aux.reverse( )         return [k for v, k in aux]

If the items you're counting are best modeled by small integers in a compact range, so that you want to keep item counts in a list, the solution is quite similar:

class hist1(list):     def _ _init_ _(self, n):         ''' initialize this list to count occurrences of n distinct items '''         list._ _init_ _(self, n*[0])     def add(self, item, increment=1):         ''' add 'increment' to the entry for 'item' '''         self[item] += increment     def counts(self, reverse=False):         ''' return list of indices sorted by corresponding values '''         aux = [ (v, k) for k, v in enumerate(self) ]         aux.sort( )         if reverse: aux.reverse( )         return [k for v, k in aux]

Discussion

The add method of hist embodies the normal Python idiom for counting occurrences of arbitrary (but hashable) items, using a dict to hold the counts. In class hist1, based on a list, we take a different approach, initializing all counts to 0 in _ _init_ _, so the add method is even simpler.

The counts methods produce the lists of keys, or indices, sorted in the order given by the corresponding values. The problem is very similar in both classes, hist and hist1; therefore, the solutions are also almost identical, using in each case the DSU approach already shown in Recipe 5.2 and Recipe 5.3. If we need both classes in our program, the similarity is so close that we should surely factor out the commonalities into a single auxiliary function _sorted_keys:

def _sorted_keys(container, keys, reverse):     ''' return list of 'keys' sorted by corresponding values in 'container' '''     aux = [ (container[k], k) for k in keys ]     aux.sort( )     if reverse: aux.reverse( )     return [k for v, k in aux]

and then implement the counts methods of each class as thin wrappers over this _sorted_keys function:

class hist(dict):     ...     def counts(self, reverse=False):         return _sorted_keys(self, self, reverse) class hist1(list):     ...     def counts(self, reverse=False):         return _sorted_keys(self, xrange(len(self)), reverse)

DSU is so important that in Python 2.4, as shown previously in Recipe 5.2 and Recipe 5.3, the sort method of lists and the new built-in function sorted offer a fast, intrinsic implementation of it. Therefore, in Python 2.4, function _sorted_keys can become much simpler and faster:

def _sorted_keys(container, keys, reverse):     return sorted(keys, key=container._ _getitem_ _, reverse=reverse)`

The bound-method container._ _getitem_ _ performs exactly the same operation as the indexing container[k] in the Python 2.3 implementation, but it's a callable to call on each k of the sequence that we're sorting, namely keysexactly the right kind of value to pass as the key keyword argument to the sorted built-in function. Python 2.4 also affords an easy, direct way to get a list of a dictionary's items sorted by value:

from operator import itemgetter def dict_items_sorted_by_value(d, reverse=False):     return sorted(d.iteritems( ), key=itemgetter(1), reverse=reverse)

The operator.itemgetter higher-order function, also new in Python 2.4, is a handy way to supply the key argument when you want to sort a container whose items are subcontainers, keying on a certain item of each subcontainer. This is exactly the case here, since a dictionary's items are a sequence of pairs (two-item tuples), and we want to sort the sequence keying on the second item of each tuple.

Getting back to this recipe's main theme, here is a usage example for the class hist shown in this recipe's Solution:

sentence = ''' Hello there this is a test.  Hello there this was a test,            but now it is not. ''' words = sentence.split( ) c = hist( ) for word in words: c.add(word) print "Ascending count:" print c.counts( ) print "Descending count:" print c.counts(reverse=True)

This code snippet produces the following output:

Ascending count: [(1, 'but'), (1, 'it'), (1, 'not.'), (1, 'now'), (1, 'test,'), (1, 'test.'), (1, 'was'), (2, 'Hello'), (2, 'a'), (2, 'is'), (2, 'there'), (2, 'this')] Descending count: [(2, 'this'), (2, 'there'), (2, 'is'), (2, 'a'), (2, 'Hello'), (1, 'was'), (1, 'test.'), (1, 'test,'), (1, 'now'), (1, 'not.'), (1, 'it'), (1, 'but')]

Recipe5.4.Sorting Keys or Indices Basedon the Corresponding Values