Recipe18.8.Implementing a Bag (Multiset) Collection Type


Recipe 18.8. Implementing a Bag (Multiset) Collection Type

Credit: Raymond Hettinger, Alex Martelli, Matt R

Problem

You need a set-like collection type that lets each element be in the set a number of times. In other words, you need a collection type of the kind that is known as multiset in C++ and SQL, and bag in Smalltalk, Objective C, and Haskell's Edison module.

Solution

We can implement bag as a class. We could restrict the implementation to language constructs that are present in Python 2.3 or are easy to emulate; however, such restrictions would give substantial inefficiencies or complications with comparison to a pure Python 2.4 implementation. So, here is a Python 2.4 implementation, with no attempt to support Python 2.3:

from operator import itemgetter from heapq import nlargest class bag(object):     def _ _init_ _(self, iterable=( )):         # start empty, then add the `iterable' if any         self._data = {  }         self.update(iterable)     def update(self, iterable):         # update from an element->count mapping, or from any iterable         if isinstance(iterable, dict):             for elem, n in iterable.iteritems( ):                 self[elem] += n         else:             for elem in iterable:                 self[elem] += 1      def _ _contains_ _(self, elem):         # delegate membership test         return elem in self._data     def _ _getitem_ _(self, elem):         # default all missing items to a count of 0         return self._data.get(elem, 0)     def _ _setitem_ _(self, elem, n):         # setting an item to a count of 0 removes the item         self._data[elem] = n         if n == 0:             del self._data[elem]     def _ _delitem_ _(self, elem):         # delegate to _ _setitem_ _ to allow deleting missing items         self[elem] = 0     def _ _len_ _(self):         # length is computed on-the-fly         return sum(self._data.itervalues( ))     def _ _nonzero_ _(self):         # avoid truth tests using _ _len_ _, as it's relatively slow         return bool(self._data)     def _ _eq_ _(self, other):         # a bag can only equal another bag         if not isinstance(other, bag):             return False         return self._data == other._data     def _ _ne_ _(self, other):         # a bag always differs from any non-bag         return not (self == other)     def _ _hash_ _(self):         # a bag can't be a dict key nor an element in a set         raise TypeError     def _ _repr_ _(self):         # typical string-representation         return '%s(%r)' % (self._ _class_ _._ _name_ _, self._data)     def copy(self):         # make and return a shallow copy         return self._ _class_ _(self._data)     _ _copy_ _ = copy # For the copy module     def clear(self):         # remove all items         self._data.clear( )     def _ _iter_ _(self):         # yield each element the # of times given by its count         for elem, cnt in self._data.iteritems( ):             for i in xrange(cnt):                 yield elem     def iterunique(self):         # yield each element only once         return self._data.iterkeys( )     def itercounts(self):         # yield element-count pairs         return self._data.iteritems( )          def mostcommon(self, n=None):         # return the n (default: all) most common elements, each as an         # element-count pair, as a list sorted by descending counts         if n is None:             return sorted(self.itercounts( ), key=itemgetter(1), reverse=True)         it = enumerate(self.itercounts( ))         nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it))         return [(elem, cnt) for cnt, i, elem in nl]

Discussion

Python offers several useful container classes, including built-in tuples, lists and dicts, sets (in Python 2.4, sets are built-in; in Python 2.3, they're in module sets)which, unlike bags, can be seen as "holding only one instance" of each of their elementsand double-ended queues, deques (in Python 2.4 only, in module collections). This abundance of container classes doesn't mean there is no use for yet more. The bag, (i.e., multiset), presented in this recipe, is widely useful, since counting the numbers of occurrences of different objects is a frequent task useful in many applications. Rather than coding a bag each time you need such a container (generally based on a dictionary mapping items to counts), it's better to design and code it once, put it among one's utilities, and lobby for it to become part of the standard library for a future Python, such as 2.5 (which can be expected sometime in 2006 and will focus on enhancements to the standard library rather than to the core language).

The API offered by the bag class presented in this recipe is largely based on indexing, due to the strong analogy between a bag and a mapping of items to counts. For example:

>>> b = bag('banana') >>> b['a'] 3 >>> b['a'] += 1 >>> b['a'] 4 >>> del b['a']          # removes all 'a's from the bag >>> b['a'] 0

Items that are not in the bag can also be used as indices, giving a value (i.e., count) of 0; a lot of bag's convenience comes from this default. A bag also offers several ways to iterate on it (by unique elements; by elements, each repeated as many times as its count; by (element, count) pairs); and also supplies a handy method mostcommon to return (element, count) pairs sorted by descending count (all such pairs, or just the top n). An example use of mostcommon:

>>> bag(word for line in open('somefile.txt') ...     for word in line.split( )).mostcommon(5) [('to', 83), ('for', 71), ('the', 61), ('of', 53), ('and', 52)]

All design choices are tradeoffs. For some applications, it might be more convenient to have bag's API closer to set's rather than to dict's, with an add method, and binary operators, for example, to join two bags returning a new one (as list does with operator + and set does with the "or", vertical-bar operator |). In most cases, this would be overkill. After all, "a designer knows he has achieved perfection, not when there is nothing left to add, but when there is nothing left to take away" (Antoine de Saint-Exupéry). So, for example, to join two bags, getting a new one, without altering either input bag, code a little function using roughly the same update-based approach you would use with dicts, as follows:

def bagjoin(b1, b2):     b = bag(b1)     b.update(b2)     return b

Just as would be the case for an analogous function joining dicts, this works, not only when b1 and b2 are bags, but also when they are other kinds of objects that can be passed to bag and bag.updateobjects such as arbitrary iterables or mappings (generally dictionaries) from items to counts. Such polymorphism comes at negligible cost, and it's well worth preserving.

Although the crucial design choices in this recipe are those about bag's API, some implementation choices must be made as well. In this recipe's code, implementation choices are oriented towards simplicity. In particular, there is no attempt to allow this code to run on Python 2.3. This recipe is optimized for Python 2.4 because it is Python's current release and is likely to be used soon in lieu of Python 2.3, particularly among Pythonistas who are sensitive to performance issues, given the amount of highly successful effort that was devoted to optimizing version 2.4's performance. If Python 2.3 support was deemed necessary, it would be best implemented separately, rather than hobbling the primary 2.4 implementation with inefficiencies or complications.

See Also

Smalltalk's Bag class at http://www.gnu.org/software/smalltalk/gst-manual/gst_49.html; C++'s std::multiset template class at http://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/classstd_1_1multiset.html.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net