Section 20.3. Implementing Sets | Programming Python

20.3. Implementing Sets

Another commonly used data structure is the set, a collection of objects that support operations such as:

Intersection: Make a new set with all items in common.
Union: Make a new set with all items in either operand.
Membership: Test whether an item exists in a set.

And there are others, depending on the intended use. Sets come in handy for dealing with more abstract group combinations. For instance, given a set of engineers and a set of writers, you can pick out individuals who do both activities by intersecting the two sets. A union of such sets would contain either type of individual, but would include any given individual only once.

Python lists, tuples, and strings come close to the notion of a set: the in operator tests membership, for iterates, and so on. Here, we add operations not directly supported by Python sequences. The idea is that we're extending built-in types for unique requirements.

20.3.1. Set Functions

As before, let's first start out with a function-based set manager. But this time, instead of managing a shared set object in a module, let's define functions to implement set operations on passed-in Python sequences (see Example 20-8).

Example 20-8. PP3E\Dstruct\Basic\inter.py

 def intersect(seq1, seq2):     res = []                          # start with an empty list     for x in seq1:                    # scan the first sequence         if x in seq2:             res.append(x)             # add common items to the end     return res def union(seq1, seq2):     res = list(seq1)                  # make a copy of seq1     for x in seq2:                    # add new items in seq2         if not x in res:             res.append(x)     return res

These functions work on any type of sequencelists strings, tuples, and other objects that conform to the sequence protocols expected by these functions (for loops, in membership tests). In fact, we can even use them on mixed object types: the last two commands in the following code compute the intersection and union of a list and a tuple. As usual in Python, the object interface is what matters, not the specific types:

 C:\...\PP3E\Dstruct\Basic>python >>> from inter import * >>> s1 = "SPAM" >>> s2 = "SCAM" >>> intersect(s1, s2), union(s1, s2) (['S', 'A', 'M'], ['S', 'P', 'A', 'M', 'C']) >>> intersect([1,2,3], (1,4)) [1] >>> union([1,2,3], (1,4)) [1, 2, 3, 4]

Notice that the result is always a list here, regardless of the type of sequences passed in. We could work around this by converting types or by using a class to sidestep this issue (and we will in a moment). But type conversions aren't clear-cut if the operands are mixed-type sequences. Which type do we convert to?

20.3.1.1. Supporting multiple operands

If we're going to use the intersect and union functions as general tools, one useful extension is support for multiple arguments (i.e., more than two). The functions in Example 20-9 use Python's variable-length argument lists feature to compute the intersection and union of arbitrarily many operands.

Example 20-9. PP3E\Dstruct\Basic\inter2.py

 def intersect(*args):     res = []     for x in args[0]:                  # scan the first list         for other in args[1:]:         # for all other arguments             if x not in other: break   # this item in each one?         else:             res.append(x)              # add common items to the end     return res def union(*args):     res = []     for seq in args:                   # for all sequence-arguments         for x in seq:                  # for all nodes in argument             if not x in res:                 res.append(x)          # add new items to result     return res

The multioperand functions work on sequences in the same way as the original functions, but they support three or more operands. Notice that the last two examples in the following session work on lists with embedded compound objects: the in tests used by the intersect and union functions apply equality testing to sequence nodes recursively, as deep as necessary to determine collection comparison results:

 C:\...\PP3E\Dstruct\Basic>python >>> from inter2 import * >>> s1, s2, s3 = 'SPAM', 'SLAM', 'SCAM' >>> intersect(s1, s2) ['S', 'A', 'M'] >>> intersect(s1, s2, s3) ['S', 'A', 'M'] >>> intersect(s1, s2, s3, 'HAM') ['A', 'M'] >>> union(s1, s2), union(s1, s2, s3) (['S', 'P', 'A', 'M', 'L'], ['S', 'P', 'A', 'M', 'L', 'C']) >>> intersect([1,2,3], (1,4), range(5)) [1] >>> s1 = (9, (3.14, 1), "bye", [1,2], "mello") >>> s2 = [[1,2], "hello", (3.14, 0), 9] >>> intersect(s1, s2) [9, [1, 2]] >>> union(s1, s2) [9, (3.14, 1), 'bye', [1, 2], 'mello', 'hello', (3.14, 0)]

20.3.2. Set Classes

The set functions can operate on a variety of sequences, but they aren't as friendly as true objects. Among other things, your scripts need to keep track of the sequences passed into these functions manually. Classes can do better: the class in Example 20-10 implements a set object that can hold any type of object. Like the stack classes, it's essentially a wrapper around a Python list with extra set operations.

Example 20-10. PP3E\Dstruct\Basic\set.py

 class Set:     def _ _init_ _(self, value = []):     # on object creation         self.data = []                     # manages a local list         self.concat(value)     def intersect(self, other):         # other is any sequence type         res = []                        # self is the instance subject         for x in self.data:             if x in other:                 res.append(x)         return Set(res)                 # return a new Set     def union(self, other):         res = self.data[:]              # make a copy of my list         for x in other:             if not x in res:                 res.append(x)         return Set(res)     def concat(self, value):            # value: a list, string, Set...         for x in value:                 # filters out duplicates            if not x in self.data:                 self.data.append(x)     def _ _len_ _(self):          return len(self.data)     def _ _getitem_ _(self, key): return self.data[key]     def _ _and_ _(self, other):   return self.intersect(other)     def _ _or_ _(self, other):    return self.union(other)     def _ _repr_ _(self):         return '<Set:' + repr(self.data) + '>'

The Set class is used like the Stack class we saw earlier in this chapter: we make instances and apply sequence operators plus unique set operations to them. Intersection and union can be called as methods, or by using the & and | operators normally used for built-in integer objects. Because we can string operators in expressions now (e.g., x & y & z), there is no obvious need to support multiple operands in intersect/union methods here. As with all objects, we can either use the Set class within a program or test it interactively as follows:

 >>> from set import Set >>> users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper']) >>> users2 = Set(['Jerry', 'Howard', 'Carol']) >>> users3 = Set(['Emily', 'Carol']) >>> users1 & users2 <Set:['Howard']> >>> users1 | users2 <Set:['Bob', 'Emily', 'Howard', 'Peeper', 'Jerry', 'Carol']> >>> users1 | users2 & users3 <Set:['Bob', 'Emily', 'Howard', 'Peeper', 'Carol']> >>> (users1 | users2) & users3 <Set:['Emily', 'Carol']> >>> users1.data ['Bob', 'Emily', 'Howard', 'Peeper']

20.3.3. Optimization: Moving Sets to Dictionaries

Once you start using the Set class, the first problem you might encounter is its performance: its nested for loops and in scans become exponentially slow. That slowness may or may not be significant in your applications, but library classes should generally be coded as efficiently as possible.

One way to optimize set performance is by changing the implementation to use dictionaries rather than lists, for storing sets internallyitems may be stored as the keys of a dictionary whose values are all None. Because lookup time is constant and short for dictionaries, the in list scans of the original set may be replaced with direct dictionary fetches in this scheme. In traditional terms, moving sets to dictionaries replaces slow linear searches with fast hash table fetches. A computer scientist would explain this by saying that the repeated nested scanning of the list-based intersection is an exponential algorithm in terms of its complexity, but dictionaries can be linear.

The module in Example 20-11 implements this idea. Its class is a subclass of the original set, and it redefines the methods that deal with the internal representation but inherits others. The inherited & and | methods trigger the new intersect and union methods here, and the inherited len method works on dictionaries as is. As long as Set clients are not dependent on the order of items in a set, they can switch to this version directly by just changing the name of the module from which Set is imported; the class name is the same.

Example 20-11. PP3E\Dstruct\Basic\fastset.py

 import set                                            # fastset.Set extends set.Set class Set(set.Set):     def _ _init_ _(self, value = []):         self.data = {}                     # manages a local dictionary         self.concat(value)                 # hashing: linear search times     def intersect(self, other):         res = {}         for x in other:                    # other: a sequence or Set             if self.data.has_key(x):       # use hash-table lookup (or "in")                 res[x] = None         return Set(res.keys( ))             # a new dictionary-based Set     def union(self, other):         res = {}                           # other: a sequence or Set         for x in other:                    # scan each set just once             res[x] = None         for x in self.data.keys( ):             # '&' and '|' come back here             res[x] = None                  # so they make new fastset's         return Set(res.keys( ))     def concat(self, value):         for x in value: self.data[x] = None     # inherit and, or, len     def _ _getitem_ _(self, key):  return self.data.keys( )[key]     def _ _repr_ _(self):          return '<Set:' + repr(self.data.keys( )) + '>'

This works about the same as the previous version:

 >>> from fastset import Set >>> users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper']) >>> users2 = Set(['Jerry', 'Howard', 'Carol']) >>> users3 = Set(['Emily', 'Carol']) >>> users1 & users2 <Set:['Howard']> >>> users1 | users2 <Set:['Emily', 'Howard', 'Jerry', 'Carol', 'Peeper', 'Bob']> >>> users1 | users2 & users3 <Set:['Emily', 'Howard', 'Carol', 'Peeper', 'Bob']> >>> (users1 | users2) & users3 <Set:['Emily', 'Carol']> >>> users1.data {'Emily': None, 'Bob': None, 'Peeper': None, 'Howard': None}

The main functional difference in this version is the order of items in the set: because dictionaries are randomly ordered, this set's order will differ from the original. For instance, you can store compound objects in sets, but the order of items varies in this version:

 >>> import set, fastset >>> a = set.Set([(1,2), (3,4), (5,6)]) >>> b = set.Set([(3,4), (7,8)]) >>> a & b <Set:[(3, 4)]> >>> a | b <Set:[(1, 2), (3, 4), (5, 6), (7, 8)]> >>> a = fastset.Set([(1,2), (3,4), (5,6)]) >>> b = fastset.Set([(3,4), (7,8)]) >>> a & b <Set:[(3, 4)]> >>> a | b <Set:[(3, 4), (1, 2), (7, 8), (5, 6)]> >>> b | a <Set:[(3, 4), (5, 6), (1, 2), (7, 8)]>

Sets aren't supposed to be ordered anyhow, so this isn't a showstopper. A deviation that might matter, though, is that this version cannot be used to store unhashable (that is, immutable) objects. This stems from the fact that dictionary keys must be immutable. Because values are stored in keys, dictionary sets can contain only things such as tuples, strings, numbers, and class objects with immutable signatures. Mutable objects such as lists and dictionaries won't work directly. For example, the following call:

 fastset.Set([[1,2],[3,4]])

raises an exception with this dictionary-based set, but it works with the original set class. Tuples do work here as set items because they are immutable; Python computes hash values and tests key equality as expected.

20.3.3.1. Timing the results under Python 2.4

So how did we do on the optimization front? Example 20-12 contains a script to compare set class performance. It reuses the timer module of Example 20-6 used earlier to test stacks.

Example 20-12. PP3E\Dstruct\Basic\settime.py

 import timer, sys import set, fastset def setops(Class):     a = Class(range(50))                 # a 50-integer set     b = Class(range(20))                 # a 20-integer set     c = Class(range(10))     d = Class(range(5))     for i in range(5):         t = a & b & c & d                # 3 intersections         t = a | b | c | d                # 3 unions if _ _name_ _ == '_ _main_ _':     rept = int(sys.argv[1])     print 'set =>    ', timer.test(rept, setops, set.Set)     print 'fastset =>', timer.test(rept, setops, fastset.Set)

The setops function makes four sets and combines them with intersection and union operators five times. A command-line argument controls the number of times this whole process is repeated. More accurately, each call to setops makes 34 Set instances (4 + [5 x (3 + 3)]) and runs the intersect and union methods 15 times each (5 x 3) in the for loop's body. The performance improvement is equally dramatic this time around, on a 1.2 GHz machine:

 C:\...\PP3E\Dstruct\Basic>python settime.py 50 set =>     0.605568584834 fastset => 0.10293794323 C:\...\PP3E\Dstruct\Basic>python settime.py 100 set =>     1.21189676342 fastset => 0.207752661302 C:\...\PP3E\Dstruct\Basic>python settime.py 200 set =>     2.47468966028 fastset => 0.415944763929

These results will vary per machine, and they may vary per Python release. But at least for this test case, the dictionary-based set implementation (fastest) is roughly six times faster than the simple list-based set (set). In fact, this sixfold speedup is probably sufficient. Python dictionaries are already optimized hash tables that you might be hard-pressed to improve on. Unless there is evidence that dictionary-based sets are still too slow, our work here is probably done.

20.3.3.2. Timing results under Python 1.5.2: version skew

For detail-minded readers, the prior section's sixfold speedup results listed in this edition of the book were timed with Python 2.4 on a 1.2 GHz machine. Surprisingly, in the second edition, under an older Python (1.5.2) and slower machine (650 MHz), all list-based set results were roughly twice as slow as today, but the dictionary-based set was roughly four times slower (e.g., 1.54 and 0.44 seconds for dictionaries and lists at 50 iterations). In relative terms, the net effect is that dictionary sets went from being approximately three times faster than list sets to being six times faster today.

That is, machine speed doubled, but in addition, the dictionary code grew twice as quickly as before, relative to list-based sets. This larger jump reflects optimizations in Python itself. As you can see, version skew is an important consideration when analyzing performance; in this case, dictionaries are twice the performance boost they were a few years earlier.

20.3.3.3. Using the Python profiler

Timing code sections helps, but the ultimate way to isolate bottlenecks is profiling. The Python profiler provides another way to gather performance data besides timing sections of code as done in this chapter so far. Because the profiler tracks all function calls, it provides much more information in a single blow. As such, it's a more powerful way to isolate bottlenecks in slow programsafter profiling, you should have a good idea of where to focus your optimization efforts.

The profiler ships with Python as the standard library module called profile, and it provides a variety of interfaces for measuring code performance. It is structured and used much like the pdb command-line debugger: import the profile module and call its functions with a code string to measure performance. The simplest profiling interface is its profile.run(statementstring) function. When invoked, the profiler runs the code string, collects statistics during the run, and issues a report on the screen when the statement completes. See it in action profiling a 100-iteration call to the set test function of the previous section's Example 20-12, for the list-based sets of Example 20-10 (hint: profile an import statement to profile an entire file):

 >>> import settime, timer, set >>> import profile >>> profile.run('timer.test(100, settime.setops, set.Set)')          675906 function calls in 6.328 CPU seconds    Ordered by: standard name    ncalls  tottime  percall  cumtime  percall filename:lineno(function)    118500    0.434    0.000    0.434    0.000 :0(append)         2    0.000    0.000    0.000    0.000 :0(clock)       500    0.003    0.000    0.003    0.000 :0(range)         1    0.004    0.004    0.004    0.004 :0(setprofile)         1    0.000    0.000    6.323    6.323 <string>:1(?)         0    0.000             0.000          profile:0(profiler)         1    0.000    0.000    6.328    6.328 profile:0(timer.test(100,...more...))      1500    0.133    0.000    0.951    0.001 set.py:13(union)      3400    0.029    0.000    1.011    0.000 set.py:2(_ _init_ _)      3400    0.621    0.000    0.982    0.000 set.py:20(concat)    544000    2.032    0.000    2.032    0.000 set.py:26(_ _getitem_ _)      1500    0.014    0.000    5.226    0.003 set.py:27(_ _and_ _)      1500    0.013    0.000    0.964    0.001 set.py:28(_ _or_ _)      1500    3.009    0.002    5.212    0.003 set.py:6(intersect)       100    0.033    0.000    6.321    0.063 settime.py:4(setops)         1    0.002    0.002    6.323    6.323 timer.py:1(test)

The report's format is straightforward and well documented in the Python library manual. By default, it lists the number of calls and times spent in each function invoked during the run. When the profiler is enabled, each interpreter event is routed to a Python handler. This gives an accurate picture of performance, but it tends to make the program being profiled run much slower than normal. In fact, the call profiled runs five times slower in this case under Python 2.4 and the 1.2 GHz test machine (6 seconds versus 1.2 seconds when not profiled).

On the other hand, the profiler's report helps you isolate which functions to recode and possibly migrate to the C language. In the preceding listing, for instance, we can see that the intersect function (and the corresponding _ _and_ _ operator method) is the main drag on performance; it takes roughly five-sixths of the total execution time. Indexing (_ _getitem_ _) is a close second, most likely because it occurs so often with the repeated scans used by intersection. Union, on the other hand, is fairly quick from a relative perspective.

Here is a profile of the dictionary-based set implementation of Example 20-11 for comparison; the code runs five times slower under the profiler again (1 second versus 0.2 seconds), though the relative speed of the list and dictionary-based set code is the same when both are profiled (6 seconds versus 1 second, the same sixfold difference we noticed before):

 >>> import settime, timer, fastset  >>> import profile >>> profile.run('timer.test(100, settime.setops, fastset.Set)')          111406 function calls in 1.049 CPU seconds    Ordered by: standard name    ncalls  tottime  percall  cumtime  percall filename:lineno(function)         2    0.000    0.000    0.000    0.000 :0(clock)     17500    0.065    0.000    0.065    0.000 :0(has_key)     42500    0.166    0.000    0.166    0.000 :0(keys)       500    0.003    0.000    0.003    0.000 :0(range)         1    0.000    0.000    0.000    0.000 :0(setprofile)         1    0.000    0.000    1.049    1.049 <string>:1(?)      1500    0.159    0.000    0.439    0.000 fastset.py:15(union)      3400    0.052    0.000    0.052    0.000 fastset.py:23(concat)     38000    0.299    0.000    0.445    0.000 fastset.py:27(_ _getitem_ _)      3400    0.030    0.000    0.082    0.000 fastset.py:4(_ _init_ _)      1500    0.208    0.000    0.533    0.000 fastset.py:8(intersect)         0    0.000             0.000          profile:0(profiler)         1    0.000    0.000    1.049    1.049 profile:0(timer.test(100, ...more...))      1500    0.014    0.000    0.548    0.000 set.py:27(_ _and_ _)      1500    0.015    0.000    0.454    0.000 set.py:28(_ _or_ _)       100    0.035    0.000    1.048    0.010 settime.py:4(setops)         1    0.001    0.001    1.049    1.049 timer.py:1(test)

This time, there is no obvious culprit behind the total execution time: intersection and union take roughly the same amount of time, and indexing is not much of a factor. Ultimately, the real difference is the exponential algorithm of list-based set intersection versus the linear nature of the dictionary-base algorithms, and this factor outweighs the choice of programming language used to implement the objects. Recoding in Python first is the best bet, since an exponential algorithm would be just as slow if implemented in C.

20.3.4. Optimizing fastset by Coding Techniques (or Not)

As coded, there seems to be a bottleneck in the fastset class: each time we call a dictionary's keys method, Python makes a new list to hold the result. This can happen repeatedly during intersections and unions. If you are interested in trying to optimize this further, see the following files in the book's examples distribution:

PP3E\Dstruct\Basic\fastset2.py
PP3E\Dstruct\Basic\fastset3.py

I wrote these to try to speed up sets further, but failed miserably. It turns out that adding extra code to try to shave operations usually negates the speedup you obtain. There may be faster codings, but the biggest win here was likely in changing the underlying data structure to dictionaries, not in minor code tweaks.

As a rule of thumb, your intuition about performance is almost always wrong in a dynamic language such as Python: the algorithm is usually the real culprit behind performance problems, not the coding style or even the implementation language. By removing the combinatorial list scanning algorithm of the original set class, the Python implementation became dramatically faster.

In fact, moving the original set class to C without fixing the algorithm would not have addressed the real performance problem. Coding tricks don't usually help as much either, and they make your programs difficult to understand. In Python, it's almost always best to code for readability first and optimize later if needed. Despite its simplicity, fastset is fast indeed.

20.3.5. Adding Relational Algebra to Sets (External)

If you are interested in studying additional set-like operations coded in Python, see the following files in this book's examples distribution:

PP3E\Dstruct\Basic\rset.py: RSet implementation
PP3E\Dstruct\Basic\reltest.py: Test script for RSet

The RSet subclass defined in rset.py adds basic relational algebra operations for sets of dictionaries. It assumes the items in sets are mappings (rows), with one entry per column (field). RSet inherits all the original Set operations (iteration, intersection, union, & and | operators, uniqueness filtering, and so on), and adds new operations as methods:

Select: Return a set of nodes that have a field equal to a given value.
Bagof: Collect set nodes that satisfy an expression string.
Find: Select tuples according to a comparison, field, and value.
Match: Find nodes in two sets with the same values for common fields.
Product: Compute a Cartesian product: concatenate tuples from two sets.
Join: Combine tuples from two sets that have the same value for a field.
Project: Extract named fields from the tuples in a table.
Difference: Remove one set's tuples from another.

Alternative implementations of set difference operations can also be found in the diff.py file in the same examples distribution directory.