Recipe1.10.Filtering a String for a Set of Characters

Recipe 1.10. Filtering a String for a Set of Characters

Credit: Jürgen Hermann, Nick Perkins, Peter Cogolo


Given a set of characters to keep, you need to build a filtering function that, applied to any string s, returns a copy of s that contains only characters in the set.


The TRanslate method of string objects is fast and handy for all tasks of this ilk. However, to call translate effectively to solve this recipe's task, we must do some advance preparation. The first argument to TRanslate is a translation table: in this recipe, we do not want to do any translation, so we must prepare a first argument that specifies "no translation". The second argument to TRanslate specifies which characters we want to delete: since the task here says that we're given, instead, a set of characters to keep (i.e., to not delete), we must prepare a second argument that gives the set complementdeleting all characters we must not keep. A closure is the best way to do this advance preparation just once, obtaining a fast filtering function tailored to our exact needs:

import string # Make a reusable string of all characters, which does double duty # as a translation table specifying "no translation whatsoever" allchars = string.maketrans('', '') def makefilter(keep):     """ Return a function that takes a string and returns a partial copy         of that string consisting of only the characters in 'keep'.         Note that `keep' must be a plain string.     """     # Make a string of all characters that are not in 'keep': the "set     # complement" of keep, meaning the string of characters we must delete     delchars = allchars.translate(allchars, keep)     # Make and return the desired filtering function (as a closure)     def thefilter(s):         return s.translate(allchars, delchars)     return thefilter if _ _name_ _ == '_ _main_ _':     just_vowels = makefilter('aeiouy')     print just_vowels('four score and seven years ago') # emits: ouoeaeeyeaao     print just_vowels('tiger, tiger burning bright') # emits: ieieuii


The key to understanding this recipe lies in the definitions of the maketrans function in the string module of the Python Standard Library and in the translate method of string objects. TRanslate returns a copy of the string you call it on, replacing each character in it with the corresponding character in the translation table passed in as the first argument and deleting the characters specified in the second argument. maketrans is a utility function to create translation tables. (A translation table is a string t of exactly 256 characters: when you pass t as the first argument of a translate method, each character c of the string on which you call the method is translated in the resulting string into the character t[ord(c)].)

In this recipe, efficiency is maximized by splitting the filtering task into preparation and execution phases. The string of all characters is clearly reusable, so we build it once and for all as a global variable when this module is imported. That way, we ensure that each filtering function uses the same string-of-all-characters object, not wasting any memory. The string of characters to delete, which we need to pass as the second argument to the translate method, depends on the set of characters to keep, because it must be built as the "set complement" of the latter: we must tell translate to delete every character that we do not want to keep. So, we build the delete-these-characters string in the makefilter factory function. This building is done quite rapidly by using the translate method to delete the "characters to keep" from the string of all characters. The translate method is very fast, as are the construction and execution of these useful little resulting functions. The test code that executes when this recipe runs as a main script shows how to build a filtering function by calling makefilter, bind a name to the filtering function (by simply assigning the result of calling makefilter to a name), then call the filtering function on some strings and print the results.

Incidentally, calling a filtering function with allchars as the argument puts the set of characters being kept into a canonic string form, alphabetically sorted and without duplicates. You can use this idea to code a very simple function to return the canonic form of any set of characters presented as an arbitrary string:

def canonicform(s):     """ Given a string s, return s's characters as a canonic-form string:         alphabetized and without duplicates. """     return makefilter(s)(allchars)

The Solution uses a def statement to make the nested function (closure) it returns, because def is the most normal, general, and clear way to make functions. If you prefer, you could use lambda instead, changing the def and return statements in function makefilter into just one return lambda statement:

    return lambda s: s.translate(allchars, delchars)

Most Pythonistas, but not all, consider using def clearer and more readable than using lambda.

Since this recipe deals with strings seen as sets of characters, you could alternatively use the sets.Set type (or, in Python 2.4, the new built-in set type) to perform the same tasks. Thanks to the translate method's power and speed, it's often faster to work directly on strings, rather than go through sets, for tasks of this ilk. However, just as noted in Recipe 1.8, the functions in this recipe only work for normal strings, not for Unicode strings.

To solve this recipe's task for Unicode strings, we must do some very different preparation. A Unicode string's translate method takes only one argument: a mapping or sequence, which is indexed with the code number of each character in the string. Characters whose codes are not keys in the mapping (or indices in the sequence) are just copied over to the output string. Otherwise, the value corresponding to each character's code must be either a Unicode string (which is substituted for the character) or None (in which case the character is deleted). A very nice and powerful arrangement, but unfortunately not one that's identical to the way plain strings work, so we must recode.

Normally, we use either a dict or a list as the argument to a Unicode string's translate method to translate some characters and/or delete some. But for the specific task of this recipe (i.e., keep just some characters, delete all others), we might need an inordinately large dict or string, just mapping all other characters to None. It's better to code, instead, a little class that appropriately implements a _ _getitem_ _ method (the special method that gets called in indexing operations). Once we're going to the (slight) trouble of coding a little class, we might as well make its instances callable and have makefilter be just a synonym for the class itself:

import sets class Keeper(object):     def _ _init_ _(self, keep):         self.keep = sets.Set(map(ord, keep))     def _ _getitem_ _(self, n):         if n not in self.keep:             return None         return unichr(n)     def _ _call_ _(self, s):         return unicode(s).translate(self) makefilter = Keeper if _ _name_ _ == '_ _main_ _':     just_vowels = makefilter('aeiouy')     print just_vowels(u'four score and seven years ago') # emits: ouoeaeeyeaao     print just_vowels(u'tiger, tiger burning bright') # emits: ieieuii

We might name the class itself makefilter, but, by convention, one normally names classes with an uppercase initial; there is essentially no cost in following that convention here, too, so we did.

See Also

Recipe 1.8; documentation for the TRanslate method of strings and Unicode objects, and maketrans function in the string module, in the Library Reference and Python in a Nutshell.

Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420 © 2008-2017.
If you may any questions please contact us: