Recipe1.8.Checking Whether a String Contains a Set of Characters


Recipe 1.8. Checking Whether a String Contains a Set of Characters

Credit: Jürgen Hermann, Horst Hansen

Problem

You need to check for the occurrence of any of a set of characters in a string.

Solution

The simplest approach is clear, fast, and general (it works for any sequence, not just strings, and for any container on which you can test for membership, not just sets):

def containsAny(seq, aset):     """ Check whether sequence seq contains ANY of the items in aset. """     for c in seq:         if c in aset: return True     return False

You can gain a little speed by moving to a higher-level, more sophisticated approach, based on the itertools standard library module, essentially expressing the same approach in a different way:

import itertools def containsAny(seq, aset):     for item in itertools.ifilter(aset._ _contains_ _, seq):         return True     return False

Discussion

Most problems related to sets are best handled by using the set built-in type introduced in Python 2.4 (if you're using Python 2.3, you can use the equivalent sets.Set type from the Python Standard Library). However, there are exceptions. Here, for example, a pure set-based approach would be something like:

def containsAny(seq, aset):     return bool(set(aset).intersection(seq))

However, with this approach, every item in seq inevitably has to be examined. The functions in this recipe's Solution, on the other hand, "short-circuit": they return as soon as they know the answer. They must still check every item in seq when the answer is Falsewe could never affirm that no item in seq is a member of aset without examining all the items, of course. But when the answer is true, we often learn about that very soon, namely as soon as we examine one item that is a member of aset. Whether this matters at all is very data-dependent, of course. It will make no practical difference when seq is short, or when the answer is typically False, but it may be extremely important for a very long seq (when the answer can typically be soon determined to be true).

The first version of containsAny presented in the recipe has the advantage of simplicity and clarity: it expresses the fundamental idea with total transparency. The second version may appear to be "clever", and that is not a complimentary adjective in the Python world, where simplicity and clarity are core values. However, the second version is well worth considering, because it shows a higher-level approach, based on the itertools module of the standard library. Higher-level approaches are most often preferable to lower-level ones (although the issue is moot in this particular case). itertools.ifilter takes a predicate and an iterable, and yields the items in that iterable that satisfy the "predicate". Here, as the "predicate", we use anyset._ _contains_ _, the bound method that is internally called when we code in anyset for membership testing. So, if ifilter yields anything at all, it yields an item of seq that is also a member of anyset, so we can return True as soon as this happens. If we get to the statement following the for, it must mean the return True never executed, because no items of seq are members of anyset, so we can return False.

What Is "a Predicate?"

A term you can see often in discussions about programming is predicate: it just means a function (or other callable object) that returns TRue or False as its result. A predicate is said to be satisfied when it returns true.


If your application needs some function such as containsAny to check whether a string (or other sequence) contains any members of a set, you may also need such variants as:

def containsOnly(seq, aset):     """ Check whether sequence seq contains ONLY items in aset. """     for c in seq:         if c not in aset: return False     return True

containsOnly is the same function as containsAny, but with the logic turned upside-down. Other apparently similar tasks don't lend themselves to short-circuiting (they intrinsically need to examine all items) and so are best tackled by using the built-in type set (in Python 2.4; in 2.3, you can use sets.Set in the same way):

def containsAll(seq, aset):     """ Check whether sequence seq contains ALL the items in aset. """     return not set(aset).difference(seq)

If you're not accustomed to using the set (or sets.Set) method difference, be aware of its semantics: for any set a, a.difference(b) (just like a-set(b)) returns the set of all elements of a that are not in b. For example:

>>> L1 = [1, 2, 3, 3] >>> L2 = [1, 2, 3, 4] >>> set(L1).difference(L2) set([  ]) >>> set(L2).difference(L1) set([4])

which hopefully helps explain why:

>>> containsAll(L1, L2) False >>> containsAll(L2, L1) True

(In other words, don't confuse difference with another method of set, symmetric_difference, which returns the set of all items that are in either argument and not in the other.)

When you're dealing specifically with (plain, not Unicode) strings for both seq and aset, you may not need the full generality of the functions presented in this recipe, and may want to try the more specialized approach explained in Recipe 1.10 based on strings' method TRanslate and the string.maketrans function from the Python Standard Library. For example:

import string notrans = string.maketrans('', '')           # identity "translation" def containsAny(astr, strset):     return len(strset) != len(strset.translate(notrans, astr)) def containsAll(astr, strset):     return not strset.translate(notrans, astr)

This somewhat tricky approach relies on strset.translate(notrans, astr) being the subsequence of strset that is made of characters not in astr. When that subsequence has the same length as strset, no characters have been removed by strset.translate, therefore no characters of strset are in astr. Conversely, when the subsequence is empty, all characters have been removed, so all characters of strset are in astr. The translate method keeps coming up naturally when one wants to treat strings as sets of characters, because it's speedy as well as handy and flexible; see Recipe 1.10 for more details.

These two sets of approaches to the recipe's tasks have very different levels of generality. The earlier approaches are very general: not at all limited to string processing, they make rather minimal demands on the objects you apply them to. The approach based on the translate method, on the other hand, works only when both astr and strset are strings, or very closely mimic plain strings' functionality. Not even Unicode strings suffice, because the TRanslate method of Unicode strings has a signature that is different from that of plain stringsa single argument (a dict mapping code numbers to Unicode strings or None) instead of two (both strings).

See Also

Recipe 1.10; documentation for the translate method of strings and Unicode objects, and maketrans function in the string module, in the Library Reference and Python in a Nutshell; ditto for documentation of built-in set (Python 2.4 only), modules sets and itertools, and the special method _ _contains_ _.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net