Recipe1.18.Replacing Multiple Patterns in a Single Pass

Recipe 1.18. Replacing Multiple Patterns in a Single Pass

Credit: Xavier Defrang, Alex Martelli

Problem

You need to perform several string substitutions on a string.

Solution

Sometimes regular expressions afford the fastest solution even in cases where their applicability is not obvious. The powerful sub method of re objects (from the re module in the standard library) makes regular expressions particularly good at performing string substitutions. Here is a function returning a modified copy of an input string, where each occurrence of any string that's a key in a given dictionary is replaced by the corresponding value in the dictionary:

import re def multiple_replace(text, adict):     rx = re.compile('|'.join(map(re.escape, adict)))     def one_xlat(match):         return adict[match.group(0)]     return rx.sub(one_xlat, text)

Discussion

This recipe shows how to use the Python standard re module to perform single-pass multiple-string substitution using a dictionary. Let's say you have a dictionary-based mapping between strings. The keys are the set of strings you want to replace, and the corresponding values are the strings with which to replace them. You could perform the substitution by calling the string method replace for each key/value pair in the dictionary, thus processing and creating a new copy of the entire text several times, but it is clearly better and faster to do all the changes in a single pass, processing and creating a copy of the text only once. re.sub's callback facility makes this better approach quite easy.

First, we have to build a regular expression from the set of keys we want to match. Such a regular expression has a pattern of the form a1|a2|...|aN, made up of the N strings to be substituted, joined by vertical bars, and it can easily be generated using a one-liner, as shown in the recipe. Then, instead of giving re.sub a replacement string, we pass it a callback argument. re.sub then calls this object for each match, with a re.MatchObject instance as its only argument, and it expects the replacement string for that match as the call's result. In our case, the callback just has to look up the matched text in the dictionary and return the corresponding value.

The function multiple_replace presented in the recipe recomputes the regular expression and redefines the one_xlat auxiliary function each time you call it. Often, you must perform substitutions on multiple strings based on the same, unchanging translation dictionary and would prefer to pay these setup prices only once. For such needs, you may prefer the following closure-based approach:

import re def make_xlat(*args, **kwds):     adict = dict(*args, **kwds)     rx = re.compile('|'.join(map(re.escape, adict)))     def one_xlat(match):         return adict[match.group(0)]     def xlat(text):         return rx.sub(one_xlat, text)     return xlat

You can call make_xlat, passing as its argument a dictionary, or any other combination of arguments you could pass to built-in dict in order to construct a dictionary; make_xlat returns a xlat closure that takes as its only argument text the string on which the substitutions are desired and returns a copy of text with all the substitutions performed.

Here's a usage example for each half of this recipe. We would normally have such an example as a part of the same .py source file as the functions in the recipe, so it is guarded by the traditional Python idiom that runs it if and only if the module is called as a main script:

if _ _name_ _ == "_ _main_ _":     text = "Larry Wall is the creator of Perl"     adict = {       "Larry Wall" : "Guido van Rossum",       "creator" : "Benevolent Dictator for Life",       "Perl" : "Python",     }     print multiple_replace(text, adict)     translate = make_xlat(adict)     print translate(text)

Substitutions such as those performed by this recipe are often intended to operate on entire words, rather than on arbitrary substrings. Regular expressions are good at picking up the beginnings and endings of words, thanks to the special sequence r'\b'. We can easily make customized versions of either multiple_replace or make_xlat by simply changing the one line in which each of them builds and assigns the regular expression object rx into a slightly different form:

  rx = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, adict)))

The rest of the code is just the same as shown earlier in this recipe. However, this sameness is not necessarily good news: it suggests that if we need many similarly customized versions, each building the regular expression in slightly different ways, we'll end up doing a lot of copy-and-paste coding, which is the worst form of code reuse, likely to lead to high maintenance costs in the future.

A key rule of good coding is: "once, and only once!" When we notice that we are duplicating code, we should notice this symptom as a "code smell," and refactor our code for better reuse. In this case, for ease of customization, we need a class rather than a function or closure. For example, here's how to write a class that works very similarly to make_xlat but can be customized by subclassing and overriding:

class make_xlat:     def _ _init_ _(self, *args, **kwds):         self.adict = dict(*args, **kwds)         self.rx = self.make_rx( )     def make_rx(self):         return re.compile('|'.join(map(re.escape, self.adict)))     def one_xlat(self, match):         return self.adict[match.group(0)]     def _ _call_ _(self, text):         return self.rx.sub(self.one_xlat, text)

This is a "drop-in replacement" for the function of the same name: in other words, a snippet such as the one we showed, with the if _ _name_ _ == '_ _main_ _' guard, works identically when make_xlat is this class rather than the previously shown function. The function is simpler and faster, but the class' important advantage is that it can easily be customized in the usual object-oriented waysubclassing it, and overriding some method. To translate by whole words, for example, all we need to code is:

class make_xlat_by_whole_words(make_xlat):     def make_rx(self):         return re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, self.adict)))

Ease of customization by subclassing and overriding helps you avoid copy-and-paste coding, and this is sometimes an excellent reason to prefer object-oriented structures over simpler functional structures, such as closures. Of course, just because some functionality is packaged as a class doesn't magically make it customizable in just the way you want. Customizability also requires some foresight in dividing the functionality into separately overridable methods that correspond to the right pieces of overall functionality. Fortunately, you don't have to get it right the first time; when code does not have the optimal internal structure for the task at hand (in this specific example, for reuse by subclassing and selective overriding), you can and should refactor the code so that its internal structure serves your needs. Just make sure you have a suitable battery of tests ready to run to ensure that your refactoring hasn't broken anything, and then you can refactor to your heart's content. See http://www.refactoring.com for more information on the important art and practice of refactoring.

Recipe1.18.Replacing Multiple Patterns in a Single Pass