Recipe 22.10. Using a PHP Function in a Regular Expression


22.10.1. Problem

You want to process matched text with a PHP function. For example, you want to decode all HTML entities in captured subpatterns.

22.10.2. Solution

Use preg_replace_callback( ). Instead of a replacement pattern, give it a callback function. This callback function is passed an array of matched subpatterns and should return an appropriate replacement string. Example 22-15 decodes entities between <code></code> tags.

Generating replacement strings with a callback function

<?php $html = 'The &lt;b&gt; tag makes text bold: <code>&lt;b&gt;bold&lt;/b&gt;</code>'; print preg_replace_callback('@<code>(.*?)</code>@','decode', $html); // $matches[0] is the entire matched string // $matches[1] is the first captured subpattern function decode($matches) {     return html_entity_decode($matches[1]); } ?> 

Example 22-15 prints:

The &lt;b&gt; tag makes text bold: <b>bold</b>

22.10.3. Discussion

The second argument to preg_replace_callback( ) specifies the function that is to be called to calculate replacement strings. Like everywhere the PHP "callback" pseudotype is used, this argument can be a string or an array. Use a string to specify a function name. To use an object instance method as a callback, pass an array whose first element is the object and whose second element is a string containing the method name. To use a static class method as a callback, pass an array of two strings: the class name and the method name.

The callback function is passed one argument: an array of matches. Element 0 of this array is always the text that matched the entire pattern. If the pattern given to preg_replace_callback( ) has any parenthesized subpatterns, these are present in subsequent elements of the matches array. The keys of the matches array are numeric, even if there are named subpatterns in the pattern.

The PHP manpage on preg_replace_callback( ) suggests using create_function( ) to create an anonymous function for use as a callback. Although this can be convenient, it is memory intensive if the call to create_function( ) is inline with the call to preg_replace_callback( ) and inside a loop. If you want to use an anonymous function with preg_replace_callback( ), call create_function( ) once, storing the anonymous function callback in a variable. Then, provide the variable to preg_replace_callback( ) as the callback function. Example 22-16 uses an anonymous function to apply the transformation in Example 22-15 to every line in a file.

Generating replacement strings with an anonymous function

<?php $callbackFunction = create_function('$matches',                          'return html_entity_decode($matches[1]);'); $fp = fopen('html-to-decode.html','r'); while (! feof($fp)) {     $line = fgets($fp);     print preg_replace_callback('@<code>(.*?)</code>@',$callbackFunction, $line); } fclose($fp); ?>

An alternative to preg_replace_callback( ) is to use the e pattern modifier. This causes the replacement string to be evaluated as PHP code. We recommend you use preg_replace_callback( ) instead, though, for the backreference-related reasons explained below.

Example 22-17 uses the e pattern modifier to do the same entity decoding in Example 22-15.

Entity encoding matched text

<?php $html = 'The &lt;b&gt; tag makes text bold: <code>&lt;b&gt;bold&lt;/b&gt;</code>'; print preg_replace('@<code>(.*?)</code>@e',"html_entity_decode('$1')", $html); ?>

Things can get a bit tricky when you use the e modifier and include backreferences in the replacement string. There are multiple levels of escaping to be aware of (and, in some cases, to work around).

The first level of escaping is PHP's regular behavior that's at work whenever you construct a string. (Note that in Example 22-17, however, since $1 isn't a valid regular variable name, the $ doesn't need to be escaped even though the entire replacement string is delimited by double quotes.

The second level of escaping is how backreference replacements are delimited inside the replacement string. In Example 22-17, html_entity_decode('$1') becomes html_entity_decode('<code>&lt;b&gt;bold&lt;/b&gt;</code>'). This causes html_entity_decode( ) to be called with one argument, a single-quoted string.

Both single and double quotes in the captured match are backslash escaped. The backreference replacements look a little different when they themselves contain single or double quotes. For instance, examine Example 22-18.

Quote escaping in backreference replacements

<?php $html = "<code>&lt;b&gt; It's bold &lt;/b&gt;</code>"; print preg_replace('@<code>(.*?)</code>@e',"html_entity_decode('$1')", $html); print "\n"; $html = '<code>&lt;i&gt; "This" is italic. &lt;/i&gt;</code>'; print preg_replace('@<code>(.*?)</code>@e',"html_entity_decode('$1')", $html); print "\n"; ?>

Example 22-18 prints:

<b> It's bold </b> <i> \"This\" is italic. </i>

Somehow, backslashes have crept into the second line. This is a consequence of how the e modifier works. As mentioned above, both single and double quotes in the captured match are backslash escaped. This means that, in the first call to preg_replace( ) in Example 22-18, what's executed to calculate the replacement is:

html_entity_decode('<code>&lt;b&gt; It\'s bold &lt;/b&gt;</code>)

html_entity_decode( ) is passed a single-quoted string with a backslash-escaped single quote in it. All is well''It\'s' is really just It's. The second preg_replace( ), however, is problematic. What's executed to calculate the replacement is html_entity_decode('<code>&lt;i&gt; \"This\" is italic. &lt;/i&gt;</code>'). In a single-quoted string, a backslash before a double quote represents not a literal backslash, but the two character sequence \".

To work around this problem, use str_replace( ) to replace \" with " in your code that's executed to calculate the replacement. (Don't use stripslashes( )'it also removes backslashes before other characters, which we don't want here.) Example 22-19 wraps html_entity_decode( ) with a function that does just that.

Fixing quote escaping in backreference replacements

<?php $html = "<code>&lt;b&gt; It's bold &lt;/b&gt;</code>"; print preg_replace('@<code>(.*?)</code>@e',"preg_html_entity_decode('$1')", $html); print "\n"; $html = '<code>&lt;i&gt; "This" is italic. &lt;/i&gt;</code>'; print preg_replace('@<code>(.*?)</code>@e',"preg_html_entity_decode('$1')", $html); print "\n"; function preg_html_entity_decode($s) {     $s = str_replace('\\"','"', $s);     return html_entity_decode($s); } ?>

The use of the preg_html_entity_decode( ) function in Example 22-19 ensures that it prints correct results:

<b> It's bold </b> <i> "This" is italic. </i>

One final note on escaping and the e pattern modifier: inside your replacement-calculating expression, make sure to use single quotes (not double quotes) to delimit any strings that include backreference values. That is, use preg_html_entity_decode('$1'), not preg_html_entity_decode("$1"). Double quotes cause problems if the backreference value contains what looks like a valid variable name. Example 22-20 illustrates this problem.

Variable names and double-quoted strings

<?php $text = '<code>if ($temperature &lt; 12) { fever(); }</code>'; print "Good: \n"; print preg_replace('@<code>(.*?)</code>@e',"preg_html_entity_decode('$1')", $text); print "\n Bad: \n"; print preg_replace('@<code>(.*?)</code>@e','preg_html_entity_decode("$1")'  , $text); function preg_html_entity_decode($s) {     $s = str_replace('\\"','"', $s);     return html_entity_decode($s); } ?>

Example 22-20 prints:

Good: if ($temperature < 12) { fever(); }  Bad: Notice: Undefined variable: temperature in example.php(6) : regexp code on line 1 if ( < 12) { fever(); }

With appropriate quoting, the first preg_replace( ) works as expected: the only modification to $text is that &lt; is replaced by <. The second preg_replace( ), with double quotes around $1, is broken. The PHP interpreter thinks that the string to be passed to preg_html_entity_decode( ) is "if ($temperature &lt; 12) { fever( ); }". Since that's a double-quoted string, the PHP interpreter attempts to replace $temperature with the value of the corresponding variable, which, of course, doesn't exist.

So the moral of the "using the e modifier with preg_replace( )" story is twofold: correct for backslash-escaped double-quote characters and use single quotes to delimit strings inside your code expression to avoid accidental variable interpolation. This tricky quoting and interpolation behavior makes preg_replace_callback( ) a friendlier option.

22.10.4. See Also

Documentation on preg_replace_callback( ) at http://www.php.net/preg_replace_callback, on preg_replace( ) at http://www.php.net/preg_replace, on create_function( ) at http://www.php.net/create_function, and on the callback pseudo-type at http://www.php.net/language.pseudo-types#language.types.callback .




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net