5.3 Incompatibilities Between the PCRE Library and Perl Regular Expressions


5.3 Incompatibilities Between the PCRE Library and Perl Regular Expressions

I'm using one of my Perl regular expressions in PHP, but it doesn't work as expected, or it generates an error.

Technique

The PCRE library is mostly compatible with Perl, but there are a few exceptions that I have listed in the discussion. If you have Perl installed on your system, you can do the following:

 <?php exec("$perlscript $regex $variable", $line); $line = implode (" 
 <?php exec ("$perlscript $regex $variable", $line); $line = implode ("\0",$line); ?> 
",$line); ?>

Comments

The PCRE library, while extensive , has certain incompatibilities with Perl (a full list follows ). So, what we can do is use a PHP workaround or make an exec call to a Perl script that we write to parse the data. (In fact, this is one of the areas in which Perl excels. When exec calls the Perl script, the first line of output is returned. However, we discard it and all the output is stored in a user -specified array ( $line ). We then implode this line around the null character and get the actual array as a string.

Finally, to finish off my coverage of Perl-compatible regular expressions, here is a list of all the incompatibilities between Perl regular expressions (version 5.005) and the PCRE (from the PHP documentation):

  1. As a default, a whitespace character is any character that the C function isspace () distinguishes, although it is possible to compile PCRE with alternative character type tables. Normally the isspace() matches space, formfeed, newline, carriage return, horizontal tab, and vertical tab. In Perl 5, the vertical tab is no longer included in its set of whitespace characters . The \v escape that was in the Perl documentation for a long time was never really recognized. However, the character itself was treated as whitespace at least up until 5.002. In 5.004 and 5.005, it does not match \s .

  2. The PCRE does not allow repeat quantifiers on lookahead declarations. While Perl permits them, they might not mean what you might think they would mean. For example, (?!z){3} does not say that the next three characters are not "z" . It just tells that the next character is not "z" three times.

  3. Capturing subpatterns that occur inside of negative lookahead assertions are counted, but their entries in the offsets vector never become added. Perl sets its numerical variables from any patterns that are matched before the declaration fails to match something (subsequently succeeding), but only if the negative lookahead declaration has just one branch.

  4. Though binary zero characters are supported in the subject string, they are not allowed in a pattern string because it is passed as a normal C string, terminated by zero. The escape sequence "\0" can be used in the pattern to represent a binary zero.

  5. The following Perl escape sequences are not supported: \l , \u , \L , \U , \E , \Q . This is because they are a part of Perl's general string handling functions not a part of the regular expression library.

  6. The Perl \G declaration is not supported at all.

  7. Obviously the PCRE does not support the (?{code}) pattern set.

  8. There are at the time of writing some oddities in Perl 5.005_02 concerned with the settings of captured strings when part of a pattern is repeated. For example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value "b" , but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3 ) get set.

  9. In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the future Perl changes to a consistent state that is different, PCRE may change to follow.

  10. Another as yet unresolved discrepancy is that in Perl 5.005_02 the pattern /^(a)?(?(1)ab)+$/ matches the string "a" , whereas in PCRE it does not. However, in both Perl and PCRE /^(a)?a/ matched against "a" leaves $1 unset.

  11. The PCRE adds unto the Perl Regular expression library in the following ways:

    (a) Although lookbehind declarations must match fixed length strings, each alternate branch of a lookbehind declaration can match a different length of string. Perl 5.005 requires them all to be of the same length.

    (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ matches only at the very end of the string.

    (c) If PCRE_EXTRA is set, a backslash without a special character class definition causes an error (i.e. \z would not be valid but \w ([a-zA-Z0-9_]) would be valid.)

    (d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is reversed , that is, by default they are greedy PCRE_UNGREEDY makes them by default ungreedy. (i.e. .*? is greedy when PCRE_UNGREEDY is set).



PHP Developer's Cookbook
PHP Developers Cookbook (2nd Edition)
ISBN: 0672323257
EAN: 2147483647
Year: 2000
Pages: 351

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net