Section 10.2. The Preg Function Interface


10.2. The Preg Function Interface

PHP's interface to its regex engine is purely procedural (˜ 95), provided by the six functions shown at the top of Table 10-2. For reference, the table also shows four useful functions that are presented later in this chapter.

Table 10-2. Overview of PHP's Preg Functions

Function

Use

˜ 449 preg_match

Check whether regex can match in a string, and pluck data from a string

˜ 453 preg_match_all

Pluck data from a string

˜ 458 preg_replace

Replace matched text within a copy of a string

˜ 463 preg_replace_callback

Call a function for each match of regex within a string

˜ 465 preg_split

Partition a string into an array of substrings

˜ 469 preg_grep

Cull elements from an array that do/don't match regex

˜ 470 preg_quote

Escape regex metacharacters in a string

The following functions, developed in this chapter, are included here for easy reference .

˜ 454 reg_match

Non-participatory-parens aware version of preg_match

˜ 472 preg_regex_to_pattern

Create a preg pattern string from a regex string

˜ 474 preg_pattern_error

Check a preg pattern string for syntax errors

˜ 475 preg_regex_error

Check a regex string for syntax errors


What each function actually does is greatly influenced by the type and number of arguments provided, the function flags, and the pattern modifiers used with the regex. Before looking at all the details, let's see a few examples to get a feel for how regexes look and how they are handled in PHP:

 /*  Check whether HTML   tag is a <table> tag  */     if  (preg_match  ('/^  <table\b/i  ', $tag))     print "tag is a table tag\n";     -----------------------------------     /*  Check whether text is an integer  */     if (  preg_match  ('/^  -?\d+$  /', $user_input))       print "user input is an integer\n";     -----------------------------------     /*  Pluck HTML title from a string  */     if (  preg_match  ('{  <title>(.*?)</title>  }si', $html, $matches))       print "page title: $matches[1]\n";     -----------------------------------     /*  Treat numbers in string as Fahrenheit values and replace with Celsius values  */     $metric =  preg_replace  ('/  (-?\d+(?:\.\d+)?)  /e', /*  pattern  */                                                               'floor((-32)*5/9  + 0.5)', /*  replacement code  */                                                               $string);     -----------------------------------     /*  Create an array of values from a string filled with simple comma-separated  values  */     $values_array =  preg_split  ('!  \s*,\s*  ,!', $comma_separated_values); 

The last example, when given a string such as ' Larry , Curly , Moe ', produces an array with three elements: the strings ' Larry ', ' Curly ', and ' Moe '.

10.2.1. "Pattern" Arguments

The first argument to any of the preg functions is a pattern argument , which is the regex wrapped by a pair of delimiters, possibly followed by pattern modifiers. In the first example above, the pattern argument is ' /<table\b/i ', which is the regex <table\b wrapped by a pair of slashes (the delimiters), followed by the i (case-insensitive match) pattern modifier.

10.2.1.1. PHP single-quoted strings

Because of a regex's propensity to include backslashes, it's most convenient to use PHP's single-quoted strings when providing these pattern arguments as string literals. PHP's string literals are also covered in Chapter 3(˜ 103), but in short, you don't need to add many extra escapes to a regular expression when rendering it within a single-quoted string literal. PHP single-quoted strings have only two string metasequences , ' \' ' and ' \\ ', which include a single quote and a backslash into the string's value, respectively.

One common exception requiring extra escapes is when you want \\ within the regex, which matches a single backslash character. Within a single-quoted string literal, each \ requires \\ , so \\ requires \\\\ . All this to match one backslash. Phew!

(You can see an extreme example of this kind of backslash-itis on page 473.)

As a concrete example, consider a regex to match a Windows disk's root path , such as ' C:\ '. An expression for that is ^[A-Z]:\\$ , which when included within a single-quoted string literalappears as ' ^[A-Z]:\\\\$ '.

In a Chapter 5 example on page 190, we saw that ^.*\\ required a pattern argument string of ' /^.*\\\/' , with three backslashes. With that in mind, I find the following examples to be illustrative :

 print '/^.*\/';  prints  /^.*\/     print '/^.*\/';  prints  /^.*\/     print '/^.*\\/';  prints  /^.*\/     print '/^.*\\/';  prints  /^.*\/ 

The first two examples yield the same result through different means. In the first, the ' \/ ' sequence at the end is not special to a single-quoted string literal, so the sequence appears verbatim in the string's value. In the second example, the ' \\ ' sequence is special to the string literal, yielding a single ' \ ' in the string's value. This, when combined with the character that follows (the slash), yields the same ' \/ ' in the value as in the first example. The same logic applies to why the third and fourth examples yield the same result.

You may use PHP double-quoted string literals, of course, but they're much less convenient. They support a fair number of string metasequences, all of which must be coded around when trying to render a regex as a string literal.

10.2.1.2. Delimiters

The preg engine requires delimiters around the regex because the designers wanted to provide a more Perl-like appearance, especially with respect to pattern modifiers. Some programmers may find it hard to justify the hassle of required delimiters compared to providing pattern modifiers in other ways, but for better or worse, this is the way it is. (For one example of " worse ," see the sidebar on page 448.)

It's common to use slashes as the delimiters in most cases, but you may use any non- alphanumeric , non-whitespace ASCII character except a backslash. A pair of slashes are most common, but pairs of ' ! ' and ' # ' are used fairly often as well.

If the first delimiter is one of the "opening" punctuations:

 { (< [ 

the closing delimiter becomes the appropriate matching closing punctuation:

 }) > ] 

When using one of these "paired" delimiters, the delimiters may be nested, so it's actually possible to use something like ' ( (\d+) ) ' as the pattern-argument string. In this example, the outer parentheses are the pattern-argument delimiters, and the inner parentheses are part of the regular expression those delimiters enclose. In the interest of clarity, though, I'd avoid relying on this and use the plain and simple ' / (\d+) / ' instead.

Delimiters may be escaped within the regex part of the pattern-argument string, so something like ' / <B>(.*?)< \/ B> / i' is allowed, although again, a different delimiter may appear less cluttered, as with ' ! <B>(.*?)</B> ! i ' which uses ' !‹! ' as the delimiters, or '{ <B>(.*?)</B> } i ', which uses ' {‹} '.

10.2.1.3. Pattern modifiers

A variety of mode modifiers (called pattern modifiers in the PHP vernacular) may be placed after the closing delimiter, or in some cases, within the regex itself, to modify certain aspects of a pattern's use. We've seen the case-insensitive i pattern modifier in some of the examples so far. Here's a summary of all pattern modifiers allowed:

Modifier

Inline

Description

i

(?i)

˜ 110 Ignore letter case during match

m

(?m)

˜ 112 Enhanced line anchor match mode

s

(?s)

˜ 111 Dot-matches-all match mode

x

(?x)

˜ 111 Free-spacing and comments regex mode

u

˜ 447 Consider regex and target strings as encoded in UTF-8

X

(?X)

˜ 447 Enable PCRE "extra stuff"

e

˜ 459 Execute replacement as PHP code ( preg_replace only)

S

˜ 478 Invoke PCRE's "study" optimization attempt

    The following are rarely used

U

(?U)

˜ 447 Swap greediness of * and *? , etc.

A

˜ 447 Anchor entire match to the attempt's initial starting position

D

˜ 447 $ matches only at EOS, not at newline before EOS. (Ignored if the m pattern modifier is used.)


Pattern modifiers within the regex When embedded within a regex, pattern modifiers can appear standalone to turn a feature on or off (such as (?i) to turn on case-insensitive matching, and (?-i) to turn it off ˜ 135). Used this way, they remain in effect until the end of the enclosing set of parentheses, if any, or otherwise , until the end of the regular expression.

They can also be used as mode-modified spans (˜ 135), such as (?i:‹) to turn on case-insensitive matching for the duration of the span, or (?-sm:‹) to turn off s and m modes for the duration of the span.

Mode modifiers outside the regex Modifiers can be combined, in any order, after the final delimiter, as with this example's ' si ', which enables both case-insensitive and dot-matches-all modes:

 if (preg_match('  {  <title>(.*?)</title>  }si  ', $html, $captures)) 

PHP-specific modifiers The first four pattern modifiers listed in the table are fairly standard and are discussed in Chapter 3 (˜ 110). The e pattern modifier is used only with preg_replace , and it is covered in that section (˜ 459).

The u pattern modifier tells the preg regex engine to consider the bytes of the regular expression and subject string to be encoded in UTF-8. The use of this modifier doesn't change the bytes, but merely how the regex engine considers them. By default (that is, without the u pattern modifier), the preg engine considers data passed to it as being in the current 8-bit locale (˜ 87). If you know the data is encoded in UTF-8, use this modifier; otherwise, do not. Non-ASCII characters with UTF-8-encoded text are encoded with multiple bytes, and using this u modifier ensures that those multiple bytes are indeed taken as single characters .

The X pattern modifier turns on PCRE "extra stuff," which currently has only one effect: to generate an error when a backslash is used in a situation other than as part of a known metasequence. For example, by default, \k has no special meaning to PCRE, and its treated as k (the backslash, not being part of a known metasequence, is ignored). Using the X modifier causes this situation to result in an "unrecognized character follows \ " fatal error.

Future versions of PHP may include versions of PCRE that ascribe special meaning to currently unspecial backslash-letter combinations, so in the interest of future compatibility (and general readability), it's best not to escape letters unless they currently have special meaning. In this regard, the use of the X pattern modifier makes a lot of sense, because it can point out typos or similar mistakes.

The S pattern modifier invokes PCRE's "study" feature, which pre-analyzes the regular expression, and in some well-defined cases, can result in a substantially faster match attempt. It's covered in this chapter's section on efficiency, starting on page 478.

The remaining pattern modifiers are esoteric and rarely used:

  • The A pattern modifier anchors the match to where the match attempt is first started, as if the entire regex leads off with \G . Using the car analogy from Chapter 4, this is akin to turning off the "bump-along" by the transmission (˜ 148).

  • The D pattern modifier effectively turns each $ into \z (˜ 112), which means that $ matches right at the end of the string as always, but not before a string-ending newline.

  • The U pattern modifier swaps metacharacter greediness: * is treated as *? and vice versa, + is treated as *? and vice versa, etc. I would guess that the primary effect of this pattern modifier is to create confusion, so I certainly dont recommend it.

"Unknown Modifier" Errors

On more than a few occasions, a program I'm working on suddenly generates an "Unknown Modifier" error. I scratch my head for a bit trying to figure out what is causing such an error, when it finally dawns on me that I've forgotten to add delimiters to a regular expression when creating a pattern argument.

For example, I might have intended to match an HTML tag:

 preg_match('<(\w+)([^>]*)>', $html) 

Despite my intention for the leading ' < ' to be part of the regex, preg_match considers it to be the opening delimiter (and really, who can blame it, with my having forgotten to supply one, after all?). So, the argument is interpreted as ' ', with what it considers to be the regex shown in gray, and the pattern modifiers underlined .

As a regex, (\w+)([^ is not valid, but before getting so far as to notice and report that error, it tries to interpret ' ]*)> ' as a list of pattern modifiers. None of them are valid pattern modifiers, of course, so it generates an error with the first one it sees:

 Warning: Unknown modifier ']' 

In hindsight, it's clear that I need to wrap delimiters around the regex:

 preg_match('/<(\w+)(.+?)>/', $html) 

Unless I'm actively thinking about PHP pattern modifiers, the kind of modifier the error refers to doesn't necessarily "click," so sometimes it takes a few moments for me to figure it out. I feel quite silly every time this happens, but luckily, no one knows I make such silly mistakes.

Thankfully, recent versions of PHP 5 report the function name as well:

 Warning: preg_match(): Unknown modifier ']' 

The function name puts me in the proper frame of mind to understand the problem immediately. Still, time-consuming dangers to forgetting the delimiters remain, as there are cases where no error is reported . Consider this version of the previous example:

 preg_match('<(\w+)(.*?)>', $html) 

Although I've forgotten the delimiters, the remaining code makes for (\w+)(.*?) , a perfectly valid regular expression. The only indication that anything is wrong is that it wont match as I expect. These kind of silent errors are the most insidious.




Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net