Section 10.1. PHP s Regex Flavor


10.1. PHP's Regex Flavor

Table 10-1. Overview of PHP's preg Regular-Expression Flavor

Character Shorthands []

˜ 115

[]

\a [\b] \e \f \n \r \t \octal \x hex \x{ hex } \c char

Character Classes and Class-Like Constructs

˜ 118

 

Classes: [ ] [ ^‹ ] (may contain POSIX-like [:alpha:] notation; ˜ 127)

˜ 119

 

Any character except newline: dot (with s pattern modifier, any character at all)

˜ 120

[]

Unicode combining sequence: \X

˜ 120

[]

Class shorthands: \w \d \s \W \D \S (8-bit only) []

˜ 121

[] []

Unicode properties and scripts: [] \p{ Prop } \P{ Prop }

˜ 120

 

Exactly one byte (can be dangerous): [] \C

Anchors and Other Zero-Width Tests

˜ 129

 

,

Start of line/string: ^ \A

˜ 129

 

End of line/string: [] $ \z \Z

˜ 130

 

Start of current match: \G

˜ 133

 

Word boundary: \b \B (8-bit only) []

˜ 133

 

Lookaround: [] (?= ) ( ?!‹ ) ( ?<=‹ ) ( ?<!‹ )

Comments and Mode Modifiers

˜ 446

 

Mode modifiers: (? mods - mods ) Modifiers allowed: x [] s m i X U

˜ 446

 

Mode-modified spans : (? mods - mods : )

˜ 136

 

Comments: (?#‹) (with x pattern modifier, also from ' # ' until newline or end of regex)

Grouping, Capturing, Conditional, and Control

˜ 137

 

Capturing parentheses: ( ) \1 \2 ...

˜ 138

 

Named capture: (?P < name >‹ ) (?P= name )

˜ 137

 

Grouping-only parentheses: (?: )

˜ 139

 

Atomic grouping: (?> )

˜ 139

 

Alternation:

˜ 475

 

Recursion: (?R) (? num ) (?P > name )

˜ 140

 

Conditional: (? if then else) - "if" can be lookaround, (R), or ( num )

˜ 141

 

Greedy quantifiers: * + ? {n} {n,} {x,y}

˜ 141

 

Lazy quantifiers: *? +? ?? {n}? {n,}? {x,y}?

˜ 142

 

Possessive quantifiers: *+ ++ ?+ {n}+ {n,}+ {x,y}+

˜ 136

[]

Literal (non-metacharacter) span: \Q ... \E

[]
[]

(This table also serves to describe PCRE, the regex library behind PHP's preg functions ˜ 91)


[] (c) - may also be used within a character class ... see text

[] (u) - only in conjunction with the u pattern modifier ˜447

Table 10-1 on the previous page summarizes the preg engine's regex flavor. The following notes supplement the table:

  • [] .

    [] \b is a character shorthand for backspace only within a character class. Outside of a character class, \b matches a word boundary (˜ 133).

    Octal escapes are limited to two- and three-digit 8-bit values. The special one-digit \0 sequence matches a NUL byte.

    \x hex allows one- and two-digit hexadecimal values, while \x{ hex } allows any number of digits. Note, however, that values greater than \x{FF} are valid only with the u pattern modifier (˜ 447). Without the u pattern modifier, values larger than \x{FF} result in an invalid-regex error

  • [] .

    [] Even in UTF-8 mode (via the u pattern modifier), word boundaries and class shorthands such as \w work only with ASCII characters. If you need to consider the full breadth of Unicode characters , consider using \pL (˜ 121) instead of \w , using \pN instead of \d , and \pZ instead of \s

  • [] .

    [] ‚ Unicode support is as of Unicode Version 4.1.0.

    Unicode scripts (˜ 122) are supported without any kind of ' Is ' or ' In ' prefix, as with \p{Cyrillic} .

    One- and two-letter Unicode properties are supported, such as \p{Lu} , \p{L} , and the \pL shorthand for one-letter property names (˜ 121). Long names such as \p{Letter} are not supported.

    The special \p{L&} (˜ 121) is also supported, as is \p{Any} (which matches any character)

  • [] .

    [] ƒ By default, preg-suite regular expressions are byte oriented, and as such, \C defaults to being the same as (?s:.) , an s -modified dot . However, with the u modifier, preg-suite regular expressions become UTF-8 oriented, which means that a character can be composed of up to six bytes. Even so, \C still matches only a single byte. See the caution on page 120

  • [] .

    [] \z and \Z can both match at the very end of the subject string, while \Z can also match at a final-character newline.

    The meaning of $ depends on the m and D pattern modifiers (˜ 446) as follows : with neither pattern modifier, $ matches as \Z (before string-ending newline, or at the end of the string); with the m pattern modifier, it can also match before an embedded newline; with the D pattern modifier, it matches as \z (only at the end of the string). If both the m and D pattern modifiers are used, D is ignored

  • [] .

    [] Lookbehind is limited to subexpressions that match a fixed length of text, except that top-level alternatives of different fixed lengths are allowed (˜ 133)

  • [] .

    [] The x pattern modifier (free spacing and comments) recognizes only ASCII whitespace, and does not recognize other whitespace found in Unicode



Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net