Section 10.1. PHP s Regex Flavor | Mastering Regular Expressions

10.1. PHP's Regex Flavor

Table 10-1. Overview of PHP's preg Regular-Expression Flavor

Character Shorthands ^[]
˜ 115	^[]	`\a [\b] \e \f \n \r \t` \octal \x hex \x{ hex } \c char
Character Classes and Class-Like Constructs
˜ 118		Classes: [ ‹ ] [ ^‹ ] `(may contain POSIX-like [:alpha:] notation; ˜ 127)`
˜ 119		Any character except newline: dot `(with s pattern modifier, any character at all)`
˜ 120	^[]	Unicode combining sequence: \X
˜ 120	^[]	Class shorthands: `\w \d \s \W \D \S` (8-bit only) ^[]
˜ 121	^[] ^[]	Unicode properties and scripts: ^[] \p{ Prop } \P{ Prop }
˜ 120		Exactly one byte (can be dangerous): ^[] \C
Anchors and Other Zero-Width Tests
˜ 129		, Start of line/string: `^` \A
˜ 129		End of line/string: ^[] $ \z \Z
˜ 130		Start of current match: \G
˜ 133		Word boundary: `\b \B` (8-bit only) ^[]
˜ 133		Lookaround: ^[] `(?= ‹ ) ( ?!‹ ) ( ?<=‹ ) ( ?<!‹ )`
Comments and Mode Modifiers
˜ 446		Mode modifiers: (? mods - mods ) Modifiers allowed: `x ^[] s m i X U`
˜ 446		Mode-modified spans : `(?` mods - mods : ‹ )
˜ 136		Comments: `(?#‹)` `(with x pattern modifier, also from ' # ' until newline or end of regex)`
Grouping, Capturing, Conditional, and Control
˜ 137		Capturing parentheses: `(` ‹ `) \1 \2` ...
˜ 138		Named capture: `(?P < name >‹ ) (?P= name )`
˜ 137		Grouping-only parentheses: `(?:` ‹ `)`
˜ 139		Atomic grouping: `(?>` ‹ `)`
˜ 139		Alternation:
˜ 475		Recursion: `(?R) (? num ) (?P > name )`
˜ 140		Conditional: `(? if then else) - "if" can be lookaround, (R), or ( num )`
˜ 141		Greedy quantifiers: `* + ? {n} {n,} {x,y}`
˜ 141		Lazy quantifiers: `*? +? ?? {n}? {n,}? {x,y}?`
˜ 142		Possessive quantifiers: `*+ ++ ?+ {n}+ {n,}+ {x,y}+`
˜ 136	^[]	Literal (non-metacharacter) span: `\Q ... \E`
^[]
^[]
(This table also serves to describe PCRE, the regex library behind PHP's preg functions ˜ 91)

^[] (c) - may also be used within a character class ... see text

^[] (u) - only in conjunction with the u pattern modifier ˜447

Table 10-1 on the previous page summarizes the preg engine's regex flavor. The following notes supplement the table:

^[] .

^[] \b is a character shorthand for backspace only within a character class. Outside of a character class, \b matches a word boundary (˜ 133).

Octal escapes are limited to two- and three-digit 8-bit values. The special one-digit \0 sequence matches a NUL byte.

\x hex allows one- and two-digit hexadecimal values, while \x{ hex } allows any number of digits. Note, however, that values greater than \x{FF} are valid only with the u pattern modifier (˜ 447). Without the u pattern modifier, values larger than \x{FF} result in an invalid-regex error
^[] .

^[] Even in UTF-8 mode (via the u pattern modifier), word boundaries and class shorthands such as \w work only with ASCII characters. If you need to consider the full breadth of Unicode characters , consider using \pL (˜ 121) instead of \w , using \pN instead of \d , and \pZ instead of \s
^[] .

^[] ‚ Unicode support is as of Unicode Version 4.1.0.

Unicode scripts (˜ 122) are supported without any kind of ' Is ' or ' In ' prefix, as with \p{Cyrillic} .

One- and two-letter Unicode properties are supported, such as \p{Lu} , \p{L} , and the \pL shorthand for one-letter property names (˜ 121). Long names such as \p{Letter} are not supported.

The special \p{L&} (˜ 121) is also supported, as is \p{Any} (which matches any character)
^[] .

^[] ƒ By default, preg-suite regular expressions are byte oriented, and as such, \C defaults to being the same as (?s:.) , an s -modified dot . However, with the u modifier, preg-suite regular expressions become UTF-8 oriented, which means that a character can be composed of up to six bytes. Even so, \C still matches only a single byte. See the caution on page 120
^[] .

^[] \z and \Z can both match at the very end of the subject string, while \Z can also match at a final-character newline.

The meaning of $ depends on the m and D pattern modifiers (˜ 446) as follows : with neither pattern modifier, $ matches as \Z (before string-ending newline, or at the end of the string); with the m pattern modifier, it can also match before an embedded newline; with the D pattern modifier, it matches as \z (only at the end of the string). If both the m and D pattern modifiers are used, D is ignored
^[] .

^[] Lookbehind is limited to subexpressions that match a fixed length of text, except that top-level alternatives of different fixed lengths are allowed (˜ 133)
^[] .

^[] The x pattern modifier (free spacing and comments) recognizes only ASCII whitespace, and does not recognize other whitespace found in Unicode