10.2. The Preg Function InterfacePHP's interface to its regex engine is purely procedural (˜ 95), provided by the six functions shown at the top of Table 10-2. For reference, the table also shows four useful functions that are presented later in this chapter. Table 10-2. Overview of PHP's Preg Functions
What each function actually does is greatly influenced by the type and number of arguments provided, the function flags, and the pattern modifiers used with the regex. Before looking at all the details, let's see a few examples to get a feel for how regexes look and how they are handled in PHP: /* Check whether HTML tag is a <table> tag */ if (preg_match ('/^ <table\b/i ', $tag)) print "tag is a table tag\n"; ----------------------------------- /* Check whether text is an integer */ if ( preg_match ('/^ -?\d+$ /', $user_input)) print "user input is an integer\n"; ----------------------------------- /* Pluck HTML title from a string */ if ( preg_match ('{ <title>(.*?)</title> }si', $html, $matches)) print "page title: $matches[1]\n"; ----------------------------------- /* Treat numbers in string as Fahrenheit values and replace with Celsius values */ $metric = preg_replace ('/ (-?\d+(?:\.\d+)?) /e', /* pattern */ 'floor((-32)*5/9 + 0.5)', /* replacement code */ $string); ----------------------------------- /* Create an array of values from a string filled with simple comma-separated values */ $values_array = preg_split ('! \s*,\s* ,!', $comma_separated_values); The last example, when given a string such as ' Larry , Curly , Moe ', produces an array with three elements: the strings ' Larry ', ' Curly ', and ' Moe '. 10.2.1. "Pattern" ArgumentsThe first argument to any of the preg functions is a pattern argument , which is the regex wrapped by a pair of delimiters, possibly followed by pattern modifiers. In the first example above, the pattern argument is ' /<table\b/i ', which is the regex <table\b wrapped by a pair of slashes (the delimiters), followed by the i (case-insensitive match) pattern modifier. 10.2.1.1. PHP single-quoted stringsBecause of a regex's propensity to include backslashes, it's most convenient to use PHP's single-quoted strings when providing these pattern arguments as string literals. PHP's string literals are also covered in Chapter 3(˜ 103), but in short, you don't need to add many extra escapes to a regular expression when rendering it within a single-quoted string literal. PHP single-quoted strings have only two string metasequences , ' \' ' and ' \\ ', which include a single quote and a backslash into the string's value, respectively. One common exception requiring extra escapes is when you want \\ within the regex, which matches a single backslash character. Within a single-quoted string literal, each \ requires \\ , so \\ requires \\\\ . All this to match one backslash. Phew! (You can see an extreme example of this kind of backslash-itis on page 473.) As a concrete example, consider a regex to match a Windows disk's root path , such as ' C:\ '. An expression for that is ^[A-Z]:\\$ , which when included within a single-quoted string literalappears as ' ^[A-Z]:\\\\$ '. In a Chapter 5 example on page 190, we saw that ^.*\\ required a pattern argument string of ' /^.*\\\/' , with three backslashes. With that in mind, I find the following examples to be illustrative : print '/^.*\/'; prints /^.*\/ print '/^.*\/'; prints /^.*\/ print '/^.*\\/'; prints /^.*\/ print '/^.*\\/'; prints /^.*\/ The first two examples yield the same result through different means. In the first, the ' \/ ' sequence at the end is not special to a single-quoted string literal, so the sequence appears verbatim in the string's value. In the second example, the ' \\ ' sequence is special to the string literal, yielding a single ' \ ' in the string's value. This, when combined with the character that follows (the slash), yields the same ' \/ ' in the value as in the first example. The same logic applies to why the third and fourth examples yield the same result. You may use PHP double-quoted string literals, of course, but they're much less convenient. They support a fair number of string metasequences, all of which must be coded around when trying to render a regex as a string literal. 10.2.1.2. DelimitersThe preg engine requires delimiters around the regex because the designers wanted to provide a more Perl-like appearance, especially with respect to pattern modifiers. Some programmers may find it hard to justify the hassle of required delimiters compared to providing pattern modifiers in other ways, but for better or worse, this is the way it is. (For one example of " worse ," see the sidebar on page 448.) It's common to use slashes as the delimiters in most cases, but you may use any non- alphanumeric , non-whitespace ASCII character except a backslash. A pair of slashes are most common, but pairs of ' ! ' and ' # ' are used fairly often as well. If the first delimiter is one of the "opening" punctuations: { (< [ the closing delimiter becomes the appropriate matching closing punctuation: }) > ] When using one of these "paired" delimiters, the delimiters may be nested, so it's actually possible to use something like ' ( (\d+) ) ' as the pattern-argument string. In this example, the outer parentheses are the pattern-argument delimiters, and the inner parentheses are part of the regular expression those delimiters enclose. In the interest of clarity, though, I'd avoid relying on this and use the plain and simple ' / (\d+) / ' instead. Delimiters may be escaped within the regex part of the pattern-argument string, so something like ' / <B>(.*?)< \/ B> / i' is allowed, although again, a different delimiter may appear less cluttered, as with ' ! <B>(.*?)</B> ! i ' which uses ' !‹! ' as the delimiters, or '{ <B>(.*?)</B> } i ', which uses ' {‹} '. 10.2.1.3. Pattern modifiersA variety of mode modifiers (called pattern modifiers in the PHP vernacular) may be placed after the closing delimiter, or in some cases, within the regex itself, to modify certain aspects of a pattern's use. We've seen the case-insensitive i pattern modifier in some of the examples so far. Here's a summary of all pattern modifiers allowed:
Pattern modifiers within the regex When embedded within a regex, pattern modifiers can appear standalone to turn a feature on or off (such as (?i) to turn on case-insensitive matching, and (?-i) to turn it off ˜ 135). Used this way, they remain in effect until the end of the enclosing set of parentheses, if any, or otherwise , until the end of the regular expression. They can also be used as mode-modified spans (˜ 135), such as (?i:‹) to turn on case-insensitive matching for the duration of the span, or (?-sm:‹) to turn off s and m modes for the duration of the span. Mode modifiers outside the regex Modifiers can be combined, in any order, after the final delimiter, as with this example's ' si ', which enables both case-insensitive and dot-matches-all modes: if (preg_match(' { <title>(.*?)</title> }si ', $html, $captures)) PHP-specific modifiers The first four pattern modifiers listed in the table are fairly standard and are discussed in Chapter 3 (˜ 110). The e pattern modifier is used only with preg_replace , and it is covered in that section (˜ 459). The u pattern modifier tells the preg regex engine to consider the bytes of the regular expression and subject string to be encoded in UTF-8. The use of this modifier doesn't change the bytes, but merely how the regex engine considers them. By default (that is, without the u pattern modifier), the preg engine considers data passed to it as being in the current 8-bit locale (˜ 87). If you know the data is encoded in UTF-8, use this modifier; otherwise, do not. Non-ASCII characters with UTF-8-encoded text are encoded with multiple bytes, and using this u modifier ensures that those multiple bytes are indeed taken as single characters . The X pattern modifier turns on PCRE "extra stuff," which currently has only one effect: to generate an error when a backslash is used in a situation other than as part of a known metasequence. For example, by default, \k has no special meaning to PCRE, and its treated as k (the backslash, not being part of a known metasequence, is ignored). Using the X modifier causes this situation to result in an "unrecognized character follows \ " fatal error. Future versions of PHP may include versions of PCRE that ascribe special meaning to currently unspecial backslash-letter combinations, so in the interest of future compatibility (and general readability), it's best not to escape letters unless they currently have special meaning. In this regard, the use of the X pattern modifier makes a lot of sense, because it can point out typos or similar mistakes. The S pattern modifier invokes PCRE's "study" feature, which pre-analyzes the regular expression, and in some well-defined cases, can result in a substantially faster match attempt. It's covered in this chapter's section on efficiency, starting on page 478. The remaining pattern modifiers are esoteric and rarely used:
|