|
9.3. Regular ExpressionsAlthough regular expressions are very powerful, they are difficult to use, especially if you're new to them. So, instead of jumping on the functions that PHP supports for dealing with the regular expressions, we cover the pattern matching syntax first. If PCRE is enabled, the following should show up in phpinfo() output, as shown in Figure 9.3. Figure 9.3. PCRE phpinfo() output.9.3.1. SyntaxPCRE functions check whether a text string matches a pattern. The syntax of a pattern always has the following format: <delimiter> <pattern> <delimiter> [<modifiers>] The modifiers are optional. The delimiter separates the pattern from the modifiers. PCRE uses the first character of the expression as the delimiter. You should use a character that does not exist in the pattern itself. Or, you can use a character that exists in your expression, but then you must escape it with the \. traditionally, the / is used as the delimiter, but other common delimiters are | or @. It's your choice. Personally, in most cases, we would pick the @, unless we need to do matching on an email or similar pattern that contains the @, in which case we would use the /. The PHP function preg_match() is used to match regular expressions. The first parameter passed to the function is the pattern. The second parameter is the string to be matched to the pattern and is also called the subject. The function returns TRUE (the pattern matches) or FALSE (the pattern does not match). You can also pass a third parametera variable name. The text that matches is stored by reference in the array with this name. If you don't need to use the matching text but just want to know if there is a match or not, you can leave out the third parameter. In short, the format is as follows, with $matches being optional: $result = preg_match($pattern, $subject, $matches); Note The examples in this section will not use the <?php and ?> tags, but of course, they are required. 9.3.1.1 Pattern SyntaxPCRE's matching syntax is very complex. A full discussion of all its details would exceed the scope of this book. We cover just the basics here, which is enough to be very useful. On most UNIX systems with the PCRE library installed, you can use man pcrepattern to read about the whole pattern matching language, or have a look at the (somewhat outdated) PHP Manual page at http://www.php.net/manual/en/pcre.pattern.syntax.php. But here we start with the simple things: 9.3.1.2 MetacharactersThe characters from the Table 9.1 are special characters in the way that they can be used to construct patterns.
9.3.1.3 Example 1Let's dissect some useful complex regular expressions that we can create with the metacharacters from Table 9.1: $pattern = "/^([0-9a-f][0-9a-f]:){5}[0-9a-f][0-9a-f]$/"; This pattern matches a MAC addressa unique number bound to a network cardwith the format 00:04:23:7c5d:01. The pattern is bound to the start and end of our subject string with ^ and $, and it contains two parts:
This regexp could also have been written as /^([0-9a-f]{2}:){5}[0-9a-f]{2}$/, which would have been a bit shorter. To test the text against the pattern, use the following code: preg_match($pattern, '00:04:23:7c:5d:01', $matches); print_r($matches); With either pattern, the output would be the same, as follows: Array ( [0] => 00:04:23:7c:5d:01 [1] => 5d: ) 9.3.1.4 Example 2"/([^<]+)<([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\\.)+[a-zA-Z0-9_-]+)>/" This pattern is used to match email addresses in the following format: 'Derick Rethans <derick@php.net>' This pattern is not good enough to match all email addresses, and validates some addresses that should not be matched. It only serves as a simple example. The first part is ([^<]+)<, as follows:
The second part is ([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\\.)+[a-zA-Z0-9_-]+), which used to match the email address itself:
Then there is the trailing > and delimiter. The following example shows the contents of the $matches array after running the preg_match() function: <?php $string = 'Derick Rethans <derick@php.net>'; preg_match( "/([^<]+)<([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\\.)+[a-zA-Z09_]+)>/", $string, $matches ); print_r($matches); ?> The output is Array ( [0] => Derick Rethans <derick@php.net> [1] => Derick Rethans [2] => derick@php.net [3] => php. ) The fourth element cannot really be avoided because a subpattern was used for the (sub)domain part of the pattern, but of course, it doesn't hurt to have it. 9.3.1.5 Escape SequencesAs shown in the previous table, the \ character is the general escape character. In combination with the character that follows it, the \ stands for a special group of characters. Table 9.2 shows the different cases.
9.3.1.6 Examples '/\w+\s+\w+/'Matches two words separated by whitespace. '/(\d{1,3}\.){3}\d{1,3}/' Matches (but not validates) an IP address. The IP address may appear anywhere in the string. <?php $str = "My IP address is 212.187.38.47."; preg_match('/(\d{1,3}\.){3}\d{1,3}/', $str, $matches); print_r($matches); ?> outputs Array ( [0] => 212.187.38.47 [1] => 38. ) It is interesting to notice that the second element only contains the last one of the three matched subpatterns. 9.3.1.7 Lazy MatchingSuppose you have the following string and you want to match the string inside the first <a /> tag: <a href="http://php.net/">PHP</a> has an <a href="http://php.net/manual">excellent</a> manual. The following pattern looks like it will work: '@<a.*>(.*)</a>@' However, when you run the following example, you see that it outputs the wrong result: <?php $str = '<a href="http://php.net/">PHP</a> has an '. '<a href="http://php.net/manual">excellent</a> manual.'; $pattern = '@<a.*>(.*)</a>@'; preg_match($pattern, $str, $matches); print_r($matches); ?> outputs Array ( [0] => <a href="http://php.net/">PHP</a> [1] => PHP ) The example fails because the * and the + are greedy operators. They try to match as many characters as possible. In this case, <a.*> will match everything to manual">. You can tell the PCRE engine not to do this by appending the ? to the quantifier. If the ? is added, the PCRE engine tries to match as little characters/sub-patterns as possible, which is what we want here. When the pattern @<a.*?>(.*?)</a>@ is used, the output is correct: Array ( [0] => <a href="http://php.net">PHP</a> [1] => PHP ) However, this is not the most efficient way. It's usually better to use the pattern @<a[^>]+>([^<]+)</a>@, which requires less processing by the PCRE engine. 9.3.1.8 ModifiersThe modifiers "modify" the behavior of the pattern matching engine. Table 9.3 lists them all with descriptions and examples.
9.3.2. FunctionsThree groups of PCRE-related functions are available: matching functions, replacement functions, and splitting functions. preg_match(), discussed previously, belongs to the first group. The second group contains functions that replace substrings, which match a specific pattern. The last group of functions split strings based on regular expression matches. 9.3.2.1 Matching Functionspreg_match() is the function that matches one pattern with the subject string and returns either true or false depending whether the subject matched the pattern. It also can return an array containing the contents of the different sub-pattern matches. The function preg_match_all() is similar, except that it matches the pattern with the subject repeatedly. Finding all the matches is useful when extracting information from documents. Take, for example, the situation in which you want to extract email addresses from a web site: <?php $raw_document = file_get_contents('http://www.w3.org/TR/CSS21'); $doc = html_entity_decode($raw_document); $count = preg_match_all( '/<(?P<email>([a-z.]+).?@[a-z0-9]+\.[a-z]{1,6})>/Ui', $doc, $matches ); var_dump($matches); ?> outputs Array ( [0] => Array ( [0] => <bert @w3.org> [1] => <tantekc @microsoft.com> [2] => <ian @hixie.ch> [3] => <howcome @opera.com> ) [email] => Array ( [0] => bert @w3.org [1] => tantekc @microsoft.com [2] => ian @hixie.ch [3] => howcome @opera.com ) [1] => Array ( [0] => bert @w3.org [1] => tantekc @microsoft.com [2] => ian @hixie.ch [3] => howcome @opera.com ) [2] => Array ( [0] => bert [1] => tantekc [2] => ian [3] => howcome ) ) This example reads the contents of the CSS 2.1 specification into a string and decodes the HTML entities in it. The script then uses a preg_match_all() on the document, using a pattern that matches < + an email address + >, and stores the email addresses in the $matches array. The output shows that preg_match_all() doesn't store all sub-pattern belonging to one match in one element of the $matches array. Instead, it stores all the sub-pattern matches belonging to the different matches into one element of $matches. preg_grep() performs similarly to the UNIX egrep command. It compares a pattern against elements of an array containing the subjects. It returns an array containing the elements that were successfully matched against the pattern. See the next example, which returns all valid IP addresses from the array $addresses: <?php $addresses = array('212.187.38.47', '188.141.21.91', '2.9.256.7', '<<empty>>'); $pattern = '@^((\d?\d|1\d\d|2[0-4]\d|25[0-5])\.){3}'. '(\d?\d|1\d\d|2[0-4]\d|25[0-5])@'; $addresses = preg_grep($pattern, $addresses); print_r($addresses); ?> 9.3.2.2 Replacement FunctionsIn addition to the matching described in the previous section, PHP's regular expression functions can also replace text based on pattern matching. The replacement functions can replace a substring that matches a subpattern with different text. In the replacement, you can refer to the pattern matches using back references. Here is an example that explains the replacement functions. In this example, we use preg_replace() to replace a pseudo-link, such as [link url="www.php.net"]PHP[/link], with a real HTML link: <?php $str = '[link url="http://php.net"]PHP[/link] is cool.'; $pattern = '@\[link\ url="([^"]+)"\](.*?)\[/link\]@'; $replacement = '<a href="\\1">\\2</a>'; $str = preg_replace($pattern, $replacement, $str); echo $str; ?> The script outputs <a href="http://php.net">PHP</a> is cool. The pattern consists of two sub-patterns, ([^"]+) for the URL and (.*?). Instead of returning the substring of the subject that matches the two sub-patterns, the PCRE engine assigns the substring to back references, which you can access by using \\1 and \\2 in the replacement string. If you don't want to use \\1, you may use $1. Be careful when putting the replacement string into double quotes, because you will have to escape either the slashes (so that a back reference looks like \\\\1) or the dollar sign (so that a back reference looks like \$1). You should always put the replacement string in single quotes. The full pattern match is assigned to back reference 0, just like the element with key 0 in the matches array of the preg_match() function. Tip If the replacement string needs to be back reference + number, you can also use ${1}1 for the first back reference, followed by the number 1. preg_replace() can replace more than one subject at the same time by using an array of subjects. For instance, the following example script changes the format of the names in the array $names: <?php $names = array( 'rethans, derick', 'sæther bakken, stig', 'gutmans, andi' ); $names = preg_replace('@([^,]+).\ (.*)@', '\\2 \\1', $names); ?> The names array is changed to array('derick rethans', 'stig sœther bakken', 'andi gutmans'); However, names usually start with an uppercase letter. You can uppercase the first letter by using either the /e modifier or preg_replace_callback(). The /e modifier uses the replacement string to be evaluated as PHP code. Its return value is the replacement string: <?php $names = array( 'rethans, derick', 'sæther bakken, stig', 'gutmans, andi' ); $names = preg_replace('@([^,]+).\ (.*)@e', 'ucwords("\\2 \\1")', $names); ?> If you need to do more complex manipulation with the matched patterns, evaluating replacement strings becomes complicated. You can use the preg_replace_callback() function instead: <?php function format_string($matches) { return ucwords("{$matches[2]} {$matches[1]}"); } $names = array( 'rethans, derick', 'sæther bakken, stig', 'gutmans, andi' ); $names = preg_replace_callback( '@([^,]+).\ (.*)@', // pattern 'format_string', // callback function $names // array with 'subjects' ); print_r($names); ?> Here's one more useful example: <?php $show_with_vat = true; $format = '€ %.2f'; $exchange_rate = 1.2444; function currency_output_vat ($data) { $price = $data[1]; $vat_percent = $data[2]; $show_vat = isset ($_GLOBALS['show_with_vat']) && $_GLOBALS['show_with_vat']; $amount = ($show_vat) ? $price * (1 + $vat_percent / 100) : $price; return sprintf( $GLOBALS['format'], $amount / $GLOBALS['exchange_rate'] ); } $data = "This item costs {amount: 27.95 %19%} ". "and the other one costs {amount: 29.95 %0%}.\n"; echo preg_replace_callback ( '/\{amount\:\ ([0-9.]+)\ \%([0-9.]+)\%\}/', 'currency_output_vat', $data ); ?> This example originates from a webshop where the format and exchange rate are decoupled from the text, which is stored in a cache file. With this solution, it is possible to use caching techniques and still have a dynamic exchange rate. preg_replace() and preg_replace_callback() allow the pattern to be an array of patterns. When an array is passed as the first parameter, every pattern is matched against the subject. preg_replace() also enables you to pass an array for the replacement string when the first parameter is an array with patterns: <?php $text = "This is a nice text; with punctuation AND capitals"; $patterns = array('@[A-Z]@e', '@[\W]@', '@_+@'); $replacements = array('strtolower(\\0)', '_', '_'); $text = preg_replace($patterns, $replacements, $text); echo $text."\n"; ?> The first pattern @[A-Z]@e matches any uppercase character and, because the e modifier is used, the accompanying replacement string strtolower(\\0) is evaluated as PHP code. The second pattern [\W\] matches all non-word characters and, because the second replacement string is simply _, all non-word characters are replaced by the underscore (_). Because the replacements are done in order, the third pattern matches the already modified subject, replacing all multiple occurrences of _ with one. The subject string contains the following after each pattern/replacement match, as shown in Table 9.4.
9.3.2.3 Splitting StringsThe last group of functions includes only preg_split(), which can be used to split a string into substrings by using a regular expression match for the delimiters. PHP provides an explode() function that also splits strings, but explode() can only use a simple string as the delimiter. explode() is much faster than using a regular expression, so you might be better off using explode() when possible. A simple example of preg_splits()'s usage might be to split a string into the words it contains. See the following example: <?php $str = 'This is an example for preg_split().'; $words = preg_split('@[\W]+@', $str); print_r($words); ?> The script outputs Array ( [0] => This [1] => is [2] => an [3] => example [4] => for [5] => preg_split [6] => ) As you can see, the last element is empty. By default, the function returns empty elements, too. The character(s) before the end of the string are non-word characters so they act as a delimiter, resulting in an empty element. You can pass two more parameters to the preg_split() function: a limit and a flag. The "limit" parameter controls how many elements are returned before the splitting stops. In the preg_split() example, two elements are returned: <?php $str = 'This is an example for preg_split().'; $words = preg_split('@[\W]+@', $str, 2); print_r($words); ?> The output is Array ( [0] => This [1] => is an example for preg_split(). ) In the next example, we use -1 as the limit. -1 means that there is no limit at all, and allows us to pass flags without shortening our output array. Three flags specify what is returned:
|
|