| only for RuBoard - do not distribute or recompile |
In this section we show how regular expressions can achieve more sophisticated pattern matching to find, extract, and even replace complex substrings within a string.
While regular expressions provide capabilities beyond those described in the last section, complex pattern matching isn't as efficient as simple string comparisons. The functions described in the last section are more efficient than those that use regular expressions and should be used if complex pattern searches aren't required.
This section starts with a brief description of the POSIX regular expression syntax. This isn't a complete description of all the capabilities, but we do provide enough details to create quite powerful regular expressions. The second half of the section describes the functions that use POSIX regular expressions. Examples of regular expressions can be found in this section and in Chapter 7.
A regular expression
To
boolean ereg(string pattern, string subject [, array var])
ereg( ) returns true if the regular expression pattern is found in the subject string. We discuss how the ereg( ) function can extract values into the optional array variable var later in this section.
The following trivial example shows how ereg() is called to find the literal pattern "cat" in the subject string "raining cats and dogs" :
// prints "Found a cat" if (ereg("cat", "raining cats and dogs")) echo "Found 'cat'";
The regular expression "cat" matches the subject string, and the fragment prints " Found 'cat' " .
To represent any character in a pattern, a period is used as a wildcard. The pattern "c.." matches any three-letter string that begins with a lowercase "c" ; for example, "cat" , "cow" , "cop" , etc. To express a pattern that actually matches a period, use the backslash character \ —for example, "\.com" matches ".com" but not "xcom" .
The use of the backslash in a regular expression can cause confusion. To include a backslash in a double-quoted string, you need to escape the meaning of the backslash with a backslash. The following example shows how the regular expression pattern "\.com" is represented:
// Sets $found to true $found = ereg("\\.com", "www.ora.com");
It's better to avoid the confusion and use single quotes when passing a string as a regular expression:
$found = ereg('\.com', "www.ora.com");
Rather than using a wildcard that matches any character, a list of characters
ereg("p[aeiou]p", $var)
can be used. This returns
true
for any string that contains
"pap"
,
"pep"
,
"pip"
,
"pop"
, or
"pup"
. A range of characters can also be specified; for example,
"[0-9]"
specifies the
// Matches "A1", "A2", "A3", "B1", ... $found = ereg("[ABC][123]", "A1 Quality"); // true // Matches "00" to "39" $found = ereg("[0-3][0-9]", "27"); //true
A list can specify characters that aren't matches using the not operator ^ as the first character in the brackets. The pattern "[^123]" matches any character other than 1, 2, or 3. The following examples show more regular expressions that make use of the not operator in lists:
// true for "pap", "pbp", "pcp", etc. but not "php" $found = ereg("p[^h]p", $val); // true if $var does not contain //
alphanumeric
characters $found = ereg("[^0-9a-zA-Z]", $val);
The ^ character can be treated as normal by placing it in a position other than the start of the characters enclosed in the brackets. For example, "[0-9^]" matches the characters 0 to 9 and the ^ character. The - character can be matched by placing it at the start or the end of the list; for example, "[-123]" matches characters - , 1 , 2 , or 3 .
A regular expression can specify that a pattern occur at the start or end of a subject string using anchors . The ^ anchors a pattern to the start, and the $ character anchors a pattern to the end of a string. For example, the expression:
ereg("^php", $var)
matches strings that start with "php" but not others. The following code shows the operation of both:
$var = "to be or not to be"; $match = ereg("^to", $var); // true $match = ereg('be$', $var); // true $match = ereg("^or", $var); // false
Both anchors can be used in one regular expression to match a whole string. The following example illustrates this:
// Must match "Yes" exactly $match = ereg('^Yes$', "Yes"); // true $match = ereg('^Yes$', "Yes sir"); // false
By following a character in a regular expression with a ? , * , or + operator, the pattern matches zero or one, zero to many, or one to many occurrences of the character, respectively.
The ? operator allows zero or one occurrence of a character, so the expression:
ereg("pe?p", $var)
matches either "pep" or "pp" , but not the string "peep" . The * operator allows zero or many occurrences of the "o" in the expression:
ereg("po*p", $var)
and matches "pp" , "pop" , "poop" , "pooop" , and so on. Finally, the + operator allows one to many occurrences of "b" in the expression:
ereg("ab+a", $var)
so while strings such as "aba" , "abba" , and "abbba" match, "aa" doesn't.
The operators ? , * , and + can also be used with a wildcard or a list of characters. The following examples show how:
$var = "www.rmit.edu.au"; // True for strings that start with "www" // and end with "au" $matches = ereg('^www.*au$', $var); // true $hexString = "x01ff"; // True for strings that start with 'x' // followed by at least one hexadecimal digit $matches = ereg('x[0-9a-fA-F]+$', $hexString); // true
The first example matches any string that starts with "www" and ends with "au" ; the pattern ".*" matches a sequence of any characters, including a blank string. The second example matches any sequence that starts with the character "x" followed by one or more characters from the list [0-9a-fA-F] .
A fixed number of occurrences can be specified in braces. for example, the pattern "[0-7]{3}" matches three-character numbers that contain the digits 0 through 7:
$valid = ereg("[0-7]{3}", "075"); // true $valid = ereg("[0-7]{3}", "75"); // false
The braces syntax also allows the minimum and maximum occurrences of a pattern to be specified as demonstrated in the following examples:
$val = "58273"; // true if $val contains numerals from start to end // and is between 4 and 6 characters in length $valid = ereg('^[0-9]{4,6}$', $val); // true $val = "5827003"; $valid = ereg('^[0-9]{4,6}$', $val); // false // Without the anchors at the start and end, the // matching pattern "582768" is found $val = "582768986456245003"; $valid = ereg("[0-9]{4,6}", $val); // true
Subpatterns in a regular expression can be grouped by placing parentheses around them. This allows the optional and repeating operators to be applied to groups rather than just a single character. For example, the expression:
ereg("(123)+", $var)
matches
"123"
,
"123123"
,
"123123123"
, etc. Grouping characters allows complex patterns to be
// A simple, incomplete, HTTP URL regular expression that doesn't allow numbers $pattern = '^(http://)?[a-zA-Z]+(\.[a-zA-z]+)+$'; $found = ereg($pattern, "www.ora.com"); // true
The regular expression assigned to
$pattern
includes both the start and end anchors,
^
and
$
, so the whole
subject
string,
"www.ora.com"
must match the pattern. The start of the pattern is the optional
Groups can also define subpatterns when
ereg( )
Alternatives in a pattern are specified with the
operator; for example, the pattern "
catbatrat"
matches
"cat"
,
"bat"
, or
"rat"
. The
operator has the
$var = "bat"; $found = ereg("(cbr)at", $var); // true
Another example shows alternative beginnings to a pattern:
// match some URLs $pattern = '(^ftp^http^gopher)://'; $found = ereg($pattern, "http://www.ora.com"); // true
We've already discussed the need to escape the special meaning of characters used as operators in a regular expression. However, when to escape the meaning depends on how the character is used. Escaping the special meaning of a character is done with the backslash character as with the expression "2\+3 , which matches the string "2+3" . If the + isn't escaped, the pattern matches one or many occurrences of the character 2 followed by the character 3 . Another way to write this expression is to express the + in the list of characters as "2[+]3" . Because + doesn't have the same meaning in a list, it doesn't need to be escaped in that context. Using character lists in this way can improve readability. The following examples show how escaping is used and avoided:
// need to escape ( and ) $phone = "(03) 9429 5555"; $found = ereg("^\([0-9]{2,3}\)", $phone); // true // No need to escape (*.+?) within parentheses $special = "Special Characters are (, ), *, +, ?, "; $found = ereg("[(*.+?)]", $special); // true // The back-slash always needs to be quoted to match $backSlash = 'The backslash \ character'; $found = ereg('^[a-zA-Z \\]*$', $backSlash); //true // Don't need to escape the dot within parentheses $domain = "www.ora.com"; $found = ereg("[.]com", $domain); //true
Another complication arises due to the fact that a regular expression is passed as a string to the regular expression functions. Strings in PHP can also use the backslash character to escape quotes and to encode tabs, newlines, etc. Consider the following example, which matches a backslash character:
// single-quoted string containing a backslash $backSlash = '\ backslash'; // Evaluates to true $found = ereg("^\\\\ backslash\$", $backSlash);
The regular expression looks quite odd: to match a backslash, the regular expression function needs to escape the meaning of backslash, but because we are using a double-quoted string, each of the two backslashes needs to be escaped. The last complication is that PHP interprets the
$
character as the beginning of a variable
Metacharacters can also be used in regular expressions. For example, the tab character is represented as
\t
and the
$source = "fast\tfood"; $result = ereg('\s', $source); // true
PHP has several functions that use POSIX regular expressions to find and extract substrings, replace substrings, and split a string into an array. The functions to perform these
The
ereg( )
function, and the case-insensitive version
boolean ereg(string pattern, string subject [, array var]) boolean eregi(string pattern, string subject [, array var])
Both functions return
true
if the regular expression
pattern
is found in the
subject
string. An optional array variable
var
can be passed as the third argument; it is
To extract values from a string into an array, patterns can be arranged in groups contained by parentheses in the regular expression. The following example shows how the year, month, and day
$parts = array( ); $value = "2001-09-07"; $pattern = '^([0-9]{4})-([0-9]{2})-([0-9]{2})$'; ereg($pattern, $value, $parts); // Array ([0]=> 2001-09-07 [1]=>2001 [2]=>09 [3]=>07 print_r($
parts
);
The expression:
'^([0-9]{4})-([0-9]{2})-([0-9]{2})$'
matches dates in the format YYYY-MM-DD . After calling ereg( ) , $parts[0] is assigned the portion of the string that matches the whole regular expression—in this case, the whole string 2001-09-07 . The portion of the date that matches each group in the expression is assigned to the following array elements: $parts[1] contains the year matched by ([0-9]{4}) , $parts[2] contains the month matched by ([0-9]{2}) , and $parts[3] contains the day matched by "([0-9]{2})" .
The following functions create new strings by replacing substrings:
string ereg_replace(string pattern, string replacement, string source) string eregi_replace(string pattern, string replacement, string source)
They create a new string by replacing substrings of the
source
string that match the regular expression
pattern
with a
replacement
string. These functions are similar to the
str_replace( )
function described earlier in the Section 2.6 section, except that the
$source = "The quick\tbrown\n\tfox jumps"; // prints "The quick brown fox" echo ereg_replace("[ \t\n]+", " ", $source); $source = "\xf6 The quick\tbrown\n\tfox
jumps
\x88"; // replace all non-printable characters with a space echo ereg_replace("[^ -~]+", " ", $source);
The second example uses the regular expression "[^ -~]+" to match all characters except those that fall between the space character and the tilde character in the ASCII table. This represents almost all the printable 7-bit characters.
The following two functions split strings:
array split(string pattern, string source [, integer limit]) array spliti(string pattern, string source [, integer limit])
They split the source string into an array, breaking the string where the matching pattern is found. These functions perform a similar task to the explode( ) function described earlier and as with explode( ) , a limit can be specified to determine the maximum number of elements in the array.
The following simple example shows how
split( )
can break a
$sentence = "Iwonder why he does\nBuzz, buzz, buzz!"; $words = split("[^a-zA-Z]+", $sentence);
When complex patterns aren't needed to break a string into an array, the explode( ) function makes a better choice.
| only for RuBoard - do not distribute or recompile |