Using Regular Expressions

Chapter 6, "Strings and Characters of the World," showed some options for finding information within strings such as the strpos and substr. Although these functions are useful and efficient, they do have some limitations. Suppose, for example, that we have a form that requires a user to enter a U.S. or Canadian telephone number, of the format (###)###-####. We would have to pick apart this string one character at a time to make sure that the correct type of character was in the correct location. If we wanted to be flexible and allow input in formats such as 123.456.7890 or (123)456 7890, verifying these formats using only strpos and substr would take a frustrating number of function calls to verify the correct type of character in the correct location.

We would be much happier programmers if there were some advanced pattern-matching engine available to us in PHP. Of course, we would want it to be compatible with a multi-byte character set to help us with UTF-8 and other localized strings.

Fortunately, such functionality does exist in the form of a feature called regular expressions.

What Are Regular Expressions?

A regular expression is just a description of a pattern typically specified in a string. When you compare a string against the regular expression, the processing engine determines whether the string matches the expression (and if so, in what way). Therefore, instead of having to look through a phone number searching for individual characters, we can create a regular expression that says something more like "look for a series of 10 digits, possibly with some parentheses around the first 3 characters and a dash between the sixth and seventh characters."

The syntax used to describe these regular expressions is powerful, flexible, and unfortunately somewhat dialectal. A few major implementations of regular expressions available differ slightly in their details. Fortunately, these differences are not major, and we can typically move from one system to another without too much trouble.

PHP provides programmers with two regular expression processing engines. The first is called the Perl Compatible Regular Expressions (PCRE) extension and is modeled on the processor used in Perl, an extremely powerful language that has regular expressions tightly integrated into its programming model. The second flavor is called POSIX Extended Regular Expressions and is based on the standard for regular expressions defined by the POSIX 1003.2 standard.

Both extensions are enabled by default in PHP, and you can use them with a number of functions. However, this book focuses entirely on the POSIX regular expressions for the following reasons:

PCRE is already extremely well documented in numerous places and has a remarkable amount of user support through the Perl community.
The POSIX regular expressions are multi-byte character set enabled in PHP, whereas the PCRE extension is not. Given that we are focusing our efforts largely on writing globalizable applications, we want to be sure foreign language characters can be properly processed.

This is not meant to be a judgment in favor of one regular expression engine over the other. There are a number of features in the PCRE engine that are not available in the POSIX one that many programmers find invaluable, and it can be faster in a number of situations. If your application does not require multi-byte character set support and you can be sure that you are dealing with input data that is in a certain code page, the Perl regular expressions might be appropriate for you.

Setup

The POSIX regular expression extension is enabled in PHP by default, unless explicitly disabled by specifying the --disable-regex switch to the configuration program before compiling it. The one trick to this is that the extension only supports multi-byte strings if you also enable the mbstring extension, as discussed in Chapter 6. Microsoft Windows users merely need to make sure the following line in php.ini has no semicolon ( ; ) character in front of it:

 extension=php_mbstring.dll

You have no configuration options for this extension in php.ini, so after it is compiled or enabled, it is ready for use.

Testing Your Expressions

The next few sections cover the specifics of regular expression syntax, but first you learn how to test and play with the functionality in PHP5. Although we could use POSIX regular expressions with a number of functions, we limit ourselves initially to the ereg function, which takes a string and a regular expression and tells whether the string matches the pattern and, if so, what exactly the match was:

 $success = ereg($pattern, $string, $match);

The third parameter is optional and can be omitted when you are interested strictly in whether a match occurred. When it is specified and a match is found, an array with those match(es) is placed into this variable.

Most Unix-like systems (including Mac OS X) also ship with a program called egrep, which applies the regular expression to each line in an input file individually, indicating which lines match and which ones do not:

 # egrep [options] pattern  files

To have the ability to process a number of input lines in PHP and indicate which ones match, we can write our own function, which we will call regex_play and use while we explore the functionality:

 function regex_play($in_strings, $in_regex) {   if (!is_array($in_strings) || !is_string($in_regex))     die('Bad Parameters (array + string)<br/>');   echo <<<EOM <b>regex_play</b> called to match <b>'{$in_regex}'</b>:<br/> EOM;   foreach ($in_strings as $x => $strval)   {     $found = ereg($in_regex, $strval, $matches);     if ($found)     {       echo "Array Index <b>$x</b> matches: ";       var_export($matches); echo " \"$strval\"<br/>\n";     }   }   echo "<br/>\n"; }

This function merely takes an array of strings and the regular expression to match and indicates which of the strings in the array matches the pattern. As we will see, regular expressions use many of the same characters that PHP uses for special string processing. Therefore, it is almost always in the best interest to enclose regular expression patterns in single quotes (') rather than double quotes (").

Basic Searches

In its most basic usage, a regular expression contains a character or set of characters to match in the input string. If we have the following array in our PHP scripts

 $clothes = array("shoes", "pants", "socks", "jacket", "cardigan",   "scarf", "t-shirt", "blouse", "underpants", "belt",   "hand bag",  );

then to see which strings contain the letter a, we could write the following:

 regex_play($clothes, 'a');

This function outputs the following:

 regex_play called to match 'a': Array Index 1 matches: array ( 0 => 'a', ) "pants" Array Index 3 matches: array ( 0 => 'a', ) "jacket" Array Index 4 matches: array ( 0 => 'a', ) "cardigan" Array Index 5 matches: array ( 0 => 'a', ) "scarf" Array Index 8 matches: array ( 0 => 'a', ) "underpants" Array Index 10 matches: array ( 0 => 'a', ) "handbag"

It is interesting to look at the last item in the array: handbag. We might intuitively ask why the results array does not contain two instances of the letter a in it, because there are two in the input string. The answer lies in how the POSIX regular expression processor works: as soon as it satisfies a condition (i.e. look for a single letter 'a'), it stops processing.

To find all those entries that contain pants, we could write the following:

 regex_play($clothes, 'pants');

The output would be as follows:

 regex_play called to match 'pants': Array Index 1 matches: array ( 0 => 'pants', ) "pants" Array Index 8 matches: array ( 0 => 'pants', ) "underpants"

Given that the pants in underpants also matched against our regular expression, we see further evidence that the regular expression is just matching characters. It normally does not care about word boundaries or whether that which it seeks is buried among other characters.

We can also search for multi-byte characters, assuming we have correctly enabled the mbstring extensions:

 $mb_strings = array("",                     "",                     ""); regex_play($mb_strings, "");

The output from the preceding would be as follows:

 regex_play called to match '': Array Index 0 matches: array ( 0 => '', ) "" Array Index 2 matches: array ( 0 => '', ) ""

Character Classes

When we want to search for more than just individual characters or strings, we can use square brackets ( [ and ] ) to define what are called character classes. These are used in positions where you want to allow one of a number of characters to appear. For example, to find any clothing that has the letter o followed by either a u or an e, you can use the following:

 regex_play($clothes, "o[ue]");

This output results:

 regex_play called to match 'o[ue]': Array Index 0 matches: array ( 0 => 'oe', ) "shoes" Array Index 7 matches: array ( 0 => 'ou', ) "blouse"

To find any string containing a vowel (any of a, e, i, o, or u), we could use the character class [aieou]. Similarly, to match against any number, we could use [0123456789], and to match against any lowercase letter, we could write [abcdefghijklmnopqrstuvwxyz]. These last two classes, however, are somewhat annoying to type in all the time, and they're prone to input errors.

To solve this problem, you can specify ranges of characters using the hyphen (-) character: [a-z], [A-Z], or [0-9]. You can include multiple ranges within one character class, such as [A-Za-z0-9], which instructs the processor to match any single uppercase letter, lowercase letter, or digit.

However, a note of caution is warranted against expressions such as [A-z] because regular expression ranges actually just operate on character codes. All the uppercase letters happen to lie consecutively in the character tables in most character sets, as do the lowercase ones, but between the two ranges, there are a number of characters. Therefore, the range [A-z] would also include characters such as [, ], ^, and _. The character class [a-Z], on the other hand, just generates an error from the regex or mbregex compiler in PHP. The character code for a comes after that of Z, which translates into an invalid range.

To specify nonprintable characters in character classes, you can use many of the same escape sequences that you would use in PHP, including those for tabs (\t), newlines (\n), carriage returns (\r), and hexadecimal representations of unprintable digits (\x0b). Of course, this means that if you want to search for the backslash character ( \), you must escape it: [\\].

Ranges in character classes work on any character set with contiguous character values. Therefore, in UTF-8 character sets, [-] represents all possible Japanese hiragana characters, and [09] represents the double-width digits found in most Asian fonts. (These digits differ from the regular single-width digits found in ASCII.)

In addition to putting individual digits, letters, or ranges within character classes, you can specify a number of special named classes available in POSIX regular expressions, as shown in Table 22-1.

Table 22-1. Named Character Classes in POSIX Regular Expressions
Named Class	Description
`[:alnum:]`	Matches all ASCII letters and numbers. Equivalent to `[a-zA-Z0-9]`.
`[:alpha:]`	Matches all ASCII letters. Equivalent to `[a-zA-Z]`.
`[:blank:]`	Matches spaces and tab characters. Equivalent to `[ \t]`.
`[:space:]`	Matches any whitespace characters, including space, tab, newlines, and vertical tabs. Equivalent to `[\n\r\t \x0b]`.
`[:cntrl:]`	Matches unprintable control characters. Equivalent to `[\x01-\x1f]`.
`[:digit:]`	Matches ASCII digits. Equivalent to `[0-9]`.
`[:lower:]`	Matches lowercase letters. Equivalent to `[a-z]`.
`[:upper:]`	Matches uppercase letters. Equivalent to `[A-Z]`.

You cannot use these named character classes outside of character classes or as part of ranges. Thus, we could choose to write [0-9], [[:digit:]], or [[:alpha:][:digit:]], but not [A-[:lower:]].

One other important aspect of using character classes is the ^ character, which enables us to match anything except the contents of the character class. Therefore, the character class [^aeiou] matches any strings except those containing English vowels.

Finally, to include carets (^) or square brackets within the list of characters against which to match, you just escape them with backslashes: [\^\[\]].

Boundaries

One of the things shown previously was that searching for pants matched both pants and underpants. When we want to match only the word pants, we need a way to mark word boundaries. This is done in POSIX regular expressions by using the [:<:] and [:>:] anchors, for a word's left and right boundaries, respectively. As with other special classes listed in Table 22-1, these must be used within character classes. These two anchors are used in regular character classes as follows:

 regex_play($clothes, '[[:<:]]pants[[:>:]]');

The beginning of a string and the end of a string count as a left and right word boundary, respectively, so the preceding code would generate the following results:

 regex_play called to match '[[:<:]]pants[[:>:]]': Array Index 1 matches: array ( 0 => 'pants', ) "pants"

WARNING FOR WINDOWS USERS

Versions of PHP as recent as 5.0.4 have an issue when compiled for the Microsoft Windows platform. On these, if you have php_mbstring.dll enabled, you cannot use the word boundary anchors [:<:] and [:>:] in regular expressionsthey generate a regular expression compiler error. Unix versions of PHP do not have this problem, and neither do Windows versions without mbstring enabled.

Two other important anchors exist for matching the beginning of a string (the caret, or ^) and the end of a string (dollar sign, or $). These are used on their own, outside of character classes wrapped in [ and ], and are known as metacharacters.

To match any string beginning with the word the, we could use the regular expression "^the". If we want to allow either a lower- or uppercase t character, we rewrite the expression as "^[tT]he". In the clothing example, to find words starting with the letter s, we would write the following:

 regex_play($clothes, '^s');

The output would be as follows:

 regex_play called to match '^s': Array Index 0 matches: array ( 0 => 's', ) "shoes" Array Index 2 matches: array ( 0 => 's', ) "socks" Array Index 5 matches: array ( 0 => 's', ) "scarf"

Similarly, to find an article of clothing ending with the letter s, we would write this:

 regex_play($clothes, 's$');

And our output would be this:

 regex_play called to match 's$': Array Index 0 matches: array ( 0 => 's', ) "shoes" Array Index 1 matches: array ( 0 => 's', ) "pants" Array Index 2 matches: array ( 0 => 's', ) "socks" Array Index 8 matches: array ( 0 => 's', ) "underpants"

The regular expression "^s$" matches strings containing only the letter s.

The Dot

One special character, the period or dot character (.), is used in regular expressions to match any single character. Therefore, the regular expression "s.n" matches sun, sin, son, s!n, s%n, and sSn. The dot must, however, match one character. Thus, sn would not match the previous regular expression. To actually match a period character in your string, you must escape the dot with a backslash (\.).

A dot character inside of a character class has no special meaning and is not treated as a metacharacter: It is just used to match a period. For example, to match some common characters seen at the end of a word, we might write [.,:;-'">?!\]].

Repeating Patterns

When we want to match a character or character class occurring more than once, we can use quantifiers, which enable us to specify a minimum and maximum number of times the preceding entity can occur. Quantifiers are specified by including the minimum and maximum number in brackets: {min, max}.

One common misspelling seen these days is the word lose spelled as loose. To match either of these, you could use the following expression: "lo{1,2}se", which would match lose and loose, but neither lse nor looose:

 regex_play(array("loser", "looser", "lser", "looooser"),            'lo{1,2}se'); regex_play called to match 'lo{1,2}se': Array Index 0 matches: array ( 0 => 'lose', ) "loser" Array Index 1 matches: array ( 0 => 'loose', ) "looser"

You can, if you so desire, omit the upper bound, in which case any number greater than or equal to the minimum bound matches:

 regex_play(array("loser", "looser", "lser", "looooser"),            'lo{1,}se'); regex_play called to match 'lo{1,}se': Array Index 0 matches: array ( 0 => 'lose', ) "loser" Array Index 1 matches: array ( 0 => 'loose', ) "looser" Array Index 3 matches: array ( 0 => 'loooose', ) "looooser"

Three extremely common repeating patterns get their own special quantifiers:

{0,} This is represented by the special quantifier *, which means match zero or more of the preceding entity.
{1,} This is represented by the special quantifier +, which means match one of more of the preceding entity.
{0,1} This sequence denotes that something can optionally existbut only once if it does. It is represented by the special quantifier ?.

For example, to match any sequence of digits ending in 99, we can use the regular expression "[0-9]*99".

Grouping and Choice

Regular expressions also enable us to group characters or character classes together, via parentheses: ( and ). Using these in combination with other operators enables us to form even more powerful regular expressions, such as "(very){1,}", which would find a match in all of very good, very very good, and very very very very very smokin' good.

Tricks and Traps

POSIX regular expressions have a couple of interesting properties that can cause some unexpected results (when matching against strings) and some potential performance problems. We cover a couple of the more common sources of confusion here, and leave a more thorough treatment of advanced regular expressions to another book. (See Appendix C, "Recommended Reading.")

First and foremost, POSIX regular expressions work in a fashion that leads them to be called greedy. Effectively, when given free reign to start matching characters, such as with a sequence such as ".*" (which says match any number of characters), a POSIX regular expression immediately starts gobbling up characters until it reaches the end of the string.

This behavior can cause problems if the regular expression is in fact something like ".*fish". If given the string I like to eat raw fish with that pattern, the processor matches all characters until it gets to the end of the string. It then realizes that it still has four more characters left to match, namely those in fish. It then starts working its way backward through the string, seeing whether it can make a match happen that way. It finally makes that match, but in a somewhat inefficient manner.

This greedy processing can cause some unexpected results if our patterns are not as specific as they need to be. Consider the following expression to match an IP address specified of the format xxx.yyy.zzz.www:

 [0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}

We have written this to try to match between one and three digits four times, each time separated by a period character. We have, however, forgotten that the dot character, when specified by itself in a regular expression, means "match any character." What we really wanted was to escape each of the periods with a backslash.

The preceding pattern correctly matches (as expected) against the following IP addresses:

 1.2.3.4 192.168.0.1 255.255.255.255

What is unexpected, however, is that it successfully matches against the following:

 192.168.255

Why? Because the regular expression processor works very hard to make patterns match. The preceding string matches the regular expression along the following lines:

The first two [0-9]{1,3}. sequences match the 192. and 168. respectively. The processor then uses the 255 to match the third one of these before realizing that there is still more in the regular expression to match.
After processing, however, it discovers that it can satisfy the regular expression with the remaining 255 by matching the 2 against the third [0-9]{1,3}. sequence, the first 5 against the dot character, and the second 5 against the fourth digit sequence [0-9]{1,3}.

We are thus given a match, even though that is not what we intended! To fix this problem, we should correctly escape the dot characters to indicate that we will only accept periods:

 [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}

This new regular expression still correctly matches valid IP addresses, but it no longer matches the invalid one.

If you are getting strange or unexpected results with your regular expression, do not fixate on one particular part of the expression, but instead look at the whole sequence of patterns and try to see how it could be producing the results. Trying different input values to isolate how it is behaving will also help.