POSIX Regular Expressions

The regular expression standard that made its way through the POSIX standard is perhaps the simplest form of regex available to PHP programmers. As such, it makes a great learning tool because the functions that implement it do not provide any particular "advanced" features.

In addition to the standard rules that we have already discussed, the POSIX regex standard defines the concept of character classes as a way to make it even easier to specify character ranges. Character classes are always enclosed in a set of colon characters (:) and must be enclosed in square brackets. There are 12 character classes:

alpha represents a letter of the alphabet (either upper- or lowercase). This is equivalent to [A-Za-z].
digit represents a digit between 09 (equivalent to [0-9]).
alnum represents an alphanumeric character, just like [0-9A-Za-z].
blank represents "blank" characters, normally space and Tab.
cntrl represents "control" characters, such as DEL, INS, and so forth.
graph represents all the printable characters except the space.
lower represents lowercase letters of the alphabet only.
upper represents uppercase letters of the alphabet only.
print represents all printable characters.
punct represents punctuation characters such as "." or ",".
space is the whitespace.
xdigit represents hexadecimal digits.

This makes it possible, for example, to rewrite our email validation regex as follows:

 [[:alnum:]_]+@[[:alnum:]_]+\.[[:alnum:]_]{2,4}

This notation is much simpler, and it makes mistakes a little less obvious.

Another important concept introduced by the POSIX extension is the reference. Earlier in the chapter, we have already had a chance to see how parentheses can be used to group regular expressions. When you do so in a POSIX regex, when the expression is executed the interpreter assigns a numeric identifier to each grouped expression that is matched. This identifier can later be used in various operationssuch as finding and replacing.

For example, consider the following string and regular expression:

 marcot@tabini.ca ([[:alpha:]]+)@([[:alpha:]]+)\.([[:alpha:]]{2,4})

The regex should match the preceding email address. However, because we have grouped the username, the domain name and the domain extensions will each become a reference, as shown in Table 3.1.

Table 3.1. Regex References
Reference Number	Value
0	marcot@tabini.ca (the string matches by the entire regex)
1	marcot
2	tabini
3	ca

PHP provides support for POSIX through functions of the ereg* class. The simplest form of regex matching is performed through the ereg() function:

 ereg (pattern, string[, matches)

The ereg function works by compiling the regular expression stored in pattern and then comparing it against string. If the regex is matched against string, the result value of the function is TRUEotherwise, it is FALSE. If the matches parameter is specified, it is filled with an array containing all the references specified by pattern that were found in string (see Listing 3.1).

Listing 3.1. Filling Patterns with `ereg`

 <?php     $s = 'marcot@tabini.ca';     if (ereg ('([[:alpha:]]+)@([[:alpha:]]+)\.([[:alpha:]]{2,4})', $s, $matches))     {       echo "Regular expression successful. Dumping matches\n";       var_dump ($matches);     }     else     {       echo "Regular expression unsuccessful.\n";     } ?>

If you execute the preceding script, you should see this result:

 Regular expression successful. Dumping matches array(4) {   [0]=>   string(16) "marcot@tabini.ca"   [1]=>   string(6) "marcot"   [2]=>   string(6) "tabini"   [3]=>   string(2) "ca" }

This indicates that the regular expression was successfully matched against the string stored in $s and returned the various references in the $matches array.

If you're not interested in case-sensitive matching (and you don't want to have to specify all characters twice when creating a regular expression), you can use the eregi function instead. It accepts the same parameters and behaves the same way as ereg(), with the exception that it ignores the case when matching a regular expression against a string (see Listing 3.2):

Listing 3.2. Case-insensitive Pattern Matching

 <?php     $a = "UPPERCASE";     echo (int) ereg ('uppercase', $a);     echo "\n";     echo (int) eregi ('uppercase', $a);     echo "\n"; ?>

The first regex will fail because ereg() performs a case-sensitive match against the contents of $a. The second regex, however, will be successful, because the eregi function performs its matches using an algorithm that is not case sensitive.

References make regular expressions an even more effective tool for handling search-and-replace operations. For this purpose, PHP provides the ereg_replace function, and its cousin eregi_replace(), which is not case sensitive:

 ereg_replace (pattern, replacement, string);

The ereg_replace() function first matches the regular expression pattern against string. Then, it applies the references created by the regular expression in replacement and returns the resulting string. Here's an example (see Listing 3.3):

Listing 3.3. Using `ereg_replace`

 <?php     $s = 'marcot@tabini.ca';     echo ereg_replace ('([[:alpha:]]+)@([[:alpha:]]+)\.([[:alpha:]]{2,4})',       '\1 at \2 dot \3', $s) ?>

If you execute this script, it will return the following string:

 marcot at tabini dot ca

As you can see, the three references are extracted from the contents of $s by the regex compiler and used to substitute the placeholders in the replacement string.

Table 3.1. Regex References

Listing 3.1. Filling Patterns with ereg

Listing 3.2. Case-insensitive Pattern Matching

Listing 3.3. Using ereg_replace

Listing 3.1. Filling Patterns with `ereg`

Listing 3.3. Using `ereg_replace`