Using ereg

Using `ereg`

The ereg function in PHP is used to test a string against a regular expression. Using a very simple regex, the following example checks whether $phrase contains the substring PHP:

 $phrase = "I love PHP"; if (ereg("PHP", $phrase)) {   echo "The expression matches"; }

If you run this script through your web browser, you will see that the expression does indeed match $phrase.

Regular expressions are case-sensitive, so if the expression were in lowercase, this example would not find a match. To perform a non-case-sensitive regex comparison, you can use eregi:

 if (eregi("php", $phrase)) {   echo "The expression matches"; }

Performance The regular expressions you have seen so far perform basic string matching that can also be performed by the functions you learned about in Lesson 6, "Working with Strings," such as strstr. In general, a script will perform better if you use string functions in place of ereg for simple string comparisons.

Testing Sets of Characters

As well as checking that a sequence of characters appears in a string, you can test for a set of characters by enclosing them in square brackets. You simply list all the characters you want to test, and the expression matches if any one of them occurs.

The following example is actually equivalent to the use of eregi shown earlier in this lesson:

 if (ereg("[Pp][Hh][Pp]", $phrase)) {   echo "The expression matches"; }

This expression checks for either an uppercase or lowercase P, followed by an uppercase or lowercase H, followed by an uppercase or lower-case P.

You can also specify a range of characters by using a hyphen between two letters or numbers. For example, [A-Z] would match any uppercase letter, and [0-4] would match any number between zero and four.

The following condition is true only if $phrase contains at least one uppercase letter:

 if (ereg("[A-Z]", $phrase)) ...

The ^ symbol can be used to negate a set so that the regular expression specifies that the string must not contain a set of characters. The following condition is true only if $phrase contains at least one non-numeric character:

 if (ereg("[^0-9]", $phrase)) ...

Common Character Classes

You can use a number of sets of characters when using regex. To test for all alphanumeric characters, you would need a regular expression that looks like this:

 [A-Za-z0-9]

The character class that represents the same set of characters can be represented in a much clearer fashion:

 [[:alnum:]]

The [: and :] characters indicate that the expression contains the name of a character class. The available classes are shown in Table 8.1.

Table 8.1. Character Classes for Use in Regular Expressions
Class Name	Description
`alnum`	All alphanumeric characters, AZ, az, and 09
`alpha`	All letters, AZ and az
`digit`	All digits, 09
`lower`	All lowercase characters, az
`print`	All printable characters, including space
`punct`	All punctuation charactersany printable character that is not a space or `alnum`
`space`	All whitespace characters, including tabs and newlines
`upper`	All uppercase letters, AZ

Testing for Position

All the expressions you have seen so far find a match if that expression appears anywhere within the compared string. You can also test for position within a string in a regular expression.

The ^ character, when not part of a character class, indicates the start of the string, and $ indicates the end of the string. You could use the following conditions to check whether $phrase begins or ends with an alphabetic character, respectively:

 if (ereg("^[a-z]", $phrase)) ... if (ereg("[a-z]$", $phrase)) ...

If you want to check that a string contains only a particular pattern, you can sandwich that pattern between ^ and $. For example, the following condition checks that $number contains only a single numeric digit:

 if (ereg("^[[:digit:]]$", $number) ...

The Dollar Sign If you want to look for a literal $ character in a regular expression, you must delimit the character as \$ so that it is not treated as the end-of-line indicator.

When your expression is in double quotes, you must use \\$ to double-delimit the character; otherwise, the $ sign may be interpreted as the start of a variable identifier.

Wildcard Matching

The dot or period (.) character in a regular expression is a wildcardit matches any character at all. For example, the following condition matches any four-letter word that contains a double o:

 if (ereg("^.oo.$", $word)) ...

The ^ and $ characters indicate the start and end of the string, and each dot can be any character. This expression would match the words book and tool, but not buck or stool.

Wildcards A regular expression that simply contains a dot matches any string that contains at least one character. You must use the ^ and $ characters to indicate length limits on the expression.

Repeating Patterns

You have now seen how to test for a particular character or for a set or class of characters within a string, as well as how to use the wildcard character to define a wide range of patterns in a regular expression. Along with these, you can use another set of characters to indicate where a pattern can or must be repeated a number of times within a string.

You can use an asterisk (*) to indicate that the preceding item can appear zero or more times in the string, and you can use a plus (+) symbol to ensure that the item appears at least once.

The following examples, which use the * and + characters, are very similar to one another. They both match a string of any length that contains only alphanumeric characters. However, the first condition also matches an empty string because the asterisk denotes zero or more occurrences of [[:alnum::]]:

 if (ereg("^[[:alnum:]]*$", $phrase)) ... if (ereg("^[[:alnum:]]+$", $phrase)) ...

To denote a group of matching characters that should repeat, you use parentheses around them. For example, the following condition matches a string of any even length that contains alternating letters and numbers:

 if (ereg("^([[:alpha:]][[:digit:]])+$", $string)) ...

This example uses the plus symbol to indicate that the letter/number sequence could repeat one or more times. To specify a fixed number of times to repeat, the number can be given in braces. A single number or a comma-separated range can be given, as in the following example:

 if (ereg("^([[:alpha:]][[:digit:]]){2,3}$", $string)) ...

This expression would match four or six character strings that contain alternating letters and numbers. However, a single letter and number or a longer combination would not match.

The question mark (?) character indicates that the preceding item may appear either once or not at all. The same behavior could be achieved by using {0,1} to specify the number of times to repeat a pattern.

Some Practical Examples

You use regex mostly to validate user input in scripts, to make sure that a value entered is acceptable. The following are some practical examples of using regular expressions.

Zip Codes

If you have a customer's zip code stored in $zip, you might want to check that it has a valid format. A U.S. zip code always consists of five numeric digits, and it can optionally be followed by a hyphen and four more digits. The following condition validates a zip code in this format:

 if (ereg("^[[:digit:]]{5}(-[[:digit:]]{4})?$", $zip)) ...

The first part of this regular expression ensures that $zip begins with five numeric digits. The second part is in parentheses and followed by a question mark, indicating that this part is optional. The second part is defined as a hyphen character followed by four digits.

Regardless of whether the second part appears, the $ symbol indicates the end of the string, so there can be no other characters other than those allowed by the expression if this condition is to be satisfied. Therefore, this condition matches a zip code that looks like either 90210 or 90210-1234.

Telephone Numbers

You might want to enforce the format of a telephone number to ensure that it looks like (555)555-5555. There are no optional parts to this format. However, because the parentheses characters have a special meaning for regex, they have to be escaped with a backslash.

The following condition validates a telephone number in this format:

 if (ereg("^\([[:digit:]]{3}\)[[:digit:]]{3}-[[:digit:]]{4}$",                   $telephone)) ...

Email Addresses

You need to consider many variables when validating an email address. At the very simplest level, an email address for a .com domain name looks like somename@somedomain.com.

However, there are many variations, including top-level domain names that are two characters, such as .ca, or four characters, such as .info.

Some country-specific domains have a two-part extension, such as .co.uk or .com.au.

As you can see, a regular expression rule to validate an email address needs to be quite forgiving. However, by making some general assumptions about the format of an email address, you can still create a rule that rejects many badly formed addresses.

There are two main parts to an email address, and they are separated by an @ symbol. The characters that can appear to the left of the @ symbolusually the recipient's mailbox namecan be alphanumeric and can contain certain symbols.

Let's assume that the mailbox part of an email address can consist of any characters except for the @ symbol itself and can be any length. Rather than try to list all the acceptable characters you can think offor instance, should you allow an apostrophe in an email address?it is usually good enough to enforce that email address can contain only one @ character and that anything up to that character is a valid mailbox name.

For the regex rule, you can define that the domain part of an email address consists of two or more parts, separated by dots. You can also assume that the last part may only be between two and four characters in length, which is sufficient for all top-level domain names currently in use.

The set of characters that can be used in parts of the domain is more restrictive than the mailbox nameonly lowercase alphanumeric characters and a hyphen can be used.

Taking these assumptions into consideration, you can come up with the following condition to test the validity of an email address:

 if (ereg("^[^@]+@([a-z0-9\-]+\.)+[a-z]{2,4}$", $email)) ...

This regular expression breaks down as follows: any number of characters followed by an @ symbol, followed by one or more parts consisting of only lowercase letters, numbers, or a hyphen. Each of those parts ends with a dot, and the final part must be between two and four letters in length.

How Far to Go This expression could be even further refined. For instance, a domain name cannot begin with a hyphen and has a maximum length of 63 characters. However, for the purpose of catching mistyped email addresses, this expression is more than sufficient.

Breaking a String into Components

You have used parentheses to group together parts of a regular expression to indicate a repeating pattern. You can also use parentheses to indicate subparts of an expression, and ereg allows you to break a pattern into components based on the parentheses.

When an optional third argument is passed to ereg, that variable is assigned an array of values that correspond to the parts of the pattern identified by the parentheses in the regular expression.

Let's use the email address regular expression as an example. The following code includes three sets of parentheses to isolate the mailbox name, domain name (apart from the extension), and domain extension:

 $email = "chris@lightwood.net"; if (ereg("^([^@]+)@([a-z\-]+\.)+([a-z]{2,4})$",               $email, $match)) {   echo "Mailbox: " . $match[1] . "<br>";   echo "Domain name: " . $match[2] . "<br>";   echo "Domain type: " . $match[3] . "<br>"; } else {   echo "Email address is invalid"; }

If you run this script in a web browser, you get output similar to the following:

 Mailbox: chris Domain name: lightwood. Domain type: net

Note that the first key of $match refers to the first pattern found. The array keys are numbered from zero, as usual; however, $match[0] contains the entire matched pattern.

Searching and Replacing

You can use regular expressions to perform search and replace operations on a string with the ereg_replace function. Its three arguments are a regex search pattern, the replacement string, and the string to replace into. The modified string is returned.

str_replace If you want to perform a simple string replace operation that does not require a regular expression, you can use str_replace instead of ereg_replace. str_replace is more efficient because PHP does not even have to consider that you might be looking for a regular expression.

For example, to blank out a telephone number before displaying a string, you could use the following:

 echo ereg_replace(         "\([[:digit:]]{3}\)[[:digit:]]{3}-[[:digit:]]{4}$",            "(XXX)XXX-XXXX", $string);

Just like you can use eregi in place of ereg, to perform a non-case-sensitive search and replace using regex, you can use eregi_replace.