Limitations of the Basic Syntax


Even though regular expressions are quite powerful because of the original rules, inherent limitations make their use impractical. For example, there is no regular expression that can be used to specify the concept of "any character." In addition, if you happen to have to specify a parenthesis or star as a regular expressionrather than as a special characteryou're pretty much out of luck.

As a result of these limitations, the practical implementations of regular expressions have grown to include a number of other rules:

  • The special character "^" is used to identify the beginning of the string.

  • The special character "$" is used to identify the end of the string.

  • The special character "." is used to identify the expression "any character."

  • Any nonnumeric character following the character "\" is interpreted literally (instead of being interpreted according to its regex meaning). Note that this escaping technique is relative to the regex compiler, and not to PHP itself. This means that you must ensure that an actual backslash character reaches the regex functions by escaping it as needed (that is, if you're using double quotes, you will need to input \\). Any regular expression followed by a "+" character is a regular expression composed of one or more instances of that regular expression.

  • Any regular expression followed by a "?" character is a regular expression composed of either zero or one instances of that regular expression.

  • Any regular expression followed by an expression of the type {min[,|,max]} is a regular expression composed of a variable number of instances of that regular expression. The min parameter indicates the minimum acceptable number of instances, whereas the max parameter, if present, indicates the maximum acceptable number of instances. If only the comma is available, no upper limit exists to the number of instances that can be found in the string. Finally, if only min is defined, it indicates the only acceptable number of instances.

  • Square brackets can be used to identify groups of characters acceptable for a given character position.

Let's start from the beginning. It's sometimes useful to be able to recognize whether a portion of a regular expression should appear at the beginning or at the end of a string. For example, suppose you're trying to determine whether a string represents a valid HTTP URL. The regex http:// would match both http://www.phparch.com, which is a valid URL, and nhttp://www.phparch.com, which is not (and could easily represent a typo on the user's part).

By using the "^" special character, you can indicate that the following regular expression should be matched only at the beginning of the string. Thus, the regex ^http:// will create a match only with the first of the two strings.

The same conceptalthough in reverseapplies to the end-of-string marker "$", which indicates that the regular expression preceding it must end exactly at the end of the string. For example, com$ will match "sams.com" but not "communication."

The special characters "+" and "?" work similarly to the Kleene Star, with the exception that they represent "at least one instance" and "either zero or one instances" of the regex they are attached to, respectively.

As I briefly mentioned earlier, having a "wildcard" that can be used to match any character is extremely useful in a wide range of scenarios, particularly considering that the "." character is considered a regular expression in its own right, so that it can be combined with the Kleene Star and any of the other modifiers. For example, the expression

 .+@.+\..+ 

can be used to indicate:

At least one instance of any character, followed by

The "@" character, followed by

At least one instance of any character, followed by

The "." character, followed by

At least one instance of any character.

As you might have guessed, this expression is a very rough form of email address validation. Note how I have used the backslash character (\) to force the regex compiler to interpret the penultimate "." as a literal character, rather than as another instance of the "any character" regular expression.

However, that is a rather primitive way of checking for the validity of an email address. After all, only letters of the alphabet, the underscore character (_), the minus character (), and digits are allowed in the name, domain, and extension portion of an email. This is where the range denominators come into play.

As mentioned previously, anything within nonescaped square brackets represents a set of alternatives for a particular character position. For example, [abc] indicates either an "a", a "b", or a "c". However, representing something like "any character" by including every possible symbol in the square brackets would give birth to some ridiculously long regular expressionsand regex are complex enough as it is.

Luckily, it's possible to specify a "range" of characters by separating them with a dash. For example, [a-z] means "any lowercase character." You can also specify more than one range and combine them with individual characters by placing them side-by-side. For example, our email validation requirements can be satisfied by the expression [A-Za-z0-9_], which turns the overall regex into

 [A-Za-z0-9_]+@[A-Za-z0-9_]+\.[A-Za-z0-9_]+ 

The range specifications that we have seen so far are all inclusivethat is, they tell the regex compiler which characters can be in the string. Sometimes, it's more convenient to use exclusive specifications, dictating that any character except the characters you specify are valid. This can be done by prepending a caret character (^) to the character specifications inside the square bracket. For example, [^A-Z] means "any character except any uppercase letter of the alphabet."

Going back to the email validation regex, it's still not as good as it could be. For example, we know for sure that a domain extension (for example, .ca or .com) must have a minimum of two characters (as in .ca) and a maximum of four (as in .info). We can therefore use the minimum-maximum length specifier that I introduced earlier to specify this additional requirement:

 [A-Za-z0-9_]+@[A-Za-z0-9_]+\.[A-Za-z0-9_]{2,4} 

Naturally, you may want to allow only email addresses that have a three-letter domain (such as .com). This can be accomplished by omitting the comma and max parameters from the length specifiers:

 [A-Za-z0-9_]+@[A-Za-z0-9_]+\.[A-Za-z0-9_]{3} 

If, on the other hand, you would like to leave the maximum number of characters open in anticipation of the fact that longer domain extensions may be introduced in the future, you could use the following regex:

 [A-Za-z0-9_]+@[A-Za-z0-9_]+\.[A-Za-z0-9_]{3,} 

This indicates that the last regex in the expression should be repeated at least a minimum of three times, with no fixed upper limit.



PHP 5 Unleashed
PHP 5 Unleashed
ISBN: 067232511X
EAN: 2147483647
Year: 2004
Pages: 257

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net