Defining More Complicated Patterns

I l @ ve RuBoard

Once you understand how to use literals and metacharacters to create a pattern, you can learn about groupings and classes, which will allow you to define more complex patterns.

Groupings

Using the basic symbols established so far, you can begin to incorporate parentheses to group characters into more involved patterns. Grouping works as you might expect: "(abc)" will only match abc, "(trout)" will only match trout. These examples are moot points, though, as "abc" will also only match abc. It is when you begin to use metacharacters with parentheses that you will see how groupings affect your patterns.

Essentially, think of parentheses as being used to establish a new literal. Logically, whereas "a" is a literal that matches only a, "(abc)" is a literal that matches only abc. From this notion, quantifiers, instead of applying solely to the immediately preceding literal, will apply to the whole group. Hence, "a{3} " matches aaa, but "(abc){3} " matches abcabcabc. As a better example, "bon+" will only match a string beginning with bon followed by one or more n's (say, bonnet ), but "(bon)+" will match a string beginning with bon, followed by zero or more bon's ( bonbon ). The parentheses restrict and control our pattern.

Similarly, while "yesno" matches either yeso or yeno ( ye plus either s or n plus o ), "(Yes)(No)", accepts either of those two words in their entirety, which is certainly what you would rather look for.

Classes

Regardless of how you combine your letters into various groups, they will only ever be useful for matching specific words. But what if you wanted to match any four-letter lowercase word or any number sequence? For this, you define and utilize classes (more formally referred to as character classes).

Table 8.2. These are the most common predefined classes built-in to PHP, which will save you a lot of time in lieu of defining your own classes.
Predefined Classes for Regular Expression
Class Matches
[[:alpha:]] any letter
[[:digit:]] any digit
[[:alnum:]] any letter or digit
[[:space:]] any white space
[[:upper:]] any uppercase letter
[[:lower:]] any lowercase letter
[[:punct:]] any punctuation mark

Classes are created by placing characters within square brackets ([]). For example, you can match any one vowel with "[aeiou]". Or you can use the hyphen to indicate a range of characters: "[a-z]" is any single lowercase letter and "[A-Z]" is any uppercase, "[A-Za-z]" is any letter in general, and "[0-9]" matches any digit. You should note that these patterns will only ever match one character, but, "[A-Z]{ 3} " would match abc, def, etc.

Within the square brackets, the caret symbol, which is normally used to indicate an accepted beginning of a string, is used to exclude a character. So "[^a]" will match any single character that isn't a.

PHP has already defined some classes which will be most useful to you in your programming. There is "[[:alpha:]]" for matching any letter (the equivalent of "[A-Za-z]"), "[[:digit:]]" for any number (or "[0-9]"), and "[[:alnum:]]" for any letter or number ( otherwise written as "[A-Za-z0-9]").

By defining your own classes and using those built-in to PHP (see Table 8.2), you can make better patterns for regular expressions.

Examples of patterns

Using the information introduced above, you can now create some useful patterns, and I'll give some detailed explanations of how I arrived there.

 "^([0-9]{ 5} )(-[0-9]{ 4} )?$" 

The pattern above is a pattern for matching a zip code, which begins with precisely five numbers, possibly followed by a dash and four more digits. In the first parenthetical, it is stated that you need precisely five digits (the class followed by the curly braces), and you use the caret to indicate that this must be the beginning of the string. Then you make a second parenthetical which starts with a dash, followed by precisely four numbers . You utilize the question mark to state that this section is optional (i.e., there can be zero or one of these) but you use the dollar sign as well to mandate that if it does exist, it must be the very end of the string.

Here is a more involved pattern for matching an email. address: "^([0-9a-z]+)([ 0-9a-z\.-_]+) @([0-9a-z\.-_]+)\.([0-9a-z]+)"

Since you will use eregi () for matching, you need not concern yourself with alphabetic case so the character classes only include "a-z" and not "A-Z" as well.

The first step in the pattern says that an email address must begin with at least one letter or number, followed by any quantity of letters, numbers, underscores, periods, and dashes.

Second, there is the at symbol, which is required in any email address.

Third, the pattern insists upon at least one letter, number, dash, period, or underscore .

Fourth, an email address must include a period.

Lastly, there must be at least one letter or number concluding the string (it cannot end with a period).

The space, a literal character, will mark the end of the string (so that, if the email address is found within other text, the words following the email address will not be included in the pattern). Script 8.3 is a modification of HandleForm.php using this more specific pattern.

Script 8.3. The more complex email validation pattern has been inserted into the HandleForm.php script to provide a more specific level of matching.

graphics/08sc03.jpg

Back Referencing

Finally, there is one more concept to discuss with regards to establishing patterns: back referencing. You'll learn more about this process when you are matching and replacing text later in the chapter.

In the zip code matching pattern, "^([0-9]{ 5} ) (-[0-9]{ 4} )?$", notice that there are two groupings within parentheses ”"^([0-9]{ 5} )" and "(-[0-9]{ 4} )". Within a regular expression pattern, PHP will automatically number parenthetical groupings beginning at 1. Back referencing allows you to refer to each individual section by using a double backslash (\\) in front of the corresponding number. For example, if you match the zip code 94710-0001 with this pattern, referring back to \\1 will give you "94710". You'll see an example of this in action in the next section.

I l @ ve RuBoard


PHP for the World Wide Web (Visual QuickStart Guide)
PHP for the World Wide Web (Visual QuickStart Guide)
ISBN: 0201727870
EAN: 2147483647
Year: 2001
Pages: 116
Authors: Larry Ullman

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net