Section 15.2. Pattern Matching


15.2. Pattern Matching

15.2.1.

Pattern matching allows you to build expressions that match strings using a specific matching syntax called a regular expression. Regular expressions allow you to perform searching tasks such as separating out a certain tag for an incoming text file, or validating user input such as email addresses.

The easiest way to use regular expressions in PHP is to use the PCRE (Perl-compatible regular expressions) extension. This extension is installed by default, so it should be part of your PHP environment. PHP also supports a style of regular expression matching functions called ereg that are older and less compatible than PCRE functions.

A regular expression is really just a string. The string uses a combination of special characters and literals to allow matching of other strings. For example, the following string describes an email address:

 \b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}\b 

It does this by searching for:

  1. Sequential alphanumeric and punctuation characters, which form the username

  2. The at symbol (@)

  3. A group of alphanumeric and punctuation characters, which forms the first part of the domain name

  4. A period, which separates the domain name from the extension

  5. A two- to four-character alpha string, which signifies the top level domainfor example, com and net

The descriptors used in the regular expression are:


\b

A boundary point of a word


[aAbB]

One of anything inside the brackets: a, A, b, B


{2,4}

A total of between 2 and 4 of anything preceding the brackets


A-Z

Any letter between A and Z, such as A, B, and C


\.

A literal period


+

Match the preceding block one or more times

There are two types of characters in the regular expression string. Those that match themselves, such as the at (@) symbol, are called literals, meaning they literally match. The other type is called metacharacters, which describe matching by specifying repetition, ranges, and combinations within the expression.

15.2.1.1. Quantifiers

Quantifiers are metacharacters that specify how many times you wish to match the preceding pattern in a string.

Quantifiers include:


*

Zero or more


+

One or more


?

Zero or one


{ num}

Exactly num times


{ num,}

At least num times


{ min,max}

At least min but not more than max times

For example, the regular expression [a-f]?ex matches both alex and ex, but not ax.

15.2.1.2. Anchors

Anchors define a specific location for a match to take place. To match the start of a line, the caret character (^) is used. To match the end of a line, the dollar character ($) is used. To match a string that begins with I, use the regular expression ^I.

Other anchors deal with word boundaries. Words are made up of consecutive letters, digits, and underscores. All other characters, such as spaces, punctuation, and newline characters, are word boundaries. To match a word boundary, the backslash b (\b) character is used. To match everywhere that isn't a word boundary, the backslash capital B (\B) character is used. Table 15-1 lists other word boundaries.

Table 15-1. Escaped word boundaries

Character

Anchor type

\b

A word boundary

\B

A nonword boundary

\d

A single digit character

\D

A single nondigit character

\n

The newline character

\r

The carriage return character

\s

A single whitespace character

\S

A single nonwhitespace character

\t

The tab character

\w

A single word character, alphanumeric and underscore

\W

A single nonword character


15.2.1.3. Character classes

A character class allows you to group several characters together and work with them in a regular expression as though they were one character. Use the square brackets ([]) to group the characters together. For example, to match any alpha character twice:

 [a-zA-Z]{2} 

You can also use a negated character class, which selects the opposite of the character class by adding a caret (^) character after the opening square bracket. Note that this is the only time that caret character doesn't represent an anchor. The following matches all nonalpha characters.

 [^a-zA-Z] 

15.2.1.4. Executing pattern matches in PHP

PHP uses a set of functions that start with preg_ to perform regular expression operations on strings. These functions take a regular expression as a parameter in a string format. There are functions for doing a variety of operations on strings, including splitting them up and returning matching portions.

The regular expression string must be in Perl format, which specifies that the regular expression start with '/ and end with /'. The regular expression goes between the single quote and slashes, as in '/regular expression/'. Forward slashes in the expression must be escaped with a backslash. For example, /home/example becomes '/\/home\/example/'.

To specify regular expression options such as case insensitivity, add the parameter to the end of the regex string after the last slash. These most common parameters are listed in Table 15-2.

Table 15-2. Regular expression characters

Regex character

Meaning

s

Dot matches all characters

i

Case insensitive

m

Match start and end of line anchors at embedded new lines in the search string


For example, use '/abc/i' to do a case-insensitive search of abc.

15.2.1.5. preg_match

The function preg_match is used to return all matches based on the supplied regular expression and string. The function value returned is true if a match is found. Its syntax is:

 preg_match (string pattern, string subject [, array groups]) 

In Example 15-3, we search the string example to see if it has words that start with ple. Since the string doesn't start with ple, no results are returned.

Example 15-3. Using preg_match to return an array of matches that start with ple

 <?php $subject = "example"; $pattern = '/^ple/'; preg_match($pattern, $subject, $matches); print_r($matches); ?> 

Example 15-3 displays:

 Array ( ) 



Learning PHP and MySQL
Learning PHP and MySQL
ISBN: 0596101104
EAN: 2147483647
Year: N/A
Pages: 135

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net