14.1 Perl Style


PHP provides two major interfaces for working with regular expressions. One of these interfaces supports Perl-style regular expressions. For many years Perl has been known as the big player in the regular expression business. Because many people like Perl's way of dealing with pattern matching, a set of functions has been implemented into PHP as well.

14.1.1 preg_match and Regular Expression Fundamentals

In this section, the preg_match function and regular expression fundamentals will be discussed, and you will take a closer look at the most important pattern modifiers.

14.1.1.1 Pattern Modifiers

The first important function is called preg_match and it can be used to see whether a certain pattern can be found in a string. Let's take a look at an example:

 <?php         $string = "Welcome to the chapter about regular expressions";         if      (preg_match("/chapter/i", $string))         {                 echo "substring can be found<br>\n";         } ?> 

First the string is defined. In the next line the script checks whether the string chapter can be found in $string. If so, a message is displayed. As you can see, preg_match accepts two parameters. The first parameter contains the pattern you are looking for. The second parameter contains the variable you want to check. Let's take a closer look at the first parameter: The regular expression has to start with a slash. After the slash the pattern will be listed. At the end of the pattern, a slash has to be added to tell PHP that the string containing the pattern is over. After the slash some options can be defined. In this example the option i is used, which means that PHP will not distinguish between uppercase and lowercase patterns. If chapter is spelled with uppercase letters, the script will still find the pattern:

 <?php         $string = "Welcome To The Chapter About Regular Expressions";         if      (preg_match("/chapter/i", $string))         {                 echo "substring can be found<br>\n";         } ?> 

Another important pattern modifier supported by Perl-style regular expressions is the m modifier. Normally ^ defines the beginning of a string. If the string contains a newline character, ^ still defines the beginning of a string. If the m pattern modifier is used, ^ matches every beginning of a line defined by a newline character. m works the same way when working with $, which defines the end of a string. So that you can understand how things work, we have included an example:

 <?php         $string = "Welcome To\nThe Chapter About\nRegular Expressions";         if      (preg_match("/^Th/m", $string))         {                 echo "1: substring can be found<br>\n";         }         if      (preg_match("/^Th/", $string))         {                 echo "2: substring can be found<br>\n";         } ?> 

The first regular expression matches the string because m tells PHP to look for the substring Th after every newline character. The second regular expression does not match because the beginning of the string is not Th. Therefore the output of the script contains just one line:

 1: substring can be found 

Another important modifier that is also supported by Perl is s (PCRE_DOTALL). If s is used, dots also match newline characters. If s is not enabled, dots do not match newlines. Let's take a look at an example:

 <?php         $string = "Welcome To\nThe Chapter About\nRegular Expressions";         if      (preg_match("/To.The/s", $string))         {                 echo "1: substring can be found<br>\n";         }         if      (preg_match("/To.The/", $string))         {                 echo "2: substring can be found<br>\n";         } ?> 

The first regular expression matches because s is used and therefore dots match newline characters.

14.1.1.2 Metacharacters

Regular expressions support a set of metacharacters. Every metacharacter has a special meaning. Table 14.1 contains an overview of all important metacharacters.

Table 14.1. An Overview of Metacharacters
Symbol Meaning
\ Escapes a character
^ Matches beginning of a string or a line (depending on whether m is set)
$ Matches end of a string depending on whether m is set
. Matches one character
[ Starts character class definition
] Ends character class definition
| Starts alternative branch
( Starts subpattern
) Ends subpattern
? Matches zero or one character
* Matches 0 times or more often
+ Matches one time or more often
{ Starts min/max quantifier
} Ends min/max quantifier

After you have seen which metacharacters are defined, it's time to take a closer look at some practical examples where you can see how these metacharacters can be used.

Escaping special characters is one of the most important things when working with regular expressions and PHP in general. To escape a character, you can use backslashes. The next example shows how a bracket can be escaped:

 <?php         $string = 'this is a bracket: ) ';         if      (preg_match("/\)/", $string))         {                 echo "substring can be found<br>\n";         } ?> 

The regular expression starts with a slash. In the next step a backslash is passed to the function to escape the bracket that comes after the backslash. Finally the regular expression is terminated by a slash. If the backslash is removed from the regular expression, an error will be displayed:

 Warning: Compilation failed: unmatched parentheses at offset 0 in /var/www/html/regexp.php on line 3 

One question that is often asked is how backslashes can be escaped when working with regular expressions. The answer to this question is simple because escaping backslashes is as simple as escaping any other metacharacter. The next example shows how to escape two backslashes:

 <?php         $string = 'these are two backslashes: \\ ';         if      (preg_match("/\\\\/", $string))         {                 echo "substring can be found<br>\n";         } ?> 

Four backslashes are needed to escape two backslashes. The reason is obvious: Two backslashes are needed for obtaining one backslash. Therefore four backslashes make two backslashes.

^ matches the beginning of a string. The counterpart of ^ is the $ metacharacter, which matches the end of a string. The next example shows a way to look for the string hello:

 <?php         $string = 'hello';         if      (preg_match("/^hello$/", $string))         {                 echo "substring can be found<br>\n";         } ?> 

In many cases, regular expressions are used to search for data you do not know precisely. Imagine a situation where you are looking for the string hello but you don't know how it is spelled. This example might seem ridiculous, but in real-world applications it is not. Most databases contain records that have been inserted by users, and as you probably know, users make mistakes. Therefore, the data in the database might not be what you have been looking for. Typos will lower the quality of your data, but with the help of regular expressions it is possible to retrieve at least some of the words that are not spelled correctly.

Let's take a look at the next example:

 <?php         $string = "hello";         if      (preg_match("/hel.o/", $string))         {                 echo "1: substring can be found<br>\n";         }         $string = "helllllo";         if      (preg_match("/hel.*o/", $string))         {                 echo "2: substring can be found<br>\n";         } ?> 

The first regular expression matches the string hello. The dot in the regular expression matches exactly one character. In this example the second l in hello is missing, but the regular expression still matches. Dots match exactly one character occurring exactly once. If you don't know if a character is in a string, the * metacharacter might be useful for you. The second regular expression in the script means that any set of characters can be found after the substring hel. At the end of the string there has to be an o. * matches zero or more than zero characters. In contrast, + matches one or more characters.

? is an additional metacharacter. It matches one character zero or one time.

Sometimes it is necessary to check whether entire sequences of characters occur more than a predefined number of times. Brackets can be used to combine a list of characters into one block. Then this block is treated just like one single character. In the next example you can see how this works:

 <?php         $string = "amadeus, amadeus, rock me amadeus";         if      (preg_match("/(amadeus, )+/", $string))         {                 echo "substring can be found<br>\n";         } ?> 

The string amadeus followed by a comma and a blank matches if it occurs once or more often. In this example the + tells PHP to treat the entire string as one symbol.

In the preceding scenario, you saw how + can be used. If a string should match two to three times, it is necessary to work with curly brackets as shown in the following listing:

 <?php         $string = "amadeus, amadeus, rock me amadeus";         if      (preg_match("/(amadeus, ){2,3}/", $string))         {                 echo "substring can be found<br>\n";         } ?> 

The string is treated as one symbol again, but this time curly brackets are used instead of ?, +, or *.

If a block of characters has to match more than a certain number of times, one edge of the interval has to be omitted as shown in the next listing:

 <?php         $string = "amadeus, amadeus, rock me amadeus";         if      (preg_match("/(amadeus, ){3,}/", $string))         {                 echo "substring can be found<br>\n";         } ?> 

In this example the regular expression does not match because the string in brackets can only be found twice, not three times.

Sometimes you are looking for words that are spelled slightly differently. An example is the words back and pack. The meaning of the two words is totally different, but the difference between the two words is only one letter. Both words can be retrieved by using just one regular expression:

 <?php         $string = ' "back" does not mean "pack" ';         if      (preg_match("/[bp]/", $string))         {                 echo "substring can be found<br>\n";         } ?> 

With the help of square brackets, a list of characters can be defined. If one of the characters matches, the condition will be fulfilled. In this example the regular expression matches if either b or p can be found.

If you are looking for a set of characters such as the entire alphabet, it is not recommended to list every character. With regular expressions, it is possible to define a range of characters:

 <?php         $string = 'Hello Pinky';         if      (preg_match("/^[A-Za-z]/", $string))         {                 echo "substring can be found<br>\n";         } ?> 

The regular expression in the listing matches if the string starts with a character, whether or not it is spelled in uppercase letters. However, in case-insensitive searching, it is also possible to use the i pattern-modifier instead of working with [A-Za-z]:

 <?php         $string = 'Hello Pinky';         if      (preg_match("/^[a-z]/i", $string))         {                 echo "substring can be found<br>\n";         } ?> 

The script will return the same data as the script you have seen before, but the regular expression is shorter and written more efficiently.

Intervals can be seen as various branches. In some cases branches are used for checking entire blocks of records instead of a list of records:

 <?php         $string = 'Hello Pinky';         if      (preg_match("/(Pinky)|(Brain)/i", $string))         {                 echo "substring can be found<br>\n";         } ?> 

The regular expression in the script matches the substrings Pinky and Brain. With the help of the | (pipe) symbol, it is possible to tell PHP to select the appropriate value from the list.

14.1.1.3 Special Symbols

Special data like line feeds, tabs, or hex values have to be escaped using a backslash. You have already seen that things like dollar signs and brackets must be escaped. In this section you will see that many other characters need special treatment. Table 14.2 contains an overview of all important special characters escaped by using a backslash.

Table 14.2. Special Symbols
Symbol Meaning
\a System bell (hex 07)
\cx "Control + character", where x is any character
\e Escape (hex 1B)
\f Formfeed (hex 0C)
\n Newline (hex 0A)
\r Carriage return (hex 0D)
\t Tab (hex 09)
\xhh Character with hex code hh
\ddd Character with octal code ddd, or backreference
\040 A space character
\40 Is the same, provided there are fewer than 40 previous capturing subpatterns
\7 Back reference
\11 Might be a back reference, or another way of writing a tab
\011 Tab
\0113 A tab followed by 3
\113 Is the character with octal code 113
\377 Is a byte consisting entirely of 1 bits
\81 Either a back reference, or a binary zero followed by the two characters "8" and "1"
\d Any decimal digit
\D Any character that is not a digit
\s Any whitespace character
\S Any character that is not a whitespace character
\w Any word
\W Any non-word character
\b Word boundary
\B Not a word boundary
\A Start of subject
\Z End of subject or newline at end
\z End of subject independent of multiline mode

14.1.2 preg_match_all

To perform a global pattern match, PHP provides a command called preg_match_all. Let's take a look at the next example and see what the function can be used for:

 <?php         $string = 'Hello Frank Drebin, Hello Frank Johnston';         preg_match_all("/Hello Frank [DJ][a-z]*/i", $string, $out);         print "pattern: ".$out[0][0]." --- ".$out[0][1]."<br>\n"; ?> 

The pattern matches two parts of the string passed to preg_match_all. The substrings found in $string will be stored in the array called $out. Let's execute the string and see what comes out:

 pattern: Hello Frank Drebin --- Hello Frank Johnston 

Every cell contains exactly one value.

You might already have wondered why $out is a two-dimensional and not a one-dimensional array. preg_match_all supports subpatterns that can be defined using brackets. These subpatterns can be retrieved as well. Let's take a look at an example:

 <?php         $string = 'Hello Frank Drebin, Hello Frank Johnston';         preg_match_all("/(Hello) Frank ([DJ][a-z]*)/i", $string, $out);         print "zero: ".$out[0][0]." --- ".$out[0][1]."<br>\n";         print "one: ".$out[1][0]." --- ".$out[1][1]."<br>\n";         print "two: ".$out[2][0]." --- ".$out[2][1]."<br>\n"; ?> 

The first subpattern in the regular expression is Hello. The second subpattern is ([DJ][a-z]*). The second element on the first axis of the array will contain a substring matching the first subpattern. The third element on the first axis of the array will contain the substring matching the second subpattern, and so on. To make PHP's behavior clearer, it is worth looking at the output of the script:

 zero: Hello Frank Drebin --- Hello Frank Johnston one: Hello --- Hello two: Drebin --- Johnston 

$out[0] contains the two substrings ($out[0][0] and $out[0][1]) matching the entire regular expression. $out[1] contains the two substrings matching the first subpattern. The same applies to $out[2]. With the help of subpatterns, it is possible to build brief yet powerful and flexible applications. The advantage of subpatterns is that you need just one regular expression to retrieve multiple patterns, which is more efficient than working with a set of regular expressions.

14.1.3 preg_replace

Up to now you have seen how to retrieve data from a string. In the next step you will learn how strings can be modified to your needs using regular expressions. The command for replacing patterns is called preg_replace and works as shown in the next example:

 <?php         $string = 'Hello Frank Drebin, Hello Frank Johnston';         $string = preg_replace("/Hello/i", "Welcome", $string);         print $string; ?> 

The string Hello should be replaced by Welcome. To perform the task, three parameters have to be passed to preg_replace. The first parameter defines the pattern you are looking for, the second parameter contains the new string, and the third parameter contains the string you want to process. If the script is executed, the desired result will be displayed:

 Welcome Frank Drebin, Welcome Frank Johnston 

Sometimes it is useful to perform more than one substitution at once. This will save you a lot of execution time and make your code shorter. The next example shows how two patterns are substituted for new values:

 <?php         $string = 'Jeanny, quit living on dreams';         $pattern = array("/Jeanny/", "/dreams/");         $newvals = array("Carlos", "a prayer");         $string = preg_replace($pattern, $newvals, $string);         print $string; ?> 

Jeanny is substituted for Carlos and dreams will be changed to the string a prayer:

 Carlos, quit living on a prayer 

In this section you have seen that PHP has always substituted all patterns found in the string. This is not always the way you want it to be. Therefore, PHP provides a method for defining how often a pattern must be substituted for. The next script shows how only the first pattern found in the string can be substituted for a new value:

 <?php         $string = 'Jeanny oh Jeanny quit living on dreams';         $string = preg_replace("/Jeanny/", "Carlos", $string, 1);         print $string; ?> 

Just add thenumber of substitutions you want PHP to perform to the function. In this scenario only one change has been made:

 Carlos oh Jeanny quit living on dreams 

As you can see, it is an easy task to perform multiple changes, and it is easy to write complex applications with just a few lines.

14.1.4 preg_split

Splitting a string with the help of regular expressions is one of the most important tasks. You have already seen in Chapter 3, "PHP Basics," how strings can be split using other strings. In this section you will learn to work with regular expressions.

In the next example a list of words should be extracted and transformed to an array. The string called $string contains four names that are separated by commas and a variable number of whitespace characters:

 <?php         $string = 'Jeanny, Carlos,Amadeus,   Falco';         $result = preg_split("/,\s*/", $string);         foreach ($result as $x)         {                 print "x: $x<br>";         } ?> 

In this example it does not matter if two words are divided by a blank or any other whitespace character because \s has been used. If the script is executed, all four names will be retrieved and displayed utilizing a loop:

 x: Jeanny x: Carlos x: Amadeus x: Falco 

In some cases it might be useful to extract only the first few words of a string. Therefore, preg_split provides a parameter that defines the number of components you want in the result:

 <?php         $string = 'Jeanny, Carlos,Amadeus,   Falco';         $result = preg_split("/,\s*/", $string, 2);         foreach ($result as $x)         {                 print "x: $x<br>";         } ?> 

This time the result will only be two lines long because the input string has been divided into two parts:

 x: Jeanny x: Carlos,Amadeus, Falco 

If all components should be retrieved, it is either possible to omit the third parameter or to use -1 instead:

 <?php         $string = 'Jeanny, Carlos,,Amadeus,   Falco';         $result = preg_split("/,\s*/", $string, -1);         foreach ($result as $x)         {                 print "x: $x<br>";         } ?> 

The result will be the same as if you had not used a parameter. Five lines will be displayed as shown in the next listing:

 x: Jeanny x: Carlos x: x: Amadeus x: Falco 

As you can see, in the third line of the result PHP retrieves empty fields as well. In many cases this can be a problem because you might only want to have those fields that are not empty. Therefore, PHP provides a parameter:

 <?php         $string = 'Jeanny, Carlos,,Amadeus,   Falco';         $result = preg_split("/,\s*/", $string, -1, PREG_SPLIT_NO_EMPTY);         foreach ($result as $x)         {                 print "x: $x<br>";         } ?> 

PREG_SPLIT_NO_EMPTY makes PHP return all nonempty fields:

 x: Jeanny x: Carlos x: Amadeus x: Falco 

As you can see, the empty field is neglected and will not be displayed in the output.

14.1.5 preg_quote

In this chapter you already saw that special characters must be escaped in order not to be mixed up with other characters. With complex strings this can be uncomfortable. To get around the problem, you can use a function called preg_quote. Just pass the string to the function and the result will be returned:

 <?php         $result = preg_quote("$ + $ = $");         print $result; ?> 

In this example the symbols $, +, and = have to be escaped. Therefore the result contains a lot of backslashes:

 \$ \+ \$ \= \$ 

By default the characters in the input string are escaped to match regular expression syntax. Sometimes it is necessary to escape additional characters. In the next example you can see how the a characters can be escaped:

 <?php         $result = preg_quote("Heaven is a place on earth", "a");         print $result; ?> 

In the result, all as have been escaped using a backslash:

 He\aven is \a pl\ace on e\arth 

14.1.6 preg_grep

To scan entire arrays for patterns, PHP provides a function called preg_grep. The result of the function is an array containing all matches in the array:

 <?php         $string[0] = "Hello Austria";         $string[1] = "Hello Vienna";         $string[2] = "Welcome to Paris";         $result = preg_grep("/Hello [AV][a-z]+/", $string);         foreach ($result as $str)         {                 echo "$str<br>";         } ?> 

First, an array is defined. In the next step, preg_grep is called and all fields matching the regular expression are stored in $result. When you execute the script, two lines will be returned:

 Hello Austria Hello Vienna 


PHP and PostgreSQL. Advanced Web Programming2002
PHP and PostgreSQL. Advanced Web Programming2002
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 201

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net