Recipe 22.0. Introduction


Regular expressions are an intricate and powerful tool for matching patterns and manipulating text. While not as fast as plain vanilla string matching, regular expressions are extremely flexible. They allow you to construct patterns to match almost any conceivable combination of characters with a simple'albeit terse and punctuation-studded'grammar. If your web site relies on data feeds that come in text files'data feeds like sports scores, news articles, or frequently updated headlines'regular expressions can help you make sense of those data feeds.

This chapter gives a brief overview of basic regular expression syntax and then focuses on the functions that PHP provides for working with regular expressions. For a bit more detailed information about the ins and outs of regular expressions, check out the PCRE section of the PHP online manual (http://www.php.net/pcre) and Appendix B of Learning PHP 5 by David Sklar (O'Reilly). To start on the path to regular expression wizardry, read the comprehensive Mastering Regular Expressions by Jeffrey E.F. Friedl (O'Reilly).

Regular expressions are handy when transforming plain text into HTML and vice versa. Luckily, since these are such helpful subjects, PHP has many built-in functions to handle these tasks. Recipe 9.10 tells how to escape HTML entities; 13.14 covers stripping HTML tags; and Recipes 13.12 and 13.13 show how to convert plain text to HTML and HTML to plain text, respectively. For information on matching and validating email addresses, see Recipe 9.4.

Over the years, the functionality of regular expressions has grown from its basic roots to incorporate increasingly useful features. As a result, PHP offers two different sets of regular expression functions. The first set includes the traditional (or POSIX) functions, whose names each begin with ereg (for "extended" regular expressions; the ereg functions themselves are already an extension of the original feature set). The other set includes the Perl-compatible family of functions, prefaced with preg (for Perl-compatible regular expressions).

The preg functions use a library that mimics the regular expression functionality of the Perl programming language. This is a good thing because Perl allows you to do a variety of handy things with regular expressions, including nongreedy matching, forward and backward assertions, and even recursive patterns.

In general, there's no longer any reason to use the ereg functions. They offer fewer features, and they're slower than preg functions. However, the ereg functions existed in PHP for many years prior to the introduction of the preg functions, so many programmers still use them because of legacy code or out of habit. Thankfully, the prototypes for the two sets of functions are identical, so it's easy to switch back and forth from one to another without too much confusion. (We list how to do this while avoiding the major gotchas in Recipe 22.1.)

Think of a regular expression as a program in a very restrictive programming language. The only task of a regular expression program is to match a pattern in text. In regular expression patterns, most characters just match themselves. That is, the regular expression rhino matches strings that contain the five-character sequence rhino. The fancy business in regular expressions is due to a handful of punctuation and symbols called metacharacters. These symbols don't literally match themselves, but instead give commands to the regular expression matcher.

The most frequently used metacharacters include the period (.), asterisk (*), plus sign (+), and question mark (?). (To match a literal metacharacter in a pattern, precede the character with a backslash.)

  • The period means "match any character," so the pattern .at matches bat, cat, and even rat.

  • The asterisk means "match 0 or more of the preceding object." (So far, the only objects we know about are characters.)

  • The plus is similar to asterisk, but means "match one or more of the preceding object." So .+at matches brat, sprat, and even the cat inside of catastrophe, but not plain at. To match at, replace the + with a *.

  • The question mark means "the preceding object is optional." That is, it matches 0 or 1 of the object that precedes it. colou?r matches both color and colour.

To apply * and + to objects greater than one character, place the sequence of characters that make up the object inside parentheses. Parentheses allow you to group characters for more complicated matching and also capture the part of the pattern that falls inside them. A captured sequence can be referenced by preg_replace( ) to alter a string, and all captured matches can be stored in an array that's passed as a third parameter to preg_match( ) and preg_match_all( ). The preg_match_all( ) function is similar to preg_match( ), but it finds all possible matches inside a string, instead of stopping at the first match. Example 22-1 shows a few examples of preg_match( ), preg_match_all( ), and preg_replace( ) at work.

Using preg functions

<?php if (preg_match('{<title>.+</title>}', $html)) {     // page has a title } if (preg_match_all('/<li>/', $html, $matches)) {     print 'Page has ' . count($matches[0]) . " list items\n"; } // turn bold into italic $italics = preg_replace('/(<\/?)b(>)/', '$1i$2', $bold); ?>

If you want to match strings with a specific set of characters, create a character class by putting the characters you want inside square brackets. The character class [aeiou] matches any one of the characters a, e, i, o, and u. You can also put ranges inside of square brackets to form a character class. The class [a-z] matches all lowercase English letters. The class [a-zA-Z0-9] matches digits and English letters. The class [a-zA-Z0-9_] matches digits, English letters, and the underscore.

So far, all the patterns we've seen match anything that contains text that corresponds to the pattern. That is, [a-z0-9]+ matches grapefruit and c3p0, but it also matches grr!!! and *******p. All four of those strings meet the condition that [a-z0-9]+ sets out: "one or more of a digit or lowercase English letter."

Anchoring your pattern enables matching against strings that only contain characters that the pattern describes. The caret (^) and the dollar sign ($) anchor the pattern at the beginning and the end of the string, respectively. Without them, a match can occur anywhere in the string. So while [a-z0-9]+ means "one or more of a digit or lowercase English letter," ^[a-z0-9]+ means "begins with one or more of a digit or lowercase English letter," [a-z0-9]+$ means "ends with one or more of a digit or lowercase English letter," and ^[a-z0-9]+$ means "contains only one or more of a digit or lowercase English letter." Example 22-2 shows a few character classes at work.

Matching with character classes and anchors

<?php $thisFileContents = file_get_contents(__FILE__); // http://php.net/language.variables gives a regular expression for // valid variable names in php. Beginning the pattern with \$ matches // a literal $ $matchCount = preg_match_all('/\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/',                              $thisFileContents, $matches); print "Matches: $matchCount\n"; foreach ($matches[0] as $variableName) {     print "$variableName\n"; } ?>

Example 22-2 prints each variable name it uses:

Matches: 8 $thisFileContents $matchCount $thisFileContents $matches $matchCount $matches $variableName $variableName

If it's easier to define what you're looking for by its complement, use that. To make a character class match the complement of what's inside it, begin the class with a caret. A caret outside a character class anchors a pattern at the beginning of a string; a caret inside a character class means "match everything except what's listed in the square brackets." For example, the character class [^aeiou] matches everything but lowercase English vowels.

Note that the opposite of [aeiou] isn't [bcdfghjklmnpqrstvwxyz]. The character class [^aeiou] also matches uppercase vowels such as AEIOU, numbers such as 123, URLs such as http://www.cnpq.br/, and even emoticons such as :).

The vertical bar (|), also known as the pipe, specifies alternatives. Example 22-3 uses the pipe to find various possibilities for image filenames in a block of text.

Matching with |

<?php $text = "The files are cuddly.gif, report.pdf, and cute.jpg."; if (preg_match_all('/[a-zA-Z0-9]+\.(gif|jpe?g)/',$text,$matches)) {     print "The image files are: " . implode(',',$matches[0]); } ?>

Example 22-3 prints:

The image files are: cuddly.gif,cute.jpg 

We've covered just a small subset of the world of regular expressions. We provide some additional details in later recipes, but the PHP web site also has some very useful information on Perl-compatible regular expressions at http://www.php.net/pcre. The links from this last page to "Pattern Modifiers" and "Pattern Syntax" are especially detailed and informative.




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net