Regular Expression Basics

In some languages, such as Perl, regular expressions are a part of the language core . That means, basically, that they are the same as using the or statement. Here is an example of how we might match this link in Perl:

 my $url = "<a href=\"http://www.designmultimedia.com/\">DesignMultimedia</a>"; print "the link matches our stringent standards"                          if $url =~ m{<a.*href=\"(.+)\".*>.*</a>}i;

However, in PHP, regular expressions are accessed through functions. The most basic of these functions are the ereg() and eregi () functions, which will do the same type of match that Perl does in the preceding code (with a different pattern matching syntax, of course). Here is the PHP equivalent of the Perl code:

 <?php if (eregi('<a.*href=\"(.+)\".*>.*<\/a>', $url)) {     print "the link matches our stringent standards"; } ?>

So, now we know that, in PHP, regular expressions are enclosed in functions ”what do we do from here? Let's go over some basic syntax and problem-solving methods used with regular expressions.

Regular expressions, or pattern matching (as I will call it), work by applying the pattern you specify against a specified string, instead of comparing the string character-by-character to another string.

Matching Strings

Let's start with some basic regular expressions. Examine the following regular expression:

 Lamb$

This matches any string that ends with the string Lamb . So, for example, it would match all the following:

 Mary had a little Lamb Donny stole Mary's Lamb Donny got arrested for stealing Mary's little Lamb

To tell the previous regular expression to match any word at the end of the string, we use the $ (dollar) sign. The $ sign is a meta-character that tells PHP to match a pattern at the end of a string. Just as a dollar sign will match any occurrence of a pattern at the end of a string, the ^ (circumflex) sign will match any pattern at the beginning of a string:

 ^Donny

So, the previous expression would match all the following:

 Donny noticed that Mary had a little lamb Donny disliked Mary so he wanted to steal from her Donny got in trouble with the law for being a bad boy

The ^ character matches the beginning of a string and the $ character matches the end of the string pattern. If we put these two meta- characters together, we can get a regular expression that will match only a string that contains a specific pattern, or a pattern that is at the beginning and the end of a string at the same time.

 ^Hello World$

This would match only Hello World , not Hello Beautiful World or Hello World I'm Sterling .

To search a string that to see whether it contains a certain pattern and nothing else, simply supply the pattern itself. This simple regular expression

http://www\.designmultimedia\.com

would match the following:

 Mary went to http://www.designmultimedia.com <a href="http://www.designmultimedia.com">DesignMultimedia</a> http://www.designmultimedia.com - Creating order out of Chaos

Matching Escape Sequences

In PHP, you can also search for escape sequences (such as \n , \t , and so on). For example, the following will match certain two words separated by a line break ( \n ):

 Mary\nLamb

Meaning that the following strings will match the criteria ( Mary , linebreak ( \n ), Lamb ):

 I love Mary Lamb am I I hate Mary Lamb be not I

Character Classes

Pattern matching is the process of applying a pattern to a string for the purpose of finding a match. So far, we have searched only for literal characters (such as D , o , n , n , and y ) or escape sequences (such as \n ). However, PHP also enables you to match a series of characters or escape sequences in what are known as character classes. Take the following pattern:

 [a-z]

This will match any lowercase character from a to z , meaning that any of the following strings will be matched by the preceding pattern:

 hedk fd arpp Here fido BARKITY BaRK XYZ

The expression will match as long as the string contains one lowercase character from a-z (a to z). To denote a character, make sure that you place your character class within brackets (that is, [a-z] not a-z ).

To negate your character class, you can use the ^ meta-character. When enclosed within brackets, the circumflex tells PHP to match any character except what is in the character class. (This works only if the circumflex is the very first character inside the brackets.) Therefore, the following will match any string that contains a non-lowercase letter:

 [^a-z]

Some sample strings this pattern would match are

 FACSIMILe Rtdg 9034 3FH7*.

Finishing off this mini-section on matching character classes, here is a set of commonly used regular expressions:

 [a-zA-Z] // Match any letter (e.g. W, w, t, Z, x) [a-zA-Z0-9] // Match any letter or number (e.g. howdy, 9) [ \f\r\t\n\v] // Match Any Whitespace character [0-9\.\-] // Match any Number, minus sign or period

Matching More Than One Occurrence of a Character

To match more than one occurrence of a character (or character class) in a string, use the {} meta-characters. For example, the following will match a string that contains three lowercase letters in a row:

 [a-z]{3}

The 3 inside the {} is what tells PHP to match a string that contains three lowercase letters in a row. If you want to match a range of character occurrences, separate the lower limit and the upper limit by a comma:

 [a-zA-Z]{1,8}

This will match any occurrence of a set of letters between one and eight characters long, so the following would be valid:

 W WaREzkiD ARMDe .+

If you want to match zero or more, or one or more occurrences of a character, use the * and + quantifiers. The * quantifier will match zero or more occurrences of a character and the + quantifier will match one or more occurrences of a character. Consider the following:

 [a-z]*

This will match zero or more occurrences of lowercase letters in a string, so the following would match:

 ArfDoggy 67494304 // Remember **0** or more occurrences Lepidus WAR AND PEACE // Remember **0** or more occurrences The Great Santini

Now you might be wondering whether there is a practical use for a regular expression that matches zero or more. Let us revisit the first example I gave you of a regular expression:

 <a.*href=\"(.+)\".*>.*<\/a>

The previous regular expression matches '<a' followed by zero or more occurrences of anything (the '.' means anything). Then it matches 'href="' plus one or more occurrences of anything (.+) . Following that is a closing '"' and then it matches zero or more occurrences of anything before a > sign. It then matches another set of zero or more occurrences of anything before meeting a closing '</a>' sign. The * quantifier is useful if you're not sure whether anything is going to be there, but you want to account for it just in case.

Optional Matches Using `?`

As shown previously, the * quantifier matches zero or more occurrences of an item. It basically makes that item optional to the regular expression. But what if you want to get more specific; for instance, what if you want to have an optional three-letter string? You would use the ? specifier , which basically says, "It is optional that this pattern exists." Here is an example to help illuminate this point:

 [a-z]{3}?

This example would optionally match three occurrences of any lowercase character; therefore, each of the following would be valid:

 34d3des HDED &(#LDJIIED MR. KELLY HI

In fact, pretty much anything at all would be valid because the match is optional. Why not just use .* to achieve the same effect? Read on

PHP enables you not only to match certain types of text, it also enables you to save your matches in an array. To do this, you must enclose what you want to be saved in parentheses, like so:

 ([a-z]{4}?)[0-9]

If PHP found a four-character string with lowercase letters, it would be saved in either an array that you specify or the \\1 , \\2 , \\3 , ... variables . (These are accessible only to ereg_replace() , eregi_replace() , and preg_replace() .) Examples of this will be shown in most of the recipes, so I am leaving out an explanation here.

The "or" of Regular Expressions

In PHP (and Perl), the pipe delimiter ( ) enables you to say "This or that or this or that" (note that the keyword here is or). Here is an example to illustrate that concept:

 (BreadMilkEggs)

This example means that PHP will match either Bread or Milk or Eggs . Another place this is useful is in matching single characters, like so:

 (abc)

This would match either a , b , or c . Please note that matching single characters with an or statement is the same as

 [abc]

except that using character classes whenever you can is a better programming practice because they execute much faster than the cases separated by .

Well, that's it! You have now been sufficiently introduced to the concept of regular expressions and I've have just finished the first part of the introduction. Knowing the syntax of regular expressions is half the battle, but you also must have a method for problem solving to create regular expressions.

To start explaining this methodology, let us examine a sample scenario in the next section.

Mini-Problem

You want to write a regular expression that extracts information out of the <meta> tags ( specifically the description and keyword tags).

Mini-Technique

First, let's write down two sample meta tags that we will match so we can see them in front of us as we build the regular expression. (This can more easily be done with the get_meta_tags() function, but the purpose of this exercise is to go through writing a regular expression.)

 <meta name="description" content="This is a description of site xyz"> <meta name="keywords" content="keyword1, keyword2, keyword3">

Now, let's look at the first thing we want to match in each tag. In this case, it happens to be the content between two quotation marks after an equal sign. We know that the tag has to start with <meta , so we can draw the first part of our regular expression:

 <meta.*name=[\'"]?(descriptionkeywords)[\'"]?.*

The next item we want to match is the stuff after the = sign that comes after content . Our only concern is that it is something other than a single or double quotation mark, so here is the second and final part of the regular expression:

 <meta.*name=[\'"]?(descriptionkeywords)[\'"]? .*content=[\'"]?([^\'"]*)[\'"]?.*>

Now that we have our regular expression, we need to find a function to put it in so that the regular expression can actually do its work. Because we want a case-insensitive match, we will choose the eregi() function:

 <?php eregi('<meta.*name=[\'"]?(descriptionkeywords)[\'"]?.*content=[\'"]?([^\'"]*)       [\'"]?.*>', $line, $match); ?>

Subsequently, $match[0] will contain all the matching text, $match[1] will contain the name alone, and $match[2] will contain the content alone.

So, let's recap. When writing a regular expression, remember to

Assess the problem.
Write an example of the text you want to match.
Divide the problem into separate parts and solve those parts .
Put the problem back together for your final regular expression.
Find the right function for your needs.
Test and retest.

That's it. This is probably the longest introduction in the book, but I feel it was worth the space because regular expressions can be the most confusing and, at the same time, most powerful part of the language. The rest of this chapter is devoted to showing you different ways to solve common problems and some other neat features not covered in this introduction (for example, all the regular expression functions or look-ahead assertions). For a more advanced look at regular expressions, check out Mastering Regular Expressions by Jeffrey Friedl, published by O'Reilly.