Section 10.7. Extended Examples | Mastering Regular Expressions

10.7. Extended Examples

Here are two more examples to close out the chapter.

10.7.1. CSV Parsing with PHP

Here's the PHP version of the CSV (comma-separated values) example from Chapter 6 (˜ 271). The regex has been updated to use possessive quantifiers (˜ 142) instead of atomic parens, for their cleaner presentation.

First, here is the regex we'll use:

 $csv_regex = '{       \G(?:^;,)       (?:         #  Either a double-quoted field  ...         " #  field opening quote  (  [^"]*+ (?: "" [^"]*+ )*+ )         " # field closing quote       # ...  or  ...        # ...  some non-quote/non-comma text  ...        ( [^",]*+ )       )     }x';

And then, we use it to parse a $line of CSV text:

 /*  Apply the regex, filling $all_matches with all kinds of data  */     preg_match_all($csv_regex, $line, $all_matches);     /*  $Result will hold the array of fields we'll glean from $all_matches  */     $Result = array ();     /*  Run through each successful match  ... */     for ($i = 0; $i < count($all_matches[0]); $i++)     {        /*  If the 2nd set of capturing parentheses captured, use that directly  */     if (strlen($all_matches[2][$i]) > 0)        array_push($Result, $all_matches[2][$i]);     else     {       /*  It was a quoted value, so take care of an embedded double       double-quote before using  */       array_push($Result, preg_replace('/""/', '"', $all_matches[1][$i]));       }     }     /*  The array $Result is now populated and available for use  */

10.7.2. Checking Tagged Data for Proper Nesting

Here's a somewhat complex undertaking that covers many interesting points: checking that XML (or XHTML, or any similar tagged data) contains no orphan or mismatched tags. The approach I'll take is to look for properly matched tags, nontag text, and self-closing tags (e.g., <br/> , an " empty-element tag" in XML parlance), and hope that I find the entire string.

Here's the full regex:

A string that matches this has no mismatched tags (with a few caveats we'll look at a bit later).

This may appear to be quite complex, but it's manageable when broken down into its component parts . The expression's outer ^(‹)$ wraps the main body of the regex to ensure that the entire subject string is matched before success is returned. That main body is also wrapped with an additional set of capturing parentheses, which, as well soon see, allows a later recursive reference to "the main body."

10.7.2.1. The main body of this expression

The main body of the regex, then, is three alternatives (each underlined within the regex, for visual clarity) wrapped in (?:‹)*+ to allow any mix of them to match. The three alternatives attempt to match, respectively: matched tags, non-tag text, and self-closing tags.

Because what each alternative can match is unique to that alternative (that is, where one alternative has matched, neither of the others may match), I know that later backtracking will never allow another alternative to match the same text. I can take advantage of that knowledge to make the process more efficient by using a possessive * on the "allow any mix to match" parentheses. This tells the engine to not even bother trying to backtrack, thereby hastening a result when a match can't be found.

For the same reason, the three alternatives may be placed in any order, so I put first the alternatives I felt were most likely to match most often (˜ 260).

Now let's look at the alternatives one at a time...

The second alternative: non-tag text I'll start with the middle alternative, because it's the simplest: [^<>]++ . This alternative matches non-tag spans of text. The use of the possessive quantifier here may be overkill considering that the wrapping (?:‹)*+) is also possessive, but to be safe, I like to use a possessive quantifier when I know it cant hurt. (A possessive quantifier is often used for its efficiency, but it can also change the semantics of a match. The change can be useful, but make sure you understand its ramifications ˜ 259).

The third alternative: self-closing tags The third alternative, <\w[^>]*+/> , matches self-closing tags such as <br/> and <img ‹/> (self-closing tags are characterized by the ' / ' immediately before the closing bracket ). As before, the use of a possessive quantifier here may be overkill, but it certainly doesnt hurt.

The first alternative: a matched set of tags Finally, let's look at the first alternative: (?1) </\2>

The first part of this subexpression ( marked with an underline) matches an opening tag, with its (\w++) capturing the tag name within what turns out to be the overall regexs second set of capturing parentheses. (The use of a possessive quantifier in \w++ is an important point that well look at in a bit.)

(?<!/) is negative lookbehind (˜ 133) ensuring that we havent just matched a slash. We put it right before the > at the end of the match-an-opening-tag section to be sure that were not matching a self-closing tag such as <hr/> (As we've seen, self-closing tags are handled by the third alternative.)

After the opening tag has been matched, (?1) recursively applies the subexpression within the first set of capturing parentheses. Thats the aforementioned "main body," which is, in effect, a span of text with no unbalanced tags. Once that's matched, we'd better find ourselves at the tag that closes the opening tag we found in the first part of this alternative (whose name was captured within the second overall set of capturing parentheses). The leading </ of </\2> ensures that its a closing tag; the backreference in \2 > ensures that its the correct closing tag.

(If you're checking HTML or other data where tag names are case insensitive, be sure to prepend (?i) to the regex, or apply it with the i pattern modifier.)

Phew!

10.7.2.2. Possessive quantifiers

I'd like to comment on the use of a possessive \w++ in the first alternative, <(\w++)[^>]*+(?<!/)> . If I were using a less-expressive regex flavor that didnt have possessives or atomic grouping (˜ 139), I'd write this alternative with \b after the (\w+) that matches the tag name: <(\w++) \b [^>]*(?<!/)> .

The \b is important to stop the (\w+) from matching, for example, the first ' li ' of a ' <link>‹</li> ' sequence. This would leave the ' nk ' to be matched outside the capturing parentheses, resulting in a truncated tag name for the \2 backreference that follows .

None of this would normally happen, because the \w+ is greedy and wants to match the entire tag name. However, if this regex were applied to badly nested text for which it should fail, then backtracking in search of a match could force the \w+ to match less than the full tag name, as it certainly would in the ' <link>‹</li> ' example. The \b prevents this.

PHP's powerful preg engine, thankfully, does support possessive quantifiers, and using one in (\w++) has the same "dont allow backtracking to split up the tag name" effect that appending \b provides, but it is more efficient.

10.7.2.3. Real-world XML

The format of real-world XML is more complex than simply matching tags. We must also consider XML comments, CDATA sections, and processing instructions, among others.

Adding support for XML comments is as easy as adding a fourth alternative,  , and making sure to use (?s) or the s pattern modifier so that its dot can match a newline.

Similarly, CDATA sections, which are of the form <![CDATA[‹]]> , can be handled with a new <!\[CDATA\[.*?]]> alternative, and XML processing instructions such as ' <?xml version="1.0"?> ' can be handled by adding <\?.*?\?> as an alternative.

Entity declarations are of the form <!ENTITY‹> and can be handled with <!ENTITY\b.*?> . There are a number of similar structures in XML, and for the most part they can all be handled as a group by changing <!ENTITY\b.*?> to <![A-Z].*?> .

A few issues remain , but what we have discussed so far should cover most XML. Here it is all put together into a PHP snippet:

 $xml_regex = '{     ^(     (?: <(\w++) [^>]*+ (?<!/)> (?1) </> #  matched pair of tags  [^<>]++                                       #  non-tag stuff  <\w[^>]*+/>                                #  self-closing tag  <!--.*?-->                                    #  comment  <!\[CDATA\[.*?]]>                             #  cdata block  <\?.*?\?>                                     #  processing instruction  <![A-Z].*?>                                   #  Entity declaration, etc  .     )*+     )$     }sx';     if (preg_match($xml_regex, $xml_string))         echo "block structure seems valid\n";     else         echo "block structure seems invalid\n";

10.7.2.4. HTML?

More often than not, real-world HTML has all kinds of issues that make a check like this impractical , such as orphan and mismatched tags, and invalid raw ' < ' and ' > ' characters . However, even properly balanced HTML has some special cases that we need to allow for: comments and <script> tags.

HTML comments are handled in the same way as XML comments:  with the s pattern modifier.

A <script> section is important because it may have raw ' < ' and ' > ' within it, so we want to simply allow anything from the opening <script‹> to the closing </script> . We can handle this with <script\b[^>]*> .*? </script> . Its interesting that script sequences that don't contain the forbidden raw ' < ' and ' > ' characters are caught by the first alternative, because they conform to the "matched set of tags" pattern. If a < script > does contain such raw characters, the first alternative fails, leaving the sequence to be matched by this alternative.

Here's the HTML version of our PHP snippet:

 $html_regex = '{     ^(         (?: <(\w++) [^>]*+ (?<!/)>  (?1)  </> #  matched pair of tags  [^<>]++                                       #  non-tag stuff  <\w[^>]*+/>                                #  self-closing tag  <!--.*?-->                                    #  comment  <script\b[^>]*>.*?</script>          #  script block  )*+      )$     }isx';     if (preg_match($html_regex, $html_string))         echo "block structure seems valid\n";     else         echo "block structure seems invalid\n";