10.7. Extended ExamplesHere are two more examples to close out the chapter. 10.7.1. CSV Parsing with PHPHere's the PHP version of the CSV (comma-separated values) example from Chapter 6 (˜ 271). The regex has been updated to use possessive quantifiers (˜ 142) instead of atomic parens, for their cleaner presentation. First, here is the regex we'll use: $csv_regex = '{ \G(?:^;,) (?: # Either a double-quoted field ... " # field opening quote ( [^"]*+ (?: "" [^"]*+ )*+ ) " # field closing quote # ... or ... # ... some non-quote/non-comma text ... ( [^",]*+ ) ) }x'; And then, we use it to parse a $line of CSV text: /* Apply the regex, filling $all_matches with all kinds of data */ preg_match_all($csv_regex, $line, $all_matches); /* $Result will hold the array of fields we'll glean from $all_matches */ $Result = array (); /* Run through each successful match ... */ for ($i = 0; $i < count($all_matches[0]); $i++) { /* If the 2nd set of capturing parentheses captured, use that directly */ if (strlen($all_matches[2][$i]) > 0) array_push($Result, $all_matches[2][$i]); else { /* It was a quoted value, so take care of an embedded double double-quote before using */ array_push($Result, preg_replace('/""/', '"', $all_matches[1][$i])); } } /* The array $Result is now populated and available for use */ 10.7.2. Checking Tagged Data for Proper NestingHere's a somewhat complex undertaking that covers many interesting points: checking that XML (or XHTML, or any similar tagged data) contains no orphan or mismatched tags. The approach I'll take is to look for properly matched tags, nontag text, and self-closing tags (e.g., <br/> , an " empty-element tag" in XML parlance), and hope that I find the entire string. Here's the full regex: A string that matches this has no mismatched tags (with a few caveats we'll look at a bit later). This may appear to be quite complex, but it's manageable when broken down into its component parts . The expression's outer 10.7.2.1. The main body of this expression The main body of the regex, then, is three alternatives (each underlined within the regex, for visual clarity) wrapped in Because what each alternative can match is unique to that alternative (that is, where one alternative has matched, neither of the others may match), I know that later backtracking will never allow another alternative to match the same text. I can take advantage of that knowledge to make the process more efficient by using a possessive * on the "allow any mix to match" parentheses. This tells the engine to not even bother trying to backtrack, thereby hastening a result when a match can't be found. For the same reason, the three alternatives may be placed in any order, so I put first the alternatives I felt were most likely to match most often (˜ 260). Now let's look at the alternatives one at a time... The second alternative: non-tag text I'll start with the middle alternative, because it's the simplest: The third alternative: self-closing tags The third alternative, The first alternative: a matched set of tags Finally, let's look at the first alternative: The first part of this subexpression ( marked with an underline) matches an opening tag, with its After the opening tag has been matched, (If you're checking HTML or other data where tag names are case insensitive, be sure to prepend Phew! 10.7.2.2. Possessive quantifiers I'd like to comment on the use of a possessive The \b is important to stop the (\w+) from matching, for example, the first ' li ' of a ' <link>‹</li> ' sequence. This would leave the ' nk ' to be matched outside the capturing parentheses, resulting in a truncated tag name for the None of this would normally happen, because the \w+ is greedy and wants to match the entire tag name. However, if this regex were applied to badly nested text for which it should fail, then backtracking in search of a match could force the PHP's powerful preg engine, thankfully, does support possessive quantifiers, and using one in 10.7.2.3. Real-world XMLThe format of real-world XML is more complex than simply matching tags. We must also consider XML comments, CDATA sections, and processing instructions, among others. Adding support for XML comments is as easy as adding a fourth alternative, Similarly, CDATA sections, which are of the form <![CDATA[‹]]> , can be handled with a new Entity declarations are of the form <!ENTITY‹> and can be handled with A few issues remain , but what we have discussed so far should cover most XML. Here it is all put together into a PHP snippet: $xml_regex = '{ ^( (?: <(\w++) [^>]*+ (?<!/)> (?1) </> # matched pair of tags [^<>]++ # non-tag stuff <\w[^>]*+/> # self-closing tag <!--.*?--> # comment <!\[CDATA\[.*?]]> # cdata block <\?.*?\?> # processing instruction <![A-Z].*?> # Entity declaration, etc . )*+ )$ }sx'; if (preg_match($xml_regex, $xml_string)) echo "block structure seems valid\n"; else echo "block structure seems invalid\n"; 10.7.2.4. HTML?More often than not, real-world HTML has all kinds of issues that make a check like this impractical , such as orphan and mismatched tags, and invalid raw ' < ' and ' > ' characters . However, even properly balanced HTML has some special cases that we need to allow for: comments and <script> tags. HTML comments are handled in the same way as XML comments: A <script> section is important because it may have raw ' < ' and ' > ' within it, so we want to simply allow anything from the opening <script‹> to the closing </script> . We can handle this with Here's the HTML version of our PHP snippet: $html_regex = '{ ^( (?: <(\w++) [^>]*+ (?<!/)> (?1) </> # matched pair of tags [^<>]++ # non-tag stuff <\w[^>]*+/> # self-closing tag <!--.*?--> # comment <script\b[^>]*>.*?</script> # script block )*+ )$ }isx'; if (preg_match($html_regex, $html_string)) echo "block structure seems valid\n"; else echo "block structure seems invalid\n"; |