Section 8.9. Validating HTML with Multiple Patterns Per Matcher

8.9. Validating HTML with Multiple Patterns Per Matcher

Here's a Java version of the Perl program to validate a subset of HTML (˜ 132). This snippet employs the usePattern method to change a matcher's pattern on the fly. This allows multiple patterns, each beginning with \G , to "tag team their way through a string. See the text on page 132 for more details on the approach.

 Pattern  pAtEnd  = Pattern.compile("\\G\\z"); Pattern  pWord  = Pattern.compile("\\G\\w+"); Pattern  pNonHtml  = Pattern.compile("\\G[^\\w<>&]+"); Pattern  pImgTag  = Pattern.compile("\\G(?i)<img\\s+([^>]+)>"); Pattern  pLink  = Pattern.compile("\\G(?i)<A\\s+([^>]+)>"); Pattern  pLinkX  = Pattern.compile("\\G(?i)</A>"); Pattern  pEntity  = Pattern.compile("\\G&(#\\d+;\\w+);"); Boolean  needClose  = false; Matcher  m  =  pAtEnd  .matcher(  html  ); //  Any Pattern object can create our Matcher object  while (!  m  .usePattern(  pAtEnd  ).find()) { if (  m  .usePattern(  pWord  ).find()) {  ... have a word or number in m. group () can now check for profanity, etc  ... } else if (  m  .usePattern(  pImgTag  ).find()) {  ... have an image tag can check that it's appropriate  ... } else if (!  needClose  &&  m  .usePattern(  pLink  ).find()) {  ... have a link anchor can validate it  ...  needClose  = true; } else if (  needClose  &&  m  .usePattern(  pLinkX  ).find()) { System.out.println("/LINK [" +  m  .group() + "]");  needClose  = false; } else if (  m  .usePattern(  pEntity  ).find()) { //  Allow entities like &gt; and &#123;  } else if (  m  .usePattern(  pNonHtml  ).find()) { //  Other (non-word) non-HTML stuff simply allow it  } else { //  Nothing matched at this point, so it must be an error. Grab a dozen or so characters  //  at our current location so that we can issue an informative error message   m  .usePattern(Pattern.compile("\\G(?s).{1,12}")).find(); System.out.println("Bad char before '" +  m  .group() + "'"); System.exit(1); } } if (  needClose  ) { System.out.println("Missing Final </A>"); System.exit(1); }

Because of a java.util.regex bug causing the "non-HTML" match attempt to " consume " a character of the target text even when it doesn't match, I moved the non-HTML check to the end. The bug is still there, but now manifests itself only in the error message, which is updated to indicate that the first character is missing in the text reported. I've reported this bug to Sun.

Until the bug is fixed, how might we use the one-argument version of the find method to solve this problem? Turn the page for the answer.

Multiple Patterns and the One-Argument find()

Answer to the question on page 399.

The java.util.regex bug described on page 399 incorrectly moves the matcher's idea of the "current location," so the next find starts at the wrong location. We can get around the bug by explicitly keeping track of the "current location" ourselves and using the one-argument form of find to explicitly begin the match at the proper spot.

Changes from the version on page 399 are highlighted:

 Pattern  pWord  = Pattern.compile("\\G\\w+"); Pattern  pNonHtml  = Pattern.compile("\\G[^\\w<>&]+"); Pattern  pImgTag  = Pattern.compile("\\G(?i)<img\\s+([^>]+)>"); Pattern  pLink  = Pattern.compile("\\G(?i)<A\\s+([^>]+)>"); Pattern  pLinkX  = Pattern.compile("\\G(?i)</A>"); Pattern  pEntity  = Pattern.compile("\\G&(#\\d+\\w+);"); Boolean  needClose  = false; Matcher  m  =  pWord  .matcher(  html  ); //  Any Pattern object can create our Matcher object  I  nteger  currentLoc  = 0; //  Begin at the start of the string   while (   currentLoc  < html.length()  ) { if (  m  .usePattern(  pWord  ).find(   currentLoc   )) {  ... have a word or number in m.group() can now check for profanity, etc  ... } else if (  m  .usePattern(  pNonHtml  ).find(   currentLoc   )) { //  Other (non-word) non-HTML stuff simply allow it  } else if (  m  .usePattern(  pImgPag  ) .find(   currentLoc   )) {  ... have an image tag can check that it's appropriate  ... } else if (!  needClose  &&  m  .usePattern(  pLink  ).find(   currentLoc   )) {  ... have a link anchor can validate it  ...  needClose  = true; } else if (  needClose  &&  m  .usePattern(  pLinkX  ) .find(   currentLoc   )) { System.out.println("/LINK [" +  m  .group() + "]");  needClose  = false; } else if (  m  .usePattern(  pEntity  ) .find(   currentLoc   )) { //  Allow entities like &gt; and &#123;  } else { //  Nothing matched at this point, so it must be an error. Grab a dozen or so characters  //  at our current location so that we can issue an informative error message   m  .usePattern(Pattern.compile("\\G(?s).{1,12}")).find(   currentLoc   ); System.out.println(  "Bad char at  '" +  m  .group() + "'"); System.exit(1); }   currentLoc  =  m  .end(); //  The 'current location' is now where the previous match ended   } if (  needclose  ) { System.out.println("Missing Final </A>"); System.exit(1); }

Unlike the previous approach, this one uses the matcher-resetting version of find , so it wouldn't translate directly to a situation where a region must be respected. You can, however, maintain the region yourself by inserting appropriate region calls before each find , such as:

  m  .usePattern(  pWord  )

 .find(  currentLoc  )

8.9.1. Parsing Comma-Separated Values (CSV) Text

Here's the java.util.regex version of the CSV example from Chapter 6 (˜ 271). It's been updated to use possessive quantifiers (˜ 142) instead of atomic parentheses, for their cleaner presentation.

 String  regex  = //  Puts a double quoted field into group(1), an unquoted field into group(2)  . " \\G(?:^,) \n"+ " (?: \n"+ " #  Either a double-quoted field  ... \n"+ " \" #  field's opening quote  \n"+ " ([^\"]*+ (?: \"\" [^\"]*+)*+) \n"+ " \" #  field's closing quote  \n"+ " # ...  or  ... \n"+ " #  some non-quote/non-comma text  ... \n"+ " ([^\",]*+) \n"+ ") \n"; //  Create a matcher for the CSV line of text, using the regex above  . Matcher  mMain  = Pattern.compile(  regex  , Pattern.COMMENTS).matcher(  line  ); //  Create a matcher for   ""  ,  with dummy text for the time being  . Matcher  mQuote  = Pattern.compile("\"\"").matcher(""); while (  mMain  .find()) { String  field  ; if (  mMain  .start(2) >= 0)  field  =  mMain  .group(2); //  The field is unquoted, so we can use it as is  . else //  The field is quoted, so we must replace paired double quotes with one double quote  . field = mQuote.reset(mMain.group(1)).replaceAll("\""); //  We can now work with field  ... System.out.println("Field [" +  field  + "]"); }

This is more efficient than the original Java version shown on page 217 for two reasons: the regex is more efficient as per the Chapter 6 discussion on page 271, and a single matcher is used and reused (via the one-argument form of the reset method), rather than creating and disposing of new matchers each time.