8.4. The Matcher ObjectOnce you've associated a regular expression with a target string by creating a matcher, you can instruct it to apply the regex to the target in various ways, and query the results of that application. For example, given a matcher m , the call m . find() actually applies m 's regex to its string, returning a Boolean indicating whether a match is found. If a match is found, the call m . group () returns a string representing the text actually matched. Before looking in detail at a matcher's various methods , it's useful to have an overview of what information it maintains. To serve as a better reference, the following lists are sprinkled with page references to details about each item. Items in the first list are those that the programmer can set or modify, while items in the second list are read-only. Items that the programmer can set or update:
The following are the read-only data maintained by the matcher:
These lists are a lot to absorb , but they are easier to grasp when discussing the methods grouped by functionality. The next few sections do so. Also, the list of methods at the start of the chapter (˜ 366) will help you to find your way around when using this chapter as a reference. 8.4.1. Applying the RegexHere are the main Matcher methods for actually applying the matcher's regex to its target text:
8.4.2. Querying Match ResultsThe matcher methods in the following list return information about a successful match. They throw IllegalStateException if the matcher's regex hasn't yet been applied to its target text, or if the previous application was not successful. The methods that accept a num argument (referring to a set of capturing parentheses) throw IndexOutOfBoundsException when an invalid num is given. Note that the start and end methods, which return character offsets, do so without regard to the regiontheir return values are offsets from the start of the text , not necessarily from the start of the region . Following this list of methods is an example illustrating many of them in action.
8.4.2.1. Match-result exampleHere's an example that demonstrates many of these match-result methods. Given a URL in a string, the code identifies and reports on the URL's protocol (' http ' or ' https '), hostname, and optional port number: String url = "http://regex.info/blog"; String regex = "(?x) ^(https?):// ([^/:]+) (?:(\d+))?"; Matcher m = Pattern.compile( regex ).matcher( url ); if ( m .find()) { System.out.print("Overall [" + m .group() + "]" + " (from " + m .start() + " to " + m .end() + ")\n" + "Protocol [" + m .group(1) + "]" + " (from " + m .start(1) + " to " + m .end(1) + ")\n" + "Hostname [" + m .group(2) + "]" + " (from " + m .start(2) + " to " + m .end(2) + ")\n"); // Group #3 might not have participated, so we must be careful here if ( m .group(3) == null) System.out.println("No port; default of '80' is assumed"); else { System.out.print("Port is [" + m .group(3) + "] " + "(from " + m .start(3) + " to " + m .end(3) + ")\n"); } } When executed, it produces: Overall [http://regex.info] (from 0 to 17) Protocol [http] (from 0 to 4) Hostname [regex.info] (from 7 to 17) No port; default of '80' is assumed 8.4.3. Simple Search and ReplaceYou can implement search-and-replace operations using just the methods mentioned so far, if you don't mind doing a lot of housekeeping work yourself, but a matcher offers convenient methods to do simple search and replace for you:
8.4.3.1. Simple search and replace examplesThis simple example replaces every occurrence of "Java 1.5" with "Java 5.0," to convert the nomenclature from engineering to marketing: String text = "Before Java 1.5 was Java 1.4.2. After Java 1.5 is Java 1.6"; String regex = "\bJava\s*1\.5\b"; Matcher m = Pattern.compile( regex ).matcher( text ); String result = m . replaceAll ("Java 5.0"); System.out.println( result ); It produces: Before Java 5.0 was Java 1.4.2. After Java 5.0 is Java 1.6 If you won't need the pattern and matcher further, you can chain everything together, setting result to: Pattern.compile("\bJava\s*1\.5\b").matcher( text ).replaceAll("Java 5.0") (If the regex is used many times within the same thread, it's most efficient to precompile the Pattern object ˜ 372.) You can convert "Java 1.6" to "Java 6.0" as well, by making a small adjustment to the regex (and a corresponding adjustment to the replacement string, as per the discussion on the next page). Pattern.compile("\bJava\s*1\. ([56]) \b").matcher( text ).replaceAll ("Java .0") which, when given the same text as earlier, produces: Before Java 5.0 was Java 1.4.2. After Java 5.0 is Java 6.0 You can use replaceFirst instead of replaceAll with any of these examples to replace only the first match. You should use replaceFirst when you want to forcibly limit the matcher to only one replacement, of course, but it also makes sense from an efficiency standpoint to use it when you know that only one match is possible. (You might know this because of your knowledge of the regex or the data, for example.) 8.4.3.2. The replacement argumentThe replacement argument to the replaceAll and replaceFirst methods (and to the next section's appendReplacement method, for that matter) receives special treatment prior to being inserted in place of a match, on a per-match basis:
If you have a string of unknown content that you intend to use as the replacement text, it's best to use Matcher.quoteReplacement to ensure that any replacement metacharacters it might contain are rendered inert. Given a user's regex in uRegex and replacement text in uRepl , this snippet ensures that the replaced text is exactly that given: Pattern.compile( uRegex ).matcher( text ).replaceAll(Matcher.quoteReplacement( uRepl )) 8.4.4. Advanced Search and ReplaceTwo methods provide raw access to a matcher's search-and-replace mechanics. Together, they build a result in a StringBuffer that you provide. The first, called after each match, fills the result with the replacement string and the text between the matches. The second, called after all matches have been found, tacks on whatever text remains after the final match.
8.4.4.1. Search-and-replace examplesHere's an example showing how you might implement your own version of replaceAll . (Not that you'd want to, but it's illustrative .) public static String replaceAll (Matcher m , String replacement ) { m .reset(); // Be sure to start with a fresh Matcher object StringBuffer result = new StringBuffer(); // We'll build the updated copy here while ( m .find()) m .appendReplacement( result , replacement ); m .appendTail( result ); return result .toString(); // Convert result to a string and return } As with the real replaceAll method, this code does not respect the region (˜ 384), but rather resets it prior to the search-and-replace operation. To remedy that deficiency, here is a version of replaceAll that does respect the region. Changed or added sections of code are highlighted: public static String replaceAllRegion (Matcher m , String replacement ) { Integer start = m .regionStart(); Integer end = m .regionEnd(); m .reset().region( start , end ); // Reset the matcher, but then restore the region StringBuffer result = new StringBuffer(); // We'll build the updated copy here while ( m .find()) m .appendReplacement( result , replacement ); m .appendTail( result ); return result .toString(); // Convert to a String and return } The combination of the reset and region methods in one expression is an example of method chaining , which is discussed starting on page 389. This next example is sightly more involved; it prints a version of the string in the variable metric , with Celsius temperatures converted to Fahrenheit: // Build a matcher to find numbers followed by "C" within the variable "Metric" // The following regex is: (\d+(?:\.\d*)?)C\b Matcher m = Pattern.compile("(\d+(?:\.\d*)?)C\b").matcher( metric ); StringBuffer result = new StringBuffer(); // We'll build the updated copy here while ( m .find()) { float celsius = Float.parseFloat( m .group(1)); // Get the number, as a number int fahrenheit = (int) ( celsius * 9/5 + 32); // Convert to a Fahrenheit value m .appendReplacement( result , fahrenheit + "F"); // Insert it } m .appendTail( result ); System.out.println( result .toString()); // Display the result For example, if the variable meTRic contains ' from 36.3C to 40.1C .', it displays ' from 97F to 104F .'. 8.4.5. In-Place Search and ReplaceSo far, we've applied java.util.regex to only String objects, but since a matcher can be created with any object that implements the CharSequence interface, we can actually work with text that we can modify directly, in place, and on the fly. StringBuffer and StringBuilder are commonly used classes that implement CharSequence , the former being multithread safe but less efficient. Both can be used like a String object, but, unlike a String , they can also can be modified. Examples in this book use StringBuilder , but feel free to use StringBuffer if coding for a multithreaded environment. Here's a simple example illustrating a search through a StringBuilder object, with all-uppercase words being replaced by their lowercase counterparts: [ ]
StringBuilder text = new StringBuilder("It's SO very RUDE to shout!"); Matcher m = Pattern.compile("\b[\p{Lu}\p{Lt}]+\b").matcher( text ); while ( m .find()) text .replace( m .start(), m .end(), m .group().toLowerCase()); System.out.println( text ); This produces:
Two matches result in two calls to text .replace . The first two arguments indicate the span of characters to be replaced (we pass the span that the regex matched), followed by the text to use as the replacement (the lowercase version of what was matched). As long as the replacement text is the same length as the text being replaced, as is the case here, an in-place search and replace is this simple. Also, the approach remains this simple if only one search and replace is done, rather than the iterative application shown in this example. 8.4.5.1. Using a different- sized replacementProcessing gets more complicated if the replacement text is a different length than what it replaces. The changes we make to the target are done "behind the back" of the matcher, so its idea of the match pointer (where in the target to begin the next find ) can become incorrect. We can get around this by maintaining the match pointer ourselves , passing it to the find method to have it explicitly begin the search where we know it should. That's what we do in the following modification of the previous example, where we add < b >‹< /b > tags around the newly lowercased text: StringBuilder text = new StringBuilder("It's SO very RUDE to shout!"); Matcher m = Pattern.compile("\b[\p{Lu}\p{Lt}]+\b").matcher( text ); int matchPointer = 0;// First search begins at the start of the string while ( m .find(matchPointer)) { matchPointer = m .end(); // Next search starts from where this one ended text .replace( m .start(), m .end(), "<b>"+ m .group().toLowerCase() +"</b>); matchPointer += 7; // Account for having added '<b>' and '</b>' } System.out.println( text ); This produces:
8.4.6. The Matcher's RegionSince Java 1.5, a matcher supports the concept of a changeable region with which you can restrict match attempts to some subset of the target string. Normally, a matcher's region encompasses the entire target text, but it can be changed on the fly with the region method. The next example inspects a string of HTML, reporting image tags without an ALT attribute. It uses two matchers that work on the same text (the HTML), but with different regular expressions: one finds image tags, while the other finds ALT attributes. Although the two matchers are applied to the same string, they're independent objects that are related only because we use the result of each image-tag match to explicitly isolate the ALT search to the image tag's body. We do this by using the start - and end -method data from the just-completed image-tag match to set the ALT-matcher's region, prior to invoking the ALT-matcher's find . By isolating the image tag's body in this way, the ALT search tells us whether the image tag we just found, and not the whole HTML string in general, contains an ALT attribute. // Matcher to find an image tag. The 'html' variable contains the HTML in question Matcher mImg = Pattern.compile("(?id)<IMG\s+(.*?)/?>").matcher( html ); // Matcher to find an ALT attribute (to be applied to an IMG tag's body within the same 'html' variable) Matcher mAlt = Pattern.compile("(?ix)\b ALT \s* =").matcher( html ); // For each image tag within the html ... while ( mImg .find()) { // Restrict the next ALT search to the body of the just-found image tag mAlt .region( mImg .start(1), mImg .end(1)); // Report an error if no ALT found, showing the whole image tag found above if (! mAlt .find()) System.out.println("Missing ALT attribute in: " + mImg .group()); } It may feel odd to indicate the target text in one location (when the mAlt matcher is created) and its range in another (the mAlt .region call). If so, another option is to create mAlt with a dummy target (an empty string, rather than html ) and change the region call to mAlt . reset( html ). region( ‹ ) . The extra call to reset is slightly less efficient, but keeping the setting of the target text in the same spot as the setting of the region may be clearer. In any case, let me reiterate that if we hadn't restricted the ALT matcher by setting its region, its find would end up searching the entire string, reporting the useless fact of whether the HTML contained ' ALT =' anywhere at all. Let's extend this example so that it reports the line number within the HTML where the offending image tag starts. We'll do so by isolating our view of the HTML to the text before the image tag, then count how many newlines we can find. New code is highlighted: // Matcher to find an image tag. The 'html' variable contains the HTML in question Matcher mImg = Pattern.compile("(?id)<IMG\s+(.*?)/?>").matcher( html ); // Matcher to find an ALT attribute (to be applied to an IMG tag's body within the same 'html' variable) Matcher mAlt = Pattern.compile("(?ix)\b ALT \s* =").matcher( html ); // Matcher to find a newline Matcher mLine = Pattern.compile("\n").matcher( html ); // For each image tag within the html ... while ( mImg .find()) { // Restrict the next ALT search to the body of the just-found image tag mAlt .region( mImg .start(1), mImg .end(1)); // Report an error if no ALT found, showing the whole image tag found above if (! mAlt .find()) { // Restrict counting of newlines to the text before the start of the image tag mLine.region(0, mImg .start()); int lineNum = 1; // The first line is numbered 1 while (mLine.find()) lineNum++; // Each newline bumps up the line number System.out.println("Missing ALT attribute on line " + lineNum); } } As before, when setting the region for the ALT matcher, we use the start(1) method of the image matcher to identify where in the HTML string the body of the image tag starts. Conversely, when setting the end of the newline-matching region, we use start() because it identifies where the whole image tag begins (which is where we want the newline counting to end). 8.4.6.1. Points to keep in mindIt's important to remember that not only do some search-related methods ignore the region, they actually invoke the reset method internally and so revert the region to its "entire text" default.
Also important to remember is that character offsets in the match-result data (that is, the values reported by start and end methods) are not region-relative values, but always with respect to the start of the entire target. 8.4.6.2. Setting and inspecting region boundsThree matcher methods relate to setting and inspecting the region bounds:
Because the region method requires both start and end to be explicitly provided, it can be a bit inconvenient when you want to set only one. Table 8-4 offers ways to do so. Table 8-4. Setting Only One Edge of the Region
8.4.6.3. Looking outside the current regionSetting a region to something other than the all-text default normally hides, in every respect, the excluded text from the regex engine. This means, for example, that the start of the region is matched by ^ even though it may not be the start of the target text. However, it's possible to open up the areas outside the region to limited inspection. Turning on transparent bounds opens up the excluded text to "looking" constructs (lookahead, lookbehind, and word boundaries), and by turning off anchoring bounds , you can configure the edges of the region to not be considered the start and/or edge of the input (unless they truly are). The reason one might want to change either of these flags is strongly related to why the region was adjusted from the default in the first place. We had no need to do this in the earlier region examples because of their naturethe region-related searches used neither anchoring nor looking constructs, for example. But imagine again using a CharBuffer to hold text being edited by the user within your application. If the user does a search or search-and-replace operation, it's natural to limit the operation to the text after the cursor, so you'd set the region to start at the current cursor position. Imagine further that the user's cursor is at the marked point in this text:
and requests that matches of \b car \b be changed to "automobile." After setting the region appropriately (to isolate the text to the right of the cursor), youll launch the search and perhaps be surprised to find that it matches right there at the start of the region, in ' '. It matches there because the transparent-bounds flag defaults to false, and as such, the \b believes that the start of the region is the start of the text. It cant "see" what comes before the start of the region. Were the transparent-bounds flag set to true, \b would see the ' s ' before the region-starting 'c and know that \b cant match there. 8.4.6.4. Transparent boundsThese methods relate to the transparent-bounds flag:
The default state of a matcher's transparent-bounds flag is false , meaning that the region bounds are not transparent to "looking" constructs such as lookahead, lookbehind, and word boundaries. As such, characters that might exist beyond the edges of the region are not seen by the regex engine. [ ] This means that even though the region start might be placed within a word, \b can match at the start of the region it does not see that a letter exists just before the region-starting letter.
This example illustrates a false (default) transparent-bounds flag: String regex = "\bcar\b"; // \ b car \b String text = "Madagascar is best seen by car or bike."; Matcher m = Pattern.compile( regex ).matcher( text ); m .region(7, text .length()); m .find(); System.out.println("Matches starting at character " + m .start()); It produces:
indicating that a word boundary indeed matched at the start of the region, in the middle of Madagas car , despite not being a word boundary at all. The non-transparent edge of the region "spoofed a word boundary. However, adding:
before the find call causes the example to produce:
Because the bounds are now transparent, the engine can see that the character just before the start of the region, ' s ', is a letter, thereby forbidding \b to match there. Thats why a match isn't found until later, at '‹by or bike.' Again, the transparent-bounds flag is relevant only when the region has been changed from its "all text" default. Note also that the reset method does not reset this flag. 8.4.6.5. Anchoring boundsThese methods relate to the anchoring-bounds flag:
The default state of a matcher's anchoring-bounds flag is true , meaning that the line anchors (^ \A $ \z \Z) match at the region boundaries, even if those boundaries have been moved from the start and end of the target string. Setting the flag to false means that the line anchors match only at the true ends of the target string, should the region include them. One might turn off anchoring bounds for the same kind of reasons that transparent bounds might be turned on, such as to keep the semantics of the region in line with a user's "the cursor is not at the start of the text" expectations. As with the transparent-bounds flag, the anchoring-bounds flag is relevant only when the region has been changed from its "all text" default. Note also that the reset method does not reset this flag. 8.4.7. Method ChainingConsider this sequence, which prepares a matcher and sets some of its options: Pattern p = Pattern.compile( regex ); // Compile regex . Matcher m = p .matcher( text ); // Associate regex with text, creating a Matcher . m .region(5, text .length()); // Bump start of region five characters forward . m .useAnchoringBounds(false); // Don't let ^ et al. match at the region start . m .useTransparentBounds(true); // Let looking constructs see across region edges . We've seen in earlier examples that if we don't need the pattern beyond the creation of the matcher (which is often the case), we can combine the first two lines: Matcher m = Pattern.compile( regex ).matcher( text ); m .region(5, text .length()); // Bump start of region five characters forward . m .useAnchoringBounds(false); // Don't let ^ et al. match at the region start . m .useTransparentBounds(true); // Let looking constructs see across region edges . However, because the two matcher methods invoked after that are from among those that return the matcher itself, we can combine everything into one line (although presented here on two lines to fit the page): Matcher m = Pattern.compile( regex ).matcher( text ).region(5, text .length()) .useAnchoringBounds(false).useTransparentBounds(true); This doesn't buy any extra functionality, but it can be quite convenient. This kind of "method chaining" can make action-by-action documentation more difficult to fit in and format neatly, but then, good documentation tends to focus on the why rather than the what , so perhaps this is not such a concern. Method chaining is used to great effect in keeping the code on page 399 clear and concise . 8.4.8. Methods for Building a ScannerNew in Java 1.5 are hitEnd and requireEnd , two matcher methods used primarily in building scanners . A scanner parses a stream of characters into a stream of tokens. For example, a scanner that's part of a compiler might accept ' var < 34 ' and produce the three tokens IDENTIFIER · LESSR_THAN · INTEGER. These methods help a scanner decide whether the results from the just-completed match attempt should be used to decide the proper interpretation of the current input. Generally speaking, a return value of true from either method means that more input is required before a definite decision can be made. For example, if the current input (say, characters being typed by the user in an interactive debugger) is the single character ' < ', it's best to wait to see whether the next character is ' = ' so you can properly decide whether the next token should be LESS_THAN or LESS_THAN_OR_EQUAL. These methods will likely be of little use to the vast majority of regex-related projects, but when they're at all useful, they're invaluable. This occasional invaluableness makes it all the more lamentable that hitEnd has a bug that renders it unreliable in Java 1.5. Luckily, it appears to have been fixed in Java 1.6, and for Java 1.5, there's an easy workaround described at the end of this section. The subject of building a scanner is quite beyond the scope of this book, so I'll limit the coverage of these specialized methods to their definitions and some illustrative examples. (By the way, if you're in need of a scanner, you might be interested in java.util.Scanner as well.)
Both hitEnd and requireEnd respect the region. 8.4.8.1. Examples illustrating hitEnd and requireEndTable 8-5 shows examples of hitEnd and requireEnd after a lookingAt search. Two expressions are used that, although unrealistically simple on their own, are useful in illustrating these methods. Table 8-5. hitEnd and requireEnd after a lookingAt search
The regex in the top half of Table 8-5 looks for a non-negative integer and four comparison operators: greater than, less than, greater-than-or-equal, and less-than -or-equal. The bottom-half regex is even simpler, looking for the words set and setup . Again, these are simple examples, but illustrative. For example, notice in test 5 that even though the entire target was matched, hitEnd remains false. The reason is that, although the last character in the target was matched, the engine never had to inspect beyond that character (to check for another character or for a boundary). 8.4.8.2. The hitEnd bug and its workaroundThe " hitEnd bug" in Java 1.5 (fixed in Java 1.6) [ ] causes unreliable results from the hitEnd method in one very specific situation: when an optional, single-character regex component is attempted in case-insensitive mode ( specifically , when such an attempt fails).
For example, the expression >=? in case-insensitive mode (by itself, or as part of a larger expression) tickles the bug because ' = ' is an optional, single-character component. Another example, aanthe in case-insensitive mode (again, alone or as part of a larger expression) tickles the bug because the a alternative is a single character, and being one of several alternatives, is optional. Other examples include values? and \r?\n\r?\n The workaround The workaround is to remove the offending condition, either by turning off case-insensitive mode (at least for the offending subexpression), or to replace the single character with something else, such as a character class. Using the first approach, >=? might become (?-i:>=?) , which uses a mode-modified span (˜ 110) to ensure that insensitivity does not apply to the subexpr ession (which doesn't benefit from a case insensitivity to begin with, so the workaround is "free" in this case). Using the second approach, a anthe becomes [aA] anthe , which preserves any case insensitivity applied via the Pattern.CASE_INSENSITIVE flag. |