Recipe 4.3 Finding the Matching Text


Problem

You need to find the text that the regex matched.

Solution

Sometimes you need to know more than just whether a regex matched a string. In editors and many other tools, you want to know exactly what characters were matched. Remember that with multipliers such as * , the length of the text that was matched may have no relationship to the length of the pattern that matched it. Do not underestimate the mighty .* , which happily matches thousands or millions of characters if allowed to. As you saw in the previous recipe, you can find out whether a given match succeeds just by using find( ) or matches( ). But in other applications, you will want to get the characters that the pattern matched.

After a successful call to one of the above methods, you can use these "information" methods to get information on the match:


start( ), end( )

Returns the character position in the string of the starting and ending characters that matched.


groupCount( )

Returns the number of parenthesized capture groups if any; returns 0 if no groups were used.


group(int i)

Returns the characters matched by group i of the current match, if i is less than or equal to the return value of groupCount( ). Group is the entire match, so group(0) (or just group( )) returns the entire portion of the string that matched.

The notion of parentheses or " capture groups" is central to regex processing. Regexes may be nested to any level of complexity. The group(int) method lets you retrieve the characters that matched a given parenthesis group. If you haven't used any explicit parens, you can just treat whatever matched as "level zero." For example:

// Part of REmatch.java String patt = "Q[^u]\\d+\\."; Pattern r = Pattern.compile(patt); String line = "Order QT300. Now!"; Matcher m = r.matcher(line); if (m.find( )) {     System.out.println(patt + " matches \"" +         m.group(0) +         "\" in \"" + line + "\""); } else {      System.out.println("NO MATCH"); }

When run, this prints:

Q[^u]\d+\. matches "QT300." in "Order QT300. Now!"

An extended version of the REDemo program presented in Recipe 4.2, called REDemo2, provides a display of all the capture groups in a given regex; one example is shown in Figure 4-3.

Figure 4-3. REDemo2 in action
figs/jcb2_0403.gif


It is also possible to get the starting and ending indexes and the length of the text that the pattern matched (remember that terms with multipliers, such as the \d+ in this example, can match an arbitrary number of characters in the string). You can use these in conjunction with the String.substring( ) methods as follows:

// Part of regexsubstr.java -- Prints exactly the same as REmatch.java Pattern r = Pattern.compile(patt); String line = "Order QT300. Now!"; Matcher m = r.matcher(line); if (m.find( )) {     System.out.println(patt + " matches \"" +         line.substring(m.start(0), m.end(0)) +         "\" in \"" + line + "\"");     } else {          System.out.println("NO MATCH");     } }

Suppose you need to extract several items from a string. If the input is:

Smith, John Adams, John Quincy

and you want to get out:

John Smith John Quincy Adams

just use:

// from REmatchTwoFields.java // Construct a regex with parens to "grab" both field1 and field2 Pattern r = Pattern.compile("(.*), (.*)"); Matcher m = r.matcher(inputLine);  if (!m.matches( ))      throw new IllegalArgumentException("Bad input: " + inputLine); System.out.println(m.group(2) + ' ' + m.group(1));



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net