Regular Expressions

   


Regular expressions are used to specify string patterns. You can use regular expressions whenever you need to locate strings that match a particular pattern. For example, one of our sample programs locates all hyperlinks in an HTML file by looking for strings of the pattern <a href="...">.

Of course, for specifying a pattern, the ... notation is not precise enough. You need to specify precisely what sequence of characters is a legal match. You need to use a special syntax whenever you describe a pattern.

Here is a simple example. The regular expression

 [Jj]ava.+ 

matches any string of the following form:

  • The first letter is a J or j.

  • The next three letters are ava.

  • The remainder of the string consists of one or more arbitrary characters.

For example, the string "javanese" matches the particular regular expression, but the string "Core Java" does not.

As you can see, you need to know a bit of syntax to understand the meaning of a regular expression. Fortunately, for most purposes, a small number of straightforward constructs are sufficient.

  • A character class is a set of character alternatives, enclosed in brackets, such as [Jj], [0-9], [A-Za-z], or [^0-9]. Here the - denotes a range (all characters whose Unicode value falls between the two bounds), and ^ denotes the complement (all characters except the ones specified).

  • There are many predefined character classes such as \d (digits) or \p{Sc} (Unicode currency symbol). See Tables 12-8 and 12-9.

  • Most characters match themselves, such as the ava characters in the example above.

  • The . symbol matches any character (except possibly line terminators, depending on flag settings).

  • Use \ as an escape character, for example \. matches a period and \\ matches a backslash.

  • ^ and $ match the beginning and end of a line respectively.

  • If X and Y are regular expressions, then XY means "any match for X followed by a match for Y". X | Y means "any match for X or Y".

  • You can apply quantifiers X+ (1 or more), X* (0 or more), and X? (0 or 1) to an expression X.

  • By default, a quantifier matches the largest possible repetition that makes the overall match succeed. You can modify that behavior with suffixes ? (reluctant or stingy match match the smallest repetition count) and + (possessive or greedy match match the largest count even if that makes the overall match fail).

    For example, the string cab matches [a-z]*ab but not [a-z]*+ab. In the first case, the expression [a-z]* only matches the character c, so that the characters ab match the remainder of the pattern. But the greedy version [a-z]*+ matches the characters cab, leaving the remainder of the pattern unmatched.

  • You can use groups to define subexpressions. Enclose the groups in ( ), for example ([+-]?)([0-9]+). You can then ask the pattern matcher to return the match of each group or to refer back to a group with \n, where n is the group number (starting with \1).

For example, here is a somewhat complex but potentially useful regular expression it describes decimal or hexadecimal integers:

 [+-]?[0-9]+|0[Xx][0-9A-Fa-f]+ 

Unfortunately, the expression syntax is not completely standardized between the various programs and libraries that use regular expressions. While there is consensus on the basic constructs, there are many maddening differences in the details. The Java regular expression classes use a syntax that is similar to, but not quite the same as, the one used in the Perl language. Table 12-8 shows all constructs of the Java syntax. For more information on the regular expression syntax, consult the API documentation for the Pattern class or the book Mastering Regular Expressions by Jeffrey E. F. Friedl (O'Reilly and Associates, 1997).

Table 12-8. Regular Expression Syntax

Syntax

Explanation

Characters

c

The character c

\unnnn, \xnn, \0n, \0nn, \0nnn

The code unit with the given hex or octal value

\t, \n, \r, \f, \a, \e

The control characters tab, newline, return, form feed, alert, and escape

\cc

The control character corresponding to the character c

Character Classes

[C1C2. . .]

Any of the characters represented by C1, C2, . . . The Ci are characters, character ranges (c1-c2), or character classes

[^. . .]

Complement of character class

[ . . . && . . .]

Intersection of two character classes

Predefined Character Classes

.

Any character except line terminators (or any character if the DOTALL flag is set)

\d

A digit [0-9]

\D

A nondigit [^0-9]

\s

A whitespace character [ \t\n\r\f\x0B]

\S

A non-whitespace character

\w

A word character [a-zA-Z0-9_]

\W

A nonword character

\p{name}

A named character class see Table 12-9

\P{name}

The complement of a named character class

Boundary Matchers

^ $

Beginning, end of input (or beginning, end of line in multiline mode)

\b

A word boundary

\B

A nonword boundary

Syntax

Explanation

\A

Beginning of input

\z

End of input

\Z

End of input except final line terminator

\G

End of previous match

Quantifiers

X?

Optional X

X*

X, 0 or more times

X+

X, 1 or more times

X{n} X{n,} X{n,m}

X n times, at least n times, between n and m times

Quantifier Suffixes

?

Turn default (greedy) match into reluctant match

+

Turn default (greedy) match into possessive match

Set Operations

XY

Any string from X, followed by any string from Y

X|Y

Any string from X or Y

Grouping

(X)

Capture the string matching X as a group

\n

The match of the nth group

Escapes

\c

The character c (must not be an alphabetic character)

\Q . . . \E

Quote . . . verbatim

(? . . . )

Special construct see API notes of Pattern class


The simplest use for a regular expression is to test whether a particular string matches it. Here is how you program that test in Java. First construct a Pattern object from the string denoting the regular expression. Then get a Matcher object from the pattern, and call its matches method:

 Pattern pattern = Pattern.compile(patternString); Matcher matcher = pattern.matcher(input); if (matcher.matches()) . . . 

Table 12-9. Predefined Character Class Names

Lower

ASCII lower case [a-z]

Upper

ASCII upper case [A-Z]

Alpha

ASCII alphabetic [A-Za-z]

Digit

ASCII digits [0-9]

Alnum

ASCII alphabetic or digit [A-Za-z0-9]

Xdigit

Hex digits [0-9A-Fa-f]

Print or Graph

Printable ASCII character [\x21-\x7E]

Punct

ASCII non-alpha or digit [\p{Print}&&\P{Alnum}]

ASCII

All ASCII [\x00-\x7F]

Cntrl

ASCII Control character [\x00-\x1F]

Blank

Space or tab [ \t]

Space

Whitespace [ \t\n\r\f\0x0B]

javaLowerCase

Lower case, as determined by Character.isLowerCase()

javaUpperCase

Upper case, as determined by Character.isUpperCase()

javaWhitespace

Whitespace, as determined by Character.isWhitespace()

javaMirrored

Mirrored, as determined by Character.isMirrored()

InBlock

Block is the name of a Unicode character block, with spaces removed, such as BasicLatin or Mongolian. See http://www.unicode.org for a list of block names.

Category or InCategory

Category is the name of a Unicode character category such as L (letter) or Sc (currency symbol). See http://www.unicode.org for a list of category names.


The input of the matcher is an object of any class that implements the CharSequence interface, such as a String, StringBuilder, or CharBuffer.

When compiling the pattern, you can set one or more flags, for example,

 Pattern pattern = Pattern.compile(patternString,    Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CASE); 

The following six flags are supported:

  • CASE_INSENSITIVE: Match characters independently of the letter case. By default, this flag takes only US ASCII characters into account.

  • UNICODE_CASE: When used in combination with CASE_INSENSITIVE, use Unicode letter case for matching.

  • MULTILINE: ^ and $ match the beginning and end of a line, not the entire input.

  • UNIX_LINES: Only '\n' is recognized as a line terminator when matching ^ and $ in multiline mode.

  • DOTALL: When using this flag, the . symbol matches all characters, including line terminators.

  • CANON_EQ: Takes canonical equivalence of Unicode characters into account. For example, u followed by ¨ (diaeresis) matches ü.

If the regular expression contains groups, then the Matcher object can reveal the group boundaries. The methods

 int start(int groupIndex) int end(int groupIndex) 

yield the starting index and the past-the-end index of a particular group.

You can simply extract the matched string by calling

 String group(int groupIndex) 

Group 0 is the entire input; the group index for the first actual group is 1. Call the groupCount method to get the total group count.

Nested groups are ordered by the opening parentheses. For example, given the pattern

 ((1?[0-9]):([0-5][0-9]))[ap]m 

and the input

 11:59am 

the matcher reports the following groups

Group Index

Start

End

String

0

0

7

11;59am

1

0

5

11:59

2

0

2

11

3

3

5

59


Example 12-9 prompts for a pattern, then for strings to match. It prints out whether or not the input matches the pattern. If the input matches and the pattern contains groups, then the program prints the group boundaries as parentheses, such as

 ((11):(59))am 

Example 12-9. RegexTest.java
  1. import java.util.*;  2. import java.util.regex.*;  3.  4. /**  5.    This program tests regular expression matching.  6.    Enter a pattern and strings to match, or hit Cancel  7.    to exit. If the pattern contains groups, the group  8.    boundaries are displayed in the match.  9. */ 10. public class RegExTest 11. { 12.   public static void main(String[] args) 13.   { 14.      Scanner in = new Scanner(System.in); 15.      System.out.println("Enter pattern: "); 16.      String patternString = in.nextLine(); 17. 18.      Pattern pattern = null; 19.      try 20.      { 21.         pattern = Pattern.compile(patternString); 22.      } 23.      catch (PatternSyntaxException e) 24.      { 25.         System.out.println("Pattern syntax error"); 26.         System.exit(1); 27.      } 28. 29.      while (true) 30.      { 31.         System.out.println("Enter string to match: "); 32.         String input = in.nextLine(); 33.         if (input == null || input.equals("")) return; 34.         Matcher matcher = pattern.matcher(input); 35.         if (matcher.matches()) 36.         { 37.            System.out.println("Match"); 38.            int g = matcher.groupCount(); 39.            if (g > 0) 40.            { 41.               for (int i = 0; i < input.length(); i++) 42.               { 43.                  for (int j = 1; j <= g; j++) 44.                     if (i == matcher.start(j)) 45.                        System.out.print('('); 46.                  System.out.print(input.charAt(i)); 47.                  for (int j = 1; j <= g; j++) 48.                     if (i + 1 == matcher.end(j)) 49.                        System.out.print(')'); 50.               } 51.               System.out.println(); 52.            } 53.         } 54.         else 55.            System.out.println("No match"); 56.      } 57.   } 58. } 

Usually, you don't want to match the entire input against a regular expression, but you want to find one or more matching substrings in the input. Use the find method of the Matcher class to find the next match. If it returns TRue, use the start and end methods to find the extent of the match.

 while (matcher.find()) {    int start = matcher.start();    int end = matcher.end();    String match = input.substring(start, end);    . . . } 

Example 12-10 puts this mechanism to work. It locates all hypertext references in a web page and prints them. To run the program, supply a URL on the command line, such as

 java HrefMatch http://www.horstmann.com 

Example 12-10. HrefMatch.java
  1. import java.io.*;  2. import java.net.*;  3. import java.util.regex.*;  4.  5. /**  6.    This program displays all URLs in a web page by  7.    matching a regular expression that describes the  8.    <a href=...> HTML tag. Start the program as  9.    java HrefMatch URL 10. */ 11. public class HrefMatch 12. { 13.    public static void main(String[] args) 14.    { 15.       try 16.       { 17.          // get URL string from command line or use default 18.          String urlString; 19.          if (args.length > 0) urlString = args[0]; 20.          else urlString = "http://java.sun.com"; 21. 22.          // open reader for URL 23.          InputStreamReader in = new InputStreamReader(new URL(urlString).openStream()); 24. 25.          // read contents into string buffer 26.          StringBuilder input = new StringBuilder(); 27.          int ch; 28.          while ((ch = in.read()) != -1) input.append((char) ch); 29. 30.          // search for all occurrences of pattern 31.          String patternString = "<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>])\\s*>"; 32.          Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE); 33.          Matcher matcher = pattern.matcher(input); 34. 35.          while (matcher.find()) 36.          { 37.             int start = matcher.start(); 38.             int end = matcher.end(); 39.             String match = input.substring(start, end); 40.             System.out.println(match); 41.          } 42.       } 43.       catch (IOException e) 44.       { 45.          e.printStackTrace(); 46.       } 47.       catch (PatternSyntaxException e) 48.       { 49.          e.printStackTrace(); 50.       } 51.    } 52. } 

The replaceAll method of the Matcher class replaces all occurrences of a regular expression with a replacement string. For example, the following instructions replace all sequences of digits with a # character.

 Pattern pattern = Pattern.compile("[0-9]+"); Matcher matcher = pattern.matcher(input); String output = matcher.replaceAll("#"); 

The replacement string can contain references to groups in the pattern: $n is replaced with the nth group. Use \$ to include a $ character in the replacement text.

The replaceFirst method replaces only the first occurrence of the pattern.

Finally, the Pattern class has a split method that works like a string tokenizer on steroids. It splits an input into an array of strings, using the regular expression matches as boundaries. For example, the following instructions split the input into tokens, where the delimiters are punctuation marks surrounded by optional whitespace.

 Pattern pattern = Pattern.compile("\\s*\\p{Punct}\\s*"); String[] tokens = pattern.split(input); 


 java.util.regex.Pattern 1.4 

  • static Pattern compile(String expression)

  • static Pattern compile(String expression, int flags)

    compile the regular expression string into a pattern object for fast processing of matches.

    Parameters:

    expression

    The regular expression

     

    flags

    One or more of the flags CASE_INSENSITIVE, UNICODE_CASE, MULTILINE, UNIX_LINES, DOTALL, and CANON_EQ


  • Matcher matcher(CharSequence input)

    returns a matcher object that you can use to locate the matches of the pattern in the input.

  • String[] split(CharSequence input)

  • String[] split(CharSequence input, int limit)

    split the input string into tokens, where the pattern specifies the form of the delimiters. Return an array of tokens. The delimiters are not part of the tokens.

    Parameters:

    input

    The string to be split into tokens

     

    limit

    The maximum number of strings to produce. If limit - 1 matching delimiters have been found, then the last entry of the returned array contains the remaining unsplit input. If limit is 0, then the entire input is split. If limit is 0, then trailing empty strings are not placed in the returned array



 java.util.regex.Matcher 1.4 

  • boolean matches()

    returns true if the input matches the pattern.

  • boolean lookingAt()

    returns TRue if the beginning of the input matches the pattern.

  • boolean find()

  • boolean find(int start)

    attempt to find the next match and return true if another match is found.

    Parameters:

    start

    The index at which to start searching


  • int start()

  • int end()

    return the start and past-the-end position of the current match.

  • String group()

    returns the current match.

  • int groupCount()

    returns the number of groups in the input pattern.

  • int start(int groupIndex)

  • int end(int groupIndex)

    return the start and past-the-end position of a given group in the current match.

    Parameters:

    groupIndex

    The group index (starting with 1), or 0 to indicate the entire match


  • String group(int groupIndex)

    returns the string matching a given group.

    Parameters:

    groupIndex

    The group index (starting with 1), or 0 to indicate the entire match


  • String replaceAll(String replacement)

  • String replaceFirst(String replacement)

    return a string obtained from the matcher input by replacing all matches, or the first match, with the replacement string.

    Parameters:

    replacement

    The replacement string. It can contain references to a pattern group as $n. Use \$ to include a $ symbol


  • Matcher reset()

  • Matcher reset(CharSequence input)

    reset the matcher state. The second method makes the matcher work on a different input. Both methods return this.


       
    top



    Core Java 2 Volume I - Fundamentals
    Core Java(TM) 2, Volume I--Fundamentals (7th Edition) (Core Series) (Core Series)
    ISBN: 0131482025
    EAN: 2147483647
    Year: 2003
    Pages: 132

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net