Regular expressions are used to specify string patterns. You can use regular expressions whenever you need to locate strings that match a particular pattern. For example, one of our sample programs locates all hyperlinks in an HTML file by looking for strings of the pattern <a href="...">. Of course, for specifying a pattern, the ... notation is not precise enough. You need to specify precisely what sequence of characters is a legal match. You need to use a special syntax whenever you describe a pattern. Here is a simple example. The regular expression [Jj]ava.+ matches any string of the following form:
For example, the string "javanese" matches the particular regular expression, but the string "Core Java" does not. As you can see, you need to know a bit of syntax to understand the meaning of a regular expression. Fortunately, for most purposes, a small number of straightforward constructs are sufficient.
For example, here is a somewhat complex but potentially useful regular expression it describes decimal or hexadecimal integers: [+-]?[0-9]+|0[Xx][0-9A-Fa-f]+ Unfortunately, the expression syntax is not completely standardized between the various programs and libraries that use regular expressions. While there is consensus on the basic constructs, there are many maddening differences in the details. The Java regular expression classes use a syntax that is similar to, but not quite the same as, the one used in the Perl language. Table 12-8 shows all constructs of the Java syntax. For more information on the regular expression syntax, consult the API documentation for the Pattern class or the book Mastering Regular Expressions by Jeffrey E. F. Friedl (O'Reilly and Associates, 1997).
The simplest use for a regular expression is to test whether a particular string matches it. Here is how you program that test in Java. First construct a Pattern object from the string denoting the regular expression. Then get a Matcher object from the pattern, and call its matches method: Pattern pattern = Pattern.compile(patternString); Matcher matcher = pattern.matcher(input); if (matcher.matches()) . . .
The input of the matcher is an object of any class that implements the CharSequence interface, such as a String, StringBuilder, or CharBuffer. When compiling the pattern, you can set one or more flags, for example, Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CASE); The following six flags are supported:
If the regular expression contains groups, then the Matcher object can reveal the group boundaries. The methods int start(int groupIndex) int end(int groupIndex) yield the starting index and the past-the-end index of a particular group. You can simply extract the matched string by calling String group(int groupIndex) Group 0 is the entire input; the group index for the first actual group is 1. Call the groupCount method to get the total group count. Nested groups are ordered by the opening parentheses. For example, given the pattern ((1?[0-9]):([0-5][0-9]))[ap]m and the input 11:59am the matcher reports the following groups
Example 12-9 prompts for a pattern, then for strings to match. It prints out whether or not the input matches the pattern. If the input matches and the pattern contains groups, then the program prints the group boundaries as parentheses, such as ((11):(59))am Example 12-9. RegexTest.java1. import java.util.*; 2. import java.util.regex.*; 3. 4. /** 5. This program tests regular expression matching. 6. Enter a pattern and strings to match, or hit Cancel 7. to exit. If the pattern contains groups, the group 8. boundaries are displayed in the match. 9. */ 10. public class RegExTest 11. { 12. public static void main(String[] args) 13. { 14. Scanner in = new Scanner(System.in); 15. System.out.println("Enter pattern: "); 16. String patternString = in.nextLine(); 17. 18. Pattern pattern = null; 19. try 20. { 21. pattern = Pattern.compile(patternString); 22. } 23. catch (PatternSyntaxException e) 24. { 25. System.out.println("Pattern syntax error"); 26. System.exit(1); 27. } 28. 29. while (true) 30. { 31. System.out.println("Enter string to match: "); 32. String input = in.nextLine(); 33. if (input == null || input.equals("")) return; 34. Matcher matcher = pattern.matcher(input); 35. if (matcher.matches()) 36. { 37. System.out.println("Match"); 38. int g = matcher.groupCount(); 39. if (g > 0) 40. { 41. for (int i = 0; i < input.length(); i++) 42. { 43. for (int j = 1; j <= g; j++) 44. if (i == matcher.start(j)) 45. System.out.print('('); 46. System.out.print(input.charAt(i)); 47. for (int j = 1; j <= g; j++) 48. if (i + 1 == matcher.end(j)) 49. System.out.print(')'); 50. } 51. System.out.println(); 52. } 53. } 54. else 55. System.out.println("No match"); 56. } 57. } 58. } Usually, you don't want to match the entire input against a regular expression, but you want to find one or more matching substrings in the input. Use the find method of the Matcher class to find the next match. If it returns TRue, use the start and end methods to find the extent of the match. while (matcher.find()) { int start = matcher.start(); int end = matcher.end(); String match = input.substring(start, end); . . . } Example 12-10 puts this mechanism to work. It locates all hypertext references in a web page and prints them. To run the program, supply a URL on the command line, such as java HrefMatch http://www.horstmann.com Example 12-10. HrefMatch.java1. import java.io.*; 2. import java.net.*; 3. import java.util.regex.*; 4. 5. /** 6. This program displays all URLs in a web page by 7. matching a regular expression that describes the 8. <a href=...> HTML tag. Start the program as 9. java HrefMatch URL 10. */ 11. public class HrefMatch 12. { 13. public static void main(String[] args) 14. { 15. try 16. { 17. // get URL string from command line or use default 18. String urlString; 19. if (args.length > 0) urlString = args[0]; 20. else urlString = "http://java.sun.com"; 21. 22. // open reader for URL 23. InputStreamReader in = new InputStreamReader(new URL(urlString).openStream()); 24. 25. // read contents into string buffer 26. StringBuilder input = new StringBuilder(); 27. int ch; 28. while ((ch = in.read()) != -1) input.append((char) ch); 29. 30. // search for all occurrences of pattern 31. String patternString = "<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>])\\s*>"; 32. Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE); 33. Matcher matcher = pattern.matcher(input); 34. 35. while (matcher.find()) 36. { 37. int start = matcher.start(); 38. int end = matcher.end(); 39. String match = input.substring(start, end); 40. System.out.println(match); 41. } 42. } 43. catch (IOException e) 44. { 45. e.printStackTrace(); 46. } 47. catch (PatternSyntaxException e) 48. { 49. e.printStackTrace(); 50. } 51. } 52. } The replaceAll method of the Matcher class replaces all occurrences of a regular expression with a replacement string. For example, the following instructions replace all sequences of digits with a # character. Pattern pattern = Pattern.compile("[0-9]+"); Matcher matcher = pattern.matcher(input); String output = matcher.replaceAll("#"); The replacement string can contain references to groups in the pattern: $n is replaced with the nth group. Use \$ to include a $ character in the replacement text. The replaceFirst method replaces only the first occurrence of the pattern. Finally, the Pattern class has a split method that works like a string tokenizer on steroids. It splits an input into an array of strings, using the regular expression matches as boundaries. For example, the following instructions split the input into tokens, where the delimiters are punctuation marks surrounded by optional whitespace. Pattern pattern = Pattern.compile("\\s*\\p{Punct}\\s*"); String[] tokens = pattern.split(input); java.util.regex.Pattern 1.4
java.util.regex.Matcher 1.4
|