8.10. Key Terms

 
[Page 279 ( continued )]

8.6. (Optional) Regular Expressions

A regular expression (abbreviated regex ) is a string that describes a pattern for matching a set of strings. Regular expression is a powerful tool for string manipulations. You can use regular expressions for matching, replacing, and splitting strings.


[Page 280]

Pedagogical Note

The previous edition of this book introduced the StringTokenizer class for extracting tokens from a string. Using regular expressions is more powerful and flexible than StringTokenizer for splitting strings. Therefore, StringTokenizer is obsolete.


8.6.1. Matching Strings

Let us begin with the matches method in the String class. At first glance, the matches method is very similar to the equals method. For example, the following two statements both evaluate to true :

   "Java"   .matches(   "Java"   );   "Java"   .equals(   "Java"   ); 

However, the matches method is more powerful. It can match not only a fixed string, but also a set of strings that follow a pattern. For example, the following statements all evaluate to true :

   "Java is fun"   .matches(    "Java.*"    )   "Java is cool"   .matches(    "Java.*"    )   "Java is powerful"   .matches(    "Java.*"    ) 

"Java.*" in the preceding statements is a regular expression. It describes a string pattern that begins with Java followed by any zero or more characters . Here, .* the substring matches any zero or more characters.

8.6.2. Regular Expression Syntax

A regular expression consists of literal characters and special symbols. Table 8.1 lists some frequently used syntax for regular expressions.

Table 8.1. Frequently Used Regular Expressions
(This item is displayed on page 281 in the print version)
Regular Expression Matches Example
x a specified character x Java matches Java.
. any single character Java matches J . .
(abcd ab or cd ten matches t(enim)
[abc] a , b , or c Java matches Ja[uvwx]a
[^abc] any character except a , b , or c Java matches Ja[^ars]a
[a “z] a through z Java matches [A-M]av[a-d]
[^a-z] any character except a through z Java matches Jav[^b-d]
[a “e[m “p]] a through e or m through p Java matches [A-G[I-M]]av[a-d]
[a “e&&[c “p]] intersection of a-e with c-p Java matches [A-P&&[I-M]]av[a-d]
\d a digit, same as [1 “9] Java2 matches "Java[\\d]"
\D a non-digit $Java matches "[\\D][\\D]ava"
\w a word character Java matches "[\\w]ava"
\W a non-word character $Java matches "[\\W][\\w]ava"
\s a whitespace character "Java 2" matches "Java\\s2"
\S a non-whitespace char Java matches "[\\S]ava"
p* zero or more occurrences of pattern p Java matches "[\w]*"
p+ one or more occurrences of pattern p Java matches "[\\w]+"
p? zero or one occurrence of pattern p Java matches "[\\w]?ava"
p{n} exactly n occurrences of pattern p Java matches "[\\w]{4,}"
p{n,} at least n occurrences of pattern p Java matches "[\\w]{3,}"
p{n, m} between n and m occurrences (inclusive) Java matches "[\\w]{1, 9}"

Note

Backslash is a special character that starts an escape sequence in a string. So you need to use "\\d" in Java to represent \d .


Note

Recall that a whitespace (or a whitespace character ) is any character which does not display itself but does take up space. The characters ' ' , '\t' , '\n' , '\r' , '\f' are whitespace characters. So \s is the same as [ \t\n\r\f] , and \S is the same as [^ \t\n\r\f\v] .


Note

A word character is any letter, digit, or the underscore character. So \w is the same as [a-z[A-Z][0-9]_] or simply [a-zA-Z0-9_] , and \W is the same as [^a-zA-Z0-9_] .


Note

The last six entries * , + , ? , {n} , {n,} , and {n, m} in Table 8.1 are called quantifiers that specify how many times the pattern before a quantifier may repeat. For example, A* matches zero or more A 's, A+ matches one or more A 's, A? matches zero or one A 's, A{3} matches exactly AAA , A{3,} matches at least three A 's, and A{3,6} matches between 3 and 6 A 's. * is the same as {0,} , + is the same as {1,} , and ? is the same as {0,1} .


Caution

Do not use spaces in the repeat quantifiers. For example, A{3,6} cannot be written as A{3, 6} with a space after the comma.


Note

You may use parentheses to group patterns. For example, (ab){3} matches ababab , but ab{3} matches abbb .



[Page 281]

Let us use several examples to demonstrate how to construct regular expressions.

Example 1

The pattern for social security numbers is xxx-xx-xxxx , where x is a digit. A regular expression for social security numbers can be described as

 [\d]{   3   }-[\d]{   2   }-[\d]{   4   } 

Example 2

An even number ends with digits , 2 , 4 , 6 , or 8 . The pattern for even numbers can be described as

 [\d]*[   02468   ] 


[Page 282]
Example 3

The pattern for telephone numbers is (xxx) xxx-xxxx , where x is a digit and the first digit cannot be zero. A regular expression for telephone numbers can be described as

 \([   1   -   9   ][\d]{   2   }\) [\d]{   3   }-[\d]{   4   } 

Note that the parentheses symbols ( and ) are special characters in a regular expression for grouping patterns. To represent a literal ( or ) in a regular expression, you have to use \\( and \\) .

Example 4

Suppose the last name consists of at most twenty five letters and the first letter is in uppercase. The pattern for a last name can be described as

 [A-Z][a-zA-Z]{   1   ,   24   } 

Note that you cannot have arbitrary whitespace in a regular expression. For example, [A-Z][\\a-zA-Z]{ 1, 24 } would be wrong.

Example 5

Java identifiers are defined in §2.3, "Identifiers."

  • An identifier must start with a letter, an underscore ( _ ), or a dollar sign ( $ ). It cannot start with a digit.

  • An identifier is a sequence of characters that consists of letters, digits, underscores ( _ ), and dollar signs ( $ ).

The pattern for identifiers can be described as

 [a-zA-Z_$][\w$]* 

Example 6

What strings are matched by the regular expression "Welcome to (JavaHTML)" ? The answer is Welcome to Java or Welcome to HTML .

Example 7

What strings are matched by the regular expression ".*" ? The answer is any string.

8.6.3. Replacing and Splitting Strings

The matches method in the String class returns true if the string matches the regular expression. The String class also contains the replaceAll , replaceFirst , and split methods for replacing and splitting strings, as shown in Figure 8.11.

Figure 8.11. The String class contains the methods for matching, replacing, and splitting strings using regular expressions.
(This item is displayed on page 283 in the print version)

The replaceAll method replaces all matching substring and the replaceFirst method replaces the first matching substring. For example, the following code

 System.out.println(    "Java Java Java"   .replaceAll(   "v\w"   ,   "wi"   )  ); 

displays

 Jawi Jawi Jawi 


[Page 283]

The following code

 System.out.println(    "Java Java Java"   .replaceFirst(   "v\w"   ,   "wi"   )  ); 

displays

 Jawi Java Java 

There are two overloaded split methods. The split(regex) method splits a string into substrings delimited by the matches. For example, the following statement

 String[] tokens =   "Java1HTML2Perl"   .split(   "\d"   ); 

splits string "Java1HTML2Perl" into Java , HTML , and Perl and saved in tokens[0] , tokens[1] , and tokens[2] .

In the split(regex, limit) method, the limit parameter determines how many times the pattern is matched. If limit <= 0 , split(regex, limit) is the same as split(regex) . If limit > 0 , the pattern is matched at most limit “ 1 times. Here are some examples:

   "Java1HTML2Perl"   .split(   "\d"   ,     ); splits into Java, HTML, Perl   "Java1HTML2Perl"   .split(   "\d"   ,   1   ); splits into Java1HTML2Perl   "Java1HTML2Perl"   .split(   "\d"   ,   2   ); splits into Java, HTML2Perl   "Java1HTML2Perl"   .split(   "\d"   ,   3   ); splits into Java, HTML, Perl   "Java1HTML2Perl"   .split(   "\d"   ,   4   ); splits into Java, HTML, Perl   "Java1HTML2Perl"   .split(   "\d"   ,   5   ); splits into Java, HTML, Perl 

Note

By default, all the quantifiers are greedy . This means that they will match as many occurrences as possible. For example, the following statement displays JRvaa , since the first match is aaa :

 System.out.println(    "Jaaavaa"   .replaceFirst(   "a+"   ,   "R"   )  ); 

You can change a quantifier's default behavior by appending a question mark ( ? ) after it. The quantifier becomes reluctant , which means that it will match as few occurrences as possible. For example, the following statement displays JRaavaa , since the first match is a :

 System.out.println(   "Jaaavaa"   .replaceFirst(    "a+?"    ,   "R"   )); 


 


Introduction to Java Programming-Comprehensive Version
Introduction to Java Programming-Comprehensive Version (6th Edition)
ISBN: B000ONFLUM
EAN: N/A
Year: 2004
Pages: 503

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net