Regular Expressions


A regular expression, also known as a regex or regexp, is a string containing a search pattern. The regular expression language is a fairly standardized language for pattern-matching specifications. Other languages such as Perl and Ruby provide direct support for regular expressions. Programmers' editors such as TextPad and UltraEdit allow you to search file text using regular expressions. Java supplies a set of class libraries to allow you to take advantage of regular expressions.

You might have used a wildcard character ('*') to list files in a directory. Using the wildcard tells the ls or dir command to find all matching files, assuming that the * can expand into any character or sequence of characters. For example, the DOS command dir *.java lists every file with a .java extension.

The regular expression language is similar in concept but far more powerful. In this section, I'll introduce you to regular expressions using a few simple examples. I'll show you how to take advantage of regular expressions in your Java code. I'll suggest how you might approach testing Java code that uses regular expressions. But a complete discussion of regular expressions is way out of the scope of this book. Refer to the end of this section for a few sites that provide regular expression tutorials and information.

Splitting Strings

Back in Lesson 7, you used the String method split to break a full name into discrete name parts. You passed a String containing a single blank character to split:

 for (String name: fullName.split(" ")) 

The split method takes a regular expression as its sole argument.[1] The split method breaks the receiving string up when it encounters a match for the regular expression.

[1] An overloaded version of split takes a limit on the number of times the pattern is applied.

Suppose you try to split the string "Jeffrey Hyman"[2] with three spaces separating the first and last name. You want the results to be the strings "Jeffrey" and "Hyman", but a passing language test represents the actual results:

[2] Aka Joey Ramone. Gabba gabba hey.

 public void testSplit() {    String source = "Jeffrey   Hyman";    String expectedSplit[] = {"Jeffrey", "", "", "Hyman" };    assertTrue(       java.util.Arrays.equals(expectedSplit, source.split(" "))); } 

In other words, there are four substrings separated by spaces. Two of those substrings are the empty string (""): The first blank character separates "Jeffrey" from an empty string, the second blank character separates the first empty string from a second empty string, and the third blank character separates the second empty string from "Hyman".

You instead want to split the names around any groupings of whitespace characters. Using the Java API documentation for the class java util.regex.Pattern as your guide, you should be able to figure out that the construct \s matches a single whitespace character (tab, new line, form feed, space, or carriage return). Under the section Greedy Quantifiers in the API doc, you'll note that X+ means that a match is found when the construct X occurs at least one time.

Combining the two ideas, the regular expression string \s+ will match on sequences with one or more whitespace characters.

I've moved the name-splitting code from the Student class and into a class called sis.studentinfo.Name. You can find the complete code for NameTest and Name at http://www.LangrSoft.com/agileJava/code. The listings below of NameTest and Name show only relevant or new material.

 // NameTest.java public void testExtraneousSpaces() {    final String fullName = "Jeffrey   Hyman";    Name name = createName(fullName);    assertEquals("Jeffrey", name.getFirstName());    assertEquals("Hyman", name.getLastName()); } private Name createName(String fullName) {    Name name = new Name(fullName);    assertEquals(fullName, name.getFullName());    return name; } // Name.java private List<String> split(String fullName) {    List<String> results = new ArrayList<String>();     for (String name: fullName.split(" "))       results.add(name);    return results; } 

The test method testExtraneousSpaces will not pass given the current implementation in Name.split. Using the new pattern you learned about, you can update the split method to make the tests pass:

 private List<String> split(String fullName) {    List<String> results = new ArrayList<String>();    for (String name: fullName.split("\\s+"))        results.add(name);    return results; } 

The backslash character ('\') represents the start of an escape sequence in Java strings. Thus, the regular expression \s+ appears as \\s+ within the context of a String.

Since \s matches any whitespace character, change the test to demonstrate that additional whitespace characters are ignored.

 public void testExtraneousWhitespace() {    final String fullName = "Jeffrey   \t\t\n \r\fHyman";    Name name = createName(fullName);    assertEquals("Jeffrey", name.getFirstName());    assertEquals("Hyman", name.getLastName()); } 

Replacing Expressions in Strings

Many applications must capture a phone number from the user. People enter phone numbers in myriad ways. The best place to start is to strip all nondigits (parentheses, hyphens, letters, whitespace, and so on) from the phone number. This is very easy to do using regular expressions. Using a U.S. 10-digit phone number as an example:

 public void testStripPhoneNumber() {    String input = "(719) 555-9353 (home)";    assertEquals("7195559353", StringUtil.stripToDigits(input)); } 

The corresponding production code:

 public static String stripToDigits(String input) {    return input.replaceAll("\\D+", ""); } 

The replaceAll method takes two arguments: a regular expression to match on and a replacement string. The expression \D matches any nondigit character. The convention in the regular expression language is that the lowercase version of a construct character is used for positive matches and the uppercase version is used for negative matches. As another example, \w matches a "word" character: any letter, digit, or the underscore ('_') character. The uppercase construct, \W, matches any character that is not a word character.

The Pattern and Matcher Classes

You can use a JTextPane as the basis for a Swing-based text editor application. The JTextPane allows you to apply styles to sections of text. Many text editors provide the ability to search and highlight all strings matching a pattern. You'll want to build a class that can manage text searches against the underlying text.

Here's a test that demonstrates a small bit more of the regular expressions language.

 package sis.util; import junit.framework.TestCase; import java.util.regex.*; public class RegexTest extends TestCase {    public void testSearch() {       String[] textLines =           {"public class Test {",             "public void testMethod() {}",             "public void testNotReally(int x) {}",             "public void test() {}",             "public String testNotReally() {}",             "}" };       String text = join(textLines);       String testMethodRegex =           "public\\s+void\\s+test\\w*\\s*\\(\\s*\\)\\s*\\{";       Pattern pattern = Pattern.compile(testMethodRegex);       Matcher matcher = pattern.matcher(text);       assertTrue(matcher.find());       assertEquals(text.indexOf(textLines[1]), matcher.start());       assertTrue(matcher.find());       assertEquals(text.indexOf(textLines[3]), matcher.start());       assertFalse(matcher.find());    }    private String join(String[] textLines) {       StringBuilder builder = new StringBuilder();       for (String line: textLines) {          if (builder.length() > 0)             builder.append("\n");          builder.append(line);       }       return builder.toString();   } } 

The regular expression in testMethodRegex looks pretty tough. They can get much worse! The regular expression above doesn't handle the possibility of a static or abstract modifier.[3] The double backslashes in the regular expression don't help matters.

[3] Perhaps regex isn't the best tool for this job.

Breaking the pattern down into its constructs makes it pretty straightforward. I'll step through each construct, left to right in the expression:

 public\\s+void\\s+test\\w*\\s*\\(\\s*\\)\\s*\\{ 

match the text "public"

public

match one or more whitespace characters

\\s+

match the text "void"

void

match one or more whitespace characters

\\s+

match the text "test"

test

match zero or more word characters

\\w*

An asterisk indicates that there may be any number of the preceding construct (including 0).

match zero or more whitespace characters

\\s*

match a left parenthesis

\\(

You must escape parenthesis characters and brace characters.

match one or more whitespace characters

\\s*

match a right parenthesis

\\)

match zero or more whitespace characters

\\s*

match a left brace

\\{


A regular expression string is free-format, unproven text. You must compile it first using the Pattern class method compile. A successfully compiled regular expression string returns a Pattern object. From a Pattern object, you can obtain a Matcher for a given input String. Once you have a Matcher object, you can send it the find message to get it to locate the next matching subsequence (the substring of the input String that matches the regular expression). The find method returns true if a match was found, false otherwise.

The Matcher instance stores information about the last found subsequence. You can send the Matcher object the messages start and end to obtain the indexes that delineate the matched subsequence. Sending the message group to the Matcher returns the matching subsequence text.

In addition to triggering a find, you can send the message matches to the Matcher. This returns TRue if and only if the entire input string matches the regular expression. You can also send lookingAt, which returns TRue if the start of the input string (or the entire string) matches the regular expression.

For More Information

As you've seen, testing regular expressions is extremely simplethe assertions compare only String objects. This brief section should have given you enough information about regular expressions to be able to dig further into Java's support for it. The API documentation for the class java.util.regex.Pattern is the best place to start.

The regular expression language is a bit more involved. Knowing how patterns match against text is necessary to understanding how to write an appropriate regular expression.

The Sun tutorial on regular expressions at the web page http://java.sun.com/docs/books/tutorial/extra/regex/ supplies information on a few additional regular expression topics. You may also want to visit

http://www.regular-expressions.info/

http://www.javaregex.com/



Agile Java. Crafting Code with Test-Driven Development
Agile Javaв„ў: Crafting Code with Test-Driven Development
ISBN: 0131482394
EAN: 2147483647
Year: 2003
Pages: 391
Authors: Jeff Langr

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net