Recipe 4.1 Regular Expression Syntax

Problem

You need to learn the syntax of JDK 1.4 regular expressions.

Solution

Consult Table 4-2 for a list of the regular expression characters.

Discussion

These pattern characters let you specify regexes of considerable power. In building patterns, you can use any combination of ordinary text and the metacharacters, or special characters, in Table 4-2. These can all be used in any combination that makes sense. For example, a+ means any number of occurrences of the letter a, from one up to a million or a gazillion. The pattern Mrs?\. matches Mr. or Mrs.. And .* means "any character, any number of times," and is similar in meaning to most command-line interpreters' meaning of the * alone. The pattern \d+ means any number of numeric digits. \d{2,3} means a two- or three-digit number.

Table 4-2. Regular expression metacharacter syntax
Subexpression	Matches	Notes
General
`^`	Start of line/string
`$`	End of line/string
`\b`	Word boundary
`\B`	Not a word boundary
`\A`	Beginning of entire string
`\z`	End of entire string
`\Z`	End of entire string (except allowable final line terminator)	See Recipe 4.9
.	Any one character (except line terminator)
`[...]`	"Character class"; any one character from those listed
`[^...]`	Any one character not from those listed	See Recipe 4.2
Alternation and grouping
`(...)`	Grouping (capture groups)	See Recipe 4.3
`\|`	Alternation
`(?`:`re)`	Noncapturing parenthesis
`\G`	End of the previous match
`\n`	Back-reference to capture group number "`n`"
Normal (greedy) multipliers
`{m`,`n}`	Multiplier for "from `m` to `n` repetitions"	See Recipe 4.4
`{m,}`	Multiplier for "`m` or more repetitions"
`{m}`	Multiplier for "exactly `m` repetitions"	See Recipe 4.10
`{`,`n}`	Multiplier for 0 up to `n` repetitions
`*`	Multiplier for 0 or more repetitions	Short for `{0,}`
`+`	Multiplier for 1 or more repetitions	Short for `{1,}`; see Recipe 4.2
`?`	Multiplier for 0 or 1 repetitions (i.e, present exactly once, or not at all)	Short for `{0,1}`
Reluctant (non-greedy) multipliers
`{m`,`n}?`	Reluctant multiplier for "from `m` to `n` repetitions"
`{m,}?`	Reluctant multiplier for "`m` or more repetitions"
`{`,`n}?`	Reluctant multiplier for 0 up to `n` repetitions
`*?`	Reluctant multiplier: 0 or more
`+?`	Reluctant multiplier: 1 or more	See Recipe 4.10
`??`	Reluctant multiplier: 0 or 1 times
Possessive (very greedy) multipliers
`{m`,`n}+`	Possessive multiplier for "from `m` to `n` repetitions"
`{m,}+`	Possessive multiplier for "`m` or more repetitions"
`{`,`n}+`	Possessive multiplier for 0 up to `n` repetitions
`*+`	Possessive multiplier: 0 or more
`++`	Possessive multiplier: 1 or more
`?+`	Possessive multiplier: 0 or 1 times
Escapes and shorthands
`\`	Escape (quote) character: turns most metacharacters off; turns subsequent alphabetic into metacharacters
`\Q`	Escape (quote) all characters up to `\E`
`\E`	Ends quoting begun with `\Q`
`\t`	Tab character
`\r`	Return (carriage return) character
`\n`	Newline character	See Recipe 4.9
`\f`	Form feed
`\w`	Character in a word	Use `\w+` for a word; see Recipe 4.10
`\W`	A non-word character
`\d`	Numeric digit	Use `\d+` for an integer; see Recipe 4.2
`\D`	A non-digit character
`\s`	Whitespace	Space, tab, etc., as determined by `java.lang.Character.isWhitespace( )`
`\S`	A nonwhitespace character	See Recipe 4.10
Unicode blocks (representative samples)
`\p{InGreek}`	A character in the Greek block	(simple block)
`\P{InGreek}`	Any character not in the Greek block
`\p{Lu}`	An uppercase letter	(simple category)
`\p{Sc}`	A currency symbol
POSIX-style character classes (defined only for US-ASCII)
`\p{Alnum}`	Alphanumeric characters	`[A-Za-z0-9]`
`\p{Alpha}`	Alphabetic characters	`[A-Za-z]`
`\p{ASCII}`	Any ASCII character	`[\x00-\x7F]`
`\p{Blank}`	Space and tab characters
`\p{Space}`	Space characters	`[ \t\n\x0B\f\r]`
`\p{Cntrl}`	Control characters	`[\x00-\x1F\x7F]`
`\p{Digit}`	Numeric digit characters	`[0-9]`
`\p{Graph}`	Printable and visible characters (not spaces or control characters)
`\p{Print}`	Printable characters	Same as `\p{Graph}`
`\p{Punct}`	Punctuation characters	One of !"#$%&'( )*+,-./:;<=>?@[\]^_`{\|}~
`\p{Lower}`	Lowercase characters	`[a-z]`
`\p{Upper}`	Uppercase characters	`[A-Z]`
`\p{XDigit}`	Hexadecimal digit characters	`[0-9a-fA-F]`

Regexes match anyplace possible in the string. Patterns followed by a greedy multiplier (the only type that existed in traditional Unix regexes) consume (match) as much as possible without compromising any subexpressions which follow; patterns followed by a possessive multiplier match as much as possible without regard to following subexpressions; patterns followed by a reluctant multiplier consume as few characters as possible to still get a match.

Also, unlike regex packages in some other languages, the JDK 1.4 package was designed to handle Unicode characters from the beginning. And the standard Java escape sequence \unnnn is used to specify a Unicode character in the pattern. We use methods of java.lang.Character to determine Unicode character properties, such as whether a given character is a space.

To help you learn how regexes work, I provide a little program called REDemo .^[2] In the online directory javasrc/RE, you should be able to type either ant REDemo, or javac REDemo followed by java REDemo, to get the program running.

^[2] REDemo was inspired by (but does not use any code from) a similar program provided with the Jakarta Regular Expressions package mentioned in the Introduction to Chapter 4.

In the uppermost text box (see Figure 4-1), type the regex pattern you want to test. Note that as you type each character, the regex is checked for syntax; if the syntax is OK, you see a checkmark beside it. You can then select Match, Find, or Find All. Match means that the entire string must match the regex, while Find means the regex must be found somewhere in the string (Find All counts the number of occurrences that are found). Below that, you type a string that the regex is to match against. Experiment to your heart's content. When you have the regex the way you want it, you can paste it into your Java program. You'll need to escape (backslash) any characters that are treated specially by both the Java compiler and the JDK 1.4 regex package, such as the backslash itself, double quotes, and others (see the sidebar Remeber This!).

Remember This!

Remember that because a regex compiles strings that are also compiled by javac, you usually need two levels of escaping for any special characters, including backslash, double quotes, and so on. For example, the regex:

"You said it\."

has to be typed like this to be a Java language String:

"\"You said it\\.\""

I can't tell you how many times I've made the mistake of forgetting the extra backslash in \d+, \w+, and their kin!

In Figure 4-1, I typed qu into the REDemo program's Pattern box, which is a syntactically valid regex pattern: any ordinary characters stand as regexes for themselves, so this looks for the letter q followed by u. In the top version, I typed only a q into the string, which is not matched. In the second, I have typed quack and the q of a second quack. Since I have selected Find All, the count shows one match. As soon as I type the second u, the count is updated to two, as shown in the third version.

Figure 4-1. REDemo with simple examples

Regexes can do far more than just character matching. For example, the two-character regex ^T would match beginning of line (^) immediately followed by a capital T i.e., any line beginning with a capital T. It doesn't matter whether the line begins with Tiny trumpets, Titanic tubas, or Triumphant slide trombones, as long as the capital T is present in the first position.

But here we're not very far ahead. Have we really invested all this effort in regex technology just to be able to do what we could already do with the java.lang.String method startsWith( ) ? Hmmm, I can hear some of you getting a bit restless. Stay in your seats! What if you wanted to match not only a letter T in the first position, but also a vowel (a, e, i, o, or u) immediately after it, followed by any number of letters in a word, followed by an exclamation point? Surely you could do this in Java by checking startsWith("T") and charAt(1) == 'a' || charAt(1) == 'e', and so on? Yes, but by the time you did that, you'd have written a lot of very highly specialized code that you couldn't use in any other application. With regular expressions, you can just give the pattern ^T[aeiou]\w*!. That is, ^ and T as before, followed by a character class listing the vowels, followed by any number of word characters (\w*), followed by the exclamation point.

"But wait, there's more!" as my late, great boss Yuri Rubinsky used to say. What if you want to be able to change the pattern you're looking for at runtime? Remember all that Java code you just wrote to match T in column 1, plus a vowel, some word characters, and an exclamation point? Well, it's time to throw it out. Because this morning we need to match Q, followed by a letter other than u, followed by a number of digits, followed by a period. While some of you start writing a new function to do that, the rest of us will just saunter over to the RegEx Bar & Grille, order a ^Q[^u]\d+\.. from the bartender, and be on our way.

OK, the [^u] means "match any one character that is not the character u." The \d+ means one or more numeric digits. The + is a multiplier or quantifier meaning one or more occurrences of what it follows, and \d is any one numeric digit. So \d+ means a number with one, two, or more digits. Finally, the \.? Well, . by itself is a metacharacter. Most single metacharacters are switched off by preceding them with an escape character. Not the ESC key on your keyboard, of course. The regex "escape" character is the backslash. Preceding a metacharacter like . with escape turns off its special meaning. Preceding a few selected alphabetic characters (e.g., n, r, t, s, w) with escape turns them into metacharacters. Figure 4-2 shows the ^Q[^u]\d+\.. regex in action. In the first frame, I have typed part of the regex as ^Q[^u and, since there is an unclosed square bracket, the Syntax OK flag is turned off; when I complete the regex, it will be turned back on. In the second frame, I have finished the regex and typed the string as QA577 (which you should expect to match the ^Q[^u]\d+, but not the period since I haven't typed it). In the third frame, I've typed the period so the Matches flag is set to Yes.

Figure 4-2. REDemo with ^Q[^u]\d+\. example

One good way to think of regular expressions is as a "little language" for matching patterns of characters in text contained in strings. Give yourself extra points if you've already recognized this as the design pattern known as Interpreter. A regular expression API is an interpreter for matching regular expressions.

So now you should have at least a basic grasp of how regexes work in practice. The rest of this chapter gives more examples and explains some of the more powerful topics, such as capture groups. As for how regexes work in theory and there is a lot of theoretical details and differences among regex flavors the interested reader is referred to the book Mastering Regular Expressions. Meanwhile, let's start learning how to write Java programs that use regular expressions.