Pattern | Java In A Nutshell, 5th Edition

Pattern

java.util.regex

Java 1.4

serializable

This class represents a regular expression. It has no public constructor: obtain a Pattern by calling one of the static compile( ) methods , passing the string representation of the regular expression, and an optional bitmask of flags that modify the behavior of the regex. pattern( ) and flags( ) return the string form of the regular expression and the bitmask that were passed to compile( ) .

If you want to perform only a single match operation with a regular expression, and don't need to use any of the flags, you don't have to create a Pattern object: simply pass the string representation of the pattern and the CharSequence to be matched to the static matches( ) method: the method returns TRue if the specified pattern matches the complete specified text, or returns false otherwise .

Pattern represents a regular expression, but does not actually define any primitive methods for matching regular expressions to text. To do that, you must create a Matcher object that encapsulates a pattern and the text it is to be compared with. Do this by calling the matcher( ) method and specifying the CharSequence you want to match against. See Matcher for a description of what you can do with it.

The split( ) methods are the exception to the rule that you must obtain a Matcher in order to be able to do anything with a Pattern (although they create and use a Matcher internally). They take a CharSequence as input, and split it into substrings, using text that matches the regular expression as the delimiter , returning the substrings as a String[ ] . The two-argument version of split( ) takes an integer argument that specifies the maximum number of substrings to break the input into.

Pattern defines the following flags that control various aspects of how regular expression matching is performed. The flags are the following:

CANON_EQ: The Unicode standard sometimes allows more than one way to specify the same character. If this flag is set, characters are compared by comparing their full canonical decompositions, so that characters will match even if expressed in different ways. Enabling this flag typically slows down performance. Unlike all the other flags, there is no way to temporarily enable this flag within a pattern.
CASE_INSENSITIVE: Match letters without regard to case. By default this flag only affects the comparisons of ASCII letters . Also set the UNICODE_CASE flag if you want to ignore the case of all Unicode characters. You can enable this flag within a pattern with (?i) .
COMMENTS: If this flag is set, then whitespace and comments within a pattern are ignored. Comments are all characters between a # and end of line. You can enable this flag within a pattern with (?x)
DOTALL: If this flag is set, then the . expression matches any character. If it is not set, then it does not match line terminator characters. This is also known as "single-line mode" and you can enable it within a pattern with (?s) .
MULTILINE: If this flag is set, then the ^ and $ anchors match not only at the beginning and end of the input string, but also at the beginning and end of any lines within that string. Within a pattern you can enable this flag with (?m) .
UNICODE_CASE: If this flag is set along with the CASE_INSENSITIVE flag, then case-insensitive comparison is done for all Unicode letters, rather than just for ASCII letters. You can enable both flags within a pattern with (?iu) .
UNIX_LINES: If this flag is set, then only the newline character is considered a line terminator for the purposes of ., ^ , and $ . If the flag is not set, then newlines ( \n ) carriage returns ( \r ) and carriage return newline sequences ( \r\n ) are all considered line terminators, as are the Unicode characters \u0085 (" next line") \u2028 ("line separator") and \u2029 ("paragraph separator"). You can turn this flag on within a pattern with (?d) .

Although the API for the Pattern class is quite simple, the syntax for the text representation of regular expressions is fairly complex. A complete tutorial on regular expressions is beyond the scope of this book. The table below, is a quick-reference for regular expression syntax. It is very similar to the syntax used in Perl. Note that many of the syntax elements of a regular expression include a backslash character, such as \d to match one of the digits 0-9. Because Java strings also use the backslash character as an escape, you must double the backslashes when expressing a regular expression as a string literal: "\\d". In Java 5.0, the static quote( ) method quotes all special characters in a string so that you can match arbitrary text literally without worrying that punctuation in that text will be interpreted specially. For complete details on regular expressions see a book like Programming Perl by Larry Wall et. al., or Mastering Regular Expressions by Jeffrey E. F. Friedl.

Table 16-3. Java regular expression quick reference

Syntax	Matches
Single characters
`x`	The character `x` , as long as `x` is not a punctuation character with special meaning in the regular expression syntax.
`\` `p`	The punctuation character `p` .
`\\`	The backslash character
`\n`	Newline character `\u000A` .
`\t`	Tab character `\u0009` .
`\r`	Carriage return character `\u000D` .
`\f`	Form feed character `\u000C` .
`\e`	Escape character `\u001B` .
`\a`	Bell (alert) character `\u0007` .
`\u` `xxxx`	Unicode character with hexadecimal code `xxxx` .
`\x` `xx`	Character with hexadecimal code `xx` .
`\0` `n`	Character with octal code `n` .
`\0` `nn`	Character with octal code `nn` .
`\0` `nnn`	Character with octal code `nnn` , where `nnn` <= 377.
`\c` `x`	The control character `^` `x` .
Character classes
`[...]`	One of the characters between the brackets. Characters may be specified literally, and the syntax also allows the specification of character ranges, with intersection, union, and subtraction operators. See specific examples below.
`[^...]`	Any one character not between the brackets.
`[a-z0-9]`	Character range: a character between (inclusive) `a` and `z` or and `9` .
`[0-9[a-fA-F]]`	Union of classes: same as `[0-9a-fA-F]`
`[a-z&&[aeiou]]`	Intersection of classes: same as `[aeiou]` .
`[a-z&&[^aeiou]]`	Subtraction: the characters `a` through `z` except for the vowels .
.	Any character except a line terminator. If the `DOTALL` flag is set, then it matches any character including line terminators.
`\d`	ASCII digit: `[0-9]` .
`\D`	Anything but an ASCII digit: `[^\d]` .
`\s`	ASCII whitespace: `[ \t\n\f\r\x0B]`
`\S`	Anything but ASCII whitespace: [^\s].
`\w`	ASCII word character: `[a-zA-Z0-9_]` .
`\W`	Anything but ASCII word characters: `[^\w]` .
`\p{` `group` `}`	Any character in the named group. See group names below. Many of the group names are from POSIX, which is why p is used for this character class.
`\P{` `group` `}`	Any character not in the named group.
`\p{Lower}`	ASCII lowercase letter: `[a-z]` .
`\p{Upper}`	ASCII uppercase: `[A-Z]` .
`\p{ASCII}`	Any ASCII character: `[\x00-\x7f]` .
`\p{Alpha}`	ASCII letter: `[a-zA-Z]` .
`\p{Digit}`	ASCII digit: `[0-9]` .
`\p{XDigit}`	Hexadecimal digit: `[0-9a-fA-F]` .
`\p{Alnum}`	ASCII letter or digit: `[\p{Alpha}\p{Digit}]` .
`\p{Punct}`	ASCII punctuation: one of `!"#$%& ( )*+,-./:;<=>?@[\]^_ {}~]` .
`\p{Graph}`	visible ASCII character: `[\p{Alnum}\p{Punct}]` .
`\p{Print}`	visible ASCII character: same as `\p{Graph}` .
`\p{Blank}`	ASCII space or tab: `[ \t]` .
`\p{Space}`	ASCII whitespace: `[ \t\n\f\r\x0b]` .
`\p{Cntrl}`	ASCII control character: `[\x00-\x1f\x7f]` .
`\p{` `category` `}`	Any character in the named Unicode category. Category names are one or two letter codes defined by the Unicode standard. One letter codes include `L` for letter, `N` for number, `S` for symbol, `Z` for separator, and `P` for punctuation. Two letter codes represent subcategories , such as `Lu` for uppercase letter, `Nd` for decimal digit, `Sc` for currency symbol, `Sm` for math symbol, and `Zs` for space separator. See `java.lang.Character` for a set of constants that correspond to these subcategories; however, note that the full set of one- and two-letter codes is not documented in this book.
`\p{` `block` `}`	Any character in the named Unicode block. In Java regular expressions, block names begin with "In", followed by mixed-case capitalization of the Unicode block name , without spaces or underscores. For example: `\p{InOgham}` or `\p{InMathematicalOperators}` . See `java.lang.Character.UnicodeBlock` for a list of Unicode block names.
Sequences, alternatives, groups, and references
`xy`	Match `x` followed by `y` .
`x` `y`	Match `x` or `y` .
`(...)`	Grouping. Group subexpression within parentheses into a single unit that can be used with `*` , `+` , `?` , , and so on. Also "capture" the characters that match this group for use later.
`(?:...)`	Grouping only. Group subexpression as with `( )` , but do not capture the text that matched.
`\` `n`	Match the same characters that were matched when capturing group number `n` was first matched. Be careful when `n` is followed by another digit: the largest number that is a valid group number will be used.
Repetition ^[1]
`x` `?`	zero or one occurrence of `x` ; i.e., `x` is optional.
`x` `*`	zero or more occurrences of `x` .
`x` `+`	one or more occurrences of `x` .
`x` `{` `n` `}`	exactly `n` occurrences of `x` .
`x` `{` `n` `,}`	`n` or more occurrences of `x` .
`x` `{` `n` , `m` `}`	at least `n` , and at most `m` occurrences of `x` .
Anchors ^[2]
`^`	The beginning of the input string, or if the `MULTILINE` flag is specified, the beginning of the string or of any new line.
`$`	The end of the input string, or if the `MULTILINE` flag is specified, the end of the string or of line within the string.
`\b`	A word boundary: a position in the string between a word and a nonword character.
`\B`	A position in the string that is not a word boundary.
`\A`	The beginning of the input string. Like `^` , but never matches the beginning of a new line, regardless of what flags are set.
`\Z`	The end of the input string, ignoring any trailing line terminator.
`\z`	The end of the input string, including any line terminator.
`\G`	The end of the previous match.
`(?=` `x` `)`	A positive look-ahead assertion. Require that the following characters match `x` , but do not include those characters in the match.
`(?!` `x` `)`	A negative look-ahead assertion. Require that the following characters do not match the pattern `x` .
`(?<=` `x` `)`	A positive look-behind assertion. Require that the characters immediately before the position match `x` , but do not include those characters in the match. `x` must be a pattern with a fixed number of characters.
`(?<!` `x` `)`	A negative look-behind assertion. Require that the characters immediately before the position do not match `x` . `x` must be a pattern with a fixed number of characters.
Miscellaneous
`(?>` `x` `)`	Match `x` independently of the rest of the expression, without considering whether the match causes the rest of the expression to fail to match. Useful to optimize certain complex regular expressions. A group of this form does not capture the matched text.
`(?` `onflags` `-` `offflags` `)`	Don t match anything, but turn on the flags specified by `onflags` , and turn off the flags specified by `offflags` . These two strings are combinations in any order of the following letters and correspond to the following `Pattern` constants: `i` ( `CASE_INSENSITIVE` ), `d` ( `UNIX_LINES` ), `m` ( `MULTILINE` ), `s` ( `DOTALL` ), `u` ( `UNICODE_CASE` ), and `x` ( `COMMENTS` ). Flag settings specified in this way take effect at the point that they appear in the expression and persist until the end of the expression, or until the end of the parenthesized group of which they are a part, or until overridden by another flag setting expression.
`(?` `onflags` `-` `offflags` : `x` `)`	Match `x` , applying the specified flags to this subexpression only. This is a noncapturing group, like `(?:...)` , with the addition of flags.
`\Q`	Don't match anything, but quote all subsequent pattern text until `\E` . All characters within such a quoted section are interpreted as literal characters to match, and none (except `\E` ) have special meanings.
`\E`	Don't match anything; terminate a quote started with `\Q` .
`#` `comment`	If the `COMMENT` flag is set, pattern text between a `#` and the end of the line is considered a comment and is ignored.

^[1] These repetition characters are known as "greedy quantifiers," because they match as many occurrences of x as possible while still allowing the rest of the regular expression to match. If you want a "reluctant quantifier" which matches as few occurrences as possible while still allowing the rest of the regular expression to match, follow the quantifiers above with a question mark. For example, use *? instead of *, and use {2,}? instead of {2,}. Or, if you follow a quantifier with a plus sign instead of a question mark, then you specify a "possessive quantifier" which matches as many occurrences as possible, even if it means that the rest of the regular expression will not match. Possessive quantifiers can be useful when you are sure that they will not adversely affect the rest of the match, because they can be implemented more efficiently than regular "greedy quantifiers."

^[2] Anchors do not match characters but instead match the zero-width positions between characters, "anchoring" the match to a position at which a specific condition holds.

Figure 16-132. java.util.regex.Pattern

 public final class  Pattern  implements Serializable {  // No Constructor   // Public Constants  public static final int  CANON_EQ  ;  =128  public static final int  CASE_INSENSITIVE  ;  =2  public static final int  COMMENTS  ;  =4  public static final int  DOTALL  ;  =32   5.0  public static final int  LITERAL  ;  =16  public static final int  MULTILINE  ;  =8  public static final int  UNICODE_CASE  ;  =64  public static final int  UNIX_LINES  ;  =1   // Public Class Methods  public static Pattern  compile  (String  regex  );        public static Pattern  compile  (String  regex  , int  flags  );        public static boolean  matches  (String  regex  , CharSequence  input  );  5.0  public static String  quote  (String  s  );  // Public Instance Methods  public int  flags  ( );        public Matcher  matcher  (CharSequence  input  );        public String  pattern  ( );        public String[ ]  split  (CharSequence  input  );        public String[ ]  split  (CharSequence  input  , int  limit  );  // Public Methods Overriding Object   5.0  public String  toString  ( );   }

Passed To

java.util.Scanner.{findInLine( ) , findWithinHorizon( ) , hasNext( ) , next( ) , skip( ) , useDelimiter( )} , Matcher.usePattern( )

Returned By

java.util.Scanner.delimiter( ) , Matcher.pattern( )