Section 8.1. Java s Regex Flavor

8.1. Java's Regex Flavor

java.util.regex is powered by a Traditional NFA, so the rich set of lessons from Chapters 4, 5, and 6 apply. Table 8-2 on the facing page summarizes its metacharacters. Certain aspects of the flavor are modified by a variety of match modes, turned on via flags to the various methods and factories, or turned on and off via (? mods-mods ) and (? mods-mods: ‹) modifiers embedded within the regular expression itself. The modes are listed in Table 8-3 on page 368.

Table 8-2. Overview of Sun's java.util.regex Flavor

`Character Shorthands` ^[(c)]
˜ 115	( `c` )	`\a [\b] \e \f \n \r \t \0` octal `\x##` `\u#### \c` char
`Character Classes and Class-Like Constructs`
˜ 118	( `c` )	Classes: [ ‹ ] [ ^‹ ] (may contain class set operators ˜ 125)
˜ 119		Almost any character: dot (various meanings, changes with modes)
˜ 120	( `c` )	Class shorthands: ^[] `\w \d \s \W \D \S`
˜ 121	( `c` )	Unicode properties and blocks: ^[] `\p{` Prop } `\P{` Prop `}`
Anchors and Other Zero-Width Tests
˜ 370		Start of line/string: `^ \A`
˜ 370		End of line/string: `$ \z \Z`
˜ 130		Start of current match: `\G`
˜ 133		Word boundary: ^[‚] `\b \B`
˜ 133		Lookaround: ^[ƒ] `(?=‹) (?!‹) (?<=‹) (?<!‹)`
Comments and Mode Modifiers
˜ 135		Mode modifiers: `(?` mods - mods `)` Modifiers allowed: `x d s m i u`
˜ 135		Mode-modified spans : `(?` mods - mods :‹)
˜ 368	( `c` )	Comments: From `#` until newline (only when enabled) ^[]
˜ 113	( `c` )	Literal-text mode: ^[] `\Q‹\E`
Grouping and Capturing
˜ 137		Capturing parentheses: `(‹) \1 \2` ...
˜ 137		Grouping-only parentheses: `(?:‹)`
˜ 139		Atomic grouping: `(?>‹)`
˜ 139		Alternation:
˜ 141		Greedy quantifiers: `* + ? {n} {n,} {x,y}`
˜ 141		Lazy quantifiers: `*? +? ?? {n}? {n,}? {x,y}?`
˜ 142		Possessive quantifiers: `*+ ++ ?+ {n}+ {n,}+ {x,y}+`
^[(c)]

^[(c)] may also be used within a character class ‹ see text

These notes augment Table 8-2:

^[]

^[] \b is a character shorthand for backspace only within a character class. Outside of a character class, \b matches a word boundary (˜ 133).

The table shows "raw" backslashes, not the doubled backslashes required when regular expressions are provided as Java string literals. For example, \n in the table must be written as " \\n " as a Java string. See "Strings as Regular Expressions (˜ 101).

\x ## allows exactly two hexadecimal digits, e.g., \xFC ber matches ' ¼ ber '.

\u #### allows exactly four hexadecimal digits, e.g., \u00FC ber matches ' ¼ ber ', and \u20AC matches '‚'.

\0 octal requires a leading zero, followed by one to three octal digits.

\c char is case sensitive , blindly xor ing the ordinal value of the following character with 0x40. This bizarre behavior means that, unlike any other flavor I've ever seen, \cA and \ca are different. Use uppercase letters to get the traditional meaning of \x01 . As it happens, \ca is the same as \x21 , matching ' ! '.
^[]

^[] \w , \d , and \s (and their uppercase counterparts) match only ASCII characters , and don't include the other alphanumerics, digits, or whitespace in Unicode. That is, \d is exactly the same as [0-9] , \w is the same as [0-9a-zA-Z_] , and \s is the same as [ \t\n\f\r\x0B] ( \x0B is the little-used ASCII VT character). For full Unicode coverage, you can use Unicode properties (˜ 121): use \p{L} for \w , use \p{Nd} for \d , and use \p{Z} for \s . (Use the \P{ ‹ } version of each for \W , \D , and \S .)
^[‚]

^[‚] \p{ ‹ } and \P{ ‹ } support Unicode properties and blocks, and some additional "Java properties." Unicode scripts are not supported. Details follow on the facing page.
^[ƒ]

^[ƒ] The \b and \B word boundary metacharacters' idea of a "word character" is not the same as that of \w and \W . The word boundaries understand the properties of Unicode characters, while \w and \W match only ASCII characters.
^[]

^[] Look ahead constructs can employ arbitrary regular expressions, but look behind is restricted to subexpressions whose possible matches are finite in length. This means, for example, that ? is allowed within lookbehind, but * and + are not. See the description in Chapter 3, starting on page 133.
^[]

^[] # ‹ sequences are taken as comments only under the influence of the x modifier, or when the Pattern.COMMENTS option is used (˜ 368). (Don't forget to add newlines to multiline string literals, as in the example on page 401.) Unescaped ASCII whitespace is ignored. Note : unlike most regex engines that support this type of mode, comments and free whitespace are recognized within character classes.
^{[ ]}

^{[ ]} \Q ‹ \E has always been supported, but its use entirely within a character class was buggy and unreliable until Java 1.6.

Table 8-3. The java.util.regex Match and Regex Modes

Compile-Time Option	`(? mode )`	Description
`Pattern.UNIX_LINES`	`d`	Changes how dot and `^` match (˜ 370)
`Pattern.DOTALL`	`s`	Causes dot to match any character (˜ 111)
`Pattern.MULTILINE`	`m`	Expands where `^` and `$` can match (˜ 370)
`Pattern.COMMENTS`	`x`	Free-spacing and comment mode (˜ 72) (Applies even inside character classes)
`Pattern.CASE_INSENSITIVE`	`i`	Case-insensitive matching for ASCII characters
`Pattern.UNICODE_CASE`	`u`	Case-insensitive matching for non-ASCII characters
`Pattern.CANON_EQ`		Unicode "canonical equivalence" match mode (different encodings of the same character match as identical ˜ 108)
`Pattern.LITERAL`		Treat the regex argument as plain, literal text instead of as a regular expression

8.1.1. Java Support for `\p{‹} and \P{‹}`

The \p{‹} and \P{‹} constructs support Unicode properties and blocks, as well as special "Java" character properties. Unicode support is as of Unicode Version 4.0.0. (Java 1.4.2's support is only as of Unicode Version 3.0.0.)

8.1.1.1. Unicode properties

Unicode properties are referenced via short names such as \p{Lu} . (See the list on page 122.) One-letter property names may omit the braces: \pL is the same as \p{L} . The long names such as \p{Lowercase_Letter} are not supported.

In Java 1.5 and earlier, the Pi and Pf properties are not supported, and as such, characters with that property are not matched by \p{P} . (Java 1.6 supports them.)

The "other stuff" property \p{C} doesn't match code points matched by the " unassigned code points" property \p{Cn} .

The \p{L&} composite property is not supported.

The pseudo-property \p{all} is supported and is equivalent to (?s:.) . The \p{assigned} and \p{unassigned} pseudo-properties are not supported, but you can use \P{Cn} and \p{Cn} instead.

8.1.1.2. Unicode blocks

Unicode blocks are supported, requiring an ' In ' prefix. See page 402 for version-specific details on how block names can appear within \p{‹} and \P{‹} .

For backward compatibility, two Unicode blocks whose names changed between Unicode Versions 3.0 and 4.0 are accessible by either name as of Java 1.5. The extra non-Unicode-4.0 names Combining Marks for Symbols and Greek can now be used in addition to the Unicode 4.0 standard names Combining Diacritical Marks for Symbols and Greek and Coptic .

A Java 1.4.2 bug involving the Arabic Presentation Forms-B and Latin Extended-B block names has been fixed as of Java 1.5 (˜ 403).

8.1.1.3. Special Java character properties

Starting in Java 1.5.0, the \p{‹} and \P{‹} constructs include support for the non-deprecated is Something methods in java.lang.Character . To access the method functionality within a regex, replace the method name's leading ' is ' with ' java ', and use that within \p{‹} or \P{‹} . For example, characters matched by java.lang.Character.is can be matched from within a regex by \p{java } . (See the java.lang.Character class documentation for a complete list of applicable methods.)

8.1.2. Unicode Line Terminators

In traditional pre-Unicode regex flavors, a newline (ASCII LF character) is treated specially by dot , ^ , $, and \Z . In Java, most Unicode line terminators (˜ 109) also receive this special treatment.

Java normally considers the following as line terminators:

Character Codes	Nicknames	Description
`U+000A`	LF `\n`	ASCII Line Feed ("newline")
`U+000D`	CR `\r`	ASCII Carriage Return
`U+000D U+000A`	CR/LF `\r\n`	ASCII Carriage Return / Line Feed sequence
`U+0085`	NEL	Unicode NEXT LINE
`U+2028`	LS	Unicode LINE SEPARATOR
`U+2029`	PS	Unicode PARAGRAPH SEPARATOR

The characters and situations that are treated specially by dot , ^ , $ , and \Z change depending on which match modes (˜ 368) are in effect:

Match Mode	Affects	Description
UNIX_LINES	^ . $ \Z	Revert to traditional newline-only line-terminator semantics.
MULTILINE	^ $	Add embedded line terminators to list of locations after which `^` and before which `$` can match.
DOTALL	.	Line terminators no longer special to dot ; it matches any character.

The two-character CR/LF line-terminator sequence deserves special mention. By default, when the full complement of line terminators is recognized (that is, when UNIX_LINES is not used), a CR/LF sequence is treated as an atomic unit by the line-boundary metacharacters, and they can't match between the sequence's two characters.

For example, $ and \Z can normally match just before a line terminator. LF is a line terminator, but $ and \Z can match before a string-ending LF only when it is not part of a CR/LF sequence (that is, when the LF is not preceded by a CR).

This extends to $ and ^ in MULTILINE mode, where ^ can match after an embedded CR only when that CR is not followed by a LF, and $ can match before an embedded LF only when that LF is not preceded by a CR.

To be clear, DOTALL has no effect on how CR/LF sequences are treated ( DOTALL affects only dot , which always considers characters individually), and UNIX_LINES removes the issue altogether (it renders LF and all the other non-newline line terminators unspecial).