Recipe 4.9 Matching Newlines in Text


You need to match newlines in text.


Use \n or \r.

See also the flags constant Pattern.MULTILINE, which makes newlines match as beginning-of-line and end-of-line (^ and $).


While line-oriented tools from Unix such as sed and grep match regular expressions one line at a time, not all tools do. The sam text editor from Bell Laboratories was the first interactive tool I know of to allow multiline regular expressions; the Perl scripting language followed shortly. In the Java API, the newline character by default has no special significance. The BufferedReader method readLine( ) normally strips out whichever newline characters it finds. If you read in gobs of characters using some method other than readLine( ), you may have some number of \n , \r, or \r\n sequences in your text string.[4] Normally all of these are treated as equivalent to \n. If you want only \n to match, use the UNIX_LINES flag to the Pattern.compile( ) method.

[4] Or a few related Unicode characters, including the next-line (\u0085), line-separator (\u2028), and paragraph-separator (\u2029) characters.

In Unix, ^ and $ are commonly used to match the beginning or end of a line, respectively. In this API, the regex metacharacters ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire string. However, if you pass the MULTILINE flag into Pattern.compile( ) , these expressions match just after or just before, respectively, a line terminator; $ also matches the very end of the string. Since the line ending is just an ordinary character, you can match it with . or similar expressions, and, if you want to know exactly where it is, \n or \r in the pattern match it as well. In other words, to this API, a newline character is just another character with no special significance. See the sidebar Pattern.compile( ) Flags. An example of newline matching is shown in Example 4-6.

Example 4-6.
import java.util.regex.*; /**  * Show line ending matching using regex class.  * @author Ian F. Darwin,  * @version $Id: ch04.xml,v 1.4 2004/05/04 20:11:27 ian Exp $  */ public class NLMatch {     public static void main(String[] argv) {         String input = "I dream of engines\nmore engines, all day long";         System.out.println("INPUT: " + input);         System.out.println( );         String[] patt = {             "engines.more engines",             "engines$"         };         for (int i = 0; i < patt.length; i++) {             System.out.println("PATTERN " + patt[i]);             boolean found;             Pattern p1l = Pattern.compile(patt[i]);             found = p1l.matcher(input).find( );             System.out.println("DEFAULT match " + found);             Pattern pml = Pattern.compile(patt[i],                  Pattern.DOTALL|Pattern.MULTILINE);             found = pml.matcher(input).find( );             System.out.println("MultiLine match " + found);             System.out.println( );         }     } }

If you run this code, the first pattern (with the wildcard character .) always matches, while the second pattern (with $) matches only when MATCH_MULTILINE is set.

> java NLMatch INPUT: I dream of engines more engines, all day long   PATTERN engines more engines DEFAULT match true MULTILINE match: true   PATTERN engines$ DEFAULT match false MULTILINE match: true

