Recipe 4.8 Matching


Recipe 4.8 Matching "Accented" or Composite Characters

Problem

You want characters to match regardless of the form in which they are entered.

Solution

Compile the Pattern with the flags argument Pattern.CANON_EQ for " canonical equality."

Discussion

Composite characters can be entered in various forms. Consider, as a single example, the letter e with an acute accent. This character may be found in various forms in Unicode text, such as the single character é (Unicode character \u00e9) or as the two-character sequence (e followed by the Unicode combining acute accent, \u0301). To allow you to match such characters regardless of which of possibly multiple "fully decomposed" forms are used to enter them, the regex package has an option for "canonical matching," which treats any of the forms as equivalent. This option is enabled by passing CANON_EQ as (one of) the flags in the second argument to Pattern.compile( ). This program shows CANON_EQ being used to match several forms:

import java.util.regex.*; /**  * CanonEqDemo - show use of Pattern.CANON_EQ, by comparing varous ways of  * entering the Spanish word for "equal" and see if they are considered equal  * by the regex-matching engine.  */ public class CanonEqDemo {     public static void main(String[] args) {         String pattStr = "\u00e9gal"; // égal         String[] input = {                 "\u00e9gal", // égal - this one had better match :-)                 "e\u0301gal", // e + "Combining acute accent"                 "e\u02cagal", // e + "modifier letter acute accent"                 "e'gal", // e + single quote                 "e\u00b4gal", // e + Latin-1 "acute"         };         Pattern pattern = Pattern.compile(pattStr, Pattern.CANON_EQ);         for (int i = 0; i < input.length; i++) {             if (pattern.matcher(input[i]).matches( )) {                 System.out.println(pattStr + " matches input " + input[i]);             } else {                 System.out.println(pattStr + " does not match input " + input[i]);             }         }     } }

When you run this program on JDK 1.4 or later, it correctly matches the "combining accent" and rejects the other characters, some of which, unfortunately, look like the accent on a printer, but are not considered "combining accent" characters.

égal matches input égal égal matches input e?gal égal does not match input e?gal égal does not match input e'gal égal does not match input e´gal

For more details, see the character charts at http://www.unicode.org/.



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net