Regular Expressions and Unicode

Historically, regular expressions dealt with only 8-bit characters, which is fine for single-byte alphabets but it's not so great for everyone else! So how should your input-restricting code handle Unicode characters? If you must restrict your application to accept only what is valid, how do you do it if your application has Japanese or German users? The answer is not straightforward, and support is inconsistent across regular expression engines at best.

More Info
An excellent reference regarding Unicode regular expressions is Unicode Regular Expression Guidelines at http://www.unicode.org/reports/tr18, which should be your first stop on the Web after reading this chapter.

Three aspects to Unicode make it complex to build good Unicode regular expressions:

We've already discussed this, but few engines support Unicode.
Unicode is a very large character set. Windows uses little endian UTF-16 to represent Unicode. In fact, because of surrogate characters, Windows supports over 1,000,000 characters; that's a lot of characters to check!
Unicode accommodates many scripts that have different characteristics than English. (The word script is used rather than language because one script can cover many languages.)

Now here's the good news: more engines are adding support for Unicode expressions as vendors realize the world is a very small place. A good example of this change is the introduction of Perl 5.8.0, which had just been released at the time of this writing. Another example is Microsoft's .NET Framework, which has both excellent regular expression support and exemplary globalization support. In addition, all strings in managed code are natively Unicode.

At first, you might think you can use hexadecimal ranges for languages, and you can, but doing so is crude and not recommended because

Spoken languages are living entities that evolve with time; a character that might seem invalid today in one language can become valid tomorrow.
It is really hard, if not impossible, to tell what ranges are valid for a language, even for English. Are accent marks valid? What about the word caf ? You get the picture.

The following regular expression will find all Japanese Katakana letters from small letter a to letter vo, but not the conjunction and length marks and some other special characters above \u30FB:

Regex r = new Regex(@"^[\u30A1-\u30FA]+$");

The secret to making Unicode regular expressions manageable lies in the \p{category} construct, which matches any character in the named Unicode character category. The .NET Framework and Perl 5.8.0 support Unicode categories, and this makes dealing with international characters easier. The high-level Unicode categories are Letters (L), Marks (M), Numbers (N), Punctuation (P), Symbols (S), Separators (Z), and Others (O and C) as follows:

L (All Letters)
- Lu (Uppercase letter)
- Ll (Lowercase letter)
- Lt (Titlecase letters). Some letters, called diagraphs, are composed of two characters. For example, some Croatian diagraphs that match Cyrillic characters in Latin Extended-B, U+01C8, , is the titlecase version of uppercase (U+01C7) and lower case, (U+01C9).)
- Lm (Modifier, letter-like symbols)
- Lo (Other letters that have no case, such as Hebrew, Arabic, and Tibetan)
M (All marks)
- Mn (Nonspacing marks including accents and umlauts)
- Mc (Space-combining marks are usual vowel signs in languages like Tamil)
- Me (Enclosing marks, shapes enclosing other characters such as a circle)
N (All numbers)
- Nd (Decimal digit, zero to nine, does not cover some Asian languages such a Chinese, Japanese and Korea. For example, the Hangzhou-style numerals are treated similar to Roman numeral and classified as Nl (Number, Letter) instead of Nd.)
- Nl (Numeric letter, Roman numerals from U+2160 to U+2182)
- No (Other numbers represented as fractions, and superscripts and subscripts)
P (All punctuation)
- Pc (Connector, characters, such as underscore, that join other characters)
- Pd (Dash, all dashes and hyphens)
- Ps (Open, characters like {, ( and [)
- Pe (Close, characters like }, ) and ])
- Pi (Initial quote characters including . and )
- Pf (Final quote characters including ', and )
- Po (Other characters including ?, ! and so on)
S (All symbols)
- Sm (Math)
- Sc (Currency)
- Sk (Modifier symbols, such as a circumflex or grave symbols)
- So (Other, box-drawing symbols and letter-like symbols such as degrees Celsius and copyright)
Z (All separators)
- Zs (Space separator characters include normal space)
- Zl (Line is only U+2028, note U+00A6, the broken bar is treated a Symbol)
- Zp (Paragraph is only U+2029)
O (Others)
- Cc (Control includes all the well-known control codes such as carriage return, line feed, and bell)
- Cf (Format characters, invisible characters such as Arabic end-of-Ayah)
- Co (Private characters include proprietary logos and symbols)
- Cn (Unassigned)
- Cs (High and Low Surrogate characters)
More Info
There's a nice Unicode browser at http://oss.software.ibm.com/developerworks/opensource/icu/ubrowse that shows these categories.

Let's put the character classes to good use. Imagine a field in your Web application must include only a currency symbol, such as that for a dollar or a euro. You can verify that the field contains such a character and nothing else with this code:

Regex r = new Regex(@"^\p{Sc}{1}$"); if (r.Match(strInput).Success) { // cool! } else { // try again }

The good news is that this works for all currency symbols defined in Unicode, including dollar ($), pound sterling ( ), yen ( ), franc (), euro (), new sheqel (), and others!

The following regular expression will match all letters, nonspacing marks, and spaces:

Regex r = new Regex(@"^[\p{L}\p{Mn}\p{Zs}]+$");

The reason for \p{Mn} is many languages use diacritics and vowel marks; these are often called nonspacing marks.

The .NET Framework also provides language specifies, such as \p{IsHebrew}, \p{IsArabic} and \p{IsKatakana}. I have included some sample code that demonstrates this named Ch10\Lang.

When you're experimenting with other languages, I recommend you use Windows 2000, Windows XP, or Microsoft Windows .NET Server 2003 with a Unicode font installed (such as Arial Unicode MS) and use the Character Map application, as shown in Figure 10-3, to determine which characters are valid. Note, however, that a font that claims to support Unicode is not required to have glyphs for every valid Unicode code point. You can look at the Unicode code charts at http://www.unicode.org/charts.

figure 10-3 using the character map application to view non-ascii fonts.

Figure 10-3. Using the Character Map application to view non-ASCII fonts.

More Info
I mentioned earlier that Perl 5.8.0 adds greater support for Unicode and supports the \p{ } syntax. You can read more about this at http://dev.perl.org/perl5/news/2002/07/18/580ann/perldelta.html #new%20unicode%20properties.

IMPORTANT
Be wary of code that performs a regular expression operation and then a decode operation the data might be valid and pass the regular expression check, until it's decoded! You should perform a decode and then the regular expression.