Unicode Character Properties

Because Unicode contains so many characters, it can be dangerous to assume that a limited range holds a particular property. For example, do not assume that the only digits are U+0030 ( 0 ) through U+0039 ( 9 ). Unicode 3.1 has many digit ranges. Depending on subsequent processing of the string, characters with undetected properties can cause security problems. The best way to handle this problem is to check to the Unicode category. The .NET Framework method GetUnicodeCategory provides this information for managed code. Unfortunately, no interface to this data is included in NLS yet. The latest approved version of the Unicode character properties is always available at http://www.unicode.org/unicode/reports/tr23.

Use GetStringTypeEx for the same purpose, with caution. The GetStringTypeEx properties predate Unicode by several years, and some of the properties assigned to characters are surprising. Nevertheless, many components of Windows use these properties, and it's reasonable to use GetStringTypeEx if you will be interacting with such components.

Table 14-1 shows the GetStringTypeEx property and the corresponding Unicode properties for code points greater than U+0080. Code point properties less than U+0080 do not correspond with Unicode.

Table 14-1. Unicode Properties
GetStringTypeEx	Unicode Property
C1_ALPHA	Alphabetic or Ideographic
C1_UPPER	Upper or Title case
C1_LOWER	Lower or title case
C1_DIGIT	Decimal digit
C1_SPACE	White space
C1_PUNCT	Punctuation
C1_CNTRL	ISO control, bidirectional control, join control, format control or ignorable control
C1_XDIGIT	Hex digit
C3_NONSPACING	Nonspacing
C3_SYMBOL	Symbol
C3_KATAKANA	The character name contains the word KATAKANA
C3_HIRAGANA	The character name contains the word HIRAGANA
C3_HALFWIDTH	Half width or narrow
C3_IDEOGRAPH	Ideographic