Unicode Character Properties
Because Unicode contains so many characters, it can be dangerous to assume that a limited range holds a particular property. For example, do not assume that the only digits are U+0030 ( 0 ) through U+0039 ( 9 ). Unicode 3.1 has many digit ranges. Depending on subsequent processing of the string, characters with undetected properties can cause security problems. The best way to handle this problem is to check to the Unicode category. The .NET Framework method GetUnicodeCategory provides this information for managed code. Unfortunately, no interface to this data is included in NLS yet. The latest approved version of the Unicode character properties is always available at http://www.unicode.org/unicode/reports/tr23.
Use GetStringTypeEx for the same purpose, with caution. The GetStringTypeEx properties predate Unicode by several years, and some of the properties assigned to characters are surprising. Nevertheless, many components of Windows use these properties, and it's reasonable to use GetStringTypeEx if you will be interacting with such components.
Table 14-1 shows the GetStringTypeEx property and the corresponding Unicode properties for code points greater than U+0080. Code point properties less than U+0080 do not correspond with Unicode.
GetStringTypeEx | Unicode Property |
C1_ALPHA | Alphabetic or Ideographic |
C1_UPPER | Upper or Title case |
C1_LOWER | Lower or title case |
C1_DIGIT | Decimal digit |
C1_SPACE | White space |
C1_PUNCT | Punctuation |
C1_CNTRL | ISO control, bidirectional control, join control, format control or ignorable control |
C1_XDIGIT | Hex digit |
C3_NONSPACING | Nonspacing |
C3_SYMBOL | Symbol |
C3_KATAKANA | The character name contains the word KATAKANA |
C3_HIRAGANA | The character name contains the word HIRAGANA |
C3_HALFWIDTH | Half width or narrow |
C3_IDEOGRAPH | Ideographic |