Section 5.1. Character Classification

5.1. Character Classification

We will first consider one important property of characters in Unicode, namely General Category (or gc, for short). This will illustrate the definition and usefulness of properties, as well as some problems in defining them.

5.1.1. The Purposes of Classification

Characters can be classified in several ways, for different purposes. The Unicode standard defines a basic classification by assigning the General Category property to each character. Other properties imply classifications that are more specific, such as by the "age" of characteri.e., by the Unicode version in which it was encoded.

The General Category property, defined for all characters, constitutes a fundamental classification into letters, numbers, punctuation, mathematical symbols, etc. For several frequently used characters, this classification is not very natural, since they have multiple uses. For example, the hyphen-minus "-" can be used as punctuation, as a minus sign, or as a special symbol. The reason behind this is the history and design of Unicode: it contains many "legacy characters," which have ambiguous semantics and mixed usage.

The classification is generally useful, though. For example, when writing pattern-matching routines, you often need to work with concepts like "letter" or "digit." Instead of dealing with a huge amount of letters individually, you work with the classification.

The definition of a computer language (e.g., programming, markup, or data description language) typically involves a "name" or "identifier" concept. The rules typically allow an identifier to start with a letter and otherwise contain both letters and digits, and perhaps some special characters like _. Such a rule can be written easily, if we restrict ourselves to ASCII. That means, however, that most people in the world cannot use words of their native language in identifiers. To define a generalized concept of identifier, it is simplest to use the General Category and other properties, rather than list a huge number of characters. We return to this topic in Chapter 11.

If you define things like identifier syntax using the Unicode properties and specify that the newest Unicode version be used, the syntax is automatically updated when Unicode is. This means flexibility, but it also means instability in the sense that strings that were previously not identifiers by the syntax become identifiers later. The opposite is not probable, but possible; most Unicode properties are not guaranteed to remain the same, once defined for a character. For such reasons, definitions of computer languages may fix identifier syntax in a manner that does not depend on Unicode versions, at the cost of making it impossible to use newly added characters in them. For example, in XML, identifier syntax has been fixed to use the properties of characters as defined in Unicode 2.0. Technically, the XML specification does not refer to the properties but explicitly lists its own definitions of character classes (see http://www.w3.org/TR/RECxml/‌#CharClasses), but they are based on Unicode 2.0.

5.1.2. General Category Values

The classification is hierarchical: the General Category property indicates both a major class of a character and a subclass. The property is expressed with a two-letter code such as Lu so that:

The first character is an uppercase letter indicating the major class, which is Letter, Mark, Number, Separator, Other, Punctuation, or Symbol.
The second character is a lowercase letter that specifies the subclass.

The General Category values are shown in Table 5-1, together with sample characters or code points. Characters in class Mn are nonspacing and combining, and the sample character is shown as combined with a space (see "Diacritic marks" in Chapter 8).

Table 5-1. General Category values
Code	Description	Sample character
Lu	Letter, uppercase	A
Ll	Letter, lowercase	a
Lt	Letter, titlecase	ǅ (U+01C5)
Lm	Letter, modifier	ʰ (U+02B0)
Lo	Letter, other (including ideographs)	א (alef, U+05D0)
Mn	Mark, nonspacing	̀ (U+0300)
Mc	Mark, spacing combining	ः (U+0903)
Me	Mark, enclosing	(U+06DE) ۞
Nd	Number, decimal digit	1
Nl	Number, letter	Ⅳ (U+2163)
No	Number, other	½ (U+00BD)
Zs	Separator, space	(space, U+0020)
Zl	Separator, line	(line separator, U+2028)
Zp	Separator, paragraph	(paragraph separator, U+2029)
Cc	Other, control	(carriage return, U+000D)
Cf	Other, format	(soft hyphen, U+00AD)
Cs	Other, surrogate	(surrogate code points)
Co	Other, private use	(U+E000)
Cn	Other, not assigned (including noncharacters)	(U+FFFF, not a character)
Pc	Punctuation, connector	_ (low line, U+005F)
Pd	Punctuation, dash	- (hyphen-minus, U+002D)
Ps	Punctuation, open	(
Pe	Punctuation, close	)
Pi	Punctuation, initial quote	" (U+201C)
Pf	Punctuation, final quote	" (U+201D)
Po	Punctuation, other	!
Sm	Symbol, math	+
Sc	Symbol, currency	$
Sk	Symbol, modifier	^ (circumflex accent, 0+005E)
So	Symbol, other	©

The names "Punctuation, initial quote" and "Punctuation, final quote" are misleading, since characters in both categories may act as an opening or closing quotation mark, depending on the language. For example, in Swedish, a quotation starts and ends with U+021D (e.g., "Stockholm").

Characters with ambiguous semantics have General Category values that are meant to reflect their typical use in normal text. Thus, for example, hyphen-minus is classified as "Punctuation, dash," although it is often used as a mathematical symbol.

5.1.3. Use of General Category in Programming

To illustrate the use of this property in programming, let us consider the following simple task: read a text file and print all lines that contain an uppercase (capital) letter. Using a modern version of the Perl programming language, with Unicode support, you can do this with a three-liner (which could be written as a one-liner if you like):

 while(<>) {     if (m/\p{Lu}/) {         print; }}

This program contains a loop that reads an input line and prints if the condition m/.../ is truei.e., if a substring of the input line matches the expression between the slashes. The Unicode thing here is the expression, \p{Lu}, which by definition matches any character whose General Property value is Lu. This covers Latin uppercase letters with or without diacritic marks (A,Â, etc.) as well as Greek, Cyrillic, and other uppercase letters. An approach that uses the character properties is of course much simpler than writing program code that tests all the different possibilities separately. Whether the broad concept of "uppercase letter" corresponding to the General Property value Lu is really adequate in a particular situation depends on the context and application.

Section 5.1. Character Classification

5.1. Character Classification

5.1.1. The Purposes of Classification

5.1.2. General Category Values

Table 5-1. General Category values

5.1.3. Use of General Category in Programming