Section 5.2. An Overview of Properties

5.2. An Overview of Properties

For overview and quick-reference purposes, we will present an alphabetic table of properties here, followed by a list of explanations of the meanings of the properties. Many of the concepts used there will be explained later, or need to be consulted from the Unicode material, for issues that are too specialized to be discussed in this book.

The word "property" can have several meanings. For example, the shape of a character can be regarded as its property, and so can a statement about its use. However, in Unicode, the word "property" normally refers to formally defined properties. Often the definition is given as a table that lists characters and values of the property for each character.

The overall structure is described in the document "Unicode Character Database," http://www.unicode.org/Public/UNIDATA/UCD.html. The Unicode Character Database (UCD) itself is a collection of plain text files in fixed, well-defined formats, which are suitable to automated processing. These files are available at addresses that begin with http://www.unicode.org/Public/UNIDATA/, and they specify the values of properties for each character, either by explicitly assigning a value or by implying a default value.

We have previously mentioned the database file Unicodedata.txt, which is important indeed, and a basic file in a sense. However, contrary to what its name may suggest, it does not contain data for all properties. The tendency in the development of the standard has been to divide property definitions into separate files, so that Unicodedata.txt contains just some fundamental properties that can be described compactly.

Some properties are derived properties, which means that their values have been algorithmically deduced from other properties. Thus, derived properties are logically redundant: anything that you can express with them can be expressed using other properties. Derived properties have been included for convenience, to make some tests, definitions, and operations easier to write. For example, the property Alphabetic is derived, but it corresponds to an intuitive and important concept. It is more natural to say "if a character is alphabetic" than to say the same in terms of more primitive Unicode concepts (different categories of letters and characters comparable to letters). Each property has a set of values, or type, which is one of the following:

A property name, which may contain spaces; often (especially in programming) the name is written with spaces replaced by low lines (underscores)e.g., Bidi_Class instead of Bidi Class
An abbreviation (code), defined in the PropertyAliases.txt file in the Unicode database
A description of the meaning, given in prose, and further refined by rules that refer to the property (e.g., line-breaking rules define what line-breaking properties really mean)
A status as normative or informative (descriptive)

The enumeration values and the catalog values are short, somewhat mnemonic strings like AL. The same value may have different meanings for different properties, so a value as such is not unique. There are longer, more mnemonic names defined for the values in the PropertyValueAliases.txt file. For example, AL has the longer name Arabic_Letter when used as a value of the property Bidi Class and the longer name Alphabetic when used as a value of the property Line Break.

In addition to the properties discussed here, there are many properties defined for Han (Chinese-Japanese-Korean) characters. They are regarded as provisional, which means a property "whose values are unapproved and tentative, and which may be incomplete or otherwise not in a usable state." The properties are described in the document "Unihan Database," http://www.unicode.org/Public/UNIDATA/Unihan.html.

5.2.1. Summary of Properties

The following list describes briefly all the 88 properties defined in Unicode 4.1.0. For each property, the list specifies the following:

The abbreviation (short name)
The long name, as defined in the PropertyNames.txt file; for some properties, this is the same as the abbreviation
The type of the values of the property (yes/no, enumeration, etc.)
The status as normative or informative; for some properties, the status is "normative or informative," which means that the property is normative for some values, informative for others
The database file where the property values are specified; to access the file on the Web, prefix the name with http://www.unicode.org/Public/UNIDATA/

The list is in alphabetic order by the abbreviation of the property, since the abbreviation is what you normally see in program code, regular expressions, and other compact notations.

age = Age, catalog, normative or informative, DerivedAge.txt: The number of the Unicode version in which the character was added to Unicode, such as "1.1" or "4.0."
AHex = ASCII Hex Digit, yes/no, normative, PropList.txt: Indicates whether the character is an ASCII character used in hexadecimal numbers. This means letters "A" to "F" and "a" to "f" and digits "0" to "9."
Alpha = Alphabetic, yes/no, informative, DerivedCoreProperties.txt: Indicates whether the character is alphabetici.e., a letter or comparable to a letter in usage. True for characters with gc value of Lu, Ll, Lt, Lm, Lo, or Nl and additionally for characters with the OAlpha property.
bc = Bidi Class, enumeration, normative, UnicodeData.txt: The category of the character in the Bidirectional Behavior Algorithm.
Bidi C = Bidi Control, yes/no, normative, PropList.txt: Indicates whether the character has a special function in the Bidirectional Algorithm.
Bidi M = Bidi Mirrored, yes/no, normative, UnicodeData.txt: Specifies whether the character shall be represented using a mirrored glyph when it appears in right-to-left text.
blk = Block, catalog, normative, Blocks.txt: Name of the block to which the character belongs.
bmg = Bidi Mirroring Glyph, string, informative, BidiMirroring.txt: Suggests a character that can be used to supply a mirrored glyph for this character; see property Bidi M. For example, "(" mirrors ")," and vice versa.
ccc = Canonical Combining Class, number, normative, UnicodeData.txt: Specifies, with a numeric code, how a diacritic mark is positioned with respect to the base character. This is used in the Canonical Ordering Algorithm and in normalization. The order of the numbers is significant, but not the absolute values.
CE = Composition Exclusion, yes/no, normative, CompositionExclusions.txt: Specifies whether the character is explicitly excluded from composition when performing Unicode normalization.
cf= Case Folding, string, normative, CaseFolding.txt: The case-folded (lowercase) form of the character. This is a derived property.
Comp Ex = Full Composition Exclusion, yes/no, normative, DerivedNormalization-Props.txt: Indicates whether the character is excluded from composition when performing Unicode normalization.
Dash = Dash, yes/no, informative, PropList.txt: Indicates whether the character is classified as a dash. This includes characters explicitly designated as dashes and their compatibility equivalents.
Dep = Deprecated, yes/no, normative, PropList.txt: Indicates whether the character is deprecated. Deprecated characters will remain in the standard, but their use is strongly discouraged.
DI = Default Ignorable Code Point, yes/no, normative, DerivedCoreProperties.txt: Indicates whether the code point should be ignored in automatic processing by default.
Dia = Diacritic, yes/no, informative, PropList.txt: Indicates whether the character is diacritici.e., linguistically modifies another character to which it applies. A diacritic is usually, but not necessarily, a combining character.
dm = Decomposition Mapping, string, normative, UnicodeData.txt and Normal-izationCorrections.txt: The decomposition of the character. The property dt indicates the type of decomposition.
dt = Decomposition Type, enumeration, normative, UnicodeData.txt: The type of the decomposition (canonical or compatibility) specified by the property dm. The possible values are listed in Table 5-3, later in the chapter.
ea = East Asian Width, enumeration, informative, EastAsianWidth.txt: The width of the character, in terms of East Asian writing systems that distinguish between full width, half width, and narrow. See UAX #11, "East Asian Width."
Ext = Extender, yes/no, informative, PropList.txt: Indicates whether the principal function of the character is to extend the value or shape of a preceding alphabetic character.
FC NFKC = FC NFKC Closure, string, normative, DerivedNormalizationProps.txt: Indicates whether the character requires extra mappings for closure under Case Folding plus Normalization Form KC.
gc = General Category, enumeration, normative, UnicodeData.txt: The type of the character according to a specific classification, as described in section "Character Classification" later in this chapter.
GCB = Grapheme Cluster Break, enumeration, informative, auxiliary/Grapheme-BreakProperty.txt: Indicates the category of the character for determining grapheme cluster breaks.
Gr Base = Grapheme Base, yes/no, informative, DerivedCoreProperties.txt: Indicates whether the character is regarded as a base grapheme, for the purposes of determining grapheme cluster boundaries.
Gr Ext = Grapheme Extend, yes/no, informative, DerivedCoreProperties.txt: Indicates whether the character is regarded as extending grapheme, for the purposes of determining grapheme cluster boundaries.
Gr Link = Grapheme Link, yes/no, normative, PropList.txt: Indicates whether the character is regarded as grapheme link, for the purposes of determining grapheme cluster boundaries.
Hex = Hex Digit, yes/no, informative, PropList.txt: Indicates whether the character is used in hexadecimal numbers. This is true for ASCII hexadecimal digits and their fullwidth versions.
hst = Hangul Syllable Type, enumeration, normative,HangulSyllableType.txt: Type of syllable, for characters that are Hangul (Korean) syllabic characters.
Hyphen = Hyphen, yes/no, informative, PropList.txt: Indicates whether the character is regarded as a hyphen. This refers to those dashes that are used to mark connections between parts of a word and to the Katakana middle dot.
IDC = ID Continue, yes/no, informative, DerivedCoreProperties.txt: Indicates whether the character can appear as the second or subsequent character of an identifier.
IDS = ID Start, yes/no, informative, DerivedCoreProperties.txt: Indicates whether the character can appear as the first character of an identifier. See "Identifier and Pattern Syntax," available at http://www.unicode.org/reports/tr31/, and Chapter 11.
IDSB = IDS Binary Operator, yes/no, normative, PropList.txt: Indicates whether the character is a binary operator in Ideographic Description Sequences.
IDST = IDS Trinary Operator, yes/no, normative, PropList.txt: Indicates whether the character is a trinary (ternary) operator in Ideographic Description Sequences.
Ideo = Ideographic, yes/no, informative, PropList.txt: Indicates whether the character is an ideographic CJK (Chinese-Japanese-Korean) character.
isc = ISO Comment, miscellaneous, informative, UnicodeData.txt: The content of the comment field for the character in the ISO 10646 standard.
jg = Joining Group, enumeration, normative, ArabicShaping.txt: The group of characters that the character belongs to in cursive joining behavior. For Arabic and Syriac characters.
Join C = Join Control, yes/no, normative, PropList.txt: Indicates whether the character has specific functions for control of cursive joining and ligation.
jt = Joining Type, enumeration, normative, ArabicShaping.txt: Type of joining of glyphs: R (right), L (left), D (dual), J (join causing), U (non-joining), or T (transparent). For Arabic and Syriac characters.
lb = Line Break, enumeration, normative or informative, LineBreak.txt: Line-breaking class of the character. Affects whether a line break must, may, or must not appear before or after the character.
lc = Lowercase Mapping, string, informative, UnicodeData.txt and SpecialCasing.txt: The lowercase form of the character.
LOE = Logical Order Exception, yes/no, normative, PropList.txt: Indicates whether the character belongs to the small set of characters that do not use logical order and hence require special handling in most processing.
Lower = Lowercase, yes/no, informative, DerivedCoreProperties.txt: Indicates whether the character is a lowercase letter.
Math = Math, yes/no, informative, DerivedCoreProperties.txt: Indicates whether the character is mathematical. This includes characters with Sm (Symbol, math) as the General Category value, and some other characters.
na = Name, miscellaneous, normative, UnicodeData.txt and Jamo.txt: The Unicode name of the character. Guaranteed to remain stable.
na1 = Unicode 1 Name, miscellaneous, informative, UnicodeData.txt: The old name of the character in Unicode version 1.0, if significantly different from the Unicode name (value of the Name property).
NChar = Noncharacter Code Point, yes/no, normative, PropList.txt: Indicates whether the code point is a noncharacteri.e., guaranteed to never denote a character.
NFC QC = NFC Quick Check, enumeration, normative, DerivedNormalizationProps.txt: Indicates whether the character can occur in Normalization Form C. Values: N = No, M = Maybe, Y = Yes.
NFD QC = NFD Quick Check, enumeration, normative, DerivedNormalizationProps.txt: Indicates whether the character can occur in Normalization Form D. Values: N = No, Y = Yes.
NFKC QC = NFKC Quick Check, enumeration, normative, DerivedNormal-izationProps.txt: Indicates whether the character can occur in Normalization Form KC. Values: N = No, M = Maybe, Y = Yes.
NFKD QC = NFKD Quick Check, enumeration, normative, DerivedNormal-izationProps.txt: Indicates whether the character can occur in Normalization Form KD. Values: N = No, Y = Yes.
nt = Numeric Type, enumeration, normative, UnicodeData.txt and Unihan.txt: This property has the value Decimal = De for decimal digits, Digit = Di for other digits, Numeric = Nu for other number denotations (e.g., fractions), and None = None for everything else.
nv = Numeric Value, number, normative, UnicodeData.txt and Unihan.txt: The numeric value corresponding to the character. This is defined for different digit characters but also characters such as Greek letters, which are used to denote numbers according to a non-positional system. If this field is empty for a character in the database, the value defaults to "Not a Number" (NaN).
OAlpha = Other Alphabetic, yes/no, informative, PropList.txt: Indicates whether the character is alphabetic but with a General Category value other than Lu, Ll, Lt, Lm, Lo, or Nl. Used to derive the Alphabetic property.
ODI = Other Default Ignorable Code Point, yes/no, normative, PropList.txt: This property is used to derive the property DI.
OGr Ext = Other Grapheme Extend, yes/no, normative, PropList.txt: This property is used to derive the property Gr Ext.
OIDC = Other ID Continue, yes/no, normative, PropList.txt: This property is used to derive the property IDC.
OIDS = Other ID Start, yes/no, normative, PropList.txt: This property is used to derive the property IDS.
OLower = Other Lowercase, yes/no, informative, PropList.txt: This property is used to derive the property Lower.
OMath = Other Math, yes/no, informative, PropList.txt: This property is used to derive the property Math.
OUpper = Other Uppercase, yes/no, informative, PropList.txt: This property is used to derive the property Upper.
Pat Syn = Pattern Syntax, yes/no, normative, PropList.txt: Indicates whether the character is or might be used in the pattern syntax for pattern matching as defined in "Identifier and Pattern Syntax," available at http://www.unicode.org/reports/tr31/. See the section "Identifier and Pattern Syntax" in Chapter 11.
Pat WS = Pattern White Space, yes/no, normative, PropList.txt: Indicates whether the character is treated as whitespace in patterns.
QMark = Quotation Mark, yes/no, informative, PropList.txt: Indicates whether the character is used as a quotation mark in some language(s).
Radical = Radical, yes/no, normative, PropList.txt: Indicates whether the character is a radical (in ideographic writing).
SB = Sentence Break, enumeration, informative, auxiliary/SentenceBreakProperty.txt: Indicates the category of the character for determining sentence breaks.
sc = Script, catalog, informative, Scripts.txt: The script (writing system) to which the character primarily belongs to, such as "Latin," "Greek," or "Common," which indicates a character that is used in different scripts.
scc = Special Case Condition, string, informative, SpecialCasing.txt: The condition under which a special case-mapping rule is applied. The condition is expressed as a space-separated list of locale IDs or contexts. For example, a value of tr means that the rule is applied for Turkish-language texts only.
SD = Soft Dotted, yes/no, normative, PropList.txt: Indicates whether the character contains a dot that disappears when a diacritic is placed above the character (e.g., "i" and "j" are soft dotted).
sfc = Simple Case Folding, string, normative, CaseFolding.txt: The case-folded (lowercase) form of the character when applying simple folding, which does not change the length of a string (and may thus fail to fold some characters correctly). This is a derived property.
slc = Simple Lowercase Mapping, string, normative, UnicodeData.txt: The lowercase form of the character, if expressible as a single character.
stc = Simple Titlecase Mapping, string, normative, UnicodeData.txt: The titlecase form of the character, if expressible as a single character.
STerm = STerm, yes/no, informative, PropList.txt: Indicates whether the character is used to terminate a sentence.
suc = Simple Uppercase Mapping, string, normative, UnicodeData.txt: The uppercase form of the character, if expressible as a single character.
tc = Titlecase Mapping, string, informative, UnicodeData.txt and SpecialCasing.txt: The titlecase form of the character.
Term = Terminal Punctuation, yes/no, informative, PropList.txt: Indicates whether the character is a punctuation mark that generally marks the end of a textual unit.
uc = Uppercase Mapping, string, informative, UnicodeData.txt and SpecialCasing.txt: The uppercase form of the character.
UIdeo = Unified Ideograph, yes/no, normative, PropList.txt: Indicates whether the character is a unified CJK ideograph. Used in Ideographic Description Sequences.
URS = Unicode Radical Stroke Count, miscellaneous, informative, Unihan.txt: A radical/stroke count quantity describing a Han (CJK) ideograph.
Upper = Uppercase, yes/no, informative, DerivedCoreProperties.txt: Indicates whether the character is an uppercase letter.
VS = Variation Selector, yes/no, normative, PropList.txt: Indicates whether the character qualifies as a Variation Selector used to specify the glyph variant of a graphic character.
WB = Word Break, enumeration, informative, auxiliary/WordBreakProperty.txt file: Indicates the category of the character for determining word breaks.
WSpace = White Space, yes/no, normative, PropList.txt: Indicates whether the character should be treated by programming languages as a whitespace character when parsing elements. This concept does not match the more restricted whitespace concept in many programming languages, but it is a generalization of that concept to the "Unicode world."
XIDC = XID Continue, yes/no, informative, DerivedCoreProperties.txt: As IDC, but for a somewhat different definition for "identifier." See Chapter 11.
XIDS = XID Start, yes/no, informative, DerivedCoreProperties.txt: As IDS, but for a somewhat different definition for "identifier." See Chapter 11.
XO NFC = Expands On NFC, yes/no, normative, DerivedNormalizationProps.txt: Indicates whether the character expands to more than one character in normalization to C form.
XO NFD = Expands On NFD, yes/no, normative, DerivedNormalizationProps.txt: Indicates whether the character expands to more than one character in normalization to D form.
XO NFKC = Expands On NFKC, yes/no, normative, DerivedNormalizationProps.txt: Indicates whether the character expands to more than one character in normalization to KC form.
XO NFKD = Expands On NFKD, yes/no, normative, DerivedNormalizationProps.txt: Indicates whether the character expands to more than one character in normalization to KD form.

5.2.2. Normative and Informative Properties

The Unicode standard defines somewhat vaguely what it means to designate a property as normative. It does not mean that an implementation must know about the property and use it. But if it does, it must use it as specified in the standard. Thus, an implementation may not interpret the property values as it likes. A non-normativei.e., informativeproperty is provided for use on an "as you like" basis: the property and its values have defined meanings and they stay at your disposal, but you may use them for your own purposes as you like.

For example, an implementation may be ignorant of Hebrew and Arabic letters and all directionality problems. But if it processes Hebrew or Arabic in a manner that involves visual presentation, it must apply the directionality principles of Unicode, and this means using the Bidi Class property according to the standard.

Some properties are partly normative, partly informative. The LineBreak property is normative for values that indicate a forced line break, for example, but informative for many other values.

Being normative does not imply a guarantee that the property value will not change in future versions of Unicode. Such changes are expected to be rare, though.

Generally, even a normative property can be overridden by a so-called higher-level protocol (see Chapter 9). For example, the visual rendering of a document must normally obey the normative values of the LineBreak property; line breaks can be prohibited or caused by tools external to plain text, such as stylesheets or explicit formatting instructions. Similarly, you can use informative properties to map lowercase letters to uppercase, yet override the mapping for some characters due to some language-related or even application-specific conventions. Of course, you are supposed to override the properties only if you know what you are doingi.e., there is a well-defined reason.

A normative property can be designed as non-overridable. This means that no modification is allowed at any level. The reason for this is to guarantee that some basic operations are carried out in a guaranteed manner that other software may rely on. In particular, the decomposition properties are non-overridable. When canonical or compatibility decomposition is applied, the program doing so is not allowed to throw in its own decomposition rules or ignore or modify the rules specified in the standard. This means that if your program purports to deliver data in normalized form, you are guaranteeing that Unicode normalization rules and no other have been applied.

5.2.3. Structure of Database Files

As mentioned earlier in this chapter, the Unicode Character Database consists of plain text files, so it does not correspond to how many people understand the word "database." On the other hand, the files can be used to construct a database that can be used with suitable database software for searches, extracts, reports, etc. The files can also be used to generate mapping tables and other data structures needed for creating general purpose subroutines that can be used in programming, so that a programmer can work at a reasonable level of abstraction.

Largely for such purposes, the structure of the files follows some general principles, in addition to specific rules described in each file (in comments) or in the Unicode standard. The principles are:

The files are in UTF-8 encoding, except NamesList.txt,which is ISO-8859-1. However, characters outside the ASCII range (Basic Latin block) appear in comments only, except when noted otherwise in the description of the file. Thus, in most cases, you can view and process the files as if they were ASCII encoded, at least if you ignore the comments.
A comment starts with # and ends at the end of line. A comment does not belong to the data itself but describes it.
One line corresponds to one logical record, typically specifying the value of a property for one character.
Fields of a record are separated by semicolons. In some files, there is a semicolon after the last field, too. When the fields are referred to in text, they are considered as numbered starting from zeroas common in programming, since programming language designers think in terms of displacements from a base address.
Leading and trailing spaces in a field are not significant.
The first field of a record usually indicates a code point or code point range. The other fields specify property values for the code point(s).
Code points are expressed in the usual hexadecimal notation but without the "U+" prefix, using at least four digits for a code number, with leading zeros as necessary.
A code point range is described by writing two periods (..) between code pointse.g., 0000..007F.
However, in the UnicodeData.txt file, a different method is used to specify values for a range of code points. A notation involving the words First and Last is used so that one line specifies the start and the next line specifies the end of the range. For example, the following two lines there specify that all code points from U+AC00 to U+D7A3 denote Hangul syllable characters, with the same properties as the first and last character of the range: (In such situations, the Unicode names of characters are algorithmically derivable; in this case, the names can be derived from an algorithmic decomposition into Unicode characters with known names.)

 AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;       D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;

A sequence of consecutive code points is expressed by writing them as separated with space. Thus, 0066 0069 means U+0066 "i" followed by U+0069 "j"i.e., "ij" without any space.
A property value may be omitted (still preserving semicolons between fields), thereby implying a default value. If the value is of string type, the default value is the character itself; for example, for case mappings, the default is that a character does not change in the mapping. For other types of values, the default is specified in a comment in the database file.
Abbreviations and names of properties are written using underline (underscore) instead of a spacee.g., Bidi_Control instead of Bidi Control.
In a file that may specify different properties for characters, the abbreviation of a property is given in one field, its value in another. For example, the following line (from DerivedNormalizationProps.txt) says that for character U+037A, the value of the property FC_NFKC is the two-character sequence U+0020 U+03B9:

 037A  ; FC_NFKC; 0020 03B9      # Lm  GREEK YPOGEGRAMMENI

In a file that specifies binary (yes/no) properties, the name of a property is given in one field, without a value, implying a "yes" value (True) for the character. For such properties, the value "no" (False) is implied for all characters that are not mentioned. For example, in the PropList.txt file, there are only the two lines quoted below that mention the Bidi_Control property (comments omitted from this quotation). This implies that for the two characters U+200E and U+200F and for the five characters U+202A to U+202E, the value of the Bidi Control property is "yes" (True), and for all other characters, it is "no" (False):

 200E..200F    ; Bidi_Control       202A..202E    ; Bidi_Control