Section 5.4. Normalization

5.4. Normalization

In data processing, normalization generally means conversion of data to a form that has been defined as normal form, among different possibilities. This does not mean that other forms would be incorrect or nonstandard. On the contrary, the normal form is usually just one of the correct forms. In some contexts, there is a difference between "normal form" and "normalized form," but we will treat them as synonyms here.

Consider the Latin small letter "e" with acute accent é. This character can be represented in Unicode as a separate character with a code point of its own, U+00E9. Equivalently, in the sense of canonical equivalence, it can be represented as a two-character sequence: Latin small letter "e" (U+0065) followed by combining acute accent (U+0301). The rendering should be exactly the same, and we might say that well-designed software should handle them bothand identically. Due to its design goals, Unicode contains a lot of ways to represent things in canonically equivalent ways.

However, processing of data becomes easier if the variation is reduced. For some purpose and context, we might decide that one of the forms is the normal form. We might then use preprocessing software that converts the data to the normal form. This would make the coding of the actual processing easier. For example, text searching is easier if we can assume that all data has been normalized, so that when a search for é is performed, we know exactly what to look for.

Normalization operates at the level of code points, not encodings. Different encodings, such as UTF-8 and UTF-16, will be discussed in Chapter 6. Encoding issues are independent of normalization.

5.4.1. Normalization Versus Folding

In the Unicode context, the term "normalization" is used only to denote normalization forms that deal with canonical and compatibility decompositions and compositions. For example, the representation of é as U+00F9 or as U+0065 U+0301 is a normalization issue in this narrow sense. Mapping, for example, É to é for case-insensitive comparisons or ignoring diacritic marks (mapping é, è, etc., all to "e") is not called normalization but folding, although the goals are often the same as for normalization.

Folding issues are discussed in the Draft Unicode Technical Report (UTR) #30, "Character Foldings," http://www.unicode.org/reports/tr30/. Typically, the foldings described there are mappings that perform some of the canonical or compatibility mappings, such as removal of canonical duplicates or subscript folding, which turns subscript characters to corresponding normal characters. However, they also include quite different mappings, such as accent removal.

According to UTR #30, all folding operations involve canonical decomposition, and they may involve composition as the last step. The general idea is to apply folding rules, then canonical decomposition, and then to repeat these steps until the data is stablei.e., does not change anymore in these steps. Thus, folding resembles normalization but contains additional operations.

5.4.2. Overview of Normalization Forms

Unicode defines several normalization forms, which can be used for different purposes. They are summarized in Table 5-3. The principles are simple: first decomposable characters are decomposed, using either canonical or compatibility decomposition. This may be followed by canonical composition, as described later in the detailed descriptions of Normalization Forms C and KC.

Table 5-3. Unicode normalization forms
Code	Name	Meaning
NFD	Normalization Form D	Canonical decomposition
NFC	Normalization Form C	Canonical decomposition, canonical composition
NFKD	Normalization Form KD	Compatibility decomposition
NFKC	Normalization Form KC	Compatibility decomposition, canonical composition

In the codes, "D" stands for decomposition, "C" for composition, and "K" for compatibility. Composition implies prior decomposition.

For example, consider the word "ancé" written so that it starts with the ligature (U+FB01) and ends with the composite character "e with acute accent (U+00E9). These characters have compatibility and canonical mappings, respectively. The normalization forms of the word are presented in Table 5-4, denoting a combining acute accent U+0301 by the acute accent ´ for clarity.

Form

Table 5-4. Normalization forms of the sample word "ancé"
"ancé" normalized	Explanation
NFD	a n c e´	é has been decomposed (canonical)
NFC	a n c é	é was decomposed but then composed back
NFKD	f i a n c e´	Both and é have been decomposed
NFKC	f i a n c é	Only has been decomposed (compatibility)

In the example, the NFC form is the same as the original string. This is typical, since NFC deals with canonical mappings only, and it first decomposes, and then composes. The NFKD form is fully decomposed, whereas in the NFKC form, the character é was first decomposed, then composed back (canonical composition).

Unicode data may contain characters such as é in both precomposed and decomposed form. Normalization to NFD (or NKFD) ensures that they will all be in (completely) decomposed form. Normalization to NFC (or NKFC) ensures that they will all be in precomposed form if possible. The "if possible" part comes from the fact that not all characters with diacritic marks have precomposed forms in Unicode.

No normalization form performs any "compatibility composition." For example, normalization never composes the letters "f" and "i" into the ligature .

For quick checks on the normalization forms of individual characters, you can use the Normalization Charts at http://www.unicode.org/charts/normalization/. They show the four normalization forms for each character, except for those that are invariant under all normalizations. The charts are illustrated in Figure 5-4. Note that the glyphs are usually the same although the normalization forms (code number sequences) differ, so the most relevant information is in the code numbers below the glyphs. Generally, normalization to the C or D form should not change the rendering of a character, whereas normalization to KC or KD form may change it, since they involve compatibility mappings.

There is also an offline tool for checking the normalization forms of a string, Charlint. It can be downloaded from http://www.w3.org/International/charlint. It corresponds to Unicode Version 3.2, so newer characters cannot be checked.

Figure 5-4. The Normalization Chart for some Cyrillic characters that have canonical decompositions

5.4.2.1. Use of normalization forms

In practice, the different normalization forms have rather different usage:

Normalization Form C (NFC) is favored as the basic form, for example, by the World Wide Web Consortium (W3C) in the Character Model for the Web, for use in XML and related formats; see http://www.w3.org/TR/charmod/. Some general purpose subroutine libraries and utilities require that their input be in NFC.
Normalization Form D (NFD) can be useful in situations where you prefer to process all characters with diacritics as decomposede.g., because you wish to simply ignore all diacritics.
Normalization Forms KD and KC (NFKD, NFKC) should be used with caution, after a careful analysis of the possible effects, since these normalizations may lose essential information (e.g., by normalizing 4² to 42). They can be useful in applications where you intentionally want to simplify character data.

5.4.2.2. Invariance of Basic Latin characters

The Basic Latin block in Unicode, corresponding to ASCII, has been designed so that strings consisting of Basic Latin characters only are not changed in any normalization. That is, they have no decomposition mappings, and there are no compositions that operate on sequences of Basic Latin characters. Therefore, the basic syntactic constructs in programming and markup languages remain invariant in normalization, as long as they use Basic Latin characters only. This goal explains why, for example, the grave accent ' does not have canonical mapping to space followed by the combining grave accent, although for example, there is a canonical mapping for the acute accent ´ (which is outside Basic Latin, in the Latin-1 Supplement).

5.4.3. Normalization Form C

As mentioned, NFC means canonical decomposition followed by canonical composition. This may sound odd: why decompose something that will be composed back again? The explanation is that decomposition ensures, among other things, that multiple diacritic marks will be handled in a uniform manner.

The exact definition of canonical composition requires some auxiliary concepts:

Starter: A character is called a starter, if its combining class is 0i.e., the value of the property ccc = Canonical Combining Class for the character is zero. This includes all characters that are not combining characters as well as some combining characters.
Blocked: In a string that begins with a starter, a character C is said to be blocked from the starter S, if there is a character B between them such that either B is a starter or it has a combining class value as at least as high as C's.
Primary composite: A character is said to be a primary composite, if it has a canonical mapping and it has not been explicitly excluded from composition by assigning the value yes (True) to the property CE = Composition Exclusion for the character. See subsection "Composition Exclusions" later in this chapter.

We can now define that the construction of the NFC for a string consists of the following:

Construct the canonical decomposition of the string. (Note that this includes reordering of consecutive nonspacing marks.)
Process the result by successively composing each character with the nearest preceding starter, if it is not blocked from it. Composing character C with a starter S means that if there is a primary composite Z that is canonically equivalent to the string consisting of S followed by C, then S is replaced by Z, and C is removed.

This is a bit complicated, so let us consider a simple example. Assume that the initial string is U+00EA U+0323i.e., ê followed by combining dot below. The process of converting it to NFC is presented stepwise in Table 5-5. For clarity, the combining diacritic marks are visualized as ^ (denoting circumflex above) and . (denoting dot below). The operations in the composition phase are based on the canonical mappings defined for U+1EB9 and U+1EC7.

Table 5-5. Normalization Form C step by step
Phase	Representation of e with circumflex above and dot below	Comments
Original data	U+00EA ê U+0323 .	Partly composed
Decomposition	U+0065 "e" U+0302 ^ U+0323 .	Canonical decompose ê
Decomposition	U+0065 "e" U+0323 . U+0302 ^	Reorder nonspacing marks
Composition	U+1EB9 ẹ U+0302 ^	Compose mark with starter "e"
Composition	U+1EC7 ệ	Compose second mark

5.4.4. Normalization Form KC

NFKC is defined in a manner very similar to the definition of NFC. The only difference is in step 1, which involved compatibility decomposition instead of canonical decomposition. The construction of NFKC for a string consists of the following:

Construct the compatibility decomposition of the string. (Note that this includes applying both canonical and compatibility mappings and then reordering of consecutive nonspacing marks.)
Process the result by successively composing each character with the nearest preceding starter, if it is not blocked from it. Composing character C with a starter S means that if there is a primary composite Z that is canonically equivalent to the string consisting of S followed by C, then S is replaced by Z, and C is removed.

5.4.5. Composition Exclusions

As defined in the previous section, characters with a "yes" value for the CE = Composition Exclusion property are excluded from composition in normalization, because they are by definition not primary composites. These characters are listed, with comments, in the Unicode database file CompositionExclusions.txt. They are divided into the following groups:

Script-specific precomposed characters that are generally not the preferred form for particular scripts and therefore declared as to be excluded from composition. Currently these include some Devanagari, Bengali, Gurmukhi, Oriya, Tibetan, and Hebrew characters.
Post Composition Version precomposed characters, which means precomposed characters added after Unicode Version 3.0. By Unicode policy, such characters are always excluded from composition. There are just a few symbols in this group.
Singleton Decompositionsi.e., characters whose canonical decomposition consists of a single character; e.g., the ohm sign (with capital omega as its decomposition).
Non-Starter Decompositionsi.e., characters whose canonical decomposition starts with a character with a nonzero combining class. There are just a few of such characterse.g., combining Greek dialytika tonos U+0344, which represents two combining diacritic marks (dialytika and tonos).

5.4.6. Definition of Compatibility Decomposable Character

We can now formally define what it means to be compatibility decomposable: it means that a character's compatibility decomposition differs from its canonical decompositioni.e., its normalization form D is different from its normalization form KD. That is, character c is compatibility decomposable, if NFKD(c) NFD(c).

For example, the micro sign is compatibility decomposable, since it has compatibility mapping to the Greek letter small mu, which is thus its NFKD, whereas its NFD is the micro sign itself (since it has no canonical mapping to anything). On the other hand, the ohm sign is not compatibility decomposable, since it has canonical mapping to the Greek letter capital omega, thus having that character as its NFKD and as its NFD.

Not all compatibility characters are compatibility decomposable. Many of them have decompositions that are canonical.

5.4.7. W3C Normalization

The World Wide Web Consortium (W3C) favors Normalization Form C on the Web, and it additionally suggests stronger normalization rules in HTML and XML documents. The stronger rules are external to Unicode, since they relate to markup, not plain text. They are briefly described here due to their practical impact. The rules are described in more detail in the document "Character Model for the World Wide Web 1.0: Normalization," http://www.w3.org/TR/charmod-norm/. However, it needs to be noted that document is officially a Working Draft (work in progress) only.

The W3C normalization rules require that text be in NFC and additionally forbid the occurrence of character references and entity references that would make the text non-normalized, if replaced by the characters that they denote. For example, by Unicode rules, NFC does not allow the appearance of "e" followed by a combining acute accent, since this combination must be replaced by the precomposed character é. The W3C normalization rules also forbid the indirect appearance of the combination, for example, as in é (where ́ is a character reference that denotes the combining acute accent U+0301).

On the Web, expressions like é are rarely used in practice, since the corresponding precomposed character (either written as such or as a character reference like é or é or as an entity reference like &#eacute;) works much better. However, suppose that you have a database that contains characters in decomposed form. Unless you are careful, software that presents data extracted from it in HTML or XML format might treat data like U+0065 U+0301 so that U+0065 is represented directly as "e" (which should cause no problems), whereas U+0301 is converted to ́ for safety. This would result in data that is not W3C normalized, and this involves unnecessary risks. A simple way to avoid this is to normalize (to NFC) the character data extracted from the database before making any decisions on using character references to represent some characters.