5.3. Compositions and DecompositionsThe 10 design principles of Unicode, presented in Chapter 4, contain one principle on dynamic composition and another principle on equivalent sequences. For example, the letter é can be represented as a single Unicode character, or dynamically composed as a two-character string (letter "e" followed by a combining acute accent). The single character é is said to have a canonical decomposition consisting of two characters, and this relationship implies canonical equivalence . Unicode lets you combine a base character with an unlimited number of combining diacritic marks. In practice, there's most often just one diacritic, sometimes two, but there is no limit. For example, phonetic or mathematical notations may deploy several diacritic marks on one character. As a base character, you can use any character that does not itself combine with preceding characters and that is neither a control nor a format character. Unicode would be simpler, if all letters with diacritic marks were represented using dynamic composition. For different practical reasons, another approach was taken, and this implies that we need to deal with precomposed forms and with conversions between them and decomposed forms. Characters may have decompositions in a different sense, too. Many characters have compatibility decompositions . For example, the small Latin ligature "fi," , has a compatibility decomposition that consists of the two characters "f and "i." 5.3.1. The Impact of Diacritic MarksA diacritic mark is an additional graphic such as an accent (as in è or é) or cedilla (as in ç) attached to a character. It may affect the pronunciation of a character, or the meaning of a word, or both. It appears visually close to the base character, often above or below it, possibly crossing over its line, but it is treated as a logically separable part. A diacritic mark can be treated in different ways when defining a character repertoire. You could define a character like é (letter "e" with acute accent) as a separate character, or you could define the base character "e" and the diacritic ´ as two distinct characters. In the latter approach, you would need to define the diacritic as combining (nonspacing), or otherwise indicate that it be rendered as attached to the character, not as a separate character after it. For example, the ISO-8859-1 character code contains a collection of letters with diacritic marks, such as é, but no combining marks. It contains the acute accent ´, but as a normal (spacing) character, which is not combined with any other character in any way. The Unicode standard uses nonspacing mark as a term that covers diacritic marks but can be seen as somewhat more general in nature. The term "diacritic mark" is often used to denote accents and other marks attached to Latin, Greek, Cyrillic, and other letters, whereas "nonspacing mark" also covers Hebrew points, Arabic vowel marks, etc. In this book, "diacritic mark" is used in a broad sense, as a synonym for "nonspacing mark." 5.3.1.1. Precomposed and decomposed formIn Unicode, a character with a diacritic mark can often be represented in two ways. You can express é as a precomposed character or as decomposedi.e., as a character pair consisting of "e" and a combining acute accent. Both representations are possible for a large number of commonly used characters, though not for all characters with diacritics. This means flexibility, but it also creates a pile of problems. What happens if a database contains é as decomposed but a search string typed by the user contains it as precomposed? This is just the beginning of the problems. For example, a character with several diacritic marks can be represented as several different decompositions. Unicode contains separate characters called combining diacritical marks . The general idea is that you can express a vast set of characters with diacritics by representing them so that a base character is followed by one or more combining (nonspacing) diacritic mark(s). A program that displays such a construct is expected to do rather clever things in formattinge.g., selecting a particular shape for the diacritic according to the shape of the base character. In Unicode, a combining diacritic mark always follows the base character in data. It may visually appear above, below, or on either side of the base character. The logical order differs from the order in many methods of typing characters with diacritic marks. For example, on many keyboards, you could first press a key labeled ´, and then the "e" key, to produce é. However, if this letter is represented in data as decomposed, it has the combining diacritic mark after the base letter "e." The order in typing mechanisms reflects the methods used on mechanical typewriters. They may contain a ´ key, which is non-advancingi.e., the writing position is not moved forward. Therefore, the next character will overprint the symbol, resulting in a coarsely constructed accented letter. In Unicode, combining diacritic marks are supposed to be rendered as combined with the preceding character in a more elaborate way. 5.3.1.2. Combining marks: powerful, but still poorly supportedMany programs currently in use are totally incapable of doing anything meaningful with combining diacritic marks. However, there is at least some simple support for them in word processors and web browsers, for example. Regarding advanced implementation of the rendering of characters with diacritic marks, consult Unicode Technical Note #2, "A General Method for Rendering Combining Marks," http://www.unicode.org/notes/tn2/. Using combining diacritic marks, we have wide range of possibilities. We can put, say, a dieresis on a gamma, although "Greek small letter gamma with dieresis" does not exist in Unicode as a character with a code position of its own. The combination U+03B3 U+0308 consists of two characters, although its visual presentation looks like a single character in the same sense as ä looks like a single character. A word processor may display it as γ̈, which might be of poor quality (the dieresis is not correctly placed with respect to the base character), but probably legible. Many programs fail to display it at all. For practical reasons, in order to use a character with a diacritic mark, you should primarily try to find it as a precomposed character. A precomposed character, also called a composite character or a decomposable character, is one that has a code position (and thereby identity) of its own but is in some sense equivalent to a sequence of other characters. There are lots of them in Unicode, and they cover most of the needs of the languages of the world, but not all. Special notations, such as the International Phonetic Alphabet by IPA, may require several different diacritic marks that can be combined with characters, in a manner that makes it quite infeasible to try to define all the combinations as precomposed characters. For example, the Latin small letter "a" with dieresis ä (U+00E4) is, by Unicode definition, decomposable to the sequence of the two characters: Latin small letter "a" (U+0061) and combining dieresis (U+0308). Almost always, however, the letter ä is entered in its precomposed form, though it might then internally be decomposed. Generally, by decomposing all decomposable characters, you could in many cases simplify the processing of textual data, and the resulting data might be converted back to a format using precomposed characters. 5.3.1.3. Features that are not diacritic marksMany letters that do not contain a diacritic mark in the Unicode sense have historically been formed from a base letter by adding some mark to it. For example, the Norwegian and Danish ø is originally an "o" with a slanted line over it. Its name, "Latin small letter o with stroke," reminds of this and could even be read as suggesting that it is a combination of an "o" and a diacritic mark called "stroke." Similarly, the letter , "Latin capital letter L with stroke," used in Polish, would seem to be an "L with the same diacritic, though with a different visual shape. Although such letters are often understood as letters with diacritic marks, they are classified as independent letters in Unicode. The characters ø and are not decomposable in any way. They have no defined relationships with "o and "L" in Unicode, except in the sense that in the default collating order (see the section "Collation and Sorting" later in this chapter), o is sorted in the same primary position as "o," and is sorted in the primary position as "L." This approach does not exclude the possibility of treating such characters in some special way in application programming or in language-dependent general rules. Since they are intuitively understood as variants of some base characters, it would be natural to define input methods that relate to such intuition. For example, in MS Word, you can produce ø by using the sequence Ctrl-Shift-7 o. This is relatively easy to remember if your keyboard has the solidus / as Shift-7, so that you can think you are using Ctrl-/ o. 5.3.2. Compatibility Mappings and Canonical MappingsThe Unicode character database defines a decomposition mapping for each character. This mapping associates another character or a sequence of characters with the given character, and this association is indicated as a canonical mapping or as a compatibility mapping, also called decomposition. Typical cases include the following:
5.3.2.1. Difference between canonical and compatibility mappingsCanonical and compatibility mappings are rather fundamental in Unicode, and they are commonly confused with each other. One reason for this is that in many cases, the choice of the mapping type was debatable, if not arbitrary. For example, the micro sign µ has compatibility mapping to the Greek small letter mu, but the ohm sign has In some notationse.g., in the Unicode code chartsthe character (identical to, U+2261) is used to indicate canonical mappinge.g., U+2126 Ω U+03A9. Handy as this may be, it can be misleading, since the two characters are not identical, though they may be treated as essentially similar by programs. The relation expressed by the symbol here isnt even symmetric, contrary to its normal use in mathematics. The symbol is best read as "has canonical mapping to." Similarly, the character (almost equal to, U+2248) is often used to indicate compatibility mappinge.g., µ U+00B5 μ U+03BC. This symbol is best read as "has compatibility mapping to." The short characterizations are:
The Unicode Normalization Form C (discussed in the section "Normalization") is often applied to Unicode data. It applies all canonical mappings (e.g., loses the distinction between ohm sign and capital omega), but not compatibility mappings (e.g., it keeps micro sign and small mu as distinct). Although compatibility mapping is not meant to imply semantic difference, the Unicode standard admits (in UAX #15): "However, some characters with compatibility decompositions are used in mathematical notation to represent distinction of a semantic nature; replacing the use of distinct character codes by formatting may cause problems." A simple example of this is the superscript two 2, which has compatibility mapping to the digit two, 2. Applying this compatibility mapping in, for example, the expression 52 yields 52 and therefore distorts the meaning. In some cases, this can be fixed by using markup or formatting instructions, but in plain text, that's not possible. 5.3.2.2. Canonical and compatibility equivalenceAlthough canonical and compatibility mappings are one-directional and do not mean equivalence, we can define equivalence relations based on them. Canonical and compatibility equivalence are defined for sequences of characters (i.e., strings), naturally regarding a single character as a special case. The exact definitions will be given later in this chapter, but the basic idea is the following. Strings are canonical equivalent,if their canonical decompositions, obtained by applying all canonical mappings, are the same. Thus, in particular, if A has a canonical mapping to B, then A and B are canonical equivalent. Compatibility equivalence is defined in a similar way, except that both compatibility and canonical mappings are applied.
5.3.2.3. The meaning of canonical mappingWe already mentioned that canonical mapping does not mean identity, despite the symbol commonly used to denote it. A relationship like U+2126 Ω U+03A9 is a relation between two distinct characters. We should expect that programs often make no distinction between them, but a distinction For example, a program might recognize U+2126 but not U+03A9, or vice versa. It would then behave differently for them, of course. If it recognizes both, it need not treat them the same way, but any program conforming to the Unicode standard may do so. Thus, if a program sends another program the character U+2126 and the latter acknowledges having received U+03A9, it is accepted behavior, and the sender should be prepared for this. 5.3.2.4. Differences in glyphs for equivalent charactersA character may be visually distinct from its compatibility mapping. For example, a font that contains both U+2126 and U+03A9 may have different glyphs for them, although we would expect them to have the same basic shape. The Unicode standard explicitly says that replacing a character with its compatibility mapping may lose formatting information. In practice, a character may visually differ from its canonical mapping, too, although the general idea is that this shouldn't happen. For example, many fonts have different glyphs for µ U+00B5 and μ U+03BC. In some cases, there is no difference in any font, but the appearances may still differ! For example, if a font contains the Kelvin sign K (U+212A), it looks just the same as the Latin capital letter "K," K, in that font. But if you create, for example, a web page containing the Kelvin sign, it will often look different from the letter "K," since a browser uses its default font for the letter "K" and picks up the Kelvin sign from a different font. 5.3.2.5. How the mappings are definedWhen you need to know about the canonical or compatibility mapping of a particular character, you can consult some of the resources mentioned in Chapter 4, which also described the overall structure of the Unicode database. The UnicodeData.txt file in the Unicode database contains, for each character, a field (the sixth one) that specifies whether the character has a decomposition mapping, as well as the specific decomposition and its nature (canonical or compatibility). Let us consider the following line at http://www.unicode.org/Public/UNIDATA/UnicodeData.txt: 00B5;MICRO SIGN;Ll;0;L;<compat> 03BC;;;;N;;;039C;;039C Here, the notation <compat> 03BC means that the character has compatibility mapping to U+03BC. Instead of <compat>, the field could also contain a more specific notation, such as <super>, which also indicates the nature of the presentational difference. For example: 00B2;SUPERSCRIPT TWO;No;0;EN;<super> 0032;;2;2;N;SUPERSCRIPT DIGIT TWO;;;; Superscript two (2) is an ISO Latin 1 character with its own code position in that standard. In the Unicode way of thinking, it would have been treated as a superscript variant of digit two (2), if there had not been a particular reason to do otherwise. This does not mean that in the Unicode philosophy superscripting (or subscripting, italics, bolding, etc.) would be irrelevant; rather, it is to be handled at another level of data presentation, such as some special markup or styling. Since the superscript two character is contained in an important standard, it was included into Unicode, though only as a compatibility character, with <super> 0032 in the sixth field in its entry in the database. The practical reason is that now one can convert from ISO Latin 1 to Unicode and back and get the original data unchanged. The sixth field might also contain just the number of a character, or numbers of characters, without any indication of compatibility. For example: 212B;ANGSTROM SIGN;Lu;0;L;00C5;;;;N;ANGSTROM UNIT;;;00E5; The field 00C5 means that the angstrom sign U+212B has canonical mapping to the Latin capital letter "A" with ring above Å (U+00C5). Since no notation like <compat> or <super> is present in the field, it indicates canonical mapping and not compatibility mapping. You can also find information on decomposition mappingials in the Unicode code charts, where they appear more legibly, as illustrated in Figure 5-2, though divided into a large number of PDF files. In the charts, characters at the start of an item under a character's name have meanings as follows: Figure 5-2. Descriptions of four characters in a code chart
The last example in Figure 5-2 illustrates that a character does not always have a decomposition even if it greatly resembles another character. The estimated symbol is surely derived from the letter "e," but it is treated as an independent character in Unicode. The compatibility formatting tag <super> looks like an HTML or XML tag, but it is just a notation used in the Unicode database to indicate the value of the property dt = Decomposition Type. The "tags" do not appear in actual data, of course. On the other hand, characters with such mappings can often be replaced by markup elements that contain the non-compatibility character. For example, modifier letter small "h" (U+02B0) with compatibility mapping "<super> 0068," might be replaced by the markup <sup>h</sup> in HTML, though this is often debatable (see Chapter 9). The meanings of compatibility formatting tags used in the compatibility mappings are given in Table 5-2. The words "narrow" and "wide" refer specifically to presentation forms used in East Asian writing systems.
5.3.3. Canonical Decomposition and Compatibility DecompositionCanonical and compatibility decomposition are based on the canonical and compatibility mappings discussed earlier, but decompositions may consist of successive application of the mappings. For example, the angstrom sign Å (U+212B) has canonical mapping to Latin capital letter "A" with ring Å (U+00C5), which in turn has canonical mapping to letter "A" followed by a combining diacritic. Successive application of mappings is often called "recursive," but it's really not recursion, rather it's iteration. Decomposition replaces a character by a sequence of characters that are in some sense more basic. From the perspective of the Unicode standard, decomposition is something that you may or may not perform, just as you find suitable for your purposes. Other standards and rules may make decomposition compulsory in some contexts. There are two kinds of decomposition defined in the Unicode standard: canonical and compatibility. They relate to the two kinds of mappings, although in a somewhat more complex way than you might expect. 5.3.3.1. Canonical decompositionCanonical decomposition of a character means the following: if the character has a canonical mapping, you replace it with the character or string in the mapping. Then you check whether any character in the result has a canonical mapping, and you proceed until no further mapping exists. The mappings have of course been defined so that the process ends after a finite number of steps, without going to a loop. For example, the canonical decomposition of the angstrom sign Å (U+212B) is the two-character sequence U+0041 U+030A (letter "A" and combining ring above). As explained previously, two mapping steps are taken in this case. In fact, canonical decomposition involves two additional algorithms. By definition, canonical decomposition consists of the following:
5.3.3.2. Canonical Ordering BehaviorCanonical Ordering Behavior is based on the ccc = Canonical Combining Class property, which assigns an integer to each character. For nonspacing marks, this value describes the position of the mark with respect to the base character, and it is also used for ordering the marks. For characters other than nonspacing marks, this value is zero. The Canonical Ordering Behavior, described in detail in section 3.11 of the Unicode standard, reorders consecutive nonspacing marks in increasing order by their Canonical Combining Class property. This removes some variation. For example, the letter "e" with a circumflex above and a dot below can be represented in five ways in Unicode:
In canonical decomposition, canonical mappings remove part of the variation: the result is fully decomposed. However, the combining marks may appear in two different orders, depending on the initial data. Canonical Ordering Behavior removes this variation, if the combining marks belong to different combining classes. In our example, combining circumflex accent (U+0302) has combining class 230, whereas combining dot below (U+0323) has combining class 220. The one with lower class comes first, so canonical decomposition changes the five ways in the above list to a single representation: U+0065 U+0323 U+0302 (letter "e," combining dot below, combining circumflex accent). Canonical decomposition does not remove all variation in the order of combining marks. If two marks belong to the same combining class, their mutual order is not changed. The reason is that the order can be significant, since being in the same class, the marks may interact typographically, and this interaction may depend on the mutual order. For example, U+0065 U+0306 U+0302 and U+0065 U+0302 U+0306 (letter "e" followed by combining breve and combining circumflex accent in either order) remain as different after decomposition. The combining breve and the combining circumflex accent both have combining class 230, because they are in essentially the same position with respect to the base character. Thus, an adequate rendering process will produce different visual results: "e" with a breve above it and with a circumflex above the breve, or "e" with a circumflex above it and a breve above it. (A poor implementation produces an "e" with a breve and circumflex overprinting each other.) 5.3.3.3. Canonical equivalenceThe Unicode character defines canonical equivalence of strings, and it is an equivalence relation in the mathematical sense. It is reflexive (i.e., any string is equivalent to itself); it is symmetric (i.e., if A is equivalent to B, then B is equivalent to A); and it is transitive (i.e., if A is equivalent to B and B is equivalent to C, then A is equivalent to C). Strings are by definition canonical equivalent, if their canonical decompositions are identical. For example, the five ways of representing "e" with dot below and circumflex discussed in the previous section are all canonical equivalent. 5.3.3.4. Compatibility decomposition and equivalenceCompatibility decomposition is defined the same way as canonical decomposition, except that compatibility decomposition includes canonical decomposition. Canonical decomposition of a string consists of the following:
For example, the compatibility decomposition of the (rather artificial) string "½ µé," where µ is the micro sign, is the string "1/2 μe´," where μ is the Greek letter mu and ´ denotes the combining acute accent. Compatibility equivalence of strings is defined in the obvious way: strings are compatibility equivalent, if their compatibility decompositions are identical. It follows from the definitions that canonical equivalent strings are compatibility equivalent, too. 5.3.3.5. Canonical and compatibility decomposable charactersThe Unicode standard uses a large number of rather redundant terms. We need to mention them, since you may encounter them when reading about Unicode: A character that is canonical equivalent to something other than itself is said to be canonical decomposable . Similarly, if a character is compatibility equivalent to something other than itself, it is compatibility decomposable. Often such decomposability really means that a character can be decomposed into constituentse.g., ä can be decomposed into "a" and a combining dieresis. However, many of the "decompositions" just map one character to another character, as in the case of U+2126 Ω U+03A9, mentioned earlier in the chapter. 5.3.4. Compatibility CharactersUnicode contains a large number of characters described as "compatibility characters ." Many of them are variants of other characters. The overall tone of the standard is that compatibility characters should be avoided, except in legacy data. However, it does not explicitly deprecate them; on the contrary, it says: "The status of a character as a compatibility character does not mean that the character is deprecated in the standard." There is a separate concept of deprecation, for characters that really should not be used at all but have been preserved in Unicode according to its design principles. Compatibility characters were included into Unicode for compatibility with other character codesi.e., just because the characters exist in one or more character code. One reason for this is that data presented using some other code can be converted to Unicode and back, or from one character code to another using Unicode as an intermediate code, without losing information. The Unicode standard says:
Many, but not all, compatibility characters have compatibility decompositions, which specify the character's relationship to other characters. There has been some confusion around this, since not all compatibility characters have such decompositions. The Unicode standard itself mentions that the phrase "compatibility character" is also used in a narrower sense, which refers to compatibility decomposable charactersi.e., those characters that have compatibility decompositions. The phrase "compatibility composite (character)" is also mentioned as a synonym, but that sounds quite redundant and confusing. For example, the micro sign µ (U+00B5) is a compatibility decomposable character. It has compatibility mapping to the Greek small letter mu μ (U+03BC). The Unicode character database rigorously defines for each character whether it is a compatibility decomposable character, as well as the eventual decomposition. The same information is presented in a more readable form in the code charts, where the character indicates canonical decomposition, as illustrated in Figure 5-3. The question exclamation mark (U+2048) is defined as a separate character, but with compatibility mapping to the character pair U+003F U+0021i.e., ? followed by !. The more general concept of compatibility character is defined in prose only, and it includes, for example, deprecated alternate format characters, which have no Figure 5-3. Sample definition of a compatibility decomposable characterdecomposition, as well as CJK compatibility ideographs, which have canonical decompositions, not compatibility decompositions. This wider concept of compatibility character is basically just descriptive; rules and algorithms operate on the decompositions.
For example, as discussed previously, the angstrom sign Å (U+212B) is defined so that it has a canonical decomposition, not a compatibility decomposition. Yet, it is a compatibility characterbecause it has been declared as such in the prose of the Unicode standard. Generally, when a character is defined to have canonical mapping to a single character, the explanation is that it has been included into Unicode for compatibility only and it is regarded as so similar to the other character that their renderings are expected to be the same. Thus, canonical mapping means different things in different cases, depending on whether a character has canonical mapping to one character or to a sequence of characters. For example, Latin capital letter "A" with ring above Å (U+00C5) has canonical mapping to U+0041 U+030A (letter "A" followed by combining ring above), but it is not a compatibility character. It is simply a "normal" character that is decomposable into two "normal" characters. 5.3.5. Compatibility Decomposable CharactersReplacing a compatibility decomposable character by the corresponding normal character or sequence of characters does not, by Unicode definition, change the meaning of text, but it may change formatting and layout. For example, the micro sign and the small mu are expected to look similar, but not necessarily identical. This definition is subject to some criticism, though. It can be argued that the micro sign is quite different in meaning from the small mu. The micro sign unambiguously denotes a multiplier of a unit. The small mu is a letter of the Greek alphabet, and it is normally used when writing Greek words, although it could also appear in a variety of special meanings. The Unicode standard does not recommend that such distinctions should be made, or that they should not be made. Rather, the micro sign is included for compatibility with old character codes and it in fact implies that the distinction can be made, if desired. Many compatibility characters are in the Compatibility Area but others are scattered around the Unicode coding space. They belong to different types, such as the following:
|