Section 5.3. Compositions and Decompositions


5.3. Compositions and Decompositions

The 10 design principles of Unicode, presented in Chapter 4, contain one principle on dynamic composition and another principle on equivalent sequences. For example, the letter é can be represented as a single Unicode character, or dynamically composed as a two-character string (letter "e" followed by a combining acute accent). The single character é is said to have a canonical decomposition consisting of two characters, and this relationship implies canonical equivalence .

Unicode lets you combine a base character with an unlimited number of combining diacritic marks. In practice, there's most often just one diacritic, sometimes two, but there is no limit. For example, phonetic or mathematical notations may deploy several diacritic marks on one character. As a base character, you can use any character that does not itself combine with preceding characters and that is neither a control nor a format character.

Unicode would be simpler, if all letters with diacritic marks were represented using dynamic composition. For different practical reasons, another approach was taken, and this implies that we need to deal with precomposed forms and with conversions between them and decomposed forms.

Characters may have decompositions in a different sense, too. Many characters have compatibility decompositions . For example, the small Latin ligature "fi," , has a compatibility decomposition that consists of the two characters "f and "i."

5.3.1. The Impact of Diacritic Marks

A diacritic mark is an additional graphic such as an accent (as in è or é) or cedilla (as in ç) attached to a character. It may affect the pronunciation of a character, or the meaning of a word, or both. It appears visually close to the base character, often above or below it, possibly crossing over its line, but it is treated as a logically separable part.

A diacritic mark can be treated in different ways when defining a character repertoire. You could define a character like é (letter "e" with acute accent) as a separate character, or you could define the base character "e" and the diacritic ´ as two distinct characters. In the latter approach, you would need to define the diacritic as combining (nonspacing), or otherwise indicate that it be rendered as attached to the character, not as a separate character after it.

For example, the ISO-8859-1 character code contains a collection of letters with diacritic marks, such as é, but no combining marks. It contains the acute accent ´, but as a normal (spacing) character, which is not combined with any other character in any way.

The Unicode standard uses nonspacing mark as a term that covers diacritic marks but can be seen as somewhat more general in nature. The term "diacritic mark" is often used to denote accents and other marks attached to Latin, Greek, Cyrillic, and other letters, whereas "nonspacing mark" also covers Hebrew points, Arabic vowel marks, etc. In this book, "diacritic mark" is used in a broad sense, as a synonym for "nonspacing mark."

5.3.1.1. Precomposed and decomposed form

In Unicode, a character with a diacritic mark can often be represented in two ways. You can express é as a precomposed character or as decomposedi.e., as a character pair consisting of "e" and a combining acute accent. Both representations are possible for a large number of commonly used characters, though not for all characters with diacritics.

This means flexibility, but it also creates a pile of problems. What happens if a database contains é as decomposed but a search string typed by the user contains it as precomposed? This is just the beginning of the problems. For example, a character with several diacritic marks can be represented as several different decompositions.

Unicode contains separate characters called combining diacritical marks . The general idea is that you can express a vast set of characters with diacritics by representing them so that a base character is followed by one or more combining (nonspacing) diacritic mark(s). A program that displays such a construct is expected to do rather clever things in formattinge.g., selecting a particular shape for the diacritic according to the shape of the base character.

In Unicode, a combining diacritic mark always follows the base character in data. It may visually appear above, below, or on either side of the base character. The logical order differs from the order in many methods of typing characters with diacritic marks. For example, on many keyboards, you could first press a key labeled ´, and then the "e" key, to produce é. However, if this letter is represented in data as decomposed, it has the combining diacritic mark after the base letter "e."

The order in typing mechanisms reflects the methods used on mechanical typewriters. They may contain a ´ key, which is non-advancingi.e., the writing position is not moved forward. Therefore, the next character will overprint the symbol, resulting in a coarsely constructed accented letter. In Unicode, combining diacritic marks are supposed to be rendered as combined with the preceding character in a more elaborate way.

5.3.1.2. Combining marks: powerful, but still poorly supported

Many programs currently in use are totally incapable of doing anything meaningful with combining diacritic marks. However, there is at least some simple support for them in word processors and web browsers, for example. Regarding advanced implementation of the rendering of characters with diacritic marks, consult Unicode Technical Note #2, "A General Method for Rendering Combining Marks," http://www.unicode.org/notes/tn2/.

Using combining diacritic marks, we have wide range of possibilities. We can put, say, a dieresis on a gamma, although "Greek small letter gamma with dieresis" does not exist in Unicode as a character with a code position of its own. The combination U+03B3 U+0308 consists of two characters, although its visual presentation looks like a single character in the same sense as ä looks like a single character. A word processor may display it as γ̈, which might be of poor quality (the dieresis is not correctly placed with respect to the base character), but probably legible. Many programs fail to display it at all. For practical reasons, in order to use a character with a diacritic mark, you should primarily try to find it as a precomposed character.

A precomposed character, also called a composite character or a decomposable character, is one that has a code position (and thereby identity) of its own but is in some sense equivalent to a sequence of other characters. There are lots of them in Unicode, and they cover most of the needs of the languages of the world, but not all. Special notations, such as the International Phonetic Alphabet by IPA, may require several different diacritic marks that can be combined with characters, in a manner that makes it quite infeasible to try to define all the combinations as precomposed characters.

For example, the Latin small letter "a" with dieresis ä (U+00E4) is, by Unicode definition, decomposable to the sequence of the two characters: Latin small letter "a" (U+0061) and combining dieresis (U+0308). Almost always, however, the letter ä is entered in its precomposed form, though it might then internally be decomposed. Generally, by decomposing all decomposable characters, you could in many cases simplify the processing of textual data, and the resulting data might be converted back to a format using precomposed characters.

5.3.1.3. Features that are not diacritic marks

Many letters that do not contain a diacritic mark in the Unicode sense have historically been formed from a base letter by adding some mark to it. For example, the Norwegian and Danish ø is originally an "o" with a slanted line over it. Its name, "Latin small letter o with stroke," reminds of this and could even be read as suggesting that it is a combination of an "o" and a diacritic mark called "stroke." Similarly, the letter , "Latin capital letter L with stroke," used in Polish, would seem to be an "L with the same diacritic, though with a different visual shape.

Although such letters are often understood as letters with diacritic marks, they are classified as independent letters in Unicode. The characters ø and are not decomposable in any way. They have no defined relationships with "o and "L" in Unicode, except in the sense that in the default collating order (see the section "Collation and Sorting" later in this chapter), o is sorted in the same primary position as "o," and is sorted in the primary position as "L."

This approach does not exclude the possibility of treating such characters in some special way in application programming or in language-dependent general rules. Since they are intuitively understood as variants of some base characters, it would be natural to define input methods that relate to such intuition. For example, in MS Word, you can produce ø by using the sequence Ctrl-Shift-7 o. This is relatively easy to remember if your keyboard has the solidus / as Shift-7, so that you can think you are using Ctrl-/ o.

5.3.2. Compatibility Mappings and Canonical Mappings

The Unicode character database defines a decomposition mapping for each character. This mapping associates another character or a sequence of characters with the given character, and this association is indicated as a canonical mapping or as a compatibility mapping, also called decomposition. Typical cases include the following:

  • A character with a diacritic mark has a canonical mapping to a sequence of a base character and a combining diacritic mark. For example, é has a canonical mapping to "e" followed by a combining acute accent.

  • A ligature has a compatibility mapping to a sequence consisting of the constituent letters. For example, ligature has compatibility mapping to "f followed by "i."

  • A character that is treated as a variant of another character often has a compatibility mapping to it, although sometimes the mapping is defined as being canonical. For example, many characters have so-called fullwidth forms for use in East Asian texts, where normal forms of symbols like $ might look odd, when other characters are "wide" (basically, designed to fit into a square). These forms, such as fullwidth dollar sign $ (U+FF04), have compatibility mappings to the normal characters.

5.3.2.1. Difference between canonical and compatibility mappings

Canonical and compatibility mappings are rather fundamental in Unicode, and they are commonly confused with each other. One reason for this is that in many cases, the choice of the mapping type was debatable, if not arbitrary. For example, the micro sign µ has compatibility mapping to the Greek small letter mu, but the ohm sign has In some notationse.g., in the Unicode code chartsthe character (identical to, U+2261) is used to indicate canonical mappinge.g., U+2126 Ω U+03A9. Handy as this may be, it can be misleading, since the two characters are not identical, though they may be treated as essentially similar by programs. The relation expressed by the symbol here isnt even symmetric, contrary to its normal use in mathematics. The symbol is best read as "has canonical mapping to."

Similarly, the character (almost equal to, U+2248) is often used to indicate compatibility mappinge.g., µ U+00B5 μ U+03BC. This symbol is best read as "has compatibility mapping to."

The short characterizations are:

  • If A has canonical mapping to B, then A and B are really two different ways of encoding the same symbol in Unicode. As codes or sequences of codes, they are different, but they have the same ultimate meaning and normally the same rendering.

  • If A has compatibility mapping to B, then A and B denote fundamentally similar characters, which may differ in rendering, as well as in scope of usage. In practice, they may differ in meaning, too.

The Unicode Normalization Form C (discussed in the section "Normalization") is often applied to Unicode data. It applies all canonical mappings (e.g., loses the distinction between ohm sign and capital omega), but not compatibility mappings (e.g., it keeps micro sign and small mu as distinct).

Although compatibility mapping is not meant to imply semantic difference, the Unicode standard admits (in UAX #15): "However, some characters with compatibility decompositions are used in mathematical notation to represent distinction of a semantic nature; replacing the use of distinct character codes by formatting may cause problems." A simple example of this is the superscript two 2, which has compatibility mapping to the digit two, 2. Applying this compatibility mapping in, for example, the expression 52 yields 52 and therefore distorts the meaning. In some cases, this can be fixed by using markup or formatting instructions, but in plain text, that's not possible.

5.3.2.2. Canonical and compatibility equivalence

Although canonical and compatibility mappings are one-directional and do not mean equivalence, we can define equivalence relations based on them. Canonical and compatibility equivalence are defined for sequences of characters (i.e., strings), naturally regarding a single character as a special case. The exact definitions will be given later in this chapter, but the basic idea is the following. Strings are canonical equivalent,if their canonical decompositions, obtained by applying all canonical mappings, are the same. Thus, in particular, if A has a canonical mapping to B, then A and B are canonical equivalent. Compatibility equivalence is defined in a similar way, except that both compatibility and canonical mappings are applied.

The term "canonical equivalent" is from the Unicode standard, so we use it in this book, instead of the grammatically more correct expression "canonically equivalent."


5.3.2.3. The meaning of canonical mapping

We already mentioned that canonical mapping does not mean identity, despite the symbol commonly used to denote it. A relationship like U+2126 Ω U+03A9 is a relation between two distinct characters. We should expect that programs often make no distinction between them, but a distinction For example, a program might recognize U+2126 but not U+03A9, or vice versa. It would then behave differently for them, of course. If it recognizes both, it need not treat them the same way, but any program conforming to the Unicode standard may do so. Thus, if a program sends another program the character U+2126 and the latter acknowledges having received U+03A9, it is accepted behavior, and the sender should be prepared for this.

5.3.2.4. Differences in glyphs for equivalent characters

A character may be visually distinct from its compatibility mapping. For example, a font that contains both U+2126 and U+03A9 may have different glyphs for them, although we would expect them to have the same basic shape. The Unicode standard explicitly says that replacing a character with its compatibility mapping may lose formatting information.

In practice, a character may visually differ from its canonical mapping, too, although the general idea is that this shouldn't happen. For example, many fonts have different glyphs for µ U+00B5 and μ U+03BC. In some cases, there is no difference in any font, but the appearances may still differ! For example, if a font contains the Kelvin sign K (U+212A), it looks just the same as the Latin capital letter "K," K, in that font. But if you create, for example, a web page containing the Kelvin sign, it will often look different from the letter "K," since a browser uses its default font for the letter "K" and picks up the Kelvin sign from a different font.

5.3.2.5. How the mappings are defined

When you need to know about the canonical or compatibility mapping of a particular character, you can consult some of the resources mentioned in Chapter 4, which also described the overall structure of the Unicode database.

The UnicodeData.txt file in the Unicode database contains, for each character, a field (the sixth one) that specifies whether the character has a decomposition mapping, as well as the specific decomposition and its nature (canonical or compatibility). Let us consider the following line at http://www.unicode.org/Public/UNIDATA/UnicodeData.txt:

 00B5;MICRO SIGN;Ll;0;L;<compat> 03BC;;;;N;;;039C;;039C

Here, the notation <compat> 03BC means that the character has compatibility mapping to U+03BC. Instead of <compat>, the field could also contain a more specific notation, such as <super>, which also indicates the nature of the presentational difference. For example:

 00B2;SUPERSCRIPT TWO;No;0;EN;<super> 0032;;2;2;N;SUPERSCRIPT DIGIT TWO;;;;

Superscript two (2) is an ISO Latin 1 character with its own code position in that standard. In the Unicode way of thinking, it would have been treated as a superscript variant of digit two (2), if there had not been a particular reason to do otherwise. This does not mean that in the Unicode philosophy superscripting (or subscripting, italics, bolding, etc.) would be irrelevant; rather, it is to be handled at another level of data presentation, such as some special markup or styling. Since the superscript two character is contained in an important standard, it was included into Unicode, though only as a compatibility character, with <super> 0032 in the sixth field in its entry in the database. The practical reason is that now one can convert from ISO Latin 1 to Unicode and back and get the original data unchanged.

The sixth field might also contain just the number of a character, or numbers of characters, without any indication of compatibility. For example:

 212B;ANGSTROM SIGN;Lu;0;L;00C5;;;;N;ANGSTROM UNIT;;;00E5;

The field 00C5 means that the angstrom sign U+212B has canonical mapping to the Latin capital letter "A" with ring above Å (U+00C5). Since no notation like <compat> or <super> is present in the field, it indicates canonical mapping and not compatibility mapping.

You can also find information on decomposition mappingials in the Unicode code charts, where they appear more legibly, as illustrated in Figure 5-2, though divided into a large number of PDF files. In the charts, characters at the start of an item under a character's name have meanings as follows:

Figure 5-2. Descriptions of four characters in a code chart


  • indicates canonical mapping.

  • indicates compatibility mapping.

  • • indicates an informal note (not any mapping).

  • is a cross reference, which can be read as "compare with; it does not mean any mapping, and it explicitly warns against confusing the character with another one.

The last example in Figure 5-2 illustrates that a character does not always have a decomposition even if it greatly resembles another character. The estimated symbol is surely derived from the letter "e," but it is treated as an independent character in Unicode.

The compatibility formatting tag <super> looks like an HTML or XML tag, but it is just a notation used in the Unicode database to indicate the value of the property dt = Decomposition Type. The "tags" do not appear in actual data, of course. On the other hand, characters with such mappings can often be replaced by markup elements that contain the non-compatibility character. For example, modifier letter small "h" (U+02B0) with compatibility mapping "<super> 0068," might be replaced by the markup <sup>h</sup> in HTML, though this is often debatable (see Chapter 9).

The meanings of compatibility formatting tags used in the compatibility mappings are given in Table 5-2. The words "narrow" and "wide" refer specifically to presentation forms used in East Asian writing systems.

Table 5-2. Compatibility formatting tags

Tag

Meaning

<circle>

An encircled form

<compat>

Otherwise unspecified compatibility character

<final>

A final presentation form (Arabic)

<font>

A font variant (e.g., a blackletter or italics form)

<fraction>

A vulgar fraction form, such as ½

<initial>

An initial presentation form (Arabic)

<isolated>

An isolated presentation form (Arabic)

<medial>

A medial presentation form (Arabic)

<narrow>

A narrow (hankaku) compatibility character

<noBreak>

A no-break version of a space, hyphen, or other punctuation

<small>

A small variant form (CNS compatibility)

<square>

A CJK squared font variant

<sub>

A subscript form

<super>

A superscript form

<vertical>

A vertical layout presentation form

<wide>

A wide (zenkaku) compatibility character


5.3.3. Canonical Decomposition and Compatibility Decomposition

Canonical and compatibility decomposition are based on the canonical and compatibility mappings discussed earlier, but decompositions may consist of successive application of the mappings. For example, the angstrom sign Å (U+212B) has canonical mapping to Latin capital letter "A" with ring Å (U+00C5), which in turn has canonical mapping to letter "A" followed by a combining diacritic. Successive application of mappings is often called "recursive," but it's really not recursion, rather it's iteration.

Decomposition replaces a character by a sequence of characters that are in some sense more basic. From the perspective of the Unicode standard, decomposition is something that you may or may not perform, just as you find suitable for your purposes. Other standards and rules may make decomposition compulsory in some contexts.

There are two kinds of decomposition defined in the Unicode standard: canonical and compatibility. They relate to the two kinds of mappings, although in a somewhat more complex way than you might expect.

5.3.3.1. Canonical decomposition

Canonical decomposition of a character means the following: if the character has a canonical mapping, you replace it with the character or string in the mapping. Then you check whether any character in the result has a canonical mapping, and you proceed until no further mapping exists. The mappings have of course been defined so that the process ends after a finite number of steps, without going to a loop.

For example, the canonical decomposition of the angstrom sign Å (U+212B) is the two-character sequence U+0041 U+030A (letter "A" and combining ring above). As explained previously, two mapping steps are taken in this case.

In fact, canonical decomposition involves two additional algorithms. By definition, canonical decomposition consists of the following:

  1. Successively apply all the canonical mappings defined in the UnicodeData.txt file and by the Conjoining Jamo Behavior, until no such mapping can be applied. The Conjoining Jamo Behavior, defined in section 3.12 of the Unicode standard, deals with Hangul (Korean) characters and describes an algorithm for decomposing a Hangul syllable character.

  2. Then reorder nonspacing marks according to Canonical Ordering Behavior. This deals with situations where two or more nonspacing marks appear in succession.

5.3.3.2. Canonical Ordering Behavior

Canonical Ordering Behavior is based on the ccc = Canonical Combining Class property, which assigns an integer to each character. For nonspacing marks, this value describes the position of the mark with respect to the base character, and it is also used for ordering the marks. For characters other than nonspacing marks, this value is zero.

The Canonical Ordering Behavior, described in detail in section 3.11 of the Unicode standard, reorders consecutive nonspacing marks in increasing order by their Canonical Combining Class property. This removes some variation. For example, the letter "e" with a circumflex above and a dot below can be represented in five ways in Unicode:

  • As a fully composed character: Latin small letter "e" with circumflex and dot below ệ (U+1EC7)

  • As fully decomposed in two ways, using two different orders for the combining marks

  • As partly composed in two ways: ê followed by combining dot below, or ẹ followed by combining circumflex accent

In canonical decomposition, canonical mappings remove part of the variation: the result is fully decomposed. However, the combining marks may appear in two different orders, depending on the initial data. Canonical Ordering Behavior removes this variation, if the combining marks belong to different combining classes. In our example, combining circumflex accent (U+0302) has combining class 230, whereas combining dot below (U+0323) has combining class 220. The one with lower class comes first, so canonical decomposition changes the five ways in the above list to a single representation: U+0065 U+0323 U+0302 (letter "e," combining dot below, combining circumflex accent).

Canonical decomposition does not remove all variation in the order of combining marks. If two marks belong to the same combining class, their mutual order is not changed. The reason is that the order can be significant, since being in the same class, the marks may interact typographically, and this interaction may depend on the mutual order. For example, U+0065 U+0306 U+0302 and U+0065 U+0302 U+0306 (letter "e" followed by combining breve and combining circumflex accent in either order) remain as different after decomposition. The combining breve and the combining circumflex accent both have combining class 230, because they are in essentially the same position with respect to the base character. Thus, an adequate rendering process will produce different visual results: "e" with a breve above it and with a circumflex above the breve, or "e" with a circumflex above it and a breve above it. (A poor implementation produces an "e" with a breve and circumflex overprinting each other.)

5.3.3.3. Canonical equivalence

The Unicode character defines canonical equivalence of strings, and it is an equivalence relation in the mathematical sense. It is reflexive (i.e., any string is equivalent to itself); it is symmetric (i.e., if A is equivalent to B, then B is equivalent to A); and it is transitive (i.e., if A is equivalent to B and B is equivalent to C, then A is equivalent to C).

Strings are by definition canonical equivalent, if their canonical decompositions are identical. For example, the five ways of representing "e" with dot below and circumflex discussed in the previous section are all canonical equivalent.

5.3.3.4. Compatibility decomposition and equivalence

Compatibility decomposition is defined the same way as canonical decomposition, except that compatibility decomposition includes canonical decomposition. Canonical decomposition of a string consists of the following:

  1. Successively apply all the compatibility mappings and canonical mappings defined in the UnicodeData.txt file and by the Conjoining Jamo Behavior, until no such mapping can be applied.

  2. Then reorder nonspacing marks according to Canonical Ordering Behavior.

For example, the compatibility decomposition of the (rather artificial) string "½ µé," where µ is the micro sign, is the string "1/2 μe´," where μ is the Greek letter mu and ´ denotes the combining acute accent.

Compatibility equivalence of strings is defined in the obvious way: strings are compatibility equivalent, if their compatibility decompositions are identical. It follows from the definitions that canonical equivalent strings are compatibility equivalent, too.

5.3.3.5. Canonical and compatibility decomposable characters

The Unicode standard uses a large number of rather redundant terms. We need to mention them, since you may encounter them when reading about Unicode: A character that is canonical equivalent to something other than itself is said to be canonical decomposable . Similarly, if a character is compatibility equivalent to something other than itself, it is compatibility decomposable. Often such decomposability really means that a character can be decomposed into constituentse.g., ä can be decomposed into "a" and a combining dieresis. However, many of the "decompositions" just map one character to another character, as in the case of U+2126 Ω U+03A9, mentioned earlier in the chapter.

5.3.4. Compatibility Characters

Unicode contains a large number of characters described as "compatibility characters ." Many of them are variants of other characters. The overall tone of the standard is that compatibility characters should be avoided, except in legacy data. However, it does not explicitly deprecate them; on the contrary, it says: "The status of a character as a compatibility character does not mean that the character is deprecated in the standard." There is a separate concept of deprecation, for characters that really should not be used at all but have been preserved in Unicode according to its design principles.

Compatibility characters were included into Unicode for compatibility with other character codesi.e., just because the characters exist in one or more character code. One reason for this is that data presented using some other code can be converted to Unicode and back, or from one character code to another using Unicode as an intermediate code, without losing information. The Unicode standard says:

Compatibility characters are those that would not have been encoded except for compatibility and round-trip convertibility with other standards. They are variants of characters that already have encodings as normal (that is, non-compatibility) characters in the Unicode Standard.

Many, but not all, compatibility characters have compatibility decompositions, which specify the character's relationship to other characters. There has been some confusion around this, since not all compatibility characters have such decompositions. The Unicode standard itself mentions that the phrase "compatibility character" is also used in a narrower sense, which refers to compatibility decomposable charactersi.e., those characters that have compatibility decompositions. The phrase "compatibility composite (character)" is also mentioned as a synonym, but that sounds quite redundant and confusing.

For example, the micro sign µ (U+00B5) is a compatibility decomposable character. It has compatibility mapping to the Greek small letter mu μ (U+03BC).

The Unicode character database rigorously defines for each character whether it is a compatibility decomposable character, as well as the eventual decomposition. The same information is presented in a more readable form in the code charts, where the character indicates canonical decomposition, as illustrated in Figure 5-3. The question exclamation mark (U+2048) is defined as a separate character, but with compatibility mapping to the character pair U+003F U+0021i.e., ? followed by !.

The more general concept of compatibility character is defined in prose only, and it includes, for example, deprecated alternate format characters, which have no

Figure 5-3. Sample definition of a compatibility decomposable character


decomposition, as well as CJK compatibility ideographs, which have canonical decompositions, not compatibility decompositions. This wider concept of compatibility character is basically just descriptive; rules and algorithms operate on the decompositions.

The concept "compatibility decomposable character" has been defined formally, whereas the concept "compatibility character" is informal but sometimes important. If a character is a compatibility decomposable character, it is a compatibility character; the reverse is not true.


For example, as discussed previously, the angstrom sign Å (U+212B) is defined so that it has a canonical decomposition, not a compatibility decomposition. Yet, it is a compatibility characterbecause it has been declared as such in the prose of the Unicode standard. Generally, when a character is defined to have canonical mapping to a single character, the explanation is that it has been included into Unicode for compatibility only and it is regarded as so similar to the other character that their renderings are expected to be the same.

Thus, canonical mapping means different things in different cases, depending on whether a character has canonical mapping to one character or to a sequence of characters. For example, Latin capital letter "A" with ring above Å (U+00C5) has canonical mapping to U+0041 U+030A (letter "A" followed by combining ring above), but it is not a compatibility character. It is simply a "normal" character that is decomposable into two "normal" characters.

5.3.5. Compatibility Decomposable Characters

Replacing a compatibility decomposable character by the corresponding normal character or sequence of characters does not, by Unicode definition, change the meaning of text, but it may change formatting and layout. For example, the micro sign and the small mu are expected to look similar, but not necessarily identical.

This definition is subject to some criticism, though. It can be argued that the micro sign is quite different in meaning from the small mu. The micro sign unambiguously denotes a multiplier of a unit. The small mu is a letter of the Greek alphabet, and it is normally used when writing Greek words, although it could also appear in a variety of special meanings. The Unicode standard does not recommend that such distinctions should be made, or that they should not be made. Rather, the micro sign is included for compatibility with old character codes and it in fact implies that the distinction can be made, if desired.

Many compatibility characters are in the Compatibility Area but others are scattered around the Unicode coding space. They belong to different types, such as the following:

  • Variants of letters used in specialized meanings, such as the micro sign

  • Variants such as superscripts (e.g., 2 as variant of "2")

  • Ligatures, such as

  • Fullwidth forms of ASCII characters, for use in East Asian writing systems

  • Special-purpose combinations of characters, such as care of (which has compatibility mapping to c/o)

  • 5.3.6. Avoiding Compatibility Characters

    The general idea in the Unicode standard is that compatibility characters should be avoided in new data, but it expresses this somewhat indirectly. However, in subsection 3.7, "Decomposition," the standard is rather explicit about compatibility decomposable characters:

    Compatibility decomposable characters ... support transmission and processing of legacy data. Their use is discouraged other than for legacy data or other special circumstances.

    In practice, it is not always feasible to avoid compatibility characters in plain text. If plain text contains the string 32, the normal interpretation is that it means 3 to the power 2. Replacing the superscript two with the corresponding non-compatibility character would turn the data into 32, which means something completely different.

    In formats other than plain text, it is often possible and suitable to avoid compatibility characters by using markup or other tools. There is a document titled "Unicode in XML and other Markup Languages," at http://www.w3.org/TR/unicode-xml/, produced jointly by the World Wide Web Consortium (W3C) and the Unicode Consortium. It discusses characters with compatibility mappings: should they be used, or should the corresponding non-compatibility characters be used, perhaps with some markup and/or stylesheet that corresponds to the difference between them? The answers depend on the nature of the characters and the available markup and styling techniques. For example, for superscripts, the use of sup markup (as in HTML) is recommendedi.e., <sup>2</sup> is preferred over the superscript two character 2 (and its representation as an entity, &sup2;). This is a debatable issue, partly because superscripting has two essentially different uses: semantic, as in mathematics, or stylistic, as in abbreviations like 1st for "first" or French Mlle for "mademoiselle." This will be discussed in more detail in Chapter 9.

    In practice, compatibility characters are widely used in new Unicode data, too. Many of them work more reliably than the corresponding "normal" characters. For example, the micro sign belongs to ISO Latin 1 and therefore appears in almost any font used in the Western world, whereas the letter mu has less support. Existing software for processing measurement data may well recognize "µm" as denoting micrometer but fail to recognize "μm" (where the letter mu is used).

    In using characters, it's often best to do what everyone else does. Suppose, for example, that you decide to use the letter mu instead of the micro sign as a unit prefix. If people open your document in a program and use the program's search function, the odds are that they type "µm" using the micro sign. (After all, it's often easier to write than the letter mu.) They would not find anything, unless the search function uses advanced techniques that handle compatibility mappings somehow.

    5.3.7. Compatibility Characters for Ligatures

    Some compatibility characters have compatibility decompositions consisting of two or more characters so that it can be said that they represent a ligature of those characters. For example, Latin small ligature "fi" (U+FB01) has the obvious decomposition consisting of letters "f and "i." It is a distinct character in Unicode, but in the spirit of Unicode, we should not use it except for storing and transmitting existing data that contains the character. One practical reason for this is that most programs do not treat a ligature as matching the corresponding sequence of characters in comparisons, searches, etc.

    As mentioned in Chapter 4, Unicode has two control characters for affecting ligature behavior, zero-width joiner ZWJ U+200D and zero-width non-joiner ZWNJ U+200C. This is intended to prevent the use of a ligature or cursive connection. Formally, ligature characters such as U+FB01 are not defined in a manner that involves a zero-width joiner. Instead, U+FB01 has compatibility mapping to U+0066 U+0069 (i.e., "f" followed by "i"), although it might conceivably have been declared as having mapping to U+0066 U+200D U+0069.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net