Section 4.6. Unicode and Fonts


4.6. Unicode and Fonts

One of the 10 design principles is that Unicode encodes characters, not glyphs. Thus, Unicode is not about fonts . Although proper presentation of some Unicode text requires a font that contains the characters actually used in the text, any such font will do. You can even use a mixture of fonts. The font selections have to be made outside Unicode.

4.6.1. Unicode as Plain Text

Unicode is basically for plain text, or text as such, without formatting features, structural indicators, or processing commands. Plain text can be characterized as a universal, simple, and portable data format. You can save text data as plain text and expect people to be able to read it after a hundred years, as long as the text is physically preserved. Would you bet on the format used by your favorite word processor to do the same, given the past experience with incompatible data formats?

This doesn't mean that Unicode wants everyone to use plain text. On the contrary, much of work with Unicode has been pragmatically motivated by the advance of markup like XML as well as databases that store text in complex formats. Unicode is used more and more in data formats where characters and strings appear as constituents of higher-level constructs.

Unicode deals (in principle) exclusively with the plain text level of data representation, because it was designed to do just that. Some specification must do that, and it would be impractical to let each data format define its own idea of characters. It has turned out to be easier to manage complex things by dividing them into simpler parts, such as levels of data representation.

Plain text is not always quite plain. First, it usually has a division into lines. It may contain spaces, which are not always just plain separators between words but may involve formatting purposes, especially when consecutive spaces or fixed-width spaces are used. There are other deviations from the plain text principle, such as characters for tabbing or affecting ligature behavior. Moreover, some typographic variation can be encoded into the choice of a character.

4.6.2. Font Variants as Characters

Despite the "characters, not glyphs" principle, some Unicode characters are effectively variants of other characters, in the sense of font variation. For example, the character script capital "H" (U+210B) is equivalent compatibility-wise to Latin capital letter "H H (U+0048). The equivalence is defined using the notation <font> 0048. This means that it is a font variant of "H" in a sense, but in a rather abstract sense: no specific font is implied, just the general idea of using a script (handwriting) style.

In practice, programs do not select a glyph for the script capital "H" by picking up the glyph for capital "H" from some special handwriting-style font. Instead, they pick up a glyph for U+210B from a font that has a glyph in that position (such as Arial Unicode MS or some other large font).

Apart from such compatibility characters, there is no way to give any font information in Unicode. This is not a flaw but a conscious decision to handle different issues at different levels and in different standards. Plain text can be presented in any suitable font, but if you wish to change font in the midst of text, you are not using just plain text, and you need additional tools (see Chapter 9).

4.6.3. Variation Selectors

A relatively recent addition to the Unicode standard introduced the concept of a Variation Selector, which is an invisible character that is meant to affect the choice of glyph for the preceding character. Thus, a Variation Selector is comparable to font markup or to the choice of font in a word processor, though its effect is generic. It does not specify any particular font but rather the general characteristics of a glyph.

For example, the intersection (U+2229) is usually presented with a glyph without serifs (i.e., without short lines perpendicular to the two ends below), even if the glyph is from a serif font. You can explicitly request that a glyph variant with such serifs be used by inserting Variation Selector 1 (VS1) U+FE00 after the character U+2229.

The currently available standard variants that can be requested for some characters are listed in http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html. They consist of variants of some mathematical operators and some Mongolian characters.

Support for Variation Selectors in current software is very limited. Programs might just treat them as unknown graphic characters, displaying some generic symbol.

4.6.4. Affecting Font Usage

The use of Unicode characters may indirectly affect font choices made by programs:

  • If the font chosen for text does not contain all the characters in the text, the program may decide to use some other font(s) as fallback. Therefore, some characters may appear in a font different from the surrounding text. A choice between characters that are compatibility equivalent, or even canonically equivalent, can be relevant in this sense. For example, Latin capital letter "A" with ring Å (U+00C5) is probably available in most fonts you might use. If you use the angstrom sign (U+212B), which is canonically equivalent to U+00C5, the appearance can be different, since this character appears in some fonts only. Consider this as a problem, not as a formatting tool!

4.6.5. Ligatures

A ligature is a visible presentation of two or more characters as a unit. The origin of ligatures is in cursive handwriting, where characters are generally joined together. For printing, ligature types were produced to solve problems in both the visual appearance and in the mechanics of printing. Text does not look good if, for example, the upper part of the "f" is close to dot of the "i" as in "fi," so it can be better to create a type that fuses the two characters together.

In English, we normally use ligatures mostly for "fi," "fl," and "ffl" only, and usually in print only. Many other scripts use ligatures far more often.

Ligatures as discussed here should not be confused with characters that originate from ligatures. For example, capital Latin letter "ae" æ (U+00E6) is an independent letter in Norwegian and Danish, although it is originally a ligature of "a" and "e" and is sometimes used just as a typographic variant of "ae" in English when writing Latin words. Unicode recognizes this character as a letter that is not decomposable into anything, although its old name used the phrase "ligature ae." In general, the word "ligature" in a character's name can be misleading.

For most purposes, Unicode treats ligatures as belonging to typographic issues that are not addressed by the Unicode standard. A word containing "fi" may or may not be rendered using a ligature for this character pair, and this does not affect the way in which the word is represented as a sequence of Unicode characters. The ligature behavior can often be affected by the commands and tools of a page layout program. For example, a layout program may present some character combinations as ligatures by default. For example, in computer code represented in monospace font, a ligature "" probably looks odd (compare cora. with ficora.fi). You would need to use some program-specific command to prevent such behavior in general or for some selected text.

However, there are some constructs specifically related to ligatures in Unicode:

  • For compatibility with other standards, Unicode contains a few ligatures coded as characters. For example, there is Latin small ligature "fi" (U+FB00), which is defined as equivalent compatibility-wise to "f followed by "i." Characters in the Alphabetic Presentation Forms block (U+FB00..U+FB4F) include ligatures for "ff," "fi," "fl," "ffi," and "ffl."

  • The character zero width non-joiner U+200C, abbreviated ZWNJ, specifically instructs that characters before and after it shall not be joined as a ligature or in a cursive (handwriting-style) connection.

  • The character zero width joiner U+200D, abbreviated ZWJ, specifically instructs that characters before and after it should be joined as a ligature or in a cursive manner.

The characters ZWNJ and ZWJ are effectively invisible control characters. They are meant to be used for exceptional overrides only. Do not confuse these characters with the word joiner (WJ) character, which relates to line breaking issues; see Chapter 5.

Support for ZWNJ and ZWJ is not common for most scripts. You should not expect to be able to produce, say, an ligature that way, though the ZWJ and ZWNJ characters may be effective in scripts where they are really needed, such as the Arabic script. In text formatting, ligatures should normally be generated on other basis, such as program-specific commands or information on the language of the text and typographic conventions for the language.

In MS Word, you can use the Insert Symbol function to add special characters that allow or prevent a line break. However, Word uses ZWNJ and ZWJ for this, contrary to their meanings. If text written that way is processed with another application, you should check what happens to these characters.


4.6.6. Vowels as Marks

Several writing systems indicate vowels as marks attached to consonants rather than as separate letters. In Hebrew and Arabic writing, short vowels may be indicated that way, though they are mostly just omitted, to be inferred by the reader.

A different method is applied in writing systems calledabugida (or sometimes alphasyllabary ), such as the Devanagari (Devangar) script used for Hindi. The idea is that a basic character alone denotes the consonant sound followed by an implied vowel, namely "a." Other combinations of a consonant and a vowel are indicated by attaching a special mark to the basic character.

For example, the Devanagari letter pa प (U+092A) and the Devanagari vowel letter uu ू (U+0942), when appearing in that order, combine into the appearance पू (read as "puu" or "p"). If the next character is the Devanagari letter ra र (U+0930), it joins without any break: पूर. It depends on the implementation whether the rendering is achieved by using glyphs that join suitably or by mapping a sequence of characters into a single glyph that represents them as "melted together."

4.6.7. Operations on Glyphs

In Chapter 1, we described the simple idea of characters and glyphs: a character is an abstract entity, though with a general idea of what it looks like, whereas a glyph is a particular appearance of a character. This mental model needs to be broadened, since visible presentation of text may involve much more that just mapping each character to a glyph.

The use of ligatures in presenting texts can be described as glyph mapping. If some text contains the abstract characters "f" and "i" in succession, they are first mapped to glyphs for them, and then the adjacent glyphs might be mapped to a ligature glyph.

In some cases, the mapping could be performed at the character level. For example, software for printing text might first map the sequence of "f" and "i" to Latin small ligature "fi" (U+FB00), and then map this to a glyph. However, since combined glyphs do not generally have character equivalents, it is best to operate uniformly at the glyph level.

Glyph mapping may involve many other operations, such as:

  • Selecting between stylistically different glyphs ("aesthetic variants") for a character, using information expressed outside the plain text

  • Selecting an appropriate contextual variant of a glyphe.g., for Arabic letters, which need to be shown in different glyphs depending on their position in a word

Operations on glyphs are beyond the scope of this book. To get a somewhat more detailed idea of them, in the context of OpenType fonts, see the web page "GSUB - The Glyph Substitution Table" at http://www.microsoft.com/typography/otspec/gsub.htm.

4.6.8. Unicode Versus Font Tricks

When you write text in Unicode, you can normally use any font available in the system. Some fonts are "Unicode fonts, " some are not, but this refers to technicalities. Even if a font has been designed for rendering data that is represented in an 8-bit encoding, the software you use can probably handle the mappings internally, so that the font can be used for Unicode text as well.

Whether you can vary the font in your text depends on the tools and data formats you use. In plain text, there is no font variation, but word processors work with other formats. They usually have some simple tool for, for example, selecting some words and setting their font to something different than the surrounding text.

However, some special tricks have often been used in an attempt to extend character repertoire by font settings. In Chapter 3, we noted that you could type, on your word processor, the letters "abc" and then select them and use the font-changing command to set the font to Symbol to get "αβχ" (i.e., three Greek lowercase letters). We analyzed this from the viewpoint of character encoding, but here the emphasis is on comparing such tricks with the Unicode approach.

Logically, the Symbol font is a collection of mostly wrong glyphs for characters (e.g., an α glyph for "a"). Of course, the same trick works for Unicode text, too, unless the software you use refuses to perform the illogical move. After all, the Symbol font does not contain the letters "abc," so any request to use it for them should be ignored.

Anyway, using Unicode, such tricks are completely unnecessary and pointlessly risky. A change of font never changes the identity of characters, in the logical sense, so even if you see "αβχ," it's still "abc." This can be checked by changing the font to something else. There's no reason to take the slightest risk of having your data passed through some process that changes the font and distorts what you meant. In Unicode, you simply use the right characters, using some suitable input method. To help you in such a conversion, Appendix A contains a table of Unicode equivalents of Symbol font glyphs.

This should not be confused with font changes needed to make some correctly entered characters visible. For example, if you use any of the methods described in Chapter 2 to enter the Greek letter alpha α, it might still fail to display properly. If the current font does not contain a glyph for alpha, you need to change the font locally (or globally) to something else, such as Arial Unicode MSbut any font containing the alpha will do.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net