8.11. Other BlocksSome Unicode blocks of general interest are described here. For information on blocks that relate to a particular writing system or a specialized application area, please refer to the appropriate section in the Unicode standard. The overall effect of writing systems on character usage was discussed in Chapter 7. 8.11.1. Spacing Modifier LettersSome characters in this block are "spacing clones" of diacritic marks. That is, they are defined as being compatibility equivalent to space U+0020 followed by a combining diacritic mark. However, this block includes quite a few other characters as well. They are mostly written after a letter, though some of them are actually used as independent letterse.g., the different apostrophe-like characters that are used to transliterate the Arabic character hamza. For example, the first of the characters in this block, modifier letter small "h" (U+02B0), is used to indicate aspiration of the preceding consonant in phonetic notations (e.g., in pronunciation instructions in encyclopedias). This character is a compatibility character, which is defined to be compatibility equivalent to letter "h" in superscript style. The results of using U+02B0 (from a font where it exists) and using "h" formatted in superscript style may differ, of course, especially since programs often implement superscripting simply by decreasing the size of a glyph and putting it in a higher position. Good font design tries to make the appearance better, perhaps modifying the shape to suit the needs of small-size rendering. Compare the following ways of denoting an aspirated pronunciation of "k," using first U+02B0, and then "h" as a superscript: kʰ kh. 8.11.2. Currency SymbolsCurrencies can be denoted in several ways: words, currency symbol characters, or various abbreviations or codes. The optimal choice depends on the context and intentions. When uniqueness, definiteness, and internationality (as neutrality with respect to national languages) are essential, the three-letter codes as defined in ISO 4217 should be usede.g., "GBP 42." In localized notations, the formats varye.g., "£42" versus "42 £"'and so do currency names, of course. The Common Locale Data Repository, described in Chapter 11, contains extensive information on such localized formats. Currency symbol characters (general category Sc) appear in different blocks in Unicode:
8.11.3. Phonetic CharactersPhonetic characters are used in writing systems that indicate the pronunciation. The most widely known and used phonetic alphabet is the International Phonetic Alphabet (IPA). Originally designed for use in linguistics, IPA is also used in language teaching and in pronunciation instructions in dictionaries and encyclopedias, though in English material, other pronunciation notations are more common. In developing writing systems for languages that previously existed in spoken form only, some IPA characters are often used along with normal Latin letters. IPA is a fairly old alphabet and was originally defined by indicating the visible shapes of characters only. For computer applications, the characters had to be defined more exactly. Some characters were identified with normal Latin (lowercase) letters, such as "b." Some were identified with other characters that are used in normal writing too, such as æ (which belongs to the Latin-1 Supplement). But most IPA characters were separately coded in the IPA Extensions block. No writing system can accurately describe all details of spoken language. Even IPA notations are just approximations. Moreover, they are approximations of different degrees. Simple IPA writing can be used, for example, in dictionaries, whereas transcription of speech in linguistics uses more exact descriptions, using diacritic marks to indicate nuances. The needs of IPA transcription differ from conventions of general purpose typesetting. This is not surprising, since IPA attempts a precision of phonetic representation that is well beyond that of the normal alphabet of any Latin script language. For this purpose, IPA uses diacritic marks, but it also assigns a distinctive meaning to forms that in general purpose typography are considered purely stylistic variants of the same letter. The most obvious case is that IPA includes both the common (ASCII) letter "a" and a variant of the letter "a" that denotes a vowel of different quality. The latter letter is oddly named: Latin small letter alpha ɑ (U+0251). For such reasons, IPA characters do not follow typical typographic conventions in the distinction between roman and italic styles. In simple terms, an italic IPA font needs to be something akin to an oblique version of roman, rather than a distinct style of lettering. Thus, IPA involves a specialized technical kind of typesetting, not very different from, for example, mathematical typesetting in the way that it assigns distinct meaning to stylistic variants of letterforms. Many characters in the IPA Extensions blocks are turned or otherwise modified versions of Latin letters. For example, the Latin small letter schwa ə (U+0259) is originally a rotated image of "e." It denotes the neutral and often reduced vowel that is so common in English, and it is present in some fonts that do not otherwise contain IPA Extensions. For example, the standard British pronunciation of the English word "international" can be written in IPA as ɪntəˈnæʃənəl. The character "n" there, for example, is the common Latin letter, whereas the character æ is the same Latin small letter "ae" (U+00E6) as used, for example, in Danish and old English. Some other characters, such as the schwa, are from the IPA extensions block. The IPA stress mark, modifier letter vertical line ˈ (U+02C8), is not common in fonts, and often other characters such as the (ASCII) apostrophe ' (U+0027) are used instead. The official description of IPA is available at the site http://www.arts.gla.ac.uk/ipa/ipachart.html. Since characters used in IPA appear in different blocks in Unicode, you may find the following document useful: "The International Phonetic Alphabet in Unicode," http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm. Due to the heavy use of diacritic marks, IPA transcription often requires implementations that support combining diacritic marks, since most of the combinations needed do not appear as precomposed characters in Unicode. However, for simple usage of IPA, relatively simple implementations of such marks are tolerable. For simple IPA, the Arial Unicode MS font is sufficient and suitable. For more advanced purposes, you may wish to use the Doulos SIL font, available from http://scripts.sil.org. In addition to IPA, there are other phonetic writing systems. One of them, the Uralic Phonetic Alphabet (UPA), has been included into Unicode. The added characters are in the Phonetic Extensions block. 8.11.4. SpecialsThe Specials block contains just a few code positions, U+FFF0 through U+FFFF, and they are indeed special:
Interlinear annotation characters are invisible indicators (control characters, in a sense) that separate interlinear annotations from normal text. "Interlinear" means "between the lines" and refers to information presented between normal lines in small font. Interlinear annotations, called ruby orfurigana, are typically used in Japanese books for children or for foreigners studying Japanese, and they usually show the pronunciation of words. The name "ruby" is originally the name of a font size. Although interlinear annotations primarily relate to East Asian languages, they might conceivably be used for other purposes as well. They could be used to indicate the pronunciation of foreign words in English text, or to add editor's or translator's short notes, or even to create documents with lyrics with guitar chords so that the chords will be displayed above the respective text. However, software that supports interlinear annotations may do so in a manner designed for annotations of East Asian textse.g., using a very small font by default. Figure 8-3. Display of interlinear annotations on IE 6; the first alternative uses Ruby markup, the second tries to use interlinear annotation characters in UnicodeInterlinear annotations are best described at higher protocol levels, such as the markup elements in the Ruby module of XHTML. The Ruby module belongs to XHTML 1.1 and has some limited support in Internet Explorer (IE) since Version 6. The interlinear annotation characters in Unicode are of rather limited usefulness. Very few programs support them. When they are not supported, something odd may appear in their place, and the annotations would appear in normal text. However, the characters might conceivably be used if you need to represent the annotations in plain text format and you have (or you can create) software that supports them. The characters are:
The following piece of XHTML markup uses first Ruby markup, then interlinear annotation characters (via character references) to add information about the pronunciation of a name. The markup method is the one that has the best chance of working. This is illustrated in Figure 8-3: <p>My first name is <ruby><rb>Jukka</rb><rt>Yook-kah</rt>< /ruby>.</p> <p>My first name is Jukka Yook-kah.</p> 8.11.5. DingbatsDingbats are essentially graphics coded as characters. One might say that the meaning of a dingbat is its graphic appearance. This makes dingbats rather special. On the other hand, in practice, some of the dingbats have a fairly well-defined logical meaning, and putting them into this block has been a rather arbitrary decision. Dingbats are used by switching to a special font. This means that data is typically in an 8-bit encoding but by font change, characters are visually turned into something quite different. Thus, you could type the letter "a," and then change the font to a special one, and get checkmark (U+2713). However, this is not the Unicode way. This block of Unicode in general does not contain all the graphics that have been implemented in different specialized fonts. For example, corporate logos are excluded. Many of the symbols in the Windings fonts commonly available in computers have not been coded as characters in Unicode at all. 8.11.6. Summary of BlocksTable 8-14 lists all blocks as defined in Unicode Version 4.1 and planned for Version 5.0. The up-to-date summary information on blocks is in the file Blocks.txt in the Unicode character database, available online at http://www.unicode.org. Many blocks correspond more or less directly to some specific scripts (writing systems) discussed in Chapter 7.
|