Section 8.11. Other Blocks

8.11. Other Blocks

Some Unicode blocks of general interest are described here. For information on blocks that relate to a particular writing system or a specialized application area, please refer to the appropriate section in the Unicode standard. The overall effect of writing systems on character usage was discussed in Chapter 7.

8.11.1. Spacing Modifier Letters

Some characters in this block are "spacing clones" of diacritic marks. That is, they are defined as being compatibility equivalent to space U+0020 followed by a combining diacritic mark. However, this block includes quite a few other characters as well. They are mostly written after a letter, though some of them are actually used as independent letterse.g., the different apostrophe-like characters that are used to transliterate the Arabic character hamza.

For example, the first of the characters in this block, modifier letter small "h" (U+02B0), is used to indicate aspiration of the preceding consonant in phonetic notations (e.g., in pronunciation instructions in encyclopedias). This character is a compatibility character, which is defined to be compatibility equivalent to letter "h" in superscript style. The results of using U+02B0 (from a font where it exists) and using "h" formatted in superscript style may differ, of course, especially since programs often implement superscripting simply by decreasing the size of a glyph and putting it in a higher position. Good font design tries to make the appearance better, perhaps modifying the shape to suit the needs of small-size rendering. Compare the following ways of denoting an aspirated pronunciation of "k," using first U+02B0, and then "h" as a superscript: kʰ k^h.

8.11.2. Currency Symbols

Currencies can be denoted in several ways: words, currency symbol characters, or various abbreviations or codes. The optimal choice depends on the context and intentions. When uniqueness, definiteness, and internationality (as neutrality with respect to national languages) are essential, the three-letter codes as defined in ISO 4217 should be usede.g., "GBP 42." In localized notations, the formats varye.g., "£42" versus "42 £"'and so do currency names, of course. The Common Locale Data Repository, described in Chapter 11, contains extensive information on such localized formats. Currency symbol characters (general category Sc) appear in different blocks in Unicode:

The dollar sign $ is in the Basic Latin block.
The cent sign ¢, the pound sign £, the currency sign ¤, and the yen sign ¥ are in the Latin-1 Supplement block.
There are several currency symbols in script-specific blocks, such as the Thai currency symbol baht ฿ (U+0E3F) in the Thai block.
Other currency symbols are in the Currency Symbols block, U+20A0..U+20CF. It includes important symbols such as the euro sign € (U+20AC) but also some characters that are historical only, such as the French franc sign ₣ (U+20A3). The euro-currency sign ₠ (U+20A0) is not even historical but only a symbol that was once planned and allocated, and it has not been removed, due to Unicode principles.

8.11.3. Phonetic Characters

Phonetic characters are used in writing systems that indicate the pronunciation. The most widely known and used phonetic alphabet is the International Phonetic Alphabet (IPA). Originally designed for use in linguistics, IPA is also used in language teaching and in pronunciation instructions in dictionaries and encyclopedias, though in English material, other pronunciation notations are more common. In developing writing systems for languages that previously existed in spoken form only, some IPA characters are often used along with normal Latin letters.

IPA is a fairly old alphabet and was originally defined by indicating the visible shapes of characters only. For computer applications, the characters had to be defined more exactly. Some characters were identified with normal Latin (lowercase) letters, such as "b." Some were identified with other characters that are used in normal writing too, such as æ (which belongs to the Latin-1 Supplement). But most IPA characters were separately coded in the IPA Extensions block.

No writing system can accurately describe all details of spoken language. Even IPA notations are just approximations. Moreover, they are approximations of different degrees. Simple IPA writing can be used, for example, in dictionaries, whereas transcription of speech in linguistics uses more exact descriptions, using diacritic marks to indicate nuances.

The needs of IPA transcription differ from conventions of general purpose typesetting. This is not surprising, since IPA attempts a precision of phonetic representation that is well beyond that of the normal alphabet of any Latin script language. For this purpose, IPA uses diacritic marks, but it also assigns a distinctive meaning to forms that in general purpose typography are considered purely stylistic variants of the same letter. The most obvious case is that IPA includes both the common (ASCII) letter "a" and a variant of the letter "a" that denotes a vowel of different quality. The latter letter is oddly named: Latin small letter alpha ɑ (U+0251).

For such reasons, IPA characters do not follow typical typographic conventions in the distinction between roman and italic styles. In simple terms, an italic IPA font needs to be something akin to an oblique version of roman, rather than a distinct style of lettering. Thus, IPA involves a specialized technical kind of typesetting, not very different from, for example, mathematical typesetting in the way that it assigns distinct meaning to stylistic variants of letterforms.

Many characters in the IPA Extensions blocks are turned or otherwise modified versions of Latin letters. For example, the Latin small letter schwa ə (U+0259) is originally a rotated image of "e." It denotes the neutral and often reduced vowel that is so common in English, and it is present in some fonts that do not otherwise contain IPA Extensions.

For example, the standard British pronunciation of the English word "international" can be written in IPA as ɪntəˈnæʃənəl. The character "n" there, for example, is the common Latin letter, whereas the character æ is the same Latin small letter "ae" (U+00E6) as used, for example, in Danish and old English. Some other characters, such as the schwa, are from the IPA extensions block. The IPA stress mark, modifier letter vertical line ˈ (U+02C8), is not common in fonts, and often other characters such as the (ASCII) apostrophe ' (U+0027) are used instead.

The official description of IPA is available at the site http://www.arts.gla.ac.uk/ipa/ipachart.html. Since characters used in IPA appear in different blocks in Unicode, you may find the following document useful: "The International Phonetic Alphabet in Unicode," http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm.

Due to the heavy use of diacritic marks, IPA transcription often requires implementations that support combining diacritic marks, since most of the combinations needed do not appear as precomposed characters in Unicode. However, for simple usage of IPA, relatively simple implementations of such marks are tolerable.

For simple IPA, the Arial Unicode MS font is sufficient and suitable. For more advanced purposes, you may wish to use the Doulos SIL font, available from http://scripts.sil.org.

In addition to IPA, there are other phonetic writing systems. One of them, the Uralic Phonetic Alphabet (UPA), has been included into Unicode. The added characters are in the Phonetic Extensions block.

8.11.4. Specials

The Specials block contains just a few code positions, U+FFF0 through U+FFFF, and they are indeed special:

U+FFF0 through U+FFF8 are unassigned (reserved for eventual future use).
U+FFF9 through U+FFFB are interlinear annotation characters, explained below.
U+FFFC is an object replacement character, which is an invisible placeholder for a nontextual object, such as an image, to be inserted (by some external tools). In code charts, appears in place of this character.
U+FFFD is a replacement character, for use in data converted from a code other than Unicode, to indicate a character that has no Unicode counterpart. This is somewhat similar to U+001A (substitute, Control-Z) in the ASCII range. However, U+FFFD has a visible shape � (although it appears in a few fonts only). In the Java programming language, U+FFFD is traditionally used to indicate Not a Number (NaN)i.e., undefined result of a mathematical operation; this does not comply with the meaning of U+FFFD in Unicode.
U+FFFE and U+FFFF are noncharactersi.e., code positions that do not and will not ever represent any characters. They can be used as sentinels or for checking purposes. Any occurrence of these code points in character data (i.e., in data being interpreted as characters) indicates an error of some kind.

Interlinear annotation characters are invisible indicators (control characters, in a sense) that separate interlinear annotations from normal text. "Interlinear" means "between the lines" and refers to information presented between normal lines in small font. Interlinear annotations, called ruby orfurigana, are typically used in Japanese books for children or for foreigners studying Japanese, and they usually show the pronunciation of words. The name "ruby" is originally the name of a font size.

Although interlinear annotations primarily relate to East Asian languages, they might conceivably be used for other purposes as well. They could be used to indicate the pronunciation of foreign words in English text, or to add editor's or translator's short notes, or even to create documents with lyrics with guitar chords so that the chords will be displayed above the respective text. However, software that supports interlinear annotations may do so in a manner designed for annotations of East Asian textse.g., using a very small font by default.

Figure 8-3. Display of interlinear annotations on IE 6; the first alternative uses Ruby markup, the second tries to use interlinear annotation characters in Unicode

Interlinear annotations are best described at higher protocol levels, such as the markup elements in the Ruby module of XHTML. The Ruby module belongs to XHTML 1.1 and has some limited support in Internet Explorer (IE) since Version 6.

The interlinear annotation characters in Unicode are of rather limited usefulness. Very few programs support them. When they are not supported, something odd may appear in their place, and the annotations would appear in normal text. However, the characters might conceivably be used if you need to represent the annotations in plain text format and you have (or you can create) software that supports them. The characters are:

U+FFF9 interlinear annotation anchor indicates the start of normal text that has an annotation attached to it; corresponds to markup <rb> in Ruby in XHTML.
U+FFFA interlinear annotation separator indicates the end of the text being annotated and the start of the annotation; corresponds to </rb><rt> in Ruby.
U+FFFB interlinear annotation terminator ends the annotation, so that subsequent characters will be taken as normal text; corresponds to </rt> in Ruby.

The following piece of XHTML markup uses first Ruby markup, then interlinear annotation characters (via character references) to add information about the pronunciation of a name. The markup method is the one that has the best chance of working. This is illustrated in Figure 8-3:

 <p>My first name is <ruby><rb>Jukka</rb><rt>Yook-kah</rt>< /ruby>.</p> <p>My first name is &#xfff9;Jukka &#xfffa;Yook-kah&#xfffb;.</p>

8.11.5. Dingbats

Dingbats are essentially graphics coded as characters. One might say that the meaning of a dingbat is its graphic appearance. This makes dingbats rather special. On the other hand, in practice, some of the dingbats have a fairly well-defined logical meaning, and putting them into this block has been a rather arbitrary decision.

Dingbats are used by switching to a special font. This means that data is typically in an 8-bit encoding but by font change, characters are visually turned into something quite different. Thus, you could type the letter "a," and then change the font to a special one, and get checkmark (U+2713). However, this is not the Unicode way.

This block of Unicode in general does not contain all the graphics that have been implemented in different specialized fonts. For example, corporate logos are excluded. Many of the symbols in the Windings fonts commonly available in computers have not been coded as characters in Unicode at all.

8.11.6. Summary of Blocks

Table 8-14 lists all blocks as defined in Unicode Version 4.1 and planned for Version 5.0. The up-to-date summary information on blocks is in the file Blocks.txt in the Unicode character database, available online at http://www.unicode.org. Many blocks correspond more or less directly to some specific scripts (writing systems) discussed in Chapter 7.

Table 8-14. Unicode 4.1 blocks
Code range	Name of block	Notes
0000..007F	Basic Latin	ASCII
0080..00FF	Latin-1 Supplement	Upper half of Latin 1
0100..017F	Latin Extended-A
0180..024F	Latin Extended-B
0250..02AF	IPA Extensions	Phonetic symbols
02B0..02FF	Spacing Modifier Letters
0300..036F	Combining Diacritical Marks
0370..03FF	Greek and Coptic
0400..04FF	Cyrillic
0500..052F	Cyrillic Supplement
0530..058F	Armenian
0590..05FF	Hebrew
0600..06FF	Arabic
0700..074F	Syriac
0750..077F	Arabic Supplement
0780..07BF	Thaana
07C0..07FF	NKo	Proposed (Unicode 5.0)
0900..097F	Devanagari	For Indic languages
0980..09FF	Bengali
0A00..0A7F	Gurmukhi
0A80..0AFF	Gujarati
0B00..0B7F	Oriya
0B80..0BFF	Tamil
0C00..0C7F	Telugu
0C80..0CFF	Kannada
0D00..0D7F	Malayalam
0D80..0DFF	Sinhala
0E00..0E7F	Thai
0E80..0EFF	Lao
0F00..0FFF	Tibetan
1000..109F	Myanmar
10A0..10FF	Georgian
1100..11FF	Hangul Jamo
1200..137F	Ethiopic
1380..139F	Ethiopic Supplement
13A0..13FF	Cherokee
1400..167F	Unified Canadian Aboriginal Syllabics
1680..169F	Ogham
16A0..16FF	Runic
1700..171F	Tagalog
1720..173F	Hanunoo
1740..175F	Buhid
1760..177F	Tagbanwa
1780..17FF	Khmer
1800..18AF	Mongolian
1900..194F	Limbu
1950..197F	Tai Le
1980..19DF	New Tai Lue
19E0..19FF	Khmer Symbols
1A00..1A1F	Buginese
1B00..1B7F	Balinese	Proposed (Unicode 5.0)
1D00..1D7F	Phonetic Extensions	Mostly for UPA
1D80..1DBF	Phonetic Extensions Supplement
1DC0..1DFF	Combining Diacritical Marks Supplement
1E00..1EFF	Latin Extended Additional
1F00..1FFF	Greek Extended
2000..206F	General Punctuation
2070..209F	Superscripts and Subscripts
20A0..20CF	Currency Symbols
20D0..20FF	Combining Diacritical Marks for Symbols
2100..214F	Letterlike Symbols
2150..218F	Number Forms
2190..21FF	Arrows
2200..22FF	Mathematical Operators
2300..23FF	Miscellaneous Technical
2400..243F	Control Pictures
2440..245F	Optical Character Recognition
2460..24FF	Enclosed Alphanumerics
2500..257F	Box Drawing
2580..259F	Block Elements
25A0..25FF	Geometric Shapes
2600..26FF	Miscellaneous Symbols
2700..27BF	Dingbats
27C0..27EF	Miscellaneous Mathematical Symbols-A
27F0..27FF	Supplemental Arrows-A
2800..28FF	Braille Patterns
2900..297F	Supplemental Arrows-B
2980..29FF	Miscellaneous Mathematical Symbols-B
2A00..2AFF	Supplemental Mathematical Operators
2B00..2BFF	Miscellaneous Symbols and Arrows
2C00..2C5F	Glagolitic
2C60..2C7F	Latin Extended-C	Proposed (Unicode 5.0)
2C80..2CFF	Coptic
2D00..2D2F	Georgian Supplement
2D30..2D7F	Tifinagh
2D80..2DDF	Ethiopic Extended
2E00..2E7F	Supplemental Punctuation
2E80..2EFF	CJK Radicals Supplement
2F00..2FDF	Kangxi Radicals
2FF0..2FFF	Ideographic Description Characters
3000..303F	CJK Symbols and Punctuation
3040..309F	Hiragana
30A0..30FF	Katakana
3100..312F	Bopomofo
3130..318F	Hangul Compatibility Jamo
3190..319F	Kanbun
31A0..31BF	Bopomofo Extended
31C0..31EF	CJK Strokes
31F0..31FF	Katakana Phonetic Extensions
3200..32FF	Enclosed CJK Letters and Months
3300..33FF	CJK Compatibility
3400..4DBF	CJK Unified Ideographs Extension A
4DC0..4DFF	Yijing Hexagram Symbols
4E00..9FFF	CJK Unified Ideographs	Main block of CJK
A000..A48F	Yi Syllables
A490..A4CF	Yi Radicals
A700..A71F	Modifier Tone Letters
A720..A7FF	Latin Extended-D	Proposed (Unicode 5.0)
A800..A82F	Syloti Nagri
A840..A87F	Phags-pa	Proposed (Unicode 5.0)
AC00..D7AF	Hangul Syllables
D800..DB7F	High Surrogates
DB80..DBFF	High Private Use Surrogates
DC00..DFFF	Low Surrogates
E000..F8FF	Private Use Area
F900..FAFF	CJK Compatibility Ideographs
FB00..FB4F	Alphabetic Presentation Forms
FB50..FDFF	Arabic Presentation Forms-A
FE00..FE0F	Variation Selectors
FE10..FE1F	Vertical Forms
FE20..FE2F	Combining Half Marks
FE30..FE4F	CJK Compatibility Forms
FE50..FE6F	Small Form Variants
FE70..FEFF	Arabic Presentation Forms-B
FF00..FFEF	Halfwidth and Fullwidth Forms
FFF0..FFFF	Specials
10000..1007F	Linear B Syllabary
10080..100FF	Linear B Ideograms
10100..1013F	Aegean Numbers
10140..1018F	Ancient Greek Numbers
10300..1032F	Old Italic
10330..1034F	Gothic
10380..1039F	Ugaritic
103A0..103DF	Old Persian
10400..1044F	Deseret
10450..1047F	Shavian
10480..104AF	Osmanya
10800..1083F	Cypriot Syllabary
10900..1091F	Phoenician	Proposed (Unicode 5.0)
10A00..10A5F	Kharoshthi
12000..123FF	Cuneiform	Proposed (Unicode 5.0)
12400..1247F	Cuneiform Numbers and Punctuation	Proposed (Unicode 5.0)
1D000..1D0FF	Byzantine Musical Symbols
1D100..1D1FF	Musical Symbols
1D200..1D24F	Ancient Greek Musical Notation
1D300..1D35F	Tai Xuan Jing Symbols
1D360..1D37F	Chinese Counting Rod Numerals	Proposed (Unicode 5.0)
1D400..1D7FF	Mathematical Alphanumeric Symbols
20000..2A6DF	CJK Unified Ideographs Extension B
2F800..2FA1F	CJK Compatibility Ideographs Supplement
E0000..E007F	Tags	Language tagging
E0100..E01EF	Variation Selectors Supplement
F0000..FFFFF	Supplementary Private Use Area-A
100000..10FFFF	Supplementary Private Use Area-B