Section 8.11. Other Blocks


8.11. Other Blocks

Some Unicode blocks of general interest are described here. For information on blocks that relate to a particular writing system or a specialized application area, please refer to the appropriate section in the Unicode standard. The overall effect of writing systems on character usage was discussed in Chapter 7.

8.11.1. Spacing Modifier Letters

Some characters in this block are "spacing clones" of diacritic marks. That is, they are defined as being compatibility equivalent to space U+0020 followed by a combining diacritic mark. However, this block includes quite a few other characters as well. They are mostly written after a letter, though some of them are actually used as independent letterse.g., the different apostrophe-like characters that are used to transliterate the Arabic character hamza.

For example, the first of the characters in this block, modifier letter small "h" (U+02B0), is used to indicate aspiration of the preceding consonant in phonetic notations (e.g., in pronunciation instructions in encyclopedias). This character is a compatibility character, which is defined to be compatibility equivalent to letter "h" in superscript style. The results of using U+02B0 (from a font where it exists) and using "h" formatted in superscript style may differ, of course, especially since programs often implement superscripting simply by decreasing the size of a glyph and putting it in a higher position. Good font design tries to make the appearance better, perhaps modifying the shape to suit the needs of small-size rendering. Compare the following ways of denoting an aspirated pronunciation of "k," using first U+02B0, and then "h" as a superscript: kʰ kh.

8.11.2. Currency Symbols

Currencies can be denoted in several ways: words, currency symbol characters, or various abbreviations or codes. The optimal choice depends on the context and intentions. When uniqueness, definiteness, and internationality (as neutrality with respect to national languages) are essential, the three-letter codes as defined in ISO 4217 should be usede.g., "GBP 42." In localized notations, the formats varye.g., "£42" versus "42 £"'and so do currency names, of course. The Common Locale Data Repository, described in Chapter 11, contains extensive information on such localized formats. Currency symbol characters (general category Sc) appear in different blocks in Unicode:

  • The dollar sign $ is in the Basic Latin block.

  • The cent sign ¢, the pound sign £, the currency sign ¤, and the yen sign ¥ are in the Latin-1 Supplement block.

  • There are several currency symbols in script-specific blocks, such as the Thai currency symbol baht ฿ (U+0E3F) in the Thai block.

  • Other currency symbols are in the Currency Symbols block, U+20A0..U+20CF. It includes important symbols such as the euro sign € (U+20AC) but also some characters that are historical only, such as the French franc sign ₣ (U+20A3). The euro-currency sign ₠ (U+20A0) is not even historical but only a symbol that was once planned and allocated, and it has not been removed, due to Unicode principles.

8.11.3. Phonetic Characters

Phonetic characters are used in writing systems that indicate the pronunciation. The most widely known and used phonetic alphabet is the International Phonetic Alphabet (IPA). Originally designed for use in linguistics, IPA is also used in language teaching and in pronunciation instructions in dictionaries and encyclopedias, though in English material, other pronunciation notations are more common. In developing writing systems for languages that previously existed in spoken form only, some IPA characters are often used along with normal Latin letters.

IPA is a fairly old alphabet and was originally defined by indicating the visible shapes of characters only. For computer applications, the characters had to be defined more exactly. Some characters were identified with normal Latin (lowercase) letters, such as "b." Some were identified with other characters that are used in normal writing too, such as æ (which belongs to the Latin-1 Supplement). But most IPA characters were separately coded in the IPA Extensions block.

No writing system can accurately describe all details of spoken language. Even IPA notations are just approximations. Moreover, they are approximations of different degrees. Simple IPA writing can be used, for example, in dictionaries, whereas transcription of speech in linguistics uses more exact descriptions, using diacritic marks to indicate nuances.

The needs of IPA transcription differ from conventions of general purpose typesetting. This is not surprising, since IPA attempts a precision of phonetic representation that is well beyond that of the normal alphabet of any Latin script language. For this purpose, IPA uses diacritic marks, but it also assigns a distinctive meaning to forms that in general purpose typography are considered purely stylistic variants of the same letter. The most obvious case is that IPA includes both the common (ASCII) letter "a" and a variant of the letter "a" that denotes a vowel of different quality. The latter letter is oddly named: Latin small letter alpha ɑ (U+0251).

For such reasons, IPA characters do not follow typical typographic conventions in the distinction between roman and italic styles. In simple terms, an italic IPA font needs to be something akin to an oblique version of roman, rather than a distinct style of lettering. Thus, IPA involves a specialized technical kind of typesetting, not very different from, for example, mathematical typesetting in the way that it assigns distinct meaning to stylistic variants of letterforms.

Many characters in the IPA Extensions blocks are turned or otherwise modified versions of Latin letters. For example, the Latin small letter schwa ə (U+0259) is originally a rotated image of "e." It denotes the neutral and often reduced vowel that is so common in English, and it is present in some fonts that do not otherwise contain IPA Extensions.

For example, the standard British pronunciation of the English word "international" can be written in IPA as ɪntəˈnæʃənəl. The character "n" there, for example, is the common Latin letter, whereas the character æ is the same Latin small letter "ae" (U+00E6) as used, for example, in Danish and old English. Some other characters, such as the schwa, are from the IPA extensions block. The IPA stress mark, modifier letter vertical line ˈ (U+02C8), is not common in fonts, and often other characters such as the (ASCII) apostrophe ' (U+0027) are used instead.

The official description of IPA is available at the site http://www.arts.gla.ac.uk/ipa/ipachart.html. Since characters used in IPA appear in different blocks in Unicode, you may find the following document useful: "The International Phonetic Alphabet in Unicode," http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm.

Due to the heavy use of diacritic marks, IPA transcription often requires implementations that support combining diacritic marks, since most of the combinations needed do not appear as precomposed characters in Unicode. However, for simple usage of IPA, relatively simple implementations of such marks are tolerable.

For simple IPA, the Arial Unicode MS font is sufficient and suitable. For more advanced purposes, you may wish to use the Doulos SIL font, available from http://scripts.sil.org.

In addition to IPA, there are other phonetic writing systems. One of them, the Uralic Phonetic Alphabet (UPA), has been included into Unicode. The added characters are in the Phonetic Extensions block.

8.11.4. Specials

The Specials block contains just a few code positions, U+FFF0 through U+FFFF, and they are indeed special:

  • U+FFF0 through U+FFF8 are unassigned (reserved for eventual future use).

  • U+FFF9 through U+FFFB are interlinear annotation characters, explained below.

  • U+FFFC is an object replacement character, which is an invisible placeholder for a nontextual object, such as an image, to be inserted (by some external tools). In code charts,  appears in place of this character.

  • U+FFFD is a replacement character, for use in data converted from a code other than Unicode, to indicate a character that has no Unicode counterpart. This is somewhat similar to U+001A (substitute, Control-Z) in the ASCII range. However, U+FFFD has a visible shape � (although it appears in a few fonts only). In the Java programming language, U+FFFD is traditionally used to indicate Not a Number (NaN)i.e., undefined result of a mathematical operation; this does not comply with the meaning of U+FFFD in Unicode.

  • U+FFFE and U+FFFF are noncharactersi.e., code positions that do not and will not ever represent any characters. They can be used as sentinels or for checking purposes. Any occurrence of these code points in character data (i.e., in data being interpreted as characters) indicates an error of some kind.

Interlinear annotation characters are invisible indicators (control characters, in a sense) that separate interlinear annotations from normal text. "Interlinear" means "between the lines" and refers to information presented between normal lines in small font. Interlinear annotations, called ruby orfurigana, are typically used in Japanese books for children or for foreigners studying Japanese, and they usually show the pronunciation of words. The name "ruby" is originally the name of a font size.

Although interlinear annotations primarily relate to East Asian languages, they might conceivably be used for other purposes as well. They could be used to indicate the pronunciation of foreign words in English text, or to add editor's or translator's short notes, or even to create documents with lyrics with guitar chords so that the chords will be displayed above the respective text. However, software that supports interlinear annotations may do so in a manner designed for annotations of East Asian textse.g., using a very small font by default.

Figure 8-3. Display of interlinear annotations on IE 6; the first alternative uses Ruby markup, the second tries to use interlinear annotation characters in Unicode


Interlinear annotations are best described at higher protocol levels, such as the markup elements in the Ruby module of XHTML. The Ruby module belongs to XHTML 1.1 and has some limited support in Internet Explorer (IE) since Version 6.

The interlinear annotation characters in Unicode are of rather limited usefulness. Very few programs support them. When they are not supported, something odd may appear in their place, and the annotations would appear in normal text. However, the characters might conceivably be used if you need to represent the annotations in plain text format and you have (or you can create) software that supports them. The characters are:

  • U+FFF9 interlinear annotation anchor indicates the start of normal text that has an annotation attached to it; corresponds to markup <rb> in Ruby in XHTML.

  • U+FFFA interlinear annotation separator indicates the end of the text being annotated and the start of the annotation; corresponds to </rb><rt> in Ruby.

  • U+FFFB interlinear annotation terminator ends the annotation, so that subsequent characters will be taken as normal text; corresponds to </rt> in Ruby.

The following piece of XHTML markup uses first Ruby markup, then interlinear annotation characters (via character references) to add information about the pronunciation of a name. The markup method is the one that has the best chance of working. This is illustrated in Figure 8-3:

 <p>My first name is <ruby><rb>Jukka</rb><rt>Yook-kah</rt>< /ruby>.</p> <p>My first name is &#xfff9;Jukka &#xfffa;Yook-kah&#xfffb;.</p>

8.11.5. Dingbats

Dingbats are essentially graphics coded as characters. One might say that the meaning of a dingbat is its graphic appearance. This makes dingbats rather special. On the other hand, in practice, some of the dingbats have a fairly well-defined logical meaning, and putting them into this block has been a rather arbitrary decision.

Dingbats are used by switching to a special font. This means that data is typically in an 8-bit encoding but by font change, characters are visually turned into something quite different. Thus, you could type the letter "a," and then change the font to a special one, and get checkmark (U+2713). However, this is not the Unicode way.

This block of Unicode in general does not contain all the graphics that have been implemented in different specialized fonts. For example, corporate logos are excluded. Many of the symbols in the Windings fonts commonly available in computers have not been coded as characters in Unicode at all.

8.11.6. Summary of Blocks

Table 8-14 lists all blocks as defined in Unicode Version 4.1 and planned for Version 5.0. The up-to-date summary information on blocks is in the file Blocks.txt in the Unicode character database, available online at http://www.unicode.org. Many blocks correspond more or less directly to some specific scripts (writing systems) discussed in Chapter 7.

Table 8-14. Unicode 4.1 blocks

Code range

Name of block

Notes

0000..007F

Basic Latin

ASCII

0080..00FF

Latin-1 Supplement

Upper half of Latin 1

0100..017F

Latin Extended-A

 

0180..024F

Latin Extended-B

 

0250..02AF

IPA Extensions

Phonetic symbols

02B0..02FF

Spacing Modifier Letters

 

0300..036F

Combining Diacritical Marks

 

0370..03FF

Greek and Coptic

 

0400..04FF

Cyrillic

 

0500..052F

Cyrillic Supplement

 

0530..058F

Armenian

 

0590..05FF

Hebrew

 

0600..06FF

Arabic

 

0700..074F

Syriac

 

0750..077F

Arabic Supplement

 

0780..07BF

Thaana

 

07C0..07FF

NKo

Proposed (Unicode 5.0)

0900..097F

Devanagari

For Indic languages

0980..09FF

Bengali

 

0A00..0A7F

Gurmukhi

 

0A80..0AFF

Gujarati

 

0B00..0B7F

Oriya

 

0B80..0BFF

Tamil

 

0C00..0C7F

Telugu

 

0C80..0CFF

Kannada

 

0D00..0D7F

Malayalam

 

0D80..0DFF

Sinhala

 

0E00..0E7F

Thai

 

0E80..0EFF

Lao

 

0F00..0FFF

Tibetan

 

1000..109F

Myanmar

 

10A0..10FF

Georgian

 

1100..11FF

Hangul Jamo

 

1200..137F

Ethiopic

 

1380..139F

Ethiopic Supplement

 

13A0..13FF

Cherokee

 

1400..167F

Unified Canadian Aboriginal Syllabics

 

1680..169F

Ogham

 

16A0..16FF

Runic

 

1700..171F

Tagalog

 

1720..173F

Hanunoo

 

1740..175F

Buhid

 

1760..177F

Tagbanwa

 

1780..17FF

Khmer

 

1800..18AF

Mongolian

 

1900..194F

Limbu

 

1950..197F

Tai Le

 

1980..19DF

New Tai Lue

 

19E0..19FF

Khmer Symbols

 

1A00..1A1F

Buginese

 

1B00..1B7F

Balinese

Proposed (Unicode 5.0)

1D00..1D7F

Phonetic Extensions

Mostly for UPA

1D80..1DBF

Phonetic Extensions Supplement

 

1DC0..1DFF

Combining Diacritical Marks Supplement

 

1E00..1EFF

Latin Extended Additional

 

1F00..1FFF

Greek Extended

 

2000..206F

General Punctuation

 

2070..209F

Superscripts and Subscripts

 

20A0..20CF

Currency Symbols

 

20D0..20FF

Combining Diacritical Marks for Symbols

 

2100..214F

Letterlike Symbols

 

2150..218F

Number Forms

 

2190..21FF

Arrows

 

2200..22FF

Mathematical Operators

 

2300..23FF

Miscellaneous Technical

 

2400..243F

Control Pictures

 

2440..245F

Optical Character Recognition

 

2460..24FF

Enclosed Alphanumerics

 

2500..257F

Box Drawing

 

2580..259F

Block Elements

 

25A0..25FF

Geometric Shapes

 

2600..26FF

Miscellaneous Symbols

 

2700..27BF

Dingbats

 

27C0..27EF

Miscellaneous Mathematical Symbols-A

 

27F0..27FF

Supplemental Arrows-A

 

2800..28FF

Braille Patterns

 

2900..297F

Supplemental Arrows-B

 

2980..29FF

Miscellaneous Mathematical Symbols-B

 

2A00..2AFF

Supplemental Mathematical Operators

 

2B00..2BFF

Miscellaneous Symbols and Arrows

 

2C00..2C5F

Glagolitic

 

2C60..2C7F

Latin Extended-C

Proposed (Unicode 5.0)

2C80..2CFF

Coptic

 

2D00..2D2F

Georgian Supplement

 

2D30..2D7F

Tifinagh

 

2D80..2DDF

Ethiopic Extended

 

2E00..2E7F

Supplemental Punctuation

 

2E80..2EFF

CJK Radicals Supplement

 

2F00..2FDF

Kangxi Radicals

 

2FF0..2FFF

Ideographic Description Characters

 

3000..303F

CJK Symbols and Punctuation

 

3040..309F

Hiragana

 

30A0..30FF

Katakana

 

3100..312F

Bopomofo

 

3130..318F

Hangul Compatibility Jamo

 

3190..319F

Kanbun

 

31A0..31BF

Bopomofo Extended

 

31C0..31EF

CJK Strokes

 

31F0..31FF

Katakana Phonetic Extensions

 

3200..32FF

Enclosed CJK Letters and Months

 

3300..33FF

CJK Compatibility

 

3400..4DBF

CJK Unified Ideographs Extension A

 

4DC0..4DFF

Yijing Hexagram Symbols

 

4E00..9FFF

CJK Unified Ideographs

Main block of CJK

A000..A48F

Yi Syllables

 

A490..A4CF

Yi Radicals

 

A700..A71F

Modifier Tone Letters

 

A720..A7FF

Latin Extended-D

Proposed (Unicode 5.0)

A800..A82F

Syloti Nagri

 

A840..A87F

Phags-pa

Proposed (Unicode 5.0)

AC00..D7AF

Hangul Syllables

 

D800..DB7F

High Surrogates

 

DB80..DBFF

High Private Use Surrogates

 

DC00..DFFF

Low Surrogates

 

E000..F8FF

Private Use Area

 

F900..FAFF

CJK Compatibility Ideographs

 

FB00..FB4F

Alphabetic Presentation Forms

 

FB50..FDFF

Arabic Presentation Forms-A

 

FE00..FE0F

Variation Selectors

 

FE10..FE1F

Vertical Forms

 

FE20..FE2F

Combining Half Marks

 

FE30..FE4F

CJK Compatibility Forms

 

FE50..FE6F

Small Form Variants

 

FE70..FEFF

Arabic Presentation Forms-B

 

FF00..FFEF

Halfwidth and Fullwidth Forms

 

FFF0..FFFF

Specials

 

10000..1007F

Linear B Syllabary

 

10080..100FF

Linear B Ideograms

 

10100..1013F

Aegean Numbers

 

10140..1018F

Ancient Greek Numbers

 

10300..1032F

Old Italic

 

10330..1034F

Gothic

 

10380..1039F

Ugaritic

 

103A0..103DF

Old Persian

 

10400..1044F

Deseret

 

10450..1047F

Shavian

 

10480..104AF

Osmanya

 

10800..1083F

Cypriot Syllabary

 

10900..1091F

Phoenician

Proposed (Unicode 5.0)

10A00..10A5F

Kharoshthi

 

12000..123FF

Cuneiform

Proposed (Unicode 5.0)

12400..1247F

Cuneiform Numbers and Punctuation

Proposed (Unicode 5.0)

1D000..1D0FF

Byzantine Musical Symbols

 

1D100..1D1FF

Musical Symbols

 

1D200..1D24F

Ancient Greek Musical Notation

 

1D300..1D35F

Tai Xuan Jing Symbols

 

1D360..1D37F

Chinese Counting Rod Numerals

Proposed (Unicode 5.0)

1D400..1D7FF

Mathematical Alphanumeric Symbols

 

20000..2A6DF

CJK Unified Ideographs Extension B

 

2F800..2FA1F

CJK Compatibility Ideographs Supplement

 

E0000..E007F

Tags

Language tagging

E0100..E01EF

Variation Selectors Supplement

 

F0000..FFFFF

Supplementary Private Use Area-A

 

100000..10FFFF

Supplementary Private Use Area-B

 




Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net