Section 4.5. Guide to the Unicode Standard

4.5. Guide to the Unicode Standard

The newest version of the Unicode standard itself should be your ultimate reference in matters of Unicode. It is, however, very large and partly very technical and hard to read, though many parts are enjoyable and smoothly written. Perhaps most frustratingly, it is often difficult to find the place or places where some topic is covered; the information might be scattered to different sections of the standard. To help you to find the relevant information and to make use of it, here is a brief guide to the standard.

4.5.1. Accessing the Unicode Versions

The Unicode standard is available online (mostly in PDF format), but not necessarily as a simple consolidated version. You may need to combine information from a major base version with later modifications issued as minor versions. At the time of this writing, the current version is 4.1.0, and its content is defined cumulatively by the following documents:

The Unicode standard, Version 4.0: http://www.unicode.org/versions/Unicode4.0.0/

Unicode 4.0.1, an update to the previous: http://www.unicode.org/versions/Unicode4.0.1/

Unicode 4.1.0, another update (minor version): http://www.unicode.org/versions/Unicode4.1.0/

The Unicode database reflects the newest version, but the prose text and code charts may need to be read along with the update documents.

A previous version of the standard, Unicode 3.0, is available online, too, and it might be interesting for comparison: http://www.unicode.org/unicode/uni2book/u2.html. There are also many old database files available via http://www.unicode.org/versions/.

4.5.2. What Material Constitutes the Unicode Standard?

The Unicode standard is available as a book, though there can be a delay between issuing the standard and printing it. The online version contains PDF documents that correspond to the chapters of the book. But these alone are not self-contained presentations of the standard. There are several points to note. As mentioned earlier in the chapter, there can be incremental updates (minor versions):

On the Unicode web site, there's a page titled "Updates and Errata," which lists official corrections to the standard. As new versions are issued, corrections are incorporated into them, and the "Updates and Errata" page is effectively cleared. The page is http://www.unicode.org/errata/.
There is a series of documents called "Unicode Technical Reports," some of which are called "Unicode Standard Annexes," (UAX), and regarded as integral part of the standard but published as separate documents. They are available on the CD-ROM that accompanies the book as well as (as possibly updated versions) on the Unicode web site, at http://www.unicode.org/unicode/reports/.
There is the "Unicode Character Database," which defines many properties for characters, in a manner suitable for automated processing. The database and the description of its structure are available via http://www.unicode.org/ucd/.

4.5.3. Viewing the Standard Online

As mentioned earlier in the chapter, the online standard is mostly in PDF format. Thus, you need some software that can display PDF files, such as Adobe Reader. The online version cannot be printed using normal methods, so you may still have a reason to buy the printed standard. Copying of texts is possible: using Adobe Reader's text select tools, you can copy text onto the Windows clipboard.

The main table of contents of the online version consists of the following parts:

Front Matter: This includes a table of contents as in a book, in PDF format, but also "Unicode 4.0 Web Bookmarks," which is a very handy hypertext table of contents. It is in HTML format, with links pointing to locations in the PDF files.
Chapters: The main text of the standard. See below for an explanation of its structure.
Appendices and Back Matter: Material such as a glossary (in PDF format).
Unicode Standard Annexes: The number of the annexes varies by standard version, since annexes may get incorporated into the main text when creating new versions.
The Unicode Character Database (UCD): Consists of HTML and plain text files.
Related Links: The links point to additional material on the Unicode site, such as Glossary of Unicode Terms (updated and modified, in HTML format).

4.5.4. The Chapters of the Standard

The breakdown of chapters is as follows:

Chapter 1: Introduction

This is a short chapter, and it gives a good overview of some basic ideas.

Chapter 2: General Structure

This gets more detailed and more technical than the Introduction. It presents the fundamental principles of Unicode, but it is rather hard to read. After finishing this book, though, you can probably understand this chapter.

Chapter 3: Conformance

This is a rather technical chapter, which is important to Unicode implementers. For a "normal" reader, there are some useful explanations of basic concepts like character semantics and code values.

Chapter 4: Character Properties

Describes how the standard defines some general properties for characters, such as General Category (letter, number, separator, etc.) or case mappings (e.g., what character, if any, is the uppercase equivalent of a lowercase letter).

Chapter 5: Implementation Guidelines

As the name says, this is mainly for implementers. But reading 5.1, "Transcoding to Other Standards," can be useful to anyone, and browsing through the headings is a good idea, too. Note in particular that this chapter describes some general principles according to which programs might recognize grapheme, word, line, and sentence boundaries (e.g., to implement a command for moving forward one sentence in text processing). It also explains the problems of sorting and searching, which are more language-dependent than you may have thought.

Chapter 6: Punctuation and Writing Systems

This is the first one of the chapters (6 through 15) that describe the various sets of characters. They contain quite a lot of practical information about the use of various characters and comparisons between characters (e.g., a comparison of different dash-like characters). Note that the sets do not necessarily correspond to blocks. For example, there are punctuation symbols scattered around into various blocks, in addition to the General Punctuation block. This chapter begins with an overview of writing systems, also known as scripts.

Chapter 7: European Alphabetic Scripts

Latin, Greek, Cyrillic, etc.

Chapter 8: Middle Eastern Scripts

Hebrew, Arabic, Syriac, Thaana.

Chapter 9: South Asian Scripts

Devanagari, Bengali, etc.

Chapter 10: Southeast Asian Scripts

Thai, Lao, etc.

Chapter 11: East Asian Scripts

Han (especially Chinese-Japanese-Korean (CJK) unified ideograms), Hiragana, Katakana, Hangul, Bopomofo, Yi.

Chapter 12: Additional Modern Scripts

Ethiopic, Mongolian, Osmanya, Cherokee, Canadian Aboriginal Syllabics, Deseret, Shavian.

Chapter 13: Archaic Scripts

Ogham, Runic, and other historical scripts.

Chapter 14: Symbols

This includes a rich set of characters used as symbols that are relatively language-independent, such as currency symbols, letterlike operators (which are letters taken into some special use), number forms, mathematical, technical, geometric, and other symbols.

Chapter 15: Special Areas and Format Characters

This chapter discusses codes used for various control purposes, the "private use" area, the "surrogates" area (based on the idea of using two 16-bit values to present one character), and the special code points at the end of the Unicode range (e.g., byte order mark).

Chapter 16: Code Charts

This "chapter" presents the character themselves, and it constitutes about half of the volume. It begins with a short legend and explanations. Then the blocks are presented, in code number order. For most blocks, a chart of (typical) glyphs for the characters in it is given first, followed by a list of the characters, with their code numbers, glyphs, names, and possibly alternate names, references to similar (but distinct) characters, decompositions (compatibility or canonical), and usage notes. These descriptions do not list all the properties of the characters as defined in Unicode; they do not include all the information in the Unicode database.

Figure 4-2. A search from the Zvon database by character name has found the character, and a link to information on it is included

Chapter 17: Han Radical-Stroke Index

For the Chinese-origin ideograms. "To expedite locating specific Han ideographic characters within the Unicode Han ideographic set, this chapter contains a radical-stroke index." The Han Radical-Stroke Index itself is available as a separate document.

Thus, Chapters 1 through 5 form the general part. Their essential content is covered in this book. The relevance of the other parts depends on what kinds of characters you work with.

4.5.5. How Do I Find All the Information About a Character?

If you are looking for the most adequate Unicode character for some particular use, there is no simple answer. You might browse through the chart for the block where you expect the character to appear; for example, a mathematical symbol is probably in the Mathematical Symbols block. You can also use more systematic search methods. A few alternatives are described in the following sections.

4.5.5.1. The Zvon database

If your clue to the character is its name, or its Unicode number, you could use the online Zvon character database: http://www.zvon.org/other/charSearch/PHP/search.php. The database, although not authoritative, is based on information at the Unicode site. Beware that the name you have in your mind might not be the one under which the character is known in Unicodethe name might have been assigned to a different character there.

The information in the Zvon database (of which an example is shown in Figure 4-3) is the same as in the Unicode code charts, including the annotations (called "Comment" in

Figure 4-3. Information on a character in the Zvon database

Zvon), and some additional derived information such as the XML character reference. The information does not include the notes made in the prose text of the standard.

4.5.5.2. Using Unibook

Unibook is software for offline browsing of information about characters, using a graphic user interface, in a Windows environment. It can be downloaded for free from http://www.unicode.org/unibook/, and it has detailed instructions for installation and use. It has no technical support, though. Figure 4-4 is a snapshot of using Unibook: the user has searched for "lira sign" (using Ctrl-F to invoke a Find dialog) and has got the character highlighted in its position in a code chart. Clicking on the character causes information to be displayed in a pop-up window, as shown in Figure 4-5.

4.5.5.3. Using the Unicode standard

Assuming that you know the code number of a character, at least as a tentative answer to the question "Which character should I use?", you can consult the following to see what the Unicode standard says about it:

Its description in the code charts .
Its properties as defined in the database. Note that this means several different properties, defined in different files of the database.
Any additional explanations you might find in the standard, at various places. There is no systematic way to locate such information, but at least you should look at the applicable part in Chapters 6 through 15. They often contain information that is often similar to the general descriptions preceding the code chart (in Chapter 16), just placed elsewhere.

Let us take a simple example: suppose we need all the information on the character U+2206. Since it falls into the range U+2200..U+22FF, we find it in the Mathematical

Figure 4-4. Using Unibook, the Unicode character browser

Operators block. This suggests that it is a mathematical symbol in some sense. The formal confirmation for this is that the Unicodedata.txt file in the character database contains the following entry for it:

2206;INCREMENT;Sm;0;ON;;;;;N;;;;;

The file consists of lines, each of which gives information about a character, with information fields separated by semicolons. The fields are summarized in Table 4-5 and described at http://www.unicode.org/Public/UNIDATA/UCD.html in more detail. (See also Chapter 5 for a description of the general format of Unicode database fields.) Thus, the example tells that character U+2206:

Has the name INCREMENT.
Belongs to general category Sm, which is an "informative" (as opposite to "normative") category. The abbreviation stands for "Symbol, math." Chapter 5 explains what categories mean in general; note that the categories are referred to when defining various properties, such as line breaking properties (UAX #14).
Belongs to canonical combining class 0, which roughly means just "base character"; see section 4.3 of the standard.
Belongs to bidirectional category ON, "Other Neutrals."
Has the Bidi mirrored property value of N, which means "not mirrored."

Figure 4-5. Viewing character information in Unibook

Table 4-5. Fields in the Unicodedata.txt file
#	Field name	Default	Meaning of the field
0			Code number of the character (hex.)
1	Name		Unicode name of the character
2	General Category	Cn	Overall classification for the character
3	Canonical Combining Class	0	Used in the Canonical Ordering Algorithm
4	Bidi Class	L,AL,R	Defines the bidirectional behavior
5	Decomposition Mapping	=	Canonical or compatibility decomposition
6	Numeric Value	(none)	Numeric value, if the character is numeric
7	Numeric Value	(none)	Numeric value, if digit but not decimal
8	Numeric Value	(none)	Numeric value, if decimal digit
9	Bidi Mirrored	N	Y (yes), if mirrored in bidirectional text
10	Unicode 1 Name	(none)	Old name, defined in Unicode Version 1.0
11	ISO Comment	(none)	Comment in the ISO 10646 standard
12	Simple Uppercase Mapping	=	Uppercase version as single character
13	Simple Lowercase Mapping	=	Lowercase version as single character
14	Simple Titlecase Mapping	=	Titlecase version as single character

Figure 4-6. Description of character U+2206 in a code chart

The Unicodedata.txt file is a handy reference to some properties of characters. Using a suitable text editor, you can find information on characters quickly, if you download a copy of the file from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.

We find additional information on our sample character in the code chart for the Mathematical Operators block, as shown in the extract in Figure 4-6.

The description characterizes some uses of the character by listing "Laplace operator" and "forward difference" as synonyms for it (in some usage). Obviously, the primary name suggests the use as an increment symbol in some sense. Note that this does not constitute an exclusive list of uses for the character by any means, or that it would be obligatory to use this character for those purposes even when it is available in the repertoire. The actual usage is a decision made by mathematicians.

The description also clarifies that this is not the same character as Greek letter capital delta or a white up-pointing triangle (in the Geometric Shapes block). Note that an arrow means in principle just "cross reference," but quite often its specific purpose is to make it explicit that two characters are not equal, although they may have identical or similar glyphs.

Then let us check what the corresponding general description in Chapter 12 says. The relevant part in the standard, section 14.4, contains a clarifying note. It says that the INCREMENT character is one of the mathematical operators derived from Greek characters that "have been given separate encodings because they are used differently from the corresponding letters." It adds: "These operators may occasionally occur in context with Greek-letter variables." (In contrast, Unicode 3.0 said that these characters "have been given separate encodings to match usage in existing standards.") In practice, there are borderline cases: when a character with the shape of a capital delta occurs in printed form only, or in an encoding that lacks a code corresponding to U+2206, it can be difficult to say whether it should be interpreted as the Greek letter (U+0394) or as U+2206. For example, what about the delta amplitude function or the symbol for the area of a triangle?

There are also dozens of other properties defined for characters than those defined in the Unicodedata.txt file, as explained in Chapter 5. Although not all properties are practically relevant for all characters, many of them form part of the meaning of a character in a broad sense. They affect behavior like line breaking and writing direction. For U+2206, there are really no surprises in the properties. For example, in line breaking, it behaves the same way as letters, which should be suitable. On the other hand, for the en dash "" U+2013, for example, the Unicode line breaking rules allow a break after it (e.g., an expression like "58" could be broken as "5" on one line and "8" on the next). In borderline cases at least, such things might matter in the choice of character.

Thus, the identification of a symbol as a particular Unicode character is not really an exact science. There are matters of interpretation, and there is no comprehensive index to all information on a character in the standard.

4.5.6. Additional Reference Material

The Unicode standard and its annexes is partly rather large and complex, and it is not always suitable for quick checks or efficient searches. You may therefore wish to consult other references as well, even though they are not authoritative. To a large extent, other references have been constructed automatically from material issued by the Unicode Consortium, but this does not guarantee that they are error-free; programs have bugs.

Some practical references were listed in Chapter 1. The following online material is more technical or more specialized:

Die Unicode-Datenbank by Jürgen Auer: This site (http://www.sql-und-xml.de/unicode-database/) lists Unicode 4.1 characters by block but also by the 30 categories, by additional properties, and by bidirectional value. It also mentions for each character the version of Unicode in which it was introduced, helping to estimate how well it is supported. The explanations are in German, but you can mostly use it without knowing German.
Unicode Charts by Mark Davis: This site (http://www.macchiato.com/unicode/charts.html) lists only Unicode 3.0 characters, but it has some useful features. It lets you search for the code of a character by typing or pasting a character into a text box and hitting Enter. You can also view the GIF images of characters instead of the rendering of your browser using some font on your computer.
Unicode Character Properties Excel Workbook: This site (http://scripts.sil.org/ExcelUnicodeData) presents the contents of several Unicode database files as a single file that you can open in MS Excel or in Excel Viewer.
DecodeUnicode, a wiki activity: This site (http://www.decodeunicode.org) combines data extracted from the Unicode site with additional data contributed to different people using the wiki approach: anyone can write and edit anything. Therefore, the site contains a mixture of descriptive information. The user interface is not intuitive, and much of the material is in German only.