Section 1.5. Definitions of Character Repertoires

1.5. Definitions of Character Repertoires

The implementation of Unicode support is a long and mostly gradual process. Unicode can be supported by programs on any operating systems, although some systems may allow much easier implementation than others; this mainly depends on whether the system uses Unicode internally so that support to Unicode is built in.

Even in circumstances where Unicode is supported in principle, the support usually does not cover all Unicode characters. For example, an available font may cover some part of Unicode that is only practically important in some area. When text data produced in one program is to be processed in another, we should be prepared for difficulties with any unusual characters. For data transfer, it is essential to know which Unicode characters the recipient is able to handle.

Thus, although Unicode contains a huge number of characters, not all of them can be used safely. Among the 100,000 or so characters, usually only a small subset can be used in a particular application and context without a serious risk of distorting information.

1.5.1. Formally Defined Repertoires

Each character code, by itself, defines a character repertoire: the collection of characters that can be represented in the code. In addition to this, subsets of such collections can be defined.

A character repertoire is any collection of characters, without implying any particular implementation even at the level of code numbers. However, in practice, the simplest way to define a character repertoire is to use Unicode as the basis and simply list the code numbers. Such a definition specifies a closed collection, which does not change if the Unicode standard is enhanced. In contrast, by listing a set of Unicode blocks you define anopen collection, which is fixed at any given moment of time but will automatically expand if new characters are added to any of those blocks in a revision of the Unicode standard.

For example, there are three Multilingual European Subsets (MES-1, MES-2, MES-3), defined in a CEN Workshop Agreement, CWA 13873. Among them, MES-2 is the most important. It is a closed collection, covering Latin, Greek, and Cyrillic scripts. The CWA is available at http://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf or via http://www.cenorm.be/cenorm/businessdomains/businessdomains/isss/cwa/.

1.5.2. Practical Repertoires

In addition to international standards, there are company policies that define various subsets of the character repertoire. A practically important one, especially in regards to support in widely used fonts, is Microsoft's "Windows Glyph List 4" (WGL4), also known as "PanEuropean" character set, listed on the page "Using special characters from Windows Glyph List 4 (WGL4) in HTML" at http://www.alanwood.net/demos/wgl4.html. Contrary to what you might expect, the characters in it have not all been included in MES-2.

In data-processing contexts, a character can be considered "safe" if it is certain or very probable that it will be correctly transmitted and presented to the recipient. In a broader sense, being safe entails more: the sender should be sure of the character he means, and the recipient should understand it correctly. Mostly, however, we consider the technical problems: difficulties in presenting the character in a digital form, in sending it over network connections and possibly to a different program and operating environment, and in rendering it visually. Nowadays, it's usually the last phase that poses most problems.

From a practical point of view, we can distinguish the following repertoires of characters. Each repertoire listed here contains all the previous repertoires. The list can be useful when you design an application, or instructions on writing things, or a computer language. When selecting which repertoire you use or support, it is advisable to proceed slowly in the list and consider whether the usefulness of extra characters outweighs the risks. The names used for the repertoires here are practical descriptions, not official names. They make liberal use of encoding names, which will be described in more detail in Chapter 3.

ASCII name characters: English letters AZ and az and digits 09: These are the safest characters and often the only characters you can use in names or identifiers in a computer language. Often a few extra characters like underline _, hyphen - and full stop "." are allowed, too. Be careful with any extra characters when selecting a name for a file, a username, or a data item name. The naming rules you have learned in some context may not apply in others. For example, Unicode names for characters use just letters AZ without case distinction, digits 09, space, and hyphen (hyphen-minus, to be exact).
The invariant subset of ASCII: the above, plus characters ! " % & ' ( ) * + , - . / : ; < = > ? and the space character: This can be described as the rock-bottom repertoire of characters in data processing. However, in different transfer and transformations, even these characters may get changed somehow. A common example is the ampersand &, which often needs to be written in some special way (e.g., as & in HTML and XML).
The full ASCII repertoire: the above, plus characters # $ @ [ \ ] ^ _' { | } and ~: This repertoire, called Basic Latin in Unicode, usually works well across programs, computer platforms, and network connections. The characters listed here work mostly just as well as the other ASCII characters, but some standards allow national variation that may make them unsafe. Moreover, producing some of these characters can be a nontrivial task on a non-U.S. keyboard.
The ISO Latin 1 repertoire consists of the above plus 96 additional characters, such as à, é, Ô, £, §, µ, ©, and ¥: This repertoire is also called ISO 8859-1, and it will be described in more detail in Chapter 3 and Chapter 8. It is sufficient for writing most Western European languages, except for some typographic issues. It is widely available in the Western world, but not necessarily elsewhere. Some characters in it still cause problems to some Mac users.
The Windows Latin 1 repertoire, which adds the dashes "" and "'" as well as English (curly) quotation marks and apostrophe, and a few other characters: This repertoire is generally available on Windows systems and on most other systems as well. The extra characters usually need to be produced using special key combinations or other tools such as word processor functions. Due to character code differences between systems, the extra characters are generally not safe in email, for example.
The WGL4 repertoire: Although the repertoire has been defined by a private company and not in any standard, the characters in it are standard and rather widely available in environments other than Windows, too. The repertoire has a total of 652 characters. In addition to the characters mentioned above, it contains additional Latin letters, the basic characters used in modern Greek, a repertoire of Cyrillic letters sufficient for several languages, a mixed collection of mathematical and other symbols, and some line drawing characters.
The Unicode 2.0 repertoire: There is quite a jump from the WGL4 repertoire to the Unicode 2.0 repertoire, but there are few intermediate general purpose repertoires. Since Unicode is an evolving standard, there are considerable differences between its versions. For example, a font that purportedly supports "full Unicode" might actually support just Unicode 2.0. Newer versions are much more extensive. At the time Unicode 4.1 was published (March 2005), no widely used font supported essentially more than Unicode 2.0 (published in July 1996).
The full Unicode repertoire(s): Unicode as currently defined is very large, but anything beyond Unicode 2.0 (except for the euro sign €, defined in 2.1) is rather unsafe. Experimental use, as well as use for well-defined limited applications, can be possible and interesting. When designing such use, select and document clearly the Unicode version you need. In the future, things can be expected to change, as font support to (at least) Unicode 4.1 will be shipped with important operating systems.

To illustrate the repertoire of characters that is reasonably "safe" in many situations, Table 1-2 shows all WGL4 characters. This is just an overview. Many of the characters cannot be identified by their shape only. The classification of the characters used in the table is a practical one, rather thanformal.

Table 1-2. WGL4 characters
Classification	Characters
Basic Latin letters	ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Variants of Latin letters	ª º ⁿ
Ligatures
Added Latin letters	ÆæŒœØøßſ
Latin letters with diacritics	ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõ öùúûüÿĔģĬ Ŏ ǺǻǼǽǾǿẀẁẂẃẄẅỲỳ
Greek characters	ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδε ζηθικλμνξοπρστυφχψω·΄΅
Cyrillic letters	Ґґ
Digits	0123456789¹²³
Fractions	½ ¼ ¾ 1/8 3/8 5/8 7/8
Punctuation	, : ; . ! ¡ ? ¿ " « » ‹ › " " " ' _' ‛ ' ' ... ' ‗ • ' " ‼ ‾ ⁄
Space characters	space (U+0040), no-break space (U+00A0)
Parentheses	( ) [ ] { }
Multiple-use characters	# % & * - / \ @ ^ _ ' \| ~ § ¯ ¶ · °
Spacing modifier letters	´ ¨ ¸ ˆ ˉ
Currency symbols	$ ¢ £ ¤ ¥ ₣ ₤ ₧ €
Letterlike symbols	© ® ™ µ ℮
Arrows	↨
Mathematical operators	+ - ± x ÷ < = > ¬ ∆ ∕ ∙
Miscellaneous technical	⌂ ⌐ ⌠ ⌡
Box drawing
Block elements	▌ ▐ ▲ ► ▼ ◄
Geometric shapes	■ ▫ ▬ ● ◘ ◙ ◦
Miscellaneous symbols	☺ ☻ ☼ ♪ ♫

Section 1.5. Definitions of Character Repertoires

1.5. Definitions of Character Repertoires

1.5.1. Formally Defined Repertoires

1.5.2. Practical Repertoires

Table 1-2. WGL4 characters