Section 2.7. Special Techniques

2.7. Special Techniques

General techniques that let you type any Unicode character are often impractical when you need to write a large number of characters of some particular kind. More specialized techniques are often more convenient. Moreover, some characters cannot be written just by selecting a character from a map, since they need to be represented as combinations of two (or more) Unicode characters.

2.7.1. Combining Diacritic Marks

Unicode has a special concept of combining diacritic marks, which will be described in detail in Chapter 8. Here, we discuss its relevance to typing characters.

A combining diacritic mark is a Unicode character that is not meant to be shown as such but only in conjunction with another character, a base character. For example, a combining acute accent, U+0301, has really no independent appearance, but when combined with the Latin small letter "u" U+0075 as a base character, it produces ú. By definition, the two-character sequence U+0075 U+0301 is canonically equivalent to Latin small letter "u" with acute accent U+00FA. The latter is an example of a precomposed character, which means that a base character and some diacritic mark(s) have been combined and the combination is defined as a separate Unicode character.

There is not much point in typing ú in a manner based on that equivalence, since there are ways that are more practical. It is possible, though, in programs that have sufficiently good Unicode support, and it can be useful as an exercise. Try this in MS Word, for example:

Press the "u" key.
Type 301 Alt-X. You should now see the "u" change to ú.
You can now type Alt-X to check what you have got; it should show u301, indicating that you really have "u" followed by a combining acute accent. If you had typed ú as a single character, Alt-X would give you 00FA, which is the code of that character.

Canonical equivalence does not mean identity. The character ú (U+00FA) is still distinct from the character sequence U+0075 U+0301, for example, in string matching, unless measures have been taken to deal with the equivalence. Moreover, even the rendering may differ. If you look carefully, you may notice that the accent in ú (U+00FA) is different from the accent in ú (U+0075 U+0301). This is because the former has probably been specifically designed by the typographer who created the font, while the latter is often the result of "mechanical" composition by a program.

This probably sounds confusing, but it has practical applications. Although you don't want to use this method to type ú, for example, what would you do if requested to produce the Cyrillic letter yu, , with an acute accent on it? Such a character does not exist in Unicode as precomposedi.e., in a code position of its own. It exists in Unicode only in the sense that it can be expressed as followed by a combining acute accent.

To produce ́, you would type and then use one of the ways discussed to add U+0301for example, 301 Alt-X. The visual appearance of the combined character might not be ideal, but there is little you can do about it. Many programs use rather simplistic methods to create characters with diacritic marks.

There are many potential combinations of characters with diacritic marks, and only a small percentage of them have been included in Unicode as characters. The rest are mostly very rare characters, such as special symbols used in mathematics. Some human languages use such combinations, though.

For example, letter ̀i.e., "i with both a macron (a horizontal line above) and a grave accentdoes not exist in Unicode as such. It can be expressed in several ways: "i" followed by a combining macron and a combining grave in some order, or as followed by a combining grave, or as ì followed by a combining macron. This multitude causes some problems, and there are techniques in Unicode to reduce the variation by so-called normalization. If you need to produce the character only on paper or screen, you can try the different methods (using a large font to see the differences) and use the combination that produces the typographically best result. This often heavily depends on the font.

In Unicode, a combining diacritic mark always appears after the character that it relates to. This is different from the use of dead keys for typing letters with diacritic marks: you press the dead key before the letter key.

2.7.2. Spacing Between Characters

Spacing between characters is mostly a typographic issue and, as such, is outside the scope of this book. We will however consider some Unicode approaches to spacing, emphasizing their limited usefulness as compared with other tools. Basically, you use tools like commands in a layout or publishing program to control character spacing.

In Unicode, there are somefixed-width space characters, which will be discussed in Chapter 8. Contrary to normal spaces, which are usually flexible (can be expanded or shrunk in formatting), fixed-width spaces have a more or less fixed width.

Figure 2-19. Different methods for adding spacing around a dash

Consider the typographic problem with an expression like 46 (four, en dash, six, meaning "from four to six"). In most fonts, the en dash will (almost) touch both digits, creating a somewhat unpleasant appearance. We could write "4 6" using spaces around the en dash, but this would violate orthographic rules, and it would also create too much spacing, as a rule. There are different approaches to the problem:

Use a font where the problem does not appear. This might mean using a different font for the en dash than for the text around it. Naturally, this is a tricky way, and it does not work if you cannot really control fonts.
Use the tools of a typesetting or other program to adjust character spacing. Even in MS Word, you can do that. Select the characters 4, and then choose Format Font Character Spacing, and set Spacing to Expanded by, say, 1pt or 2pt. The setting affects the spacing Insert suitable fixed-width spaces, such as thin spaces (U+2009), before and after the en dash. You could also try a hair space (U+200A), but it is probably too narrow (perhaps just one pixel wide).

The last approach is the only one that operates at the character level only, so it belongs to our topic. However, it is usually not the best way. It gives rather coarse control, at least if the typesetting program does not let you modify the widths of the fixed-width spaces. Moreover, it works for some fonts only. (If you enter fixed-width spaces and the current font does not contain them, your program might insert the space in some other font, often causing odd effects.) On the positive side, it expresses the spacing request at the character level and can thus be used even in plain text.

The three approaches are illustrated in Figure 2-19. The basic font there is Arial Unicode MS, which contains the thin space character.

2.7.3. Inputting East Asian Characters

You may wonder how people type Chinese/Japanese/Korean (CJK) characters on a computer, given the fact that there are thousands of such characters. Using a general character map is rather impractical, since it is very difficult to find CJK characters there.

Some techniques are based on the phonetic values of characters: using Latin letters, you type a string that corresponds to the pronunciation, and a program shows you a menu of alternative characters to select from. Other techniques work on the graphic elements of characters, such as the number of strokes or the radical (root symbol). A program might even recognize characters as drawn using a mouse.

There are severalInput Method Editors (IME) available from different sources. These utilities combine many alternative methods of CJK character input, as illustrated in the document http://www.microsoft.com/globaldev/handson/user/IME_Paper.mspx.

If you use Microsoft products, you can download and install support to one or more of the East Asian writing systems: Chinese Traditional, Chinese Simplified, Japanese, and Korean. Along with the support, you get an IME. Since the choice and installation heavily depends on a particular system (including version of Windows) and on whether MS Office is used or not, we just refer to information available via http://www.alanwood.net/unicode/utilities_editors.html. Be aware that because of the number of CJK characters, the packages are rather large.