International Best Practices | Developing International Software

Rich Edit 3 and earlier are documented in the Windows Platform SDK. Starting with Windows XP Service Release (SR) 1, Rich Edit 4.1 will be documented in the Windows Platform SDK. Accordingly, the following offers a glimpse at only a few of the most pertinent practices.

Consider Backward Compatibility with Non-Unicode Text

If all text were Unicode and if all display facilities supported Unicode, the Rich Edit code base would be simpler. But a Unicode text engine has to be able to import and export text in other standards, which are defined by their code pages. If you have non-Unicode plaintext, which code page should you use to convert to and from Unicode? On localized systems, the system code page is the most likely answer. However, you can enter multilingual text using keyboards in a variety of languages that are either encoded with Unicode or with multiple code pages. What code page would you use to search for such text? There's no perfect answer, but the best choice seems to be to use the code page of the keyboard you are working with at that particular moment.

Even Unicode has a number of coding schemes, such as UTF-16 big-endian, UTF-16 little-endian, UTF-8, and UTF-32. (For more information, see Chapter 3.) If text begins with a Unicode byte-order mark (BOM), use the appropriate conversion. UTF-8 can also be recognized without its BOM with pretty high accuracy, but the algorithm is fairly complicated, whereas the algorithm for handling a leading UTF-8 BOM is easy. If text begins with a rich-text header such as {\rtf, <HTML> or <!doctype HTML.>, you can route it to the appropriate conversion routine. Rich Edit employs a combination of these schemes to read non-Unicode plaintext.

Take Font Sizing into Account

In dialog boxes, 8-point Latin characters are commonly used. But 8-point Chinese characters are hard to read, so it's better to use 9 points in combination with 8-point Latin characters. Latin characters have bigger descenders than Chinese characters, since the latter only need room for an underline. Combining 8-point Latin characters with 9-point Chinese characters and keeping the same baseline increases line height beyond 9 points-to about 10 points-since an 8-point Latin descender is bigger than a 9-point Chinese descender. This can shift the text too high in a dialog box originally designed to only handle one language. Thai characters offer particular problems, since ordinarily they are displayed as14-point characters. When mixed with Western characters, Thai characters need correspondingly larger Western fonts. (For more information on font sizes and character height, see Chapter 5.)

Know How to Handle Multicode Sequences

Glossary

Combining-mark sequence: An alphabetic base character followed by one or more combining-mark characters such as acute and grave accents.
Diaeresis: Two dots placed over a vowel to indicate that the vowel is pronounced as a separate syllable (as in the word "na ve"). Typically used when two vowels are adjacent, but should be pronounced separately rather than as a diphthong.
Caret: The blinking line indicating the space into which you insert text.

Unicode surrogate pairs, carriage-return line-feeds (CRLFs), and nonspacing combining-mark sequences are multicode characters (more than one 16-bit quantity) that require special treatment, both for display and for cursor navigation. For basic display purposes, combining-mark sequences can be rendered approximately by standard system calls and fonts that support the combining marks. More elegant display requires an OpenType layout engine, which has tables specifying where to attach specific combining marks. For example, a diaeresis should be placed higher on an "I" than on an "i."

In terms of cursor navigation, simple caret movement across combining-mark sequences precludes you from ending up inside such a sequence. So if you type a Left Arrow key whenever the caret immediately follows such a sequence, you should find the caret immediately in front of the sequence. A Backspace key should delete one combining mark at a time. Mouse-cursor hit testing should exit the selection at the beginning or at the end of a combining-mark sequence-never within such a sequence (at least with a simple model). A more elegant model might allow selection and editing of the individual combining marks, but this sort of capability is relatively hard to implement. Rich Edit does let you use the Backspace key to delete one combining mark at a time, as you work backward from the end of the character sequence.