International Features | Developing International Software

The fact that Rich Edit is so widely available, combined with its numerous international features, makes it a convenient and efficient technology as you develop and create world-ready applications. Some of the international features associated with Rich Edit, discussed in the sections that follow, include font binding, Unicode surrogate pairs, interfaces for using rich edit controls, and support for IMEs.

Starting with version 2, Rich Edit has been based on Unicode. This gives considerable international capability for Western European and East Asian languages. (For more information on Unicode, see Chapter 3, "Unicode." ) Rich Edit 3 added support for BiDi languages, Devanagari, Tamil, Thai, and Vietnamese. Rich Edit 4 added support for most of the remaining Unicode 3.0 scripts as well as for surrogate pairs and TSF. Because Rich Edit has many clients, a major principle has been to provide a single, worldwide binary that runs on all the operating systems. This is a need of international applications in general. (For more information on the advantages of creating a single, worldwide binary, see Chapter 1, "Understanding Internationalization," and Chapter 2, "Designing a World-Ready Program." )

Font Binding

A little rich-text functionality is necessary for displaying Unicode Chinese, Japanese, and Korean (CJK) plaintext unambiguously in order to account for glyph variations that exist among Simplified Chinese, Traditional Chinese, Japanese, and Korean. Rich-text functionality handles font choices and language-dependent glyph variants. For example, in plaintext dialog boxes, East Asian users usually expect to see Chinese characters displayed in only one of the four languages just listed. Heuristics often are adequate to make the right choice, but when a string contains only Chinese characters belonging to two or more of these languages on a computer equipped with these languages' respective fonts, ambiguity results. The only way to resolve this uncertainty is to use language tagging.

When Rich Edit first detects any character belonging to a complex script (such as for Arabic, Thai, Hebrew, or Indic languages), it dynamically binds to the advanced-typography line-layout component along with Uniscribe. Uni-scribe contains a number of shaping engines including one for Korean jamos. This particular engine can be used to display all Korean characters, whether modern or ancient. (For more information on Uniscribe, see Chapter 5, "Text Input, Output, and Display.") In addition, a Uniscribe shaping engine for non-spacing combining marks is very useful for mathematical text. This type of engine accesses OpenType font tables to attach combining marks at the appropriate positions on the base character.

There can be disparity between typed text and set text. When a user types in text using a keyboard charset, the edit engine knows the charset and therefore can insert accurate Unicode text-including refinements such as which CJK glyph variant to use. If the client gets the plaintext from the control (as Unicode or non-Unicode), and then sets the text again, the original rich-text clues provided by the keyboard are gone. In such contexts, it would be handy to have writing-system tags like Latin 1, Greek, Russian, Arabic, Japanese, and so on. Language tags-for example, the lower 16 bits of a Microsoft Win32 locale ID (LCID)-are usually more specific and would also work, but are superfluous unless proofing tools are to be supported.

Much of Unicode consists of characters that belong to a set of writing systems such as Latin 1, Greek, Russian, Arabic, Japanese, and so on. Rich Edit effectively associates a font bundle with each position in a document. A font bundle is a set of fonts corresponding to Unicode characters of a particular writing system. A plaintext document has a single font bundle, which can be computed on the fly from other information. As characters are inserted, each is assigned to a script in a context-dependent way. For Chinese character assignments, surrounding characters are checked for kana and hangul in an attempt to use Japanese or Korean fonts instead of Chinese. Context allows you to ascertain the script to which neutral characters (such as blanks) and digits belong. In addition, the keyboard language, especially IMEs, can provide strong binding clues. Theinserted characters are formatted with fonts assigned to a particular writing system unless the current font supports the required writing system. Whether a font can support a given writing system can usually be efficiently determined by getting the font signature. (For more information, see the documentation on the GetTextCharsetInfo() function at http://msdn.microsoft.com.) Rich Edit 4 supports approximately 50 writing systems from Unicode 3.1.

Unicode Surrogate Pairs

Rich Edit uses a pair of UTF-16 Unicode surrogates to represent a single supplementary-plane character. This approach is commonly used in Microsoft software and gives a smaller instance size than using 32-bit characters (UTF-32), but it causes complications beyond the measurement and display of characters. (For more information on surrogate pairs, see Chapter 3.) Arrow-key handlers and other methods that change the caret character position should not end up between the lead and trail surrogates. Input methods need to map to the surrogate pair. Case changes, line-breaking rules, sorting, file formats, and backing-store manipulations, in general, have to recognize and deal with surrogate pairs.

Luckily the choice of the surrogate code ranges makes surrogate pairs easy to work with relative to multibyte encoding systems, since you can easily tell if a code value is a lead surrogate, trail surrogate, or neither. Microsoft surrogate-pair support uses fonts with a new 21-bit cmap that allows mixing characters from Unicode's 17 planes. This support is consistent with TrueType's current 16-bit glyph indices so that the TrueType rasterizer doesn't need to be revised, which limits the total number of glyphs in a font to 65,536. To handle more glyphs, multiple fonts need to be used. This is a fairly straightforward process because of the font binding methods that Rich Edit uses. As Unicode adds support for new scripts, new fonts will need to be created to represent these scripts' glyphs. These additional fonts can include new tables, including a 21-bit cmap. Fortunately, Rich Edit clients can typically ignore these problems with surrogate pairs, since Rich Edit and its companion DLLs provide automatic resolution.

Interfaces

There are four main ways to use a control within Rich Edit 2.x, Rich Edit 3, or Rich Edit 4.x :

Messages
File read and write (plaintext or RTF)
Text Object Model methods
ITextServices methods

These four options are described in the following sections.

Messages

Rich Edit handles system messages including keyboard messages like WM_KEY-DOWN, WM_CHAR, mouse messages like WM_MOUSEMOVE, WM_LBUTTON-DOWN, and clipboard messages like WM_COPY, WM_CUT, and WM_PASTE. Rich Edit also supports most of the system edit messages (as defined in Winuser.h) except for EM_GETHANDLE, EM_SETHANDLE, EM_FMTLINES, and WM_GET-FONT.

Rich Edit has many of its own messages to yield access to features. For example, EM_INSERTTABLE (defined for Rich Edit 4) allows you to insert nested tables with cells that can contain multiple paragraphs. Rich Edit messages are defined in Richedit.h and have been included in the Microsoft Windows Platform SDK, available from http://msdn.microsoft.com.

File Read and Write

Rich Edit reads and writes text in a variety of plaintext and rich-text formats. Plaintext characters are assigned to scripts and bound to appropriate fonts. Rich-text files typically have appropriate fonts, but Unicode characters that aren't assigned to a particular script are also subject to font binding when Rich Edit reads text that contains such characters.

Text Object Model Methods

TOM comprises a set of six Common Object Model dual interfaces. The Common Object Model dual interface obeys rules that let it be used by a variety of clients, ranging from Microsoft Visual Basic, Microsoft C, Microsoft C++, and Java programs to simple Automation containers (formerly known as "OLE Automation containers"). Clients can run TOM from Visual Basic by creating Visual Basic scripts and from Java by creating Java scripts. This is flexible and relatively easy to do. When efficiency really counts, clients can access TOM directly from C and C++.

The top-level TOM object is defined by the ITextDocument interface, which has methods for creating and retrieving objects lower in the object hierarchy. For simple plaintext processing, you can obtain an ITextRange object from an IText-Document object and perform most editing tasks with that. If you need to manipulate rich-text formatting, youcan obtain ITextFont and ITextPara objects from an ITextRange object. ITextFont provides the programming equivalent of the Font dialog box (on the Format menu) in Word, and ITextPara provides the equivalent of the Paragraph dialog box (on the Format menu) in Word. In addition to these four lower-level objects, TOM has a selection object (ITextSelection), which is just an ITextRange object with selection highlighting and some additional methods oriented toward the user interface (UI). The range and selection objects include screen-oriented methods that enable programs to examine text on the screen or to view text that can be scrolled onto the screen. These capabilities help make text accessible to the blind, for example.

ITextServices Methods

Windowless rich edit controls provide rich-text editing support for windowless objects. Instead of creating a windowed rich edit control, you create a text services object that uses ITextServices/ITextHost interfaces to provide access to the rich edit functionality. The control then works somewhere inside the host window, and many controls can belong to the same host window. This can greatly reduce the instance size of a set of controls as well as centralize the administration of the controls.

Support for Input Method Editors

Glossary

Active Input Method Manager (IMM): An ActiveX control that provides limited IMM service on non-Asian language versions of Windows 95, Windows 98, Windows Me, and Windows NT 4 platforms. It is replaced by the more general Text Services Framework in Windows XP. Active IMM is also known as "Global IME."

Rich Edit has supported IMEs for years. (For more information on IMEs, see Chapter 5.) Rich Edit 4.x integrates the latest Microsoft advances in this area by giving native support to the new Text Services Framework. TSF, which is shipped with Office XP and Windows XP, enables an application to extend speech, handwriting, and East Asian input support across all localized versions of Win32 platforms. (For more information on TSF, see Chapter 23, "Microsoft Windows Text Services Framework [TSF].") Rich Edit also has a Unicode input method based on the Alt+X hot key (described later in this section) and a standard built-in (overrideable) hot keys for Cut, Paste, Copy, Undo, Redo and so on.

Internally, Rich Edit's IME capability has been factored out into an independent module that communicates with the Rich Edit engine via the TOM interfaces. It supports both Level 2 and Level 3 IMEs. Level 2 IMEs allow users to enter a candidate in a window or box before sending the final text to applications. Many of the Chinese IMEs are Level 2 IMEs. Level 3 IMEs provide applications with composition characters while the user is typing. Rich Edit's input services also support the older Active Input Method Manager (IMM). IME features include the following:

Reconversion. This feature enables the user to convert the final string back to composition mode, allowing easy selection of a different candidate string. In the past, the user needed to delete the final string first and then type in a new string to get to the correct candidate.
Document feed. This feature provides Microsoft IME 2000 and IME 2002 with the text for the desired paragraph, which allows both IMEs to have more accurate conversion during typing.
Mouse operation. This feature allows the user to have better control over the candidate and UI windows during typing.
Caret position. This feature provides the current caret and line information, which IME 2000 and IME 2002 use to position UI windows (such as a candidate list).

A handy hexadecimal-to-Unicode entry method works with WordPad 2000, Office edit boxes, rich edit controls in general, and Microsoft Word 2002. Basically, you type a character's hexadecimal code in ASCII and then type Alt+X. The hexadecimal code is replaced by the corresponding Unicode character. The Alt+X can be a toggle (as in Microsoft Office XP). That is, press Alt+X once to convert the hexadecimal code to a character, and press Alt+X again to convert the character back to a hexadecimal code. If the hexadecimal code is preceded by one or more hexadecimal digits, you need to select the code so that the preceding hexadecimal characters aren't included. The code can extend up to the value 0x10FFFF, which is the highest character in the 17 planes of Unicode.