Input Languages | Developing International Software

Glossary

Input method: Any method used to enter text. These methods include different keyboard layouts and IMEs, as well as newer input services such as voice-recognition engines or handwriting-recognition engines.

Before diving into the technical aspects of how to support input languages, this chapter will first show you how a typical user interacts with these languages. Because Windows 2000 and Windows XP allow the user to enter multiple languages using a variety of input methods, the system needs to know which in put method should be active for a particular language. These associations are called "installed language and method pairs," or "input languages" (called "input locales" in Windows 2000). During installation, the default input language for the language version of the operating system, along with English, is installed for each user. The user can then define the list of input languages to be made available for his or her own account and usage. For example, on the same machine, one user can have an English keyboard layout and a Japanese IME installed, and another user can have both French and Arabic keyboard layouts installed. This customization is done by adding or removing input languages and using them on the fly from the Regional And Language Options property sheet, provided that the language support of the target language has already been installed. (See Figure 5-1.)

figure 5-1 each user can add and remove input languages from the languages tab of the regional and language options property sheet.

Figure 5-1 - Each user can add and remove input languages from the Languages tab of the Regional And Language Options property sheet.

The default input language is the input language that is active when a new application thread is started. Switching to a different input language is done on a per-thread basis; you can have two different input languages in two different applications. The taskbar indicates which input language is currently active. For example, in Figure 5-2, English is the input language that is currently active. When the user clicks the language indicator in the taskbar-each language is represented by its two-letter abbreviation-Windows 2000 and Windows XP present a list of alternatives such as Japanese, French (Canada), and so on. (For an extensive list of locales and their associated valid input languages for Windows XP, see Appendix P, "List of Keyboards and IMEs Supported by Microsoft Windows 2000 and Microsoft Windows XP.")

figure 5-2 list of available input languages, with english being the one that is currently active for this particular user.

Figure 5-2 - List of available input languages, with English being the one that is currently active for this particular user.

The shortcut keys iterate through the list of installed language and method pairs in the order in which they were added via the Regional And Language Options property sheet. If the user has selected Left Alt+Shift in the Advanced Key Settings dialog box, Left Alt+Shift will allow the user to toggle between different installed input languages. (See Figure 5-3.)

figure 5-3 switching between various input languages in windows xp.

Figure 5-3 - Switching between various input languages in Windows XP.

Having gained an understanding of how the user can customize a list of input languages and switch from one input language to another, you'll now see the most efficient ways to work with input languages from a developer's standpoint. Taking advantage of system support will go a long way toward making your job easier.

Techniques for Handling Input Languages

The Microsoft Developer Network (MSDN) documentation (found at http://msdn.microsoft.com) and programming APIs represent input languages with a variable type called "input locale identifier," formally known in older documentation as "Handle to the Keyboard Layout" (HKL) and still used as the type identifier. HKL is an archaic name from a time when the only input was from a keyboard. The input locale identifier name is a 32-bit value composed of the hexadecimal value of the language identifier (low WORD) and a device identifier (high WORD). (See Figure 5-4.) For example, U.S. English has a language identifier of 0x0409, so the primary U.S. English layout is named "00000409." Variants of the U.S. English layout (such as the Dvorak layout) are named "00010409," "00020409," and so on. The device identifier is not limited to keyboards and IMEs; data can now be entered by more sophisticated mechanisms such as voice- and text- recognition engines. For instance, Microsoft Windows Text Services Framework (TSF)-a system service available on Windows XP-enables advanced, source-independent text input. (For more information on TSF, see Chapter 23, "Microsoft Windows Text Services Framework [TSF].")

figure 5-4 the hkl variable, which represents input languages.

Figure 5-4 - The HKL variable, which represents input languages.

The easiest way to handle input languages is to use the standard controls that the operating system provides whenever you are expecting user input. For example, by using Unicode edit controls or rich edit controls, you enable your application to handle multilingual text input. The operating system automatically handles input languages in a way that is transparent to your application. For instance, the GlobalDev application is a property sheet with a tab called "Text APIs." Text APIs uses a standard multiline edit control, which eliminates the hassle of dealing with input languages. (You can find the GlobalDev application and the Text APIs page in the Samples subdirectory on the companion CD.)

Advanced applications (such as a text editor) that need to have full control over how input languages are handled should monitor-and should be able to respond to-the user's changes. When a user selects an input language by clicking on the language indicator of the taskbar or by pressing Left Alt+Shift, the input language is not automatically changed-either action generates a request that the active application must accept or reject. In response to the hot-key combination or the mouse click on the language indicator of the taskbar, the system sends a WM_INPUTLANGCHANGEREQUEST message to the window of focus, as Figure 5-5 illustrates. If the application accepts the message and passes it to DefWindowProc, the system initiates switching the input language, sending a WM_INPUTLANGCHANGE message. The process is slightly different when the input method is a part of the Text Services Framework (TSF), in which case only a WM_INPUTLANGCHANGE is sent. When the system successfully completes the change, it generates a WM_INPUTLANGCHANGE message. The lParam variable of the WM_INPUTLANGCHANGE message contains the input locale identifier (that is, the HKL) of the new input language.

figure 5-5 wm_inputlangchangerequest and wm_inputlangchange message propagation flowchart.

Figure 5-5 - WM_INPUTLANGCHANGEREQUEST and WM_INPUTLANGCHANGE message propagation flowchart.

An application that does not support multiple languages will reject the WM_INPUTLANGCHANGEREQUEST message. It might reject any or all WM_INPUTLANGCHANGEREQUEST messages, or it might perform a couple of tests first. For example, the wParam variable of this message is a Boolean value-bLangInSystemCharset -that indicates whether the requested input language can be represented in the current system locale. Representing input languages is not a worry when dealing with Unicode applications, but non-Unicode applications should, in fact, monitor this value, or they will display the wrong characters.

Similar to the system generating a WM_INPUTLANGCHANGEREQUEST message in response to a user request, applications can also initiate input language changes by calling the ActivateKeyboardLayout API. This allows a user who is editing a document containing Latin and Greek text to automatically activate the Greek input method when moving the insertion point from the Latin text to the Greek text. (See Figure 5-6.) Likewise, when this user moves the insertion point back to the Latin text, the application will activate the default Latin-based input method.

figure 5-6 when the cursor is positioned in a greek text stream, the active keyboard layout should switch to greek.

Figure 5-6 - When the cursor is positioned in a Greek text stream, the active keyboard layout should switch to Greek.

Other Win32 APIs that handle input methods are shown in Table 5-1.

Table 5-1 Win32 APIs that handle input methods.

Keyboard-Related API Function	Description
GetKeyboardLayout	Returns the active installed language and method pair
GetKeyboardLayoutList	Returns a list of installed language and method pairs
GetKeyboardLayoutName	Returns the name of the active input method
LoadKeyboardLayout	Loads a new input method into the system
UnloadKeyboardLayout	Unloads an input method; cannot unload the system default
ActivateKeyboardLayout	Changes the active installed language and method pair

When you design functionality to allow the user to switch keyboard layouts, keep in mind that because the letters on keyboards vary from layout to layout, the keys used to generate shortcut-key combinations might also vary. For example, the French keyboard defaults to the AZERTY layout, whereas the English layout follows a QWERTY mapping. Therefore, it is suggested that you use numbers and function keys (F4, F5, and so on) instead of letters in shortcut-key combinations.

In addition to enabling your application to handle varying input languages, you will also need to enable it to support IMEs. (Keep in mind that if you use standard APIs for input, your applications will automatically handle IMEs.) By enabling IME support, you allow the user to enter ideographs, for example, from various East Asian writing systems. The following sections explore what an IME does-with practical examples and technical solutions on the best ways to support IMEs.

Input Method Editors

Glossary

Conversion or composition window: The window of an IME that displays text typed by the user, either just the way it is entered or after it is converted to ideographic form.
Status window: The window of an IME in which the user can change the IME's conversion mode or input mode.
Candidate window: The window of an IME that lists characters the user can choose to replace the text highlighted in the composition window.
Input Method Manager (IMM): The module on Windows 2000 and Windows XP that handles communication between IMEs and applications.
Dead key: A key that does not produce a character by itself, such as the accent key on the international keyboard. However, when the user types in a character after pressing the accent key, an accented character appears.

IMEs are components that allow the user to enter the thousands of different characters used in East Asian languages using a standard 101-key keyboard. The user composes each character in one of several ways: by radical, by phonetic representation, or by typing in the character's numeric code-page index. IMEs are widely available; Windows 2000 and Windows XP ship with standard IMEs that are based on the most popular input methods used in each target country, and a number of third-party vendors sell IME packages.

An IME consists of an engine that converts keystrokes into phonetic and ideographic characters, plus a dictionary of commonly used ideographic words. As the user enters keystrokes, the IME engine attempts to guess which ideographic character or characters the keystrokes should be converted into. Because many ideographs have identical pronunciation, the IME engine's first guess isn't always correct. When the suggestion is incorrect, the user can choose from a list of homophones; for more advanced IMEs, the homophone that the user selects then becomes the IME engine's first guess the next time around. This process is summarized in Figure 5-7.

figure 5-7 the process through which an ime engine converts keystrokes into ideographic characters.

Figure 5-7 - The process through which an IME engine converts keystrokes into ideographic characters.

Before examining how the user can enter ideographs using IMEs and how you can add support for IMEs in your application, the next sections will give a quick overview of the linguistic differences among East Asian scripts.

East Asian Writing Systems

Chinese, Japanese, and Korean writing systems all offer some interesting complexities not found in Latin writing systems. To put things in clearer context, it will be useful for you to have an idea of what these complexities entail.

Chinese Three forms of ideographic characters are commonly used today in the world: Traditional Chinese, Simplified Chinese, and kanji (which is used for Japanese). Traditional Chinese characters, which are thousands of years old and have kept their original shapes, generally contain more strokes than other ideographic forms, and are more pictorial. These characters are typically used in Taiwan. Simplified Chinese characters, which are based on Traditional Chinese characters, were developed in mainland China to make reading and writing easier to learn. Although Traditional Chinese and Simplified Chinese share some characters, the simplified characters, of which there are less than 7,000, are composed of fewer strokes and in most cases are distinct from their original counterparts. This is why software products developed for the Chinese-speaking market are usually released in two editions-one for the Traditional Chinese script and one for the Simplified Chinese script.

Japanese Japanese characters are called "kanji." Japanese mixes kanji characters with characters from two syllabaries, collectively called "kana." The two forms of kana are referred to as "hiragana" and "katakana." Hiragana is a cursive script, commonly used in Japanese text to represent ending inflections for verbs and to write native Japanese words that have no kanji equivalent, such as "and," "of," and "to." Katakana is chiefly used to represent words borrowed from other languages. All kana symbols, except for single-vowel characters and the character "n," represent a consonant followed by one of five vowels. Hiragana and katakana both represent the entire Japanese script of sounds.

Korean The Korean written language uses two types of characters: hangul and hanja. A hangul character is a single syllabic character created by combining one or more consonant signs and a vowel sign. There are 24 basic elements (14 consonants and 10 vowels), or phonemes, used to denote these signs; these elements are called "jamos." You can create up to 51 jamos by combining two or more basic elements to form additional vowels or consonants, called "compounds." Compounds and basic elements together comprise 21 vowels (10 basic vowels and 11 compound vowels) and 30 consonants (14 basic consonants and 16 compound consonants). A hangul character (syllabic) consists of an initial consonant, a medial vowel, and sometimes a final consonant. Nineteen of the 30 consonants can be initial consonants. All 21 vowels can be medial vowels, and 27 of the 30 consonants can be final consonants. This means that 11,172 hangul character combinations are possible, though far fewer are actually used. The Korean language also adopted hanja characters from Chinese and uses them for more formal written communication and to represent personal names. Most daily communication is written in hangul.

Ways to Enter Ideographs with an IME

With an IME you don't have to use a localized keyboard to enter ideographic characters. While East Asian keyboards can generate phonetic syllables (such as kana or hangul) directly, the user can represent phonetic syllables using Latin characters. In Japanese, Latin characters that represent kana are called "romaji." Japanese keyboards contain extra keys that allow the user to toggle between entering romaji and entering kana. If you are using a non-Japanese keyboard, you need to type in romaji to generate kana.

The best way to learn how an IME works from the user's perspective is to try using it and to take advantage of the extensive Windows Help files. As a reference, the following sections look at how the Japanese IME that ships with Windows XP works.

The Standard Japanese IME for Windows XP

The Japanese IME for Windows XP, called "Microsoft IME 2002" (see Figure 5-8), has six standard input modes, listed in Table 5-2. Additionally, IME 2002 contains an IME Pad that allows for alternative methods of input, and several other tools for handling both conversion into kanji and voice input. Although you will usually see IME 2002 the way it appears in Figure 5-8, it also has a drop-down menu that lists various input modes. (See Figure 5-9.)

figure 5-8 the japanese ime language bar.

Figure 5-8 - The Japanese IME Language bar.

Table 5-2 The Japanese IME input modes.

Japanese IME Input Mode	IME Toolbar Setting	Key to Convert Text Representation
Full-width hiragana		f6
Full-width katakana		f7
Full-width alphanumeric		f9
Half-width katakana		f8
Half-width alphanumeric		f8
Direct input		Not applicable

figure 5-9 ime 2002 on windows xp. the input modes are listed in the drop-down menu. the last input mode, called direct input, turns off the ime, and keystrokes are sent to the application directly without being converted into phonetic syllables.

Figure 5-9 - IME 2002 on Windows XP. The input modes are listed in the drop-down menu. The last input mode, called "direct input," turns off the IME, and keystrokes are sent to the application directly without being converted into phonetic syllables.

Input of Japanese Characters In order to begin entering Japanese characters in an application running on Windows XP, you need to activate the IME by selecting it from the list of input languages. When you activate the IME, the floating Language bar changes to the Japanese IME toolbar as you saw in Figure 5-8. Table 5-3 shows what happens when you enter Japanese characters into an application running on Windows XP.

Table 5-3 Entering Japanese characters in an application running on Windows XP.

Action	Result
Type the letter "k." The IME conversion (or composition) window is represented by a dotted underline. The window might be displayed anywhere on the screen, but in most applications it is displayed next to the insertion point, as in the example on the right.
Now type the letter "a." The letter "k" is replaced with the hiragana syllable "ka." If you had typed the letter "i" instead of the letter "a," the hiragana syllable "ki" ({) would have appeared.
To convert the syllable "ka" into kanji, press the Spacebar.
Suppose you are looking for a different kanji representation of "ka." If the character you are seeking is not displayed, you can activate a list of alternatives from the IME by pressing the Spacebar a second time. You can scroll down through this list (known as the "candidate window") by pressing the Spacebar a third time, as shown in the example on the right. After you have highlighted the character you want, press Enter to place it in your document. The IME responds by sending the character to the active window. Then the dotted underline representing the IME composition window disappears.

You can form a number of kanji characters before pressing Enter. The IME engine will attempt to convert your keystrokes into a "determined string" based on Japanese grammar rules. There are four different conversion modes that allow you some control as to where the IME gets its data to convert. (See Table 5-4.)

Table 5-4 Four different IME conversion modes in Windows XP.

IME Conversion Mode	Description
General	This mode is configured for optimum conversion accuracy when writing formal Japanese, such as business correspondence, essays, and manuals.
Bias for names	This mode is handy for times when the user has to enter a lot of personal and place names (for instance, when entering data into databases and spreadsheets, or when writing address labels).
Bias for speech	This mode is configured for optimum conversion accuracy when writing documents containing a large proportion of informal (conversational) Japanese, such as novels, plays, or e-mails to close friends. If you are using default settings, this mode also allows the user to automatically enter emoticons (for example, "smileys"), which is handy when chatting on message boards or when writing e-mail.
No conversion	This mode enters data exactly as is, without performing any conversion.

How the IME System Works

The IME module in Windows 2000 and Windows XP fits into a larger mechanism for passing user input to applications and, like other input methods, the easiest and safest way of handling input is by using standard system controls such as edit fields and rich edit controls. Unless you are writing an IME package or customizing your IME user interface (UI), all of the IME complexities are taken care of for you if you use standard input APIs.

Whether an input language uses an IME or a keyboard to enter a language is something that is entirely transparent to the user. The procedure is the same whether the user is switching IMEs or Western keyboard layouts. Both actions are accomplished by clicking the language indicator on the taskbar or by entering a shortcut-key combination. Furthermore, it does not matter to an application which input method is used because switching IMEs generates the same messages as switching keyboard layouts: WM_INPUTLANGCHANGEREQUEST (if the IME is not part of TSF) and WM_INPUTLANGCHANGE. Applications can activate specific IMEs by calling ActivateKeyboardLayout. The IMM manages communication between IMEs and applications, serving as the go-between. When the user is typing with the IME, each keystroke posts a WM_IME_COMPOSITION message with the GCS_COMPSTR flag to indicate that there is an update to the composition string. The message's WPARAM value returns the first character of the string, and the rest can be retrieved via the ImmGetCompositionString API with the same GCS_COMPSTR flag. Then when the user presses Enter or clicks a character to place it in a document, the IME, by default, posts a WM_IME_COMPOSITION message with the GCS_RESULTSTR flag. (You can retrieve the committed string with the same API and the GCS_RESULTSTR flag.) If the latter WM_IME_COMPOSITION message is sent to DefWindowProc, then for each character in the committed string it posts a WM_IME_CHAR message containing the actual character. For a non-Unicode window, if the WM_IME_CHAR message includes a double-byte character and the application passes this message to DefWindowProc, the IME converts this message into two WM_CHAR messages, each containing one byte of the double-byte character. If the application ignores either message, it falls through to the application's DefWindowProc, which in turn notifies the IMM that the message has been ignored. The IME then resends the character or string byte-by-byte via multiple WM_CHAR messages. For Unicode windows, WM_IME_CHAR and WM_CHAR are identical.

Discussed in the following sections are the three discrete levels of IME support for applications running on Windows: no support, partial support, and fully customized support. Applications can customize IME support in small ways-by repositioning windows, for example-or they can completely change the look of the IME UI.

No IME Support IME-unaware applications basically ignore all IME- specific Windows messages. Most applications that target single-byte languages are IME-unaware.

Applications that are IME-unaware inherit the default UI of the active IME through a predefined global class, appropriately called "IME." This global class has the same characteristics as any other Windows-based common control. For each thread, Windows 2000 and Windows XP automatically create a window based on the IME Global class; all IME-unaware windows of the thread share this default IME window. When IME-unaware applications pass IME-related messages to the DefWindowProc function, DefWindowProc sends them to the default IME window.

Partial IME Support IME-aware applications can create their own IME windows. Applications with partial IME support can use this application IME window to control certain IME behavior. For example, by calling the function ImmIsUIMessage, an application can pass messages related to the IME's UI to the application IME window, where the application can process them. The following code (with proper error handling and possibly more messages handled) would appear in the window procedure of the application's IME window:

 HIMC hIMC; LPVOID lpBufResult; COMPOSITIONFORM cf; DWORD dwBufLen; if (ImmIsUIMessage(hIMEWnd, uMsg, wParam, lParam) == TRUE) { switch(uMsg)     {     case WM_IME_COMPOSITION:         if (lParam & GCS_RESULTSTR         {            hIMC = ImmGetContext(hWnd);  dwBufLen = ImmGetCompositionString(hIMC,                          GCS_RESULTSTR, NULL, NULL) +                           sizeof(TCHAR);            lpBufResult =  malloc(dwBufLen);  if(ImmGetCompositionString(hIMC, GCS_RESULTSTR,  lpBufResult, dwBufLen) > 0)  {  // ...  // process the text in lpBufResult   // ...    }      else // a negative error value was returned    {    // ...     // handle an error    // ...     }   free(lpBufResult);  ImmReleaseContext(hWnd, hIMC);  }  break;  } } return 0; }

The same window procedure could call SendMessage either to reposition the status, composition, or candidate windows, or to open or close the status window.

 SendMessage(hIMEWnd, WM_IME_CONTROL,  IMC_SETCOMPOSITIONWINDOW, &cf);

Other API functions that allow the application to change window positions or properties are ImmSetCandidateWindow, ImmSetCompositionFont, ImmSet-CompositionString, ImmSetCompositionWindow, and ImmSetStatusWindowPos. Applications that contain partial support for IMEs can use these functions to set the style and the position of the IME UI windows, but the IME dynamic-link library (DLL) is still responsible for drawing these windows-the general appearance of the IME's UI remains unchanged.

Full IME Support In contrast, applications that are fully IME-aware take over responsibility for painting the IME windows (the status, composition, and candidate windows) from the IME DLL. Such applications can fully customize the appearance of these windows, including determining their screen position and selecting which fonts and font styles are used to display characters in them. This is especially convenient and effective for word processing and similar programs whose primary function is text manipulation and which, therefore, benefit from smooth interaction with IMEs, creating a "natural" interface with the user. The IME DLL still determines which characters are displayed in IME composition and candidate windows, and it handles algorithms for guessing characters and looking them up in the IME dictionary. FULLIME, which is an example of a customized IME UI, can be found in the Microsoft Windows Platform SDK, available at http://msdn.microsoft.com.

Applications that are fully IME-aware trap IME-related messages in the following manner:

They call GetMessage to retrieve intermediate IME messages.
They process these messages in the application WindowProc.
They call TranslateMessage (part of the IMM) to pass the messages to the IME DLL. The IME needs to remain synchronized in the same way that keyboard drivers need to remain synchronized with dead keys. Remember that partial IME support is taken care of for you if you use standard input calls like those to Rich Edit.

You've made sure your application can handle different input languages and methods. Another task in ensuring your application can support multilingual input, output, and display is to meet the inherent demands that complex scripts present. In the sections that follow, you will see various linguistic traits that are associated with complex scripts, and you will learn about Windows support for working with complex scripts.