Options for Displaying Text | Developing International Software

Now that you have seen some ways to handle complex scripts, the following sections will dig even deeper into how to effectively display text. Covered are ways to display text in Win32 applications and in Web content.

Text Layout in Win32 Applications

In order to display text in a multilingual context, which also entails the output of complex scripts, there are four possible options:

Calling Win32 text APIs
Instantiating Win32 standard edit controls
Instantiating rich edit controls
Calling Uniscribe

The following sections briefly explain the advantages of each of these possibilities. It's then up to you to decide which is the best option for your application based on the application's complexity and its design features.

Win32 Text APIs

Many applications deal mostly with plaintext-text that is all in the same typeface, weight, color, and so on. Such applications have traditionally displayed text using standard Win32 display entry points (TextOut, ExtTextOut, TabbedTextOut,and DrawText) to write text to a window, and the GetTextExtent family of functions to measure line lengths. In Windows 2000 and Windows XP, the standard entry points have been extended to support display of multilingual Unicode text and complex scripts, to display vertical text, and to handle special rules regarding line breaking and word breaking. In general, this support is transparent to the application itself, so properly designed applications require no changes to support complex scripts through these interfaces.

Figure 5-16 shows how ExtTextOut can be used to lay out multilingual Unicode text including complex scripts. There is no need for you to do anything other than call ExtTextOut; ithandles everything for you.

figure 5-16 multilingual text output using the exttextout api.

Figure 5-16 - Multilingual text output using the ExtTextOut API.

The code looks like the following:

 HDC    hDC; HFONT  hFont; // Creating a font object to display text using  //    Microsoft Sans Serif. hFont = CreateFont(14, 0, 0, 0,  FW_NORMAL, FALSE, FALSE, FALSE, DEFAULT_CHARSET,   OUT_CHARACTER_PRECIS, CLIP_DEFAULT_PRECIS, PROOF_QUALITY,   VARIABLE_PITCH | FF_SWISS, TEXT("Microsoft Sans Serif")); hDC = GetDC(hDlg); SelectObject(hDC, hFont); // Outputting buffer lpszText into the selected  //  device context. ExtTextOut(hDC, 10, 10, ETO_CLIPPED, NULL,   lpszText, _tcslen(lpszText), NULL); ReleaseDC(hDlg, hDC); DeleteObject(hFont);

There are three requirements for displaying complex scripts correctly using the standard Win32-based applications:

First, applications should save characters in a buffer and display the whole line of text at once rather than, for example, calling ExtTextOut on each character as it is typed in by the user. When characters are written out one by one, the complex-script shaping modules cannot determine the context for correct reordering and glyph shaping. (See Figure 5-17.)
Figure 5-17 - In the first row the word is being written one character at a time, which does not provide enough information to Uniscribe to do the layout and shaping. In the second row, the string is passed as a whole to the Win32 API, and the final result is properly laid out.
Second, applications should use one of the GetTextExtentXXX functions to determine line length rather than computing line lengths from cached character widths. This is because the width of a glyph used to display a character can vary by context. (See Figure 5-18.)
Figure 5-18 - The width of the shaped string might be shorter or longer than the sum of individual character widths. In the first row, the two Arabic letters "Beh" and "Alef" are longer than the width of the two combined characters.
Third, bidirectional-aware applications should make sure that Arabic and Hebrew scripts are rendered using RTL alignment and reading order. You can use the GetTextAlign and SetTextAlign APIs to retrieve and set, respectively, the alignment of the text in a given device context. As for the reading order, calls to ExtTextOut and DrawText should specify the appropriate RTL reading-order flags, which are, respectively, ETO_RTLREADING and DT_RTLREADING. (See Figure 5-19.)
Figure 5-19 - Bidirectional text output. On the left side, the reading order for the sentence "123-52 equals 71." is broken. On the right side, the same sentence is being drawn using ETO_RTLREADING to allow the proper reading order.

Depending on the purpose of your application, you might also find it necessary to support vertical text. Although text that is read horizontally from left to right is becoming more common in East Asian countries-text in Japanese technical and business journals, for example, is often printed horizontally-many books, magazines, and newspapers still print text vertically.

As Figure 5-20 shows, displaying text vertically does not mean that you simply rotate an entire line of text by 90 degrees. Most characters remain upright, but others, such as those identified by arrows, change orientation.

Fortunately, with Win32 you do not need to write code to rotate characters. To display text vertically on Windows 2000 and Windows XP, enumerate the available fonts as usual, and select a font whose font face name begins with the at sign (@). Then create a LOGFONT structure, setting both the escapement and the orientation to 270 degrees. Calls to TextOut are the same as for horizontal text.

figure 5-20 text displayed vertically.

Figure 5-20 - Text displayed vertically.

The Windows Platform SDK contains a sample application called "TATE" (short for "tategaki," meaning "vertical writing"), which demonstrates how to create fonts and display vertical text. (For more information on vertical writing, see the Windows Platform SDK documentation, available at http://msdn.microsoft.com.)

Finally, Win32 display entry points can be of enormous help for things such as line breaking and word breaking. Line-breaking and word-wrapping algorithms are important to text parsing as well as to text display. In Western languages, as stated in "Word Breaking and Line Breaking" earlier in this chapter, line breaking occurs at a hyphen, space, tab, or on word boundaries. Word breaking is generally based on white space (spaces, tabs, the end of a line, punctuation, and so on).

Again, the line-breaking and word-breaking rules for Asian languages, however, are quite different from those for Western languages. For example, unlike most Western written languages, Chinese, Japanese, Korean, and Thai do not necessarily indicate the distinction between words by using spaces. Although the Thai language does not use spacing between words, it still requires lines to be broken on word boundaries.

For these languages, world-ready software applications cannot conveniently base line-breaking and word-wrapping algorithms on a space character or on standard hyphenation rules. They must follow different guidelines.

Take Japanese, for example. Japanese line breaking is based on the kinsoku rules-you can break lines between any two characters, with several exceptions. The first exception is that a line of text cannot end with any leading characters-such as opening quotation marks, opening parentheses, and currency signs-that shouldn't be separated from succeeding characters. The second exception is that a line of text cannot begin with any following characters-such as closing quotation marks, closing parentheses, and punctuation marks-that shouldn't be separated from preceding characters. The third exception is that certain overflow characters (such as punctuation characters) are allowed to extend beyond the right margin for horizontal text or below the bottom margin for vertical text.

As you can see, these rules and exceptions can become somewhat complicated. However, by using the appropriate APIs (DrawTextEx, ExTextOut, TextOut, and so on), you don't need to worry about which rule to use. Win32 functions take care of line breaking and word breaking for you. (For more information on East Asian line and word breaking, go to http://www.microsoft.com/globaldev/dis_v1/html/S24B6_L.asp.)

Standard Edit Control

The second option to display text in a multilingual context is to instantiate the standard edit control. This control has been extended in Windows 2000 and Windows XP to support data containing multilingual text and complex scripts, and includes not only input and display, but also correct cursor movement over character clusters (in Thai and Devanagari script, for example). As with the standard Win32 API functions, a well-written application will receive this support automatically, without modification. Again, you should consider adding support for right-to-left reading order and right alignment. In this case, toggle the extended style flags of the edit control window to manage these attributes, as shown in the following code:

 // ID_EDITCONTROL is the control ID in the resource file. HANDLE hWndEdit = GetDlgItem(hDlg, ID_EDITCONTROL); LONG lAlign = GetWindowLong(hWndEdit, GWL_EXSTYLE); // To toggle alignment lAlign ^= WS_EX_RIGHT; // To toggle reading order lAlign ^= WS_EX_RTLREADING;

After setting the lAlign value, enable the new display by setting the extended style of the edit control window as follows:

 // This assumes your edit control is in a dialog box. If not, //  you can  get the edit control handle from another source. SetWindowLong(hWndEdit, GWL_EXSTYLE, lAlign); InvalidateRect(hWndEdit, NULL, FALSE);

One new feature of the standard edit control is a context menu (activated by pressing the right mouse button while the cursor is in the field) that allows the user to toggle the reading order and to insert or display Unicode bidirectional control characters. (See Figure 5-21.)

figure 5-21 edit controls context menu allows the user to insert unicode control characters and to toggle text reading order.

Figure 5-21 - Edit controls context menu allows the user to insert Unicode control characters and to toggle text reading order.

Rich Edit Control

A third option for multilingual text display is Rich Edit. Rich Edit 3 is a higher-level collection of interfaces that takes advantage of Uniscribe to further insulate text-layout clients from the complexities of certain scripts. Rich Edit provides fast, versatile editing of rich Unicode multilingual text and simple plaintext. It includes extensive message and Component Object Model (COM) interfaces, text editing, formatting, line breaking, simple table layout, vertical text layout, bidirectional-text layout, Indic and Thai support, a Word-like edit UI, and Text Object Model (TOM) interfaces. Rich Edit is the simplest way for a client to support features of complex scripts. Clients use its TextOut function to automatically parse, shape, position, and break lines. (For more information on Rich Edit, see Chapter 21, "Rich Edit.")

Uniscribe

The last of the four options, Uniscribe supports the complex rules found in scripts such as Arabic, Thai, and scripts used for Indic languages. Uniscribe also handles scripts written from right to left, such as Arabic or Hebrew, and supports the mixing of scripts. For plaintext clients, Uniscribe provides a range of ScriptString functions that are similar to TextOut, with additional support for caret placement. The remainder of the Uniscribe interfaces provide finer control to clients.

As stated in "Windows Support for Complex Scripts" earlier in this chapter, Uniscribe uses multiple shaping engines that contain the layout knowledge for particular scripts. It also takes advantage of the OpenType layout shaping engine for handling font-specific script features such as glyph generation, extent measurement, and word-breaking support.

Uniscribe subdivides strings of characters into items, runs, and clusters. The client builds runs based on its own stored formatting attributes and on the item boundaries obtained by calling the Uniscribe ScriptItemize API. The Uniscribe ScriptShape API breaks a run into clusters according to script rules and then generates glyphs. The ScriptPlace API generates x and y positions for the characters. The ScriptTextOut API then displays the glyphs using the x and y positions.

Uniscribe supports line breaking at word boundaries through ScriptBreak. Hit testing and cursor positioning are supported by ScriptCPtoX and ScriptXtoCP. Character-to-glyph mapping is provided by ScriptGetCMap. Uniscribe manages bidirectional character reordering using the Unicode bidirectional algorithm, and also understands non-OpenType layout font formats for Arabic, Hebrew, and Thai shaping and positioning.

Using Uniscribe, the text-layout client only needs to manage a backing store of Unicode character codes. The client does not need to maintain any other buffer or mapping table to track character order, but rather only needs to store and manage the order in which the characters were entered by the user. This is the same logical order as defined by Unicode. The client's backing store never changes as a result of layout operations. Uniscribe maintains an index from the reordered clusters to the original character boundaries passed by the client.

Uniscribe is a single API for Unicode output across Microsoft's operating-system range. (For more information on coding techniques with Uniscribe, see Chapter 24, "Uniscribe," and Cssamp.exe in the Samples subdirectory on the companion CD. Other resources include the MSDN documentation and the article "Multilanguage text support in Windows 2000" at http://microsoft.com/globaldev/articles/multilang.asp.)

Again, remember that you can take advantage of Uniscribe's features through the system's standard support in edit controls without having to interface with Uniscribe directly. The only exception is if you need to use advanced text formatting.

As you can see, this chapter provides more in-depth information about displaying text within Win32 applications than it does about text in Web content and in the .NET Framework. This is because the rendering engine of Internet Explorer hides all implementation details in Web content (both traditional Web files and within the .NET Framework), thereby interfacing with Uniscribe in a transparent manner.

Text Input, Output, and Display in Web Content and in the .NET Framework

Text input, output, and display in Web content has been made a lot easier because HTML rendering in Internet Explorer is handled by the Trident module (Mshtml.dll), which is one of the Uniscribe clients. All support for different input languages and complex scripts is provided to Web-based pages automatically and transparently, as long as Unicode encoding (either UTF-8 or UTF-16) is used. For Web content within the .NET Framework, system support hides all implementation details for Microsoft Windows Forms and for other .NET applications.

Another essential aspect of a globalized application is its ability to display the correct font. Thanks to the evolution of font technology, enabling support for varying fonts has become a more manageable task.