Encodings in Console or Text-Mode Programming | Developing International Software

Programmers can use both Unicode and SBCS or DBCS encodings when programming console, or "text-mode," applications. For legacy reasons, non-Unicode console I/O functions use the console code page, which is an OEM code page by default. All other non-Unicode functions in Windows use the Windows code page. This means that strings returned by the console functions might not be processed correctly by the other functions and vice versa. For example, if FindFirstFileA returns a string that contains certain non-ASCII characters, WriteConsoleA will not display the string properly.

Always keeping track of which encoding is required by which function-and appropriately converting encodings of textual parameters-can be hard. This task was simplified with the introduction of the functions SetFileApisToOEM, SetFileApisToANSI, and a helper function AreFileApisANSI. The first two affect non-Unicode functions exported by KERNEL32.dll that accept or return a file name. As the names suggest, the SetFileApisToOEM sets those functions to accept or return file names in the OEM character set corresponding to the current system locale, and SetFileApisToANSI restores the default, Windows ANSI encoding for those names. Currently selected encoding can be queried with AreFileApisANSI.

With SetFileApisToOEM at hand, the problem with the results of WindFirstFileA (or GetCurrentDirectoryA, or any of the file-handling functions of Win32 API) that cannot be passed directly to WriteConsoleA is easily solved: after SetFileApisToOEM is called, WindFirstFileA returns text encoded in OEM, not in the Windows ANSI character set. This solution is not a universal remedy against all Windows ANSI versus OEM incompatibilities, however. Imagine you need to get text from a file-handling function, output it to console, and then process it by another function, which is not affected by SetFileApisToOEM. This absolutely realistic scenario will require the encoding to be changed. Otherwise, you will need to call SetFileApisToOEM to get data for console output, then SetFileApisToANSI and get the same text, just in another encoding, for internal processing. Another case when SetFileApisToOEM does not help is handling of the command-line parameters: when the entry point of your application is main (and not wmain), the arguments are always passed as an array of Windows ANSI strings. All this clearly complicates the life of a programmer who writes non-Unicode console applications.

To make things more complex, 8-bit code written for console has to deal with two different types of locales. To write your code, you can use either Win32 API or C run-time library functions. ANSI functions of Win32 API assume the text is encoded for the current console code page, which the system locale defines by default. The SetConsoleCP and SetConsoleOutputCP functions change the code page used in these operations. A user can call chcp or mode con cp select= commands in the command prompt; this will change the code page for the current console. Another way to set a fixed console code page is to create a console shortcut with a default code-page set (only available on East Asian localized versions of the operating system). Applications should be able to respond to a user's actions.

Locale-sensitive functions of C run-time library (CRT functions) handle text according to the settings defined by a (_w)setlocale call. If (_w)setlocale is not called in the code, CRT functions use the ANSI "C" language invariant locale for those operations, losing language-specific functionality.

The declaration of the function is:

 setlocale( int category, const char *locale)

 _wsetlocale( int category, const wchar_t *locale)

The "category" defines the locale-specific settings affected (or all of them, if LC_ALL is specified). The variable-locale -is either the explicit locale name or one of the following:

".OCP"-refers to the OEM code page corresponding to the current user locale
".ACP" or ""-refers to the Windows code page corresponding to the current user locale

".OCP" and ".ACP" parameters always refer to the settings of the user locale, not the system locale. Hence they should not be used to set LC_CTYPE. This category defines the rules for Unicode to 8-bit conversion and other text-handling procedures, and must follow the settings of the console, accessible with GetConsoleCP and GetConsoleOutputCP.

The best long-term solution for a console application is to use Unicode, since Unicode interfaces are defined for both the Win32 API and C run-time library. The latter programming model still requires you to set the locale explicitly, but at least you can be sure the text seen by Win32 and CRT does not require transcoding.

CRT Console I/O

When a Unicode stream I/O routine (file I/O such as fwprintf, fwscanf, fgetwc, fputwc, fgetws, or fputws, or console 'stdio' functions) operates on a file that is open in text mode (the default) or on a console, two kinds of character conversions take place:

Unicode-to-MBCS or MBCS-to-Unicode conversion. When a Unicode stream I/O function operates in text mode, the source or destination stream is assumed to be a sequence of multibyte characters. Therefore, the Unicode stream-input functions convert multibyte characters to wide characters (identical to calling the mbtowc function). For the same reason, the Unicode stream-output functions convert wide characters to multibyte characters (identical to calling the wctomb function).
Carriage return-linefeed (CR-LF) translation. This translation occurs before the MBCS-Unicode conversion (for Unicode stream input functions) and after the Unicode-MBCS conversion (for Unicode stream output functions). During input, each carriage return-linefeed combination is translated into a single linefeed character. Conversely, during output, each linefeed character is translated into a carriage return-linefeed combination.

However, when a Unicode stream I/O function operates in binary mode, the file is assumed to be Unicode, and no CR-LF translation or character conversion occurs during input or output. This makes CRT I/O complex when the application has to be globalized. Using console I/O routines of the Win32 API is a consistent and controllable way to handle multilingual data, and it should be used whenever possible.

Unicode CRT I/O functions do not provide Unicode output. Rather, they take a Unicode input and convert it to the code page defined by a prior call to setlocale, or to "C" locale (ASCII) if setlocale was not called. In the latter case, only Latin script is processed and the localized text is not displayed at all. Figure 3-14 presents the same text displayed twice, first with the Unicode Win32 API and next with the Unicode CRT. The CRT locale is not set.

figure 3.14 crt output example.

Figure 3.14 - CRT output example.

Win32 Text-Mode I/O

Win32 API provides two approaches for console I/O: the high-level approach using Read/WriteConsole and Read/WriteFile functions, and the low-level approach, requiring access to console screen and input buffers, keyboard, mouse, and buffer-resizing events. Only high-level functions are described here.

Both WriteConsole and WriteFile can take Unicode parameters. The difference between them is:

WriteConsole writes Unicode characters to console, but works on console handles only. It does not work if the output is redirected to a disk file.
WriteFile can take a handle of any file, so that the output can be redirected to a file or a pipe. However, the encoding is assumed to be in the current console-output code page. A Unicode input array will be displayed as garbage.

An application that displays messages that can never be redirected should stick with a WriteConsole call. Applications producing multilingual output that can be redirected can check the handle they get for output. If the output handle is of a console, use WriteConsole. Otherwise, call WriteFile with the text that matches the current console code page. In the example on the following page, the code writes Unicode output to the console, but code page-based text is sent out if the output is redirected. The assumption (not seen in the code example on the following page) is that the output string is loaded from resources that match the console-output code page.

 wchar_t*   szwOut ; DWORD      dwBytesWritten; DWORD      fdwMode; HANDLE     outHandle = GetStdHandle(STD_OUTPUT_HANDLE); //... // ThreadLocale adjustment, resource loading, etc. is skipped  //... if( (GetFileType(outHandle) & FILE_TYPE_CHAR) &&   GetConsoleMode( outHandle, &fdwMode) ) {  WriteConsoleW( outHandle, szwOut, wcslen(szwOut),  &dwBytesWritten, 0); } else {  int nOutputCP = GetConsoleOutputCP();  int charCount = WideCharToMultiByte(nOutputCP, 0, szwOut, -1, 0,   0, 0, 0);  char* szaStr = (char*) malloc(charCount);  WideCharToMultiByte( nOutputCP, 0, szwOut, -1, szaStr, charCount,  0, 0);  WriteFile(outHandle, szaStr, charCount-1, &dwBytesWritten, 0);  free(szaStr); }