Unicode | The Assembly Programming Master Book

Operating systems of the Windows NT family, starting from Windows 2000, have fully migrated to Unicode. However, most programmers haven't even noticed this event. This is because all operations with Unicode are internal for Windows. As relates to output parameters, such as strings for the MessageBox function, Windows continues to accept them in ANSI encoding. When such a function is called, the operating system converts the input ANSI string into a two-byte string and then works with the result. If the function must return the string value, the string must be converted from Unicode to ANSI. Additionally, for all functions that accept or return strings, there are "twins" with the same name , complemented by the trailing Wfunctions such as MessageBoxW or CharToOemW . These functions operate with Unicode strings. The resources, which will be covered later, are also stored in the Unicode format. Consequently, all functions that store and retrieve the resource information also convert it prior to accomplishing their tasks .

One of the most interesting functions is IsTextUnicode , which checks the information stored in the current buffer and determines whether or not it is presented in the Unicode format. The function is statistic, which means that it correctly determines whether the given text is in the Unicode format with a certain level of probability. Consider this function in more detail:

 BOOL IsTextUnicode (                        CONST VOID* pBuffer,                        int cb,                        LPINT lpi                        )

The function returns a nonzero value if it successfully recognizes Unicode format; otherwise , it returns zero.

The first parameter of this function is the address of the buffer containing text information that has to be checked.
The second parameter contains the length of the buffer being checked.
The third parameter is the pointer to the memory area that must contain instructions about which tests must be carried out. If the memory area contains 0, this means that it is necessary to carry out all tests. For example, the value IS_TEXT_UNICODE_ASCII16 equal to 1 means that the text being tested must be presented in the Unicode format and contain characters of a national alphabet.

Other values of constants can be found in the WINDOWS.INC file supplied with the MASM32 product. The most important point here is that all constants contain bits that don't overlap, which means that the memory area, to which the third argument of this function points, can contain combinations of these constants. An example illustrating the use of this function will be provided later.

Now, consider how conversions from ANSI to Unicode and from Unicode to ANSI are carried out. Two functions are intended for this purpose, MultiByteToWideChar and WideCharToMultiByte . Consider these functions in more detail.

The MultiByteToWideChar function is used for converting ANSI strings to Unicode strings:

 int MultiByteTowideChar(       UINT CodePage,       DWORD dwFlags,       LPCSTR lpMultiByteStr,       int cbMultiByte,       LPWSTR lpWideCharStr,       int cchWideChar     )

The first parameter specifies the number of the codepage of the source string. For example, the CP ACP = 0 constant corresponds to ASCII.
The second parameter is the flag that regulates the conversion of letters with diacritical signs. Usually this parameter is assumed to be zero.
The third parameter points to the string being converted.
The fourth parameter must be equal to the length of the string being converted. If this parameter is set to 1, the function must determine the string length on its own.
The fifth parameter points to the buffer intended to store the converted string.
The sixth parameter specifies the maximum size of the buffer that will store the converted string.

If the function completes successfully, it returns the size of the converted string.

If the sixth parameter is set to 0, the function won't carry out the conversion. Instead, it will return the buffer size required to store the converted string:

 int WideCharToMultiByte(       UINT CodePage,       DWORD dwFlags,       LPCWSTR lpWideCharStr,       int cchWideChar,       LPSTR lpMultiByteStr,       int cbMultiByte,       LPCSTR lpDefaultChar,       LPBOOL lpUsedDefaultChar    )

The first parameter specifies the codepage for the resulting string.
The second parameter is the flag that regulates the conversion of letters with diacritical signs. Usually this parameter is assumed to be zero.
The third parameter is the address of the string being converted.
The fourth parameter is the length of the string being converted (in characters). If this parameter is set to 1, then the function must determine the string length on its own.
The fifth parameter is the address of the buffer that will store the resulting ANSI string.
The sixth parameter determines the maximum size of the buffer that would store the converted string.
The seventh parameter is the default address of the symbol. If the function encounters the symbol missing from the specified codepage, it replaces the missing character with the one pointed to by the specified parameter. Usually, this parameter is assumed to be 0 (or NULL).
The eighth parameter points to the memory area, in which the function places either 0 or 1, depending on whether or not it succeeded to convert all the symbols in the source string.

Listing 6.1 is the code fragment that converts the source string from ANSI to Unicode.

Listing 6.1: The fragment that carries out ANSI to UNICODE conversion

 . ; The string to be converted STR1 DB "Console application", 0 ; The buffer for copying the converted string BUF DB 200 DUP(0) . . . PUSH  200         ; Maximum buffer length PUSH  OFFSET BUF  ; Buffer address PUSH  -1          ; Define the string length automatically PUSH  OFFSET STR1 ; String address PUSH  0           ; Flag PUSH  0           ; CP_ACP - ANSI encoding CALL MultiByteToWideChar@24 ; Now it is possible to work with the Unicode string BUF. ...

Regarding the provided fragment, it should be pointed out that it would be better to proceed as follows :

Start the MultiByteToWideChar function with the sixth parameter set to 0.

The function would return the size of the buffer for storing the converted string (in bytes).
Allocate memory for that buffer.
Convert the string by calling the MultiByteToWideChar function and specifying the sixth parameter.
Work with the string.
Release the memory block allocated to the buffer.

Chapter 12 contains an interesting macro that simplifies the conversion of the string from ASCII to Unicode.

Finally, I'd like to point out that there is a convenient technique of specifying the string that will be interpreted as Unicode directly in the program. For example, you could choose not to specify the string in a traditional manner, for example, as follows:

 STRl DB "MASM 7.0", 0

Instead, it can be written as follows:

 STRl DW 'M1,'A1,'S1,'M1,'71,'.','0', 0

After that, you can comfortably use the MessageBoxW function instead of MessageBoxA .