Writing Exploits for Use with a Unicode Filter | Hacking Ubuntu: Serious Hacks Mods and Customizations (ExtremeTech)

Chris Anley first documented the feasibility of the exploitation of Unicode-based vulnerabilities in his excellent paper "Creating Arbitrary Shell Code in Unicode Expanded Strings," published in January 2002 ( www.nextgenss.com/papers/unicodebo.pdf ).

The paper introduces a method for creating shellcode with machine code that is Unicode in nature; that is, with every second byte being a null. Although Chris's paper is a fantastic introduction to using such techniques, there are some limitations to the method and code he presents . He recognizes these limitations and concludes his paper by stating that refinements can be made. This section introduces Chris's technique, known as the Venetian Method , and his implementation of the method. We then detail some refinements and address some of its shortcomings.

What Is Unicode?

Before we continue, let's cover the basics of Unicode. Unicode is a standard for encoding characters using 16 bits per character (rather than 8 bitswell, 7 bits, actually, like ASCII) and thus supports a much greater character set, lending itself to internationalization. By supporting the Unicode standard, an operating system can be more easily used and therefore gain acceptance in the international community. If an operating system uses Unicode, then the code of the operating system needs to be written only once, and only the language and character set need to change; so even those systems that use the Roman alphabet use Unicode. The ASCII value of each character in the Roman alphabet and number system is padded with a null byte in its Unicode form. For example, the ASCII character A , which has a hex value of 0x41 , becomes 0x4100 in Unicode.

 String:         ABCDEF Under ASCII:    \x41\x42\x43\x44\x45\x46\x00 Under Unicode:  \x41\x00\x42\x00\x43\x00\x44\x00\x45\x00\x46\x00\x00\x00

Such Unicode characters are often referred to as wide characters; strings made up of wide characters are terminated with two null bytes. However, non-ASCII characters, such as those found in the Chinese or Russian alphabets, would not have the null bytesall 16 bits would be used accordingly . In the Windows family of operating systems, normal ASCII strings are often converted to their Unicode equivalent when passed to the kernel or when used in protocols such as RPC.

Converting from ASCII to Unicode

At a high level, most programs and text-based network protocols such as HTTP deal with normal ASCII strings. These strings may then be converted to their Unicode equivalents so that the low-level code underlying programs and servers can deal with them.

Why do Unicode vulnerabilities occur?

Unicode-based vulnerabilities occur for the same reason normal ones do. Just about everyone knows about the dangers of using functions like strcpy () and strcat() , and the same applies to Unicode; there are wide-character equivalents such as wscpy() and wscat() . Indeed, even the conversion functions MultiByteToWideChar() and WideCharToMultiByte() are vulnerable to buffer overflow if the lengths of the strings used are miscalculated or misunderstood. You can even have Unicode format-string vulnerabilities.

Under Windows, a normal ASCII string would be converted to its wide-character equivalent using the function MultiByteToWideChar() . Conversely, converting a Unicode string to its ASCII equivalent uses the Wide_CharToMultiByte() function. The first parameter passed to both these functions is the code page . A code page describes the variations in the character set to be applied. When the function MultiByteToWideChar() is called, depending on what code page it has been passed, one 8-bit value may turn into completely different 16-bit values. For example, when the conversion function is called with the ANSI code page ( CP_ACP ), the 8-bit value 0x8B is converted to the wide-character value 0x3920 . However, if the OEM code page ( CP_OEM ) is used, then 0x8B becomes 0xEF00 .

Needless to say, the code page used in the conversion will have a big impact on any exploit code sent to a Unicode-based vulnerability. However, more often than not, ASCII characters such as A ( 0x41 ) are typically converted to their wide-character versions simply by adding a null byte 0x4100 . As such, when writing plug-and-play exploit code for Unicode-based buffer overflows, it's better to use code made up entirely of ASCII characters. In this way, you minimize the chance of the code being mangled by conversion routines.