Chapter 1: Introduction to Disassembling | Disassembling Code: IDA Pro and SoftICE

The assembler and the disassembler are two sides of the same coin. The assembler converts the source code of the program written in Assembly language into the binary code, and the disassembler converts the binary module into a sequence of Assembly commands. Thus, for analysis of the disassembled code it is necessary to know machine commands, their binary format, and their Assembly representation. Also, it is important to understand the structure of data representation in computer memory, as well as to know the structure of programs written for the Windows operating system. All of these topics will be covered in this chapter.

1.1. Representing Information in Computer Memory

The main goal of this section is to describe how numeric data are stored in computer memory.

1.1.1. Investigating the Memory

Consider a simple program written in the C programming language (Listing 1.1).

Listing 1.1: A simple program that outputs the memory dump

 #include <stdio.h> #include <windows.h> int k = 0x1667; BYTE *b = (BYTE*)&k; void main() {         int j = 0;         printf("\n%p ", b);         for(int i = 0; i < 400; i++)         {                 printf("%02x ", *(b++));                 if(++j == 16&&i<398)                 {                         printf("\n");                         j = 0;                         printf("%p  ", b) ;                 };         };         printf("\n"); };

Note

All C programs will be compiled using the Microsoft Visual C++ compiler (which is supplied as part of Visual Studio .NET 2003). In my opinion, this is the best C++ compiler available. Special cases will be mentioned individually.

The program in Listing 1.1 must output the contents of the memory area, starting from the block that stores the variable value. This memory area, sent to any device, is called the dump. The program outputs to the screen the memory area that stores variables.

Compile the program, then start command-line session and run it. The console screen would display a table made up of hexadecimal (hex) numbers (Fig. 1.1).

image from book
Figure 1.1: Memory dump displayed by the program presented in Listing 1.1

Judging by the memory pattern, it contains data in addition to the value of the k variable, which is 0x1667 (the least significant byte of the word has the smallest address). What are these data? How is it possible to understand these tables of hex numbers? I will begin by covering issues that advanced users might consider elementary — namely, with representation of numbers in computer memory. Most readers that have mastered these concepts can skip Sections 1.1.2 and 1.1.3.

1.1.2. Scales of Notation

Decimal Notation

Most individuals have known the decimal scale of notation from childhood. It is natural and traditional. Binary notation is not as natural for humans, but it is natural for computers. Computer memory is made up of elements that can be in one of two possible states. One of the states is conventionally designated as zero, and the alternative state is one. As a result, all information in memory is written as binary numbers, or sequences of ones and zeros. In addition, computer memory is divided into blocks, each block containing eight items. These blocks are called memory cells or bytes. A single digit in binary notation is called a bit (bit stands for binary digit). Thus, each memory cell is made up of eight binary digits, or 8 bits.

Recall that decimal system numbers are base 10 numbers. This means that every decimal system number can be represented as a sum of the powers of ten, where the number positions serve as coefficients. Consider the following example:

4567 = 4×l0³ + 5×l0² + 6×l0¹ + 7×10⁰

In other words, every digit's contribution depends on the position that it takes. The position of the digit depends on the ordinal number counted from right to left, starting from zero. Such numeral systems are also called positional numeral systems.

Binary Notation

Binary notation is also a positional numeral system. Thus, any binary number can be represented in the form of a sum of the powers of two, for example:

11101001 =1×2⁷ +1×2⁶ +1×2⁵ +0×2⁴ +1×2³ +0×2² +0×2¹ +1×2⁰

This method of writing binary numbers is actually the method of converting it to another numeral system. For example, if you carry out these actions in decimal system notation, you'll obtain 233.

Converting a decimal system number into the binary representation is somewhat more difficult. This can be done according to the following algorithm:

Divide the given number by two and take the remainder as the next most significant bit.
If the result is greater than one, return to step 1.
The binary number is composed of the last result of division (the most significant bit) and all remainders from the division.

For instance, consider conversion of the number 350 to binary notation:

image from book

As the result of the preceding computations, it is obtained that the binary representation of the decimal system number 350 is 101011110.

To ensure that numbers in different notations can be adequately distinguished in Assembly programs, a single-character B suffix is used for designating binary numbers. For decimal system numbers, the D suffix is used, which can be omitted. For hex numbers, the H suffix is used. For example: 10000B, 345H, 100, etc.

By analogy with decimal fractions, it is possible to consider binary fractions. For example, the binary number 1001.1101 can be represented as follows:

1×2³ +0×2² +0×2¹ +1×2⁰ +1×(1/2¹)+1×(1/2²) + 0×(l/2³) + 1×(l/2⁴)

A binary fraction can also be converted into decimal notation by simply using arithmetic operations. For example, to convert the number 1001.1101 into a decimal number, it is necessary to carry out all operations specified in the binary number representation. As a result, you'll obtain the following number in decimal notation: 9.8125.

Decimal fractions are also easily converted into binary notation. The integer and fractional parts of the number are converted separately. The algorithm for converting the whole part of the number was already covered. The fractional part is converted as follows:

Multiply the fractional part by two (the system base).
In the resulting number, separate the integer part (this will be either zero or one). This will be the first digit after the decimal point in the binary numeral system.
If the fractional part of the resulting number is not zero, return to step 1; otherwise, terminate computation. It is possible to specify the computation's precision — in other words, the number of digits after the decimal point — and terminate computations when this precision is achieved.

Now, consider a practical example of converting the decimal system number into the binary representation. Assume that it is necessary to convert 105.406 into binary notation. The algorithm of converting the integer part of the number has already been considered. Thus, 105 in binary representation equals 1101001. To convert the fractional part, use the algorithm just considered. The sequence of computations is presented here. Note that in this example it was necessary to stop the computation when a precision of nine characters after the decimal point was reached.

0.406×2	0×(1/2¹)
0.812×2	1×(1/2²)
0.624×2	1×(1/2³)
0.248×2	0×(1/2⁴)
0.496×2	0×(1/2⁵)
0.992×2	1×(1/2⁶)
0.984×2	1×(1/2⁷)
0.968×2	1×(1/2⁸)
0.936×2	1×(1/2⁹)

As a result of this computation, you'll find out the following:

105.406≈1101001. 011001111

Thus, converting decimal system numbers into the binary notation, in which they are stored in the computer memory, is an additional factor of precision loss.

Hexadecimal Numeral System

The hex numeral system is more compact than the decimal numeral system. Numbers in hex numeral systems are easily converted into the binary system, and vice versa. Finally, the hex numeral system corresponds to the computer memory's architecture considerably better than any other notation. Sixteen hex digits are used for designating numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F. The method of converting numbers from a decimal to a hex system, and vice versa, is similar to the method described in the previous section; the only difference is that in this case the system base is 16 instead of 2. Hopefully, you will easily derive the required algorithms on your own.

Consider the method of converting numbers from hex system into the binary system, and vice versa. The main principle here is exceedingly simple: Four digits of a binary number, a quaternion, correspond to one digit of a hex number, and vice versa. Fig. 1.2 demonstrates the conversion of the 10101101 binary number to a hex number.

image from book
Figure 1.2: Converting a binary number to a hex number

Fig. 1.3 illustrates backward conversion of the hex number 14A into a binary format.

image from book
Figure 1.3: Converting a hex number to a binary number

As already mentioned, the hex numeral system, out of all numeral systems, best maps to the computer memory's architecture. The computer memory is easily divided into cells containing 8 bits each. However, 8 bits corresponds to two hex digits. For example, 1345H will take two memory cells, the least significant cell (according to the convention) will contain 45H, and the most significant cell will store 13H.

The conversion of fractions from the hex numeral system to the binary numeral system, and vice versa, is easy; you do this in the same way as for integer numbers. The fractional part, like the integer part, is converted according to the following principle: One hex digit corresponds to four binary digits. Consider the binary number 101.10001 and convert it into hex notation. According to this rule, the result will be as follows: 101≥0101≥5. Furthermore, the fractional part can be represented as follows: 10001≥10001000≥88 (note that in fractional part, all quartets of digits are counted from left to right). As a result, the 101.10001 binary number corresponds to 5.88 in hex notation. As in the integer part, conversion of the fractional part is reduced to dividing the binary digits into quaternions and padding incomplete quaternions with zeros (from left to right).

1.1.3. Representing Numbers in Computer Memory

Unsigned Integer Numbers

The principle of representing unsigned integer numbers in computer memory is trivial:

The number must be converted to the binary numeral system.
It is necessary to determine the memory size required to store that number. As already mentioned, the most convenient way of doing this is to convert the number into hex notation, after which the amount of memory required for storing this number will be immediately clear. According to convention, memory is allocated by single memory cells (bytes), double cells (words), and quadruple cells (4 bytes, or a double word). Assembly language provides special directives for reserving memory for storing numeric constants and variables:
- Namel DB value 1 ; Reserve 1 byte
- Name2 DW value 2 ; Reserve 2 bytes
- Name3 DD value 3 ; Reserve 4 bytes
- Name4 DQ value 4 ; Reserve 8 bytes
- Name5 DT value 5 ; Reserve 10 bytes

When dealing with variables, which usually will be the case, it is necessary to determine the range, within which the variable value would change, and reserve the memory for storing this variable on the basis of the obtained information. Because contemporary Intel processors are oriented toward operations over 32-bit numbers, the best approach for the moment is to orient them toward variables of the same dimensions.

Consider the fragment of some C program, shown in Listing 1.2.

Listing 1.2: A fragment of a program written in C

 BYTE e = 0xab; WORD c = 0x1234; DWORD b = 0x34567890; _ _int64 a = 0x6178569812324572;

This fragment defines four variables: the e 1-byte variable, the c 2-byte variable, the b 4-byte variable,^[1] and the a 8-byte variable. Using the program presented in Listing 1.1, output the memory area where these variables are stored. You'll obtain the following sequence of bytes:

 ab 00 00 00 34 12 00 00 90 78 56 34 00 00 00 00 72 45 32 12 98 56 78 61

Consider this sequence of bytes carefully. You should find all of the variables without difficulties. The most important conclusions that can be drawn by studying this sequence of bytes are as follows.

As you should recall, in Listing 1.1 the memory contents were displayed from the lower (least significant) to the higher (most significant) address. Thus, the least significant bytes of all numbers (variables) take the least significant addresses of the word. The least significant word in a double word, in turn, takes the smaller address. Finally, in a 64-bit variable, the least significant double word must take the smaller address. This issue is important for analysis of the binary code. Later, you'll be able to identify variables in one glance at the memory region.
As you can see, all variables require a memory size that is a multiple of a 4-byte value. After each initialized variable, the compiler inserts a special directive for alignment by a 32-bit boundary (Align 4). However, the situation is not that simple, and alignment might be different with a different order of variables. This topic will be covered in more detail in Section 3.1.1.

Examples

Thus, a 16-bit number, such as A890H, will be stored in memory as the following sequence of bytes: 90 A8. A 32-bit number, such as 67896512H, will be stored as 12 65 89 67. Finally, a 64-bit number, F5C68990D1327650H, for example, will be stored as 50 76 32 D1 90 89 C6 F5.

Signed Numbers

Because the memory contains only binary digits, it would be logical to dedicate a separate bit for storing the number sign. For example, if you have one memory cell, you'll be able to use arithmetic operations over the numbers ranging from -127 to +127 (11111111 to 01111111). This approach won't be too bad; however, it would be necessary to introduce separate addition operations for signed and unsigned numbers. There is an alternative method of introducing signed numbers. In the algorithm of building such numbers, a certain number is known to be positive and a number with the inverse sign is easily found: a + (-a) ≡ 0.

When working with a set of single-byte numbers, it is natural to consider that 1 equals the 00000001 binary number. By solving the equation 00000001 + x = 00000000, you'll obtain a result that at first glance seems paradoxical: x =11111111. In other words, using this alternative approach, -1 must be considered equal to 11111111 (255 in the decimal system equivalent and FF in hex). Now, it is time to elaborate on this theory. Obviously, -1-(1)=-2 . Therefore, according to this theory, -2 must be equal to 11111110 and 00000010 must represent +2. Check whether these figures correspond to the previously described theory, and you'll see that 11111110 + 00000010 = 00000000. Thus, the self-evident identity is true: +2 + (-2) = 0. This means that the chosen approach is consistent and the process can be continued (Table 1.1).

Table 1.1: Signed single-byte numbers
Positive number	Binary representation	Negative number	Binary representation
+0	00000000	-0	00000000
+1	00000001	-1	llllllll
+2	00000010	-2	11111110
+3	00000011	-3	11111101
+4	00000100	-4	11111100
+5	00000101	-5	11111011
...	...	...	...
+120	01111000	-120	10001000
+121	01111001	-121	10000111
+122	01111010	-122	10000110
+123	01111011	-123	10000101
+124	01111100	-124	10000100
+125	01111101	-125	10000011
+126	01111110	-126	10000010
+127	01111111	-127	10000001
+128	Doesn't exist within the limits of 1 byte	-128	10000000

Consider Table 1.1 more carefully. What was the result of elaborating this theory? The signed numbers can range from -128 to +127.

Thus, a single-byte number can be interpreted both as a signed and as an unsigned number. According to the first approach (signed), 11111111 will equal -1; with unsigned numbers, it will equal 255. Thus, everything depends on the chosen interpretation. The most interesting fact is that addition and subtraction are carried out according to the same method for both signed and unsigned numbers. Therefore, the processor has only one command for each operation: ADD and SUB. When executing a specific operation. There might be overflow or carry to the nonexistent bit;^[2] however, this topic deserves separate consideration. This problem could be solved by reserving one or more memory cells. All of these considerations can be easily extended to 2- and 4-byte numbers. Thus, the maximum unsigned 16-bit number equals 65,535, and signed numbers belong to the range from -32,768 to +32,767.

Another interesting issue relates to the most significant bit. As you can see, this bit can be used to determine the sign. However, this bit is not entirely isolated and participates with the other bits in forming the number value.

Having the skills to navigate signed and unsigned numbers is important for an investigator of software code. For example, having encountered commands such as cmp eax, 0FFFFFFFEh, it is necessary to bear in mind that this might be the cmp eax, -2 command.

Consider the sequence of variables shown in Listing 1.3.

Listing 1.3: A sequence of different variables

 signed char e = -2; short int c = -3; int b = -4; _ _int64 a = -5;

As you can see, all variables shown in this listing are signed variables with negative values. When displaying the memory block containing these variables, the following sequence of bytes will be obtained:

 FE 00 00 00 FD FF 00 00 FC FF FF FF 00 00 00 00 FB FF FF FF FF FF FF FF

Thus, the value of an 8-bit variable set to -2 in computer memory is represented as FEh, the value of a 16-bit variable set to -3 is represented by the FFFDh sequence, and the value of a 32-bit variable set to -4 is represented as FFFFFFFCh. Finally, a negative 64-bit variable set to -5 is represented as follows: FFFFFFFFFFFFFFFBh. Recall that when representing a 64-bit variable, the 4 least significant bytes must be located at an address smaller than the most significant bytes.

Real Numbers

To use real numbers in commands of the Intel processor (the arithmetic coprocessor^[3]), they must be represented in computer memory in the normalized form. In general, the normalized form of a number appears as follows:

A = (NS)×M×N^q

Here, NS designates the number sign; M stands for mantissa, which usually meets the < 1 condition; N is the base of the numeral system; and q is the exponent, which might be positive or negative. Numbers represented this way are often called floating-point numbers. Consider a practical example of a floating-point number. Try to represent 5.75 in the normalized form. First, it is necessary to convert this number into the binary notation. This task is trivial: 5 in binary notation will be represented as 1001, and 0.75 equals (1/2) + (1/4). In other words, 5.75 = 1001.11B. Furthermore, 1001.11B = 1.00111 × 2³. Thus, the normalized number will comprise the following components: NS = +1, M =1.00111, N=2, and q = 3. Note that when using such a representation, the first number of the mantissa always equals one; consequently, it is possible to do without storing it. Intel format is based on this possibility. In addition, it is necessary to bear in mind that the q exponent is stored in the memory in the form of a sum with a certain number, to ensure that it is always positive. The Intel processor can work with the following three types of real numbers:

Short real number — For storing a short real number, 32 bits are allocated. Bits 0-22 are reserved for the mantissa. Bits 23–30 are intended for storing the q exponent added to the number 127. The last bit, bit 31, is intended for storing the number sign (if this bit is set to one, then the number is negative; otherwise, the number is positive).
Long real number — Here, 64 bits are allocated for storing such a number. Bits 0-51 are reserved for storing the mantissa. Bits 52-62 are intended for storing the q exponent added to 1024. The last bit, bit 63, determines the number sign (if this bit is set to one, then the number is negative; otherwise, the number is positive).
Extended real number— For storing such numbers, 80 bits are allocated. Bits 0-63 are intended for storing the mantissa. Bits 64-78 store the q exponent added to 16,383. The last bit, bit 79, is intended for storing the number sign (if this bit is set to one, then the number is negative; otherwise, the number is positive).

Consider a practical example illustrating representation of a floating-point number in the memory. Assume that the following variable is declared in some program written in C:

                               float a = -80.5;

The float type corresponds to the short real number. This means that its memory representation will take 32 bits. Now, try to view the memory using the standard approach. Here are 4 bytes that represent the previously mentioned number:

                               00 00 a1 c2

To make this representation easily understandable, convert it into the binary representation:

               00000000 00000000 10100001 11000010

To make this representation more understandable, rewrite it starting from the most significant byte to emphasize the mantissa, exponent, and sign:

               11000010 10100001 00000000 00000000

Now, separate the mantissa. Recall that 23 bits are allocated for storing it. Thus, the following binary number will be obtained: 0100001. Note that mantissa bits are counted starting from the most significant one (in this case, this is bit 22). The trailing zeros are discarded because the whole mantissa is located to the right of the decimal point. However, the obtained number doesn't represent the mantissa exactly. As already mentioned, the first number of the mantissa is always equal to one; consequently, there is no need to store it. Thus, when using Intel representation, this one should be restored. Therefore, the following number will represent the mantissa: 1.0100001B. The sign of the whole number is negative because bit 31 is set to one. As relates to the exponent, it must be obtained from the 10000101B binary number. In decimal system representation, this will equal 133. To obtain the exponent for a short real number, subtract 127 from this value; the result will be 6. Thus, to obtain a real fractional number from the mantissa, the decimal point must be shifted six positions to the right. The result will be 1010000. 1B. In hex notation, this is 50.8H; if you convert this number to decimal notation, the result will be 80.5.

To have hands-on practice, consider the following sequence of bytes:

                   00   80   FB   42

Try to prove that this sequence of bytes corresponds to the representation of 125.75.

On the basis of the material in this section, it is possible to conclude that if real numbers are used in a program, they might become approximate before any actions are carried out over them. This is because all real numbers must be normalized before they can be written into the memory.

Binary-Coded Decimals

Binary-coded decimal (BCD) notation is a special method of representing decimal numbers in computer memory. In this case, each of the digits of an unsigned decimal number is represented as the 4-bit binary equivalents (nibbles). The Intel processor supports two types of such numbers: packed and unpacked.

Every digit of a packed number is encoded by a nibble (4 bits, or half a byte). In this case, the 4 most significant bits contain the most significant digit. Thus, a byte can contain a number ranging from 0 to 99. For example, 56 will be represented as 01010110B.
Each digit of an unpacked number is encoded by a single byte. In this case, only the 4 least significant bits store digits and the 4 most significant bits must contain zeros. Thus, 1 byte can contain a number from 0 to 9.

BCDs are rarely used in programming nowadays; therefore, I won't consider this topic further.

^[1]BYTE is simply unsigned char, WORD is unsigned short int, and DWORD is unsigned int. Definitions of these data types can be found, for example, in the windows.h file.

^[2]It can be easily proven that simultaneously representation of signed and unsigned numbers is possible because the number size is limited by 1 or more bytes.

^[3]Starting with Intel 486, the arithmetic coprocessor is an integral, built-in part of the microprocessor.