Chapter 25: Code Analysis Basics | The Assembly Programming Master Book

Most executable modules are written in high-level programming languages and do not contain debug information. However, even in this case, it is sometimes necessary to analyze their code. To speed up the procedure of code analysis, the programmer has to know, or at least have reference information about, standard Assembly language structures corresponding to specific structures of high-level programming languages. Naturally, it is necessary to point out that in this chapter I mainly describe 32-bit applications.

Variables and Constants

Contemporary compilers optimize the source code quite efficiently ; therefore, sometimes it is difficult to determine which variable is used and where. Mainly, this is because the compiler uses registers for storing variables whenever possible. As a rule, the compiler would start using memory only when there are no available registers.

To illustrate this situation, I have taken an easy console application written on Borland C++. The source code of this application takes about 15 lines of code. The EXE file, however, is more than 50 KB. The size of executable files isn't any wonder , though. Another point is more interesting in this respect: Only one disassembler has correctly solved the problem of determining the entry pointthe _main label. As you have probably guessed, this was IDA Pro. Naturally, all disassemblers have correctly disassembled the program section that carries out the job; however, only IDA Pro was able to discover how the jump to that section of code is carried out. The most advantageous point here is that it has also correctly recognized the _printf function. Listing 25.1 shows a fragment of the disassembled program corresponding to the main procedure. The source code was written in C language, and disassembling was carried out using IDA Pro. The debugger doesn't provide any clear possibilities of quickly finding this fragment. Hence, the usefulness of the combined use of the debugger and disassembler is obvious.

Listing 25.1: The main function of a console application

 CODE:00401108  _main  proc near ;  DATA XREF:  DATA:0040B044 CODE:00401108 CODE:00401108  argc      = DWORD PTR 8 CODE:00401108  ARGV      = DWORD PTR 0CH CODE:00401108  ENVP      = DWORD PTR 10H CODE:00401108 CODE:00401108        PUSH  EBP CODE:00401109        MOV   EBP, ESP CODE:0040110B        PUSH  EBX CODE:0040110C        MOV   EDX, OFFSET UNK_40D42C CODE:00401111        XOR   EAX, EAX CODE:00401113 CODE:00401113 LOC_401113: ; CODE XREF: _MAIN+22 CODE:00401113        MOV   ECX, 1FH CODE:00401118        SUB   ECX, EAX CODE:0040111A        MOV   EBX, DS:OFF_40B074 CODE:00401120        MOV   CL, [EBX+ECX] CODE:00401123        MOV    [EDX+EAX], CL CODE:00401126        INC   EAX CODE:00401127        CMP   EAX, 21H CODE:0040112A        JL    SHORT LOC_401113 CODE:0040112C        MOV   BYTE PTR [EDX+20H],  0 CODE:00401130        PUSH  EDX     ; CHAR CODE:00401131        PUSH  OFFSET AS ; __VA_ARGS CODE:00401136        CALL  _PRINTF CODE:0040113B        ADD   ESP, 8 CODE:0040113E        POP   EBX CODE:0040113F        POP   EBP CODE:00401140        RETN CODE:00401140 _MAIN     ENDP

Now consider the way the programmer might determine which program written in C language was the source of the fragment under consideration. To begin with, consider the standard structures. This fragment contains only one standard structure, namely, the loop. The key command in the loop organization appears as follows :

 CODE:0040112A JL SHORT LOC_401113

Obviously, the inc eax command increments the loop variable. Thus, the EAX register stores some variable that plays the role of the loop parameter. I suggest that you call this variable i . This assumption is confirmed by the presence of the XOR EAX, EAX command before the loop start. Naturally, this command is equivalent to i=o . The INC EAX stands for i++ . Now, try to discover other variables. Pay attention to the following command:

 CODE:0040110C MOV EDX, OFFSET UNK_40D42C

This command deserves special attention because some address is loaded into the EDX register. Trace how the EDX register will be used further. Note the presence of the following command:

 CODE:00401123 MOV [EDX+EAX], CL

Other commands using the EDX register are missing in the loop; with the presence of the preceding command, this drives the code analyzer to assume that EDX plays the role of the pointer to an array, string, or record. This assumption must be confirmed . or refuted at the next step. At this step, pay attention to the following two commands:

 CODE:0040112C       MOV  BYTE PTR  [EDX+20H], 0

and

 CODE:00401130       PUSH  EDX      ; CHAR

The first command assures the programmer that EDX points to a string, because t is the string that is terminated by the 0 character. The second command passes the second parameter to the printf function. Based on this information, and on the comment supplied by IDA Pro (the debugger correctly interpreted this code and did it quickly), it is possible to conclude that EDX is a pointer to some string. Note that this conclusion was drawn without reviewing the data block, which would certainly speed up the investigation. I suggest that you designate this pointer s1 . In this relationship, the [EDX+EAX] expression can be interpreted as s1[i] or as * (s1+i) .

Now consider the following command:

 CODE:0040111A   MOV  EBX, DS:OFF_40B074

It means that the EBX register also points to some string (I'll designate it s2 ). All further lines of code, which move characters from s2 to s1 , confirm this assumption. This deserves more detailed coverage.

What is the meaning of the following sequence of commands?

 CODE:00401113      MOV   ECX,  1FH     CODE:00401118      SUB   ECX,  EAX

Only one answer is possible: at every loop iteration, ECX will contain numbers from 1FH (31) to ( 1FH will be the last value of the EAX register, that is, of the i variable, participating in the SUB ECX, EAX command). Because the MOV CX, 1FH command also participates in forming the content of the ECX register, it would be logical to assume that the ECX register, before moving characters from string to string, would always contain the 1FH-i (or 31-i ) number. The [EBX+ECX] expression will then be equivalent to s2[31-i] or *(s2+31-i) .

As a result, it is possible to conclude that the following commands:

 CODE:00401120        MOV  CL,  [EBX+ECX]     CODE:00401123        MOV  [EDX+EAX], CL

Can be replaced by the following expression: s1 [i] = s2 [31-i] .

Now, you are prepared to consider the entire fragment:

 CODE:00401111        XOR  EAX, EAX     CODE:00401113     CODE:00401113 LOC_401113: ; CODE XREF: _MAIN+22     CODE:00401113        MOV  ECX, 1FH     CODE:00401118        SUB  ECX, EAX     CODE:0040111A        MOV  EBX, DS:OFF_40B074     CODE:00401120        MOV  CL,   [EBX+ECX]     CODE:00401123        MOV  [EDX+EAX], CL     CODE:00401126        INC  EAX     CODE:00401127        CMP  EAX, 21H     CODE:0040112A        JL   SHORT LOC_401113

For example, I hope that I'd be right if I wrote the following code fragment in C language:

 i=0;    do {         s1[i]=s2[31-i];         i++;    }  while(i>0x20)

Everything seems OK except the presence of the following string in the loop:

 CODE:0040111A   MOV  EBX, DS:OFF_40B074

Why is it in the loop instead of before the loop? This was done at the compiler's discretion. Note that this mustn't necessarily be the DO loop. In this case, the number of loop iterations is specified explicitly, and it is possible to make one conditional jump in the end of the loop. In other words, the previously provided structure could also be the result of optimization of the WHILE loop.

It is also possible to ask another question: Where are the s1 and s2 strings stored? This can be discovered quickly and easily. The main procedure is the standard one; consequently, if the strings in question were local variables, there would be an area in the stack reserved for them using commands such as SUB ESP , N (or ADD ESP, -N ). Thus, if there are no such commands, s1 and s2 are global variables. The interesting point is that other variables, which are local ones, are stored in registers in accordance to the principle stating that variables must be stored in registers whenever possible. ^[i]

Thus, the result of the code analysis can be presented in Listing 25.2.

Listing 25.2: The final form of the C program reconstructed on the basis of the disassembled code

 char s1[32]; char * s2="abcdefghigklmnopqrstuvwxyz"; void main() {     int i;     i=0;     do {           s1[i]=s2[31-i];            i++;         } while(i>32),           s1[32]=0;           printf("%s\n", si); }

To conclude this example, I'll point out that variable s1 is not initialized and variable s2 receives an initial value. Initialized variables are stored in the DATA section, and uninitialized ones are in the BSS section (for the Borland C++ compiler).

Contemporary compilers allow the use of 64-bit integers. The Assembly code in this case becomes more complicated; however, these complications are not too considerable. It is necessary to bear in mind that the 64-bit variable is stored in two adjacent 32-bit blocks. The most significant part of such a variable has the higher address. To manipulate such a variable, a couple of registers are used as a rule EDX:EAX for most significant and least significant parts , respectively. Therefore, suppose that you encounter the following strings:

 MOV  DWORD PTR [adr], EAX     MOV  DWORD PTR [adr+4], EDX

In this case, you can assume that you are dealing with a 64-bit variable.

^[i] In classical C language, to instruct the compiler to store variables in registers, they, should be declared as register.