| ||
Most executable modules are written in high-level programming languages and do not contain debug information. However, even in this case, it is sometimes necessary to analyze their code. To speed up the procedure of code analysis, the programmer has to know, or at least have reference information about, standard Assembly language structures corresponding to specific structures of high-level programming languages. Naturally, it is necessary to point out that in this chapter I mainly describe 32-bit applications.
Contemporary compilers optimize the source code quite efficiently ; therefore, sometimes it is difficult to determine which variable is used and where. Mainly, this is because the compiler uses registers for storing variables whenever possible. As a rule, the compiler would start using memory only when there are no available registers.
To illustrate this situation, I have taken an easy console application written on Borland C++. The source code of this application takes about 15 lines of code. The EXE file, however, is more than 50 KB. The size of executable files isn't any wonder , though. Another point is more interesting in this respect: Only one disassembler has correctly solved the problem of determining the entry pointthe _main label. As you have probably guessed, this was IDA Pro. Naturally, all disassemblers have correctly disassembled the program section that carries out the job; however, only IDA Pro was able to discover how the jump to that section of code is carried out. The most advantageous point here is that it has also correctly recognized the _printf function. Listing 25.1 shows a fragment of the disassembled program corresponding to the main procedure. The source code was written in C language, and disassembling was carried out using IDA Pro. The debugger doesn't provide any clear possibilities of quickly finding this fragment. Hence, the usefulness of the combined use of the debugger and disassembler is obvious.
CODE:00401108 _main proc near ; DATA XREF: DATA:0040B044 CODE:00401108 CODE:00401108 argc = DWORD PTR 8 CODE:00401108 ARGV = DWORD PTR 0CH CODE:00401108 ENVP = DWORD PTR 10H CODE:00401108 CODE:00401108 PUSH EBP CODE:00401109 MOV EBP, ESP CODE:0040110B PUSH EBX CODE:0040110C MOV EDX, OFFSET UNK_40D42C CODE:00401111 XOR EAX, EAX CODE:00401113 CODE:00401113 LOC_401113: ; CODE XREF: _MAIN+22 CODE:00401113 MOV ECX, 1FH CODE:00401118 SUB ECX, EAX CODE:0040111A MOV EBX, DS:OFF_40B074 CODE:00401120 MOV CL, [EBX+ECX] CODE:00401123 MOV [EDX+EAX], CL CODE:00401126 INC EAX CODE:00401127 CMP EAX, 21H CODE:0040112A JL SHORT LOC_401113 CODE:0040112C MOV BYTE PTR [EDX+20H], 0 CODE:00401130 PUSH EDX ; CHAR CODE:00401131 PUSH OFFSET AS ; __VA_ARGS CODE:00401136 CALL _PRINTF CODE:0040113B ADD ESP, 8 CODE:0040113E POP EBX CODE:0040113F POP EBP CODE:00401140 RETN CODE:00401140 _MAIN ENDP
Now consider the way the programmer might determine which program written in C language was the source of the fragment under consideration. To begin with, consider the standard structures. This fragment contains only one standard structure, namely, the loop. The key command in the loop organization appears as follows :
CODE:0040112A JL SHORT LOC_401113
Obviously, the inc eax command increments the loop variable. Thus, the EAX register stores some variable that plays the role of the loop parameter. I suggest that you call this variable i . This assumption is confirmed by the presence of the XOR EAX, EAX command before the loop start. Naturally, this command is equivalent to i=o . The INC EAX stands for i++ . Now, try to discover other variables. Pay attention to the following command:
CODE:0040110C MOV EDX, OFFSET UNK_40D42C
This command deserves special attention because some address is loaded into the EDX register. Trace how the EDX register will be used further. Note the presence of the following command:
CODE:00401123 MOV [EDX+EAX], CL
Other commands using the EDX register are missing in the loop; with the presence of the preceding command, this drives the code analyzer to assume that EDX plays the role of the pointer to an array, string, or record. This assumption must be confirmed . or refuted at the next step. At this step, pay attention to the following two commands:
CODE:0040112C MOV BYTE PTR [EDX+20H], 0
and
CODE:00401130 PUSH EDX ; CHAR
The first command assures the programmer that EDX points to a string, because t is the string that is terminated by the 0 character. The second command passes the second parameter to the printf function. Based on this information, and on the comment supplied by IDA Pro (the debugger correctly interpreted this code and did it quickly), it is possible to conclude that EDX is a pointer to some string. Note that this conclusion was drawn without reviewing the data block, which would certainly speed up the investigation. I suggest that you designate this pointer s1 . In this relationship, the [EDX+EAX] expression can be interpreted as s1[i] or as * (s1+i) .
Now consider the following command:
CODE:0040111A MOV EBX, DS:OFF_40B074
It means that the EBX register also points to some string (I'll designate it s2 ). All further lines of code, which move characters from s2 to s1 , confirm this assumption. This deserves more detailed coverage.
What is the meaning of the following sequence of commands?
CODE:00401113 MOV ECX, 1FH CODE:00401118 SUB ECX, EAX
Only one answer is possible: at every loop iteration, ECX will contain numbers from 1FH (31) to ( 1FH will be the last value of the EAX register, that is, of the i variable, participating in the SUB ECX, EAX command). Because the MOV CX, 1FH command also participates in forming the content of the ECX register, it would be logical to assume that the ECX register, before moving characters from string to string, would always contain the 1FH-i (or 31-i ) number. The [EBX+ECX] expression will then be equivalent to s2[31-i] or *(s2+31-i) .
As a result, it is possible to conclude that the following commands:
CODE:00401120 MOV CL, [EBX+ECX] CODE:00401123 MOV [EDX+EAX], CL
Can be replaced by the following expression: s1 [i] = s2 [31-i] .
Now, you are prepared to consider the entire fragment:
CODE:00401111 XOR EAX, EAX CODE:00401113 CODE:00401113 LOC_401113: ; CODE XREF: _MAIN+22 CODE:00401113 MOV ECX, 1FH CODE:00401118 SUB ECX, EAX CODE:0040111A MOV EBX, DS:OFF_40B074 CODE:00401120 MOV CL, [EBX+ECX] CODE:00401123 MOV [EDX+EAX], CL CODE:00401126 INC EAX CODE:00401127 CMP EAX, 21H CODE:0040112A JL SHORT LOC_401113
For example, I hope that I'd be right if I wrote the following code fragment in C language:
i=0; do { s1[i]=s2[31-i]; i++; } while(i>0x20)
Everything seems OK except the presence of the following string in the loop:
CODE:0040111A MOV EBX, DS:OFF_40B074
Why is it in the loop instead of before the loop? This was done at the compiler's discretion. Note that this mustn't necessarily be the DO loop. In this case, the number of loop iterations is specified explicitly, and it is possible to make one conditional jump in the end of the loop. In other words, the previously provided structure could also be the result of optimization of the WHILE loop.
It is also possible to ask another question: Where are the s1 and s2 strings stored? This can be discovered quickly and easily. The main procedure is the standard one; consequently, if the strings in question were local variables, there would be an area in the stack reserved for them using commands such as SUB ESP , N (or ADD ESP, -N ). Thus, if there are no such commands, s1 and s2 are global variables. The interesting point is that other variables, which are local ones, are stored in registers in accordance to the principle stating that variables must be stored in registers whenever possible. [i]
Thus, the result of the code analysis can be presented in Listing 25.2.
char s1[32]; char * s2="abcdefghigklmnopqrstuvwxyz"; void main() { int i; i=0; do { s1[i]=s2[31-i]; i++; } while(i>32), s1[32]=0; printf("%s\n", si); }
To conclude this example, I'll point out that variable s1 is not initialized and variable s2 receives an initial value. Initialized variables are stored in the DATA section, and uninitialized ones are in the BSS section (for the Borland C++ compiler).
Contemporary compilers allow the use of 64-bit integers. The Assembly code in this case becomes more complicated; however, these complications are not too considerable. It is necessary to bear in mind that the 64-bit variable is stored in two adjacent 32-bit blocks. The most significant part of such a variable has the higher address. To manipulate such a variable, a couple of registers are used as a rule EDX:EAX for most significant and least significant parts , respectively. Therefore, suppose that you encounter the following strings:
MOV DWORD PTR [adr], EAX MOV DWORD PTR [adr+4], EDX
In this case, you can assume that you are dealing with a 64-bit variable.
[i] In classical C language, to instruct the compiler to store variables in registers, they, should be declared as register.
| ||