3.2 A Good Disassembly | Security Warrior

< Day Day Up >

The output of objdump leaves a little to be desired. In addition to being a "dumb" or sequential disassembler, it provides very little information that can be used to understand the target. For this reason, a great deal of post-disassembly work must be performed in order to make a disassembly useful.

3.2.1 Identifying Functions

As a disassembler, objdump does not attempt to identify functions in the target; it merely creates code labels for symbols found in the ELF header. While it may at first seem appropriate to generate a function for every address that is called, this process has many shortcomings; for example, it fails to identify functions only called via pointers or to detect a "call 0x0" as a function.

On the Intel platform, functions or subroutines compiled from a high-level language usually have the following form:

 55         push ebp 89 E5        movl %esp, %ebp 83 EC ??    subl ??, %esp ... 89 EC        movl %ebp, %esp        ; could also be C9 leave  C3        ret

The series of instructions at the beginning and end of a function are called the function prologue and epilogue ; they are responsible for creating a stack frame in which the function will execute, and are generated by the compiler in accordance with the calling convention of the programming language. Functions can be identified by searching for function prologues within the disassembled target; in addition, an arbitrary series of bytes could be considered code if it contains instances of the 55 89 E5 83 EC byte series.

3.2.2 Intermediate Code Generation

Performing automatic analysis on a disassembled listing can be quite tedious . It is much more convenient to do what more sophisticated disassemblers do: translate each instruction to an intermediate or internal representation and perform all analyses on that representation, converting back to assembly language (or to a higher-level language) before output.

This intermediate representation is often referred to as intermediate code ; it can consist of a compiler language such as the GNU RTL, an assembly language for an idealized (usually RISC) machine, or simply a structure that stores additional information about the instruction.

The following Perl script generates an intermediate representation of objdump output and a hex dump; instructions are stored in lines marked "INSN", section definitions are stored in lines marked "SEC", and the hexdump is stored in lines marked "DATA".

 #--------------------------------------------------------------------------     #!/usr/bin/perl     # int_code.pl : Intermediate code generation based on objdump output     # Output Format:     # Code:     # INSNaddressnamesizehexmnemonictypesrcstypedestdtypeauxatype     # Data:     # DATAaddresshexascii     # Section Definition:     # SECnamesizeaddressfile_offsetpermissions     my $file = shift;     my $addr, $hex, $mnem, $size;     my $s_type, $d_type, $a_type;     my $ascii, $pa, $perm;     my @ops;     if (! $file ) {         $file = "-";     }     open( A, $file )  die "unable to open $file\n";     foreach (<A>) {         # is this data?         if ( /^([0-9a-fA-F]{8,})\s+                    # address             (([0-9a-fA-f]{2,}\s{1,2}){1,16})\s*        # 1-16 hex bytes             \([^]{1,16})\                           # ASCII chars in                                          /x) {             $addr = ;             $hex = ;             $ascii = ;             $hex =~ s/\s+/ /g;             $ascii =~ s/\/./g;             print "DATA$addr$hex$ascii\n";         # Is this an instruction?         }elsif ( /^\s?(0x0)?([0-9a-f]{3,8}):?\s+        # address                 (([0-9a-f]{2,}\s)+)\s+                  # hex bytes                 ([a-z]{2,6})\s+                         # mnemonic                 ([^\s].+)                               # operands                                         $/x) {             $addr = ;             $hex = ;             $mnem = ;                  @ops = split_ops();                  $src = $ops[0];             $dest = $ops[1];             $aux = $ops[2];                  $m_type = insn_type( $mnem );             if ( $src ) {                 $s_type = op_type( $src );             }             if ( $dest ) {                 $d_type = op_type( $dest );             }             if ( $aux ) {                 $a_type = op_type( $aux );             }                              chop $hex;    # remove trailing ' '             $size = count_bytes( $hex );             print "INSN";            # print line type             print "$addr$name$size$hex";             print "$mnem$m_type";             print "$src$s_type$dest$d_type$aux$a_type\n";             $name = "";    # undefine name             $s_type = $d_type = $a_type = "";         # is this a section?         } elsif ( /^\s*[0-9]+\s                # section number                 ([.a-zA-Z_]+)\s+               # name                 ([0-9a-fA-F]{8,})\s+           # size                 ([0-9a-fA-F]{8,})\s+           # VMA                 [0-9a-fA-F]{8,}\s+             # LMA                 ([0-9a-fA-F]{8,})\s+           # File Offset                                     /x) {             $name = ;             $size = ;             $addr = ;             $pa = ;                  if ( /LOAD/ ) {                 $perm = "r";                 if ( /CODE/ ) {                     $perm .= "x";                     } else {                     $perm .= "-";                 }                 if ( /READONLY/ ) {                     $perm .= "-";                 } else {                     $perm .= "w";                 }             } else {                 $perm = "---";             }             print "SEC$name$size$addr$pa$perm\n";         } elsif ( /^[0-9a-f]+\s+<([a-zA-Z._0-9]+)>:/) {             # is this a name? if so, use for next addr             $name = ;         } # else ignore line     }     close (A);     sub insn_in_array {         my ($insn, $insn_list) = @_;         my $pattern;         foreach( @{$insn_list} ) {             $pattern = "^$_";             if ( $insn =~ /$pattern/ ) {                 return(1);             }         }         return(0);     }     sub insn_type {         local($insn) = @_;         local($insn_type) = "INSN_UNK";         my @push_insns = ("push");         my @pop_insns = ("pop");         my @add_insns = ("add", "inc");         my @sub_insns = ("sub", "dec", "sbb");         my @mul_insns = ("mul", "imul", "shl", "sal");         my @div_insns = ("div", "idiv", "shr", "sar");         my @rot_insns = ("ror", "rol");         my @and_insns = ("and");         my @xor_insns = ("xor");         my @or_insns = ("or");         my @jmp_insns = ("jmp", "ljmp");         my @jcc_insns = ("ja", "jb", "je", "jn", "jo", "jl", "jg", "js",                       "jp");         my @call_insns = ("call");         my @ret_insns = ("ret");         my @trap_insns = ("int");         my @cmp_insns = ("cmp", "cmpl");         my @test_insns = ("test", "bt");         my @mov_insns = ("mov", "lea");         if (insn_in_array($insn, \@jcc_insns) == 1) {             $insn_type = "INSN_BRANCHCC";         } elsif ( insn_in_array($insn, \@push_insns) == 1 ) {             $insn_type = "INSN_PUSH";         } elsif ( insn_in_array($insn, \@pop_insns) == 1 ) {             $insn_type = "INSN_POP";         } elsif ( insn_in_array($insn, \@add_insns) == 1 ) {             $insn_type = "INSN_ADD";         } elsif ( insn_in_array($insn, \@sub_insns) == 1 ) {             $insn_type = "INSN_SUB";         } elsif ( insn_in_array($insn, \@mul_insns) == 1 ) {             $insn_type = "INSN_MUL";         } elsif ( insn_in_array($insn, \@div_insns) == 1 ) {             $insn_type = "INSN_DIV";         } elsif ( insn_in_array($insn, \@rot_insns) == 1 ) {             $insn_type = "INSN_ROT";         } elsif ( insn_in_array($insn, \@and_insns) == 1 ) {             $insn_type = "INSN_AND";         } elsif ( insn_in_array($insn, \@xor_insns) == 1 ) {             $insn_type = "INSN_XOR";         } elsif ( insn_in_array($insn, \@or_insns) == 1 ) {             $insn_type = "INSN_OR";         } elsif ( insn_in_array($insn, \@jmp_insns) == 1 ) {             $insn_type = "INSN_BRANCH";         } elsif ( insn_in_array($insn, \@call_insns) == 1 ) {             $insn_type = "INSN_CALL";         } elsif ( insn_in_array($insn, \@ret_insns) == 1 ) {             $insn_type = "INSN_RET";         } elsif ( insn_in_array($insn, \@trap_insns) == 1 ) {             $insn_type = "INSN_TRAP";         } elsif ( insn_in_array($insn, \@cmp_insns) == 1 ) {             $insn_type = "INSN_CMP";         } elsif ( insn_in_array($insn, \@test_insns) == 1 ) {             $insn_type = "INSN_TEST";         } elsif ( insn_in_array($insn, \@mov_insns) == 1 ) {             $insn_type = "INSN_MOV";         }         $insn_type;     }     sub op_type {         local($op) = @_; # passed as reference to enable mods         local($op_type) = "";         # strip dereference operator         if ($$op =~ /^\*(.+)/ ) {             $$op = ;         }         if ( $$op =~ /^(\%[a-z]{2,}:)?(0x[a-f0-9]+)?\([a-z\%,0-9]+\)/ ) {             # Effective Address, e.g., [ebp-8]             $op_type = "OP_EADDR";         } elsif ( $$op =~ /^\%[a-z]{2,3}/ ) {             # Register, e.g.,, %eax             $op_type = "OP_REG";         } elsif ( $$op =~ /^$[0-9xXa-f]+/ ) {             # Immediate value, e.g.,  #-------------------------------------------------------------------------- #!/usr/bin/perl # int_code.pl : Intermediate code generation based on objdump output # Output Format: # Code: # INSNaddressname size hexmnemonictypesrcstypedestdtypeauxatype # Data: # DATAaddresshexascii # Section Definition: # SEC name sizeaddressfile_offsetpermissions my $file = shift; my $addr, $hex, $mnem, $size; my $s_type, $d_type, $a_type; my $ascii, $pa, $perm; my @ops; if (! $file ) { $file = "-"; } open( A, $file )  die "unable to open $file\n"; foreach (<A>) { # is this data? if ( /^([0-9a-fA-F]{8,})\s+ # address (([0-9a-fA-f]{2,}\s{1,2}){1,16})\s* # 1-16 hex bytes \([^]{1,16})\ # ASCII chars in  /x) { $addr = $1; $hex = $2; $ascii = $4; $hex =~ s/\s+/ /g; $ascii =~ s/\/./g; print "DATA$addr$hex$ascii\n"; # Is this an instruction? }elsif ( /^\s?(0x0)?([0-9a-f]{3,8}):?\s+ # address (([0-9a-f]{2,}\s)+)\s+ # hex bytes ([a-z]{2,6})\s+ # mnemonic ([^\s].+) # operands $/x) { $addr = $2; $hex = $3; $mnem = $5; @ops = split_ops($6); $src = $ops[0]; $dest = $ops[1]; $aux = $ops[2]; $m_type = insn_type( $mnem ); if ( $src ) { $s_type = op_type( \$src ); } if ( $dest ) { $d_type = op_type( \$dest ); } if ( $aux ) { $a_type = op_type( \$aux ); } chop $hex; # remove trailing ' ' $size = count_bytes( $hex ); print "INSN"; # print line type print "$addr$name$size$hex"; print "$mnem$m_type"; print "$src$s_type$dest$d_type$aux$a_type\n"; $name = ""; # undefine name $s_type = $d_type = $a_type = ""; # is this a section? } elsif ( /^\s*[0-9]+\s # section number ([.a-zA-Z_]+)\s+ # name ([0-9a-fA-F]{8,})\s+ # size ([0-9a-fA-F]{8,})\s+ # VMA [0-9a-fA-F]{8,}\s+ # LMA ([0-9a-fA-F]{8,})\s+ # File Offset /x) { $name = $1; $size = $2; $addr = $3; $pa = $4; if ( /LOAD/ ) { $perm = "r"; if ( /CODE/ ) { $perm .= "x"; } else { $perm .= "-"; } if ( /READONLY/ ) { $perm .= "-"; } else { $perm .= "w"; } } else { $perm = "---"; } print "SEC$name$size$addr$pa$perm\n"; } elsif ( /^[0-9a-f]+\s+<([a-zA-Z._0-9]+)>:/) { # is this a name? if so, use for next addr $name = $1; } # else ignore line } close (A); sub insn_in_array { my ($insn, $insn_list) = @_; my $pattern; foreach( @{$insn_list} ) { $pattern = "^$_"; if ( $insn =~ /$pattern/ ) { return(1); } } return(0); } sub insn_type { local($insn) = @_; local($insn_type) = "INSN_UNK"; my @push_insns = ("push"); my @pop_insns = ("pop"); my @add_insns = ("add", "inc"); my @sub_insns = ("sub", "dec", "sbb"); my @mul_insns = ("mul", "imul", "shl", "sal"); my @div_insns = ("div", "idiv", "shr", "sar"); my @rot_insns = ("ror", "rol"); my @and_insns = ("and"); my @xor_insns = ("xor"); my @or_insns = ("or"); my @jmp_insns = ("jmp", "ljmp"); my @jcc_insns = ("ja", "jb", "je", "jn", "jo", "jl", "jg", "js", "jp"); my @call_insns = ("call"); my @ret_insns = ("ret"); my @trap_insns = ("int"); my @cmp_insns = ("cmp", "cmpl"); my @test_insns = ("test", "bt"); my @mov_insns = ("mov", "lea"); if (insn_in_array($insn, \@jcc_insns) == 1) { $insn_type = "INSN_BRANCHCC"; } elsif ( insn_in_array($insn, \@push_insns) == 1 ) { $insn_type = "INSN_PUSH"; } elsif ( insn_in_array($insn, \@pop_insns) == 1 ) { $insn_type = "INSN_POP"; } elsif ( insn_in_array($insn, \@add_insns) == 1 ) { $insn_type = "INSN_ADD"; } elsif ( insn_in_array($insn, \@sub_insns) == 1 ) { $insn_type = "INSN_SUB"; } elsif ( insn_in_array($insn, \@mul_insns) == 1 ) { $insn_type = "INSN_MUL"; } elsif ( insn_in_array($insn, \@div_insns) == 1 ) { $insn_type = "INSN_DIV"; } elsif ( insn_in_array($insn, \@rot_insns) == 1 ) { $insn_type = "INSN_ROT"; } elsif ( insn_in_array($insn, \@and_insns) == 1 ) { $insn_type = "INSN_AND"; } elsif ( insn_in_array($insn, \@xor_insns) == 1 ) { $insn_type = "INSN_XOR"; } elsif ( insn_in_array($insn, \@or_insns) == 1 ) { $insn_type = "INSN_OR"; } elsif ( insn_in_array($insn, \@jmp_insns) == 1 ) { $insn_type = "INSN_BRANCH"; } elsif ( insn_in_array($insn, \@call_insns) == 1 ) { $insn_type = "INSN_CALL"; } elsif ( insn_in_array($insn, \@ret_insns) == 1 ) { $insn_type = "INSN_RET"; } elsif ( insn_in_array($insn, \@trap_insns) == 1 ) { $insn_type = "INSN_TRAP"; } elsif ( insn_in_array($insn, \@cmp_insns) == 1 ) { $insn_type = "INSN_CMP"; } elsif ( insn_in_array($insn, \@test_insns) == 1 ) { $insn_type = "INSN_TEST"; } elsif ( insn_in_array($insn, \@mov_insns) == 1 ) { $insn_type = "INSN_MOV"; } $insn_type; } sub op_type { local($op) = @_; # passed as reference to enable mods local($op_type) = ""; # strip dereference operator if ($$op =~ /^\*(.+)/ ) { $$op = $1; } if ( $$op =~ /^(\%[a-z]{2,}:)?(0x[a-f0-9]+)?\([a-z\%,0-9]+\)/ ) { # Effective Address, e.g., [ebp-8] $op_type = "OP_EADDR"; } elsif ( $$op =~ /^\%[a-z]{2,3}/ ) { # Register, e.g.,, %eax $op_type = "OP_REG"; } elsif ( $$op =~ /^\$[0-9xXa-f]+/ ) { # Immediate value, e.g., $0x1F $op_type = "OP_IMM"; } elsif ( $$op =~ /^0x[0-9a-f]+/ ) { # Address, e.g., 0x8048000 $op_type = "OP_ADDR"; } elsif ( $$op =~ /^([0-9a-f]+)\s+<[^>]+>/ ) { $op_type = "OP_ADDR"; $$op = "0x$1"; } elsif ( $$op ne "" ) { # Unknown operand type $op_type = "OP_UNK"; } $op_type; } sub split_ops { local($opstr) = @_; local(@op); if ( $opstr =~ /^([^\(]*\([^\)]+\)),\s? # effective addr (([a-z0-9\%\$_]+)(,\s? # any operand (.+))?)? # any operand /x ) { $op[0] = $1; $op[1] = $3; $op[2] = $5; } elsif ( $opstr =~ /^([a-z0-9\%\$_]+),\s? # any operand ([^\(]*\([^\)]+\))(,\s? # effective addr (.+))? # any operand /x ) { $op[0] = $1; $op[1] = $2; $op[2] = $4; } else { @op = split ',', $opstr; } @op; } sub count_bytes { local(@bytes) = split ' ', $_[0]; local($len) = $#bytes + 1; $len; } #-------------------------------------------------------------------------- 
 x1F             $op_type = "OP_IMM";         } elsif ( $$op =~ /^0x[0-9a-f]+/ ) {             # Address, e.g., 0x8048000             $op_type = "OP_ADDR";         } elsif ( $$op =~ /^([0-9a-f]+)\s+<[^>]+>/ ) {             $op_type = "OP_ADDR";             $$op = "0x";         } elsif ( $$op ne "" )  {             # Unknown operand type             $op_type = "OP_UNK";         }          $op_type;     }     sub split_ops {          local($opstr) = @_;         local(@op);              if ( $opstr =~ /^([^\(]*\([^\)]+\)),\s?        # effective addr                     (([a-z0-9\%$_]+)(,\s?             # any operand                     (.+))?)?                           # any operand                                         /x ) {             $op[0] = ;             $op[1] = ;             $op[2] = ;         } elsif ( $opstr =~ /^([a-z0-9\%$_]+),\s?    # any operand                         ([^\(]*\([^\)]+\))(,\s?       # effective addr                         (.+))?                        # any operand                                         /x ) {             $op[0] = ;             $op[1] = ;             $op[2] = ;         } else {             @op = split ',', $opstr;         }         @op;     }     sub count_bytes {         local(@bytes) = split ' ', $_[0];         local($len) = $#bytes + 1;         $len;     } #--------------------------------------------------------------------------

The instruction types in this script are primitive but adequate; they can be expanded as needed to handle unrecognized instructions.

By combining the output of objdump with the output of a hexdump (here the BSD utility hd is simulated with the hexdump command, using the format strings -e ' "%08_ax: " 8/1 "%02x " " - " 8/1 "%02x " " "' -e '"%_p"' -e '"\n" ' mentioned in Section 3.1.5), a complete representation of the target can be passed to this script for processing:

 bash#  (objdump -hw -d a.out; hd a.out)  ./int_code.pl

This writes the intermediate code to STDOUT; the intermediate code can be written to a file or piped to other utilities for additional processing. Note that lines for sections, instructions, and data are created:

 SEC.interp00000019080480f4000000f4r-- SEC.hash000000540804812800000128r-- SEC.dynsym000001000804817c0000017cr-- ... INSN80484a0_fini155pushINSN_PUSH%ebpOP_REG INSN80484a1289 e5movINSN_MOV%espOP_REG%ebpOP_REG INSN80484a3383 ec 14subINSN_SUB  SEC.interp00000019080480f4000000f4r-- SEC.hash000000540804812800000128r-- SEC.dynsym000001000804817c0000017cr-- ... INSN80484a0_fini155pushINSN_PUSH%ebpOP_REG INSN80484a1289 e5movINSN_MOV%espOP_REG%ebpOP_REG INSN80484a3383 ec 14subINSN_SUB$0x14OP_IMM%espOP_REG INSN80484a6153pushINSN_PUSH%ebxOP_REG INSN80484a75e8 00 00 00 00callINSN_CALL0x80484acOP_ADDR INSN80484ac15bpopINSN_POP%ebxOP_REG INSN80484ad681 c3 54 10 00 00addINSN_ADD$0x1054OP_IMM%ebxOP_REG INSN80484b45e8 a7 fe ff ffcallINSN_CALL0x8048360OP_ADDR INSN80484b915bpopINSN_POP%ebxOP_REG ... DATA000000007f 45 4c 46 01 01 01 09 00 00 00 00 00 00 00 00 .ELF............ DATA0000001002 00 03 00 01 00 00 00 88 83 04 08 34 00 00 00 ............4... 
 x14OP_IMM%espOP_REG INSN80484a6153pushINSN_PUSH%ebxOP_REG INSN80484a75e8 00 00 00 00callINSN_CALL0x80484acOP_ADDR INSN80484ac15bpopINSN_POP%ebxOP_REG INSN80484ad681 c3 54 10 00 00addINSN_ADD  SEC.interp00000019080480f4000000f4r-- SEC.hash000000540804812800000128r-- SEC.dynsym000001000804817c0000017cr-- ... INSN80484a0_fini155pushINSN_PUSH%ebpOP_REG INSN80484a1289 e5movINSN_MOV%espOP_REG%ebpOP_REG INSN80484a3383 ec 14subINSN_SUB$0x14OP_IMM%espOP_REG INSN80484a6153pushINSN_PUSH%ebxOP_REG INSN80484a75e8 00 00 00 00callINSN_CALL0x80484acOP_ADDR INSN80484ac15bpopINSN_POP%ebxOP_REG INSN80484ad681 c3 54 10 00 00addINSN_ADD$0x1054OP_IMM%ebxOP_REG INSN80484b45e8 a7 fe ff ffcallINSN_CALL0x8048360OP_ADDR INSN80484b915bpopINSN_POP%ebxOP_REG ... DATA000000007f 45 4c 46 01 01 01 09 00 00 00 00 00 00 00 00 .ELF............ DATA0000001002 00 03 00 01 00 00 00 88 83 04 08 34 00 00 00 ............4... 
 x1054OP_IMM%ebxOP_REG INSN80484b45e8 a7 fe ff ffcallINSN_CALL0x8048360OP_ADDR INSN80484b915bpopINSN_POP%ebxOP_REG ... DATA000000007f 45 4c 46 01 01 01 09 00 00 00 00 00 00 00 00 .ELF............ DATA0000001002 00 03 00 01 00 00 00 88 83 04 08 34 00 00 00 ............4...

The first field of each line gives the type of information stored in a line. This makes it possible to expand the data file in the future with lines such as TARGET, NAME, LIBRARY, XREF, STRING, and so forth. The scripts in this section will only make use of the INSN information; all other lines are ignored.

When the intermediate code has been generated, the instructions can be loaded into a linked list for further processing:

 #--------------------------------------------------------------------------     #!/usr/bin/perl     # insn_list.pl -- demonstration of instruction linked list creation     my $file = shift;     my $insn, $prev_insn, $head;     if (! $file ) {             $file = "-";     }     open( A, $file )  die "unable to open $file\n";     foreach (<A>) {     if ( /^INSN/ ) {         chomp;         $insn = new_insn( $_ );         if ( $prev_insn ) {              $$insn{prev} = $prev_insn;             $$prev_insn{next} = $insn;          } else {             $head = $insn;         }         $prev_insn = $insn;     } else {         print;     } } close (A); $insn = $head; while ( $insn ) {     # insert code to manipulate list here      print "insn $$insn{addr} : ";      print "$$insn{mnem}\t$$insn{dest}\t$$insn{src}\n";      $insn = $$insn{next}; } # generate new instruction struct from line sub new_insn {     local($line) = @_;     local(%i, $jnk);     # change this when input file format changes!     ( $jnk, $i{addr}, $i{name}, $i{size}, $i{bytes},       $i{mnem}, $i{mtype}, $i{src}, $i{stype},       $i{dest}, $i{dtype}, $i{arg}, $i{atype} ) =          split '\', $line;     return \%i; } #--------------------------------------------------------------------------

The intermediate form of disassembled instructions can now be manipulated by adding code to the while ( $insn ) loop. As an example, the following code creates cross-references:

 #------------------------------------------------------------------------------ # insn_xref.pl -- generate xrefs for data from int_code.pl # NOTE: this changes the file format to # INSNaddrnamesizebytesmemmtypsrcstypdestdtypeargatypxrefs my %xrefs;    # add this global variable # new version of while (insn) loop $insn = $head; while ( $insn ) {      gen_xrefs( $insn, $$insn{src}, $$insn{stype} );      gen_xrefs( $insn, $$insn{dest}, $$insn{dtype} );      gen_xrefs( $insn, $$insn{arg}, $$insn{atype} );      $insn = $$insn{next}; } # output loop $insn = $head; while ( $insn ) {     if ( $xrefs{$$insn{addr}} ) {         chop $xrefs{$$insn{addr}};    # remove trailing colon     }     print "INSN";                    # print line type     print "$$insn{addr}$$insn{name}$$insn{size}$$insn{bytes}";     print "$$insn{mnem}$$insn{mtype}$$insn{src}$$insn{stype}";     print "$$insn{dest}$$insn{dtype}$$insn{arg}$$insn{atype}";     print "$xrefs{$$insn{addr}}\n";     $insn = $$insn{next}; } sub gen_xrefs {     local($i, $op, $op_type) = @_;     local $addr;     if ( $op_type eq "OP_ADDR" && $op =~ /0[xX]([0-9a-fA-F]+)/ ) {         $addr = ;         $xrefs{$addr} .= "$$i{addr}:";     }     return; } #--------------------------------------------------------------------------

Naturally, there is much more that can be done aside from merely tracking cross-references. The executable can be scanned for strings and address references for them created, system and library calls can be replaced with their C names and prototypes , DATA lines can be fixed to use RVAs instead of file offsets using information in the SEC lines, and higher-level language constructs can be generated.

Such features can be implemented with additional scripts that print to STDOUT a translation of the input (by default, STDIN). When all processing is finished, the intermediate code can be printed using a custom script:

 #------------------------------------------------------------------------------     #!/usr/bin/perl     # insn_output.pl -- print disassembled listing     #                   NOTE: this ignores SEC and DATA lines     my $file = shift;     my %insn, $i;     my @xrefs, $xrefstr;     if (! $file ) {             $file = "-";     }     open( A, $file )  die "unable to open $file\n";     foreach (<A>) {         if ( /^INSN/ ) {             chomp;             $i = new_insn( $_ );             $insn{$$i{addr}} = $i;         } else {             ; # ignore other lines         }     }     close (A);     foreach ( sort keys %insn ) {         $i = $insn{$_};         $xrefstr = "";         @xrefs = undef;         if ($$i{name}) {                 print "\n$$i{name}:\n";             } elsif ( $$i{xrefs} ) {             # generate fake name             print "\nloc_$$i{addr}:\n";             @xrefs = split ':', $$i{xrefs};             foreach ( @xrefs ) {                 $xrefstr .= " $_";             }         }         print "\t$$i{mnem}\t";         if ( $$i{src} ) {             print_op( $$i{src}, $$i{stype} );             if ( $$i{dest} ) {                 print ", ";                 print_op( $$i{dest}, $$i{dtype} );                 if ( $$i{arg} ) {                     print ", ";                     print_op( $$i{arg}, $$i{atype} );                 }             }         }         print "\t\t(Addr: $$i{addr})";         if ( $xrefstr ne "" ) {             print " References:$xrefstr";         }         print "\n";     }          sub print_op {         local($op, $op_type) = @_;         local $addr, $i;         if ( $op_type eq "OP_ADDR" && $op =~ /0[xX]([0-9a-fA-F]+)/ ) {             # replace addresses with their names             $addr = ;             $i = $insn{$addr};             if ( $$i{name} ) {                 print "$$i{name}";             } else {                 print "loc_$addr";             }         } else {             print "$op";         }         return;     }          # generate new instruction struct from line     sub new_insn {         local($line) = @_;         local(%i, $jnk);         # change this when input file format changes!         ( $jnk, $i{addr}, $i{name}, $i{size}, $i{bytes},           $i{mnem}, $i{mtype}, $i{src}, $i{stype},           $i{dest}, $i{dtype}, $i{arg}, $i{atype}, $i{xrefs} ) =              split '\', $line;         return \%i;     } #--------------------------------------------------------------------------

This can receive the output of the previous scripts from STDIN:

 bash#  (objdump -hw -d a.out, hd a.out)  int_code.pl  insn_xref.pl \   insn_output.pl

In this way, a disassembly tool chain can be built according to the standard Unix model: many small utilities performing simple transforms on a global set of data.

3.2.3 Program Control Flow

One of the greatest advantages of reverse engineering on Linux is that the compiler and libraries used to build the target are almost guaranteed to be the same as the compiler and libraries that are installed on your system. To be sure, there are version differences as well as different optimization options, but generally speaking all programs will be compiled with gcc and linked with glibc. This is an advantage because it makes it possible to guess what higher-level language constructs caused a particular set of instructions to be generated.

The code generated for a series of source code statements can be determined by compiling those statements in between a set of assembly language markers ”uncommon instructions that make the compiled code stand out:

 #define MARKER asm("\tint3\n\tint3\n\tint3\n"); int main( int argc, char **argv ) {     int x, y;     MARKER     /* insert code to be tested here */     MARKER     return(0); };

One of the easiest high-level constructs to recognize is the WHILE loop, due to its distinct backward jump. In general, any backward jump that does not exceed the bounds of a function (i.e., a jump to an address in memory before the start of the current function) is indicative of a loop.

The C statement:

 while ( x < 1024 ) { y += x; }

compiles to the following assembly under gcc:

 80483df:       cc                      int3 80483e0:       81 7d fc ff 03 00 00    cmpl  80483df: cc int3 80483e0: 81 7d fc ff 03 00 00 cmpl $0x3ff,0xfffffffc(%ebp) 80483e7: 7e 07 jle 80483f0 <main+0x20> 80483e9: eb 0d jmp 80483f8 <main+0x28> 80483eb: 90 nop 80483ec: 8d 74 26 00 lea 0x0(%esi,1),%esi 80483f0: 8b 45 fc mov 0xfffffffc(%ebp),%eax 80483f3: 01 45 f8 add %eax,0xfffffff8(%ebp) 80483f6: eb e8 jmp 80483e0 <main+0x10> 
 x3ff,0xfffffffc(%ebp) 80483e7:       7e 07                   jle    80483f0 <main+0x20> 80483e9:       eb 0d                   jmp    80483f8 <main+0x28> 80483eb:       90                      nop 80483ec:       8d 74 26 00             lea    0x0(%esi,1),%esi 80483f0:       8b 45 fc                mov    0xfffffffc(%ebp),%eax 80483f3:       01 45 f8                add    %eax,0xfffffff8(%ebp) 80483f6:       eb e8                   jmp    80483e0 <main+0x10>

By removing statement-specific operands and instructions, this can be reduced to the more general pattern:

 ; WHILE L1:     cmp    ?, ?     jcc    L2    ; jump to loop body     jmp    L3    ; exit from loop L2    :     ?    ?, ?    ; body of WHILE loop     jmp    L1    ; jump to start of loop ; ENDWHILE L3:

where jcc is one of the Intel conditional branch instructions.

A related construct is the FOR loop, which is essentially a WHILE loop with a counter. Most C FOR loops can be rewritten as WHILE loops by adding an initialization statement, a termination condition, and a counter increment.

The C FOR statement:

 for ( x > 0; x < 10; x++ ) { y *= 1024; }

is compiled by gcc to:

 80483d9:       8d b4 26 00 00 00 00    lea    0x0(%esi,1),%esi 80483e0:       83 7d fc 09             cmpl  80483d9: 8d b4 26 00 00 00 00 lea 0x0(%esi,1),%esi 80483e0: 83 7d fc 09 cmpl $0x9,0xfffffffc(%ebp) 80483e4: 7e 02 jle 80483e8 <main+0x18> 80483e6: eb 18 jmp 8048400 <main+0x30> 80483e8: 8b 45 f8 mov 0xfffffff8(%ebp),%eax 80483eb: 89 c2 mov %eax,%edx 80483ed: 89 d0 mov %edx,%eax 80483ef: c1 e0 0a shl $0xa,%eax 80483f2: 89 45 f8 mov %eax,0xfffffff8(%ebp) 80483f5: ff 45 fc incl 0xfffffffc(%ebp) 80483f8: eb e6 jmp 80483e0 <main+0x10> 80483fa: 8d b6 00 00 00 00 lea 0x0(%esi),%esi 
 x9,0xfffffffc(%ebp) 80483e4:       7e 02                   jle    80483e8 <main+0x18> 80483e6:       eb 18                   jmp    8048400 <main+0x30> 80483e8:       8b 45 f8                mov    0xfffffff8(%ebp),%eax 80483eb:       89 c2                   mov    %eax,%edx 80483ed:       89 d0                   mov    %edx,%eax 80483ef:       c1 e0 0a                shl  80483d9: 8d b4 26 00 00 00 00 lea 0x0(%esi,1),%esi 80483e0: 83 7d fc 09 cmpl $0x9,0xfffffffc(%ebp) 80483e4: 7e 02 jle 80483e8 <main+0x18> 80483e6: eb 18 jmp 8048400 <main+0x30> 80483e8: 8b 45 f8 mov 0xfffffff8(%ebp),%eax 80483eb: 89 c2 mov %eax,%edx 80483ed: 89 d0 mov %edx,%eax 80483ef: c1 e0 0a shl $0xa,%eax 80483f2: 89 45 f8 mov %eax,0xfffffff8(%ebp) 80483f5: ff 45 fc incl 0xfffffffc(%ebp) 80483f8: eb e6 jmp 80483e0 <main+0x10> 80483fa: 8d b6 00 00 00 00 lea 0x0(%esi),%esi 
 xa,%eax 80483f2:       89 45 f8                mov    %eax,0xfffffff8(%ebp) 80483f5:       ff 45 fc                incl   0xfffffffc(%ebp) 80483f8:       eb e6                   jmp    80483e0 <main+0x10> 80483fa:       8d b6 00 00 00 00       lea    0x0(%esi),%esi

This generalizes to:

 ; FOR L1:     cmp    ?, ?     jcc    L2     jmp     L3 L2:     ?    ?, ?        ; body of FOR loop     inc    ?     jmp    L1 ; ENDFOR L3:

which demonstrates that the FOR statement is really an instance of a WHILE statement, albeit often with an inc or a dec at the tail of L2.

The IF-ELSE statement is generally a series of conditional and unconditional jumps that skip blocks of code. The typical model is to follow a condition test with a conditional jump that skips the next block of code; that block of code then ends with an unconditional jump that exits the IF-ELSE block. This is how gcc handles the IF-ELSE. A simple IF statement in C, such as:

 if ( argc > 4 ) { x++; }

compiles to the following under gcc:

 80483e0:       83 7d 08 04             cmpl  80483e0: 83 7d 08 04 cmpl $0x4,0x8(%ebp) 80483e4: 7e 03 jle 80483e9 <main+0x19> 80483e6: ff 45 fc incl 0xfffffffc(%ebp) 
 x4,0x8(%ebp) 80483e4:       7e 03                   jle    80483e9 <main+0x19> 80483e6:       ff 45 fc                incl   0xfffffffc(%ebp)

The generalization of this code is:

 ; IF     cmp    ?, ?     jcc    L1    ; jump over instructions     ?    ?, ?    ; body of IF statement ; ENDIF L1:

A more complex IF statement with an ELSE clause in C such as:

 if ( argc > 4 ) { x++; } else { y--; }

compiles to the following under gcc:

 80483e0:       83 7d 08 04             cmpl  80483e0: 83 7d 08 04 cmpl $0x4,0x8(%ebp) 80483e4: 7e 0a jle 80483f0 <main+0x20> 80483e6: ff 45 fc incl 0xfffffffc(%ebp) 80483e9: eb 08 jmp 80483f3 <main+0x23> 80483eb: 90 nop 80483ec: 8d 74 26 00 lea 0x0(%esi,1),%esi 80483f0: ff 4d f8 decl 0xfffffff8(%ebp) 
 x4,0x8(%ebp) 80483e4:       7e 0a                   jle    80483f0 <main+0x20> 80483e6:       ff 45 fc                incl   0xfffffffc(%ebp) 80483e9:       eb 08                   jmp    80483f3 <main+0x23> 80483eb:       90                      nop 80483ec:       8d 74 26 00             lea    0x0(%esi,1),%esi 80483f0: ff 4d f8 decl 0xfffffff8(%ebp)

The generalization of the IF-ELSE is therefore:

 ; IF     cmp    ?, ?     jcc    L1        ; jump to else condition     ?    ?, ?        ; body of IF statement     jmp    L2        ; jump over else  ; ELSE L1:     ?    ?, ?        ; body of ELSE statement ; ENDIF L2:

The final form of the IF contains an ELSE-IF clause:

 if (argc > 4) {x++;} else if (argc < 24) {x *= y;} else {y--;}

This compiles to:

 80483e0:       83 7d 08 04             cmpl  80483e0: 83 7d 08 04 cmpl $0x4,0x8(%ebp) 80483e4: 7e 0a jle 80483f0 <main+0x20> 80483e6: ff 45 fc incl 0xfffffffc(%ebp) 80483e9: eb 1a jmp 8048405 <main+0x35> 80483eb: 90 nop 80483ec: 8d 74 26 00 lea 0x0(%esi,1),%esi 80483f0: 83 7d 08 17 cmpl $0x17,0x8(%ebp) 80483f4: 7f 0c jg 8048402 <main+0x32> 80483f6: 8b 45 fc mov 0xfffffffc(%ebp),%eax 80483f9: 0f af 45 f8 imul 0xfffffff8(%ebp),%eax 80483fd: 89 45 fc mov %eax,0xfffffffc(%ebp) 8048400: eb 03 jmp 8048405 <main+0x35> 8048402: ff 4d f8 decl 0xfffffff8(%ebp) 
 x4,0x8(%ebp) 80483e4:       7e 0a                   jle    80483f0 <main+0x20> 80483e6:       ff 45 fc                incl   0xfffffffc(%ebp) 80483e9:       eb 1a                   jmp    8048405 <main+0x35> 80483eb:       90                      nop 80483ec:       8d 74 26 00             lea    0x0(%esi,1),%esi 80483f0:       83 7d 08 17             cmpl  80483e0: 83 7d 08 04 cmpl $0x4,0x8(%ebp) 80483e4: 7e 0a jle 80483f0 <main+0x20> 80483e6: ff 45 fc incl 0xfffffffc(%ebp) 80483e9: eb 1a jmp 8048405 <main+0x35> 80483eb: 90 nop 80483ec: 8d 74 26 00 lea 0x0(%esi,1),%esi 80483f0: 83 7d 08 17 cmpl $0x17,0x8(%ebp) 80483f4: 7f 0c jg 8048402 <main+0x32> 80483f6: 8b 45 fc mov 0xfffffffc(%ebp),%eax 80483f9: 0f af 45 f8 imul 0xfffffff8(%ebp),%eax 80483fd: 89 45 fc mov %eax,0xfffffffc(%ebp) 8048400: eb 03 jmp 8048405 <main+0x35> 8048402: ff 4d f8 decl 0xfffffff8(%ebp) 
 x17,0x8(%ebp) 80483f4:       7f 0c                   jg     8048402 <main+0x32> 80483f6:       8b 45 fc                mov    0xfffffffc(%ebp),%eax 80483f9:       0f af 45 f8             imul   0xfffffff8(%ebp),%eax 80483fd:       89 45 fc                mov    %eax,0xfffffffc(%ebp) 8048400:       eb 03                   jmp    8048405 <main+0x35> 8048402:       ff 4d f8                decl   0xfffffff8(%ebp)

The generalization of this construct is therefore:

 ; IF     cmp    ?, ?     jcc    L1        ; jump to ELSE-IF      ?    ?, ?        ; body of IF statement     jmp    L3        ; jump out of IF statement ; ELSE IF L1:     cmp    ?, ?     jcc    L2        ; jump to ELSE     ?    ?, ?        ; body of ELSE-IF statement     jmp    L3 ; ELSE L2:     ?    ?, ?        ; body of ELSE statement ; ENDIF L3:

An alternative form of the IF will have the conditional jump lead into the code block and be followed immediately by an unconditional jump that skips the code block. This results in more jump statements but causes the condition to be identical with that of the C code (note that in the example above, the condition must be inverted so that the conditional branch will skip the code block associated with the IF).

Note that most SWITCH statements will look like IF-ELSEIF statements; large SWITCH statements will often be compiled as jump tables.

The generalized forms of the above constructs can be recognized using scripts to analyze the intermediate code produced in the previous section. For example, the IF-ELSE construct:

 cmp    ?, ?     jcc    L1        ; jump to else condition     jmp    L2        ; jump over else  L1: L2:

would be recognized by the following code:

 if ( $$insn{type} == "INSN_CMP" &&           ${$$insn{next}}{type} == "INSN_BRANCHCC" ) {     $else_insn = get_insn_by_addr( ${$$insn{next}}{dest} );     if ( ${$$else_insn{prev}}{type} == "INSN_BRANCH" ) {         # This is an IF/ELSE         $endif_insn = get_insn_by_addr( ${$$else_insn{prev}}{dest} );         insert_before( $insn, "IF" );             insert_before( ${$$insn{next}}{next}, "{" );             insert_before( $else_insn, "}" );             insert_before( $else_insn, "ELSE" );             insert_before( $else_insn, "{" );             insert_before( $endif_insn, "}" );     } }

The insert_before routine adds a pseudoinstruction to the linked list of disassembled instructions, so that the disassembled IF-ELSE in the previous section prints out as:

 IF     80483e0:       83 7d 08 04             cmpl  IF 80483e0: 83 7d 08 04 cmpl $0x4,0x8(%ebp) 80483e4: 7e 0a jle 80483f0 <main+0x20> { 80483e6: ff 45 fc incl 0xfffffffc(%ebp) 80483e9: eb 08 jmp 80483f3 <main+0x23> 80483eb: 90 nop 80483ec: 8d 74 26 00 lea 0x0(%esi,1),%esi } ELSE { 80483f0: ff 4d f8 decl 0xfffffff8(%ebp) } 
 x4,0x8(%ebp) 80483e4:       7e 0a                   jle    80483f0 <main+0x20> { 80483e6:       ff 45 fc                incl   0xfffffffc(%ebp) 80483e9:       eb 08                   jmp    80483f3 <main+0x23> 80483eb:       90                      nop 80483ec:       8d 74 26 00             lea    0x0(%esi,1),%esi } ELSE { 80483f0:       ff 4d f8                decl   0xfffffff8(%ebp) }

By creating scripts that generate such output, supplemented perhaps by an analysis of the conditional expression to a flow control construct, the output of a disassembler can be brought closer to the original high-level language source code from which it was compiled.

< Day Day Up >