Using a Decompiler or Disassembler to Reverse Engineer a Program

Reverse engineering of computer programs generally involves the study and analysis of a program s binary code to determine its inner workings. In the case of a program for which only the binary version is available, an assembly code or high level source code version can be created for study using a decompiler or a disassembler. A decompiler is a program that converts another program s binary code into a high level programming language such as Java, C#, or C. A disassembler performs a similar function but returns an assembly language version. Neither decompilation nor disassembly requires the user to actually run the target program, which is helpful when the target binary is potentially malicious software (a worm, Trojan horse, etc.). In this section, we discuss how decompilers and disassemblers can be used to identify security vulnerabilities that might be exposed in a program that is distributed only in binary form. We also explain how some people analyze security patches to uncover details of the original security bugs .

Understanding Differences Between Native Code and Bytecode Binaries

A program s binary typically comes in one of two forms ”native code or bytecode. Native code contains operations in machine language that run directly on a computer s processor. Bytecode binaries do not run on the processor directly, but instead contain intermediate code. Bytecode examples include Microsoft Intermediate Language (used for the Microsoft .NET Framework binaries) and Java bytecode (used by the Java Virtual Machine). When a binary containing bytecode is executed, the intermediate code is translated into machine code by an interpreter and is executed by the processor. The Common Language Runtime is the .NET interpreter, and Java s is the Java Virtual Machine.

Bytecode binaries contain more information than native code binaries do. This allows for more direct translation from a bytecode binary to the original source code and makes bytecode decompilers very effective. Several decompilers were created specifically for Java and .NET binaries. Understanding the decompiler s results of a bytecode binary is much easier than understanding the results of a disassembled native binary. To more clearly illustrate this point, the following shows the differences between the original source code of a simple C# application and the decompiled code. The decompiler used in this example is .NET Reflector ( http://www.aisto.com/roeder/dotnet/ ).

Tip	Decompilers exist for translating native binaries into C. Currently, these decompilers don t yield very reliable results. For this reason, a disassembler should be used for native code binaries.

The following is the original source code from a simple C# application, LaunchBrowser, which is included on this book s companion Web site:

 private bool IsValidURL(string URL) { return URL.StartsWith("http://")  URL.StartsWith("https://"); }

A function named IsValidURL takes one parameter named URL and returns true if URL begins with the strings http:// or https:// . Otherwise, false is returned.

Note

Symbols are files created when a program s binaries are built from source code. These files contain information that can help debug the application later. Information about global variables , function names , and information to map code in the binary back to lines in the source code are included in symbol files. Symbols created by Microsoft Visual Studio have the extensions .dgb and .pdb. For more information about these symbol files, see http://support.microsoft.com/kb/ 121366.

Using a decompiler without access to public or private symbols produces the following code from the binary for the same function:

 private bool IsValidURL(string URL) { if (!URL.StartsWith("http://")) {      return URL.StartsWith("https://"); } return true; }

Wow! These are pretty similar, aren t they? The function name and variable are the same. The code is slightly different, but it does exactly the same thing. If you re a programmer, looking at the decompiled binary reinforces what your first computer science professor likely mentioned to you: if you believe one of the or conditions in the if statement is more likely to be true, runtime is more efficient if the more likely condition is placed first.

Because the decompiled version of the binary is extremely similar to the original source code, the decompiled version can be reviewed for security bugs in the same manner the original source can be reviewed. The rest of this chapter focuses on basic information about how disassemblers can be used on native binaries to find security bugs.

More Info

For more information about using disassemblers to find security bugs, see the book Exploiting Software: How to Break Code by Greg Hoglund and Gary McGraw.

Modern compilers often optimize and remove unnecessary code. By disassembling a binary, you can more accurately see exactly what is executed. An interesting example is the ZeroMemory function. For security reasons, programmers often remove sensitive information from memory after the data is no longer needed. A common way to do this is to call ZeroMemory , which fills the memory with zeros. Although this sounds like a good solution, because the memory is no longer used, many compilers treat the ZeroMemory call as unnecessary and do not include it in the binary. This means that in the actual program the memory is not filled with zeros and does contain the sensitive information. This problem isn t obvious by looking at the original source code, but is obvious by looking at the disassembled binary. To address this particular concern, Microsoft created the SecureZeroMemory function, which is not optimized out of the binary.

More Info

For more detailed information about ZeroMemory vs. SecureZeroMemory , see pages 322 “326 in Writing Secure Code , 2nd edition, by Michael Howard and David LeBlanc (Microsoft Press).

Important

Obfuscators and packers/protectors are two types of tools that attempt to hinder reverse engineering by making it difficult to disassemble/decompile a binary or by making the results difficult to understand. Obfuscators scramble binaries (both native code and bytecode). Packers and protectors compress the code and at run time uncompress and execute the code. If a binary is packed or protected, the results of decompiling/disassembling it will contain only the unpacking code and the compressed data. Although these tools can slow down the reverse engineering process, they cannot prevent it completely. In both situations, all of the code in the binary is still available, but might be harder to read. For example, the compression routine inside a packed/protected binary can be reverse engineered, which enables you to uncompress the binary back to its unpacked/ unprotected state.

Spotting Insecure Function Calls Without Source Code

Entire books have been written on how various functions can lead to security vulnerabilities. You ve seen many examples of such vulnerabilities referenced throughout this book. If you have access to source code, a common way to find security bugs is to search for commonly misused functions. Many people erroneously believe that keeping source code private makes a product more secure. However, even without source code, you can perform code reviews using an approach similar to the one used when source is available.

If you don t have access to the source code, how do you know whether any of these problematic functions are used? You could use the monitoring tools discussed earlier in this chapter to discover the functions, but some other functions are compiled into the program s binary and won t be picked up by monitoring tools. Because some disassemblers recognize the assembly instructions that make up these functions and flag them as the original function name, you can disassemble the binary to better understand whether and how these functions are used.

Black box testers are only able to test code that they know how to exercise. A white box tester can search through the product s code looking for certain insecure coding practices. For example, find all instances of the strcpy function. Once a suspicious piece of code is identified, the code can be traced backward to identify how an attacker could call into that code. By using a disassembler, you can take this approach against a binary without access to the original source code. The following example shows how to identify format string vulnerabilities without access to the original source code.

Tip

One of the more widely used commercial disassemblers is the Interactive Disassembler (IDA) ( http://www.datarescue.com ). IDA Pro is the professional version of this product. In this text, IDA Pro is referred to simply as IDA.

IDA allows easy navigation of the disassembled code and commenting the code, and includes a basic debugger. Another strength of IDA is that it is able to identify common library routines compiled into the binary and can automatically comment these. For example, IDA can recognize the assembly code for strcpy in a binary created by common compilers and can call this out. Features like this are great time-savers. Because of its advanced functionality and ease of use, IDA is used to reverse engineer native binaries in this chapter.

Example: Finding Format String Vulnerabilities Without Source Code

Load the sample program Formatstring.exe (which you can find on the companion Web site). The program takes the path and name of a text file as a command-line parameter. Run the program, and you will quickly see that the program echoes the contents of the specified file to the console. Based on the name of this program and its simplicity, you likely have a good idea of how to use the black box approach to test for security bugs. Working through the following example will help you understand an approach that, in more complex programs, will uncover bugs not obvious to the black box tester.

Step 1: Disassembling the binary Disassemble Formatstring.exe by opening it in IDA. Part of the IDA autoanalysis feature is the capability of identifying the assembly instructions used from such common functions as strcpy and printf . The analysis happens automatically after the binary is disassembled.

Step 2: Understanding which functions are called After IDA has finished its autoanalysis, you can see whether Formatstring.exe makes any calls to potentially dangerous functions. One of the potentially dangerous functions is printf . Look in the IDA Functions Window, as shown in Figure 17-12, to see whether the target (Formatstrings.exe) calls printf . The printf function is used by this program. Time to investigate how printf is used.

Figure 17-12: The IDA Functions Window

Tip	Please read Chapter 9, Format String Attacks, now if you haven t already because this section won t make a lot of sense unless you understand the dangers of format strings.

Step 3: Finding calls to unsafe functions To investigate whether printf is called in an unsafe manner, double-click _printf in the Functions Window. This places the Disassembly window on line 0x00401149. You want to know all of the places this program calls printf , which can be determined by scrolling up a few lines. The beginning of the code listing for the printf function is printf proc near . Click the word _printf to select it, and then press X to display a list of all the places the target program calls printf . There are five calls to printf , as shown in Figure 17-13.

Figure 17-13: The five calls to the printf function

Step 4: Determining whether a function call is made in a potentially unsafe manner Remember that calls to printf can be done like either printf("%s", szVariable); or printf(szVariable);. If an attacker can control szVariable, only the second example is a format string bug. Time to check whether there are calls like this to printf in Formatstring.exe.

To investigate whether printf is called in an unsafe manner, select the first item in the Xrefs To _Printf window as shown in Figure 17-13, and click OK. IDA shows the first place printf is called; you should see the following code:

Only one PUSH operation comes before making this call, meaning that printf is called with only one parameter. If this parameter is controlled by the attacker, it s a bug. In this case, the constant string beginning with Could not process the entire file is being pushed onto the stack. (You can verify that the string is from a read-only part of memory by pressing Shift+F7 to look at the IDA segment s table and checking the columns to ensure it is read-only.) An attacker can t control a constant string, so this isn t a bug.

Repeat steps 2 and 3 for each call to the printf function. It isn t necessary to use the Functions Window to select the original prinf function. To save time, you can select any appearance of printf and then press X to bring up the Xrefs To _Printf window.

The second call to printf is very similar to the first and uses a constant string. The third and fourth calls are a little different and look like the following:

There are two pushes onto the stack before the call to printf . The second PUSH is used as the first parameter to printf . This pushes a constant string containing a format string onto the stack. In this example, the string is Error opening the file ˜%s . The second parameter (referenced by the EDX register) is formatted with the %s in the first parameter. This means the printf call looks like printf(Error opening the file ˜%s , szVariable) ;. This isn t a format string bug.

The fifth and final printf call contains only one parameter, and it doesn t appear to be a constant string. Here is the disassembled fifth call to printf :

This call takes the form printf(variable); . If the ECX register references data the attacker controls, this is a format string bug.

Step 5: Determining whether an attacker can control the data You ve found a printf call that needs investigation. How can you determine whether the ECX register is pointing to attacker-controlled data? A few ways include tracing through the code wherever attacker data could enter the application and seeing whether it ever hits the printf call, tracing backward through the code starting at the suspicious printf call and figuring out how the value in ECX is determined, and setting a breakpoint on the suspicious printf call and checking the contents ECX references.

Tracing code from the entry points could be very time-consuming. Tracing code backward is faster, but still somewhat time-consuming . Setting a breakpoint and checking the contents of the parameter is fast, but doesn t guarantee you ll hit all code paths. For example, the parameter could contain data that isn t controlled by an attacker (or you in your tests), but under different conditions it could be. Using the breakpoint approach alone will not ensure you hit all code paths for making this printf call.

There is never enough time to completely test an application; good testers are constantly doing cost/benefit analyses. In this situation, it would be most efficient to use the breakpoint approach. If this doesn t yield a bug, you can make sure there aren t any bugs by investigating using the more thorough backward-tracing approach to determine that ECX is used as lpBuffer in a call to the ReadFile API and receives the contents of the file.

One way to set the breakpoint is to compute the offset address of the assembly instruction when the binary is loaded in memory at run time and set a breakpoint using a debugger ”but there s an even easier way. IDA Pro version 4.5 and later include an integrated debugger. Select the suspicious printf call, and set a breakpoint on it by pressing F2. Because Formatstring.exe takes a filename on the command line, you need to specify that IDA uses command-line parameters when debugging. Select Debugger, and then click Process Options. For the parameters, enter c:\temp\input.txt . Now create a file with that name containing the text Test Input.

Start the debugger by pressing F9. You should hit the breakpoint on the suspicious printf call. Remember, the register storing the value pushed onto the stack is ECX . You want to see the value referenced by the ECX register. Click in the IDA View-ESP window, and press G (to go to a memory address). You are prompted to enter an address; type ECX , and then click OK.

You should see the memory contents referenced by ECX . This data isn t currently displayed as a string. To display it as a string click the line where ECX is noted, and press A, which displays bytes as a string. The data is shown as a string (see Figure 17-14). Does this data look familiar? It s the contents of the input file. If attackers can control the contents of this input file, they can exploit this bug.

Figure 17-14: Using the IDA debugger to determine there is a format string bug

As you can see, it isn t necessary to have the source code or symbols to perform a review of suspicious functions. Don t be fooled into thinking that not shipping the source code provides security protection. As you ll see in the next section, a program s algorithm can be uncovered using an approach similar to this one.

Reverse Engineering Algorithms to Identify Security Flaws

Auditing commonly misused functions is a good way to find bugs quickly, but is an approach that can miss any design or implementation flaw unrelated to those functions. Going a step deeper than the preceding example, you can use a disassembler to ascertain the entire algorithm and the implementation details of the disassembled binary. This section discusses how a security flaw can be identified and exploited by understanding the disassembled code.

Tip	The Open Reverse Code Engineering Web site ( http://www.openrce.org/ ), started by Pedram Amini, is dedicated to sharing knowledge about reverse engineering and is worth checking out.

Example: Finding an Implementation Flaw in Authentication Code

The RemoteAuth.exe program included on the companion Web site allows you to create accounts and authenticate these accounts by using the correct password.

To test the authentication scheme, you need a test account. Create a user named user1 with the password 2_ManySec3ts with the following command line:

 RemoteAuth.exe NEW user1 2_ManySec3ts

You can successfully authenticate with the following command line:

 RemoteAuth.exe AUTH user1 2_ManySec3ts

Modify the password in the preceding command line and attempt to authenticate again to see what happens when authentication fails. When the correct user name and password are supplied, the text Access Granted is displayed. When authentication fails, Access Denied is displayed.

There s a security hole that allows an attacker to authenticate as a user without knowing the user s password. You ll have a hard time finding this flaw quickly (if at all) using the black box approach. However, the disassembly makes it clear that there is a major problem.

For brevity we don t include detailed step-by-step instructions, but instead focus on the important areas. Feel free to use the information you learned from the previous examples to follow along.

Important

For readability we ve added comments to the following disassembly examples. If you are following along in IDA, you won t see many of these comments. If you know a little assembly, you will likely be able to determine what the code is doing without reading these comments.

Step 1: Hashing together the entered user name and password Lines 0x00401291 “ 0x0040135A of the disassembly include fairly lengthy but easy-to-read code that calls the following:

CryptAcquireContextW to get a cryptographic service provider
CryptCreateHash to specify that a SHA1 hash be created
CryptHashData to add the user name to the hash
CryptHashData to add the password to the hash
CryptGetHashParam to retrieve the newly created hash

This hash, along with the user name, is stored in a file named Secrets.txt.

Step 2: Retrieving the password from Secrets.txt The following code retrieves the requested user name from Secrets.txt:

By understanding the preceding assembly, you have a good understanding of how the user names and hashes are retrieved. The user name and hash are both retrieved as part of the fread call. Because the user name is NULL- terminated in Secrets.txt, a string compare compares only up to the NULL and does not include the hash. A simplified pseudocode listing for the preceding disassembly follows . We include the corresponding line numbers from the disassembly in the comments on the right.

 ReadAndCompareUsrName: bytesRead = fread(buffer, 1, 384, file); //Lines 0x401400401410 if (bytesRead != 384) goto AccessDenied; //Lines 0x401418-40141D bReturn = strcmp(UserNameEntered,UserNameFromFile); //Lines 0x401431-401437 if (bReturn == 0) goto ComparePasswdHashes; //Lines 0x40143A-0x40143C else goto ReadAndCompareUsrName; //Line 0x401444

Step 3: Comparing the password hashes The following disassembly makes the decision whether authentication will succeed or fail. Can you spot the flaw? Hint: The added comments are very telling.

The flaw is on line 0x0040144D. Only the first byte of each hash is compared. If an attacker can enter a password that, when hashed , has the same first character as the hash stored in Secrets.txt, the attacker will be able to log on without using the correct password. For example, the password aaaaalh can be used to successfully authenticate the user1 account.

You can determine that by reading the assembly code to understand how RemoteAuth.exe creates hashes (discussed in step 1 of this example). You can use the same algorithm to brute force passwords until the hash has the same first character as the target hash s first character.

Even if the attacker cannot determine the first character of the target hash (because he or she might not have Read access to Secrets.txt), the attacker can use the hash generation algorithm to reduce the number of logon attempts required to gain access to an account by never attempting two passwords that result in a hash with the same first character.

By examining the disassembled binary and using assembly knowledge, it takes only a few minutes to find this bug. Without examining the disassembly or source code, you would likely miss this bug, or it could take you a long time to find.

Tip	Some binary analysis can be automated by using existing or creating new IDA plug-ins or scripts. The IDA Palace ( http://www.backtrace.de/ ) contains several plug-ins and scripts that are useful for this purpose.