Disassembling IL and Round-Tripping | Advanced .NET Programming

We have spent virtually the whole of the last two chapters examining IL assembly code that we've handwritten. However, it's also possible to examine and edit IL assembly that has been generated by disassembling assemblies using the ildasm utility. For the rest of this chapter we will examine code that has been generated in this way.

We mentioned earlier that one difference between IL and higher-level languages is the ease with which IL can be disassembled into IL assembly. In fact, it is perfectly possible both for ilasm.exe to read ildasm.exe output (provided there was no embedded native code in the original assembly), and for ildasm.exe to read ilasm.exe output. One of the specific aims Microsoft had when designing ilasm and ildasm was the ability to perform round trips. We pointed out one use for this at the end of the last chapter when we discussed debugging IL generated from a high-level language. There are also a couple of other uses for round-tripping:

You can hand-modify a compiled assembly. If you have an assembly (for example, one that was generated by the C# or VB.NET compiler), you can use ildasm to convert its contents to ILAsm text, make whatever changes you wish to make to it, and then use the ilasm tool to convert the results back into a binary assembly.
You'll also need to use this technique if you wish to create a single file that contains IL code generated from multiple languages. Compile the segments written in each language separately, and then use ildasm to disassemble the assemblies. You can then hand-edit the .il files produced to merge them into one file, and use ilasm.exe to convert this back into an assembly.

We won't illustrate these techniques in detail here, since anything we write concerning the details may change in the near future. At present there is no support in any of the .NET tools for automating the process of round-tripping and editing IL code. Nor is there any support in any of the Microsoft compilers for writing embedded IL code in your high-level source code - something that would have made it much easier to use IL when you wish to. At present, it seems unlikely that any such support will be added in the near future, but it is likely that third-party tools on the web will appear to assist in automating the round-tripping process, so if you are thinking of doing that it would be worthwhile checking what's available.

It's also worth pointing out that there are tools available on the Internet that not only disassemble assemblies, but convert the IL into equivalent high-level code, making it a lot easier to read. The best known of these is almost certainly the free tool, anakrino, which is available at http://www.saurik.com. If you do decide to try out anakrino or similar tools, however, do be aware of licensing issues - your software licenses will almost certainly not permit you to run this type of tool on many of the assemblies installed on your machine. In many cases this restriction will apply to running ildasm.exe on assemblies as well.

Comparing IL Emitted by C#, VB and MC++

In this section, we will use ildasm.exe to examine the IL generated by the C#, VB, and MC++ compilers - this will both teach us a little more IL, and start to familiarize us with some of the differences between the compilers.

The samples we look at here were generated with version 1.0 of the .NET Framework. Although the general principles should still hold, you may find that you get slightly different IL code from that presented here if you try out these samples using a later version of the framework, since it's always possible that Microsoft will make improvements to either the compilers or the ildasm.exe utility.

C#

For this test, we'll have VS.NET generate a C# console application called CSharp for us and modify the code so that it looks like this:

 using System; namespace CSharp {    class Class1    {       [STAThread]       static void Main(string[] args)       {          string helloWorld = "Hello, World!";          Console.WriteLine(helloWorld);       }    } }

We then compile this program, making sure we select the Release configuration in VS.NET, since we don't want the IL we examine to be cluttered up with debugging code (or if you are doing this at the command line, specify that optimizations should be turned on with the /o flag).

The full listing file produced by ildasm.exe is too long to show in full. We'll just pick some highlights from it.

Before the Main() method is defined, we encounter an empty class definition:

 // // =============== CLASS STRUCTURE DECLARATION ================ // .namespace CSharp {    .class private auto ansi beforefieldinit Class1           extends [mscorlib]System.Object    {    } // end of class Class1 } // end of namespace CSharp

One aspect of IL assembly I haven't mentioned is that it's possible to close and reopen class definitions within the same source file (although this is purely a feature of IL assembly - this structure isn't persisted to the binary IL). ildasm.exe emits ILAsm code that takes advantage of this feature. Doing so serves the same purpose as forward declarations in C++ - it prevents certain possible issues to do with types being used before they are defined, which could confuse ilasm.exe if it is called on to regenerate the assembly.

Then we come to the actual definition of Class1, the class that contains the Main() method:

 .namespace CSharp {   .class private auto ansi beforefieldinit Class1          extends [mscorlib]System.Object   {     .method private hidebysig static void           Main(string[] args) cil managed   {     .entrypoint     .custom instance void [mscorlib]System,STAThreadAttribute::.ctor() =                                                           ( 01 00 00 00 )     // Code size       13 (0xd)     .maxstack 1     .locals init (string V_0)     IL_0000:  ldstr      "Hello, World!"     IL_0005:  stloc.0     IL_0006:  ldloc.0     IL_0007:  call       void [macorlib]System.Console::WriteLine(string)     IL_000c;  ret   } // end of method Class1::Main   .method public hidebysig specialname rtspecialname           instance void  .ctor() cil managed   {     // Code size       7 (0x7)     .maxstack  1     IL_0000:  ldarg.0     IL_0001:  call       instance void [mscorlib] System.Object::.ctor()     IL_0006:  ret   } // end of method Class1::.ctor } // end of class Class1

There are several points to notice about this code. The actual class definition is marked with several flags that we have already encountered, as well as one that we have not: beforefieldinit. This flag indicates to the CLR that there is no need to call a static constructor for the class before any static methods are invoked (though it should still be called before any static fields are invoked). The flag seems a little pointless in this particular case since Class1 does not have any static constructors anyway, but can be a useful performance optimizer where there is a static constructor but no static method requires any field to be pre-initialized.

There are several comments in the code, indicating the start and end of various blocks. These comments are not of course emitted by the C# compiler or present in the assembly - they are generated by ildasm.exe. The same applies for the name of the local variable, V_0, and to the statement labels that appear by every statement. These give what amounts to a line number - the offset of the opcode for that instruction relative to the start of the method.

The Main() method contains a local variable, which holds the string to be displayed. You or I could easily see from a simple examination of the C# code that this variable is not actually required and could be easily optimized away. The fact that it is present in the IL illustrates a feature of both the C# and the VB compilers: to a large extent, they leave optimization to the JIT compiler, performing very little optimization of the code themselves.

There's another related aspect of the Main() method which is not immediately obvious but which deserves comment. Look at how the ILAsm instructions match up to the C# instructions:

 string helloWorld = "Hello, World!";       IL_0000:  ldstr      "Hello, World!"       IL_0005:  stloc.0 Console.WriteLine(helloWorld);       IL_0006:  ldloc.0       IL_0007:  call       void [mscorlib]System.Console::WriteLine(string) (Return implied)       IL_000c:  ret

Do you notice how there is a neat break, and how each set of IL instructions corresponding to a C# statement leaves the stack empty. That's a general feature of how many high-level compilers work. I've not yet observed any code generated by the C# or VB compilers that doesn't follow this practice, although the C++ compiler does seem to use the stack in a slightly more sophisticated way. There are performance and security implications of this. On the performance side, it means that these compilers won't perform optimizations that cause the stack to contain values that are used between different high-level language statements. The JIT compiler will probably perform these optimizations anyway, but at a cost of a tiny performance penalty the first time each method is executed. On the security side, this feature of high-level compilers makes it particularly easy for decompilers to reverse-engineer IL code - you just look for the points where the stack is empty and you know that's the boundary between highlevel statements. We'll explore that issue more in the security chapter, and look at the precautions you can take against it. However, you will observe that in the IL code that I've written by hand for the samples in this and the previous chapter, I've generally made much more sophisticated use of the evaluation stack, for example loading data onto it which will only be used several operations further on. The potential implications for improved performance and security against reverse engineering if you hand-code methods in IL should be obvious.

The [STAThread] attribute that is applied to the Main() method by VS.NET wizard when it generates the code for a C# application has made it through to the IL code. STAThreadAttribute (which indicates the threading model that should be used if the method makes any calls to COM) is an example of a custom attribute that is largely ignored by the CLR. It only comes into play if COM Interop services are called on.

One other point to observe is that the IL code contains a constructor for the Class1 class, even though none is present in the C# code. This is a feature of C#: if you don't supply any constructors for a class, then the definition of C# requires that the compiler supply a default constructor which simply calls the base class. This is actually quite important. Although in our code so far we don't instantiate the Class1 class, we might in principle do so if we develop the code further. And as we saw when we examined instantiation of reference types, newobj requires a constructor as an argument. So if the C# compiler didn't supply a default constructor to classes where you don't specify one yourself, then it would be impossible to ever instantiate those classes.

VB

Now let's examine the equivalent code produced by the VB compiler. Once again for this I've used a file generated by VS.NET. That's important, since in the case of VB, VS.NET maintains some information such as namespaces to be referenced and the program entry point as project properties, to be supplied to the VB compiler as parameters, rather than as source code. The VB source code looks like this:

 Module Module1    Sub Main()       Dim HelloWorld As String = "Hello, World!"       Console(WriteLine(HelloWorld)    End Sub End Module

The VB code defines a module - which can be confusing since a VB module is not remotely the same thing as an IL module. The VB Module keyword declares a sealed class that is permitted to contain only static methods (as opposed to a module in an assembly, which is a file that forms part of the assembly). We can see the difference from the ILAsm file. Here's what the Module declaration in VB has become:

 .namespace VB {   .class private auto ansi sealed Module1          extends [mscorlib]System.Object   {   } // end of class Module1 } // end of namespace VB

Although the sealed keyword is there, there is no way in IL to mark a class as only containing static methods - that's a requirement that is enforced by the VB compiler at source-code level.

Note that the VB namespace comes from the VS.NET project properties - by default VS.NET gives VB projects a namespace name that is the same as the project name. You can change this using the project properties.

On the other hand, here's the 'real' module declaration:

 .module VB.exe // MVID: {} .imagebase 0x11000000 .subsystem 0x00000003

As usual, the prime module (in this case the only module) has the same name as the assembly.

Now let's examine the code for the Main() method:

 .method public static void Main() cil managed {   .entrypoint   .custom instance void [mscorlib]System.STAThreadAttribute::.ctor() =                                                              ( 01 00 00 00 )   // Code size       13 (0xd)   .maxstack  1   .locals init (string V_0)   IL_0000: ldstr       "Hello, World!"   IL_0005: stloc.0   IL_0006:  ldloc.0   IL_0007: call        void [mscorlib]System.Console::WriteLine(string)   IL_000c: ret } // end of method Module1::Main

This code is virtually identical to the C# code, illustrating the fact that in practical terms there is very little difference between VB and C# for those IL features that both languages implement - apart of course from the source-level syntax. There are, however, a couple of points here that illustrate minor differences between the approaches taken by the languages. Recall I said earlier in the chapter that C# uses hidebysig semantics for all methods, whereas VB does so only when overriding. You can see that here - in C# the Main() method is marked as hidebysig, in VB it isn't. There's also evidence of the VB compiler's slightly greater tendency to hide some features from the developer - much less than in the days of VB6, but still present to a small extent: the STAThreadAttribute has made its way into the VB-generated IL, even though it wasn't present in the source code. The C# compiler only emits it if it's there in source code. Which approach you prefer depends on precisely where you prefer to strike the balance between how much work you have to do, and how fine a degree of control over your code you want.

MC++

Now let's bite the bullet and look at the C++ code. I'll warn you here that you can expect a more complex assembly due to the greater power of C++. In fact, I'm actually going to make things a bit more complicated still by putting two Console.WriteLine() calls in the C++ source code. One will take a string as a parameter, and the other one will take a C++ LPCSTR - a C-style string - or as close as we can get to a C-style string when using managed code. The reason I'm doing this is because I want to show you how the C++ compiler copes with LPCSTR declarations. It uses the technique I showed earlier for defining a class to represent binary data that is embedded in the metadata - and as we'll see very soon, this can lead to some very strange looking entries when you examine the file using ildasm.

This is the C++ code we will use. Once again it's generated as a VS.NET project - in this case called CPP. For this I generated a C++ managed console application and then modified the code by hand to look like this:

 #include "stdafx.h" #include "Windows.h" #using <mscorlib.dll> using namespace System; int main(void) {    LPCSTR pFirstString = "Hello World";    Console::WriteLine(pFirstString);    String *pSecondString = S"Hello Again";    Console::WriteLine(pSecondString);    return 0; }

Now for the IL. Before I show you the code for the Main() method, let's look at what ildasm shows us if we run it without any parameters:

click to expand

Confusing, huh? I still remember I couldn't believe my eyes the first time I ever tried running ildasm.exe on a large C++ project. But rest assured there is a good reason for all the items here. Let's consider them in turn:

The structure named $ArrayType$0x14f3579d is a type that is defined as a placeholder for the binary data. It's the type that will be used to represent the "Hello World" string.
The global field called ??_C@_0M@KPLPPDAC@Hello?5World?$AA@ is the field that will be used to instantiate $ArrayType$0x14f 3579d. The names of this data and class are pretty complicated, but that's because these names are generated internally by the C++ compiler. Obviously, the compiler wants to make sure that its names don't conflict with the names of any types that you might want to declare - and ??_C@_0M@KPLPPDAC@Hello?5World?$AA@ and $ArrayType$0x14f3579d are the names its chosen for this purpose. If you ever try porting an unmanaged C++ project to managed code and look at the results with ildasm, the chances are you'll see a huge number of types and static fields with these kinds of names, representing all the hard-coded strings in your application.
_mainCRTStartup() is the entry point for the application. However, it is not the main() method that we think we have written. The C++ compiler works by generating a separate entry-point method called _mainCRTStartup(), which contains unmanaged code. This unmanaged function performs several tasks that cannot be done from managed code, and which are only relevant to C++ applications, such as making sure the C runtime library is initialized. Then it calls the method that you thought you'd told the compiler was the entry-point method. The existence of this method illustrates the greater range of resources that can be called on by the C++ compiler, but also demonstrates clearly why the C++ compiler can't generate type-safe code (at least as of .NET version 1.0): the type-safety checks will fail at the very first hurdle - the entry-point method. This is likely to be fixed at some point in a future version of .NET, however.
main() is the method that we wrote in our source code.

Now let's have looked at the actual code. First, here's the true entry point:

 .method public static pinvokeimpl(/* No map */)         unsigned int32 _mainCRTStartup() native unmanaged preservesig {   .entrypoint   .custom instance void [mscorlib]System.Security.SuppressUnmanagedCodeSecurityAttribute::,ctor() = ( 01 00 00 00 )   // Embedded native code   // Disassembly of native methods is not supported.   // Managed TargetRVA = 0x10d5 } // end of method 'Global Functions' ::_mainCRTStartup

Clearly there's not much that we can deduce from this. The native unmanaged flag attached to the method definition tells us that the method contains unmanaged code, not IL. ildasm can't generate IL source code for native code, and has left comments to warn us of that fact instead. The preservesig flag is one that we haven't encountered yet - it prevents the marshaler from modifying the signature of this method (so-called name-mangling is sometimes done for internal reasons on method names in unmanaged C++).

There is an attribute in the method, SuppressUnmanagedCodeSecurityAttribute. This attribute is there for performance reasons - normally, any call into unmanaged code, will trigger a so-called stack walk, in which the CLR examines this assembly and every calling assembly to verify that all these assemblies have permission to call unmanaged code - which is great for security, but not good for performance. This attribute suppresses the stack walk - you will normally apply it to code that you believe has been thoroughly tested and cannot possibly open security loopholes.

Now for the method that contains the interesting code:

 .method public static int32 modopt( [mscorlib]System.Runtime.CompilerServices.CallConvCdecl)          main() cil managed {   .vtentry 1 : 1   // Code size        27 (0x1b)   .maxstack  1   IL_0000:  ldsflda     valuetype $ArrayType$0x14f3579d                                         ??_C@_0M@KPLPPDAC@Hello?5World?$AA@   IL_0005:  newobj      instance void [mscorlib]System.String::.ctor(int8*)   IL_000a:  call        void [mscorlib]System.Console::WriteLine(string)   IL_000f:  ldstr       "Hello Again"   IL_0014:  call        void [mscorlib]System.Console::WriteLine(string)   IL_0019:  ldc.i4.0   IL_001a:  ret } // end of method 'Global Functions'::main

You should be able to follow through this code without too much difficulty. One point I will mention is that the C++ compiler does show slightly more intelligent use of the evaluation stack than the C# and VB compilers do: it doesn't use a local variable for either string, although the original C++ source code had local variables for both of them, but instead confines the strings to the evaluation stack. There are two new keywords here, .vtentry and modopt. They are both present in this code to smooth the internal operation of the unmanaged/managed transition: .vtentry indicates the entry in a table of what are known as vtable fixups. This is simply a table of method addresses, and is necessary to allow managed and unmanaged code to interoperate in the same assembly. The modopt keyword is used here to indicate the calling convention of a method for the benefit of unmanaged code. This quick examination of the code generated for simple Hello, World applications in different languages does appear to bear out Microsoft's claims that the C++ compiler is more powerful than the C# or VB ones, and can potentially give you higher-performance managed code. On the other hand, the IL code generated by the C++ compiler will have more work to do at program startup time as it performs various initializations of unmanaged libraries.

Working through these short programs, and compiling and disassembling them also shows how much more extra information you can find out about your code when you understand a little IL.