Compiling Managed Code | Understanding .NET (2nd Edition)

When source code written in a CLR-based language is compiled, two things are produced: instructions expressed in Microsoft Intermediate Language (MSIL), and metadata, information about those instructions and the data they manipulate. Whether this code is initially written in C#, VB, or some other CLR-based language, the compiler transforms all of the types it containsclasses, structs, integers, delegates, and all the restinto MSIL and metadata.

Compiling managed code generates MSIL and metadata

Figure 2-4 illustrates this process. In this example, the code being compiled contains three CTS types, all of them classes. When this code is compiled using whatever compiler is appropriate for the language it's written in, the result is an equivalent set of MSIL code for each class along with metadata describing those classes. Both the MSIL and the metadata are stored in a standard Windows portable executable (PE) file. This file can be either a DLL or an EXE, with the general term module sometimes used to mean either of these. This section takes a closer look at MSIL and metadata.

Figure 2-4. Compiling managed code written in any language produces MSIL and metadata describing that MSIL.

MSIL and metadata are contained in a DLL or EXE

Microsoft Intermediate Language (MSIL)

MSIL defines a virtual instruction set

MSIL is quite similar to a processor's native instruction set. However, no hardware that actually executes these instructions (today, at least) is available. Instead, MSIL code is always translated into native code for whatever processor this code is running on before it's executed. It's probably fair to say that a developer working in the .NET Framework environment need never fully understand MSIL. Nevertheless, it's worth knowing at least a little bit about what a compiler produces from code written in C# or VB or any other CLR-based language.

Perspective: Why Use MSIL?

The idea of compiling from a high-level language into a common intermediate language is a staple of modern compiler technology. Even before .NET, for instance, the compilers in Visual Studio transformed various programming languages into the same intermediate language, then used a common back end to compile this into a machine-specific binary. In this pre-.NET world, it was this binary that a user of the application would install.

For .NET Framework applications, however, what gets copied to the disk when an application is installed is not a machine-specific binary. Instead, it's the MSIL for this application, code that's analogous to the intermediate language that formerly remained hidden inside the compiler. Why make this change? What's the benefit of distributing code as MSIL?

The most obvious answer is the potential for portability. As discussed in Chapter 1, at least the core of the .NET Framework is available for systems that use non-Windows operating systems and non-Intel processors. Still, it's fair to view the .NET Framework as primarily a Windows-focused technology, and so portability isn't MSIL's primary purpose.

But portability isn't the only advantage of using an intermediate language. Unlike binary code, which can contain references to arbitrary memory addresses, MSIL code can be verified for type safety when it is loaded into memory. This allows better security and higher reliability, since some kinds of errors and a large set of possible attacks can be made impossible.

One potential drawback to using an intermediate language is that it might lead to slower code. The reality, though, is that since MSIL is always compiled before execution rather than interpreted, as described later, this turns out not to be a problem in most situations. And even if it were, hardware speeds just keep increasing. It's great working in software, isn't it? If things are too slow today, just wait a little while. They'll probably be fast enough in a year or two.

The CLR defines a stack-based virtual machine

As implied earlier in this chapter, the abstract machine defined by the CLR is stack based, which means that many MSIL operations are defined in terms of this stack. Here are a few example MSIL instructions and what they're used for:

add: Adds the top two values on the stack and pushes the result back onto the stack.
box: Converts a value type to a reference type; that is, it boxes the value.
br: Transfers control (branches) to a specified location in memory.
call: Calls a specified method.
ldfld: Loads a specified field of an object onto the stack.
ldobj: Copies the value of a specified value type onto the stack.
newobj: Creates a new object or a new instance of a value type.
stfld: Stores a value from the stack into a specified field of an object.
stobj: Stores a value on the stack into a specified value type.
unbox: Converts a boxed value type back to its ordinary form.

In effect, MSIL is the assembly language of the CLR. One interesting thing to notice about this tiny sample of the MSIL instruction set is how closely it maps to the abstractions of the CLR's CTS. Objects, value types, and even boxing and unboxing all have direct support. Also, some operations, such as the newobj used to create new instances, are analogous to operators more commonly found in high-level languages than they are to typical machine instructions.

MSIL reflects the CTS

Perspective: MSIL vs. Java Bytecode

The concept of MSIL is similar to what Java calls bytecode. Java fans might point out, with some justification, that Microsoft has copied an approach first made popular by their technology. Microsoft sometimes responds to this claim, again with some justification, by observing that the idea of an intermediate language predates both Java bytecode and MSIL, with antecedents stretching back to UCSD Pascal's p-code and beyond.

In any case, it's interesting to compare the two technologies. The broad outlines are similar, with both the Java virtual machine and the CLR defining a stack-based virtual environment. One obvious difference is that Java bytecode was specifically designed to support the Java language, while MSIL was defined to support multiple languages. Still, a substantial amount of language semantics is embedded in MSIL, so while it is somewhat broader than Java bytecode, MSIL isn't completely general. Concepts defined by the CTS, such as the distinction between reference and value types, are fundamental to MSIL. This distinction is part of both C# and VB, and it will likely appear in most CLR-based languages. Another difference is that the Java virtual machine was designed to allow bytecode to be interpreted as well as compiled, while MSIL's designers explicitly targeted just-in-time (JIT) compilation (which is described later in this chapter). While interpreting MSIL is probably possible, it appears that it would be significantly less efficient than interpreting Java bytecode.

For a more detailed comparison of MSIL and Java bytecode, see K. John Gough's Stacking Them Up: A Comparison of Virtual Machines, available at various places on the Web.

For developers who wish to work directly in this low-level argot, the .NET Framework provides an MSIL assembler called Ilasm. Only the most masochistic developers are likely to use this tool, however, or those who need very low-level control. Why write in MSIL when you can use a simpler, more powerful language such as VB or C# and get the same result?

Metadata

Compiling managed code always produces MSIL. Compiling managed code also always produces metadata describing that code. Metadata is information about the types defined in the managed code it's associated with, and it's stored in the same file as the MSIL generated from those types. Figure 2-5 shows an abstract view of a module produced by a CLR-based compiler. The file contains the MSIL code generated from the types in the original program, which once again are the three classes X, Y, and Z. Along with the code for the methods in each class, the file contains metadata describing these classes and any other types defined in this file. This information is loaded into memory when the file itself is loaded, making the metadata accessible at runtime. Metadata can also be read directly from the file that contains it, making information available even when code isn't loaded into memory. The process of reading metadata is known as reflection, and it's described in a bit more detail in Chapter 4.

Figure 2-5. A module contains metadata for each type in the file.

Compiled managed code always has associated metadata

What Metadata Contains

Metadata describes the types contained in a module. Among the information it stores for a type are the following things:

The type's name
The type's visibility, which can be public or assembly
What type this type inherits from, if any
Any interfaces the type implements
Any methods the type implements
Any properties the type exposes
Any events the type provides

Metadata provides detailed information about each type

More detailed information is also available. For example, the description of each method includes the method's parameters and their types, along with the type of the method's return value.

Because metadata is always present, tools can rely on it always being available. Visual Studio, for example, uses metadata to provide IntelliSense, which shows a developer things like what methods are available for the class name she's just typed. A module's metadata can also be examined using the MSIL disassembler tool, commonly referred to as Ildasm. This tool is the reverse of the Ilasm tool mentioned earlier in this chapterit's a disassembler for MSILand it can also provide a detailed display of the metadata contained in a particular module.

Tools can use metadata

Attributes

Metadata also includes attributes. Attributes are values that are stored in the metadata and can be read and used to control various aspects of how this code executes. Attributes can be added to types, such as classes, and to fields, methods, and properties of those types. As described later in this book, the .NET Framework class library relies on attributes for many things, including specifying transaction requirements, indicating which methods should be exposed as SOAP-callable Web services, and describing security requirements. These attributes have standard names and functions defined by the various parts of the .NET Framework class library that use them.

Attributes contain values stored with metadata

Developers can also create custom attributes used to control behavior in an application-specific way. To create a custom attribute, a developer using a CLR-based programming language such as C# or VB can define a class that inherits from System.Attribute. An instance of the resulting class will automatically have its value stored in metadata when it is compiled.

Developers can create custom attributes