Compiling Managed Code | Understanding .NET: A Tutorial and Analysis (Independent Technology Guides)

< BACK NEXT >

[oR]

When managed code is compiled, two things are produced: instructions expressed in Microsoft Intermediate Language (MSIL), and metadata, information about those instructions and the data they manipulate. Whether the managed code is initially written in C#, Visual Basic.NET, or some other CLR-based language, the compiler transforms all of the types it contains classes, structs, integers, delegates, and all the rest into MSIL and metadata.

Compiling managed code generates MSIL and metadata

Figure 3-4 illustrates this process. In this example, the code being compiled contains three CTS types, all of them classes. When this code is compiled using whatever compiler is appropriate for the language it's written in, the result is an equivalent set of MSIL code for each class along with metadata describing those classes. Both the MSIL and the metadata are stored in a standard Windows portable executable (PE) file. This file can be either a DLL or an EXE, with the general term module sometimes used to mean either of these. This section takes a closer look at MSIL and metadata.

Figure 3-4. Compiling managed code written in any language produces MSIL and metadata describing that MSIL.

MSIL and metadata are contained in a DLL or EXE

Microsoft Intermediate Language (MSIL)

MSIL is quite similar to a processor's native instruction set. However, no hardware that actually executes these instructions (today, at least) is available. Instead, MSIL code is always translated into native code for whatever processor this code is running on before it's executed. It's probably fair to say that a developer working in the .NET Framework environment need never fully understand MSIL. Nevertheless, it's worth knowing at least a little bit about what a compiler produces from code written in C# or Visual Basic.NET or any other CLR-based language.

MSIL defines a virtual instruction set

Why Use MSIL?

The idea of compiling from a high-level language into a common intermediate language is a staple of modern compiler technology. The compilers in Visual Studio 6, for example, transform various programming languages into the same intermediate language and then use a common back end to compile this into a machine-specific binary. Prior to the .NET Framework, it was this binary that a user of the application would install.

For .NET Framework applications, however, what gets copied to the disk when an application is installed is not a machine-specific binary. Instead, it's the MSIL for this application, code that's analogous to the intermediate language that formerly remained hidden inside the compiler. Why make this change? What's the benefit of distributing code as MSIL?

The most obvious answer is the potential for portability. As discussed in Chapter 1, at least the core of the .NET Framework is available on systems that use non-Windows operating systems and non-Intel processors. While making the complete .NET Framework a truly multi-operating system technology would be challenging in many ways, Microsoft clearly wants at least the .NET Compact Framework to be available on many devices.

Portability isn't the only advantage of using an intermediate language. Unlike binary code, which can contain references to arbitrary memory addresses, MSIL code can be verified for type safety when it is loaded into memory. This allows better security and higher reliability, since some kinds of errors and a large set of possible attacks can be made impossible. Verification in the CLR is described later in this chapter.

One potential drawback of this approach is that it can lead to slower code. Microsoft actually admits that some .NET Framework applications are likely to run more slowly than those written with the earlier Windows DNA technologies. The right question to ask, however, is not how fast these applications will run, but whether they will be fast enough to meet customer requirements. There is a large body of evidence that suggests they will, evidence provided by the success of Java. Java developers have been using this approach for several years, and while Java code is typically slower than native code, it is fast enough that many organizations use it quite successfully. And unlike Java, which is often interpreted, MSIL is always compiled before execution, as described later in this chapter.

Relying heavily on an intermediate language could be unworkably slow five years ago. Today, though, it's fine: The hardware has gotten much faster. It's great working in software, isn't it? If things are too slow today, just wait a little while. They'll probably be fast enough in a year or two.

As implied earlier in this chapter, the abstract machine defined by the CLR is stack based, which means that many MSIL operations are defined in terms of this stack. Here are a few example MSIL instructions and what they're used for:

The CLR defines a stack-based virtual machine

add: Adds the top two values on the stack and pushes the result back onto the stack.
box: Converts a value type to a reference type; that is, it boxes the value.
br: Transfers control (branches) to a specified location in memory.
call: Calls a specified method.
ldfld: Loads a specified field of an object onto the stack.
ldobj: Copies the value of a specified value type onto the stack.
newobj: Creates a new object or a new instance of a value type.
stfld: Stores a value from the stack into a specified field of an object.
stobj: Stores a value on the stack into a specified value type.
unbox: Converts a boxed value type back to its ordinary form.

In effect, MSIL is the assembly language of the CLR. One interesting thing to notice about this tiny sample of the MSIL instruction set is how closely it maps to the abstractions of the CLR's Common Type System. Objects, value types, even boxing and unboxing all have direct support. Also, some operations, such as the newobj used to create new instances, are analogous to operators more commonly found in high-level languages than they are to typical machine instructions.

MSIL reflects the CTS

For developers who wish to work directly in this low-level argot, the .NET Framework provides an MSIL assembler called Ilasm. Only the most masochistic developers are likely to use this tool, however. Why write in MSIL when you can use a simpler, more powerful language such as Visual Basic.NET or C# and get the same result?

Metadata

Compiling managed code always produces MSIL. Compiling managed code also always produces metadata describing that code. Metadata is information about the types defined in the managed code it's associated with, and it's stored in the same file as the MSIL generated from those types. If you're familiar with COM, metadata serves much the same purpose as a COM type library.

Compiled managed code always has associated metadata

Figure 3-5 shows an abstract view of a module produced by a CLR-based compiler. The file contains the MSIL code generated from the types in the original program, which once again are the three classes X, Y, and Z. Along with the code for the methods in each class, the file contains metadata describing these classes and any other types defined in this file. This information is loaded into memory when the file itself is loaded, making the metadata accessible at runtime. Metadata can also be read directly from the file that contains it, making information available even when code isn't loaded into memory. The process of reading metadata is known as reflection, and it's described in more detail in Chapter 5.

Figure 3-5. A module contains metadata for each type in the file.

What Metadata Contains

Metadata describes the types contained in a module. Among the information it stores for a type are the following:

Metadata provides detailed information about each type

The type's name
The type's visibility, which can be public or assembly
What type this type inherits from, if any
Any interfaces the type implements
Any methods the type implements
Any properties the type exposes
Any events the type provides

More detailed information is also available. For example, the description of each method includes the method's parameters and their types, along with the type of the method's return value.

MSIL Versus Java Bytecode

The concept of MSIL is similar to what Java calls bytecode. Java fans might point out, with some justice, that Microsoft has copied an approach first made popular by their technology. Microsoft sometimes responds to this claim, again with some justice, by observing that the idea of an intermediate language predates both Java bytecode and MSIL, with antecedents stretching back to UCSD Pascal's p-code and beyond.

In any case, it's interesting to compare the two technologies. The broad outlines are similar, with both the Java virtual machine and the CLR defining a stack-based virtual environment. One obvious difference is that Java bytecode was specifically designed to support the Java language, while MSIL was defined to support multiple languages. Still, a substantial amount of language semantics is embedded in MSIL, so while it is somewhat broader than Java bytecode, MSIL isn't completely general. Concepts defined by the CTS, such as the distinction between reference and value types, are fundamental to MSIL. This distinction is part of both C# and Visual Basic.NET, and it will likely appear in most CLR-based languages. Another difference is that the Java virtual machine was designed to allow bytecode to be interpreted as well as compiled, while MSIL's designers explicitly targeted JIT compilation. While interpreting MSIL is probably possible, it appears that it would be significantly less efficient than interpreting Java bytecode.

As is usually the case when a widely used model exists, Microsoft was probably able to learn from the experiences of the Java world in designing the CLR. As Moore's Law continues to make more processing power available for less money, both the Java and Microsoft camps have decided that the advantages of an intermediate language outweigh the potential performance penalty.

For a more detailed comparison of MSIL and Java bytecode, see K. John Gough's Stacking Them Up: A Comparison of Virtual Machines, available at http://sky.fit.qut.edu.au/~gough/publications.html.

The Evolution of Metadata

The idea of providing information about compiled code metadata has been around for quite a while. Even standard Windows DLLs offer a simple form of metadata by providing a list of the functions they export. But most Windows developers first encountered metadata in a serious way with the arrival of COM's type libraries. A COM component's type library allows tools and other software to learn about the COM classes a component supports, discover what interfaces each of those classes implements, and more.

Yet while a type library provides useful information about the COM classes it describes, the information it provides is limited. A developer can't determine what other COM classes this one depends on, for instance, nor can he be certain that the type library accurately describes every externally visible aspect of a COM class. COM's Interface Definition Language (IDL) allows expressing things that can't be represented in a type library, while type libraries can contain information that can't be expressed in IDL. Most developers don't suffer much from this mismatch, since the differences are fairly exotic. Still, while COM provides metadata, it does so in a needlessly idiosyncratic way.

The designers of the .NET Framework's metadata avoided these problems. For one thing, a module's metadata is full fidelity, which means that it completely and accurately describes the code it's associated with. Also, every module always contains metadata, unlike COM's optional type libraries. While the current form of a module's metadata probably isn't the last word on the subject new ideas always appear it is clearly a long step forward from what COM provided.

Because metadata is always present, tools can rely on it always being available. Visual Studio.NET, for example, uses metadata to show a developer what methods are available for the class name she's just typed. A module's metadata can also be examined using a tool called Ildasm. This tool is the reverse of the Ilasm tool mentioned earlier in this chapter it's a disassembler for MSIL and it can also provide a detailed display of the metadata contained in a particular module.

Tools can use metadata

Attributes

Metadata also includes attributes. Attributes are values that are stored in the metadata and can be read and used to control various aspects of how this code executes. Attributes can be added to types, such as classes, and to fields, methods, and properties of those types. As described later in this book, the .NET Framework class library relies on attributes for many things, including specifying transaction requirements, indicating which methods should be exposed as SOAP-callable Web services, and describing security requirements. These attributes have standard names and functions defined by the various parts of the .NET Framework class library that use them.

Attributes contain values stored with metadata

Developers can also create custom attributes used to control behavior in an application-specific way. To create a custom attribute, a developer using a CLR-based programming language such as C# or Visual Basic.NET can define a class that inherits from System.Attribute. An instance of the resulting class will automatically have its value stored in metadata when it is compiled.

Developers can create custom attributes

< BACK NEXT >