Mapping to the CLR | Programming in the .NET Environment

The most important mappings to be decided upon, when implementing a new language on the CLI, are the mapping of compilation units onto the class and assembly structure of the CLI and the mapping of the types of the language onto the types of the Common Type System (CTS).

Mapping the Program Structure

In the CLI, classes and other named entities have their visibility controlled at the boundaries of assemblies. There may be additional rules for controlling accessibility, but the assembly boundary is the one that counts to the CLI. The use of namespaces, seemingly fundamental to programmers, has no representation at runtime. At runtime, classes have only fully qualified dotted names , with everything up to the last "dot" corresponding to what programmers like to think of as the namespace. A compiler may therefore perform arbitrary " name -mangling " transformations to map the hierarchical naming structure of a source language to the dotted names of the CLI.

In Component Pascal, each module is a compilation unit, and modules define the boundaries of visibility control. It thus seemed natural to map modules to assemblies, so that each compilation results in a separate assembly. Although this seems a natural choice, there are other possibilities.

The Synthetic Static Class

As well as defining an assembly, each module implicitly defines a synthetic static class. This is a class with only static methods , but no instance constructors. The class is given a made-up name and provides a mechanism to encapsulate the body of the module. Recall that the body of a module needs to be executed exactly once, when the module is loaded. This behavior is exactly what is provided by the static (class) constructor of a class. When the synthetic static class is demand-loaded, the class constructor performs any initialization specified in the module body. The guarantees of the CLR ensure that the class will be loaded before any access to the static data of the module is required.

Any static data of the module become static data of the synthetic class, and ordinary (that is, non “type-bound) procedures of the module become static methods of the class.

Nested procedures are implemented by static methods of the synthetic class, but in this case the names are mangled so as to ensure uniqueness in the flattened name-scope of the class. In the case of procedures nested inside type-bound procedures, the nested procedure becomes a static method of the class to which the enclosing (dispatched) method is bound.

Controlling Visibility

Classes in the CLI are either private to an assembly or public. These two levels of visibility correspond precisely to the nonexported or exported possibilities of Component Pascal. This provides a natural mapping for defining the visibility of types.

Class members in the CLI have a variety of possible accessibility attributes. As it turns out, only two are used here. Static variables of a module are mapped to static fields of the synthetic static class. If the variable is exported, it has public accessibility. If the variable is not exported, it must be given assembly visibility, because it must be accessible to all code in the module. The use of private in this case would be too restrictive , as the code of type-bound procedures would be unable to access such fields.

The mapping of variables with read-only export is an interesting problem, as there is no direct support for it in the CLI. In the current version of the compiler, such variables are public. The Component Pascal compiler knows that such variables must not be modified and enforces the guarantee. However, a component written in another language would permit modification of the value, as the compiler would be unaware of the (nonstandard) attribute. A more secure implementation would be to map read-only variables to properties that export a getter method but no setter method. This would enforce security at the cost of a small runtime overhead.

Static procedures, like static variables, become members of the synthetic static class. They are declared with either public or assembly accessibility, depending on whether they are exported.

Mapping the Type System

The built-in types of Component Pascal map naturally to a subset of the types of the CLI. The character and floating-point types map directly, with the longer signed integers also corresponding directly. Although the CLI provides a full range of unsigned types, the Common Language Specification standardizes use of signed types, except for the 8-bit type, for which, perversely, the unsigned type is standard. This is the only slight problem with the Component Pascal mapping, as in Component Pascal the 8-bit number type is signed. It follows that Component Pascal programs must not export procedures with 8-bit formal parameters, if they are to be CLS -conformant.

The type constructors of Component Pascal generate records, pointers, and arrays. Somehow these have to be mapped onto the reference classes, value classes, and dynamically allocated arrays of the Common Type System. In the CTS, value classes have value assignment semantics, while both arrays and reference classes have assignment with alias (reference) semantics. One thing becomes immediately apparent: Some types with value semantics in Component Pascal will need to be implemented by reference objects in the Common Type System. Consider just two examples. Arrays in the CTS are always dynamically allocated, and hence are reference types. Value classes in the CTS are necessarily sealed , so any extensible record type must be implemented as a reference class in the CTS.

Reference Surrogates

The mechanism by which structures with value semantics in some source language are implemented by means of reference objects in an implementation framework has been described as reference surrogacy. In effect, a dynamically allocated object in the implementation framework is created to act as a surrogate for an object with value semantics in the source language program. Whenever an operation in the source code requires value copies to be made, the code in the implementation framework does whatever field-by-field or element-by-element copying is required to maintain semantic consistency with the source language view. Similarly, values that are implicitly created in the source code may require explicit invocations of new in the implementation framework, if they are implemented by reference surrogates.

There are two cases that require reference surrogates in Component Pascal. We consider each in turn .

Arrays

All arrays in the CTS are dynamically allocated, so that all Component Pascal arrays must be implemented by reference surrogates. Here we sketch out the main elements of the trickery that achieves the correct semantics.

Suppose that a particular program has a procedure with a local variable of array type. The source code might be written as follows:

 procedure Foo();          var localA : array 8 of char; begin         ...

As is usual for local variables, the array automatically comes into existence when the procedure is invoked. In conventional settings, this is done by reserving space for the array in the activation record of the procedure. However, in the case of the CLR, the surrogate array must be dynamically allocated from the garbage collected heap, so the implementation of the procedure will require code in its prolog to create the object. The prolog code will be equivalent to that arising from the following C# statement:

 char [] localA = new char[8];

Note that in this example the localA in the C# fragment is a reference to the 8-long array, while in the Component Pascal source the variable is the array.

When an entire assignment of such arrays takes place, the compiler will generate object code that performs an element-by-element copy. This copy is performed by an inline loop in the current version of the compiler. In the case of multidimensional arrays, because the subarrays will in turn be implemented by reference surrogates, the copy operation will become a nest of loops as deep as the number of array dimensions.

Figure E.2 is a diagram of the runtime representation of a two-dimensional array. In this case, the diagram is for an array with the following Component Pascal declaration:

Figure E.2. Runtime representation of a two-dimensional array

graphics/efig02.gif

 var bar : array 8,4 of integer;

A paradoxical aspect of the reference surrogate mechanism is demonstrated for the pointer types. Suppose that some type T is a value type in the source language, but is implemented as a reference surrogate type. The paradox is that the types T and pointer to T share the same runtime representation! The difference between the two cases is not the runtime structure that is used to represent the values, but rather the way in which allocation and assignment are translated into object code. For the pointer type, initialization consists of loading the variable with nil ; for the value type, an explicit object must be dynamically created as discussed earlier. To translate assignment, in the pointer case, the reference is simply copied by a single Intermediate Language instruction. For record and array values, to achieve proper value semantics, the destination and source surrogate references must be loaded and a datum-by-datum copy performed.

Extensible Records

The case of extensible records is similar to that of arrays. Corresponding to each extensible record type, a reference class is declared to the CLR. Instances of this class are used as reference surrogates for the value record. As before, the magic is not in the type declaration but rather in the way in which the operations of the source language are translated for the runtime environment.

Value records that are implicitly created in the source code require explicit allocation of a surrogate from the heap at runtime. For assignment, it is necessary to perform a field-by-field copy, including even inherited private fields. Because the code requiring the copy does not in general have access rights for all inherited fields, it is necessary to call the public copy method of each supertype .

Pointers to extensible records are implemented by the same runtime class as their bound type. As is the case with arrays, the distinction between the pointer and its target lies in the different translation of the code for initialization and assignment.

Static Record Types

Record types that are sealed and do not extend any other type have the opportunity to be implemented as value classes in the CTS. There are advantages in doing so, because the operations for initialization and entire assignment are primitives of the runtime and do not involve object creation. The C# compiler uses such classes to implement struct s. The Component Pascal compiler uses such classes for record types that neither are extensible nor extend any other type.

However, a small issue arises for Component Pascal that does not arise for C#. In the case that a particular record type R is implemented as a value class, how is the type pointer to R to be implemented? There are two clear possibilities. First, it is possible to use the built-in box instruction to copy the value records to the heap as needed. Second, it is possible to define a corresponding reference class for each defined value class. The argument against the first method is that every system-boxed value will be of type System.Object and will require a narrowing cast at each unboxing site. The disadvantage of the second method is the cluttering of the type-name scope with extra (possibly unused) class declarations.

The Component Pascal compiler uses the second method. For every suitable record type R , two CTS declarations are made. One is for the value class R and the other is for the reference class boxed_R . As a slight optimization, anonymous record types that occur only as pointer-bound types are implemented as reference classes, as in this case it is never possible to create an unboxed instance of the class.

It follows that the Component Pascal programmer neither knows nor needs to care whether a particular type is really implemented as a value type or as a reference surrogate. The compiler will make a choice depending on the way in which the type is used, and it will generate code so that the proper source semantics are achieved.

A final point has to do with the fact that the reference classes are self-describing while the value classes are not. For the self-describing types, it is possible to find out the runtime type of an object using reflection. However, in a Component Pascal program, the user is not required to know whether a particular type is implemented as a reference type or a value type, so how is the user supposed to know if it is permissible to call a reflection method? The answer is simple. When the program uses the TYPEOF() primitive, the compiler knows whether the parameter is implemented as a value object or a reference surrogate. If the parameter is a value type, then the compiler knows what the exact type is (this is the reason that the compiler knows that using a value type is permitted); otherwise , the compiler generates a runtime call to the reflection library.

Dispatched Methods

Both value and reference classes may have instance and virtual methods bound to them. Because value classes are necessarily sealed , the use of virtual methods on such types is of limited utility. In Component Pascal procedures may be bound to record types, whether the types are extensible or not. The type-bound procedures of CP thus map in a very natural way to the runtime facilities.

Consider the type-bound procedure (dispatched method) declared as follows:

 procedure (in thisFig : Figure)Serial() : INTEGER, new;  begin <<procedure body> end Serial;

SerialNum is a method bound to the object type Figure , takes no other arguments, and returns the INTEGER type. The this object of the method is known as thisFig within the method. Unlike in most object-oriented languages, the this is named by the user just like any other formal parameter. In the preceding example, the method is declared to be new , meaning that it does not override any inherited method but is not marked as extensible . The method is thus final. The compiler will recognize that this particular choice of attributes allows the method to be optimally implemented as a (nonvirtual) instance method. As in other aspects, the user specifies the desired semantics, and the compiler determines the mapping to the CLR that best achieves that result.

A small curiosity has to do with the fact that in Component Pascal methods are associated with record types. The receiver value (the this value) in the source may be defined either as a reference parameter of the record type or as a value parameter of the corresponding pointer type. If the Component Pascal record is implemented as a reference surrogate, there is no problem, as both types are represented by the same reference class. However, if the record type is implemented as a value class, the record and the pointer types are represented by different types at runtime. This simply causes confusion for users who cannot resist the urge to read the assembly language.