Primitive Types in the Common Language Runtime
All types have to be defined somewhere. The Microsoft .NET Framework class library defines hundreds of types, and other assemblies build their own types based on the types defined in the class library. Some of the types defined in the class library are recognized by the common language runtime as primitive types and are given special encoding in the signatures. This is done only for the sake of performance—theoretically, the signatures could have been built from type tokens only, given that every type is defined somewhere and hence has a token. But resolving all these tokens simply to find that they reference trivial items such as a 4-byte integer or a Boolean value can hardly be considered a sensible way to work in the runtime.
Primitive Data Types
The term primitive data types refers to the types defined in the .NET Framework class library that are given specific individual type codes to be used in signatures. Because all these types are defined in the assembly Mscorlib and all belong to the namespace System, I have omitted the prefix [mscorlib]System when supplying the class library type name for a type.
The individual type codes are defined in the enumeration CorElementType in the header file CorHdr.h. The names of all these codes begin with ELEMENT_TYPE_, which I have either omitted in this chapter or abbreviated as E_T_.
Table 7-1 describes primitive data types and their respective ILAsm notation.
Code | Constant Name | .NET Framework Type Name | ILAsm Notation | Comments |
0x01 | VOID | Void | void |
|
0x02 | BOOLEAN | Boolean | bool | Single-byte value, true = 1, false = 0 |
0x03 | CHAR | Char | char | 2-byte unsigned integer, representing a Unicode character |
0x04 | I1 | SByte | int8 | Signed 1-byte integer, the same as char in C/C++ |
0x05 | U1 | Byte | unsigned int8 | Unsigned 1-byte integer |
0x06 | I2 | Int16 | int16 | Signed 2-byte integer |
0x07 | U2 | UInt16 | unsigned int16 | Unsigned 2-byte integer |
0x08 | I4 | Int32 | int32 | Signed 4-byte integer |
0x09 | U4 | UInt32 | unsigned int32 | Unsigned 4-byte integer |
0x0A | I8 | Int64 | int64 | Signed 8-byte integer |
0x0B | U8 | UInt64 | unsigned int64 | Unsigned 8-byte integer |
0x0C | R4 | Single | float32 | 4-byte floating-point |
0x0D | R8 | Double | float64 | 8-byte floating-point |
0x16 | TYPEDBYREF | TypedReference | typedref | Typed reference, carrying both reference to a type and information identifying the referenced type |
0x18 | I | IntPtr | native int | Pointer-size integer; size dependent on the underlying platform, hence use of the keyword native |
0x19 | U | UIntPtr | native unsigned int | Pointer-size unsigned integer |
Data Pointer Types
Two data pointer types are defined in the common language runtime: the managed pointer, which is a reference, and the unmanaged pointer, which is a pointer in the conventional sense. The difference is that a managed pointer is managed by the runtime’s garbage collection subsystem and stays valid even if the referenced item is moved in memory during the process of garbage collection, whereas an unmanaged pointer can be safely used only in association with “unmovable” items.
Both pointer types have no meaning per se and must be followed by the base types, which are the types to which the pointer types point. As derivatives from base types, the pointer types have no corresponding types defined in the .NET Framework class library and cannot be boxed. Table 7-2 describes the two pointer types and their ILAsm notations. Neither of them has a respective .NET Framework type associated.
Code | Constant Name |
| ILAsm Notation | Comments |
0x0F | PTR |
| <type>* | Unmanaged pointer to <type> |
0x10 | BYREF |
| <type>& | Managed pointer to <type> |
Note that although ILAsm notation places the pointer sign after the pointed type, in signatures E_T_PTR and E_T_BYREF always precede the pointed type. | |
Pointers of both types are subject to standard pointer arithmetic: an integer can be added to or subtracted from a pointer, resulting in a pointer; and one pointer can be subtracted from another, resulting in an integer value. The difference between pointer arithmetic in, say, C/C++ and in IL (intermediate language) is that in IL—and hence in ILAsm—the increments and decrements of pointers are always specified in bytes, regardless of the size of the item the pointer represents.
C/C++:
long L, *pL=&L; pL += 4; // pL is incremented by 4*sizeof(long) = 16 bytes
ILAsm:
.locals init(int32 L, int32& pL) ldloca L // Load pointer to L on stack stloc pL // pL = &L ldloc pL // Load pL on stack ldc.i4 4 // Load 4 on stack add stloc pL // pL += 4, pL is incremented by 4 bytes
By the same token—now, this is just a common expression. I’m not referring to metadata tokens. (I think I’d better be extra careful with phrases like “by the same token” or “token of appreciation” in this book.) In the same way, the delta of two pointers in IL is always expressed in bytes, not in the items pointed at.
Using unmanaged pointers in IL is not considered nice. Because of the unlimited access that C-style pointer arithmetic gives to anybody for anything, IL code, which has unmanaged pointers dereferenced, is deemed unverifiable and can be run only from a local drive with run-time code verification disabled.
Managed pointers are tamed, domesticated pointers, fully owned by the common language runtime type control and the garbage collection subsystem. These pointers dwell in a safe but not too spacious corral, fenced along the following lines:
Managed pointers are always references to an item in existence—a field, an array element, a local variable, a method argument.
Managed pointer types can be used only for method attributes—local variables, parameters, or a return type.
Array elements and fields cannot have managed pointer types. Local variables and method parameters can, and it is not a simple coincidence that all these items are stack-allocated.
Managed pointers that point to “managed memory” (the garbage collector heap, which contains object instances and arrays) cannot be converted to unmanaged pointers.
Managed pointers that don’t point to the garbage collector heap can be converted to unmanaged pointers, but such conversion renders the IL code unverifiable.
The underlying type of a managed pointer cannot be another pointer, but it can be an object reference.
Managed pointers are different from object references. In Chapter 6, “Namespaces and Classes,” which described boxing and unboxing of the value types, we saw that it takes boxing to create an object reference to a value type. Using a simple reference—that is, a managed pointer—is not enough.
The difference is that an object reference points to the method table of an object, whereas a managed pointer points to the value (data) part of the item. When you take a managed pointer to an instance of a value type, you address the data part. You can have only this much because instances of value types, not being objects, have no method tables.
When you box a value type instance, you create an object, a class instance with its own method table and data part copied from the value type instance. This object is represented by an object reference.
Function Pointer Types
Chapter 6 briefly described the use of managed function pointers and compared them with delegate types. Managed function pointers are represented by type E_T_FNPTR, which is indicated by the value 0x1B and doesn’t have a .NET Framework type associated.
Just like a data pointer type, a function pointer type does not exist by itself and must be followed by the full signature of the function to which it points. (Method signatures are discussed later in this chapter; see “Signatures.”)
The ILAsm notation for a function pointer is as follows:
<call_conv> <return_type> * (<type>[,<type>*])
where <call_conv> is a calling convention, <return_type> is the return type, and the <type> sequence in the parentheses is the argument list. You’ll find more details in the “Signatures” section.
Vectors and Arrays
The common language runtime recognizes two types of arrays: vectors and multidimensional arrays, as described in Table 7-3. Vectors are single-dimensional arrays with a zero lower bound. Multidimensional arrays, which I’ll refer to as arrays, can have more than one dimension and nonzero lower bounds. Neither of these two types of arrays has a respective .NET Framework type associated.
Code | Constant Name |
| ILAsm Notation | Comments |
0x1D | SZARRAY |
| <type>[ ] | Vector of <type> |
0x14 | ARRAY |
| <type>[<bounds> [,<bounds>*] ] | Array of <type> |
All vectors and arrays are objects (class instances) derived from the abstract class [mscorlib]System.Array. This is a very peculiar class; in fact, it is a construct known as a generic.
Vector encoding is very simple: E_T_SZARRAY followed by the encoding of the underlying type, which can be anything except void. The size of the vector is not part of the encoding. Because arrays and vectors are object references, it is not enough to simply declare an array—you must create an instance of it, using the instruction newarr for a vector or calling an array constructor. It is at that point that the size of the vector or array instance is specified.
Array encoding is more sophisticated:
E_T_ARRAY<underlying_type><rank><num_sizes><size1> <sizeN> <num_lower_bounds><lower_bound1> <lower_boundM>
where the following is true:
<underlying_type> cannot be void <rank> is the number of array dimensions (K>0) <num_sizes> is the number of specified sizes for dimensions (N = K) <sizen> is an unsigned integer specifying the size (n = 1, ,N) <num_lower_bounds> is the number of specified lower bounds (M = K) <lower_boundm> is a signed integer specifying the lower bound (m = 1, ,M)
All the above unsigned integer values are compressed according to the length compression formula discussed in Chapter 4, “Metadata Tables Organization.” To save you a trip three chapters back, I will repeat this formula in Table 7-4.
Value Range | Compressed Size | Compressed Value |
0 0x7F | 1 byte | <value> |
0x80 0x3FFF | 2 bytes | 0x8000 <value> |
0x4000 0x1FFFFFFF | 4 bytes | 0xC0000000 <value> |
Signed integer values (lower bound values) are compressed according to a different compression procedure. First the signed integer is encoded as an unsigned integer by taking the absolute value of the original integer, shifting it left by 1 bit, and setting the least significant bit according to the most significant (sign) bit of the original value. Then compression is applied according to the formula shown in Table 7-4.
If size and/or the lower bound for a dimension are not specified, they are not presumed to be 0; rather, they are marked as not specified. The specification of size and lower bound cannot have “holes”—that is, if you have an array of rank 5 and want to specify size (or lower bound) for its third dimension, you must specify size (or lower bound) for the first and second dimensions as well.
An array specification in ILAsm looks like this:
<type> [ <bounds>[, <bounds>*] ]
where
<bounds> ::= [<lower_bound>] [<upper_bound>]
The following is an example:
int32[ , ] // Two-dimensional array with undefined lower bounds // And sizes int32[2 5] // One-dimensional array with lower bound 2 and size 4 int32[0 , 0 ] // Two-dimensional array with zero lower bounds // And undefined sizes
If neither lower bound nor upper bound is specified for a dimension in a multidimensional array declaration, the ellipsis can be omitted. Thus int32[ , ] and int32[,] mean the same: a two-dimensional array with no lower bounds or sizes specified.
This omission does not work in the case of single-dimensional arrays, however. The notation int32[ ] indicates a vector (<E_T_SZARRAY><E_T_I4>), and int32[ ] indicates an array of rank 1 whose lower bound and size are undefined (<E_T_ARRAY><E_T_I4><1><0><0>).
The common language runtime treats multidimensional arrays and vectors of vectors (of vectors, and so on) completely differently. The specifications int32[,] and int32[ ][ ] result in different type encoding, are created differently, and are laid out differently when created:
int32[,] This specification has the encoding <E_T_ARRAY><E_T_ I4><1><0><0>, is created by a single call to an array constructor, and is laid out as a contiguous two-dimensional array of int32.
int32[ ][ ] This specification has the encoding <E_T_SZARRAY><E_T_ SZARRAY><E_T_I4>, is created by a series of newarr instructions, and is laid out as a vector of vector references, each pointing to a contiguous vector of int32, with no guarantee regarding the location of each vector. Vectors of vectors are useful for describing jagged arrays, when the size of the second dimension varies depending on the first dimension index.
Modifiers
Four built-in common language runtime types, described in Table 7-5, do not denote any specific data or pointer type but rather are used as modifiers of data and pointer types. None of these modifiers have a respective .NET Framework type associated.
Code | Constant Name |
| ILAsm Notation | Comments |
0x1F | CMOD_REQD |
| modreq( <class_ref> ) | Required C modifier |
0x20 | CMOD_OPT |
| modopt( <class_ref> ) | Optional C modifier |
0x41 | SENTINEL |
|
| Start of optional arguments in a vararg method call |
0x45 | PINNED |
| pinned | Marks a local variable as unmovable by the garbage collector |
The modifiers modreq and modopt indicate that the item to which they are attached—an argument, a return type, or a field, for example—must be treated in some special way. These modifiers are followed by TypeDef or TypeRef tokens, and the classes corresponding to these tokens indicate the special way the item is to be handled.
The tokens following modreq and modopt are compressed according to the following algorithm. As you might remember, an uncoded (external) metadata token is a 4-byte unsigned integer, which has the token type in its senior byte and a record index (RID) in its 3 lower bytes. It so happens that the tokens appearing in the signatures and hence requiring compression are of three types only: TypeDef, TypeRef, or TypeSpec. (See “Signatures” later in this chapter for information about TypeSpecs.) Because of that, only 2 bits, rather than a whole byte, are required for the token type: 00 denotes TypeDef, 01 is used for TypeRef, and 10 for TypeSpec. The token compression procedure resembles the procedure used to compress the signed integers: the RID part of the token is shifted left by 2 bits, and the 2-bit type encoding is placed in the least significant bits. The result is compressed just as any unsigned integer would be, according to the formula shown earlier in Table 7-4.
The modifiers modreq and modopt are used primarily by tools other than the common language runtime, such as compilers or program analyzers. The modreq modifier indicates that the modifier must be taken into account, whereas modopt indicates that the modifier is optional and can be ignored. The ILAsm compiler does not use these modifiers for its internal purposes.
The only use of the modreq and modopt modifiers recognized by the common language runtime is when these modifiers are applied to return types or parameters of methods subject to managed/unmanaged marshaling. For example, to specify that a managed method must have the cdecl calling convention when it is marshaled as unmanaged, we can use the following modifier attached to the method’s return type:
modopt([mscorlib]System.Runtime.InteropServices.CallConvCdecl)
When used in the context of managed/unmanaged marshaling, the modreq and modopt modifiers are equivalent.
Although the modreq and modopt modifiers have no effect on the managed types of the items to which they are attached, signatures with and without these modifiers are considered different. The same is true for signatures differing only in classes referenced by these modifiers.
The sentinel modifier ( ) was introduced in Chapter 1, “Simple Sample,” when we analyzed the declaration and calling of methods with a variable-length argument list (vararg methods). (See “Method Declaration.”) A sentinel signifies the beginning of optional arguments supplied for a vararg method call. This modifier can appear in only one context: at the call site, because the optional parameters of a vararg method are not specified when such a method is declared. The runtime treats a sentinel appearing in any other context as an error. The method arguments at the call site can contain only one sentinel, and the sentinel is used only if optional arguments are supplied:
// Declaration of vararg method mandatory parameters only: .method public static vararg int32 Print(string Format) { } // Calling vararg method with two optional arguments: call vararg int32 Print(string, , int32, int32) // Calling vararg method without optional arguments: call vararg int32 Print(string)
The pinned modifier is applicable to the method’s local variables only. Its use means that the local variable cannot be relocated by the garbage collector and must stay put throughout the method execution. If a local variable is “pinned,” it is safe to convert a managed pointer to this variable to an unmanaged pointer and then to dereference this unmanaged pointer, because the unmanaged pointer is guaranteed to still be valid when it is dereferenced:
.locals init(int32 A, int32 pinned B, int32* pA, int32* pB) ldloca A stloc pA // pA = &A ldloca B stloc pB // pB = &B ldloc pA ldc.i4 123 stind.i4 // *pA=123 unsafe, A could have been moved ldloc pB ldc.i4 123 stind.i4 // *pB=123 safe, B is pinned and cannot move
Native Types
When managed code calls unmanaged methods or exposes managed fields to unmanaged code, it is sometimes necessary to provide specific information about how the managed types should be marshaled to and from the unmanaged types. The unmanaged types recognizable by the common language runtime are referred to as native, and they are listed in CorHdr.h in the enumeration CorNativeType. All constants in this enumeration have names that begin with NATIVE_TYPE_* ; for purposes of this discussion, I have omitted this part of the names or abbreviated it as N_T_. The same constants are also listed in the .NET Framework class library in the enumerator System.Runtime.InteropServices.UnmanagedType.
Some of the native types are obsolete and are ignored by the runtime interoperability subsystem. But since these native types are not retired altogether, ILAsm must have ways to denote them—and since ILAsm denotes these types, I cannot help but list obsolete types along with others, all of which you’ll find in Table 7-6.
Code | Constant Name | .NET Framework Type Name | ILAsm Notation | Comments |
0x01 | VOID |
| void | Obsolete and thus should not be used; recognized by ILAsm but ignored by the runtime interoperability subsystem |
0x02 | BOOLEAN | Bool | bool | 4-byte Boolean value; true = nonzero, false = 0 |
0x03 | I1 | I1 | int8 | Signed 1-byte integer |
0x04 | U1 | U1 | unsigned int8 | Unsigned 1-byte integer |
0x05 | I2 | I2 | int16 | Signed 2-byte integer |
0x06 | U2 | U2 | unsigned int16 | Unsigned 2-byte integer |
0x07 | I4 | I4 | int32 | Signed 4-byte integer |
0x08 | U4 | U4 | unsigned int32 | Unsigned 4-byte integer |
0x09 | I8 | I8 | int64 | Signed 8-byte integer |
0x0A | U8 | U8 | unsigned int64 | Unsigned 8-byte integer |
0x0B | R4 | R4 | float32 | 4-byte floating-point |
0x0C | R8 | R8 | float64 | 8-byte floating-point |
0x0D | SYSCHAR |
| syschar | Obsolete |
0x0E | VARIANT | variant | Obsolete | |
0x0F | CURRENCY | Currency | currency | Currency value |
0x10 | PTR | * | Obsolete; use native int | |
0x11 | DECIMAL | decimal | Obsolete | |
0x12 | DATE | date | Obsolete | |
0x13 | BSTR | BStr | bstr | Unicode Visual Basic style string |
0x14 | LPSTR | LPStr | lpstr | Pointer to a zero-terminated ANSI string |
0x15 | LPWSTR | LPWStr | lpwstr | Pointer to a zero-terminated Unicode string |
0x16 | LPTSTR | LPTStr | lptstr | Pointer to a zero-terminated ANSI or Unicode string, depending on platform |
0x17 | FIXEDSYSSTRING | ByValTStr | fixed sysstring [<size>] | Fixed-system string of size <size> bytes; applicable to field marshaling only |
0x18 | OBJECTREF | objectref | Obsolete | |
0x19 | IUNKNOWN | IUnknown | iunknown | IUnknown interface pointer |
0x1A | IDISPATCH | IDispatch | idispatch | IDispatch interface pointer |
0x1B | STRUCT | Struct | struct | C-style structure, for marshaling the formatted managed types |
0x1C | INTF | Interface | interface | Interface pointer |
0x1D | SAFEARRAY | SafeArray | safearray <variant_type> | Safe array of type <variant_type> |
0x1E | FIXEDARRAY | ByValArray | fixed array [<size>] | Fixed-size array, of size <size> bytes |
0x1F | INT | IntPtr | int | Signed pointer-size integer |
0x20 | UINT | UIntPtr | unsigned int | Unsigned pointer-size integer |
0x21 | NESTEDSTRUCT | nested struct | Obsolete; use struct | |
0x22 | BYVALSTR | VBByRefStr | byvalstr | Visual Basic style string in a fixed-length buffer |
0x23 | ANSIBSTR | AnsiBStr | ansi bstr | ANSI Visual Basic style string |
0x24 | TBSTR | TBStr | tbstr | bstr or ansi bstr, depending on the platform |
0x25 | VARIANTBOOL | VariantBool | variant bool | 2-byte Boolean; true = -1, false = 0 |
0x26 | FUNC | FunctionPtr | method | Function pointer |
0x28 | ASANY | AsAny | as any | Object; type defined at run time |
0x2A | ARRAY | LPArray | <n_type> [<sizes>] | Fixed-size array of a native type <n_type> |
0x2B | LPSTRUCT | LPStruct | lpstruct | Pointer to a C-style structure |
0x2C | CUSTOMMARSHALER | CustomMarshaler | custom (<class_str>, <cookie_str>) | Custom marshaler |
0x2D | ERROR | Error | error | Maps int32 to VT_HRESULT |
The <sizes> parameter in the ILAsm notation for ARRAY, shown in Table 7-6, can be empty or can be formatted as <size> + <size_param_number>:
<sizes> ::= <> <size> + <size_param_number> <size> + <size_param_number>
If <sizes> is empty, the size of the native array is derived from the size of the managed array being marshaled.
The <size> parameter specifies the native array size in array items. The zero-based method parameter number <size_param_number> indicates which of the method parameters specifies the size of the native array. The total size of the native array is <size> plus the additional size specified by the method parameter that is indicated by <size_param_number>.
A custom marshaler declaration (shown in Table 7-6) has two parameters, both of which are quoted strings. The <class_str> parameter is the name of the class representing the custom marshaler, using the string conventions of Reflection.Emit. The <cookie_str> parameter is an argument string (cookie) passed to the custom marshaler at run time. This string identifies the form of the marshaling required, and its notation is specific to the custom marshaler.
Variant Types
Variant types are defined in the enumeration VARENUM in the Wtypes.h file, which is distributed with Microsoft Visual Studio. Not all variant types are applicable as safe array types, according to Wtypes.h, but ILAsm provides notation for all of them nevertheless, as shown in Table 7-7. It might look strange, considering that variant types appear in ILAsm only in the context of safe array specification, but we should not forget that one of ILAsm’s principal applications is the generation of test programs, which contain known, preprogrammed errors.
Code | Constant Name | Applicable to Safe Array? | ILAsm Notation |
0x00 | VT_EMPTY | No | <empty> |
0x01 | VT_NULL | No | null |
0x02 | VT_I2 | Yes | int16 |
0x03 | VT_I4 | Yes | int32 |
0x04 | VT_R4 | Yes | float32 |
0x05 | VT_R8 | Yes | float64 |
0x06 | VT_CY | Yes | currency |
0x07 | VT_DATE | Yes | date |
0x08 | VT_BSTR | Yes | bstr |
0x09 | VT_DISPATCH | Yes | idispatch |
0x0A | VT_ERROR | Yes | error |
0x0B | VT_BOOL | Yes | bool |
0x0C | VT_VARIANT | Yes | variant |
0x0D | VT_UNKNOWN | Yes | iunknown |
0x0E | VT_DECIMAL | Yes | decimal |
0x10 | VT_I1 | Yes | int8 |
0x11 | VT_UI1 | Yes | unsigned int8 |
0x12 | VT_UI2 | Yes | unsigned int16 |
0x13 | VT_UI4 | Yes | unsigned int32 |
0x14 | VT_I8 | No | int64 |
0x15 | VT_UI8 | No | unsigned int64 |
0x16 | VT_INT | Yes | int |
0x17 | VT_UINT | Yes | unsigned int |
0x18 | VT_VOID | No | void |
0x19 | VT_HRESULT | No | hresult |
0x1A | VT_PTR | No | * |
0x1B | VT_SAFEARRAY | No | safearray |
0x1C | VT_CARRAY | No | carray |
0x1D | VT_USERDEFINED | No | userdefined |
0x1E | VT_LPSTR | No | lpstr |
0x1F | VT_LPWSTR | No | lpwstr |
0x24 | VT_RECORD | Yes | record |
0x40 | VT_FILETIME | No | filetime |
0x41 | VT_BLOB | No | blob |
0x42 | VT_STREAM | No | stream |
0x43 | VT_STORAGE | No | storage |
0x44 | VT_STREAMED_OBJECT | No | streamed_object |
0x45 | VT_STORED_OBJECT | No | stored_object |
0x46 | VT_BLOB_OBJECT | No | blob_object |
0x47 | VT_CF | No | cf |
0x48 | VT_CLSID | No | clsid |
0x1000 | VT_VECTOR | Yes | <v_type> vector |
0x2000 | VT_ARRAY | Yes | <v_type> [ ] |
0x4000 | VT_BYREF | Yes | <v_type> & |