Primitive Types in the Common Language Runtime

All types have to be defined somewhere. The Microsoft .NET Framework class library defines hundreds of types, and other assemblies build their own types based on the types defined in the class library. Some of the types defined in the class library are recognized by the common language runtime as primitive types and are given special encoding in the signatures. This is done only for the sake of performance—theoretically, the signatures could have been built from type tokens only, given that every type is defined somewhere and hence has a token. But resolving all these tokens simply to find that they reference trivial items such as a 4-byte integer or a Boolean value can hardly be considered a sensible way to work in the runtime.

Primitive Data Types

The term primitive data types refers to the types defined in the .NET Framework class library that are given specific individual type codes to be used in signatures. Because all these types are defined in the assembly Mscorlib and all belong to the namespace System, I have omitted the prefix [mscorlib]System when supplying the class library type name for a type.

The individual type codes are defined in the enumeration CorElementType in the header file CorHdr.h. The names of all these codes begin with ELEMENT_TYPE_, which I have either omitted in this chapter or abbreviated as E_T_.

Table 7-1 describes primitive data types and their respective ILAsm notation.

Table 7-1 Primitive Data Types Defined in the Runtime
Code	Constant Name	.NET Framework Type Name	ILAsm Notation	Comments
0x01	VOID	Void	void
0x02	BOOLEAN	Boolean	bool	Single-byte value, true = 1, false = 0
0x03	CHAR	Char	char	2-byte unsigned integer, representing a Unicode character
0x04	I1	SByte	int8	Signed 1-byte integer, the same as char in C/C++
0x05	U1	Byte	unsigned int8	Unsigned 1-byte integer
0x06	I2	Int16	int16	Signed 2-byte integer
0x07	U2	UInt16	unsigned int16	Unsigned 2-byte integer
0x08	I4	Int32	int32	Signed 4-byte integer
0x09	U4	UInt32	unsigned int32	Unsigned 4-byte integer
0x0A	I8	Int64	int64	Signed 8-byte integer
0x0B	U8	UInt64	unsigned int64	Unsigned 8-byte integer
0x0C	R4	Single	float32	4-byte floating-point
0x0D	R8	Double	float64	8-byte floating-point
0x16	TYPEDBYREF	TypedReference	typedref	Typed reference, carrying both reference to a type and information identifying the referenced type
0x18	I	IntPtr	native int	Pointer-size integer; size dependent on the underlying platform, hence use of the keyword native
0x19	U	UIntPtr	native unsigned int	Pointer-size unsigned integer

Data Pointer Types

Two data pointer types are defined in the common language runtime: the managed pointer, which is a reference, and the unmanaged pointer, which is a pointer in the conventional sense. The difference is that a managed pointer is managed by the runtime’s garbage collection subsystem and stays valid even if the referenced item is moved in memory during the process of garbage collection, whereas an unmanaged pointer can be safely used only in association with “unmovable” items.

Both pointer types have no meaning per se and must be followed by the base types, which are the types to which the pointer types point. As derivatives from base types, the pointer types have no corresponding types defined in the .NET Framework class library and cannot be boxed. Table 7-2 describes the two pointer types and their ILAsm notations. Neither of them has a respective .NET Framework type associated.

Table 7-2 Pointer Types Defined in the Runtime
Code	Constant Name	ILAsm Notation	Comments
0x0F	PTR	<type>*	Unmanaged pointer to <type>
0x10	BYREF	<type>&	Managed pointer to <type>


	Note that although ILAsm notation places the pointer sign after the pointed type, in signatures E_T_PTR and E_T_BYREF always precede the pointed type.

Pointers of both types are subject to standard pointer arithmetic: an integer can be added to or subtracted from a pointer, resulting in a pointer; and one pointer can be subtracted from another, resulting in an integer value. The difference between pointer arithmetic in, say, C/C++ and in IL (intermediate language) is that in IL—and hence in ILAsm—the increments and decrements of pointers are always specified in bytes, regardless of the size of the item the pointer represents.

C/C++:

      long L, *pL=&L;              pL += 4; // pL is incremented by 4*sizeof(long) = 16 bytes

ILAsm:

      .locals init(int32 L, int32& pL)       ldloca L   // Load pointer to L on stack       stloc pL   // pL = &L              ldloc pL   // Load pL on stack       ldc.i4 4   // Load 4 on stack       add             stloc pL   // pL += 4, pL is incremented by 4 bytes

By the same token—now, this is just a common expression. I’m not referring to metadata tokens. (I think I’d better be extra careful with phrases like “by the same token” or “token of appreciation” in this book.) In the same way, the delta of two pointers in IL is always expressed in bytes, not in the items pointed at.

Using unmanaged pointers in IL is not considered nice. Because of the unlimited access that C-style pointer arithmetic gives to anybody for anything, IL code, which has unmanaged pointers dereferenced, is deemed unverifiable and can be run only from a local drive with run-time code verification disabled.

Managed pointers are tamed, domesticated pointers, fully owned by the common language runtime type control and the garbage collection subsystem. These pointers dwell in a safe but not too spacious corral, fenced along the following lines:

Managed pointers are always references to an item in existence—a field, an array element, a local variable, a method argument.
Managed pointer types can be used only for method attributes—local variables, parameters, or a return type.
Array elements and fields cannot have managed pointer types. Local variables and method parameters can, and it is not a simple coincidence that all these items are stack-allocated.
Managed pointers that point to “managed memory” (the garbage collector heap, which contains object instances and arrays) cannot be converted to unmanaged pointers.
Managed pointers that don’t point to the garbage collector heap can be converted to unmanaged pointers, but such conversion renders the IL code unverifiable.
The underlying type of a managed pointer cannot be another pointer, but it can be an object reference.

Managed pointers are different from object references. In Chapter 6, “Namespaces and Classes,” which described boxing and unboxing of the value types, we saw that it takes boxing to create an object reference to a value type. Using a simple reference—that is, a managed pointer—is not enough.

The difference is that an object reference points to the method table of an object, whereas a managed pointer points to the value (data) part of the item. When you take a managed pointer to an instance of a value type, you address the data part. You can have only this much because instances of value types, not being objects, have no method tables.

When you box a value type instance, you create an object, a class instance with its own method table and data part copied from the value type instance. This object is represented by an object reference.

Function Pointer Types

Chapter 6 briefly described the use of managed function pointers and compared them with delegate types. Managed function pointers are represented by type E_T_FNPTR, which is indicated by the value 0x1B and doesn’t have a .NET Framework type associated.

Just like a data pointer type, a function pointer type does not exist by itself and must be followed by the full signature of the function to which it points. (Method signatures are discussed later in this chapter; see “Signatures.”)

The ILAsm notation for a function pointer is as follows:

   <call_conv> <return_type> * (<type>[,<type>*])

where <call_conv> is a calling convention, <return_type> is the return type, and the <type> sequence in the parentheses is the argument list. You’ll find more details in the “Signatures” section.

Vectors and Arrays

The common language runtime recognizes two types of arrays: vectors and multidimensional arrays, as described in Table 7-3. Vectors are single-dimensional arrays with a zero lower bound. Multidimensional arrays, which I’ll refer to as arrays, can have more than one dimension and nonzero lower bounds. Neither of these two types of arrays has a respective .NET Framework type associated.

Table 7-3 Arrays Supported in the Runtime
Code	Constant Name	ILAsm Notation	Comments
0x1D	SZARRAY	<type>[ ]	Vector of <type>
0x14	ARRAY	<type>[<bounds> [,<bounds>*] ]	Array of <type>

All vectors and arrays are objects (class instances) derived from the abstract class [mscorlib]System.Array. This is a very peculiar class; in fact, it is a construct known as a generic.

Vector encoding is very simple: E_T_SZARRAY followed by the encoding of the underlying type, which can be anything except void. The size of the vector is not part of the encoding. Because arrays and vectors are object references, it is not enough to simply declare an array—you must create an instance of it, using the instruction newarr for a vector or calling an array constructor. It is at that point that the size of the vector or array instance is specified.

Array encoding is more sophisticated:

   E_T_ARRAY<underlying_type><rank><num_sizes><size1> <size_N>             <num_lower_bounds><lower_bound₁> <lower_bound_M>

where the following is true:

   <underlying_type> cannot be void    <rank> is the number of array dimensions (K>0)    <num_sizes> is the number of specified sizes for dimensions (N = K)    <size_n> is an unsigned integer specifying the size (n = 1, ,N)    <num_lower_bounds> is the number of specified lower bounds (M = K)    <lower_bound_m> is a signed integer specifying the lower bound (m =    1, ,M)

All the above unsigned integer values are compressed according to the length compression formula discussed in Chapter 4, “Metadata Tables Organization.” To save you a trip three chapters back, I will repeat this formula in Table 7-4.

Table 7-4 The Length Compression Formula for Unsigned Integers
Value Range	Compressed Size	Compressed Value
0 0x7F	1 byte	<value>
0x80 0x3FFF	2 bytes	0x8000 <value>
0x4000 0x1FFFFFFF	4 bytes	0xC0000000 <value>

Signed integer values (lower bound values) are compressed according to a different compression procedure. First the signed integer is encoded as an unsigned integer by taking the absolute value of the original integer, shifting it left by 1 bit, and setting the least significant bit according to the most significant (sign) bit of the original value. Then compression is applied according to the formula shown in Table 7-4.

If size and/or the lower bound for a dimension are not specified, they are not presumed to be 0; rather, they are marked as not specified. The specification of size and lower bound cannot have “holes”—that is, if you have an array of rank 5 and want to specify size (or lower bound) for its third dimension, you must specify size (or lower bound) for the first and second dimensions as well.

An array specification in ILAsm looks like this:

   <type> [ <bounds>[, <bounds>*] ]

where

   <bounds> ::= [<lower_bound>]   [<upper_bound>]

The following is an example:

   int32[ ,  ] // Two-dimensional array with undefined lower bounds                // And sizes    int32[2 5]  // One-dimensional array with lower bound 2 and size 4    int32[0 , 0 ] // Two-dimensional array with zero lower bounds                   // And undefined sizes

If neither lower bound nor upper bound is specified for a dimension in a multidimensional array declaration, the ellipsis can be omitted. Thus int32[ , ] and int32[,] mean the same: a two-dimensional array with no lower bounds or sizes specified.

This omission does not work in the case of single-dimensional arrays, however. The notation int32[ ] indicates a vector (<E_T_SZARRAY><E_T_I4>), and int32[ ] indicates an array of rank 1 whose lower bound and size are undefined (<E_T_ARRAY><E_T_I4><1><0><0>).

The common language runtime treats multidimensional arrays and vectors of vectors (of vectors, and so on) completely differently. The specifications int32[,] and int32[ ][ ] result in different type encoding, are created differently, and are laid out differently when created:

int32[,] This specification has the encoding <E_T_ARRAY><E_T_ I4><1><0><0>, is created by a single call to an array constructor, and is laid out as a contiguous two-dimensional array of int32.
int32[ ][ ] This specification has the encoding <E_T_SZARRAY><E_T_ SZARRAY><E_T_I4>, is created by a series of newarr instructions, and is laid out as a vector of vector references, each pointing to a contiguous vector of int32, with no guarantee regarding the location of each vector. Vectors of vectors are useful for describing jagged arrays, when the size of the second dimension varies depending on the first dimension index.

Modifiers

Four built-in common language runtime types, described in Table 7-5, do not denote any specific data or pointer type but rather are used as modifiers of data and pointer types. None of these modifiers have a respective .NET Framework type associated.

Table 7-5 Custom Modifiers Defined in the Runtime
Code	Constant Name	ILAsm Notation	Comments
0x1F	CMOD_REQD	modreq( <class_ref> )	Required C modifier
0x20	CMOD_OPT	modopt( <class_ref> )	Optional C modifier
0x41	SENTINEL		Start of optional arguments in a vararg method call
0x45	PINNED	pinned	Marks a local variable as unmovable by the garbage collector

The modifiers modreq and modopt indicate that the item to which they are attached—an argument, a return type, or a field, for example—must be treated in some special way. These modifiers are followed by TypeDef or TypeRef tokens, and the classes corresponding to these tokens indicate the special way the item is to be handled.

The tokens following modreq and modopt are compressed according to the following algorithm. As you might remember, an uncoded (external) metadata token is a 4-byte unsigned integer, which has the token type in its senior byte and a record index (RID) in its 3 lower bytes. It so happens that the tokens appearing in the signatures and hence requiring compression are of three types only: TypeDef, TypeRef, or TypeSpec. (See “Signatures” later in this chapter for information about TypeSpecs.) Because of that, only 2 bits, rather than a whole byte, are required for the token type: 00 denotes TypeDef, 01 is used for TypeRef, and 10 for TypeSpec. The token compression procedure resembles the procedure used to compress the signed integers: the RID part of the token is shifted left by 2 bits, and the 2-bit type encoding is placed in the least significant bits. The result is compressed just as any unsigned integer would be, according to the formula shown earlier in Table 7-4.

The modifiers modreq and modopt are used primarily by tools other than the common language runtime, such as compilers or program analyzers. The modreq modifier indicates that the modifier must be taken into account, whereas modopt indicates that the modifier is optional and can be ignored. The ILAsm compiler does not use these modifiers for its internal purposes.

The only use of the modreq and modopt modifiers recognized by the common language runtime is when these modifiers are applied to return types or parameters of methods subject to managed/unmanaged marshaling. For example, to specify that a managed method must have the cdecl calling convention when it is marshaled as unmanaged, we can use the following modifier attached to the method’s return type:

modopt([mscorlib]System.Runtime.InteropServices.CallConvCdecl)

When used in the context of managed/unmanaged marshaling, the modreq and modopt modifiers are equivalent.

Although the modreq and modopt modifiers have no effect on the managed types of the items to which they are attached, signatures with and without these modifiers are considered different. The same is true for signatures differing only in classes referenced by these modifiers.

The sentinel modifier ( ) was introduced in Chapter 1, “Simple Sample,” when we analyzed the declaration and calling of methods with a variable-length argument list (vararg methods). (See “Method Declaration.”) A sentinel signifies the beginning of optional arguments supplied for a vararg method call. This modifier can appear in only one context: at the call site, because the optional parameters of a vararg method are not specified when such a method is declared. The runtime treats a sentinel appearing in any other context as an error. The method arguments at the call site can contain only one sentinel, and the sentinel is used only if optional arguments are supplied:

   // Declaration of vararg method   mandatory parameters only:    .method public static vararg int32 Print(string Format)    {           }        // Calling vararg method with two optional arguments:    call vararg int32 Print(string,  , int32, int32)        // Calling vararg method without optional arguments:    call vararg int32 Print(string)

The pinned modifier is applicable to the method’s local variables only. Its use means that the local variable cannot be relocated by the garbage collector and must stay put throughout the method execution. If a local variable is “pinned,” it is safe to convert a managed pointer to this variable to an unmanaged pointer and then to dereference this unmanaged pointer, because the unmanaged pointer is guaranteed to still be valid when it is dereferenced:

   .locals init(int32 A, int32 pinned B, int32* pA, int32* pB)    ldloca A    stloc pA      // pA = &A    ldloca B    stloc pB      // pB = &B        ldloc pA    ldc.i4 123    stind.i4      // *pA=123   unsafe, A could have been moved    ldloc pB    ldc.i4 123    stind.i4      // *pB=123   safe, B is pinned and cannot move

Native Types

When managed code calls unmanaged methods or exposes managed fields to unmanaged code, it is sometimes necessary to provide specific information about how the managed types should be marshaled to and from the unmanaged types. The unmanaged types recognizable by the common language runtime are referred to as native, and they are listed in CorHdr.h in the enumeration CorNativeType. All constants in this enumeration have names that begin with NATIVE_TYPE_* ; for purposes of this discussion, I have omitted this part of the names or abbreviated it as N_T_. The same constants are also listed in the .NET Framework class library in the enumerator System.Runtime.InteropServices.UnmanagedType.

Some of the native types are obsolete and are ignored by the runtime interoperability subsystem. But since these native types are not retired altogether, ILAsm must have ways to denote them—and since ILAsm denotes these types, I cannot help but list obsolete types along with others, all of which you’ll find in Table 7-6.

Table 7-6 Native Types Defined in the Runtime
Code	Constant Name	.NET Framework Type Name	ILAsm Notation	Comments
0x01	VOID		void	Obsolete and thus should not be used; recognized by ILAsm but ignored by the runtime interoperability subsystem
0x02	BOOLEAN	Bool	bool	4-byte Boolean value; true = nonzero, false = 0
0x03	I1	I1	int8	Signed 1-byte integer
0x04	U1	U1	unsigned int8	Unsigned 1-byte integer
0x05	I2	I2	int16	Signed 2-byte integer
0x06	U2	U2	unsigned int16	Unsigned 2-byte integer
0x07	I4	I4	int32	Signed 4-byte integer
0x08	U4	U4	unsigned int32	Unsigned 4-byte integer
0x09	I8	I8	int64	Signed 8-byte integer
0x0A	U8	U8	unsigned int64	Unsigned 8-byte integer
0x0B	R4	R4	float32	4-byte floating-point
0x0C	R8	R8	float64	8-byte floating-point
0x0D	SYSCHAR		syschar	Obsolete
0x0E	VARIANT		variant	Obsolete
0x0F	CURRENCY	Currency	currency	Currency value
0x10	PTR		*	Obsolete; use native int
0x11	DECIMAL		decimal	Obsolete
0x12	DATE		date	Obsolete
0x13	BSTR	BStr	bstr	Unicode Visual Basic style string
0x14	LPSTR	LPStr	lpstr	Pointer to a zero-terminated ANSI string
0x15	LPWSTR	LPWStr	lpwstr	Pointer to a zero-terminated Unicode string
0x16	LPTSTR	LPTStr	lptstr	Pointer to a zero-terminated ANSI or Unicode string, depending on platform
0x17	FIXEDSYSSTRING	ByValTStr	fixed sysstring [<size>]	Fixed-system string of size <size> bytes; applicable to field marshaling only
0x18	OBJECTREF		objectref	Obsolete
0x19	IUNKNOWN	IUnknown	iunknown	IUnknown interface pointer
0x1A	IDISPATCH	IDispatch	idispatch	IDispatch interface pointer
0x1B	STRUCT	Struct	struct	C-style structure, for marshaling the formatted managed types
0x1C	INTF	Interface	interface	Interface pointer
0x1D	SAFEARRAY	SafeArray	safearray <variant_type>	Safe array of type <variant_type>
0x1E	FIXEDARRAY	ByValArray	fixed array [<size>]	Fixed-size array, of size <size> bytes
0x1F	INT	IntPtr	int	Signed pointer-size integer
0x20	UINT	UIntPtr	unsigned int	Unsigned pointer-size integer
0x21	NESTEDSTRUCT		nested struct	Obsolete; use struct
0x22	BYVALSTR	VBByRefStr	byvalstr	Visual Basic style string in a fixed-length buffer
0x23	ANSIBSTR	AnsiBStr	ansi bstr	ANSI Visual Basic style string
0x24	TBSTR	TBStr	tbstr	bstr or ansi bstr, depending on the platform
0x25	VARIANTBOOL	VariantBool	variant bool	2-byte Boolean; true = -1, false = 0
0x26	FUNC	FunctionPtr	method	Function pointer
0x28	ASANY	AsAny	as any	Object; type defined at run time
0x2A	ARRAY	LPArray	<n_type> [<sizes>]	Fixed-size array of a native type <n_type>
0x2B	LPSTRUCT	LPStruct	lpstruct	Pointer to a C-style structure
0x2C	CUSTOMMARSHALER	CustomMarshaler	custom (<class_str>, <cookie_str>)	Custom marshaler
0x2D	ERROR	Error	error	Maps int32 to VT_HRESULT

The <sizes> parameter in the ILAsm notation for ARRAY, shown in Table 7-6, can be empty or can be formatted as <size> + <size_param_number>:

   <sizes> ::= <>                   <size>                   + <size_param_number>                   <size> + <size_param_number>

If <sizes> is empty, the size of the native array is derived from the size of the managed array being marshaled.

The <size> parameter specifies the native array size in array items. The zero-based method parameter number <size_param_number> indicates which of the method parameters specifies the size of the native array. The total size of the native array is <size> plus the additional size specified by the method parameter that is indicated by <size_param_number>.

A custom marshaler declaration (shown in Table 7-6) has two parameters, both of which are quoted strings. The <class_str> parameter is the name of the class representing the custom marshaler, using the string conventions of Reflection.Emit. The <cookie_str> parameter is an argument string (cookie) passed to the custom marshaler at run time. This string identifies the form of the marshaling required, and its notation is specific to the custom marshaler.

Variant Types

Variant types are defined in the enumeration VARENUM in the Wtypes.h file, which is distributed with Microsoft Visual Studio. Not all variant types are applicable as safe array types, according to Wtypes.h, but ILAsm provides notation for all of them nevertheless, as shown in Table 7-7. It might look strange, considering that variant types appear in ILAsm only in the context of safe array specification, but we should not forget that one of ILAsm’s principal applications is the generation of test programs, which contain known, preprogrammed errors.

Table 7-7 Variant Types Defined in the Runtime
Code	Constant Name	Applicable to Safe Array?	ILAsm Notation
0x00	VT_EMPTY	No	<empty>
0x01	VT_NULL	No	null
0x02	VT_I2	Yes	int16
0x03	VT_I4	Yes	int32
0x04	VT_R4	Yes	float32
0x05	VT_R8	Yes	float64
0x06	VT_CY	Yes	currency
0x07	VT_DATE	Yes	date
0x08	VT_BSTR	Yes	bstr
0x09	VT_DISPATCH	Yes	idispatch
0x0A	VT_ERROR	Yes	error
0x0B	VT_BOOL	Yes	bool
0x0C	VT_VARIANT	Yes	variant
0x0D	VT_UNKNOWN	Yes	iunknown
0x0E	VT_DECIMAL	Yes	decimal
0x10	VT_I1	Yes	int8
0x11	VT_UI1	Yes	unsigned int8
0x12	VT_UI2	Yes	unsigned int16
0x13	VT_UI4	Yes	unsigned int32
0x14	VT_I8	No	int64
0x15	VT_UI8	No	unsigned int64
0x16	VT_INT	Yes	int
0x17	VT_UINT	Yes	unsigned int
0x18	VT_VOID	No	void
0x19	VT_HRESULT	No	hresult
0x1A	VT_PTR	No	*
0x1B	VT_SAFEARRAY	No	safearray
0x1C	VT_CARRAY	No	carray
0x1D	VT_USERDEFINED	No	userdefined
0x1E	VT_LPSTR	No	lpstr
0x1F	VT_LPWSTR	No	lpwstr
0x24	VT_RECORD	Yes	record
0x40	VT_FILETIME	No	filetime
0x41	VT_BLOB	No	blob
0x42	VT_STREAM	No	stream
0x43	VT_STORAGE	No	storage
0x44	VT_STREAMED_OBJECT	No	streamed_object
0x45	VT_STORED_OBJECT	No	stored_object
0x46	VT_BLOB_OBJECT	No	blob_object
0x47	VT_CF	No	cf
0x48	VT_CLSID	No	clsid
0x1000	VT_VECTOR	Yes	<v_type> vector
0x2000	VT_ARRAY	Yes	<v_type> [ ]
0x4000	VT_BYREF	Yes	<v_type> &