Building Managed Code: The Common Type System | Understanding .NET (2nd Edition)

What is a programming language? One way to think about it is as a specific syntax with a set of keywords that can be used to define data and express operations on that data. While language syntaxes differ, the underlying abstractions of most popular languages today are very similar. All of them support various data types such as integers and strings, all allow packaging code into methods, and all provide a way to group data and methods into classes. When a new programming language is defined, the usual approach is to define underlying abstractions such as thesekey aspects of the language's semanticsconcomitantly with the language's syntax.

A programming language usually defines both syntax and semantics

Yet there are other possibilities. Suppose you choose to define the core abstractions for a programming model without mapping them to any particular syntax. If the abstractions were general enough, they could then be used in many different programming languages. Rather than inextricably mingling syntax and semantics, these two things could be kept separate, allowing different languages to be used with the same set of underlying abstractions. This is exactly what's done in the CLR's Common Type System (CTS). The CTS specifies no particular syntax or keywords, but instead defines a common set of types that can be used with many different language syntaxes. Each language has its own syntax, but if that language is built on the CLR, it will use at least some of the types defined by the CTS.

The Common Type System defines core semantics but not syntax

Types are fundamental to any programming language. One simple but concrete way to think of a type is as a set of rules for interpreting the value stored in some memory location, such as the value of a variable. If that variable has an integer type, for example, the bits stored in it are interpreted as an integer. If the variable has a string type, the bits stored in it are interpreted as characters. To a compiler, of course, a type is more than this. Compilers must also understand the rules that define what kinds of values are allowed for each type and what kinds of operations are legal on these values. Among other things, this knowledge allows a compiler to determine whether a value of a particular type is being used correctly.

Types are an important part of a programming language

The set of types defined by the CTS is at the core of the CLR. Programming languages built on the CLR expose these types in a language-dependent way. (For examples of this, see the descriptions of C# and VB in the next chapter.) While the creator of a CLR-based language is free to implement only a subset of the types defined by the CTS and even to add types of his own to his language, most languages built on the CLR make extensive use of the CTS-defined types.

CLR-based languages expose CTS types in different ways

Introducing the Common Type System

A substantial subset of the types defined by the CTS is shown in Figure 2-1. The first thing to note is that every type inherits either directly or indirectly from a type called Object. (All of these types are actually contained in the System namespace, as mentioned in Chapter 1, so the complete name for this most fundamental type is System.Object.) The second thing to note is that every type defined by the CTS is either a reference type or a value type. As their names suggest, an instance of a reference type always contains a reference to a value of that type, while an instance of a value type contains the value itself. Reference types inherit directly from Object, while all value types inherit directly from a type called ValueType, which in turn inherits from Object.

Figure 2-1. The CTS defines reference and value types, all of which inherit from a common Object type.

The CTS defines reference and value types

Value types tend to be simple. The types in this category include Byte, Char, signed and unsigned integers of various lengths, single- and double-precision floating point, Decimal, Boolean, and more. Reference types, by contrast, are typically more complex. As shown in the figure, for instance, Class, Interface, Array, and String are reference types. Yet to understand the difference between value types and reference typesa fundamental distinction in the CTSyou must first understand how memory is allocated for instances of each type. In managed code, values can have their memory allocated in one of two main ways, both managed by the CLR: on the stack or on the heap. Variables allocated on the managed stack are typically created when a method is called or when a running method creates them. In either case, the memory used by stack variables is automatically freed when the method in which they were created returns. Variables allocated on the managed heap, however, don't have their memory freed when the method that created them ends. Instead, the memory used by these variables is freed via a process called garbage collection, a topic that's described in more detail later in this chapter.

Value types are simpler than reference types

A basic difference between value types and reference types is that a standalone instance of a value type is allocated on the stack, while an instance of a reference type has only a reference to its actual value allocated on the stack. The value itself is allocated on the heap. Figure 2-2 shows an abstract picture of how this looks. In the case shown here, three instances of value typesInt16, Char, and Int32have been created on the managed stack, while one instance of the reference type String exists on the managed heap. Note that even the reference type instance has an entry on the stackit's a reference to the memory on the heapbut the instance's contents are stored on the heap^[1]. Understanding the distinction between value types and reference types is essential in understanding the CTS type system and, ultimately, the types used by CLR-based languages.

^[1] A reference type might contain a value type, which means that some value type instances exist on the heap.

Figure 2-2. Instances of value types are allocated on the managed stack, while instances of reference types are allocated on the managed heap.

Value types typically live on the stack, while reference types live on the heap

A Closer Look at CTS Types

The CTS defines a large set of types. As already described, the most fundamental of these is Object, from which every CTS type inherits directly or indirectly. In the object-oriented world of the CLR, having a common base for all types is useful. For one thing, since everything inherits from the same root type, an instance of this type can potentially contain any value. Object also implements several methods, and since every CTS type inherits from Object, these methods can be called on an instance of any type. Among the methods Object provides are Equals, which determines whether two objects are identical, and GetType, which returns the type of the object it's called on.

The root Object type provides methods that are inherited by every other type

Value Types

All value types inherit from ValueType. Like Object, ValueType provides an Equals method (in fact, it overrides the method defined in Object). Value types cannot act as a parent type for inheritance, however, so it's not possible to, say, define a new type that inherits from Int32. In the jargon of the CLR, value types are said to be sealed.

The ValueType type provides methods that are inherited by every value type

Many of the value types defined by the CTS were shown in Figure 2-1. Defined a bit more completely, those types are as follows:

Byte: An 8-bit unsigned integer.
Char: A 16-bit Unicode character.
Int16, Int32, and Int64: 16-, 32-, and 64-bit signed integers.
UInt16, UInt32, and UInt64: 16-, 32-, and 64-bit unsigned integers.
Single and Double: Single-precision (32-bit) and double-precision (64-bit) floating-point numbers.
Decimal: 96-bit decimal numbers.
Enum: A way to name a group of values of some integer type. Enumerated types inherit from System.Enum and are used to define types whose values have meaningful names rather than just numbers.
Boolean: True or false.

Value types include Byte, Int32, Boolean, and more

Reference Types

Compared with most value types, the reference types defined by the CTS are relatively complicated. Before describing some of the more important reference types, it's useful to look first at a few elements, officially known as type members, that are common to several types (including both reference and value types). Those elements are as follows:

Methods: Executable code that carries out some kind of operation. Methods can be overloaded, which means that a single type can define two or more methods with the same name. To distinguish among them, each of these identically named methods must differ somehow in its parameter list. Another way to say this is to state that each method must have a unique signature. If a method encounters an error, it can throw an exception, which provides some indication of what has gone wrong.
Fields: A value of some type.
Events: A mechanism for communicating with other types. Each event includes methods for subscribing and unsubscribing and for sending (often referred to as firing) the event to subscribers.
Properties: In effect, a value together with specified methods to read and/or write that value.
Nested types: A type defined inside another type. A common example of this is defining a class that is nested inside another class.

Many CTS types have common type members

Type members can be assigned various characteristics. For example, methods, events, and properties can be labeled as abstract, which means that no implementation is supplied; as final, which means that the method, event, or property can't be overridden; or as virtual, which means that exactly which implementation is used can be determined at runtime rather than at compilation. Methods, events, properties, and fields can all be defined as static, which means they are associated with the type itself rather than with any particular instance of that type. (This allows a static method to be invoked on a class without first creating an instance of that class.) Members can also be assigned different accessibilities. For example, a private method can be accessed only from within the type in which it's defined or from another type nested in that type. A method whose accessibility is family, however, can be accessed from within the type in which it's defined and from types that inherit from that type. For even broader use, a method whose accessibility is public can be accessed from any other type.

Type members have characteristics

Given this basic understanding of type members, we can now look at reference types themselves. Among the most important are the following:

Class: A CTS class can have methods, events, and properties; it can maintain its state in one or more fields; and it can contain nested types. A class's visibility can be public, which means it's available to any other type, or assembly, which means it's available only to other classes in the same assembly. (Assemblies are described later in this chapter.) Classes have one or more constructors, which are initialization methods that execute when a new instance of this class is created. A class can directly inherit from at most one other class and can act as the direct parent for at most one inheriting child class. In other words, a CTS class supports single but not multiple implementation inheritance. If a class is marked as sealed, however, no other class can inherit from it. A class marked as abstract, by contrast, can't be instantiated but can serve only as the base class (that is, the parent) for another class that inherits from it. A class can also have one or more members marked as abstract, which means the class itself is abstract. If a class inherits from another class, it may override one or more methods, properties, and other type members in its parent by providing an implementation with the same signature. A class can also implement one or more interfaces, described next.
Interface: An interface can include methods, properties, and events. Unlike classes, interfaces do support multiple inheritance, so an interface can inherit from one or more other interfaces simultaneously. An interface doesn't actually implement anything, however. Instead, it provides a way to group type definitions together, leaving the implementation to whatever type supports the interface.
Array: An array is a group of values of the same type. Arrays can have one or more dimensions, and their upper and lower bounds can be set more or less arbitrarily. All arrays inherit from a common System.Array type.
Delegate: A delegate is effectively a pointer to a method. All delegates inherit from a common System.Delegate type, and they're commonly used for event handling and callbacks. Each delegate has a set of associated members called an invocation list. When the delegate is invoked, each member on this list gets called, with each one passed the parameters that the delegate received.

Reference types include Class, Interface, Array, and String

As the next chapter shows, CLR-based programming languages such as C# and VB construct their own type system on top of the CTS types. Despite their different representations, however, the semantics of these types are essentially the same in C#, VB, and many other CLR-based languages. In fact, providing this foundation of common programming language types is one of the CLR's most important roles.

The core types are the same in CLR-based languages such as C# and VB

Converting Value Types to Reference Types: Boxing

There are cases when an instance of a value type needs to be treated as an instance of a reference type. For example, suppose you'd like to pass an instance of a value type as a parameter to some method, but that parameter is defined to be a reference to a value rather than the value itself. For situations like this, a value type instance can be converted into a reference type instance through a process called boxing.

Boxing transforms an instance of a value type into an instance of a reference type

When a value type instance is boxed, storage is allocated on the heap, and the instance's value is copied into that space. A reference to this storage is placed on the stack, as shown in Figure 2-3. The boxed value is an object, a reference type, that contains the contents of the value type instance. In the figure, the Int32 value 169 shown in Figure 2-2 has been converted to a value of type Object, and its contents have been placed on the heap. A boxed value type instance can also be converted back to its original form, a process called unboxing.

Figure 2-3. Boxing converts a value type instance into an instance of an analogous reference type.

Boxing a value type instance moves its value to the heap

Languages built on the CLR commonly hide the process of boxing, so developers may not need to request this transformation explicitly. Still, boxing has performance implicationsdoing it takes time, and references to boxed values are a bit slower than references to unboxed valuesand boxed values behave somewhat differently than unboxed values. Even though the process usually happens silently, it's worth knowing what's going on.

CLR-based languages can make boxing invisible

The Common Language Specification

The CTS defines a large and fairly complex set of types. Not all of them make sense for all languages. Yet one of the key goals of the CLR is to allow creating code in one language, then calling that code from another. Unless both languages support the same types in the same way, doing this is problematic. Still, requiring every language to implement every CTS type would be burdensome to language developers.

The solution to this conundrum is a compromise called the Common Language Specification (CLS). The CLS defines a (large) subset of the CTS that a language must obey if it wishes to interoperate with other CLS-compliant languages. For example, the CLS requires support for most CTS value types, including Boolean, Byte, Char, Decimal, Int16, Int32, Int64, Single, Double, and more. It does not require support, however, for UInt16, UInt32, or UInt64. Similarly, a CTS array is allowed to have its lower bound set at an arbitrary value, while a CLS-compliant array must have a lower bound of zero. There are many more restrictions in the CLS, all of them defined with the same end in mind: allowing effective interoperability among code written in CLR-based languages.

The CLS defines a subset of the CTS to enable cross-language interoperability

One important thing to note about the rules laid down by the CLS is that they apply only to externally visible aspects of a type. A language is free to do anything it wants within its own world, but whatever it exposes to the outside worldand thus potentially to other languagesis constrained by the CLS. Given the goal of cross-language interoperability, this distinction makes perfect sense.