Introduction to Type Systems | Professional .NET Framework 2.0 (Programmer to Programmer)

This chapter presents most idioms of the CTS using C#, although I do try to identify areas where there's a mismatch between the language's and the CTS's semantics. Since we haven't seen Common Intermediate Language (CIL) just yet (that comes in Chapter 3) — the language into which all managed programs are compiled — using a higher-level programming language such as C# will prove more effective in explaining the concepts, without syntax getting in the way.

As a brief example of the CTS's diversity of languages that it supports, consider four examples, each of which has a publicly available compiler that targets the CLR: C#, C++/CLI, Python, and F#:

C# is a (mostly) statically typed, imperative, C-style language. It offers very few features that step outside of the CLR's verifiable type-safety, and employs a heavily object-oriented view of the world. C# also offers some interesting functional language features such as first class functions and their close cousins, closures, and continues to move in this direction with the addition of, for example, type inferencing and lambdas in new versions of the language. This is, at the time of this writing, the most popular programming language on the CLR platform.
C++/CLI is an implementation of the C++ language targeting the CTS instruction set. Programmers in this language often step outside of the bounds of verifiable type safety, directly manipulating pointers and memory segments. The compiler does, however, support compilation options to restrict programs to a verifiable subset of the language. The ability to bridge the managed and unmanaged worlds with C++ is amazing, enabling many existing unmanaged programs to be recompiled under the CLR's control, of course with the benefits of Garbage Collection and (mostly) verifiable IL.
Python, like C#, deals with data in an object-oriented fashion. But unlike C# — and much like Visual Basic — it prefers to infer as much as possible and defer as many decisions until runtime that would have traditionally been resolved at compile time. Programmers in this language never deal directly with raw memory, and always live inside the safe confines of verifiable type safety. Productivity and ease of programming are often of utmost importance for such dynamic languages, making them amenable to scripting and lightweight program extensions. But they still must produce code that resolves typing and other CLR-related mapping issues somewhere between compile- and runtime. Some say that dynamic languages are the way of the future. Thankfully, the CLR supports them just as well as any other type of language.
Lastly, F# is a typed, functional language derived from O'Caml (which is itself derived from Standard ML), which offers type inferencing and scripting-like interoperability features. F# certainly exposes a very different syntax to the programmer than, say, C#, VB, or Python. In fact, many programmers with a background in C-style languages might find the syntax quite uncomfortable at first. It offers a mathematical style of type declarations and manipulations, and many other useful features that are more prevalent in functional languages, such as pattern matching. F# is a great language for scientific- and mathematical-oriented programming.

Each of these languages exposes a different view of the type system, sometimes extreme yet often subtle, and all compile into abstractions from the same CTS and instructions from the same CIL. Libraries written in one can be consumed from another. A single program can even be composed from multiple parts, each written in whatever language is most appropriate, and combined to form a single managed binary. Also notice that the idea of verification makes it possible to prove type safety, yet work around entire portions of the CTS when necessary (such as manipulating raw memory pointers in C++). Of course, there are runtime restrictions that may be placed on executing unverifiable code. We'll return to these important topics later in this chapter.

The Importance of Type Safety

Not so long ago, unmanaged assembly, C, and C++ programming were the de facto standard in industry, and types — when present — weren't much more than ways to name memory offsets. For example, a C structure is really just a big sequence of bits with names to access precise offsets from the base address, that is, fields. References to structures could be used to point at incompatible instances, data could be indexed into and manipulated freely. C++ was admittedly a huge step in the right direction. But there generally wasn't any runtime system enforcing that memory access followed the type system rules at runtime. In all unmanaged languages, there was a way to get around the illusion of type safety.

This approach to programming has proven to be quite error prone, leading to hard bugs and a movement toward completely type-safe languages. (To be fair, languages with memory safety were already available before C. For example LISP uses a virtual machine and garbage collected environment similar to the CLR, but it remains primarily a niche languages for AI and academia.) Over time, safe languages and compilers grew in popularity, and using static detection to notify developers about operations that could lead to memory errors, for example, upcasting in C++. Other languages, for example VB and Java, fully employed type safety to increase programmer productivity and robustness of programs. If things like upcasting were permitted to pass the compiler, the runtime would catch and deal with illegal casts in a controlled manner at runtime, for instance by throwing an exception. The CLR follows in this spirit.

Proving Type Safety

The CLR execution environment takes the responsibility of ensuring that type safety is proven prior to executing any code. This safety cannot be subverted by untrusted malicious programs, ensuring that memory corruption is not possible. For example, this guarantees two things:

Memory is only ever accessed in a well-known and controlled manner through typed references. Corrupting memory cannot occur simply through the use of a reference that had mismatched offsets into memory, for example, as this would result in an error by the verifier (instead of it blindly trudging forward with the request). Similarly an instance of a type cannot be accidentally treated as another entirely separate type.
All access to memory must go through the type system, meaning that instructions cannot trick the execution engine into executing an operation that results in an improper memory access at runtime. Overflowing a buffer or indexing into an arbitrary location of system memory are just not possible (unless you've uncovered a product bug or intentionally use unsafe, and thus unverifiable, constructs).

Note that these items strictly apply only to verifiable code. By using unverifiable code, you can construct programs that violate these restrictions wholesale. Doing so generally means that your programs won't be available to execute in partial trust without a special policy.

There are also situations where unmanaged interoperability supplied by a trusted library can be tricked into performing incorrect operations. For example, consider if a trusted managed API in the Base Class Libraries (BCL) blindly accepted an integer and passed it to an unmanaged bit of code. If that unmanaged code used the integer to indicate an array bound, a malicious perhaps could intentionally pass an invalid index to provoke a buffer overflow. Verification is discussed throughout this chapter, and partial trust is covered in Chapter 9 on security. It is the responsibility of shared library developers to ensure that such program errors are not present.

An Example of Type-Unsafe Code (in C)

Consider a C program that manipulates some data in an unsafe way, a situation that generally leads to either a memory access violation at runtime or a silent data corruption. An access violation (sometimes just called an AV) happens when protected memory is written to by accident; this is generally more desirable (and debuggable) than blindly overwriting memory. This snippet of code clobbers the stack, meaning that the control flow of your program and various bits of data — including the return address for the current function — could be overwritten. It's bad:

 #include <stdlib.h> #include <stdio.h> void fill_buffer(char*, int, char); int main() {     int x = 10;     char buffer[16];     /* ... */     fill_buffer(buffer, 32, 'a');     /* ... */     printf("%d", x); } void fill_buffer(char* buffer, int size, char c) {     int i;     for (i = 0; i < size; i++)     {         buffer[i] = c;     } }

Our main function allocates two items on its stack, an integer x and a 16-character array named buffer. It then passes a pointer to buffer (remember, it's on the stack), and the receiving function fill_buffer proceeds to use the size and character c parameters to fill the buffer with that character. Unfortunately, the main function passed 32 instead of 16, meaning that we'll be writing 32 char-sized pieces of data onto the stack, 16 more than we should have. The result can be disastrous. This situation might not be so bad depending on compiler optimizations — we could simply overwrite half of x — but could be horrific if we end up overwriting the return address. It is only possible because we are permitted to access raw memory entirely outside of the confines of C's primitive type system.

Static and Dynamic Typing

Type systems are often categorized using a single pivot: static versus dynamic. The reality is that type systems differ quite a bit more than just that. Nonetheless, the CTS provides capabilities for both, giving languages the responsibility of choosing how to expose the underlying runtime. There are strong proponents of both styles, although many programmers feel most comfortable somewhere in the middle. Regardless of which your favorite language is, the CLR runs that code in a strongly typed environment. This means that your language can avoid dealing with types at compile time, but ultimately it will end up having to work within the type system constraints of verifiable code. Everything has a type, whether a language designer surfaces this to users or not.

Let's take a brief look at some of the user-visible differences between static and dynamic languages. Much of this particular section isn't strictly CTS-related, but can be helpful when trying to understand what's going on inside the execution engine. Feel free to skim through it your first time reading this chapter, especially if the CLR is entirely foreign to you.

Key Differences in Typing Strategies

Static typing seeks to prove program safety at compile time, thus eliminating a whole category of runtime failures to do with type mismatches and memory access violations. C# programs are mostly statically typed, although some features like dirty upcasts enable you to relax or avoid static typing in favor of dynamism. In such cases, the runtime ensures types are compatible at runtime. Other examples of statically typed languages include Java, Haskell, Standard ML, and F#. C++ is very much like C# in that it uses a great deal of static typing, although there are several areas that can cause failures at runtime, notably in the area of type-unsafe memory manipulation, as is the case with old-style C.

Some people feel that static typing forces a more verbose and less explorative programming style on to the programmer. Type declarations are often littered throughout programs, for instance, even in cases where a more intelligent compiler could infer them. The benefit, of course, is finding more errors at compile time, but in some scenarios the restriction of having to play the "beat the compiler" game is simply too great. Dynamic languages defer to runtime many of the correctness checks that static languages perform at compile time. Some languages take extreme and defer all checks, while others employ a mixture of static and dynamic checking. Languages like VB, Python, Common LISP, Scheme, Perl, Ruby, and Python fall into this category.

A lot of people refer to strongly and weakly typed programs, and early- and late-bound programming. Unfortunately, this terminology is seldom used consistently. Generally speaking, strong typing means that programs must interact with the type system in a sound manner while accessing memory. Based on this definition, we've already established that the CTS is a strongly typed execution environment. Late binding is a form of dynamic programming in which the exact type and target operation are not bound to until runtime. Most programs bind to a precise metadata token directly in the IL. Dynamic languages, for example, perform this binding very late, that is, just before dispatching a method call.

One Platform to Rule Them All

The CLR supports the entire spectrum of languages, from static to dynamic and everywhere in between. The Framework itself in fact provides an entire library for doing late-bound, dynamic programming., called reflection (see Chapter 14 for a detailed discussion). Reflection exposes the entire CTS through a set of APIs in the System.Reflection namespace, offering functionality that facilitates compiler authors in implementing dynamic languages, and enables everyday developers to exploit some of the power of dynamic programming.

Examples from the Language Spectrum

Let's take a brief look at some example languages from the spectrum. You'll find below five small programs, each printing out the 10th element in the Fibonacci series (an interesting, well-known algorithm, the nave implementation of which is shown). Two of these examples are written in statically typed languages (C# and F#), one in a language in between (VB), and two in dynamically typed languages (Python and Scheme, a dialect of LISP). The primary differences you will notice immediately are stylistic. But one deeply ingrained difference is whether the IL they emit is typed or instead relies on dynamic type checking and binding. We'll examine what this means shortly.

C#

 using System; class Program {     static int Fibonacci(int x)     {         if (x <= 1)             return 1;         return Fibonacci(x - 1) + Fibonacci(x - 2);     }     static void Main()     {         Console.WriteLine(Fibonacci(10));     } }

F#

 let rec fibonacci x =     match x with         0 -> 1       | 1 -> 1       | n -> fibonacci(x - 1) + fibonacci(x - 2);; fibonacci 10;;

VB

 Option Explicit Off Class Program     Shared Function Fibonacci(x)         If (x <= 1)             Return 1         End If         Return Fibonacci(x - 1) + Fibonacci(x - 2)     End Function     Shared Sub Main()         Console.WriteLine(Fibonacci(10))     End Sub End Class

Python

 def fib(i):     if i <= 1:         return 1     return fib(i-1) + fib(i-2) print fib(10)

Scheme

 (letrec ((fib (lambda (x)                   (if (<= x 1)                       1                       (+ (fib (- x 1)) (fib (- x 2)))))))     (fib 10))

Type Names Everywhere!

You'll notice the C# version is the only one that mentions we're working with 32-bit int values. These are static type annotations and are needed for the compiler to prove type soundness at compile time. Many static languages like F#, on the other hand, use a technique called type inferencing, avoiding the need for annotations where they can be inferred by the use of literals. F# actually emits IL that works with ints in this example, although we never specified it in the source code. In other words, it infers the type of a variable by examining its usage. Languages that infer types ordinarily require type annotations where a type can't be inferred solely by its usage.

A type-inferencing language can easily figure out that some variable x refers to a String by examining an assignment statement x = "Hello, World". In this overly simplistic case, there would be no need to declare its type, yet the program's type safety would remain. The Fibonacci function and F#'s treatment of it is a perfect example of where type inferencing can help out. More complex cases break down quickly, for example when passing data across the boundary of separately compiled units.

The other languages emit code that works with Object — as we'll see shortly, this is the root of all type hierarchies — and chooses to bind strongly at runtime. It does so by emitting calls into its own runtime library. Clearly, the performance of statically typed programs will often win out over dynamic, simply because they can emit raw IL instructions instead of relying on additional function calls to, for example, late-binding libraries.

Compiler Availability

You might be wondering whether you can actually run the examples above on the CLR. The good news is that, with the exception of plain C, you can! C#, VB, and C++ all ship as part of the .NET Framework 2.0 release, and Visual Studio 2005. F# can be downloaded from Microsoft Research at http://research.microsoft.com/downloads. A shared source implementation of Python on the CLR can be downloaded at http://workspaces.gotdotnet.com/ironpython. And lastly, a Scheme implementation implemented and used for university coursework by Northeastern University is available at www.ccs.neu.edu/home/will/Larceny/CommonLarceny.

A full coverage of type systems, how they differ, and the pros and cons of various design choices are well outside the scope of this book. They are interesting nonetheless. Please refer to the "Further Reading" section at the end of this chapter for more resources on the topic.