13.1 Introduction

< Free Open Study >

Nearly every programming language^[1] provides some form of assignment operation that changes the contents of a previously allocated piece of storage. ^[2] In some languages-notably ML and its relatives-the mechanisms for name-binding and those for assignment are kept separate. We can have a variable x whose value is the number 5, or a variable y whose value is a reference (or pointer) to a mutable cell whose current contents is 5, and the difference is visible to the programmer. We can add x to another number, but not assign to it. We can use y directly to assign a new value to the cell that it points to (by writing y:=84), but we cannot use it directly as an argument to plus. Instead, we must explicitly dereference it, writing !y to obtain its current contents. In most other languages-in particular, in all members of the C family, including Java-every variable name refers to a mutable cell, and the operation of dereferencing a variable to obtain its current contents is implicit.^[3]

Figure 13-1: References

For purposes of formal study, it is useful to keep these mechanisms separate;^[4] our development in this chapter will closely follow ML's model. Applying the lessons learned here to C-like languages is a straightforward matter of collapsing some distinctions and rendering certain operations such as dereferencing implicit instead of explicit.

Basics

The basic operations on references are allocation, dereferencing, and assignment. To allocate a reference, we use the ref operator, providing an initial value for the new cell.

   r = ref 5; ▸ r : Ref Nat

The response from the typechecker indicates that the value of r is a reference to a cell that will always contain a number. To read the current value of this cell, we use the dereferencing operator !.

   !r; ▸ 5 : Nat

To change the value stored in the cell, we use the assignment operator.

   r := 7; ▸ unit : Unit

(The result the assignment is the trivial unit value; see §11.2.) If we dereference r again, we see the updated value.

   !r; ▸ 7 : Nat

Side Effects and Sequencing

The fact that the result of an assignment expression is the trivial value unit fits nicely with the sequencing notation defined in §11.3, allowing us to write

   (r:=succ(!r); !r); ▸ 8 : Nat

instead of the equivalent, but more cumbersome,

   (λ_:Unit. !r) (r := succ(!r)); ▸ 9 : Nat

to evaluate two expressions in order and return the value of the second. Restricting the type of the first expression to Unit helps the typechecker to catch some silly errors by permitting us to throw away the first value only if it is really guaranteed to be trivial.

Notice that, if the second expression is also an assignment, then the type of the whole sequence will be Unit, so we can validly place it to the left of another ; to build longer sequences of assignments:

   (r:=succ(!r); r:=succ(!r); r:=succ(!r); r:=succ(!r); !r); ▸ 13 : Nat

References and Aliasing

It is important to bear in mind the difference between the reference that is bound to r and the cell in the store that is pointed to by this reference.

If we make a copy of r, for example by binding its value to another variable s,

   s = r; ▸ s : Ref Nat

what gets copied is only the reference (the arrow in the diagram), not the cell:

We can verify this by assigning a new value into s

   s := 82; ▸ unit : Unit

and reading it out via r:

   !r; ▸ 82 : Nat

The references r and s are said to be aliases for the same cell.

13.1.1 Exercise [⋆]

Draw a similar diagram showing the effects of evaluating the expressions a = {ref 0, ref 0} and b = (λx:Ref Nat. {x,x}) (ref 0).

Shared State

The possibility of aliasing can make programs with references quite tricky to reason about. For example, the expression (r:=1; r:=!s), which assigns 1 to r and then immediately overwrites it with s's current value, has exactly the same effect as the single assignment r:=!s, unless we write it in a context where r and s are aliases for the same cell.

Of course, aliasing is also a large part of what makes references useful. In particular, it allows us to set up "implicit communication channels"-shared state-between different parts of a program. For example, suppose we define a reference cell and two functions that manipulate its contents:

   c = ref 0; ▸ c : Ref Nat

   incc = λx:Unit. (c := succ (!c); !c); ▸ incc : Unit → Nat   decc = λx:Unit. (c := pred (!c); !c); ▸ decc : Unit → Nat

Calling incc

   incc unit; ▸ 1 : Nat

results in changes to c that can be observed by calling decc:

   decc unit; ▸ 0 : Nat

If we package incc and decc together into a record

   o = {i = incc, d = decc}; ▸ o : {i:Unit→Nat, d:Unit→Nat}

then we can pass this whole structure around as a unit and use its components to perform incrementing and decrementing operations on the shared piece of state in c. In effect, we have constructed a simple kind of object. This idea is developed in detail in Chapter 18.

References to Compound Types

A reference cell need not contain just a number: the primitives above allow us to create references to values of any type, including functions. For example, we can use references to functions to give a (not very efficient) implementation of arrays of numbers, as follows. Write NatArray for the type Ref (Nat→Nat).

   NatArray = Ref (Nat→Nat);

To build a new array, we allocate a reference cell and fill it with a function that, when given an index, always returns 0.

   newarray = λ_:Unit. ref (λn:Nat.0); ▸ newarray : Unit → NatArray

To look up an element of an array, we simply apply the function to the desired index.

   lookup = λa:NatArray. λn:Nat. (!a) n; ▸ lookup : NatArray → Nat → Nat

The interesting part of the encoding is the update function. It takes an array, an index, and a new value to be stored at that index, and does its job by creating (and storing in the reference) a new function that, when it is asked for the value at this very index, returns the new value that was given to update, and on all other indices passes the lookup to the function that was previously stored in the reference.

   update = λa:NatArray. λm:Nat. λv:Nat.              let oldf = !a in              a := (λn:Nat. if equal m n then v else oldf n); ▸ update : NatArray → Nat → Nat → Unit

13.1.2 Exercise [⋆⋆]

If we defined update more compactly like this

   update = λa:NatArray. λm:Nat. λv:Nat.              a := (λn:Nat. if equal m n then v else (!a) n);

would it behave the same?

References to values containing other references can also be very useful, allowing us to define data structures such as mutable lists and trees. (Such structures generally also involve recursive types, which we introduce in Chapter 20.)

Garbage Collection

A last issue that we should mention before we move on formalizing references is storage deallocation. We have not provided any primitives for freeing reference cells when they are no longer needed. Instead, like many modern languages (including ML and Java) we rely on the run-time system to perform garbage collection, collecting and reusing cells that can no longer be reachedby the program. This is not just a question of taste in language design: it is extremely difficult to achieve type safety in the presence of an explicit deallocation operation. The reason for this is the familiar dangling reference problem: we allocate a cell holding a number, save a reference to it in some data structure, use it for a while, then deallocate it and allocate a new cell holding a boolean, possibly reusing the same storage. Now we can have two names for the same storage cell-one with type Ref Nat and the other with type Ref Bool.

13.1.3 Exercise [⋆⋆]

Show how this can lead to a violation of type safety.

^[1]Even "purely functional" languages such as Haskell, via extensions such as monads.

^[2]The system studied in this chapter is the simply typed lambda-calculus with Unit and references (Figure 13-1). The associated OCaml implementation is fullref.

^[3]Strictly speaking, most variables of type T in C or Java should actually be thought of as pointers to cells holding values of type Option(T), reflecting the fact that the contents of a variable can be either a proper value or the special value null.

^[4]There are also good arguments that this separation is desirable from the perspective of language design. Making the use of mutable cells an explicit choice rather than the default encourages a mostly functional programming style where references are used sparingly; this practice tends to make programs significantly easier to write, maintain, and reason about, especially in the presence of features like concurrency.

< Free Open Study >