7.3 Register Stacks

Stacks can be implemented in memory or as a bank of registers. Although a stack composed of processor registers can offer significant performance increases, greater cost and small register counts precluded realistic consideration until RISC designs emerged. The reduction in circuitry required to recognize and execute complex instructions released space on a processor chip for other uses, including more on-chip cache and a larger pool of registers.

7.3.1 SPARC^® Register Windows

Research in computer architecture at the University of California Berkeley led to the creation of the SPARC^® architecture by Sun^® Microsystems.

Early implementations of the SPARC architecture devoted a significant portion of the processor chip to registers. A large bank of general registers is organized into two groups: a set of eight global registers and n sets of 24 registers arranged as register windows. These windows partially overlap, and any particular implementation may offer 2 <= n <= 32 of them.

Any current program context has a 32-element general register address space, but the first eight global registers are shared with all contexts. The remaining 24 register names enumerate eight ins for inputs that may have been received, eight locals for internal use, and eight outs for staging any outgoing parameters.

Figure 7-4 diagrams SPARC register windows for a situation where X is going to call some Y that will in turn call Z. We do not need to specify the context of X; it can be a main program or itself a function.

Figure 7-4. SPARC register windows for three procedures or functions

graphics/07fig04.gif

The N physical registers comprising the dynamic registers for SPARC register windows are treated as a circular buffer. Thus if a SPARC implementation offered only n = 3 windows, we could imagine that the bank of eight logical registers seen as outs for Z would be the same physical storage elements as the ins for X, and N would be 64. The chip would have 72 registers in all, including the globals.

When deeply nested function calls cause the circular buffer to wrap around, the system software must spill a register window into a memory stack. Conversely, when returning functions cause the buffer to wrap back around to a point where register contents would be logically incorrect, the system software must fill a register window from that memory stack.

The rationale for this scheme comes from the abrupt discontinuity inherent in function calls. Register windows allow parameters to be passed directly between functions, and they prevent register conflicts. The fixed size of the window partitions is an overall compromise not always matching actual needs.

7.3.2 Itanium Register Stack

The designers of the Itanium architecture had the opportunity to observe many years of industry experience with architectures that do (SPARC) and do not (PA-RISC, Alpha, IA-32) offer register stacks. As processor speeds have advanced quite rapidly, especially when compared to memory and bus speeds, it is not surprising that the Itanium architecture employs dual stacks: traditional memory stacks and a register stack using a large pool of registers. Any Itanium implementation must provide 32 static registers and a register stack of at least 96 registers managed by a semi-autonomous register stack engine (RSE).

Unlike typical CISC architectures that couple memory stack operations to machine instructions, such as call and return, the Itanium ISA decouples the memory stack; indeed, the identification of Gr₁₂ as sp is purely a software convention, not a hardware feature.

Like SPARC register windows, the Itanium register stack operates in a way that makes the outs visible to both the caller and the called procedure; the called procedure views the same physical registers with different logical names. The Itanium register stack offers greater flexibility in both the size of a windowed region and its partitioning.

7.3.3 The alloc Instruction

A new stack frame on the Itanium register stack is requested by using the alloc instruction, which has numerous arguments:

 alloc    r1 = ar.pfs,ins,locals,outs,rots

where, conveniently but remarkably, register r1 can be one of the stack registers made available by this instruction itself. This is possible because the alloc instruction adjusts the register stack early, but actually stores into the target register late in its execution. Therefore, the logical number of the target register r1 must reflect the current function's view of numbering after the allocation is made.

The argument ins indicates that the function anticipates receiving input arguments. The assembler defines convenient symbols in0, in1, … in_ins-1 to refer to inputs, although the appropriate general register numbers can also be used. As explained later, convention dictates that ins will not exceed 8, but this does not necessarily reflect or limit the total number of arguments.

The argument locals indicates the number of local storage elements that should be preserved through deeper function calls. Any ins that are no longer needed may be used as additional locals. The assembler defines symbols loc0, loc1, … loc_locals-1 to refer to those local variables. The indirect limitation on the size of locals is that ins + locals + outs must not exceed 96.

The argument outs indicates that the function anticipates passing arguments through deeper function calls. The assembler defines symbols out0, out1, … out_outs-1 to refer to those outputs. As explained later, outs will not exceed 8, by software convention, but this does not necessarily reflect or limit the total number of arguments.

The argument rots indicates that the function anticipates using some stacked registers as rotating registers. The argument rots must be a multiple of 8 and must not exceed ins + locals + outs. In many practical situations, rots will have to be less than ins + locals, lest preserved information be lost. We defer discussing any applications of rotating registers until a later chapter.

The alloc instruction adjusts the sof, sol, and sor regions in a register known as cfm (the current frame marker):

 sof = ins + locals + outs   // size of frame sol = ins + locals          // size of locals sor = rots/8                // control parameter for rotation

where we see that the CPU lumps ins and locals together into a pool of registers that are in scope just for exclusive use by this one function. The outs are shared in scope between a function and any other function that it may call.

The Itanium architecture provides one level of backup storage for cfm (the current frame marker) in a field called pfm (previous frame marker) in the architectural register ar.pfs (previous function state). Appendix D.7 details this register and its fields. The br.call and br.ret instructions copy cfm to pfm or vice versa. This significantly reduces the overhead of calling and returning from leaf procedures. Any nonleaf procedure must save the caller's ar.pfs before making any function calls itself. The alloc instruction, which copies ar.pfs to the specified target register, provides a natural and convenient way of preserving context information related to the register stack.

7.3.4 The Register Stack Engine (RSE)

The Itanium register stack manages itself semi-autonomously with just a few inputs from the operating system. Registers are copied to or restored from a backing store, located in memory, whenever the hardware control determines that a new allocation claim or procedure return requires the respective action. This activity occurs asynchronously, decoupled from instruction execution to the greatest extent possible.

Itanium architecture defines several modes, from "laid back" to aggressive, for making these memory-intensive moves. Early Itanium processors provide only the "enforced lazy" mode, in which writes to and reads from memory are postponed as long as possible.

The register stack engine (RSE) uses dedicated controller logic within the CPU to accomplish copies and reloads, in contrast to the software overhead required for spills and refills of SPARC register windows. The RSE can be thought to function as an I/O device or even a cache structure.

The register stack can be likened to a device that becomes active sporadically, like a network interface or a disk controller. In all modern systems, such external device controllers operate using direct memory access (DMA). When they need to move data to or from memory, they do so across the appropriate bus without involving CPU load and store operations in software. Any impact on the CPU is indirect, in that it might be waiting for data or a bus to become available. The RSE can stall an Itanium processor for similar reasons.

The register stack can also be thought of as a cache for the region of memory that the operating system established as the backing store. The limited number of physical registers in the stack corresponds to the limited size of a cache. The larger number of information units in the backing store permits calling trees to be of arbitrarily large nesting depths. Just as a cache provides transparent access to a memory larger than itself, so too does the RSE provides transparent access to a larger virtual register stack in memory.

7.3.5 Banked Registers

Operating systems must handle interrupts from sources ranging from external device controllers to page faults, which are generated by software requests for virtual memory addresses outside the range of physical memory. As computer systems have become more complex, operating system routines for handling interrupts have become more extensive and time-consuming. Thus interrupt handlers tend to need more registers than ever before.

An interrupt handler must preserve the contents of all registers used or required by the interrupted process. Either the handler must not disturb a register, or it must copy and restore its contents.

In the Itanium architecture, general registers Gr₁₆ Gr₃₁ are banked registers. Their names are physically present in two locations, bank 0 and bank 1. Ordinarily bank 1 is in use, but an operating system can ensure that an interrupt handler has immediate access to bank 0. Only one level of interrupt response can use bank 0 at any one time (why?).