7.1 Memory Stacks | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

Operating systems establish at least one stack in memory to support standard operations such as input and output. Most instruction sets and programming environments permit a programmer to establish other stacks in memory.

7.1.1 Stack Addressing for CISC Architectures

Many CISC architectures support auto-indexed addressing modes that make possible a very easy and fully general method for "pushing" and "popping" individual items onto and then off a stack. Items on a stack are accessed with a stack pointer (SP), which may be either purpose-built or a general register.

Two conventions exist for organizing a stack. In one, autoincrementing of the stack pointer is associated with pushing, and autodecrementing with popping. With the first convention, the stack origin is at a low initial address; the stack grows in the direction of increasing memory addresses and shrinks back toward its origin as cells are released. This resembles the way in which we might manage a stack of paperwork, where the latest item on top is the first item taken off. The opposite convention has the stack origin at a high initial address; the stack grows in the direction of decreasing memory addresses and shrinks back towards its origin.

Classic CISC architectures use the main memory stack to hold the preserved state and return address of the calling routine when a subroutine is invoked. RISC and EPIC architectures may use different methods.

7.1.2 Stack Addressing for Load/Store Architectures

In a load/store architecture, pointer adjustment and data movement usually require separate instructions, creating an "economy of scale" where the SP is modified to reserve or relinquish several stack cells all at once. For example, the Alpha ISA only uses quad word access into the stack. If an Alpha routine needs three quad words of local storage, the following scheme might be appropriate:

 lda     SP,-3*8(SP)       ; Reserve 3 quad words stq     Rx,2*8(SP)        ; Most deeply buried item stq     Ry,1*8(SP)        ; Next-most deeply buried ... ldq     Rq,2*8(SP)        ; Get copy of original Rx ldq     Rt,0(SP)          ; Read surface element on stack ... lda     SP,3*8(SP)        ; Pop 3 quad words

where the Alpha instruction lda changes the value of an address pointer, and the store and load instructions use displacement addressing (Section 4.9.2).

Only the quad words reserved and released by the first and last lda instructions are logically accessible, as shown in Figure 7-1. Values at higher addresses are out of scope: They do not "belong" to this portion of code, but might constitute valuable data belonging to a calling routine. Values at lower addresses are unknown, as they too are out of scope.

Figure 7-1. Addressing quad word values stored on a stack

graphics/07fig01.gif

7.1.3 Stack Addressing for Itanium Architecture

By convention, programs for Itanium systems use register Gr₁₂ as the principal stack pointer; assemblers recognize sp as a synonym for this specific register. When a program is loaded, sp is initialized to point to an allotment of addresses for the memory stack.

The programming conventions require 16-byte alignment for the stack pointer, with the lowest four bits of sp always 0. As a corollary, one individual quad word of stack space (or any odd multiple, such as three, in the Alpha illustration above) is not to be specifically claimed. In addition, a software routine generally modifies sp only as part of the prologue, and resets sp as part of the exit sequence of the procedure.

An Itanium procedure that needs no more than 16 bytes of stack storage may use a 16-byte scratch area that is established by the calling procedure. Any greater need for stack storage requires establishment of a procedure frame by decrementing sp by the frame size in the prologue. Figure 7-2 shows the generic structure of a procedure frame.

Figure 7-2. An Itanium procedure frame

graphics/07fig02.gif

The procedure frame has five regions. The local storage, at the highest addresses, is contiguous with the 16-byte scratch area established by the previous procedure that can be used as part of local storage. Local storage is where a language such as C places automatic variables i.e., variables with local scope that are recreated every time the procedure is called. The procedure can also spill (save) the contents of registers here.

The dynamic allocation region is frequently of zero length, but it represents where a high-level language may create a variable region of storage whose size cannot be determined at compile time. If this region is expanded, the three regions at lower addresses, if not empty, must be relocated.

The frame marker region, while seemingly optional, is used to store context-specific information used for software error recovery, which the Itanium architecture refers to as unwinding.

The outgoing parameters region is frequently of zero length. While the first few parameters for called functions are passed in registers, additional parameters are accommodated in this stack region. Finally, another 16-byte scratch area must be included in the new frame size, maintaining a scratch area when sp is decremented.

The frame size must always be a multiple of 16 bytes. Everything put on the stack by this procedure has an address that can be computed using the appropriate negative offset from the previous stack pointer, or a positive offset from the current sp. Compilers frequently choose negative offsets, as what is already in memory is invariant no matter how much additional space is allocated dynamically.

Notice that the outgoing parameters from the calling procedure are adjacent to the scratch area provided for the called procedure, which can thus access the scratch area and reach upward to the incoming parameters using positive offsets from the previous stack pointer.

The programs in this book, like relatively simple "leaf" procedures that call no other functions, tend to have modest stack requirements sometimes met by the 16-byte scratch area. Procedures that are more complicated can make a single request for stack space in the prologue, calculating the desired frame size, rounding up to 16 bytes, and adding 16 bytes more for the scratch area. When declaring a larger frame size, the programmer must also save the previous stack pointer in the prologue and restore it in the epilogue.

7.1.4 User-Defined Stacks

For some algorithms, the provided stack mechanism may not be the most satisfactory choice. The programmer or compiler can establish any number or type of stacks in the data segment using general registers as stack pointers. These user-defined stacks also lend themselves to academic exploration of stack theory. In these cases, sp is manipulated only to establish the required procedure frame.

Emulating the auto-indexed stack pointers of CISC architectures is relatively easy and quite useful. Suppose we define a region in the data segment starting at address STACK and of adequate capacity:

 init:    movl    r11 = STACK      // r11 -> first to use ... [begin a zone that forbids jumps in or out] push:    st8     [r11] = rt,8;;   // Push contents of rt          st8     [r11] = rn,8;;   // Push next ... 1stpop:  add     r11 = -8,r11;;   // Adjust for 1st pop          ld8     rx = [r11],-8;;  // Pop one item ...          ld8     ry = [r11]       // Last pop [end of zone that forbids jumps in or out] ...

where the adjustment of the user stack pointer r11 makes this stack build upward to higher memory addresses and recede to lower addresses. This method requires a one-time adjustment of the stack pointer before beginning to pop any stored items. If only one generic pop instruction is used by a loop, the postdecrement of the last iteration would leave the stack pointer at STACK-8. A fix-up addition should occur after the loop if this stack region is going to be reused. Alternatively, it may be possible to use an instruction that does not adjust the pointer after the last pop, as shown here.

It is also possible to make the stack operate the other way. Suppose we define a region in the data segment whose last usable quad word address is LAST. Then consider the following code sequence:

 init:    movl    r11 = LAST       // r11 -> first to use ... [begin a zone that forbids jumps in or out] push:    st8     [r11] = rt,-8;;  // Push contents of rt          st8     [r11] = rn,-8;;  // Push next ... 1stpop:  add     r11 = 8,r11;;    // Adjust for 1st pop          ld8     rx = [r11],8;;   // Pop one item ...          ld8     ry = [r11]       // Last pop [end of zone that forbids jumps in or out] ...

where the adjustment of the user stack pointer r11 makes this stack build downward to lower memory addresses and recede to higher addresses. This method requires a one-time adjustment of the stack pointer before beginning to pop any stored items from the stack. If only one generic pop instruction is used by a loop, the post-increment on the last iteration would leave the stack pointer at LAST+8. A fix-up decrement should occur after the loop if this stack region is going to be reused. Alternatively, it may be possible to use an instruction that does not adjust the pointer after the last pop, as shown here.

We must emphasize that push and pop operations must always be matched pairs. It is important to prevent any branches into or out of the block of code that manages a stack. A stack of unknown status can lead to "memory leaks" and other unpredictable results.

This simple stack implementation is not as versatile as those in CISC architectures that automatically increment and decrement the stack pointer according to the varying size of each item stored. Our implementation fixes the size of each element, in this case to 8 bytes (note the use of ld8 and st8), which meets the needs of most algorithmic uses and demonstrations of stack principles.