7.5 Calling Conventions

Software development requires standardization, not only to minimize confusion and maximize understandability, but also to reduce redundant operations when saving and restoring the contents of registers. The use of libraries of subroutines, while increasing programmer efficiency, has the potential to decrease machine performance.

Should a subroutine conservatively save every register that it uses? That may be wasteful if the calling program did not actually require preservation of the contents of those registers for its own functioning. Should a caller conservatively save copies of the contents from all those registers? That may also be wasteful if the called routine did not need all those registers. The negative side effects of subroutine use are partially mitigated by standardized conventions for register use, and for call and return sequences.

7.5.1 Register Contention and Conventions

Itanium architecture specifies an unprecedented number of processor registers, with just a few that are inaccessible to application programs. When one is working at the register level, the compiler and other tools of the programming environment have major roles in preventing contention over register use. Appendix D outlines the categories of Itanium registers and conventions for their use in nonprivileged (i.e., user-level) programs. Only through strict adherence to those conventions can one build segmented programs that utilize system resources properly.

Some of the conventions reflect distinctions at the hardware level. One Itanium general register and two floating registers contain constant values. The first 32 general registers are global in scope, while the remainder are stacked registers managed by the register stack engine. The stacked registers, along with the predicate registers and higher-numbered floating-point registers, can be used as rotating registers.

Any preserved registers must be saved and restored if they are also used by a called procedure. Any scratch registers may be freely used by a procedure without concern for their prior contents. Several of the general registers are allocated for contextual information (e.g., the principal stack pointer and the global pointer). Certain other general registers and floating-point registers have standardized uses for communicating arguments or computed function values between calling and called procedures.

Where registers are saved depends on arbitrary conventions defined for each specific architecture, programming environment, and operating system. We describe in this book the Linux and HP-UX environments for assembly language, C, and FORTRAN. The conventions are similar for the Windows environment described by Triebel, Bissell, and Booth. It is possible for a different environment (e.g., a port of OpenVMS) to use different conventions, as long as they are well documented for programmers.

Let us consider two examples already seen in the I/O example at the end of Chapter 6. The caller must save gp (Gr₁, the global pointer), while the called function must save sp (Gr₁₂, the stack pointer). This division of responsibility may seem arbitrary, but most conventions do have a rationale. In this particular instance, the gp may be altered by loader code inserted by the linker to facilitate calls to a function, especially system functions. The convention ensures that the called function will have a valid gp for locating its own data. On the other hand, a simple function not needing stack space may not use sp at all.

At a deeper level, the operating system environment not only respects the contextual information but also all of the temporary storage of a user process. Even information in scratch registers must not be overwritten when the operating system interrupts and swaps out a user process to accommodate other processes. This distinction is important: Procedure calls occur synchronously at "convenient" times and should preserve all agreed-upon items, while system routines that respond to asynchronous events have no way to evaluate the "importance" of information and thus are obliged to respect and save everything, every time.

7.5.2 Call and Return Branch Instructions

The designation of instruction names for an ISA is quite arbitrary. Many architectures have explicit "call" and "return" instructions with associated opcodes. RISC architectures tend to implement that functionality as special variants of the branch class of instructions at the opcode and subcode level, while perhaps retaining the classic mnemonic names.

The Itanium architecture follows RISC principles in building its call and return instructions as subvariants of its generic branch instruction, naming them with completers:

 br.call.bwh.ph.dh b1 = target25    // IP-relative br.call.bwh.ph.dh b1 = b2          // indirect br.ret.bwh.ph.dh b2                // indirect only

where the three hints bwh (branch whether), ph (prefetch), and dh (cache deallocation) are the same as for the ordinary branch instructions (Section 5.3.1). Call and return target addresses must be aligned with an instruction bundle. Calls and returns may also be predicated.

Calls

Several actions occur with a br.call instruction. A return address, computed as the current IP+16, is placed into branch register b1. The current frame marker, the epilog counter, and the current privilege level are all copied into the ar.pfs register (see Appendix D.7 for details).

The register stack is adjusted so that the called procedure, while initially having the same outs as the caller, has no ins, locals, or rots of its own. A set of outs can be passed along a calling chain without being replicated, until a procedure at some level invokes an alloc instruction to claim those outs as its own ins.

Finally, the IP is set to begin the fetching of instructions for the called procedure: The sign-extended target25 offset within the instruction is added to the current IP, or the value previously established in the indirect branch register b2 is copied into the IP.

Returns

For the br.ret instruction, the IP is set to the value in the indirect branch register b2. The current frame marker, the epilog counter, and the current privilege level are restored from ar.pfs. The caller's register stack frame is restored to the state it had just prior to the call.

The caller has the same view of the register stack as before. The called procedure may have altered the contents of the caller's outs, but cannot alter the contents of the caller's ins or locals, as they are physically out of the called procedure's scope.

Combined effect of br.call, alloc, and br.ret instructions

Figure 7-5 depicts the views of the Itanium register stack as seen during a calling chain that involves three procedures: X, Y, and Z. The RSE in the processor manages the register stack using fields in cfm (the current frame marker), which is only indirectly visible through machine instructions (see Appendix D.7 for details).

Figure 7-5. Register stack changes during calls and returns

graphics/07fig05.gif

The sol (size of locals) field records the sum ins+locals from an alloc instruction. The sof (size of frame) field records the sum ins+locals+outs from an alloc instruction. The difference between sof and sol measures the number of stacked registers shared between a caller and a called procedure. The sor (size of rotators) field records how many elements of the register stack, in multiples of 8, will be used as rotating registers.

For the purposes of machine state, ins are a subset of locals and no distinction needs to be recorded at a hardware level. The software presumably knows its way around its own declared storage, being able to distinguish ins from other locals. In a language like C, the names of some variables appear inside parentheses after the procedure name (the ins), while the names of other variables appear below (the other locals).

Figure 7-5 illustrates what has happened to the register stack for a particular scenario, from the vantage point of an all-seeing observer not restricted by the scopes of the procedures involved. The column under X indicates that procedure X uses three stacked registers for locals (sol=3) and three more for outs (sof-sol=3). Since X will make procedure calls, it is obliged to save the previous pfm through the operation of its alloc instruction.

When X calls Y, as indicated in the first column under Y, the br.call instruction ensures that the locals of X are now out of scope for Y, and then copies cfm to pfm. The outs of X (which were locally numbered r35-r37) are now available as outs for Y (renumbered r32-r34). The overall frame shift of stacked registers is shown in this figure as a downward displacement of the reader's focus; the internal registers with which the RSE matches logical registers to physical registers are not shown.

When Y executes its alloc instruction, as indicated in the second column under Y, it specifies three ins, two locals, and two outs. The processor considers this a request to claim the three outs (now ins) and two more registers for locals (sol=5), and to have two new outs (sof=sol+2=5+2=7); its alloc instruction makes a copy of pfm.

When Y calls Z, as indicated in the first column under Z, the br.call instruction ensures that the locals of Y are now out of scope for Z, and copies cfm to pfm. The outs of Y (which were locally numbered r37-r38) are now available as outs for Z (as r32-r33). When Z executes its alloc instruction, it specifies two ins and three locals; as a leaf procedure it does not need to specify any outs. The processor considers this a request to claim the two outs (now ins) and three more registers for use as locals (sol=5). There can now be no new outs (sof=sol+0=sol=5). Even though Z will not make further calls, its alloc instruction will make a copy of pfm to a unique "older" location that is different from that used by either X or Y. This copy of pfm will not be used.

As a leaf procedure, Z does not need to restore pfm as part of its epilogue. When Z returns to Y, as indicated in the last column of Figure 7-5, the br.ret instruction causes the locals of Z to go out of scope and the register stack context for Y to be fully restored. This restoration is accomplished by copying pfm back into cfm. Prior to executing a br.ret instruction, Y restores the copy of its prior pfm before it returns to X (though we do not show this return to the initial state in Figure 7-5).

7.5.3 Argument Passing: Locations

Mathematical functions usually need arguments or parameters to be passed to them by the caller. Conventions for argument passing are an important aspect of software development; in the past, conventions differed widely among environments even from one vendor. We outline here the intended uniform conventions for programming environments that are compliant with the Itanium ISA.

An Itanium calling routine passes up to eight arguments in registers and places any additional arguments onto the memory stack. For purposes of this book, we consider only the following data types: integers or address pointers up to 64 bits wide, and either single- or double-precision IEEE floating-point quantities. Different rules apply when an argument is an integer wider than 64 bits, a floating-point quantity in an extended format, or aggregate data (see Itanium Software Conventions and Runtime Architecture Guide).

The locations where arguments may be placed are diagrammed in Figure 7-6. Up to eight general registers allocated as outs, shown with the alternate symbols out0, out1, out2, …, may be used to pass arguments. Additionally, the floating-point registers Fr₈ Fr₁₅ may be used to pass arguments. An arbitrary amount of the memory stack can be claimed by decrementing sp by an appropriate amount, including the mandatory 16 bytes of scratch space i.e., the two 8-byte blocks marked scr for the called function in Figure 7-6.

Figure 7-6. Argument passing in registers and on the memory stack

graphics/07fig06.gif

Even for the data types that we choose to handle, the rules differ between integers and floating-point quantities. Consider a mythical FORTRAN function call:

 PDQ = WHEW( I, J, X, Y, Z, K, R, M, S )

where I, J, K, and M are integers and X, Y, Z, R, and S are floating-point quantities. The arguments first need to be associated with sequentially numbered argument slots: I with slot0, J with slot1, X with slot3, …, S with slot8, left to right in syntactical order.

Any integer quantity in slot0 through slot7 is placed into the corresponding output register out0 through out7. The other outs are not used. Floating-point quantities in slot0 through slot7 are handled next. Each is placed into the next sequential floating-point register starting with f8 and leaving no gaps, except that any variable-argument formal parameter is placed into the corresponding output register out0 through out7 instead. If the compiler cannot determine the formal characteristics of a parameter, it will copy that parameter into both register locations.

Finally, any remaining quantities in slot8 and beyond are stored into quad word information units at addresses sp+16, sp+24, and so forth, where sp is the decremented value that will prevail across the call into the function.

In our particular illustration, I goes into out0, J goes into out1, X goes into f8, Y goes into f9, Z goes into f10, K goes into out5, R goes into f11, M goes into out7, and S must go onto the stack at sp+16. The remaining outs and floating-point registers would not be used for passing arguments in the example shown in Figure 7-6.

7.5.4 Argument Passing: Methods

Calling standards generally provide that each item in an argument list belongs to one of three classes according to its "immediacy," which can be seen as analogous to addressing modes:

Value. The quad word item in the register (or stack location) is the actual argument itself. A scalar quantity or single array element i.e., an integer or real floating-point quantity would be passed by value.
Reference. The quad word item in the register (or stack location) is the address of a data item that may be a scalar, a string, an array, a record, or a procedure. Other arguments may be associated with an item passed by reference, such as the length of a string or dimensionality information for an array.
Descriptor. The quad word item in the register (or stack location) is the address of a descriptor, a standardized data structure that contains not only an actual address for the data but also appropriate information about how the data have been stored and can be accessed. Descriptors contain information such as array bounds and string lengths.

That is, we say that each individual argument is passed by the caller to the called procedure by value, by reference, or by descriptor. Note the strong parallel to immediate, direct, and deferred addressing modes.

The default method for passing arguments in a Linux or HP-UX programming environment differs among high-level languages, but the syntax of each language usually provides for overriding the default. Table 7-1 gives a summary of the syntactical specifiers, while details can be found in the appropriate manuals for working with each language.

Table 7-1. High-Level Language Syntax for Passing Arguments
	By Value	By Reference	By Descriptor
C	`argument` (usual default)	`&argument`	define a structure, then: `&struct_name`
FORTRAN	`%VAL(argument)`	`%REF(argument)` (numeric default)	`%DESCR(argument)` (string default)
COBOL	`BY VALUE argument`	`BY REFERENCE argument` (usual default)	`BY DESCRIPTOR argument`

7.5.5 Prologues and Epilogues

This section more systematically discusses the context of procedures, particularly the need to preserve certain registers on stacks. Processor registers must be shared with other procedures in the calling chain.

Prologues, epilogues, and stack unwind tables

Operating systems and runtime routines supporting high-level languages need to consult unwind tables for exception handling, error recovery, debugging, and signal handling. Mosberger and Eranian discuss Itanium stack unwinding in detail as implemented for Linux.

ELF is the name of the binary format of executable Linux programs. The Linux utility program readelf can interpret several aspects of an executable file, and its -u option prints an interpretation of a program's unwind table. One version of the DOTCLOOP program contains the following unwind information:

 L> readelf -u bin/dotcloop Unwind section '.IA_64.unwind' at offset 0x740 contains 2 entries: <>: [0x4000000000000000-0x4000000000000490), info at +0x710   v1, flags=0x0 ( ), len=8 bytes     R1:prologue(rlen=21)         P7:rp_when(t=20)         P3:rp_gr(reg=r4)     R1:body(rlen=6)     R1:prologue(rlen=0)     R1:prologue(rlen=0) <main>: [0x4000000000000520-0x40000000000005d0), info at +0x728   v1, flags=0x0 ( ), len=8 bytes     R1:prologue(rlen=1)         P7:lc_when(t=0)         P3:lc_gr(reg=r9)     R1:prologue(rlen=0)     R1:prologue(rlen=0)     R1:prologue(rlen=0) L>

where the first block of information is involved in setting up the program's runtime environment.

The more interesting second block records that ar.lc is preserved in register r9 (see Figure 5-2) over the instruction range from 0x4000000000000520 to 0x40000000000005d0. We can verify those addresses using the debugger:

 L> gdb bin/dotcloop [messages removed] (gdb) x/3i 0x4000000000000520 0x4000000000000520 <main>:   [MFI]     nop.m 0x0 0x4000000000000521 <main+1>:           nop.f 0x0 0x4000000000000522 <main+2>:           mov.i r9=ar.lc;; (gdb) x/3i 0x40000000000005d0-0x10 0x40000000000005c0 <done+16> [MFB]     nop.m 0x0 0x40000000000005c1 <done+17>:          nop.f 0x0 0x40000000000005c2 <done+18>           br.ret.sptk.many b0;; (gdb) q L>

The unwind table stored as part of the DOTCLOOP executable file shows how the assembler directive .save in the dotcloop.s source file (Figure 5-2) flagged the protective intent of the mov r9=ar.lc instruction. The assembler program can build an internal table of such instructions flagged by a directive and determine the range of addresses over which the requested preservation of a resource is valid.

Assembly language programs for Itanium programming environments do not need an explicit directive marking the epilogue region.

Assembler directives related to unwind tables

The data structures for unwind tables are richly bit-encoded and rather difficult to inspect or construct directly, but fortunately are well documented in the Itanium Software Conventions and Runtime Architecture Guide. The system software provides some assistance by means of assembler directives for specifying entries that go into unwind tables. Each such directive tends to mark a machine instruction for one table entry. The address range over which the intent of the instruction is valid becomes part of the table entry. Recovery routines can thus work backwards by comparing the address in the IP register against all such address ranges. Table 7-2 gives a selection of assembler directives used in saving and restoring unwind resources.

Table 7-2. Some Itanium Assembler Directives Related to Stack Unwinding
Directive	Operands	Marks instruction that …
`.prologue`	`mask4`, Gr	Starts the prologue region; optionally specifies that preservation of `rp`, `ar.pfs`, `psp`, and/or `pr` will occur in sequentially numbered registers
`.save`	Special, Gr	Preserves any special register on the register stack
`.vframe`	Gr	Preserves `sp` on the register stack
`.fframe`	`size`	Claims `size` bytes of stack space (at least 16; multiple of 16)
`.altrp`	Br	Designates that the procedure was called using a different branch register than `b0`
`.save.b`	`bmask5`, Gr	Preserves any of Br₁ Br₅ on the register stack
`.save.g`	`gmask4`, Gr	Preserves any of Gr₄ Gr₇ on the register stack
`.save.gf`	0, `fmask20`	Preserves any of Fr₂ Fr₅ and Fr₁₆ Fr₃₁ in the spill area
`.restore`	`sp`	Restores `sp` to the value it contained on entry to the procedure

NaT bits

Each general register actually comprises 64 +1 bits, where the extra bit is the NaT (Not a Thing) bit (see Appendix D.2) whose significance is related to speculative load instructions discussed much later. As long as one general register is used to save another, all 65 bits are handled properly. For that reason, as well as the very convenient automatic operation of the register stack engine (RSE), we have biased our choice toward the register stack to preserve general registers, branch registers, predicates, and application registers. Floating-point registers are considered separately below.

The .prologue directive

This directive begins a prologue segment, and can show how any of four critical shared register resources will be preserved in sequentially numbered registers. The optional 4-bit mask4 parameter encodes which registers will be preserved in subsequent instructions in the prologue area: bit <3> corresponds to rp, bit <2> to ar.pfs, bit <1> to psp, and bit <0> to pr. The destination general register must correspond to that of the resource designated by the highest-order 1 bit specified in mask4. Successively higher-numbered general registers are then used to preserve the resources specified by any other 1 bits to the right in mask4. Each instruction that preserves one register is also preceded by a directive (.save, .vframe, .fframe).

The .save directive

This directive marks a mov instruction that preserves a resource designated as a Special register (Appendix D) in a general register. Special registers include rp, ar.fpsr, ar.bsp, ar.bspstore, ar.rnat, ar.pfs, ar.lc, pr, ar.unat (NaT bits on entry), and the assembler designation @priunat (NaT bits following a spill in the current procedure).

The .vframe and .fframe directives

Similarly, the .vframe (variable frame) directive marks the instruction that saves the caller's sp (now called psp) in a general register. If .vframe is used, the body of the procedure can make additional stack size adjustments after the initial adjustment in the prologue. Alternatively, if the .fframe (fixed frame) directive is used to mark the instruction that adjusts sp, the procedure may not make any further stack adjustments.

The .save.b and .save.g directives

These directives operate in similar ways. A bit mask designates which preserved branch registers (Appendix D.2) or general registers (Appendix D.4) are copied by subsequent mov instructions onto the register stack. The bits in the mask, from lowest to highest, map to the potentially preserved registers. For example, a mask of 0x5 could correspond to saving Br₁ and Br₃ (.save.b) or Gr₄ and Gr₆ (.save.g) in successive stacked registers.

The .altrp directive

The br.call and br.ret instructions work with any of the branch registers Br₀ Br₇. If a procedure was not called with Br₀, the unwind table needs an entry produced by the .altrp directive to record which branch register contains the return address.

Saving floating-point registers and other items

The floating-point registers that must be preserved are Fr₂ Fr₅ and Fr₁₆ Fr₃₁. A single .save.gf directive with a 20-bit mask can mark which of these will be selectively copied into a default spill area on the stack by subsequent stf.spill instructions that copy floating-point data in the 82-bit register format. This uses 16 bytes of stack space per register. Note that an 82-bit floating-point register cannot be spilled to the 64-bit elements of the register stack.

The spill area consists of the 16-byte scratch area provided by the caller, and at lower contiguous addresses, additional space as required. The spill area can therefore be defined as the 16 bytes from psp through psp+15 plus some top portion of the local storage region in Figure 7-2.

The 20-bit mask specifies Fr₃₁ in bit <19> … Fr₁₆ in bit <4>, and then Fr₅ in bit <3> … Fr₂ in bit <0>. The highest-numbered of these registers to be spilled goes into the 16-byte scratch area at address psp, while registers with lower numbers are stored at successively lower addresses at 16-byte intervals.

The .save.gf directive can have as the first operand a 4-bit mask that indicates general registers to be stored in the spill area; we generally set that operand to zero and instead elect to save general registers on the register stack.

Other methods

The .save, .vframe, and .spill directives have other forms with names ending in .sp or .psp when the operand is stored on the memory stack rather than the register stack. The location will be relative either to the current procedure's stack or frame pointer. In addition, there are predicated forms of the spill and restore directives. More information can be found in the Intel Itanium Architecture Assembly Language Reference Manual.

7.5.6 The .regstk Directive

Within a source program, the assembler uses the arguments of the alloc instruction to define the appropriate numbers of inx, locx, and outx alternate names for the stacked registers just allocated. A subsequent alloc instruction can change these symbolic associations and resize the current stack frame.

The nonexecutable .regstk directive can also be used to associate a different distribution of roles for the registers allocated to the current stack frame:

 .regstk   ins, locs, outs, rots

where the values of ins, locs, outs, and rots have their usual meanings and mutual constraints. The .regstk directive does not alter the size of the register stack frame. The new register names are valid for all instructions after this directive in the source.