Section 3.4. Software Conventions

3.4. Software Conventions

An application binary interface (ABI) defines a system interface for compiled programs, allowing compilers, linkers, debuggers, executables, libraries, other object files, and the operating system to work with each other. In a simplistic sense, an ABI is a low-level, "binary" API. A program conforming to an API should be compilable from source on different systems supporting that API, whereas a binary executable conforming to an ABI should operate on different systems supporting that ABI.^[51]

^[51] ABIs vary in whether they strictly enforce cross-operating-system compatibility or not.

An ABI usually includes a set of rules specifying how hardware and software resources are to be used for a given architecture. Besides interoperability, the conventions laid down by an ABI may have performance-related goals too, such as minimizing average subroutine-call overhead, branch latencies, and memory accesses. The scope of an ABI could be extensive, covering a wide variety of areas such as the following:

Byte ordering (endianness)
Alignment and padding
Register usage
Stack usage
Subroutine parameter passing and value returning
Subroutine prologues and epilogues
System calls
Object files
Dynamic code generation
Program loading and dynamic linking

The PowerPC version of Mac OS X uses the Darwin PowerPC ABI in its 32-bit and 64-bit versions, whereas the 32-bit x86 version uses the System V IA-32 ABI. The Darwin PowerPC ABI is similar tobut not the same asthe popular IBM AIX ABI for the PowerPC. In this section, we look at some aspects of the Darwin PowerPC ABI without analyzing its differences from the AIX ABI.

3.4.1. Byte Ordering

The PowerPC architecture natively supports 8-bit (byte), 16-bit (half word), 32-bit (word), and 64-bit (double word) data types. It uses a flat-address-space model with byte-addressable storage. Although the PowerPC architecture provides an optional little-endian facility, the 970FX does not implement itit implements only the big-endian addressing mode. Big-endian refers to storing the "big" end of a multibyte value at the lowest memory address. In the PowerPC architecture, the leftmost bitbit 0is defined to be the most significant bit, whereas the rightmost bit is the least significant bit. For example, if a 64-bit register is being used as a 32-bit register in 32-bit computation mode, then bits 32 through 63 of the 64-bit register represent the 32-bit register; bits 0 through 31 are to be ignored. By corollary, the leftmost bytebyte 0is the most significant byte, and so on.

In PowerPC implementations that support both the big-endian and little-endian^[52] addressing modes, the LE bit of the Machine State Register can be set to 1 to specify little-endian mode. Another bitthe ILE bitis used to specify the mode for exception handlers. The default value of both bits is 0 (big-endian) on such processors.

^[52] The use of little-endian mode on such processors is subject to several caveats as compared to big-endian mode. For example, certain instructionssuch as load/store multiple and load/store stringare not supported in little-endian mode.

3.4.2. Register Usage

The Darwin ABI defines a register to be dedicated, volatile, or nonvolatile. A dedicated register has a predefined or standard purpose; it should not be arbitrarily modified by the compiler. A volatile register is available for use at all times, but its contents may change if the context changesfor example, because of calling a subroutine. Since the caller must save volatile registers in such cases, such registers are also called caller-save registers. A nonvolatile register is available for use in a local context, but the user of such registers must save their original contents before use and must restore the contents before returning to the calling context. Therefore, it is the calleeand not the callerwho must save nonvolatile registers. Correspondingly, such registers are also called callee-save registers.

In some cases, a register may be available for general use in one runtime environment but may have a special purpose in some other runtime environment. For example, GPR12 has a predefined purpose on Mac OS X when used for indirect function calls.

Table 312 lists common PowerPC registers along with their usage conventions as defined by the 32-bit Darwin ABI.

Table 312. Register Conventions in the 32-bit Darwin PowerPC ABI
Register(s)	Volatility	Purpose/Comments
GPR0	Volatile	Cannot be a base register.
GPR1	Dedicated	Used as the stack pointer to allow access to parameters and other temporary data.
GPR2	Volatile	Available on Darwin as a local register but used as the Table of Contents (TOC) pointer in the AIX ABI. Darwin does not use the TOC.
GPR3	Volatile	Contains the first argument word when calling a subroutine; contains the first word of a subroutine's return value. Objective-C uses GPR3 to pass a pointer to the object being messaged (i.e., "self") as an implicit parameter.
GPR4	Volatile	Contains the second argument word when calling a subroutine; contains the second word of a subroutine's return value. Objective-C uses GPR4 to pass the method selector as an implicit parameter.
GPR5GPR10	Volatile	GPRn contains the (n 2)th argument word when calling a subroutine.
GPR11	Varies	In the case of a nested function, used by the caller to pass its stack frame to the nested functionregister is nonvolatile. In the case of a leaf function, the register is available and is volatile.
GPR12	Volatile	Used in an optimization for dynamic code generation, wherein a routine that branches indirectly to another routine must store the target of the call in GPR12. No special purpose for a routine that has been called directly.
GPR13GPR29	Nonvolatile	Available for general use. Note that GPR13 is reserved for thread-specific storage in the 64-bit Darwin PowerPC ABI.
GPR30	Nonvolatile	Used as the frame pointer registeri.e., as the base register for access to a subroutine's local variables.
GPR31	Nonvolatile	Used as the PIC-offset table register.
FPR0	Volatile	Scratch register.
FPR1FPR4	Volatile	FPRn contains the nth floating-point argument when calling a subroutine; FPR1 contains the subroutine's single-precision floating-point return value; a double-precision floating-point value is returned in FPR1 and FPR2.
FPR5FPR13	Volatile	FPRn contains the nth floating-point argument when calling a subroutine.
FPR14FPR31	Nonvolatile	Available for general use.
CR0	Volatile	Used for holding condition codes during arithmetic operations.
CR1	Volatile	Used for holding condition codes during floating-point operations.
CR2CR4	Nonvolatile	Various condition codes.
CR5	Volatile	Various condition codes.
CR6	Volatile	Various condition codes; can be used by AltiVec.
CR7	Volatile	Various condition codes.
CTR	Volatile	Contains a branch target address (for the `bcctr` instruction); contains counter value for a loop.
FPSCR	Volatile	Floating-Point Status and Control Register.
LR	Volatile	Contains a branch target address (for the `bclr` instruction); contains subroutine return address.
XER	Volatile	Fixed-point exception register.
VR0, VR1	Volatile	Scratch registers.
VR2	Volatile	Contains the first vector argument when calling a subroutine; contains the vector returned by a subroutine.
VR3VR19	Volatile	VRn contains the (n 1)th vector argument when calling a subroutine.
VR20VR31	Nonvolatile	Available for general use.
VRSAVE	Nonvolatile	If bit n of the VRSAVE is set, then VRn must be saved during any kind of a context switch.
VSCR	Volatile	Vector Status and Control Register.

3.4.2.1. Indirect Calls

We noted in Table 312 that a function that branches indirectly to another function stores the target of the call in GPR12. Indirect calls are, in fact, the default scenario for dynamically compiled Mac OS X user-level code. Since the target address would need to be stored in a register in any case, using a standardized register allows for potential optimizations. Consider the code fragment shown in Figure 318.

Figure 318. A simple C function that calls another function

void f1(void) {     f2(); }

By default, the assembly code generated by GCC on Mac OS X for the function shown in Figure 318 will be similar to that shown in Figure 319, which has been annotated and trimmed down to relevant parts. In particular, note the use of GPR12, which is referred to as r12 in the GNU assembler syntax.

Figure 319. Assembly code depicting an indirect function call

... _f1:         mflr r0         ; prologue         stmw r30,-8(r1) ; prologue         stw r0,8(r1)    ; prologue         stwu r1,-80(r1) ; prologue         mr r30,r1       ; prologue         bl L_f2$stub    ; indirect call         lwz r1,0(r1)    ; epilogue         lwz r0,8(r1)    ; epilogue         mtlr r0         ; epilogue         lmw r30,-8(r1)  ; epilogue         blr             ; epilogue ... L_f2$stub:         .indirect_symbol _f2         mflr r0         bcl 20,31,L0$_f2 L0$_f2:         mflr r11         ; lazy pointer contains our desired branch target         ; copy that value to r12 (the 'addis' and the 'lwzu')         addis r11,r11,ha16(L_f2$lazy_ptr-L0$_f2)         mtlr r0         lwzu r12,lo16(L_f2$lazy_ptr-L0$_f2)(r11)         ; copy branch target to CTR         mtctr r12         ; branch through CTR         bctr .data .lazy_symbol_pointer L_f2$lazy_ptr:         .indirect_symbol _f2         .long dyld_stub_binding_helper

3.4.2.2. Direct Calls

If GCC is instructed to statically compile the code in Figure 318, we can verify in the resultant assembly that there is a direct call to f2 from f1, with no use of GPR12. This case is shown in Figure 320.

Figure 320. Assembly code depicting a direct function call

        .machine ppc         .text         .align 2         .globl _f1 _f1:         mflr r0         stmw r30,-8(r1)         stw r0,8(r1)         stwu r1,-80(r1)         mr r30,r1         bl _f2         lwz r1,0(r1)         lwz r0,8(r1)         mtlr r0         lmw r30,-8(r1)         blr

3.4.3. Stack Usage

On most processor architectures, a stack is used to hold automatic variables, temporary variables, and return information for each invocation of a subroutine. The PowerPC architecture does not explicitly define a stack for local storage: There is neither a dedicated stack pointer nor any push or pop instructions. However, it is conventional for operating systems running on the PowerPCincluding Mac OS Xto designate (per the ABI) an area of memory as the stack and grow it upward: from a high memory address to a low memory address. GPR1, which is used as the stack pointer, points to the top of the stack.

Both the stack and the registers play important roles in the working of subroutines. As listed in Table 312, registers are used to hold subroutine arguments, up to a certain number.

Functional Subtleties

The terms function, procedure, and subroutine are sometimes used in programming language literature to denote similar but slightly differing entities. For example, a function is a procedure that always returns a result, but a "pure" procedure does not return a result. Subroutine is often used as a general term for either a function or a procedure. The C language does not make such fine distinctions, but some languages do. We use these terms synonymously to represent the fundamental programmer-visible unit of callable execution in a high-level language like C.

Similarly, the terms argument and parameter are used synonymously in informal contexts. In general, when you declare a function that "takes arguments," you use formal parameters in its declaration. These are placeholders for actual parameters, which are what you specify when you call the function. Actual parameters are often called arguments.

The mechanism whereby actual parameters are matched with (or bound to) formal parameters is called parameter passing, which could be performed in various ways, such as call-by-value (actual parameter represents its value), call-by-reference (actual parameter represents its location), call-by-name (actual parameter represents its program text), and variants.

If a function f1 calls another function f2, which calls yet another function f3, and so on in a program, the program's stack grows per the ABI's conventions. Each function in the call chain owns part of the stack. A representative runtime stack for the 32-bit Darwin ABI is shown in Figure 321.

Figure 321. Darwin 32-bit ABI runtime stack

In Figure 321, f1 calls f2, which calls f3. f1's stack frame contains a parameter area and a linkage area.

The parameter area must be large enough to hold the largest parameter list of all functions that f1 calls. f1 typically will pass arguments in registers as long as there are registers available. Once registers are exhausted, f1 will place arguments in its parameter area, from where f2 will pick them up. However, f1 must reserve space for all arguments of f2 in any caseeven if it is able to pass all arguments in registers. f2 is free to use f1's parameter area for storing arguments if it wants to free up the corresponding registers for other use. Thus, in a subroutine call, the caller sets up a parameter area in its own stack portion, and the callee can access the caller's parameter area for loading or storing arguments.

The linkage area begins after the parameter area and is at the top of the stackadjacent to the stack pointer. The adjacency to the stack pointer is important: The linkage area has a fixed size, and therefore the callee can find the caller's parameter area deterministically. The callee can save the CR and the LR in the caller's linkage area if it needs to. The stack pointer is always saved by the caller as a back chain to its caller.

In Figure 321, f2's portion of the stack shows space for saving nonvolatile registers that f2 changes. These must be restored by f2 before it returns to its caller.

Space for each function's local variables is reserved by growing the stack appropriately. This space lies below the parameter area and above the saved registers.

The fact that a called function is responsible for allocating its own stack frame does not mean the programmer has to write code to do so. When you compile a function, the compiler inserts code fragments called the prologue and the epilogue before and after the function body, respectively. The prologue sets up the stack frame for the function. The epilogue undoes the prologue's work, restoring any saved registers (including CR and LR), incrementing the stack pointer to its previous value (that the prologue saved in its linkage area), and finally returning to the caller.

A 32-bit Darwin ABI stack frame is 16-byte aligned.

Consider the trivial function shown in Figure 322, along with the corresponding annotated assembly code.

Figure 322. Assembly listing for a C function with no arguments and an empty body

$ cat function.c void function(void) { } $ gcc -S function.c $ cat function.s ... _function:       stmw r30,-8(r1) ; Prologue: save r30 and r31       stwu r1,-48(r1) ; Prologue: grow the stack 48 bytes       mr r30,r1       ; Prologue: copy stack pointer to r30       lwz r1,0(r1)    ; Epilogue: pop the stack (restore frame)       lmw r30,-8(r1)  ; Epilogue: restore r30 and r31       blr             ; Epilogue: return to caller (through LR)

The Red Zone

Just after a function is called, the function's prologue will decrement the stack pointer from its existing location to reserve space for the function's needs. The area above the stack pointer, where the newly called function's stack frame would reside, is called the Red Zone.

In the 32-bit Darwin ABI, the Red Zone has space for 19 GPRs (amounts to 19 x 4 = 76 bytes) and 18 FPRs (amounts to 18 x 8 = 144 bytes), for a total of 220 bytes. Rounded up to the nearest 16-byte boundary, this becomes 224 bytes, which is the size of the Red Zone.

Normally, the Red Zone is indeed occupied by the callee's stack frame. However, if the callee does not call any other functionthat is, it is a leaf functionthen it does not need a parameter area. It may also not need space for local variables on the stack if it can fit all of them in registers. It may need space for saving the nonvolatile registers it uses (recall that if a callee needs to save the CR and LR, it can save them in the caller's linkage area). As long as it can fit the registers to save in the Red Zone, it does not need to allocate a stack frame or decrement the stack pointer. Note that by definition, there is only one leaf function active at one time.

3.4.3.1. Stack Usage Examples

Figures 323 and 324 show examples of how the compiler sets up a function's stack depending on the number of local variables a function has, the number of parameters it has, the number of arguments it passes to a function it calls, and so on.

Figure 323. Examples of stack usage in functions

Figure 324. Examples of stack usage in functions (continued from Figure 323)

f1 is identical to the "null" function that we encountered in Figure 322, where we saw that the compiler reserves 48 bytes for the function's stack. The portions shown as shaded in the stacks are present either for alignment padding or for some current or future purpose not necessarily exposed through the ABI. Note that GPR30 and GPR31 are always saved, GPR30 being the designated frame pointer.

f2 uses a single 32-bit local variable. Its stack is 64 bytes.

f3 calls a function that takes no arguments. Nevertheless, this introduces a parameter area on f3's stack. A parameter area is at least eight words (32 bytes) in size. f3's stack is 80 bytes.

f4 takes eight arguments, has no local variables, and calls no functions. Its stack area is the same size as that of the null function because space for its arguments is reserved in the parameter area of its caller.

f5 takes no arguments, has eight word-size local variables, and calls no functions. Its stack is 64 bytes.

3.4.3.2. Printing Stack Frames

GCC provides built-in functions that may be used by a function to retrieve information about its callers. The current function's return address can be retrieved by calling the __builtin_return_address() function, which takes a single argumentthe level, an integer specifying the number of stack frames to walk. A level of 0 results in the return address of the current function. Similarly, the __builtin_frame_address() function may be used to retrieve the frame address of a function in the call stack. Both functions return a NULL pointer when the top of the stack has been reached.^[53] Figure 325 shows a program that uses these functions to display a stack trace. The program also uses the dladdr() function in the dyld API to find the various function addresses corresponding to return addresses in the call stack.

^[53] For __builtin_frame_address() to return a NULL pointer upon reaching the top of the stack, the first frame pointer must have been set up correctly.

Figure 325. Printing a function call stack trace^[54]

// stacktrace.c #include <stdio.h> #include <dlfcn.h> void printframeinfo(unsigned int level, void *fp, void *ra) {     int     ret;     Dl_info info;     // Find the image containing the given address     ret = dladdr(ra, &info);     printf("#%u %s%s in %s, fp = %p, pc = %p\n",            level,            (ret) ? info.dli_sname : "?",          // symbol name            (ret) ? "()" : "",                     // show as a function            (ret) ? info.dli_fname : "?", fp, ra); // shared object name } void stacktrace() {     unsigned int level = 0;     void    *saved_ra  = __builtin_return_address(0);     void   **fp        = (void **)__builtin_frame_address(0);     void    *saved_fp  = __builtin_frame_address(1);     printframeinfo(level, saved_fp, saved_ra);     level++;     fp = saved_fp;     while (fp) {         saved_fp = *fp;         fp = saved_fp;         if (*fp == NULL)             break;         saved_ra = *(fp + 2);         printframeinfo(level, saved_fp, saved_ra);         level++;     } } void f4() { stacktrace(); } void f3() { f4(); } void f2() { f3(); } void f1() { f2(); } int main() {     f1();     return 0; } $ gcc -Wall -o stacktrace stacktrace.c $ ./stacktrace #0 f4() in /private/tmp/./stacktrace, fp = 0xbffff850, pc = 0x2a3c #1 f3() in /private/tmp/./stacktrace, fp = 0xbffff8a0, pc = 0x2a68 #2 f2() in /private/tmp/./stacktrace, fp = 0xbffff8f0, pc = 0x2a94 #3 f1() in /private/tmp/./stacktrace, fp = 0xbffff940, pc = 0x2ac0 #4 main() in /private/tmp/./stacktrace, fp = 0xbffff990, pc = 0x2aec #5 tart() in /private/tmp/./stacktrace, fp = 0xbffff9e0, pc = 0x20c8 #6 tart() in /private/tmp/./stacktrace, fp = 0xbffffa40, pc = 0x1f6c

^[54] Note in the program's output that the function name in frames #5 and #6 is tart. The dladdr() function strips leading underscores from the symbols it returnseven if there is no leading underscore (in which case it removes the first character). In this case, the symbol's name is start.

3.4.4. Function Parameters and Return Values

We saw earlier that when a function calls another with arguments, the parameter area in the caller's stack frame is large enough to hold all parameters passed to the called function, regardless of the number of parameters actually passed in registers. Doing so has benefits such as the following.

The called function might want to call further functions that take arguments or might want to use registers containing its arguments for other purposes. Having a dedicated parameter area allows the callee to store an argument from a register to the argument's "home location" on the stack, thus freeing up a register.
It may be useful to have all arguments in the parameter area for debugging purposes.
If a function has a variable-length parameter list, it will typically access its arguments from memory.

3.4.4.1. Passing Parameters

Parameter-passing rules may depend on the type of programming language usedfor example, procedural or object-oriented. Let us look at parameter-passing rules for C and C-like languages. Even for such languages, the rules further depend on whether a function has a fixed-length or a variable-length parameter list. The rules for fixed-length parameter lists are as follows.

The first eight parameter words (i.e., the first 32 bytes, not necessarily the first eight arguments) are passed in GPR3 through GPR10, unless a floating-point parameter appears.
Floating-point parameters are passed in FPR1 through FPR13.
If a floating-point parameter appears, but GPRs are still available, then the parameter is placed in an FPR, as expected. However, the next available GPRs that together sum up to the floating-point parameter's size are skipped and not considered for allocation. Therefore, a single-precision floating-point parameter (4 bytes) causes the next available GPR (4 bytes) to be skipped. A double-precision floating-point parameter (8 bytes) causes the next two available GPRs (8 bytes total) to be skipped.
If not all parameters can fit within the available registers in accordance with the skipping rules, the caller passes the excess parameters by storing them in the parameter area of its stack frame.
Vector parameters are passed in VR2 through VR13.
Unlike floating-point parameters, vector parameters do not cause GPRsor FPRs, for that matterto be skipped.
Unless there are more vector parameters than can fit in available vector registers, no space is allocated for vector parameters in the caller's stack frame. Only when the registers are exhausted does the caller reserve any vector parameter space.

Let us look at the case of functions with variable-length parameter lists. Note that a function may have some number of required parameters preceding a variable number of parameters.

Parameters in the variable portion of the parameter list are passed in both GPRs and FPRs. Consequently, floating-point parameters are always shadowed in GPRs instead of causing GPRs to be skipped.
If there are vector parameters in the fixed portion of the parameter list, 16-byte-aligned space is reserved for such parameters in the caller's parameter area, even if there are available vector registers.
If there are vector parameters in the variable portion of the parameter list, such parameters are also shadowed in GPRs.
The called routine accesses arguments from the fixed portion of the parameter list similarly to the fixed-length parameter list case.
The called routine accesses arguments from the variable portion of the parameter list by copying GPRs to the callee's parameter area and accessing values from there.

3.4.4.2. Returning Values

Functions return values according to the following rules.

Values less than one word (32 bits) in size are returned in the least significant byte(s) of GPR3, with the remaining byte(s) being undefined.
Values exactly one word in size are returned in GPR3.
64-bit fixed-point values are returned in GPR3 (the 4 low-order bytes) and GPR4 (the 4 high-order bytes).
Structures up to a word in size are returned in GPR3.
Single-precision floating-point values are returned in FPR1.
Double-precision floating-point values are returned in FPR1.
A 16-byte long double value is returned in FPR1 (the 8 low-order bytes) and FPR2 (the 8 high-order bytes).
A composite value (such as an array, a structure, or a union) that is more than one word in size is returned via an implicit pointer that the caller must pass. Such functions require the caller to pass a pointer to a memory location that is large enough to hold the return value. The pointer is passed as an "invisible" argument in GPR3. Actual user-visible arguments, if any, are passed in GPR4 onward.