Earlier we developed Booth's algorithm for integer multiplication (Section 6.5.1) and implemented it in two versions of a program for performing number conversions, DECNUM (Figure 6-5) and DECNUM2 (Figure 7-3). Those were monolithic programs. Having now demonstrated program segmentation, we are prepared to split the DECNUM2 program into two parts: Booth's algorithm, as callable function BOOTH, and an outer shell, DECNUM3. 7.6.1 Defining the InterfaceFunctions are usually specified in two parts: the public interface and the body of code, which is encapsulated or hidden from view. The writer of the function determines in the public interface what the inputs, outputs, and argument-passing methods must be. The operation of multiplication forms a product from a multiplicand (first factor) and a multiplier (second factor). Thus, in terms of the calling conventions (Section 7.5.3), the caller should put the multiplicand onto the register stack as out0 and the multiplier as out1; these will become accessible to the called function as in0 and in1, respectively. An integer function should return its result in register r8 (ret0). More specific to the task at hand, as noted in Appendix D.2, Itanium architecture allows up to four general registers to be used to return integer values. Recall that Booth's algorithm can compute a full, signed 128-bit product. Therefore we can design the function for signed integer multiplication to return the high-order segment of the product in register r9 (ret1). A calling routine written at the register programming level could then obtain either or both segments of the full product. For unsigned integer multiplication, the calling routine merely needs to preserve the multiplicand (first factor) and multiplier (second factor). If and only if the multiplier was negative, the multiplicand is added onto the low segment returned by the function in order to obtain a corrected 64-bit unsigned product (Section 6.5.2). 7.6.2 BOOTH: The Callable FunctionSince the majority of the general registers used in DECNUM2 are classified as scratch registers (see Appendix D.2), they can be freely used by any leaf procedure. This allows the extraction of the central portion of DECNUM2 into a function named BOOTH (Figure 7-7), requiring only minimal editing to allow for signed integer multiplication. Figure 7-7 Function that implements Booth's algorithm// BOOTH 64 x 64 --> 128 bit multiplication // This function multiplies signed in0 times signed in1, // putting the low segment of the full signed product // into ret0 (r8) and the high segment into ret1 (r9). W = 64 // W = number width .text // Section for code .align 32 // Desired alignment .global booth // These three lines .proc booth // mark the mandatory booth: // 'booth' function entry .prologue // Leaf procedure .regstk 2,0,0,0 // simply claims 2 ins .save ar.lc,r31 // Save caller's ar.lc mov r31 = ar.lc // in scratch register .body // Now we really begin... first: add r2 = W-1,r0;; // Number of traversals mov ar.lc = r2 // minus one mov r19 = 0 // Set bit n-1 to zero mov ret0 = in1 // Set R to multiplier mov ret1 = 0;; // Set L to zero cycle: and r22 = 0x1,ret0;; // Isolate lowest bit of R xor r23 = r19,r22;; // r23 = whether to act cmp.ne p6,p0 = 0,r23 // p6 = whether to act mov r19 = r22;; // Bit n-1 for next time (p6) cmp.eq.unc p7,p8 = 0,r22;; // Add, subtract, nop? (p7) add ret1 = ret1,in0 // Add X to L (p8) sub ret1 = ret1,in0;; // Subtract X from L shrp ret0 = ret1,ret0,1 // New R of shifted LR shr ret1 = ret1,1 // New L of shifted LR br.cloop.sptk.few cycle;; // More cycles? done: // Full product in r9,r8 mov ar.lc = r31 // Restore caller's ar.lc br.ret.sptk.many b0;; // Back to the caller .endp booth // Mark end of procedure The BOOTH function uses a .regstk directive to identify its caller's outs as its own ins, but does not need a new stack frame. Since scratch registers are plentiful, compared to the needs of this function, it preserves the caller's ar.lc in register r31. The body of the BOOTH function very closely resembles the corresponding central section of the DECNUM2 program. The two inputs are the multiplicand contained in in0 (r32) and the multiplier (the 0.8 factor) contained in in1 (r33), for which registers r20 and r21, respectively, had been used in the DECNUM2 program. Registers r8 and r9 in the previous program are replaced by the synonyms ret0 and ret1; a function can efficiently compute its results directly into the return argument registers, eliminating a transfer of the values to be returned. Finally, we point out that the BOOTH function illustrates pure register register programming: It has no data section, and it avoids load and store operations with memory. Purely register-oriented programming allows for the fastest possible code in any architecture. 7.6.3 DECNUM3: The Test ProgramThe DECNUM2 program (Figure 7-3) uses many scratch general registers, the contents of which will be undefined upon return from a called function. Some changes are thus required if we are to modify the outer portions of DECNUM2 into a calling program named DECNUM3 (Figure 7-8), which calls our newly created BOOTH function. Figure 7-8 DECNUM3 program that tests the BOOTH function// DECNUM3 Convert integer to ASCII // This program converts a positive number from X3 // into a string of ASCII-encoded decimal digits at A3 // using the BOOTH function for its divide loop LEN = 20 // Allowance for 20 digits DOT8 = 0xcccccccccccccccd // 0.8 (approx.) .global booth // External reference .data // Declare storage .align 8 // Desired alignment X3: data8 0x123456789 // Number to convert A3: .skip 80 // ASCII output area STACK: .skip LEN // User-defined stack .text // Section for code .align 32 // Desired alignment .global main // These three lines .proc main // mark the mandatory main: // 'main' program entry .prologue 12,r32 // Mask for rp, ar.pfs only alloc loc1 = ar.pfs,0,6,2,0 // 6 locals, 2 outs .save rp,loc0 // Must save return address mov loc0 = b0 // to our caller .body // Now we really begin... first: add loc2 = @gprel(STACK),gp // loc2 -> stack movl loc3 = DOT8 // 0.8 reciprocal factor mov loc5 = gp // Save gp new: add r15 = @gprel(X3),gp;; // r15 -> input ld8 r9 = [r15] // Get number to convert st1 [loc2] = r0,1;; // Push 0 as a flag again: mov loc4 = r9 // Save previous quotient mov out0 = r9 // It is the multiplicand mov out1 = loc3 // 0.8 is the multiplier br.call.sptk.many b0=booth // r9,r8 = out0 * out1 mov gp = loc5 // Restore gp nosign: add r9 = r9,loc4;; // Add X to L shr.u r9 = r9,3;; // r9 = quotient = loc4/10 shladd r3 = r9,2,r9;; // r3 = 5*quotient add r3 = r3,r3;; // r3 = 10*quotient now sub r3 = loc4,r3;; // r3 = remainder now or r3 = 0x30,r3;; // Make into ASCII char st1 [loc2] = r3,1 // Store the character cmp.ne p6,p0 = r9,r0 // If quotient not zero, (p6) br.cond.sptk.few again;; // repeat the cycle add loc2 = -2,loc2 // Skip latest char add r16 = @gprel(A3),gp;; // r16 -> output move: st1 [r16] = r3,1 // Move one character ld1 r3 = [loc2],-1;; // Get another character cmp.ne p6,p0 = r3,r0 // If it is not 0 flag (p6) br.cond.sptk.few move // then move it to A3 area st1 [r16] = r0 // Make A3 into stringz add loc2 = 1,loc2 // Fix stack pointer done: mov r8 = 0 // Signal all is normal mov b0 = loc0 // Restore return address mov ar.pfs = loc1 // Restore caller's ar.pfs br.ret.sptk.many b0;; // Back to command line .endp main // Mark end of procedure At first it may seem surprising that the DECNUM3 program is as lengthy as the previous DECNUM2 program, despite having off-loaded the central portion of code to the BOOTH function. This is a direct effect of an environment where the majority of registers are classified as scratch. Calling procedures such as DECNUM3 must perform most of the work of saving and restoring essential data and pointers, while leaf procedures such as BOOTH benefit by being freed of those details. When analyzing the impact of procedure calls, one must carefully enumerate any items, including predicate registers, to be preserved, and then devise a strategy that efficiently performs those saves and restores. Conventions and the requirements of the programming environment further restrict allowable strategies for example, the global pointer should be saved once and then restored after each call. Quantities such as persistent constants can be held on the register stack. The DECNUM3 program uses an alloc instruction to declare its need for six locals and two outs on the register stack. Several register assignments differ between this program and its predecessors: DECNUM2 DECNUM3 Purpose n/a loc0 (r32) preserve rp n/a loc1 (r33) preserve ar.pfs r11 loc2 (r34) user stack pointer r21 loc3 (r35) 0.8 constant factor r20 loc4 (r36) old quotient for remainder calculation n/a loc5 (r37) preserve gp (r1) r9 out0 (r38) multiplicand (dividend in DECNUM3) r8 out1 (r39) multiplier (0.8 factor) It is important to appreciate that DECNUM3 must not depend on the value in any scratch register (see Appendix D) following the return from the function call. On the other hand, it is quite permissible to use a scratch register before and after the call for the same or a related purpose, such as using register r9 to hold a quotient that becomes the dividend for the next traversal of the principal loop in this program. DECNUM3 and BOOTH comprise the parts of an entire program that the compiler driver and linker must produce: L> gcc -Wall -O0 -o bin/decnum3 decnum3.s booth.s L> gdb bin/decnum3 [messages deleted here] (gdb) break done Breakpoint 1 at 0x4000000000000620 (gdb) run Starting program: /house/user/bin/decnum3 Breakpoint 1, 0x4000000000000620 in done () (gdb) x/s &A3 0x6000000000000898 <A3>: "4886718345" (gdb) q The program is running. Exit anyway? (y or n) y L> The expected result is the decimal value 4,886,718,345, corresponding to 0x123456789. This value exceeds the span of a 32-bit unsigned integer (Table 2-1) and thus lends credence to this implementation of Booth's algorithm for 64-bit quantities. 7.6.4 Position-Independent CodeOur sample programs have used two different methods to address data values. Examples of both methods occur near the top of the DECNUM program (Figure 6-5). In one method, an add pseudo-instruction, which becomes an addl machine instruction, forms an address pointer using an immediate value as an offset from the global pointer gp; the assembler syntax for the immediate value is @gprel(symbol). In the other method, a movl instruction encodes the actual full 64-bit address of symbol in the instruction itself. These two methods are not equivalent, though each has its uses. While total program size has ceased to be of any concern, cache size is of crucial importance for programs that should achieve maximum performance. The movl instruction is restricted to occupy the last two instruction slots in an instruction bundle. The addl instruction requires only one slot and is much less restricted in position within an instruction bundle; moreover, either M- or I-type execution units can perform it. Therefore, addl seems preferable to movl, because an Itanium 2 processor can execute up to six addl instructions simultaneously, in contrast to only one movl instruction. More significantly, the address embedded in a movl instruction reflects static binding at link time. In DECNUM, the movl instruction refers to a fixed virtual address selected by the linker for the byte address ZERO-1. That is all right for an essentially internal reference in a program to its own data. Consider a function with internal data of its own, which is to be incorporated into a shared library that the operating system will load dynamically when processes need it. Modern operating systems implement dynamic binding to handle such needs. If a function in a shared library has a valid global pointer to its own data section whenever it is called, then it can be coded using only gp-relative addresses. Such a function is said to be composed of position-independent code because its text and data segments can be loaded at any address. All of this does beg the question of how to call shared functions that are dynamically loaded by the operating system. If a function is not located at the same address, we cannot use an IP-relative fixed offset encoded in a br.call instruction. The assembler or compiler may accommodate this need by constructing a "stub procedure" that exists solely to call the actual function. Besides resolving the entry address, such methods also need to provide a called routine with its appropriate global pointer value. The advantage is that we ourselves do not need to know how to do that. The disadvantage is that each level of indirection has overhead costs that are not apparent when we look at code for either a caller or a called procedure. |