7.6 DECNUM3 and BOOTH: Making a Function | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

Earlier we developed Booth's algorithm for integer multiplication (Section 6.5.1) and implemented it in two versions of a program for performing number conversions, DECNUM (Figure 6-5) and DECNUM2 (Figure 7-3). Those were monolithic programs. Having now demonstrated program segmentation, we are prepared to split the DECNUM2 program into two parts: Booth's algorithm, as callable function BOOTH, and an outer shell, DECNUM3.

7.6.1 Defining the Interface

Functions are usually specified in two parts: the public interface and the body of code, which is encapsulated or hidden from view. The writer of the function determines in the public interface what the inputs, outputs, and argument-passing methods must be.

The operation of multiplication forms a product from a multiplicand (first factor) and a multiplier (second factor). Thus, in terms of the calling conventions (Section 7.5.3), the caller should put the multiplicand onto the register stack as out0 and the multiplier as out1; these will become accessible to the called function as in0 and in1, respectively. An integer function should return its result in register r8 (ret0).

More specific to the task at hand, as noted in Appendix D.2, Itanium architecture allows up to four general registers to be used to return integer values. Recall that Booth's algorithm can compute a full, signed 128-bit product. Therefore we can design the function for signed integer multiplication to return the high-order segment of the product in register r9 (ret1). A calling routine written at the register programming level could then obtain either or both segments of the full product.

For unsigned integer multiplication, the calling routine merely needs to preserve the multiplicand (first factor) and multiplier (second factor). If and only if the multiplier was negative, the multiplicand is added onto the low segment returned by the function in order to obtain a corrected 64-bit unsigned product (Section 6.5.2).

7.6.2 BOOTH: The Callable Function

Since the majority of the general registers used in DECNUM2 are classified as scratch registers (see Appendix D.2), they can be freely used by any leaf procedure. This allows the extraction of the central portion of DECNUM2 into a function named BOOTH (Figure 7-7), requiring only minimal editing to allow for signed integer multiplication.

Figure 7-7 Function that implements Booth's algorithm

 // BOOTH         64 x 64 --> 128 bit multiplication // This function multiplies signed in0 times signed in1, // putting the low segment of the full signed product // into ret0 (r8) and the high segment into ret1 (r9).         W        = 64             // W = number width         .text                     // Section for code         .align  32                // Desired alignment         .global booth             // These three lines         .proc   booth             //  mark the mandatory booth:                            //   'booth' function entry         .prologue                 // Leaf procedure         .regstk 2,0,0,0           //  simply claims 2 ins         .save   ar.lc,r31         //  Save caller's ar.lc         mov     r31 = ar.lc       //   in scratch register         .body                     // Now we really begin... first:  add     r2 = W-1,r0;;     // Number of traversals         mov     ar.lc = r2        //  minus one         mov     r19 = 0           // Set bit n-1 to zero         mov     ret0 = in1        // Set R to multiplier         mov     ret1 = 0;;        // Set L to zero cycle:  and     r22 = 0x1,ret0;;  // Isolate lowest bit of R         xor     r23 = r19,r22;;   // r23 = whether to act         cmp.ne  p6,p0 = 0,r23     // p6 = whether to act         mov     r19 = r22;;       // Bit n-1 for next time    (p6) cmp.eq.unc p7,p8 = 0,r22;; // Add, subtract, nop?    (p7) add     ret1 = ret1,in0   // Add X to L    (p8) sub     ret1 = ret1,in0;;  // Subtract X from L          shrp   ret0 = ret1,ret0,1  // New R of shifted LR          shr     ret1 = ret1,1    // New L of shifted LR          br.cloop.sptk.few cycle;; // More cycles? done:                             // Full product in r9,r8          mov     ar.lc = r31      // Restore caller's ar.lc          br.ret.sptk.many b0;;    // Back to the caller          .endp   booth            // Mark end of procedure

The BOOTH function uses a .regstk directive to identify its caller's outs as its own ins, but does not need a new stack frame. Since scratch registers are plentiful, compared to the needs of this function, it preserves the caller's ar.lc in register r31.

The body of the BOOTH function very closely resembles the corresponding central section of the DECNUM2 program. The two inputs are the multiplicand contained in in0 (r32) and the multiplier (the 0.8 factor) contained in in1 (r33), for which registers r20 and r21, respectively, had been used in the DECNUM2 program. Registers r8 and r9 in the previous program are replaced by the synonyms ret0 and ret1; a function can efficiently compute its results directly into the return argument registers, eliminating a transfer of the values to be returned.

Finally, we point out that the BOOTH function illustrates pure register register programming: It has no data section, and it avoids load and store operations with memory. Purely register-oriented programming allows for the fastest possible code in any architecture.

7.6.3 DECNUM3: The Test Program

The DECNUM2 program (Figure 7-3) uses many scratch general registers, the contents of which will be undefined upon return from a called function. Some changes are thus required if we are to modify the outer portions of DECNUM2 into a calling program named DECNUM3 (Figure 7-8), which calls our newly created BOOTH function.

Figure 7-8 DECNUM3 program that tests the BOOTH function

 // DECNUM3       Convert integer to ASCII // This program converts a positive number from X3 // into a string of ASCII-encoded decimal digits at A3 // using the BOOTH function for its divide loop         LEN      = 20             // Allowance for 20 digits         DOT8     = 0xcccccccccccccccd  // 0.8 (approx.)         .global  booth            // External reference         .data                     // Declare storage         .align   8                // Desired alignment X3:     data8    0x123456789      // Number to convert A3:     .skip    80               // ASCII output area STACK:  .skip    LEN              // User-defined stack         .text                     // Section for code         .align   32               // Desired alignment         .global  main             // These three lines         .proc    main             //  mark the mandatory main:                             //   'main' program entry         .prologue 12,r32          // Mask for rp, ar.pfs only         alloc   loc1 = ar.pfs,0,6,2,0  // 6 locals, 2 outs         .save    rp,loc0          // Must save return address         mov      loc0 = b0        //  to our caller         .body                     // Now we really begin... first:  add      loc2 = @gprel(STACK),gp  // loc2 -> stack         movl     loc3 = DOT8      // 0.8 reciprocal factor         mov      loc5 = gp        // Save gp new:    add      r15 = @gprel(X3),gp;;    // r15 -> input         ld8      r9 = [r15]       // Get number to convert         st1      [loc2] = r0,1;;  // Push 0 as a flag again:  mov      loc4 = r9        // Save previous quotient         mov      out0 = r9        // It is the multiplicand         mov      out1 = loc3      // 0.8 is the multiplier         br.call.sptk.many b0=booth   // r9,r8 = out0 * out1         mov      gp = loc5        // Restore gp nosign: add      r9 = r9,loc4;;   // Add X to L         shr.u    r9 = r9,3;;      // r9 = quotient = loc4/10         shladd   r3 = r9,2,r9;;   // r3 = 5*quotient         add      r3 = r3,r3;;     // r3 = 10*quotient now         sub      r3 = loc4,r3;;   // r3 = remainder now         or       r3 = 0x30,r3;;   // Make into ASCII char         st1      [loc2] = r3,1    // Store the character         cmp.ne   p6,p0 = r9,r0    // If quotient not zero,    (p6) br.cond.sptk.few again;;  //  repeat the cycle         add      loc2 = -2,loc2   // Skip latest char         add      r16 = @gprel(A3),gp;;    // r16 -> output move:   st1      [r16] = r3,1     // Move one character         ld1      r3 = [loc2],-1;;  // Get another character         cmp.ne   p6,p0 = r3,r0    // If it is not 0 flag    (p6) br.cond.sptk.few move     //  then move it to A3 area         st1      [r16] = r0       // Make A3 into stringz         add      loc2 = 1,loc2    // Fix stack pointer done:   mov      r8 = 0           // Signal all is normal         mov      b0 = loc0        // Restore return address         mov      ar.pfs = loc1    // Restore caller's ar.pfs         br.ret.sptk.many b0;;     // Back to command line         .endp    main             // Mark end of procedure

At first it may seem surprising that the DECNUM3 program is as lengthy as the previous DECNUM2 program, despite having off-loaded the central portion of code to the BOOTH function. This is a direct effect of an environment where the majority of registers are classified as scratch. Calling procedures such as DECNUM3 must perform most of the work of saving and restoring essential data and pointers, while leaf procedures such as BOOTH benefit by being freed of those details.

When analyzing the impact of procedure calls, one must carefully enumerate any items, including predicate registers, to be preserved, and then devise a strategy that efficiently performs those saves and restores. Conventions and the requirements of the programming environment further restrict allowable strategies for example, the global pointer should be saved once and then restored after each call.

Quantities such as persistent constants can be held on the register stack. The DECNUM3 program uses an alloc instruction to declare its need for six locals and two outs on the register stack. Several register assignments differ between this program and its predecessors:

 DECNUM2   DECNUM3       Purpose n/a       loc0 (r32)    preserve rp n/a       loc1 (r33)    preserve ar.pfs r11       loc2 (r34)    user stack pointer r21       loc3 (r35)    0.8 constant factor r20       loc4 (r36)    old quotient for remainder calculation n/a       loc5 (r37)    preserve gp (r1) r9        out0 (r38)    multiplicand (dividend in DECNUM3) r8        out1 (r39)    multiplier (0.8 factor)

It is important to appreciate that DECNUM3 must not depend on the value in any scratch register (see Appendix D) following the return from the function call. On the other hand, it is quite permissible to use a scratch register before and after the call for the same or a related purpose, such as using register r9 to hold a quotient that becomes the dividend for the next traversal of the principal loop in this program.

DECNUM3 and BOOTH comprise the parts of an entire program that the compiler driver and linker must produce:

 L> gcc -Wall -O0 -o bin/decnum3 decnum3.s booth.s L> gdb bin/decnum3 [messages deleted here] (gdb) break done Breakpoint 1 at 0x4000000000000620 (gdb) run Starting program: /house/user/bin/decnum3 Breakpoint 1, 0x4000000000000620 in done () (gdb) x/s &A3 0x6000000000000898 <A3>:         "4886718345" (gdb) q The program is running.  Exit anyway? (y or n) y L>

The expected result is the decimal value 4,886,718,345, corresponding to 0x123456789. This value exceeds the span of a 32-bit unsigned integer (Table 2-1) and thus lends credence to this implementation of Booth's algorithm for 64-bit quantities.

7.6.4 Position-Independent Code

Our sample programs have used two different methods to address data values. Examples of both methods occur near the top of the DECNUM program (Figure 6-5). In one method, an add pseudo-instruction, which becomes an addl machine instruction, forms an address pointer using an immediate value as an offset from the global pointer gp; the assembler syntax for the immediate value is @gprel(symbol). In the other method, a movl instruction encodes the actual full 64-bit address of symbol in the instruction itself.

These two methods are not equivalent, though each has its uses. While total program size has ceased to be of any concern, cache size is of crucial importance for programs that should achieve maximum performance. The movl instruction is restricted to occupy the last two instruction slots in an instruction bundle. The addl instruction requires only one slot and is much less restricted in position within an instruction bundle; moreover, either M- or I-type execution units can perform it. Therefore, addl seems preferable to movl, because an Itanium 2 processor can execute up to six addl instructions simultaneously, in contrast to only one movl instruction.

More significantly, the address embedded in a movl instruction reflects static binding at link time. In DECNUM, the movl instruction refers to a fixed virtual address selected by the linker for the byte address ZERO-1. That is all right for an essentially internal reference in a program to its own data.

Consider a function with internal data of its own, which is to be incorporated into a shared library that the operating system will load dynamically when processes need it. Modern operating systems implement dynamic binding to handle such needs. If a function in a shared library has a valid global pointer to its own data section whenever it is called, then it can be coded using only gp-relative addresses. Such a function is said to be composed of position-independent code because its text and data segments can be loaded at any address.

All of this does beg the question of how to call shared functions that are dynamically loaded by the operating system. If a function is not located at the same address, we cannot use an IP-relative fixed offset encoded in a br.call instruction. The assembler or compiler may accommodate this need by constructing a "stub procedure" that exists solely to call the actual function. Besides resolving the entry address, such methods also need to provide a called routine with its appropriate global pointer value. The advantage is that we ourselves do not need to know how to do that. The disadvantage is that each level of indirection has overhead costs that are not apparent when we look at code for either a caller or a called procedure.