In the previous chapter, we presented the DOTPROD program that computed the dot product of two 3-component vectors without using a loop. A similar program, but with greater generality, would compute the dot product of two N-component vectors. Such a program, with the dimensionality N as a symbolic parameter at the top of the listing, is presented in Figure 5-1. This more general program uses additional registers: r17 for the dimensionality and loop control, r14 as an address pointer for vector V , r15 as an address pointer for vector W , and r16 as an address pointer for the product P. While there is a little more overhead between first and top to get everything set up, the heart of the algorithm is simplified because the multiply, sign-extend, and add sequence occurs only once (inside the loop). Notice that both address pointers must be incremented by two units in order to advance to the next word values each time through the loop, while the copy of the dimensionality (i.e., the number of components) must be decremented by one (by adding 1 to r17). Figure 5-1 DOTLOOP: An illustration of a simple down-counted loop// DOTLOOP Scalar Product of N-vectors // This program will compute the scalar product // of two multielement vectors V and W. N = 3 // N = dimensionality .data // Declare storage .align 8 // Desired alignment P: .skip 8 // Space for product V: data2 -1,+3,+5 // V1, V2, V3, etc. W: data2 -2,-4,+6 // W1, W2, W3, etc. .text // Section for code .align 32 // Desired alignment .global main // These three lines .proc main // mark the mandatory main: // 'main' program entry .body // Now we really begin... first: movl r14 = V;; // Pointer for V movl r15 = W;; // Pointer for W movl r16 = P;; // Pointer for P mov r17 = N;; // Number of V components mov r20 = 0;; // R20 = running sum top: ld2 r21 = [r14],2;; // Get Vi; bump pointer ld2 r22 = [r15],2;; // Get Wi; bump pointer pmpy2.r r21 = r21,r22;; // Compute Vi times Wi sxt4 r21 = r21;; // Extend 32 bits to 64 add r20 = r20,r21;; // Update the sum add r17 = -1,r17;; // Decrement loop count cmp.gt p6,p0 = r17,r0 // More to do? (p6) br.cond.sptk.few top;; // Yes st8 [r16] = r20;; // No, store the product done: mov r8 = 0;; // Signal all is normal br.ret.sptk.many b0 // Back to command line .endp main // Mark end of procedure We can run this program using the debugger, with a breakpoint set at done. Examining the memory location P should reveal the correctly computed value of 20 (0x14). |