5.4 DOTLOOP: Using a Counted Loop | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

In the previous chapter, we presented the DOTPROD program that computed the dot product of two 3-component vectors without using a loop. A similar program, but with greater generality, would compute the dot product of two N-component vectors. Such a program, with the dimensionality N as a symbolic parameter at the top of the listing, is presented in Figure 5-1.

This more general program uses additional registers: r17 for the dimensionality and loop control, r14 as an address pointer for vector V , r15 as an address pointer for vector W , and r16 as an address pointer for the product P. While there is a little more overhead between first and top to get everything set up, the heart of the algorithm is simplified because the multiply, sign-extend, and add sequence occurs only once (inside the loop). Notice that both address pointers must be incremented by two units in order to advance to the next word values each time through the loop, while the copy of the dimensionality (i.e., the number of components) must be decremented by one (by adding 1 to r17).

Figure 5-1 DOTLOOP: An illustration of a simple down-counted loop

 // DOTLOOP       Scalar Product of N-vectors // This program will compute the scalar product // of two multielement vectors V and W.          N       = 3              // N = dimensionality          .data                    // Declare storage          .align  8                // Desired alignment P:       .skip   8                // Space for product V:       data2   -1,+3,+5         // V1, V2, V3, etc. W:       data2   -2,-4,+6         // W1, W2, W3, etc.          .text                    // Section for code          .align  32               // Desired alignment          .global main             // These three lines          .proc   main             //  mark the mandatory main:                             //   'main' program entry         .body                     // Now we really begin... first:  movl     r14 = V;;        // Pointer for V         movl     r15 = W;;        // Pointer for W         movl     r16 = P;;        // Pointer for P         mov      r17 = N;;        // Number of V components         mov      r20 = 0;;        // R20 = running sum top:    ld2      r21 = [r14],2;;  // Get Vi; bump pointer         ld2      r22 = [r15],2;;  // Get Wi; bump pointer         pmpy2.r  r21 = r21,r22;;  // Compute Vi times Wi         sxt4     r21 = r21;;      // Extend 32 bits to 64         add      r20 = r20,r21;;  // Update the sum         add      r17 = -1,r17;;   // Decrement loop count         cmp.gt   p6,p0 = r17,r0   // More to do?         (p6) br.cond.sptk.few top;;  // Yes         st8      [r16] = r20;;    // No, store the product done:   mov      r8 = 0;;         // Signal all is normal         br.ret.sptk.many b0       // Back to command line         .endp    main             // Mark end of procedure

We can run this program using the debugger, with a breakpoint set at done. Examining the memory location P should reveal the correctly computed value of 20 (0x14).