4.7 DOTPROD: Using Data Access Instructions | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

We shall now illustrate the very common operation of referring to successive entries in a list using vector components. Three-component vectors occur frequently in physics and engineering problems. In vector algebra, the scalar product of two vectors (also called the inner product, or the dot product) is the sum of products of corresponding components:

P = V • W = (v_x x w_x) + (v_y x w_y) + (v_z x w_z)

It makes sense to store the x-, y-, and z-components of each vector in adjacent information units. We will select word-length storage for components of two vectors, V and W , in our sample program (Figure 4-5), and the resulting scalar product, P, will be stored in a quad word.

Figure 4-5 DOTPROD: An illustration of data access instructions

 // DOTPROD       Scalar Product of 3-vectors // This program will compute the scalar product // of two three-element vectors V and W.          .data                    // Declare storage          .align  8                // Desired alignment P:       .skip   8                // Space for product V:       data2   -1,+3,+5         // Vx, Vy, Vz W:       data2   -2,-4,+6         // Wx, Wy, Wz          .text                    // Section for code          .align  32               // Desired alignment          .global main             // These three lines          .proc   main             //  mark the mandatory main:                             //   'main' program entry          .body                    // Now we really begin... first:   movl    r14 = V;;        // Pointer for V          movl    r15 = W;;        // Pointer for W          movl    r16 = P;;        // Pointer for P          mov     r20 = 0;;        // R20 = running sum          ld2     r21 = [r14],2;;  // Get Vx; bump pointer          ld2     r22 = [r15],2;;  // Get Wx; bump pointer          pmpy2.r r21 = r21,r22;;  // Compute Vx times Wx          sxt4    r21 = r21;;      // Extend 32 bits to 64          add     r20 = r20,r21;;  // Update the sum          ld2     r21 = [r14],2;;  // Get Vy; bump pointer          ld2     r23 = [r15],2;;  // Get Wy; bump pointer          pmpy2.r r21 = r21,r22;;  // Compute Vy times Wy          sxt4    r21 = r21;;      // Extend 32 bits to 64          add     r20 = r20,r21;;  // Update the sum          ld2     r21 = [r14],2;;  // Get Vz; bump pointer          ld2     r22 = [r15],2;;  // Get Wz; bump pointer          pmpy2.r r21 = r21,r22;;  // Compute Vz times Wz          sxt4    r21 = r21;;      // Extend 32 bits to 64          add     r20 = r20,r21;;  // Update the sum          st8     [r16] = r20;;    // Store computed product                                   // No more components... done:    mov     r8 = 0;;         // Signal all is normal          br.ret.sptk.many b0      // Back to command line          .endp   main             // Mark end of procedure

Some computer architectures can map data structures using fixed offsets for the component values relative to a fixed base address for each vector i.e., (V, V+8, V+16). The Itanium ISA, on the other hand, offers only register indirect addressing. We chose registers r14, r15, and r16 to point to V , W , and the result P, respectively.

Each component is expressed as a 2-byte word. We used ld2 instructions that also perform zero-extension in the destination register. We took advantage of postincrementing with the Itanium load and store instructions, since the x-, y-, and z-components of each vector are stored as successive words. (We did not remove the increment of 2 from the last set of load and store instructions. If we were to write a more general scheme utilizing a loop to compute the dot product of two N-component vectors, it would not be convenient to isolate the last component as a special case.)

Each multiplication of two word-length components using the pmpy2.r instruction yields a product expressed as a double word in the destination register. We extended that intermediate product to 64 bits using the sxt4 instruction to ensure correct results.

With this background, you should have little difficulty in following the flow of the entire calculation. Using the debugger on a Linux system, we could proceed as follows:

 L> gcc -Wall -O0 -o bin/dotprod dotprod.s L> gdb bin/dotprod [messages deleted here] (gdb) break done Breakpoint 1 at 0x40000000000005e0 (gdb) run Starting program: /home/user/bin/dotprod Breakpoint 1, 0x40000000000005e0 in done () (gdb) x/g &P 0x6000000000000770 <P>: 0x0000000000000014 (gdb) q The program is running.  Exit anyway? (y or n) y L>

The correct answer is ( 1 x 2) + (+3 x 4) + (+5 x +6) = (+2) + ( 12) + (+30) = 20₁₀ = 14₁₆. Alternatively, you could monitor the contents of registers r20 and r21 as you step through the sequence of instructions to the label done. Be attentive to the two's complement arithmetic operations.

Using a label such as done, where output instructions would be inserted, works just as well in the HP-UX command-line environment:

 H> cc +DD64 -o bin/dotprod dotprod.s H> adb bin/dotprod adb> done:b adb> :r Process 9619 Thread 9728 Execed Breakpoint 1 set at address 0x4000980 main + 0xc0: >       adds             r8=0,r0         nop.f            0         nop.b            0;; Hit Breakpoint 1 at address 0x4000980 adb> P/jx P:                 0x14 adb> q H>

where P is the symbolic address for the quad word result in memory. In later chapters, we shall usually demonstrate the sample programs using either the GNU tools (Linux) or the HP-UX tools, but not both, in the interest of keeping the book concise and readable.