12.3 Applications to Integer Multiplication | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

In the DOTPROD program (Figure 4-5), we used the Itanium pmpy2.r instruction (Section 4.2.5) to multiply word-length (16-bit) integers. The result there was valid because it was known in advance that the signed product would not exceed 32 bits in width. Then we showed an implementation of Booth's algorithm (Section 6.5) that can produce the full 128-bit signed or unsigned product of two 64-bit sources.

Later we showed the standard sequence of instructions (Section 8.7.2) required to take advantage of the Itanium xmpy pseudo-op of the xma instruction, which multiplies 64-bit integers after they have been transferred from general registers to floating-point registers. We mentioned the opportunity to capture either the high or low 64 bits of the full 128-bit product for transfer back from a floating-point register to a general register, but not both parts unless a second xmpy instruction is used.

Compilers for high-level languages have the opportunity to use xmpy for all integer multiplications or to construct sequences using the parallel multiplication instructions for cases involving source data less than 64 bits in width. We present here some sequences that attracted our attention.

12.3.1 32x32-Bit Sources Giving 32-Bit Unsigned Product

While the product of two 16-bit sources cannot exceed a width of 32 bits, the product of 32-bit sources can exceed that width i.e., up to 64 bits. Consider the special case where the product will fit in 32 bits (or where only the low-order 32 bits of a wider product would be wanted).

The following sequence, which we present without full discussion of all of the specific instructions, computes the product of two 32-bit unsigned integers. It could be converted into a callable function that would receive two values in registers r32 (in0) and r33 (in1) and produce a product in register r8 (ret0):

                            // r32:   x     x        A        a                            // r33:   x     x        B        b pmpyshr2.u   r31 = r32,r33,16  //    x     x      (AB)h    (ab)h mux2         r30 = r32,0       //    a     a        a        a shl          r29 = r33,16;;    //    x     B        b        0 pmpyshr2.u   r27 = r30,r33,0   //    x     x      (aB)l    (ab)l pmpyshr2.u   r28 = r32,r29,0   //    x     x      (Ab)l      0 shl          r26 = r31,16;;    //    x    AB)h    (ab)h      0 add          r25 = r26,r27;;   //    x     x      Part     (ab)l add          r8  = r25,r28     //    x     x      High     (ab)l

where x indicates 16-bit segments of the sources and intermediate results that do not affect the final 32-bit result, and where h and l indicate the high- and low-order 16-bit portions of each 32-bit transitory product that the processor computes from multiplying corresponding 16-bit segments of the sources. Thus, in the rightmost column, (ab)l means the product ab modulo 2¹⁶. Similarly, in the second column from the right, Part is a partial sum leading to High as the sum of three terms, (Ab)l + (aB)l + (ab)h, modulo 2¹⁶.

We can follow this algorithm using a more familiar algebraic notation where the relevant 32 bits of the value in one source register would be (2¹⁶A + a). With two sources, A and B, the full product could be calculated as:

full product = (2¹⁶A + a) x (2¹⁶B + b) = 2³²AB + 2¹⁶ (Ab + aB) + ab

where A, a, B, and b are 16-bit segments. Therefore the lowest 16 bits in the result register r8 should be (ab)l that is, ab modulo 2¹⁶. The next 16 bits should be (Ab)l + (aB)l + (ab)h, modulo 2¹⁶, as previously inferred from the behavior of the Itanium instructions.

One would have to be very careful to use only the lowest 32 bits of a result computed in this way, such as using an st4 instruction, since the upper bits would be a function of the unknown x segments.

Because of dependencies and the nonunit latency of some of the instructions used, this sequence would require some 11 cycles to execute in the initial Itanium processor implementation (Dale Morris, private communication), but that may be somewhat less than the overall cycle requirement of a logically equivalent sequence using xmpy.

12.3.2 32x32-Bit Sources Giving 64-Bit Unsigned Product

Some of the extraordinary versatility of the Itanium integer parallel operations can be illustrated with additional code segments involving multiplication of unsigned double word integers.

Walter Misar's illustration

Sverre Jarp presented a slide showing a slightly longer code sequence devised by Walter Misar of TU Darmstadt to compute the full 64-bit product of two 32-bit unsigned integers:

                             // r32:  x     x     A     a                             // r33:  x     x     B     b mux2        r34 = r32,0x50      //   A     A     a     a mux2        r35 = r33,0x14;;    //   b     B     B     b pmpyshr2.u  r36 = r34,r35,0     // (Ab)l (AB)l (aB)l (ab)l pmpyshr2.u  r37 = r34,r35,16;;  // (Ab)h (AB)h (aB)h (ab)h mix2.r      r38 = r37,r36       // (AB)h (AB)l (ab)h (ab)l mix2.l      r39 = r37,r36;;     // (Ab)h (Ab)l (aB)h (aB)l shr.u       r40 = r39,32        //   0     0   (Ab)h (Ab)l zxt4        r41 = r39;;         //   0     0   (aB)h (aB)l add         r42 = r40,r41;;     //   0    Mid3  Mid2  Mid1 shl         r43 = r42,16;;      //  Mid3  Mid2  Mid1   0 add         r8  = r43,r38       //  Sum3  Sum2  Sum1  Sum0

Misar's code and comments are adapted from those on the slide, in order to show the progress of the algorithm in the same manner as in the previous example. That is, the four fields in the comments are to be read as showing 16-bit segments of the 64-bit calculation in progress. The notation Mid indicates 16-bit segments of an intermediate sum.

Again the sources and final destination are consistent with the convention for function calls; the choice of registers r34-r43 for intermediate results implies that those have been claimed as locals with an alloc instruction.

The 11 instructions of this sequence must be placed into seven groups because of data dependencies. The total execution time would be significantly greater than seven cycles because of limitations of execution units and interdependencies and additional latency of the instructions involved.

Peter Montgomery's illustration

Walter Misar (personal communication) noted that Peter Montgomery had proposed a different code sequence to compute the full 64-bit product of two 32-bit unsigned integers:

                             // R2:   x     x     A     a                             // R3:   x     x     B     b mux2       R4 = R2,0x05         //   a     a     A     A mux2       R5 = R3,0x44;;       //   B     b     B     b pmpyshr2.u R6 = R4,R5,0         // (aB)l (ab)l (AB)l (Ab)l pmpyshr2.u R7 = R4,R5,16;;      // (aB)h (ab)h (AB)h (Ab)h shrp       R8 = R7,R6,32        // (AB)h (Ab)h (aB)l (ab)l dep.z      R9 = R6,16,32        //   0   (AB)l (Ab)l   0 extr.u     R10 = R7,32,32;;     //   0     0   (aB)h (ab)h shl        R10 = R10,16         //   0   (aB)h (ab)h   0 add        R8 = R8,R9;;         //  Int3  Int2  Int1 (ab)l add        R8 = R8,R10          //  Sum3  Sum2  Sum1  Sum0

Montgomery's code and comments are adapted in order to show the progress of the algorithm in the same manner as in the previous examples. That is, the four fields in the comments are to be read as showing 16-bit segments of the 64-bit calculation in progress. The notation Int indicates 16-bit segments of an intermediate sum.

Here arbitrary register designations are used. These would have to be changed in order to bring this particular code segment into compliance with the conventions for Itanium function calling.

The 10 instructions of this sequence must be placed into five groups because of data dependencies. The total execution time would be significantly greater than five cycles because of limitations of execution units and interdependencies and additional latency of the instructions involved.

Doubtless there are several similar code segments that could be proposed. For instance, an unpack4.l instruction could replace the second mux2 instruction in this latter illustration.