In the DOTPROD program (Figure 4-5), we used the Itanium pmpy2.r instruction (Section 4.2.5) to multiply word-length (16-bit) integers. The result there was valid because it was known in advance that the signed product would not exceed 32 bits in width. Then we showed an implementation of Booth's algorithm (Section 6.5) that can produce the full 128-bit signed or unsigned product of two 64-bit sources. Later we showed the standard sequence of instructions (Section 8.7.2) required to take advantage of the Itanium xmpy pseudo-op of the xma instruction, which multiplies 64-bit integers after they have been transferred from general registers to floating-point registers. We mentioned the opportunity to capture either the high or low 64 bits of the full 128-bit product for transfer back from a floating-point register to a general register, but not both parts unless a second xmpy instruction is used. Compilers for high-level languages have the opportunity to use xmpy for all integer multiplications or to construct sequences using the parallel multiplication instructions for cases involving source data less than 64 bits in width. We present here some sequences that attracted our attention. 12.3.1 32x32-Bit Sources Giving 32-Bit Unsigned ProductWhile the product of two 16-bit sources cannot exceed a width of 32 bits, the product of 32-bit sources can exceed that width i.e., up to 64 bits. Consider the special case where the product will fit in 32 bits (or where only the low-order 32 bits of a wider product would be wanted). The following sequence, which we present without full discussion of all of the specific instructions, computes the product of two 32-bit unsigned integers. It could be converted into a callable function that would receive two values in registers r32 (in0) and r33 (in1) and produce a product in register r8 (ret0): // r32: x x A a // r33: x x B b pmpyshr2.u r31 = r32,r33,16 // x x (AB)h (ab)h mux2 r30 = r32,0 // a a a a shl r29 = r33,16;; // x B b 0 pmpyshr2.u r27 = r30,r33,0 // x x (aB)l (ab)l pmpyshr2.u r28 = r32,r29,0 // x x (Ab)l 0 shl r26 = r31,16;; // x AB)h (ab)h 0 add r25 = r26,r27;; // x x Part (ab)l add r8 = r25,r28 // x x High (ab)l where x indicates 16-bit segments of the sources and intermediate results that do not affect the final 32-bit result, and where h and l indicate the high- and low-order 16-bit portions of each 32-bit transitory product that the processor computes from multiplying corresponding 16-bit segments of the sources. Thus, in the rightmost column, (ab)l means the product ab modulo 216. Similarly, in the second column from the right, Part is a partial sum leading to High as the sum of three terms, (Ab)l + (aB)l + (ab)h, modulo 216. We can follow this algorithm using a more familiar algebraic notation where the relevant 32 bits of the value in one source register would be (216A + a). With two sources, A and B, the full product could be calculated as:
where A, a, B, and b are 16-bit segments. Therefore the lowest 16 bits in the result register r8 should be (ab)l that is, ab modulo 216. The next 16 bits should be (Ab)l + (aB)l + (ab)h, modulo 216, as previously inferred from the behavior of the Itanium instructions. One would have to be very careful to use only the lowest 32 bits of a result computed in this way, such as using an st4 instruction, since the upper bits would be a function of the unknown x segments. Because of dependencies and the nonunit latency of some of the instructions used, this sequence would require some 11 cycles to execute in the initial Itanium processor implementation (Dale Morris, private communication), but that may be somewhat less than the overall cycle requirement of a logically equivalent sequence using xmpy. 12.3.2 32x32-Bit Sources Giving 64-Bit Unsigned ProductSome of the extraordinary versatility of the Itanium integer parallel operations can be illustrated with additional code segments involving multiplication of unsigned double word integers. Walter Misar's illustrationSverre Jarp presented a slide showing a slightly longer code sequence devised by Walter Misar of TU Darmstadt to compute the full 64-bit product of two 32-bit unsigned integers: // r32: x x A a // r33: x x B b mux2 r34 = r32,0x50 // A A a a mux2 r35 = r33,0x14;; // b B B b pmpyshr2.u r36 = r34,r35,0 // (Ab)l (AB)l (aB)l (ab)l pmpyshr2.u r37 = r34,r35,16;; // (Ab)h (AB)h (aB)h (ab)h mix2.r r38 = r37,r36 // (AB)h (AB)l (ab)h (ab)l mix2.l r39 = r37,r36;; // (Ab)h (Ab)l (aB)h (aB)l shr.u r40 = r39,32 // 0 0 (Ab)h (Ab)l zxt4 r41 = r39;; // 0 0 (aB)h (aB)l add r42 = r40,r41;; // 0 Mid3 Mid2 Mid1 shl r43 = r42,16;; // Mid3 Mid2 Mid1 0 add r8 = r43,r38 // Sum3 Sum2 Sum1 Sum0 Misar's code and comments are adapted from those on the slide, in order to show the progress of the algorithm in the same manner as in the previous example. That is, the four fields in the comments are to be read as showing 16-bit segments of the 64-bit calculation in progress. The notation Mid indicates 16-bit segments of an intermediate sum. Again the sources and final destination are consistent with the convention for function calls; the choice of registers r34-r43 for intermediate results implies that those have been claimed as locals with an alloc instruction. The 11 instructions of this sequence must be placed into seven groups because of data dependencies. The total execution time would be significantly greater than seven cycles because of limitations of execution units and interdependencies and additional latency of the instructions involved. Peter Montgomery's illustrationWalter Misar (personal communication) noted that Peter Montgomery had proposed a different code sequence to compute the full 64-bit product of two 32-bit unsigned integers: // R2: x x A a // R3: x x B b mux2 R4 = R2,0x05 // a a A A mux2 R5 = R3,0x44;; // B b B b pmpyshr2.u R6 = R4,R5,0 // (aB)l (ab)l (AB)l (Ab)l pmpyshr2.u R7 = R4,R5,16;; // (aB)h (ab)h (AB)h (Ab)h shrp R8 = R7,R6,32 // (AB)h (Ab)h (aB)l (ab)l dep.z R9 = R6,16,32 // 0 (AB)l (Ab)l 0 extr.u R10 = R7,32,32;; // 0 0 (aB)h (ab)h shl R10 = R10,16 // 0 (aB)h (ab)h 0 add R8 = R8,R9;; // Int3 Int2 Int1 (ab)l add R8 = R8,R10 // Sum3 Sum2 Sum1 Sum0 Montgomery's code and comments are adapted in order to show the progress of the algorithm in the same manner as in the previous examples. That is, the four fields in the comments are to be read as showing 16-bit segments of the 64-bit calculation in progress. The notation Int indicates 16-bit segments of an intermediate sum. Here arbitrary register designations are used. These would have to be changed in order to bring this particular code segment into compliance with the conventions for Itanium function calling. The 10 instructions of this sequence must be placed into five groups because of data dependencies. The total execution time would be significantly greater than five cycles because of limitations of execution units and interdependencies and additional latency of the instructions involved. Doubtless there are several similar code segments that could be proposed. For instance, an unpack4.l instruction could replace the second mux2 instruction in this latter illustration. |