12.5 Floating-Point Parallel Operations

When we first introduced floating-point instructions, we correlated them with their apparent integer equivalents. Similarly, we now introduce the Itanium floating-point parallel operations by analogy to operations on full-width data.

Table 12-3 lists the Itanium instructions that perform operations on two single-precision floating-point values packed in the 82-bit floating-point registers. Analogies are drawn to instructions that operate with double-precision data where appropriate.

Table 12-3. Itanium Floating-Point Parallel Instructions
Instruction Name	Opcode(s)	Analogy
Convert Parallel Floating-Point to Integer	`fpcvt.fx, fpcvt.fxu`	`fcvt.fx, fcvt.fxu`
Floating-Point Mix	`fmix`
Floating-Point Pack	`fpack`
Floating-Point Parallel Absolute Maximum	`fpamax`	`famax`
Floating-Point Parallel Absolute Minimum	`fpamin`	`famin`
Floating-Point Parallel Absolute Value (pseudo-op)	`fpabs`	`fabs`
Floating-Point Parallel Compare	`fpcmp`	`fcmp`
Floating-Point Parallel Maximum	`fpmax`	`fmax`
Floating-Point Parallel Merge	`fpmerge`	`fmerge`
Floating-Point Parallel Minimum	`fpmin`	`fmin`
Floating-Point Parallel Multiply (pseudo-op)	`fpmpy`	`fmpy`
Floating-Point Parallel Multiply Add	`fpma`	`fma`
Floating-Point Parallel Multiply Subtract	`fpms`	`fms`
Floating-Point Parallel Negate (pseudo-op)	`fpneg`	`fneg`
Floating-Point Parallel Negate Absolute Value (pseudo-op)	`fpnegabs`	`fnegabs`
Floating-Point Parallel Negative Multiply (pseudo-op)	`fpnmpy`	`fnmpy`
Floating-Point Parallel Negative Multiply Add	`fpnma`	`fnma`
Floating-Point Parallel Reciprocal Approximation	`fprcpa`	`frcpa`
Floating-Point Parallel Reciprocal Square Root Approximation	`fprsqrta`	`frsqrta`
Floating-Point Sign Extend	`fsxt`
Floating-Point Swap	`fswap`

The analogies in Table 12-3 are just that, since some significant differences distinguish these parallel instructions from their nearest counterparts for full-width data. For instance, the floating-point parallel compare (fpcmp) instruction writes its results into a floating-point register rather than a pair of predicate registers.

Contrary to what one might wish in this context, the single form of the floating-point load pair (ldfps) instruction (Section 8.3.3) would not, by itself, bring two single-precision floating-point values into one processor register. Instead, it puts the values into two separate floating-point registers in 82-bit register format. The ldf8 instruction can bring two 32-bit values from adjacent information units into a floating-point register in exactly the format required by these floating-point parallel instructions. Similarly, the ldfp8 instruction can load four single-precision values into the proper format in two registers.

The fpack instruction converts two floating-point numbers in the full 82-bit register representation into two 32-bit memory representations in bits <63:32> and <31:0> of a destination floating-point register. As with the representation of integers in floating-point registers, the sign bit <81> is set to 0 and the exponent field of bits <80:64> is set to the special value 0x1003E (2⁶³). All of the other floating-point parallel operations produce or maintain results in this format.

Like double-precision floating-point instructions, the parallel forms have a nominal latency of four cycles in an Itanium 2 processor, but the latency can be longer in special circumstances, such as cases where the computed results are stored to memory or general registers. See Intel Itanium 2 Processor Reference Manual for Software Development and Optimization for further information.

With completely independent data, an Itanium processor could sustain either two nonparallel double-precision or four parallel single-precision floating-point operations per cycle, in contrast to sequences of sufficiently independent 64-bit integer operations that can perform six operations per processor cycle. Yet the large latencies, coupled with having only two F-type execution units, make typical scalar floating-point operations (whether nonparallel or parallel) carried out sequentially on related data values take some N processor cycles per operation (N > 1). Compilers should therefore interleave as many integer instructions as possible into slots in instruction bundles amongst floating-point instructions in order to make the best of this situation. Moreover, compilers should use modulo-scheduled loops to increase floating-point throughput whenever possible.

Table 12-3. Itanium Floating-Point Parallel Instructions