12.5 Floating-Point Parallel Operations

When we first introduced floating-point instructions, we correlated them with their apparent integer equivalents. Similarly, we now introduce the Itanium floating-point parallel operations by analogy to operations on full-width data.

Table 12-3 lists the Itanium instructions that perform operations on two single-precision floating-point values packed in the 82-bit floating-point registers. Analogies are drawn to instructions that operate with double-precision data where appropriate.

Table 12-3. Itanium Floating-Point Parallel Instructions

Instruction Name

Opcode(s)

Analogy

Convert Parallel Floating-Point to Integer

fpcvt.fx, fpcvt.fxu

fcvt.fx, fcvt.fxu

Floating-Point Mix

fmix

 

Floating-Point Pack

fpack

 

Floating-Point Parallel Absolute Maximum

fpamax

famax

Floating-Point Parallel Absolute Minimum

fpamin

famin

Floating-Point Parallel Absolute Value (pseudo-op)

fpabs

fabs

Floating-Point Parallel Compare

fpcmp

fcmp

Floating-Point Parallel Maximum

fpmax

fmax

Floating-Point Parallel Merge

fpmerge

fmerge

Floating-Point Parallel Minimum

fpmin

fmin

Floating-Point Parallel Multiply (pseudo-op)

fpmpy

fmpy

Floating-Point Parallel Multiply Add

fpma

fma

Floating-Point Parallel Multiply Subtract

fpms

fms

Floating-Point Parallel Negate (pseudo-op)

fpneg

fneg

Floating-Point Parallel Negate Absolute Value (pseudo-op)

fpnegabs

fnegabs

Floating-Point Parallel Negative Multiply (pseudo-op)

fpnmpy

fnmpy

Floating-Point Parallel Negative Multiply Add

fpnma

fnma

Floating-Point Parallel Reciprocal Approximation

fprcpa

frcpa

Floating-Point Parallel Reciprocal Square Root Approximation

fprsqrta

frsqrta

Floating-Point Sign Extend

fsxt

 

Floating-Point Swap

fswap

 

The analogies in Table 12-3 are just that, since some significant differences distinguish these parallel instructions from their nearest counterparts for full-width data. For instance, the floating-point parallel compare (fpcmp) instruction writes its results into a floating-point register rather than a pair of predicate registers.

Contrary to what one might wish in this context, the single form of the floating-point load pair (ldfps) instruction (Section 8.3.3) would not, by itself, bring two single-precision floating-point values into one processor register. Instead, it puts the values into two separate floating-point registers in 82-bit register format. The ldf8 instruction can bring two 32-bit values from adjacent information units into a floating-point register in exactly the format required by these floating-point parallel instructions. Similarly, the ldfp8 instruction can load four single-precision values into the proper format in two registers.

The fpack instruction converts two floating-point numbers in the full 82-bit register representation into two 32-bit memory representations in bits <63:32> and <31:0> of a destination floating-point register. As with the representation of integers in floating-point registers, the sign bit <81> is set to 0 and the exponent field of bits <80:64> is set to the special value 0x1003E (263). All of the other floating-point parallel operations produce or maintain results in this format.

Like double-precision floating-point instructions, the parallel forms have a nominal latency of four cycles in an Itanium 2 processor, but the latency can be longer in special circumstances, such as cases where the computed results are stored to memory or general registers. See Intel Itanium 2 Processor Reference Manual for Software Development and Optimization for further information.

With completely independent data, an Itanium processor could sustain either two nonparallel double-precision or four parallel single-precision floating-point operations per cycle, in contrast to sequences of sufficiently independent 64-bit integer operations that can perform six operations per processor cycle. Yet the large latencies, coupled with having only two F-type execution units, make typical scalar floating-point operations (whether nonparallel or parallel) carried out sequentially on related data values take some N processor cycles per operation (N > 1). Compilers should therefore interleave as many integer instructions as possible into slots in instruction bundles amongst floating-point instructions in order to make the best of this situation. Moreover, compilers should use modulo-scheduled loops to increase floating-point throughput whenever possible.



ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net