8.4 Floating-Point Arithmetic Instructions | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

The powerful "fused" multiply add and multiply subtract machine instructions, which use three source operands, can also perform certain simpler operations that have two source operands in other architectures.

A key design criterion of the Itanium architecture is avoidance of branch instructions. Single instructions can select the maximum or minimum of two operands, based on either signed or absolute values. These instructions can greatly reduce the need for short-range branching in searching and sorting algorithms.

Partial support for division and square root operations is provided through hardware primitives that compute limited-precision approximations for reciprocals and square roots, which can then be refined to full precision through optimized software routines.

8.4.1 Addition, Subtraction, and Multiplication

The Itanium architecture implements three of the four basic floating-point arithmetic operations, as well as negated multiplication, with a syntax resembling that of the corresponding integer arithmetic instructions (Chapter 4):

 fadd.pc.sf   f1=f3,f2     // f1 <- f3 +  f2 fsub.pc.sf   f1=f3,f2     // f1 <- f3 -  f2 fmpy.pc.sf   f1=f3,f4     // f1 <- f3 *  f4 fnmpy.pc.sf  f1=f3,f4     // f1 <- -(f3 * f4)

where f1,f2,f3, and f4 may be any of the floating-point registers. In general, these instructions combine two numbers in floating-point source registers in order to compute the appropriate floating-point value for the destination.

Unlike the integer arithmetic instructions, the Itanium floating-point arithmetic instructions cannot have an immediate operand. These are special cases of the "fused" instructions (see Section 8.4.2 below) with four 7-bit encoded fields for registers (f1,f2,f3,f4) in the 41-bit instruction.

There are three values for pc (the precision completer): none at all, s, and d. The choice of none at all permits handling of special circumstances, including IA-32 double extended precision. We shall concentrate on IEEE single- (s) and double- (d) precision, primarily the latter of those.

There are five values for sf (the status field completer): none at all, s0, s1, s2, and s3. As none at all is equivalent to s0, those choices refer to four settings in a floating-point status register that is touched briefly upon later in this chapter. We use the default of none at all.

The four floating-point instructions described in this section carry out IEEE-conforming arithmetic operations to "infinite precision," possibly negating the result (fnmpy), and rounding the result to the specified precision.

8.4.2 Fused Multiply Add and Multiply Subtract Instructions

Three fused (i.e., joined) floating-point arithmetic instructions in the Itanium architecture multiply two source operands before adding a third source operand to the intermediate product, or subtracting the third source from the intermediate product:

 fma.pc.sf   f1=f3,f4,f2  // f1 <- f3 * f4 + f2 fms.pc.sf   f1=f3,f4,f2  // f1 <- f3 * f4 - f2 fnma.pc.sf  f1=f3,f4,f2  // f1 <- -(f3 * f4) + f2

where f1,f2,f3, and f4 may be any of the floating-point registers. The intermediate product is not rounded in any way before the operand in register f2 is added to or subtracted from it. This optimizes the precision of the final value.

The values for pc (the precision completer) and sf (the status field completer) are the same as those given previously (Section 8.4.1) for nonfused arithmetic operations.

Relationship to nonfused operations

When Fr₁ in particular occupies the position denoted by register f4, the fused operations collapse to the equivalent of simple addition or subtraction. Similarly, when Fr₀ in particular occupies the position denoted by register f2, the fused operations collapse to the equivalent of simple multiplication. The equivalent operations for the arithmetic pseudo-ops are:

 fadd.pc.sf   f1=f3,f2  // fma.pc.sff1=f3,f1,f2 fsub.pc.sf   f1=f3,f2  // fms.pc.sff1=f3,f1,f2 fmpy.pc.sf   f1=f3,f4  // fma.pc.sff1=f3,f4,f0 fnmpy.pc.sf  f1=f3,f4  // fnma.pc.sff1=f3,f4,f0

Use of the simple pseudo-ops enhances program legibility. Since compilers, listing files, and the debugger may instead use the direct operations, you should be familiar with the relationships between them.

Rationale for fused operations

The evaluation of mathematical functions occurs by way of various types of series expansion i.e., by polynomial approximations that involve repetitive multiply add steps. Markstein explains the utility of this approach.

8.4.3 Normalization as Another Special Case

Modern architectures conforming to the IEEE floating-point specification may take advantage of implementation-specific features and efficiencies as long as they maintain the required accuracy.

The Itanium architectural model involves enhanced precision within the 82-bit register representation, which must ultimately be resolved to the standard format for IEEE data types. While the last floating-point arithmetic operation can often specify the final rounding, it may be more convenient to "normalize" and round a value separately at the end of a sequence of calculations. The fnorm instruction accomplishes this:

 fnorm.pc.sf  f1=f3        // fma. pc.sff1= f3,f1,f0

As indicated here, fnorm is implemented as a pseudo-op of the fma instruction, using a pro forma multiplication by one and addition of zero.

The values for pc (the precision completer) and sf (the status field completer) are the same as those given previously (Section 8.4.1) for nonfused arithmetic operations.

8.4.4 Maximum and Minimum Operations

Many mathematical and scientific algorithms require the larger or smaller of two quantities that will participate in subsequent calculations. Traditional programming environments might require time-consuming compare and branch sequences for such operations.

The Itanium architecture directly supports these important operations, using both signed maximum or minimum selection and absolute magnitude maximum or minimum selection through four instructions:

 famax.sf     f1=f2,f3   // f1 <- larger (absolutely) of  f2,f3 famin.sf     f1=f2,f3   // f1 <- smaller (absolutely) of  f2,f3 fmax.sf      f1=f2,f3   // f1 <- larger of  f2,f3 fmin.sf      f1=f2,f3   // f1 <- smaller of  f2,f3

In the case of a "tie," where the values in registers f2 and f3 are equal in value (fmax, fmin) or magnitude (famax, famin), register f1 gets a copy of what is in register f3.

The values for sf (the status field completer) are the same as those given previously (Section 8.4.1) for nonfused arithmetic operations.

8.4.5 Rounding, Exceptions, and Floating-Point Control

The IEEE standard specifies several modes for rounding floating-point values, defines numerous invalid operations, and requires itemized flagging of overflow, underflow, and other circumstances. Computer architectures usually pull those aspects of flexibility and complexity into the purview of a special register associated with the floating-point execution unit(s).

Itanium floating-point status register

The Itanium architecture includes an application register, ar.fpsr (Appendix D.6), which is read or modified by the special fchkf, fclrf, and fsetc instructions (Appendix C; Markstein; Triebel; Intel documentation). Using that register, runtime support libraries can selectively enable or disable hardware traps on various error conditions (e.g., division by zero).

Four identical fields of bits in the register permit patterns of rounding and exception reporting behavior to be predetermined for selection on a per-instruction basis using the sf instruction completer. One bit, wre (widest range exponent), indicates whether a computed result should remain in the register format. This is generally appropriate for all intermediate results in order to retain the greatest possible precision. Programmers thus have extraordinarily fine-grained, but also very efficient, control over floating-point execution.

Rounding

The IEEE standard for binary floating-point arithmetic specifies four modes of rounding, "nearest or even" and three directions:

rounding to the most precise value that can be represented, and if two values are
equally near, choosing the one having zero for the least significant bit of the fraction;
rounding toward zero, also called "truncated" or "chopped" rounding;
rounding "down" toward minus infinity; and
rounding "up" toward plus infinity.

The "nearest or even" behavior corresponds to a longstanding practice among scientists of rounding decimal numbers of the form 0.d…x5 by choosing x to be even (sometimes called "unbiased rounding to even"). The everyday practice (as used in financial accounting) where the trailing 5 cases are rounded "up" can lead to systematic numeric bias when already-rounded quantities are later used to compute other derived quantities. Further discussion of these various rounding modes, the ar.fpsr register, and associated instructions can be found in the previously cited references.

Exceptions

The IEEE standard for binary floating-point arithmetic identifies five types of exceptions that hardware and/or software should be capable of detecting:

invalid operations, such as 0/0 or operations with NaN values;
division by zero;
overflow, when the rounded result exceeds the largest finite number of the destination format;
underflow, when the rounded result is smaller than the smallest finite number of the destination format; and
inexact result, when the rounded result differs from the infinitely precise result.

The Itanium ar.fpsr register allocates six bits in each of its four status fields for these five circumstances and the indication of denormal or un-normal operands. Since the status fields are under program control, the floating-point execution units may either ignore the event or cause an exception when one or more of these conditions are present.

Specific bits in the ar.fpsr register turn on when one of these six exceptions occurs. Since RISC-like systems require extra processor cycles to detect and report exceptions, it is common to give the programmer a choice of rapid calculations that ignore error conditions or slower calculations with exception tracking.