8.2 Representations of Floating-Point Values

In Chapter 2, we introduced the IEEE representations for single- and double-precision floating-point values using 32 and 64 bits for storage, respectively. Since these representations are an industry standard, all modern computer systems store floating-point values in memory in the same manner, respecting big- or little-endian ordering.

The IEEE representations partition an information unit into fields for sign, exponent, and significand. The exponent is stored with an additive bias. The significand consists of an implicit binary 1 followed by a binary fraction (Section 2.5.2).

8.2.1 IEEE Special Values

Before we consider floating-point instructions, we need to provide more detail about how the IEEE representation accommodates certain special values, several of which are outlined in Table 8-2.

These categories are distinguished by whether the biased exponent field contains all 0 bits, all 1 bits, or any other bit pattern and whether the fraction field contains all 0 bits or any other bit pattern. A brief overview of these should clarify some of the side effects of certain Itanium floating-point and conversion instructions.

One special circumstance is the encoding for NaN (not a number), a deliberately invalid number stored in an information unit. A signaling NaN must produce an invalid operation exception when used as an operand, while a quiet NaN may propagate through almost any arithmetic operation without signaling any exceptions.

Special encodings exist for positive and negative infinity, as well as for the denormal numbers i.e., those whose fractions have not been shifted to the left enough to give the significand the hidden bit (as mentioned in Chapter 2). Such numbers have values lying between zero and the smallest of the normalized numbers given in Table 2-2, and are not considered to be finite.

The exact value zero is assigned a specific pattern of 31 or 63 zero bits with a sign bit. Only zero and normalized numbers are considered finite in a mathematical sense.

Table 8-2. Meanings of Special IEEE Floating-Point Representations
Biased Exponent	Fraction	IEEE Meaning	Finite?
All ones	Nonzero	NaN^[*]	No
All ones	Zero	± Infinity	No
Zero	Nonzero	± Denormal	No
Zero	Zero	± 0	Yes
Other	Anything	Nonzero, normalized	Yes

^[*] The IEEE standard requires at least one quiet NaN and one signaling NaN.

8.2.2 Values in Itanium Floating-Point Registers

The IEEE standard specifies what values must be represented, what accuracy is required, and what computational details, such as rounding, must be accommodated. The IEEE standard does not dictate how a conforming architecture is to implement these requirements. Blends of hardware and software strategies are possible; some architectures represent intermediate results in registers differently from IEEE memory representation.

The designers of the Itanium architecture decided on a width of 82 bits for floating-point registers. At first sight, the choice of 82 bits may seem peculiar and costly, but it allows for several desirable features:

The "double extended" representation of the IA-32 architecture, which requires 80 bits (10 bytes), can be handled without loss of precision.
Single-, double-, and double extended precision operands are manipulated with the same set of arithmetic instructions within the CPU.
Usually only load and store instructions need to be concerned with differences among these three precisions or conversions among them.
The significand is easily expressed in a register as 64 bits, with the hidden bit explicitly shown; 63 bits of precision are available for the fraction.
A wide exponent field accommodates all of the IEEE requirements, while making possible other capabilities using floating-point registers and execution units, such as 64-bit integer multiplication.

A 64-bit significand also improves accuracy when several intermediate floating-point calculations are combined before a result is stored.

This accuracy takes on special importance in polynomial series approximations for computing transcendental functions. Other architectures have incorporated a few extra "guard bits" into their floating-point arithmetic execution unit in order to ensure that the addition or multiplication was accurate. Nevertheless, the advantage of enhanced precision was lost every time an intermediate result was stored back into a standard 64-bit floating-point register.

The 17-bit span for the biased exponent in an Itanium floating-point register exceeds the minimum requirement of 15 bits for the IA-32 double extended representation. This excess, a factor of four, facilitates intermediate calculations at the borders of very small values close to underflow or very large values close to overflow.

When a double-precision quantity is retrieved from memory into a floating-point register, its representation becomes:

for normalized numbers. The exponent is biased by 65535 (0xFFFF). Since the significand (1.fraction) is treated as a number that is at least 1 but less than 2, the value is

(1 2 x sign) x significand x 2^{(exponent 65535)}

Consequently, a single- or double-precision floating-point value is expanded from its 32- or 64-bit IEEE representation when it is loaded from memory into an Itanium floating-point register. The load is done without distortion of the value; however, when a double-precision computed result is stored in memory, only 53 bits of significand can be accommodated (Table 2-2).

Not a Thing Value

A special value, NaTVal, consists of an otherwise unassigned 82-bit code, where the sign bit and significand are 0 and the biased exponent is 0x1FFFE. NaTVal indicates that the floating-point register contains an invalid number for example, when a speculative load instruction fails. Any attempt to use this value in calculations will result in propagation of the NaTVal.

Integers in floating-point registers

Several Itanium instructions work with integers in bits <63:0> of floating-point registers. In such cases, bit <63> serves as the sign of the usual two's complement representation of the integer. The floating-point sign bit is 0. The biased exponent is 0x1003E (2⁶³), indicating that the value is integral i.e., with the true binary point to the right of bit <0>.

8.2.1 IEEE Special Values

Table 8-2. Meanings of Special IEEE Floating-Point Representations

8.2.2 Values in Itanium Floating-Point Registers

Not a Thing Value

Integers in floating-point registers