![]() | ||
| ||
![]() |
Before digging very deeply let us first examine the floating-point number and its sub- components .
The FPU supports three sizes of floating-point numbers , as shown below.
Data Size | C reference | Assembler | Bytes |
---|---|---|---|
Single-Precision | (float) | REAL4 | 4 |
Double-Precision | (double) | REAL8 | 8 |
Double Extended-Precision | --- | REAL10 | 10 |
You are probably familiar with the single- and double-precision but not the double extended-precision. Did you know that when you do a floating-point calculation that the data is actually expanded into the 10-byte (80-bit) form double extended-precision floating-point as it is pushed on the FPU stack?
The larger the number of bits used to store the number, the higher the precision of that number.
Component | SPFP | DPFP | DEPFP |
---|---|---|---|
Sign | 1 | 1 | 1 |
Exponent | 8 | 11 | 15 |
Integer |
|
| 1 |
Significand | 23 | 52 | 63 |
Total | 32 | 64 | 80 |
The exponent is a base-2 power representation stored as a binary integer. The significand (mantissa) really consists of two components: a J-bit and a binary fraction.
For the single-precision value, there is a hidden integer bit (1.) leading the 23 bits of the mantissa, thus making it a 24-bit significand. The exponent is 8 bits, thus having a bias value of 127. The magnitude of the supported range of numbers is 2—10 38 to 2—10 38 .
For double-precision values, there is a hidden integer bit (1.) leading the 52 bits of the mantissa, thus making it a 53-bit significand. The exponent is 11 bits, thus having a bias value of 1023. The magnitude of the supported range of numbers is 2.23—10 308 to 1.8—10 308 .
For the 80-bit version, the extra bits are primarily for protection against precision loss from rounding and over/underflows. The leading integer bit (1.) is the 64 th bit of the significand. The exponent is 15 bits, thus having a bias value of 32767. The magnitude of the supported range of numbers is 3.3—10 4932 to 1.21—10 4932 .
The product of the exponent and significand result in the floating- point value.
A zero exists in two forms (0): positive zero (+0) and negative zero (0). Both of these are valid indications of zero. (The sign is ignored!)
For double-precision and single-precision floating-point numbers, the integer bit is always set to one. (It just is not part of the 64 or 32 bits used to encode the number.) For double extended-precision the bit is encoded as part of the number and so denormalized numbers apply. These are very small non-zero numbers represented with an exponent of zero and thus very close to the value of zero and considered tiny. Keep in mind for the FPU that the single-precision and double-precision numbers are expanded into double extended-precision where the integer bit is one of the 80 bits and thus denormalized numbers exist for the calculations. Upon saving the single- or double-precision floating-point number back to memory the bit is stripped out as an imaginary bit, which is set!
Programmers are also usually aware that floats cannot be divided by zero or process a square root of negative because an exception error would occur.
| x e | Significand | NaN (Not a Number) | |
---|---|---|---|---|
| 255 | 1.1xxx | 0 11111111 1xxxxxxxx | QNaN |
| 255 | 1.0xxx | 0 11111111 0xxxxxxx1 | SNaN |
| 255 | 1.000 | 0 11111111 000 | + |
| 1254 | 1.xxx | 0 11111110 xxxxxx | + Normalized Finite |
|
| 0.xxx | (Not SPFP) | + Denormalized (Tiny) |
|
|
| 00000000h | + Positive Zero |
1 |
|
| 80000000h | Negative Zero |
1 |
| 0.xxx | 1000 0000 0 xxx | Denormalized (Tiny) |
1 | 1254 | 1.xxx | 1 11111110 xxxxxx | Normalized Finite |
1 | 255 | 1.000 | 1 11111111 000 |
|
1 | 255 | 1.0xxx | FF800001h-FFBFFFFFh | SNaN |
1 | 255 | 1.1xxx | FFC00000h-FFFFFFFFh | QNaN |
There are two types of NaNs (non-numbers): The quiet NaNs known as QNaNs and the signalling NaNs known as SNaNs.
QNaN
The QNaN has the most significant fraction bit set and is a valid value to use in most floating-point instructions even though it is not a number. A QNaN is an unordered number due to not being a real floating-point value.
SNaN
The SNaN has the most significant fraction bit reset (clear) and typically signals an invalid exception when used with floating-point instructions. SNaN values are never generated by the result of a floating-point operation. They are only operands supplied by software algorithms. A SNaN is an unordered number due to not being a real floating-point value.
NaN
The NaN (Not A Number) is a number that is either a QNaN or SNaN.
Unordered
An unordered number is a number that is valid or a QNaN. (It is not SNaN.)
Ordered
An ordered number is a valid number that is not NaN ( neither QNaN nor SNaN).
Value | Hex | Sign Exp Sig. |
---|---|---|
1.0 | 0xBF800000 | 1 7F 000000 |
0.0 | 0x00000000 | 0 00 000000 |
0.0000001 | 0x33D6BF95 | 0 67 56BF95 |
1.0 | 0x3F800000 | 0 7F 000000 |
2.0 | 0x40000000 | 0 80 000000 |
3.0 | 0x40400000 | 0 80 800000 |
4.0 | 0x40800000 | 0 81 000000 |
Value | Hex |
---|---|
1.0 | 0xBFF00000 00000000 |
0.0 | 0x00000000 00000000 |
1.0 | 0x3FF00000 00000000 |
Value | Hex |
---|---|
1.0 | 0xBFFF8000 00000000 |
0.0 | 0x00000000 00000000 |
1.0 | 0x3FFF8000 00000000 |
The floating-point unit has eight data registers, {ST(0), ST(1), ST(2), ST(3), ST(4), ST(5), ST(6), ST(7)}, and Status, Control Word, Tag Word, IP, Data Pointer, and Op Code Registers.
Def. | Code | Bit | Description |
---|---|---|---|
FPU_IE | 00001h |
| Invalid operation (exception) |
FPU_DE | 00002h | 1 | Denormalized operand (exception) |
FPU_ZE | 00004h | 2 | Zero divide (exception) |
FPU_OE | 00008h | 3 | Overflow (exception) |
FPU_UE | 00010h | 4 | Underflow (exception) |
FPU_PE | 00020h | 5 | Precision (exception) |
FPU_SF | 00040h | 6 | Stack fault |
FPU_ES | 00080h | 7 | Error summary status |
FPU_C0 | 00100h | 8 | (C0) Condition Code Bit#0 |
FPU_C1 | 00200h | 9 | (C1) Condition Code Bit#1 |
FPU_C2 | 00400h | 10 | (C2) Condition Code Bit#2 |
11-13 | Top of stack pointer | ||
FPU_C3 | 04000h | 14 | (C3) Condition Code Bit#3 |
FPU_B | 08000h | 15 | FPU busy bit |
The FPU has condition code bits contained within the status register. These bits match 1:1 with the EFLAGS of the CPU. They can be copied to the AX register using the FSTSW AX instruction followed by a SAHF instruction to place them into the EFLAGS register.
A ? B | C3 (Zero) | C2 (Parity) | C1 (Oflow) | C0 (Carry) |
---|---|---|---|---|
Unordered | x | x | x | 1 |
Def. | Code | Bit | Description |
---|---|---|---|
FPU_IM | 00001h |
| Invalid operation |
FPU_DM | 00002h | 1 | Denormalized operand |
FPU_ZM | 00004h | 2 | Zero divide |
FPU_OM | 00008h | 3 | Overflow |
FPU_UM | 00010h | 4 | Underflow |
FPU_PM | 00020h | 5 | Precision |
FPU_PC | 00300h | 8,9 | Precision control |
FPU_RC | 00c00h | 10,11 | Rounding control |
FPU_X | 01000h | 12 | Infinity control |
Now would be a good time to talk about FPU exceptions. The FPU uses exceptions for invalid operations in a manner similar to how the CPU uses exceptions.
Mnemonic | C/C++ | Description |
---|---|---|
#IA | Invalid arithmetic operation | |
#IS | #IND | Stack overflow or underflow |
#D | #QNAN | Denormal/un-normal operand |
#O | _FPE_OVERFLOW Numerical overflow in result | |
#P | Precision loss | |
#U | _FPE_UNDERFLOW Numerical underflow in result | |
#Z | #INF | _FPE_ZERODIVIDE Divide by zero |
Most of the single- and double-precision floating-point functionality is covered by the C runtime math library, which can be accessed in the file: #include <math.h>.