4.7 Floating-Point Operations

Although most modern CPUs support a floating-point unit (FPU) that does floating-point arithmetic in hardware, it's worthwhile to actually develop a set of software floating-point arithmetic routines to get a solid feel for what's involved in floating-point arithmetic. Generally, when designing a software-based floating-point package, you would use assembly language to write the math functions because speed is a primary design goal for a floating-point package. However, in this chapter we're only writing this floating-point package to get a clearer picture of what's involved in floating-point arithmetic, so we'll opt for code that is easy to write, read, andunderstand. As it turns out, floating-point addition and subtraction are easy to do in a high-level language like C/C++ or Pascal, so we'll implement these functions in these languages. Floating-point multiplication and division actually turn out to be easier to do in assembly language than in a high-level language, so this book will write the floating-point multiplication and division routines using HLA.

4.7.1 Floating-Point Representation

For the purposes of the floating-point functions we're about to develop, this section will use the IEEE 32-bit single-precision floating-point format (shown earlier in Figure 4-2), which uses a one's complement representation for signed values. This means that the sign bit (bit 31) contains a one if the number is negative and a zero if the number is positive. The exponent is an 8-bit excess-127 exponent sitting in bits 23..30, and the mantissa is a 24-bit value with an implied HO bit of one. Because of the implied HO bit, this format does not support denormalized values.

4.7.2 Floating-Point Addition and Subtraction

Because the IEEE floating-point format supports signed real values, it turns out that addition and subtraction use essentially the same code. After all, computing X ˆ’ Y is equivalent to computing X + ( ˆ’ Y) . So if we can add a negative number with some other value, then we can also perform subtraction by first negating some number and then adding them. And because the IEEE floating-point format uses the one's complement representation, negating a value is trivial - we just invert the sign bit.

Because we're using the standard IEEE 32-bit single-precision floating-point format, we could theoretically get away with using the C/C++ float data type ( assuming the underlying C/C++ compiler also uses this format, as most do on modern machines). However, you'll soon see that when doing floating-point calculations in software, we need to manipulate various fields within the floating-point format as bit strings and integer values. Therefore, it's more convenient to use a 32-bit unsigned integer type to hold the bit representation for our floating-point values. To avoid confusing our real values with actual integer values in a program, we'll define the following real data type (which assumes that unsigned longs are 32-bit values in your implementation of C/C++) and declare all our real variables using this type:

 typedef long unsigned real;

One advantage of using the same floating-point format that C/C++ uses for float values is that we can assign floating-point literal constants to our real variables, and we can do other floating-point operations such as input and output using existing library routines. However, one potential problem is that C/C++ will attempt to automatically convert between integer and floating-point formats if we use a real variable in a floating-point expression (remember, as far as C/C++ is concerned , real is just a long unsigned integer value). This means that we need to tell the compiler to treat the bit patterns found in our real variables as though they were float objects.

A simple type coercion like (float) realVariable will not work. The C/C++ compiler will emit code to convert the integer it believes realVariable to contain into the equivalent floating-point value. However, the bit pattern in realVariable is a floating-point value, so no conversion is required. We want the C/C++ compiler to treat the bit pattern it finds in realVariable as a float without doing any conversion. The following C/C++ macro is a sneaky way to do this:

 #define asreal(x) (*((float *) &x))

Note that this macro requires a single parameter that must be a real variable. The result of this macro is a variable that the compiler believes is a float variable.

Now that we have our float variable, we'll develop two C/C++ functions to compute floating-point addition and subtraction: fpadd and fpsub . These two functions will each take three parameters: the left and right operands of the operator and a pointer to a destination where these functions will store their result. The prototypes for these functions are the following:

 void fpadd(real left, real right, real *dest);  void fpsub(real left, real right, real *dest);

The fpsub function is almost trivial. All it has to do is negate the right operand and call the fpadd function to do the real work. Here's the code for the fpsub function:

 void fpsub(real left, real right, real *dest)  {      right = right ^ 0x80000000;   // Invert the sign bit of the right operand.      fpadd(left, right, dest);   // Let fpadd do the real work.  }

The fpadd function is where all the real work is done. To make fpadd a little easier to understand and maintain, we'll decompose the function into several different functions that help with various activities that take place. In an actual software floating-point library routine, you'd probably not do this decomposition because the extra subroutine calls would be a little slower; however, we're developing fpadd for educational purposes, not for actual use as part of a software floating-point library, and readability is a bit more important than performance in this particular instance. Besides, if you need high-performance floating-point addition, you'll probably use a hardware FPU rather than a software implementation.

The IEEE floating-point formats are good examples of packed data types. As you've seen in previous chapters, packed data types are great for reducing storage requirements for a data type, but they're not the best format when you need to use the packed fields in actual calculations. Therefore, one of the first things our floating-point functions will do is unpack the sign, exponent, and mantissa fields from the floating-point representation. The following C/C++ functions handle these simple tasks .

The first unpacking function is the extractSign function. This function extracts the sign bit (bit 31) from our packed floating-point representation and returns the value zero (for positive numbers) or one (for negative numbers ).

 inline int extractSign(real from)  {      return(from >> 31);  }

This code could have also extracted the sign bit using this (possibly more efficient) expression:

 (from & 0x80000000) != 0

However, shifting bit 31 down to bit 0 is, arguably, easier to understand.

The next utility function we'll look at unpacks the exponent from bits 23..30 in the packed real format. It does this by shifting the real value to the right by 23 bits, and then it masks out the sign bit. One other thing that this function will do is convert the excess-127 exponent to a two's complement format (this is easily achieved by subtracting 127 from the excess-127 exponent we extract). Here's the function that does this:

 inline int extractExponent(real from)  {      return ((from >> 23) & 0xff)   127;  }

Extracting the mantissa is easy. All we have to do is mask out the exponent and sign bits and then insert the implied HO bit of one. The only catch is that we must return zero if the entire value is zero. Here's the function that extracts the mantissa from the real value:

 inline int extractMantissa(real from)  {      if((from & 0x7fffffff) == 0) return 0;      return ((from & 0x7FFFFF)  0x800000);  }

As you learned earlier, whenever adding or subtracting two values using scientific notation (and the IEEE floating-point format uses scientific notation), you must first adjust the two values so that they have the same exponent. For example, consider the addition of the following two decimal (base-10) numbers: 1.2345e3 + 8.7654e1.

To add these two numbers together, we must first adjust one or the other so that their exponents are the same. We can reduce the exponent of the first number by shifting the decimal point to the right. For example, the following values are all equivalent to 1.2345e3:

 12.345e2 123.45e1 1234.5 12345e   1

Likewise, we can increase the value of an exponent by shifting the decimal point to the left. The following values are all equal to 8.7654e1:

 0.87654e2 0.087654e3 0.0087654e4

For floating-point addition and subtraction involving binary numbers, we canmake the binary exponents the same by shifting the mantissa one position to the left and decrementing the exponent, or by shifting the mantissa one position to the right and incrementing the exponent.

A problem with adjusting the exponent of one operand so that it matches the exponent of the other operand is that we only have so many bits to use to represent the mantissa. Shifting the mantissa bits to the right means that we reduce the precision of our number (because the bits wind up going off the LO end of the mantissa). To preserve as much accuracy as possible in our calculations, we shouldn't truncate the bits we shift out of the mantissa. As noted earlier, we should round the result to the nearest value we can represent with the remaining mantissa bits. These are the IEEE rules for rounding, in the following order:

Truncate the result if the last bit shifted out was a zero.
Bump the mantissa up by one if the last bit shifted out was a one and there was at least one bit set to one in all the other bits that were shifted out. ^[6]
If the last we shifted out was a one, and all the other bits were zeros, then round the resulting mantissa up by one if the mantissa's LO bit contains a one.

Shifting the mantissa and rounding it is a relatively complex operation, and it will occur a couple of times in the floating-point addition code. Therefore, it's another candidate for a utility function. Here's the C/C++ code that implements this functionality:

 // shiftAndRound:  //  // Shifts a mantissa to the right the number of bits specified.  // Rounds the result according to the IEEE rules for rounding,  //  which are:  //  // If the bits we shift out are a value that is greater than one-half the  //  value of the LO bit we are left with, then we need  //  to round the value up by adding one to the LO bit position.  // If the bits we shift out are a value that is less than one-half the value  //  of the LO bit we are left with (after denormalization), then we need  //  to round the value down (i.e., just leave the value alone).  // If the bits we shift out are exactly one-half the value of the LO bit  //  we are left with, then we need to round the value to the next larger  //  number that has a zero in the LO bit (round up if there's currently a one,  //  or leave the value unchanged if the LO bit contains a zero).  void shiftAndRound(int *valToShift, int bitsToShift)  {      // Masks is used to mask out bits to check for a "sticky" bit.      static unsigned masks[24] =      {          0, 1, 3, 7, 0xf, 0x1f, 0x3f, 0x7f,          0xff, 0x1ff, 0x3ff, 0x7ff, 0xfff, 0x1fff, 0x3fff, 0x7fff,          0xffff, 0x1ffff, 0x3ffff, 0x7ffff, 0xfffff, 0x1fffff, 0x3fffff,          0x7fffff      };      // HOmasks: Masks out the HO bit of the value masked by the masks entry.      static unsigned HOmasks[24] =      {          0,          1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80,          0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000,          0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000      };      // shiftedOut: Holds the value that will be shifted out of a mantissa      // during the denormalization operation (used to round a denormalized      // value).      int shiftedOut;      assert(bitsToShift <= 23);      // Okay, first grab the bits we're going to shift out (so we can determine      // how to round this value after the shift).      shiftedOut = *valToShift & masks[ bitsToShift ];      // Shift the value to the right the specified number of bits.      // Note: bit 31 is always zero, so it doesn't matter if the C      // compiler does a logical shift right or an arithmetic shift right.      *valToShift = *valToShift >> bitsToShift;      // If necessary, round the value:      if((shiftedOut > HOmasks[ bitsToShift ])      {          // If the bits we shifted out are greater than 1/2 the LO bit, then          // round the value up by one.          *valToShift = *valToShift + 1;      }      else if(shiftedOut == HOmasks[ bitsToShift ])      {          // If the bits we shifted out are exactly 1/2 of the LO bit's value,          // then round the value to the nearest number whose LO bit is zero.          *valToShift = *valToShift + (*valToShift & 1);      }      // else      // We round the value down to the previous value. The current      // value is already truncated (rounded down), so we don't have to do      // anything.  }

The 'trick' in this code is that it uses a couple of lookup tables, masks and HOmasks , to extract those bits that the mantissa will use from the shift right operation. The masks table entries contain one bits (set bits) in the positions that will be lost during the shift. The HOmasks table entries contain a single set bit in the position specified by the index into the table; that is, the entry at index zero contains a one in bit position zero, the entry at index one contains a one in bit position one, and so on. This code selects an entry from each of these tables based on the number of mantissa bits it needs to shift to the right.

If the original mantissa value, logically ANDed with the appropriate entry in masks , is greater than the corresponding entry in HOmasks , then the shiftAndRound function rounds the shifted mantissa to the next greater value. If the ANDed mantissa value is equal to the corresponding HOmasks element, this code rounds the shifted mantissa value according to its LO bit (note that the expression (*valToShift & 1) produces one if the mantissa's LO bit is one, and it produces zero otherwise ). Finally, if the ANDed mantissa value is less than the entry from the HOmasks table, then this code doesn't have to do anything because the mantissa is already rounded down.

Once we've adjusted one of the values so that the exponents of both operands are the same, the next step in the addition algorithm is to compare the signs of the values. If the signs of the two operands are both the same, we can simply add their mantissas (using a standard integer add operation). If the signs of the operands are different, we have to subtract, rather than add, the mantissas. Because floating-point values use one's complement representation, and standard integer arithmetic uses two's complement, we cannot simply subtract the negative value from the positive value. Instead, we have to subtract the smaller value from the larger value and determine the sign of the result based on the signs and magnitudes of the original operands. Table 4-3 describes how to accomplish this.

Table 4-3: Dealing with Operands That Have Different Signs
Left Sign	Right Sign	Left Mantissa > Right Mantissa?	Compute Mantissa As	Result Sign Is
ˆ’	+	Yes	LeftMantissa ˆ’ RightMantissa	ˆ’
+	ˆ’	Yes	LeftMantissa ˆ’ RightMantissa	+
ˆ’	+	No	RightMantissa ˆ’ LeftMantissa	+
+	ˆ’	No	RightMantissa ˆ’ LeftMantissa	ˆ’

Whenever adding or subtracting two 24-bit numbers, it is possible to produce a result that requires 25 bits (this, in fact, is a common result when dealing with normalized values). Immediately after an addition or subtraction, the floating-point code has to check the result for overflow. If this has happened , it needs to shift the mantissa right by one bit, round the result, and then increment the exponent. After completing this step, all that remains is to pack the resulting sign, exponent, and mantissa fields into the packed 32-bit IEEE floating-point format and the addition (or subtraction) is complete. The following packFP function is responsible for packing the sign , exponent , and mantissa fields into the 32-bit floating-point format:

 // packFP:  //  // Packs the sign, exponent, and mantissa fields into a  // 32-bit "real" value. Works for normalized values, denormalized  // values, and zero, but does not work for NaNs and infinities.  inline real packFP(int sign, int exponent, int mantissa)  {     return          (real)          ((sign << 31)                 ((exponent + 127) << 23)                 (mantissa & 0x7fffff));  }

With the utility routines out of the way, it's time to take a look at the fpadd function, which adds two floating-point values, producting a 32-bit real result:

 // fpadd:  //  //   Computes:  //      dest = left + right  // where all three operands are "real" values (32-bit floats).   void fpadd(real left, real right, real *dest)  {      // The following variables hold the fields associated with the      // left operand:      int             Lexponent;      long unsigned   Lmantissa;      int             Lsign;      // The following variables hold the fields associated with the      // right operand:      int             Rexponent;      long unsigned   Rmantissa;      int             Rsign;      // The following variables hold the separate fields of the result:      int   Dexponent;      long  unsigned Dmantissa;      int   Dsign;      // Extract the fields so that they're easy to work with:      Lexponent = extractExponent(left);      Lmantissa = extractMantissa(left);      Lsign     = extractSign(left);      Rexponent = extractExponent(right);      Rmantissa = extractMantissa(right);      Rsign     = extractSign(right);      // Code to handle special operands (infinity and NaNs):      if(Lexponent == 127)      {          if(Lmantissa == 0)          {              // If the left operand is infinity, then the result              // depends upon the value of the right operand.              if(Rexponent == 127)              {                  // If the exponent is all one bits (127 after unbiasing)                  // then the mantissa determines if we have an infinity value                  // (zero mantissa), a QNaN (mantissa = 0x800000) or a SNaN                  // (nonzero mantissa not equal to 0x800000).                   if(Rmantissa == 0)  // Do we have infinity?                  {                      // infinity + infinity = infinity                      //   infinity   infinity =   infinity                      //   infinity + infinity = NaN                      // infinity   infinity = NaN                      if(Lsign == Rsign)                      {                          *dest = right;                      }                      else                      {                          *dest = 0x7fC00000;  // +QNaN                      }                  }                  else  // Rmantissa is nonzero, so it's a NaN                  {                      *dest = right;  // Right is a NaN, propagate it.                  }              }          }          else // Lmantissa is nonzero, Lexponent is all ones.          {              // If the left operand is some NaN, then the result will              // also be the same NaN.              *dest = left;          }          // We've already calculated the result, so just return.          return;      }     else if(Rexponent == 127)      {          // Two case: right is either a NaN (in which case we need to          // propagate the NaN regardless of left's value) or it is          // +/   infinity. Because left is a "normal" number, we'll also          // wind up propagating the infinity because any normal number          // plus infinity is infinity.          *dest = right;  // Right is a NaN, propagate it.          return;      }      // Okay, we've got two actual floating-point values. Let's add them      // together. First, we have to "denormalize" one of the operands if      // their exponents aren't the same (when adding or subtracting values,      // the exponents must be the same).      //      // Algorithm: choose the value with the smaller exponent. Shift its      // mantissa to the right the number of bits specified by the difference      // between the two exponents.      Dexponent = Rexponent;      if(Rexponent > Lexponent)      {          shiftAndRound(&Lmantissa, (Rexponent   Lexponent));      }      else if(Rexponent < Lexponent)      {          shiftAndRound(&Rmantissa, (Lexponent   Rexponent));          Dexponent = Lexponent;      }      // Okay, add the mantissas. There is one catch: if the signs are opposite      // then we've actually got to subtract one value from the other (because      // the FP format is one's complement, we'll subtract the larger mantissa      // from the smaller and set the destination sign according to a      // combination of the original sign values and the largest mantissa).      if(Rsign ^ Lsign)      {          // Signs are different, must subtract one value from the other.          if(Lmantissa > Rmantissa)          {              // The left value is greater, so the result inherits the              // sign of the left operand.              Dmantissa = Lmantissa   Rmantissa;              Dsign = Lsign;          }          else          {              // The right value is greater, so the result inherits the              // sign of the right operand.              Dmantissa = Rmantissa   Lmantissa;              Dsign = Rsign;          }      }      else      {          // Signs are the same, so add the values:          Dsign = Lsign;          Dmantissa = Lmantissa + Rmantissa;      }      // Normalize the result here.      //      // Note that during addition/subtraction, overflow of one bit is possible.      // deal with that possibility here (if overflow occurred, shift the      // mantissa to the right one position and adjust for this by incrementing      // the exponent). Note that this code returns infinity if overflow occurs      // when incrementing the exponent (infinity is a value with an exponent      // of $FF);     if(Dmantissa >= 0x1000000)      {          // Never more than one extra bit when doing addition/subtraction.          // Note that by virtue of the floating-point format we're using,          // the maximum value we can produce via addition or subtraction is          // a mantissa value of 0x1fffffe. Therefore, when we round this          // value it will not produce an overflow into the 25th bit.          shiftAndRound(&Dmantissa, 1); // Move result into 24 bits.          ++Dexponent;                    // Shift operation did a div by two,                                          // this counteracts the effect of                                          // the shift (incrementing exponent                                          // multiplies the value by two).      }      else      {          // If the HO bit is clear, normalize the result          // by shifting bits up and simultaneously decrementing          // the exponent. We will treat zero as a special case          // because it's a common enough result.          if(Dmantissa != 0)          {              // The while loop multiplies the mantissa by two (via a shift              // left) and then divides the whole number by two (by              // decrementing the exponent. This continues until the HO bit of              // Dmantissa is set or the exponent becomes   127 (zero in the              // biased-127 form). If Dexponent drops down to   128, then we've              // got a denormalized number and we can stop.              while((Dmantissa < 0x800000) && (Dexponent >   127))              {                  Dmantissa = Dmantissa << 1;                  --Dexponent;              }          }          else          {              // If the mantissa went to zero, clear everything else, too.              Dsign = 0;              Dexponent = 0;          }      }     // Reconstruct the result and store it away:      *dest = packFP(Dsign, Dexponent, Dmantissa);  }

To conclude this discussion of the software implementation of the fpadd and fsub functions, here's a C main function that demonstrates the use of these functions:

 // A simple main program that does some trivial tests on fpadd and fpsub.  int main(int argc, char **argv)  {      real l, r, d;      asreal(l) = 1.0;      asreal(r) = 2.0;      fpadd(l, r, &d);      printf("dest = %x\n", d);      printf("dest = %12E\n", asreal(d));      l = d;      asreal(r) = 4.0;      fpsub(l, r, &d);      printf("dest2 = %x\n", d);      printf("dest2 = %12E\n", asreal(d));  }

4.7.3 Floating-Point Multiplication and Division

Most software floating-point libraries are actually written in hand-optimized assembly language, not in a high-level language. As the previous section shows, it's perfectly possible to write floating-point routines in a high-level language and, particularly in the case of single-precision floating-point addition and subtraction, you could actually write the code efficiently . Given the right library routines, it's also possible to write the floating-point multiplication and division routines in a high-level language. This section presents an HLA implementation of the single-precision floating-point multiplication and division algorithms, however, because it turns out that their implementation is actually easier in assembly language than in a high-level language like C/C++.

The HLA code in this section implements two functions, fpmul and fpdiv , that have the following prototypes:

 procedure fpmul(left:real32; right:real32);  @returns("eax");  procedure fpdiv(left:real32; right:real32);  @returns("eax");

Beyond the fact that this code is written in assembly language rather than C, there are two main differences you should note between the code in this section and the code in the previous section. First, the HLA code uses the built-in real32 data type rather than creating a new data type for our real values. This code can do that because we can easily coerce any 32-bit memory object to real32 or dword in assembly language. Therefore, there is no reason to play games with the data types. The second thing you'll notice about these prototypes is that they only support two parameters; there is no destination parameter. These functions simply return the real32 result in the EAX register. ^[7]

4.7.3.1 Floating-Point Multiplication

Whenever you multiply two values in scientific notation, you compute the result sign, exponent, and mantissa as follows :

The result sign is the exclusive-OR of the operand signs. That is, the result is positive if both operand signs were the same, and the result sign is negative if the operand signs were different.
The result exponent is the sum of the operands' exponents.
The result mantissa is the integer (fixed-point) product of the two operand mantissas.

Beyond these rules, there are a few additional rules that affect the floating-point multiplication algorithm that are a direct result of the IEEE floating-point format:

If either, or both, of the operands are zero, the result is zero (this is a special case because the representation for zero is special).
If either operand is infinity, the result is infinity.
If either operand is a NaN, the result is that same NaN.

The fpmul procedure begins by checking the operands to see if either of them is zero. If so, the function immediately returns a 0.0 result to the caller. Next, the fpmul code checks for NaN or infinity values in the left and right operands. If it finds one of these values, the fpmul procedure returns that same value to the caller.

If both of the fpmul operands are reasonable floating-point values, then the fpmul code extracts the sign, exponent, and mantissa fields of the packed floating-point value. Actually, 'extract' isn't the correct term for fpmul ; isolate is probably a better description of what this code does to the sign and exponent fields. Look at the code that isolates the sign bits of the two operands and computes the result sign:

 mov((type dword left), ebx);  // Result sign is the XOR of the  xor((type dword right), ebx); //  operand signs.  and(00_0000, ebx);         // Keep only the sign bit.

This code exclusive-ORs the HO bits of the two operands (as well as all the other bits) and then masks out bits 0..30, leaving only the result sign value in bit 31 of the EBX register. This procedure doesn't bother moving the sign bit down to bit 0 (as you'd normally do when unpacking data) because it would just have to move this bit back to bit 31 when it repacks the floating-point value later.

The fpmul procedure uses the same trick when processing the exponent. It simply isolates bits 23..30 and operates on the exponent in place. When multiplying two values using scientific notation, you must add the values of the exponents together. Note, however, that the floating-point exponents use an excess-127 format; simply adding the exponents together creates a problem because the bias winds up being added twice. Therefore, the exponent-processing code must subtract 127 from the exponent's sum first. The following code isolates the exponent bits, adjusts for the extra bias, and adds the exponents together:

 mov((type dword left), ecx);    // Exponent goes into bits 23..30  and(f80_0000, ecx);           // of ECX; mask these bits.  sub(126 << 23, ecx);            // Eliminate the bias of 127.  mov((type dword right), eax);  and(f80_0000, eax);  // For multiplication, we need to add the exponents:  add(eax, ecx);                  // Exponent value is now in bits                                    // 23..30 of ECX.

First, you'll notice that this code subtracts 126 rather than 127 (the value you'd normally expect to have to subtract in order to eliminate the extra bias). The reason for this is that later on we will need to double the result of the multiplication of the mantissas. Subtracting 126 rather than 127 does this multiplication by two implicitly for us (saving an instruction later on).

If the sum of the exponents with add(eax, ecx) , above, is too large to fit into eight bits, there will be a carry out of bit 30 into bit 31 of ECX, which will set the 80x86 overflow flag. If overflow occurs on a multiplication, our code will return infinity as the result.

If overflow does not occur, then the fpmul procedure needs to set the implied HO bit of the two mantissa values. The following code handles this chore, as well as stripping out all the exponent and sign bits from the mantissas. This code also left justifies the mantissa bits up against bit position 31 in EAX and EDX.

 mov((type dword left), eax);  mov((type dword right), edx);  // If we don't have a zero value then set the implied HO bit of the mantissa:  if(eax <> 0) then      or(_0000, eax);   // Set the implied bit to one.  endif;  shl(8, eax);   // Moves mantissa to bits 8..31 and removes sign/exp.  // Repeat this for the right operand.  if(edx <> 0) then      or(_0000, edx);  endif;  shl(8, edx);

Once this code shifts the mantissas to bit 31 in EAX and EDX, it does the multiplication by using the 80x86 mul instruction:

 mul(edx);

This instruction computes the 64-bit product of EAX and EDX, leaving the product in EDX:EAX (the HO double word is in EDX, and the LO double word is in EAX). Note that the product of any two n -bit integers produces a number that could require as many as 2* n bits. That's why the mul instruction computes EDX:EAX = EAX*EDX. Left justifying the mantissas in EAX and EDX before doing the multiplication is what ensures the mantissa of the product winds up in bits 7..30 of EDX (it would have been nice to have them wind up in bit positions 8..31 of EDX, but fixed-point multiplication winds up shifting the value down one bit in this case; that's why this code only subtracted 126 when adjusting for the excess-127 value). As these numbers were normalized prior to the multiplication, bit 30 of EDX will contain a one after the multiplication unless the result is zero. Note that the 32-bit IEEE real format does not support denormalized values, so we don't have to worry about this case when using 32-bit floating-point values.

Because the mantissas were actually 24 bits each, the product of the mantissas that the mul instruction produces could have as many as 48 significant bits. However, our result mantissa can only hold 24 bits, so we need to round the value to produce a 24-bit result (using, of course, the IEEE rounding algorithm - see Section 4.4, 'Rounding'). Here's the code that rounds the value in EDX to 24 significant bits (in positions 8..31):

 test(, edx);  // Clears zero flag if bit seven of EDX = 1.  if(@nz) then      add($FFFF_FFFF, eax);   // Sets carry if EAX <> 0.      adc(f, dl);           // Sets carry if DL:EAX > _0000_0000.      if(@c) then          // If DL:EAX > _0000_0000 then round the mantissa          // up by adding one to bit position eight:          add(1 << 8, edx);      else // DL:EAX = _0000_0000          // We need to round to the value that has a zero          // in bit position zero of the mantissa (bit #8 of EDX):          test(8, edx); // Clears zero flag if bit #8 contains a one.          if(@nz) then              add(1 << 8, edx); // Adds a one starting at bit position eight.              // If there was an overflow, renormalize:              if(@c) then                  rcr(1, edx);   // Shift overflow (in carry) back into EDX.                  inc(ecx);      // Shift did a divide by two. Fix that.          endif;          endif;      endif;  endif;

An interesting thing to note about this rounding code is that it may need to renormalize the number after rounding. If the mantissa contains all one bits and needs to be rounded up, this will produce an overflow out of the HO bit of the mantissa. The rcr and inc instructions at the end of this code sequence put the overflow bit back into the mantissa if overflow occurs.

The only thing left for fpmul to do after this is pack the destination sign, exponent, and mantissa into the 32-bit EAX register. The following code does this:

 shr(8, edx);          // Move mantissa into bits 0..23.  and(f_ffff, edx);   // Clear the implied bit.  lea(eax, [edx+ecx]);  // Merge mantissa and exponent into EAX.  or(ebx, eax);         // Merge in the sign.

The only tricky thing in this code is the use of the lea (load effective address) instruction to compute the sum of EDX (the mantissa) and ECX (the exponent) and move the result to EAX all with a single instruction.

4.7.3.2 Floating-Point Division

Floating-point division is a little bit more involved than multiplication because the IEEE floating-point standard says many things about degenerate conditions that can occur during division. We're not going to discuss all the code that handles those conditions here. Instead, see the discussion of the conditions for fpmul earlier, and check out the complete code listing for fdiv later in this section.

Assuming we have reasonable numbers to divide, the division algorithm first computes the result sign using the same algorithm (and code) as for multiplying. When dividing two values using scientific notation, we have to subtract their exponents. Unlike the multiplication algorithm, it's going to be more convenient to truly unpack the exponents for the two division operands and convert them from excess-127 to two's complement form. Here's the code that does this:

 mov((type dword left), ecx);  // Exponent comes from bits 23..30.  shr(23, ecx);  and($ff, ecx);                // Mask out the sign bit (in bit 8).  mov((type dword right), eax);  shr(23, eax);  and($ff, eax);  // Eliminate the bias from the exponents:  sub(127, ecx);  sub(127, eax);  // For division, we need to subtract the exponents:  sub(eax, ecx);                // Leaves result exponent in ECX.

The 80x86 div instruction absolutely , positively requires the quotient to fit into 32 bits. If this condition is not true, the CPU may abort the operation with a divide exception. As long as the HO bit of the divisor contains a one and the HO two bits of the dividend contain %01, we will not get a division error. Here's the code the prepares the operands prior to the division operation:

 mov (type dword left), edx);  if(edx <> 0) then      or(_0000, edx);    // Set the implied bit to one in the left operand.      shl(8, edx);  endif;  mov((type dword right), edi);  if(edi <> 0) then      or(_0000);        // Set the implied bit to one in the right operand.      shl(8, edi);  else  // Division by zero error, here.  endif;

The next step is to actually do the division. As noted earlier, in order to prevent a division error, we have to shift the dividend one bit to the right (to set the HO two bits to %01). The code that does this shift and then the division is as follows:

 xor(eax, eax);    // EAX := 0;  shr(1, edx);      // Shift EDX:EAX to the right one bit to  rcr(1, eax);      // prevent a division error.  div(edi);         // Compute EAX = EDX:EAX / EDI.

Once the div instruction executes, the quotient is sitting in the HO 24 bits of EAX, and the remainder is in AL:EDX. We now need to normalize and round the result. Rounding is a little easier because AL:EDX contains the remainder after the division; it will contain a value less than $80:0000_0000 (that is, the 80x86 AL register contains $80 and EDX contains zero) if we need to round down, it will contain a value greater than $80:0000_0000 if we need to round up, and it will contain exactly $80:0000_0000 if we need to round to the nearest value.

Here's the code that does this:

 test(, al);    // See if the bit just below the LO bit of the  if(@nz) then      //  mantissa contains a zero or one.      // Okay, the bit just below the LO bit of our mantissa contains a one.      // If all other bits below the mantissa and this bit contain zeros,      // we have to round to the nearest mantissa value whose LO bit is zero.      test(f, al);            // Clears zero flag if bits 0..6 <> 0.      if(@nz  edx <> 0) then  // If bits 0..6 in AL are zero and EDX                                  // is zero.          // We need to round up:          add(0, eax);  // Mantissa starts in bit #8);          if(@c) then      // Carry set if mantissa overflows.              // If there was an overflow, renormalize.              rcr(1, eax);              inc(ecx);          endif;      else          // The bits below the mantissa are exactly 1/2 the value          // of the LO mantissa bit. So we need to round to the value          // that has a LO mantissa bit of zero:          test(0, eax);          if(@nz) then              add(0, eax);              if(@c) then                  // If there was an overflow, renormalize.                  rcr(1, eax);  // Put overflow bit back into EAX.                  inc(ecx);     // Adjust exponent accordingly.              endif;          endif;      endif;  endif;

The last step in fpdiv is to add the bias back into the exponent (and verify that overflow doesn't occur) and then pack the quotient's sign, exponent, and mantissa fields into the 32-bit floating-point format. Here's the code that does this:

 if((type int32 ecx) > 127) then      mov($ff   127, ecx);   // Set exponent value for infinity      xor(eax, eax);       // because we just had overflow.  elseif((type int32 ecx) <   128) then      mov(   127, ecx);      // Return zero for underflow (note that      xor(eax, eax);       // next we add 127 to ECX).  endif;  add(127, ecx);           // Add the bias back in.  shl(23, ecx);            // Move the exponent to bits 23..30.  // Okay, assemble the final real32 value:  shr(8, eax);             // Move mantissa into bits 0..23.  and(f_ffff, eax);      // Clear the implied bit.  or(ecx, eax);            // Merge mantissa & exponent into EAX.  or(ebx, eax);            // Merge in the sign.

Whew! This has been a lot of code. However, it's worthwhile to go through all this just to see how floating-point operations work (so you can gain an appreciation of exactly what an FPU is doing for you).

^[6] If the algorithm only shifts out a single bit, then you assume that 'all the other bits' are zeros.

^[7] Those who know a little 80x86 assembly language may wonder if it's legal to return a floating-point value in an integer register. Of course it is! EAX can hold any 32-bit value, not just integers. Presumably, if you're writing a software-based floating-point package, you don't have floating-point hardware available and, therefore, you can't pass floating-point values around in the floating-point registers.