2.13 An Introduction to Floating Point Arithmetic

Integer arithmetic does not let you represent fractional numeric values. Therefore, modern CPUs support an approximation of real arithmetic: floating point arithmetic. A big problem with floating point arithmetic is that it does not follow the standard rules of algebra. Nevertheless, many programmers apply normal algebraic rules when using floating point arithmetic. This is a source of defects in many programs. One of the primary goals of this section is to describe the limitations of floating point arithmetic so you will understand how to use it properly.

Normal algebraic rules apply only to infinite precision arithmetic. Consider the simple statement "x:=x+1", x is an integer. On any modern computer this statement follows the normal rules of algebra as long as overflow does not occur. That is, this statement is valid only for certain values of x (minint <= x < maxint). Most programmers do not have a problem with this because they are well aware of the fact that integers in a program do not follow the standard algebraic rules (e.g., 5/2 ≠ 2.5).

Integers do not follow the standard rules of algebra because the computer represents them with a finite number of bits. You cannot represent any of the (integer) values above the maximum integer or below the minimum integer. Floating point values suffer from this same problem, only worse. After all, the integers are a subset of the real numbers. Therefore, the floating point values must represent the same infinite set of integers. However, there are an infinite number of values between any two real values, so this problem is infinitely worse. Therefore, as well as having to limit your values between a maximum and minimum range, you cannot represent all the values between those two ranges, either.

To represent real numbers, most floating point formats employ scientific notation and use some number of bits to represent a mantissa and a smaller number of bits to represent an exponent. The end result is that floating point numbers can only represent numbers with a specific number of significant digits. This has a big impact on how floating point arithmetic operates. To easily see the impact of limited precision arithmetic, we will adopt a simplified decimal floating point format for our examples. Our floating point format will provide a mantissa with three significant digits and a decimal exponent with two digits. The mantissa and exponents are both signed values, as shown in Figure 2-23.

Figure 2-23: Long Packed Date Format (Four Bytes).

When adding and subtracting two numbers in scientific notation, you must adjust the two values so that their exponents are the same. For example, when adding 1.23e1 and 4.56e0, you must adjust the values so they have the same exponent. One way to do this is to convert 4.56e0 to 0.456e1 and then add. This produces 1.686e1. Unfortunately, the result does not fit into three significant digits, so we must either round or truncate the result to three significant digits. Rounding generally produces the most accurate result, so let's round the result to obtain 1.69e1. As you can see, the lack of precision (the number of digits or bits we maintain in a computation) affects the accuracy (the correctness of the computation).

In the previous example, we were able to round the result because we maintained four significant digits during the calculation. If our floating point calculation is limited to three significant digits during computation, we would have had to truncate the last digit of the smaller number, obtaining 1.68e1, a value that is even less correct. To improve the accuracy of floating point calculations, it is necessary to add extra digits for use during the calculation. Extra digits available during a computation are known as guard digits (or guard bits in the case of a binary format). They greatly enhance accuracy during a long chain of computations.

The accuracy loss during a single computation usually isn't enough to worry about unless you are greatly concerned about the accuracy of your computations. However, if you compute a value that is the result of a sequence of floating point operations, the error can accumulate and greatly affect the computation itself. For example, suppose we were to add 1.23e3 with 1.00e0. Adjusting the numbers so their exponents are the same before the addition produces 1.23e3 + 0.001e3. The sum of these two values, even after rounding, is 1.23e3. This might seem perfectly reasonable to you; after all, we can only maintain three significant digits, adding in a small value shouldn't affect the result at all. However, suppose we were to add 1.00e0 to 1.23e3 ten times. The first time we add 1.00e0 to 1.23e3 we get 1.23e3. Likewise, we get this same result the second, third, fourth …, and tenth time we add 1.00e0 to 1.23e3. On the other hand, had we added 1.00e0 to itself ten times, then added the result (1.00e1) to 1.23e3, we would have gotten a different result, 1.24e3. This is an important thing to know about limited precision arithmetic:

The order of evaluation can affect the accuracy of the result.

You will get more accurate results if the relative magnitudes (that is, the exponents) are close to one another when adding and subtracting floating point values. If you are performing a chain calculation involving addition and subtraction, you should attempt to group the values appropriately.

Another problem with addition and subtraction is that you can wind up with false precision. Consider the computation 1.23e0 - 1.22 e0. This produces 0.01e0. Although this is mathematically equivalent to 1.00e - 2, this latter form suggests that the last two digits are exactly zero. Unfortunately, we've only got a single significant digit at this time. Indeed, some FPUs or floating point software packages might actually insert random digits (or bits) into the L.O. positions. This brings up a second important rule concerning limited precision arithmetic:

Whenever subtracting two numbers with the same signs or adding two numbers with different signs, the accuracy of the result may be less than the precision available in the floating point format.

Multiplication and division do not suffer from the same problems as addition and subtraction because you do not have to adjust the exponents before the operation; all you need to do is add the exponents and multiply the mantissas (or subtract the exponents and divide the mantissas). By themselves, multiplication and division do not produce particularly poor results. However, they tend to multiply any error that already exists in a value. For example, if you multiply 1.23e0 by two, when you should be multiplying 1.24e0 by two, the result is even less accurate. This brings up a third important rule when working with limited precision arithmetic:

When performing a chain of calculations involving addition, subtraction, multiplication, and division, try to perform the multiplication and division operations first.

Often, by applying normal algebraic transformations, you can arrange a calculation so the multiply and divide operations occur first. For example, suppose you want to compute x*(y+z). Normally you would add y and z together and multiply their sum by x. However, you will get a little more accuracy if you transform x*(y+z) to get x*y+x*z and compute the result by performing the multiplications first.^[7]

Multiplication and division are not without their own problems. When multiplying two very large or very small numbers, it is quite possible for overflow or underflow to occur. The same situation occurs when dividing a small number by a large number or dividing a large number by a small number. This brings up a fourth rule you should attempt to follow when multiplying or dividing values:

When multiplying and dividing sets of numbers, try to arrange the multiplications so that they multiply large and small numbers together; likewise, try to divide numbers that have the same relative magnitudes.

Comparing floating point numbers is very dangerous. Given the inaccuracies present in any computation (including converting an input string to a floating point value), you should never compare two floating point values to see if they are equal. In a binary floating point format, different computations that produce the same (mathematical) result may differ in their least significant bits. For example, adding 1.31e0+1.69e0 should produce 3.00e0. Likewise, adding 1.50e0+1.50e0 should produce 3.00e0. However, were you to compare (1.31e0+1.69e0) against (1.50e0+1.50e0) you might find out that these sums are not equal to one another. The test for equality succeeds if and only if all bits (or digits) in the two operands are exactly the same. Because this is not necessarily true after two different floating point computations that should produce the same result, a straight test for equality may not work.

The standard way to test for equality between floating point numbers is to determine how much error (or tolerance) you will allow in a comparison and check to see if one value is within this error range of the other. The straightforward way to do this is to use a test like the following:

      if Value1 >= (Value2-error) and Value1 <= (Value2+error) then ...

Another common way to handle this same comparison is to use a statement of the form:

 if abs(Value1-Value2) <= error then ...

You must exercise care when choosing the value for error. This should be a value slightly greater than the largest amount of error that will creep into your computations. The exact value will depend upon the particular floating point format you use, but more on that a little later. The final rule we will state in this section is

When comparing two floating point numbers, always compare one value to see if it is in the range given by the second value plus or minus some small error value.

There are many other little problems that can occur when using floating point values. This text can only point out some of the major problems and make you aware of the fact that you cannot treat floating point arithmetic like real arithmetic: The inaccuracies present in limited precision arithmetic can get you into trouble if you are not careful. A good text on numerical analysis or even scientific computing can help fill in the details that are beyond the scope of this text. If you are going to be working with floating point arithmetic, in any language, you should take the time to study the effects of limited precision arithmetic on your computations.

HLA's if statement does not support boolean expressions involving floating point operands. Therefore, you cannot use statements like "if( x < 3.141) then …" in your programs. In a later chapter that discusses floating point operations on the 80x86 you'll learn how to do floating point comparisons.

2.13.1 IEEE Floating Point Formats

When Intel planned to introduce a floating point unit (FPU) for its new 8086 microprocessor, it was smart enough to realize that the electrical engineers and solid-state physicists who design chips were, perhaps, not the best people to do the necessary numerical analysis to pick the best possible binary representation for a floating point format. So Intel went out and hired the best numerical analyst it could find to design a floating point format for its 8087 FPU. That person then hired two other experts in the field and the three of them (Kahn, Coonan, and Stone) designed Intel's floating point format. They did such a good job designing the KCS Floating Point Standard that the IEEE organization adopted this format for the IEEE floating point format.^[8]

To handle a wide range of performance and accuracy requirements, Intel actually introduced three floating point formats: single precision, double precision, and extended precision. The single and double precision formats corresponded to C's float and double types or FORTRAN's real and double precision types. Intel intended to use extended precision for long chains of computations. Extended precision contains 16 extra bits that the calculations could use as guard bits before rounding down to a double precision value when storing the result.

The single precision format uses a one's complement 24-bit mantissa and an 8-bit excess-127 exponent. The mantissa usually represents a value between 1.0 to just under 2.0. The H.O. bit of the mantissa is always assumed to be one and represents a value just to the left of the binary point.^[9] The remaining 23 mantissa bits appear to the right of the binary point. Therefore, the mantissa represents the value:

 1.mmmmmmm mmmmmmmm mmmmmmmm

The "mmmm …" characters represent the 23 bits of the mantissa. Keep in mind that we are working with binary numbers here. Therefore, each position to the right of the binary point represents a value (zero or one) times a successive negative power of two. The implied one bit is always multiplied by 2⁰, which is one. This is why the mantissa is always greater than or equal to one. Even if the other mantissa bits are all zero, the implied one bit always gives us the value one.^[10] Of course, even if we had an almost infinite number of one bits after the binary point, they still would not add up to two. This is why the mantissa can represent values in the range one to just under two.

Although there are an infinite number of values between one and two, we can only reperesent eight million of them because we use a 23-bit mantissa (the 24th bit is always one). This is the reason for inaccuracy in floating point arithmetic—we are limited to 23 bits of precision in computations involving single precision floating point values.

The mantissa uses a one's complement format rather than two's complement. This means that the 24th-bit value of the mantissa is simply an unsigned binary number and the sign bit determines whether that value is positive or negative. One's complement numbers have the unusual property that there are two representations for zero (with the sign bit set or clear). Generally, this is important only to the person designing the floating point software or hardware system. We will assume that the value zero always has the sign bit clear.

To represent values outside the range 1.0 to just under 2.0, the exponent portion of the floating point format comes into play. The floating point format raises two to the power specified by the exponent and then multiplies the mantissa by this value. The exponent is 8 bits and is stored in an excess-127 format. In excess-127 format, the exponent 2⁰ is represented by the value 127 ($7F). Therefore, to convert an exponent to excess-127 fomat simply add 127 to the exponent value. The use of excess-127 format makes it easier to compare floating point values. The single precision floating point format takes the form shown in Figure 2-24.

click to expand
Figure 2-24: Single Precision (32-bit) Floating Point Format

With a 24-bit mantissa, you will get approximately 6 1/2 digits of precision (one-half digit of precision means that the first six digits can all be in the range 0..9 but the seventh digit can only be in the range 0..x where x < 9 and is generally close to five). With an 8-bit excess-127 exponent, the dynamic range of single precision floating point numbers is approximately 2¹²⁸ or about 10³⁸.

Although single precision floating point numbers are perfectly suitable for many applications, the dynamic range is somewhat limited and is unsuitable for many financial, scientific, and other applications. Furthermore, during long chains of computations, the limited accuracy of the single precision format may introduce serious error.

The double precision format helps overcome the problems of single precision floating point. Using twice the space, the double precision format has an 11-bit excess-1023 exponent and a 53-bit mantissa (with an implied H.O. bit of one) plus a sign bit. This provides a dynamic range of about 10³⁰⁸ and 14 1/2 digits of precision, sufficient for most applications. Double precision floating point values take the form shown in Figure 2-25.

click to expand
Figure 2-25: 64-Bit Double Precision Floating Point Format.

In order to help ensure accuracy during long chains of computations involving double precision floating point numbers, Intel designed the extended precision format. The extended precision format uses 80 bits. Twelve of the additional 16 bits are appended to the mantissa, four of the additional bits are appended to the end of the exponent. Unlike the single and double precision values, the extended precision format's mantissa does not have an implied H.O. bit, which is always one. Therefore, the extended precision format provides a 64-bit mantissa, a 15-bit excess-16383 exponent, and a 1-bit sign. The format for the extended precision floating point value is shown in Figure 2-26 on the following page.

Figure 2-26: 80-Bit Extended Precision Floating Point Format.

On the FPUs all computations are done using the extended precision form. Whenever you load a single or double precision value, the FPU automatically converts it to an extended precision value. Likewise, when you store a single or double precision value to memory, the FPU automatically rounds the value down to the appropriate size before storing it. By always working with the extended precision format, Intel guarantees a large number of guard bits are present to ensure the accuracy of your computations. Some texts erroneously claim that you should never use the extended precision format in your own programs, because Intel only guarantees accurate computations when using the single or double precision formats. This is foolish. By performing all computations using 80 bits, Intel helps ensure (but not guarantee) that you will get full 32- or 64-bit accuracy in your computations. Because the FPUs do not provide a large number of guard bits in 80-bit computations, some error will inevitably creep into the L.O. bits of an extended precision computation. However, if your computation is correct to 64 bits, the 80-bit computation will always provide at least 64 accurate bits. Most of the time you will get even more. While you cannot assume that you get an accurate 80-bit computation, you can usually do better than 64 bits when using the extended precision format.

To maintain maximum precision during computation, most computations use normalized values. A normalized floating point value is one whose H.O. mantissa bit contains one. Almost any non-normalized value can be normalized; shift the mantissa bits to the left and decrement the exponent until a one appears in the H.O. bit of the mantissa. Remember, the exponent is a binary exponent. Each time you increment the exponent, you multiply the floating point value by two. Likewise, whenever you decrement the exponent, you divide the floating point value by two. By the same token, shifting the mantissa to the left one bit position multiplies the floating point value by two; likewise, shifting the mantissa to the right divides the floating point value by two. Therefore, shifting the mantissa to the left one position and decrementing the exponent does not change the value of the floating point number at all.

Keeping floating point numbers normalized is beneficial because it maintains the maximum number of bits of precision for a computation. If the H.O. bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation. Therefore, a floating point computation will be more accurate if it involves only normalized values.

There are two important cases where a floating point number cannot be normalized. Zero is a one of these special cases. Obviously it cannot be normalized because the floating point representation for zero has no one bits in the mantissa. This, however, is not a problem because we can exactly represent the value zero with only a single bit.

The second case is when we have some H.O. bits in the mantissa that are zero but the biased exponent is also zero (and we cannot decrement it to normalize the mantissa). Rather than disallow certain small values, whose H.O. mantissa bits and biased exponent are zero (the most negative exponent possible), the IEEE standard allows special denormalized values to represent these smaller values.^[11] Although the use of denormalized values allows IEEE floating point computations to produce better results than if underflow occurred, keep in mind that denormalized values offer fewer bits of precision.

2.13.2 HLA Support for Floating Point Values

HLA provides several data types and library routines to support the use of floating point data in your assembly language programs. These include built-in types to declare floating point variables as well as routines that provide floating point input, output, and conversion.

Perhaps the best place to start when discussing HLA's floating point facilities is with a description of floating point literal constants. HLA floating point constants allow the following syntax:

An optional "+" or "-" symbol, denoting the sign of the mantissa (if this is not present, HLA assumes that the mantissa is positive)
Followed by one or more decimal digits
Optionally followed by a decimal point and one or more decimal digits
Optionally followed by an "e" or "E", optionally followed by a sign ("+" or "-") and one or more decimal digits

Note that the decimal point or the "e"/"E" must be present in order to differentiate this value from an integer or unsigned literal constant. Here are some examples of legal literal floating point constants:

 1.234 3.75e2 -1.0 1.1e-1 1e+4 0.1 -123.456e+789 +25e0

Notice that a floating point literal constant cannot begin with a decimal point; it must begin with a decimal digit so you must use "0.1" to represent ".1" in your programs.

HLA also allows you to place an underscore character ("_") between any two consecutive decimal digits in a floating point literal constant. You may use the underscore character in place of a comma (or other language-specific separator character) to help make your large floating point numbers easier to read. Here are some examples:

 1_234_837.25 1_000.00 789_934.99 9_999.99

To declare a floating point variable you use the real32, real64, or real80 data types. Like their integer and unsigned brethren, the number at the end of these data type declarations specifies the number of bits used for each type's binary representation. Therefore, you use real32 to declare single precision real values, real64 to declare double precision floating point values, and real80 to declare extended precision floating point values. Other than the fact that you use these types to declare floating point variables rather than integers, their use is nearly identical to that for int8, int16, int32, and so on. The following examples demonstrate these declarations and their syntax:

 static          fltVar1:    real32;          fltVar1a:   real32 := 2.7;          pi:         real32 := 3.14159;          DblVar:     real64;          DblVar2:    real64 := 1.23456789e+10;          XPVar:      real80;          XPVar2:     real80 := -1.0e-104;

To output a floating point variable in ASCII form, you would use one of the stdout.putr32, stdout.putr64, or stdout.putr80 routines. These procedures display a number in decimal notation — that is, a string of digits, an optional decimal point and a closing string of digits. Other than their names, these three routines use exactly the same calling sequence. Here are the calls and parameters for each of these routines:

 stdout.putr80( r:real80; width:uns32; decpts:uns32 ); stdout.putr64( r:real64; width:uns32; decpts:uns32 ); stdout.putr32( r:real32; width:uns32; decpts:uns32 );

The first parameter to these procedures is the floating point value you wish to print. The size of this parameter must match the procedure's name (e.g., the r parameter must be an 80-bit extended precision floating point variable when calling the stdout.putr80 routine). The second parameter specifies the field width for the output text; this is the number of print positions the number will require when the procedure displays it. Note that this width must include print positions for the sign of the number and the decimal point. The third parameter specifies the number of print positions after the decimal point. For example,

 stdout.putr32( pi, 10, 4 );

displays the value

 _ _ _ _ 3.1416

(the underscores represent leading spaces in this example).

Of course, if the number is very large or very small, you will want to use scientific notation rather than decimal notation for your floating point numeric output. The HLA Standard Library stdout.pute32, stdout.pute64, and stdout.pute80 routines provide this facility. These routines use the following procedure prototypes:

 stdout.pute80( r:real80; width:uns32 ); stdout.pute64( r:real64; width:uns32 ); stdout.pute32( r:real32; width:uns32 );

Unlike the decimal output routines, these scientific notation output routines do not require a third parameter specifying the number of digits after the decimal point to display. The width parameter, indirectly, specifies this value because all but one of the mantissa digits always appear to the right of the decimal point. These routines output their values in decimal notation, similar to the following:

 1.23456789e+10 -1.0e-104 1e+2

You can also output floating point values using the HLA Standard Library stdout.put routine. If you specify the name of a floating point variable in the stdout.put parameter list, the stdout.put code will output the value using scientific notation. The actual field width varies depending on the size of the floating point variable (the stdout.put routine attempts to output as many significant digits as possible, in this case). Example:

 stdout.put( "XPVar2 = ", XPVar2 );

If you specify a field width specification, by using a colon followed by a signed integer value, then the stdout.put routine will use the appropriate stdout.puteXX routine to display the value. That is, the number will still appear in scientific notation, but you get to control the field width of the output value. Like the field width for integer and unsigned values, a positive field width right justifies the number in the specified field, a negative number left justifies the value. Here is an example that prints the XPVar2 variable using ten print positions:

 stdout.put( "XPVar2 = ", XPVar2:10 );

If you wish to use stdout.put to print a floating point value in decimal notation, you need to use the following syntax:

 Variable_Name : Width : DecPts

Note that the DecPts field must be a non-negative integer value.

When stdout.put contains a parameter of this form, it calls the corresponding stdout.putrXX routine to display the specified floating point value. As an example, consider the following call:

 stdout.put( "Pi = ", pi:5:3 );

The corresponding output is

 3.141

The HLA Standard Library provides several other useful routines you can use when outputting floating point values. Consult the HLA Standard Library Reference Manual for more information on these routines.

The HLA Standard Library provides several routines to let you display floating point values in a wide variety of formats. In contrast, the HLA Standard Library only provides two routines to support floating point input: stdin.getf() and stdin.get(). The stdin.getf() routine requires the use of the 80x86 FPU stack, a hardware component that this chapter is not going to cover. Therefore, we'll defer the discussion of the stdin.getf() routine until a later chapter. Because the stdin.get() routine provides all the capabilities of the stdin.getf() routine, this deference will not prove to be a problem.

You've already seen the syntax for the stdin.get() routine; its parameter list simply contains a list of variable names. The stdin.get() function reads appropriate values for the user for each of the variables appearing in the parameter list. If you specify the name of a floating point variable, the stdin.get() routine automatically reads a floating point value from the user and stores the result into the specified variable. The following example demonstrates the use of this routine:

 stdout.put( "Input a double precision floating point value: " ); stdin.get( DblVar );

Caution

This section has discussed how you would declare floating point variables and how you would input and output them. It did not discuss arithmetic. Floating point arithmetic is different than integer arithmetic; you cannot use the 80x86 add and sub instructions to operating on floating point values. Floating point arithmetic will be the subject of a later chapter in this text.

^[7]Of course, the drawback is that you must now perform two multiplications rather than one, so the result may be slower.

^[8]There were some minor changes to the way certain degenerate operations were handled, but the bit representation remained essentially unchanged.

^[9]The binary point is the same thing as the decimal point except it appears in binary numbers rather than decimal numbers.

^[10]Actually, this isn't necessarily true. The IEEE floating point format supports denormalized values where the H. O. bit is not zero. However, we will ignore denormalized values in our discussion.

^[11]The alternative would be to underflow the values to zero.