3.1 The Floating-Point Formats

   

 
Java Number Cruncher: The Java Programmer's Guide to Numerical Computing
By Ronald  Mak

Table of Contents
Chapter  3.   The Floating-Point Standard

3.1 The Floating-Point Formats

Before the IEEE 754 standard was published in 1985, there were many different floating-point formats which the computer manufacturers implemented in hardware and software. It was difficult to port programs that did numerical computations from one machine platform to another the computed results would vary. Most computer hardware manufacturers today adhere to the IEEE 754 standard, and to various degrees software, such as language compilers, supports the standard's features.

The standard specifies number formats, operations, conversions, and exceptions. Number formats refer to how the numbers are encoded in memory.

Java's two primitive types float and double conform to the standard's 32-bit single-precision format and the standard's 64-bit double-precision format, respectively. Each format breaks a number into three parts : a sign bit, an exponent, and a fraction. The two formats differ in the number of bits allocated to the exponent and fraction parts. Figure 3-1 shows the layouts of the single-precision and double-precision formats, with the bit sizes of each part. In both formats, the most significant bits of the exponent and of the fraction are at their left ends.

Figure 3-1. The layouts of the single-precision float and double-precision double number formats. The numbers are the bit sizes of each of the parts.

graphics/03fig01.gif

The sign bit represents the sign of the number value: 0 for positive and 1 for negative.

The exponent is unsigned, and so it is always positive. To allow it to represent negative exponent values, the standard adds a positive bias. We call this a biased exponent. To get the unbiased (true) value of the exponent, we must subtract off the bias. In the single-precision format, the exponent is 8 bits. It can store the biased values 0 through 255, but 0 and 255 are reserved. The bias is 127, and so the unbiased exponent values are -126 through +127. In the double-precision format, the exponent is 11 bits. It can store the biased values 0 through 2047, but 0 and 2047 are reserved. The bias is 1023, and so the unbiased exponent values are -1022 through +1023.

We can use the fraction part to calculate the floating-point number's value v. Let s be the value of the sign bit, e be the biased exponent value, E be the unbiased exponent value, and f be the fraction value.

Normalized Number

If the e is not a reserved value (0 and 255 for float, or 0 and 2047 for double ), then there is an implied 1 bit followed by an implied binary point just to the left of f' s first bit. Move the implied point to the right or left E bit positions (depending on whether E is positive or negative, respectively) to get the number's absolute value, and s determines whether the value is positive or negative (0 for positive, 1 for negative):

graphics/03equ01.gif


This is a normalized number.

The implied bit, the implied point, and the fraction constitute a number's significand, so a single-precision number has 24 bits in its significand, and a double-precision number has 53 bits in its significand.

For a float example, let

graphics/03equ02.gif


Then

graphics/03equ03.gif


and the significand is binary 1.10000000000000000000000 after we append the implied 1 bit and the implied binary point. If we move the binary point two places left, the value in binary is 0.011, which is

graphics/03equ04.gif


The maximum positive float value has

graphics/03equ05.gif


Then

graphics/03equ06.gif


and the significand is binary 1.11111111111111111111111. The value is

graphics/03equ07.gif


If we set s = 1, we get the most negative value, which is approximately -3.4 x 10 38 .

Denormalized Number

If e = 0 (one of the reserved values) and f 0, then we have a denormalized number (also known as a subnormal number). There is an implied binary point just to the left of f 's first bit and an implied 0 bit just to the left of that point. For float, move the implied point to the left 126 bit positions to get the number's value; for double, move the implied point to the left 1022 bit positions to get the number's value. The variable s determines whether the value is positive or negative:

graphics/03equ08.gif


and

graphics/03equ09.gif


For a float example, let

graphics/03equ10.gif


and the significand is binary 0.00101000000000000000000 after we insert the implied 0 bit and the implied binary point. We move the binary point 126 to the left, and we get the value

graphics/03equ11.gif


The minimum positive float value has

graphics/03equ12.gif


and the significand is binary 0.00000000000000000000001. The value is

graphics/03equ13.gif


If s = 1, the minimum negative float value is approximately -1.4 x 10 - 45 .

There are several special cases to implement some constant values:

Zero If both e and f are 0, then the number value is -0 or +0, depending on s:

graphics/03equ14.gif


Infinity If e is 255 for float or 2047 for double (which are reserved values), and f is 0, then the number value is - Infinity or + Infinity, depending on s:

graphics/03equ15.gif


NaN If e is 255 for float or 2047 for double, and f is nonzero, then we have NaN, or Not-a-Number. NaN is neither positive nor negative, so s is ignored:

graphics/03equ16.gif


For example, dividing 0 by 0 results in NaN.

Table 3-1. Summary of Java's float and double formats.

Type

Exponent Bias

Unbiased Exponent Range

Significand Size

Minimum Values

Maximum Values

float

127

-126 through 127

24 bits

±1.4 x 10 - 45

±3.4028235 x 10 + 38

double

1023

-1022 through 1023

53 bits

±4.9 x 10 - 324

±1.7976931348623157 x 10 + 308

This is all quite messy, but fortunately, the Java virtual machine takes care of all of it automatically. Table 3-1 summarizes the two formats.


   
Top


Java Number Cruncher. The Java Programmer's Guide to Numerical Computing
Java Number Cruncher: The Java Programmers Guide to Numerical Computing
ISBN: 0130460419
EAN: 2147483647
Year: 2001
Pages: 141
Authors: Ronald Mak

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net