Floating-Point Unit | Visual C++ Optimization with Assembly Code

The earliest models of Intel processors did not have hardware support for floating point operations. All of these operations were implemented as procedures made up of ordinary mathematical commands. In early models, a special additional chip was developed, which was called mathematical coprocessor. It included commands that enable the computer to perform floating-point operations much faster than was done by procedures containing ordinary mathematical commands.

Beginning with 486DX processors, the mathematical coprocessor no longer existed as a separate device. Instead, the processors contain a FPU, but it is programmed as a separate module.

The FPU provides the system with additional arithmetic calculation power, but does not replace any of the CPU commands. For example, commands such as ADD , SUB , MUL , and DIV are performed by the CPU, while the FPU takes over the additional, more efficient arithmetic commands.

The developer may view a system with a coprocessor as a single processor with a larger set of commands.

The FPU Program Model

The FPU program model can be described as a combination of registers, which fall into the following three groups:

FPU stack registers. There are 8 of them, and they are called ST(0) , ST(1) , ST(2) ST(7) . The floating-point numbers are stored as 80-bit numbers of the extended format. The stack of registers operates according to the Last In, First Out (LIFO) principle. The ST(0) register always points to the top of the stack. As the numbers are received by the FPU, they are added on top of the stack. The numbers stored in the stack move to the bottom, leaving space for other numeric values.
Control/status registers. These include the status register reflecting the information on the processor status, the controlling register (for controlling the FPU operation modes), and the tag status register that reflects the status of the ST(0) ST(7) registers.
The data point register and the instruction point register. These are intended for processing exceptions.

Any of the registers listed above can be accessed by the program either directly or indirectly. In FPU programming, the most frequently used elements are the ST(0) ST(7) registers and the C0 , C1 , C2 , and C3 bits of the status register.

FPU registers operate as an ordinary stack of the CPU. But this stack has a limited number of positions ”only 8 of them. The FPU has one more register, which is difficult for the programmer to access. This is a word containing the labels of each of the stack positions. This register enables the FPU to trace, which of the stack positions are currently in use and which are not engaged. Any attempt to place an object into a stack position that is already engaged causes an exception (invalid operation).

To place the data into the FPU stack, the program uses the load command that places the data on top of the stack. If a number stored in memory has a format other than the temporary float format the FPU converts this number to the 80-bit form during its loading.

Similarly, the write commands extract values from the FPU stack and place them into memory. If the data format conversion is needed, it is performed as part of the write operation. Some forms of the write operation leave the top of the stack intact for further operations.

After being placed into the FPU stack, the data can be accessed and used by any command. The processor instructions allow both the operations between the registers and the operations between the memory and the registers. In the same way as in the CPU, between any two operands, one should be stored in a register. For the FPU, one of the operands should always be a top element of the stack, and another operand may be taken either from the memory or from the stack of registers.

Any arithmetic operation should always have the stack of registers as the destination. The FPU, being a processor unit for numeric operations, cannot write the result into memory by using the same command that performed the calculations. To send the operand back to the memory, use either a separate write command, or a command that extracts data from the stack and then writes them into the memory.

FPU Commands and Algorithm Optimization

All the FPU commands start with the F letter to be distinguished from the CPU commands. The FPU commands can be arranged conventionally into several groups:

Data transfer (read/write) commands
Addition and subtraction commands
Multiplication and division commands
Comparison commands
Transcendental functions commands
Control flow commands

Now, we will focus on these groups of commands in more detail.

Data Transfer Commands

Write

There are two types of write commands.

One of them extracts the number from the top of the stack and writes it into a memory cell . When performing such commands, the FPU converts the data from the temporary float format to the desired external form. The commands of this type are fst and fist . These commands enable you to place the value from the top of the stack automatically into the register inside the stack.

Regarding the second type of write commands, they write the data together with shifting the stack pointer. Performing the same operation of writing the data from the CPU to the memory, the fstp commands (as well as the fistp and fbstp commands) extract the number from the stack. These commands support all the external data types.

Exchange

The next data transfer command is the exchange command: fxch . It exchanges the contents between the top of the stack and any other register of the stack. As an operand for this command, you can use only another element of the stack. This command cannot exchange the values between the top register of the stack and a memory location. To do this, you need to use a combination of several commands. Within a single command, the FPU can perform either reading from the memory or writing to the memory, but not both simultaneously .

Read (Load)

The read, or load, commands load the data to the top of the processor stack. To load the integer data, use the fild modification.

Addition and Subtraction

Now, we continue on to the next group of commands: those for addition and subtraction. Each of these commands finds the sum or difference between the ST(0) register value and another operand. The result of this operation is always placed in an FPU register. The mnemonic representation of these commands is as follows :

 fadd   ST(0), ST(1)  fadd   ST(0), ST(2)  fadd   ST(2), ST(0)  fiadd  WORD_INTEGER  fiadd  SHORT_INTEGER  fadd   SHORT_REAL  fadd   LONG_REAL  faddp  ST(2), ST(0)  fsub   ST(0), ST(2)  fisub  WORD_INTEGER  fsubp  ST(2), ST(0)  fsubr  ST(2), ST(0)  fisubr SHORT_INTEGER  fsubrp ST(2), ST(0)

The operands for these commands may be either the FPU stack registers or one stack register and one memory cell.

The following program code fragment (Listing 2.1) demonstrates the use of the FPU commands for finding the sum of two floating-point numbers.

Listing 2.1: Adding two floating-point numbers

 // FIADD_EXM.cpp : Defines the entry point for the console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   float f1, f2, fsum;   while (true)   {     printf("\nEnter float 1: ");     scanf("%f", &f1);     printf("Enter float 2: ");     scanf("%f", &f2);     _asm {       finit       fld    DWORD PTR f1       fadd   DWORD PTR f2       fstp   DWORD PTR fsum       fwait         };    printf("f1 + f2 = %6.3f\n", fsum);   }   return 0;  }

In the _asm { } block, the first fld command loads the f1 floating-point number from the memory to the top register of the FPU stack. The fadd command calculates the sum of the values of the top stack register ST(0) and the f2 variable in the memory. The result of the operation is stored on top of the FPU stack. Finally, the fstp command saves the resulting sum to the fsum variable, at the same time clearing the top stack register ST(0) . In Fig. 2.1, you can see the application window with the output of this program.

Fig. 2.1: Application adds two floating-point numbers by using the FPU commands

In the next example, we will consider the use of the loading, addition, and saving commands for summing up the elements of an array of seven integers. Tasks of this kind are common in practice. You will see both the C++ .NET source code and the code using assembly language. This example, like the previous one, also demonstrates the technique for using the assembly commands for performing mathematical calculations. Listing 2.2 shows the source code of the C++ console application without using the assembly language commands.

Listing 2.2: Summing up the elements of an integer array by using the C++ operators only

 // FSUM.cpp : Defines the entry point for the console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   int iarray[6] = {1,   7, 0, 5, 3, 9};   int *piarray = iarray;   int isum = 0;   int sf = sizeof(iarray)/4;   for (int cnt = 0; cnt < sf; cnt++)    {     isum += *piarray;     piarray++;    }   printf("Sum of integers = %d\n", isum);   getchar();   return 0;  }

In order to sum up the elements, we use a classical algorithm with the for loop and the piarray pointer to the iarray array:

 for (int cnt = 0; cnt < sf; cnt++)    {     isum += *piarray;     piarray++;    }

Now, we will modify the source code of the program, using the assembly commands. The new version of such a program is shown in Listing 2.3.

Listing 2.3: The assembly-language version of the program for summing up array elements

 // FSUM.cpp : Defines the entry point for the console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   int iarray[6] = {   3,   7, 0, 5, 3, 9};   int *piarray = iarray;   int isum;   int sf = sizeof(iarray)/4;   _asm {          mov       ECX, sf          dec       ECX          mov       ESI, DWORD PTR piarray          finit          fild      DWORD PTR [ESI]    next:          add       ESI, 4          fiadd     DWORD PTR [ESI]          loop      next          fistp     DWORD PTR isum          fwait  }   printf("Sum of integers = %d\n", isum);   getchar();   return 0;  }

To sum up the array elements, we use the following simple algorithm: first, we use the fild DWORD PTR [ESI] command to load the first array element on top of the stack, and then, in every next iteration, we will add the next array element to this value. The address of the first element is placed in the ESI register, and the number of iterations (i.e., the array size minus 1) ”in the ECX register. After finding the sum, we save it to the isum variable by using the fist command.

The assembly code shown in Listing 2.3 is more efficient in terms of application performance than the addition algorithm in Listing 2.2.

Fig. 2.2 shows the application window with this program running.

Fig. 2.2: Application that calculates the sum of elements of an integer array

Multiplication and Division

Now, we will consider the next group of commands ”the multiplication and division commands for integers and floating-point numbers. They are listed as follows:

 WORD_INTEGER        LABEL WORD  SHORT_INTEGER       LABEL DWORD  SHORT_REAL    LABEL DWORD  LONG_REAL     LABEL QWORD  fmul   SHORT_REAL  fimul  WORD_INTEGER  fmulp  ST(2), ST(0)  fdiv   ST(0), ST(2)  fidiv  SHORT_INTEGER  fdivp  ST(2), ST(0)  fdivr  ST(0), ST(2)  fidivr WORD_INTEGER  fdivrp ST(2), ST(0)

As with the addition and subtraction commands, the operands for these commands may be either the FPU registers or a combination of a stack register and a memory operand. The use of these commands is best illustrated in the following example, which has a more complicated program code and demonstrates the techniques for using different FPU commands. The task is to calculate the z variable (of the floating-point type) according to the formula: (X ˆ’ Y) / (X + Y) . Listing 2.4 shows the source code performing the C++ console application with the assembly code included.

Listing 2.4: Evaluating a formula by using the FPU assembler commands

 // FORMULA.cpp : Defines the entry point for the console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   float X, Y, Z;   while (true)   {   printf("\nEnter X: ");   scanf("%f, &X);   printf("Enter Y: ");   scanf("%f", &Y);  _asm  {                 finit                 fld   DWORD PTR X                 fadd  DWORD PTR Y                 fld   DWORD PTR X                 fsub  DWORD PTR Y                 fxch  st(1)                 fdiv  st(1), st(0)                 fxch  st(1)                 fstp  DWORD PTR Z                 fwait        };  printf("(X   Y)/(X + Y) = %7.3f\n", Z);  };   return 0;  }

Now, we will analyze this example. To calculate the (X ˆ’ Y) / (X + Y) expression, we will perform three steps, the first of which is to find the denominator by using the following commands:

 fld   DWORD PTR X  fadd  DWORD PTR Y

To the top of the stack (the ST(0) register), we load the value of the X variable. Then, we add the Y value to this register. As a result of these two commands, the top of the stack will contain the sum of X and Y .

The next two commands calculate the difference between X and Y . To do this, we load the X value to the top of the stack, and then subtract the Y value:

 fld   DWORD PTR X  fsub  DWORD PTR Y

At this moment, the ST(0) register contains the difference between X and Y . As the FPU stack is organized as a cyclic buffer, the previously calculated X + Y value has moved to the ST(1) register of the stack. So, to divide X ˆ’ Y by X + Y , we need to exchange the values between the ST(0) and ST(1) registers, and then divide the contents of the ST(1) register by the ST(0) value:

 fxch  st(1)  fdiv  st(1), st(0)

As a result of these commands, the ST(1) register contains the required value that needs to be written to the Z variable in the memory. To do this, use the following commands:

 fxch  st(1)  fstp  DWORD PTR Z

Fig. 2.3 shows the window with this application running.

Fig. 2.3: Application that evaluates the formula by using the FPU commands

Comparison

Like in the CPU commands set, the FPU also has commands for comparing two numbers. The comparison commands have the following mnemonic representation:

 WORD_INT      LABEL WORD  SHORT INT     LABEL DWORD  SHORT_REAL    LABEL DWORD  LONG_REAL     LABEL QWORD  fcom  fcom   ST(2)  ficom  WORD_INT  fcom   SHORT_REAL  fcomp  ficomp SHORT_INT  fcomp  LONG_REAL  fcompp  ftst  fxam

The FPU discards the comparison result itself, but sets the status flags according to this result. Before checking the status flags, the program must read the status word to the memory. The easiest way to do this is to load the status flags into the AH register, and then to the processor flags register (to facilitate checking the condition).

The comparison operation always involves the top register of the stack, so you need to specify only one operand for this command. This may be a register or a memory operand. The result of the comparison is stored in the processor status word. Here, the C0 bit is placed in the position of the CF carrying flag, C2 ”in the position of the PF parity bit, and C3 ”in the position of ZF .

Reflecting the result of comparison requires only two status bits: C3 and C0 . Table 2.1 shows the correspondence between the operands under comparison and the values of the status bits.

Table 2.1: The correspondence between the operands under comparison and the status bits
C3	C0	Result
		ST>source
	1	ST<source
1		ST=source
1	1	ST and the source are incomparable

Comparing Floating-Point Numbers

The following program (Listing 2.5) compares two floating-point numbers and displays the result on the screen.

Listing 2.5: A C++ program comparing two floating-point numbers

 // COMPARE_REAL.cpp : Defines the entry point  // for the console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   float X, Y;   int   Flag = 0;   X = 0;   Y = 0;   while (true)   {    printf("\nEnter X: ");    scanf("%f", &X);    printf("Enter Y: ");    scanf("%f", &Y);    _asm {               finit               fldz               fld     DWORD PTR X               fcomp   DWORD PTR Y               fstsw   AX               fwait               sahf               jb      xly               je      xeqy               mov     Flag, 2               jmp     ex          xly:               mov     Flag, 0               jmp     ex         xeqy:               mov     Flag, 1           ex:           };      switch (Flag)   {   case 0:      printf("X < Y\n");      break;   case 1:      printf("X = Y\n");      break;   case 2:      printf("X > Y\n");      break;   default:      break;   }     }    return 0;  }

After initializing the FPU with the finit command, the X variable is placed on top of the stack. The fcomp command compares the number on top of the stack with the variable in the memory, and depending on the result, sets the bits in the processor's status word. The status bits are then written to the CPU status register, where they are analyzed . Depending on the bits set, there is a jump to a corresponding branch of the program. The code fragment performing these actions looks like this:

 finit  fld    DWORD PTR X  fcomp  DWORD PTR Y  fstsw  AX  sahf

Fig. 2.4 shows the window with this application running.

Fig. 2.4: Application that implements the algorithm for comparing two floating-point numbers

Comparing Integers

Besides the fcomp command, there are also other modifications of the fcom comparison command. One of these is the ficomp modification intended for comparing integers. Below, you can see the program code for comparing two integers (Listing 2.6).

Listing 2.6: Comparing two integers

 // COMPARE_INTS.cpp : Defines the entry point  // for the console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   int X, Y;   int   Flag = 0;   X  = 0;   Y  = 0;   while (true)   {    printf("\nEnter X: ");    scanf("%d", &X);    printf("Enter Y: ");    scanf("%d", &Y);    _asm {          finit          fild     DWORD PTR X          ficomp   DWORD PTR Y          fstsw    AX          fwait          sahf          jb      xly          je      xeqy          mov     Flag, 2          jmp     ex      xly:          mov     Flag, 0          jmp     ex      xeqy:          mov     Flag, 1        ex:          };     switch  (Flag)   {    case 0:      printf("X < Y\n");      break;    case 1:      printf("X = Y\n");      break;    case 2:      printf("X > Y\n");      break;   default:      break;   }    }     return 0;  }

In general, the program code for comparing integers is almost the same as that for comparing floating-point numbers. The only difference is that the floating-point arithmetic commands ( fld , fcomp ) are replaced with those of the integer arithmetic ( fild , ficomp ).

Counting the Number of Occurrences

We will consider one more example illustrating the technique of using the FPU commands in the assembly code. Suppose you need to count the number of occurrences of the given number in an array of integers. In Listing 2.7, you can see the source code of the corresponding console application created in C++ .NET.

Listing 2.7: The application that counts the number of occurrences of the given number in the array

 // COUNT_NUMBER.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {    int iarray[10] = {-13, -7, 10, -5, 3, -7, -5, 4, -7, -3};    int *piarray = iarray;    int num;    printf("Array: ");    for (int cnt = 0; cnt < sizeof(iarray)/4; cnt++)     {   printf("%d ", *piarray++);     }    while (true)     {      printf("\nEnter number to find: ");      scanf("%d", &num);      cnt = 0;      int sf = sizeof(iarray)/4;      _asm {           mov      ESI, DWORD PTR piarray           mov      ECX, DWORD PTR sf           finit           fild     DWORD PTR num      next_cmp :           ficom    DWORD PTR [ESI]           fstsw    AX           sahf           jne      skip           inc      cnt      skip:           sub      ESI, 4           loop     next_cmp           fwait    }       printf("\nThe number %d occures = %d times\n", num, cnt);   }       return 0;     }

The first commands of the assembly block initialize the ESI and ECX processor registers with the address of the last array and its size, respectively. After that, the needed number is loaded from the num variable to the top of the FPU stack with the following command:

 fild    DWORD PTR num

The value on top of the stack is compared in turn with each element of the array:

 ficom   DWORD PTR [ESI]

This command sets the corresponding bits in the status word. To extract and analyze this data, use the following commands:

 fstsw   AX  sahf  jne     skip  inc     cnt

Every time the elements appear identical, the cnt counter is incremented.

To continue on to the next element of the array, use the following command:

 sub     ESI, 4

When the loop is completed, the cnt counter contains the number of times the given number occurs in the array. If this number is not found in the array, then cnt=0 .

Fig. 2.5 shows the application window.

Fig. 2.5: Application that counts the number of occurrences of the given integer in the array

The comparison commands that extract the value from the stack present a convenient way for clearing the stack. The FPU has no command that would extract an operand from the stack in a convenient way. Instead, you can use the comparison commands with the extraction. These commands also alter the status register, so you should not use them if you need the status bits for further operations. But in most cases, these commands give you a quick way for extracting one or two operands from the stack. As the FPU issues an error on stack overflow, you need to remove all the operands from the stack after completing the calculations.

Specialized Comparison Commands

There are two specialized comparison commands. One of these is the command that allows you to compare the contents of the top register of the stack with zero (0). It is a quick way for finding the sign of the number stored on top of the stack.

The other specialized command is fxam . It sets all four status register flags ( C3 C0 ), reflecting the type of the number contained in the top register of the stack. The FPU can process numbers represented in any form (not only the formalized floating-point numbers). The fxam command allows you to determine what type number is stored on top of the stack.

If the arithmetic processing does not demand anything special, and the results of operations do not reach the limits of the FPU registers, then it is not logical to use the fxam command. Here, we will not explore the FPU s reaction to the exceptions that may sometimes occur in the calculations. There are many issues related to this and they are addressed in detail in Intel s official manual on Processor 387.

Power Functions and Trigonometric Functions

The next group of functions we are going to consider is that containing power functions and trigonometric functions.

These commands enable the FPU to calculate mathematical expressions involving logarithms, exponents, and trigonometric functions. These are the commands:

 fsqrt  fscale  fprem  frndint  fxtract  fabs  fchs  fsin  fcos  fsincos  fptan  fpatan  f2xml  fyl2x  fyl2xp1

The commands for transcendental functions are a great contribution to the calculation power of the processors. These functions calculate the results with high precision. Note here that the angle arguments for the trigonometric functions should be specified in radian measure. For example, if you need to calculate sin A, the A angle should be given in radians. To convert angle values between degrees and radians, use the following formula:

A_RAD being the radian value, and A being the angle measured in degrees.

Now, we will consider an example that calculates the sine and the cosine of an angle. The source code of this simple program is shown in Listing 2.8.

Listing 2.8: A program for calculating the sine and the cosine of an angle

 // SinCos.cpp : Defines the entry point for the console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   float angle, angleRad, Sine, Cosine;   while (true)   {    printf("\nEnter degrees: ");    scanf("%f", &angle);    angleRad = angle*3.14/180;   _asm {       finit        fld DWORD PTR angleRad        fld DWORD PTR angleRad        fsin        fstp DWORD PTR Sine        fcos        fstp DWORD PTR Cosine        fwait       }    printf("The angle in degrees = %7.3f\n", angle);    printf("Sine of angle = %7.3f\n", Sine);    printf("Cosine of angle = %7.3f\n", Cosine);    getchar();   }  return 0;  }

The fsin and fcos commands calculate the sine and the cosine of the angle value stored in the top register of the stack: ST(0) . These commands take no operands, and return the result to the ST(0) register.

This means that the previous value of this register (the angle value) is no longer stored in ST(0) after the sine has been calculated. That is why we had to use the fld command twice in our procedure! Fig. 2.6 shows the application window with this program running.

Fig. 2.6: Application calculating the sine and the cosine of an angle

Among assembly language commands for calculating trigonometric functions, there is also the fsincos command. It calculates both the sine and the cosine of the angle value stored in ST(0) , the top register of the FPU stack. This command does not take any operands. The result of this function is returned in the ST(0) and ST(1) registers, with the sine value placed in ST(0) and the cosine in ST(1) . Now, we will modify the previous example, using the fsincos command.

In Listing 2.9, note the modified version of the program code.

Listing 2.9: The modified program calculating the sine and the cosine of an angle

 // SinCos_mod.cpp : Defines the entry point for the console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   float angle, angleRad, Sine, Cosine;   while (true)   {    printf("\nEnter degrees: ");    scanf("%f", &angle);    angleRad = angle*3.14/180;   _asm {       finit       fld DWORD PTR angleRad       fsincos       fxch st(1)       fstp DWORD PTR Sine       fstp DWORD PTR Cosine       fwait       }    printf("The angle in degrees = %7.3f\n", angle);    printf("Sine of angle = %7.3f\n", Sine);    printf("Cosine of angle = %7.3f\n", Cosine);    getchar();   }  return 0;  }

The examples considered here illustrate just few of the powerful mathematical options provided by assembly language. A remarkable feature of this language is that it is fairly easy to optimize even the code written in assembly language itself!