The earliest models of Intel processors did not have hardware support for floating point operations. All of these operations were implemented as procedures made up of ordinary mathematical commands. In early models, a special additional chip was developed, which was called mathematical coprocessor. It included commands that enable the computer to perform floating-point operations much faster than was done by procedures containing ordinary mathematical commands.
Beginning with 486DX processors, the mathematical coprocessor no longer existed as a separate device. Instead, the processors contain a FPU, but it is programmed as a separate module.
The FPU provides the system with additional arithmetic calculation power, but does not replace any of the CPU commands. For example, commands such as ADD , SUB , MUL , and DIV are performed by the CPU, while the FPU takes over the additional, more efficient arithmetic commands.
The developer may view a system with a coprocessor as a single processor with a larger set of commands.
The FPU program model can be described as a combination of registers, which fall into the following three groups:
FPU stack registers. There are 8 of them, and they are called ST(0) , ST(1) , ST(2) ST(7) . The floating-point numbers are stored as 80-bit numbers of the extended format. The stack of registers operates according to the Last In, First Out (LIFO) principle. The ST(0) register always points to the top of the stack. As the numbers are received by the FPU, they are added on top of the stack. The numbers stored in the stack move to the bottom, leaving space for other numeric values.
Control/status registers. These include the status register reflecting the information on the processor status, the controlling register (for controlling the FPU operation modes), and the tag status register that reflects the status of the ST(0) ST(7) registers.
The data point register and the instruction point register. These are intended for processing exceptions.
Any of the registers listed above can be accessed by the program either directly or indirectly. In FPU programming, the most frequently used elements are the ST(0) ST(7) registers and the C0 , C1 , C2 , and C3 bits of the status register.
FPU registers operate as an ordinary stack of the CPU. But this stack has a limited number of positions ”only 8 of them. The FPU has one more register, which is difficult for the programmer to access. This is a word containing the labels of each of the stack positions. This register enables the FPU to trace, which of the stack positions are currently in use and which are not engaged. Any attempt to place an object into a stack position that is already engaged causes an exception (invalid operation).
To place the data into the FPU stack, the program uses the load command that places the data on top of the stack. If a number stored in memory has a format other than the temporary float format the FPU converts this number to the 80-bit form during its loading.
Similarly, the write commands extract values from the FPU stack and place them into memory. If the data format conversion is needed, it is performed as part of the write operation. Some forms of the write operation leave the top of the stack intact for further operations.
After being placed into the FPU stack, the data can be accessed and used by any command. The processor instructions allow both the operations between the registers and the operations between the memory and the registers. In the same way as in the CPU, between any two operands, one should be stored in a register. For the FPU, one of the operands should always be a top element of the stack, and another operand may be taken either from the memory or from the stack of registers.
Any arithmetic operation should always have the stack of registers as the destination. The FPU, being a processor unit for numeric operations, cannot write the result into memory by using the same command that performed the calculations. To send the operand back to the memory, use either a separate write command, or a command that extracts data from the stack and then writes them into the memory.
All the FPU commands start with the F letter to be distinguished from the CPU commands. The FPU commands can be arranged conventionally into several groups:
Data transfer (read/write) commands
Addition and subtraction commands
Multiplication and division commands
Comparison commands
Transcendental functions commands
Control flow commands
Now, we will focus on these groups of commands in more detail.
There are two types of write commands.
One of them extracts the number from the top of the stack and writes it into a memory cell . When performing such commands, the FPU converts the data from the temporary float format to the desired external form. The commands of this type are fst and fist . These commands enable you to place the value from the top of the stack automatically into the register inside the stack.
Regarding the second type of write commands, they write the data together with shifting the stack pointer. Performing the same operation of writing the data from the CPU to the memory, the fstp commands (as well as the fistp and fbstp commands) extract the number from the stack. These commands support all the external data types.
The next data transfer command is the exchange command: fxch . It exchanges the contents between the top of the stack and any other register of the stack. As an operand for this command, you can use only another element of the stack. This command cannot exchange the values between the top register of the stack and a memory location. To do this, you need to use a combination of several commands. Within a single command, the FPU can perform either reading from the memory or writing to the memory, but not both simultaneously .
The read, or load, commands load the data to the top of the processor stack. To load the integer data, use the fild modification.
Now, we continue on to the next group of commands: those for addition and subtraction. Each of these commands finds the sum or difference between the ST(0) register value and another operand. The result of this operation is always placed in an FPU register. The mnemonic representation of these commands is as follows :
fadd ST(0), ST(1) fadd ST(0), ST(2) fadd ST(2), ST(0) fiadd WORD_INTEGER fiadd SHORT_INTEGER fadd SHORT_REAL fadd LONG_REAL faddp ST(2), ST(0) fsub ST(0), ST(2) fisub WORD_INTEGER fsubp ST(2), ST(0) fsubr ST(2), ST(0) fisubr SHORT_INTEGER fsubrp ST(2), ST(0)
The operands for these commands may be either the FPU stack registers or one stack register and one memory cell.
The following program code fragment (Listing 2.1) demonstrates the use of the FPU commands for finding the sum of two floating-point numbers.
// FIADD_EXM.cpp : Defines the entry point for the console application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { float f1, f2, fsum; while (true) { printf("\nEnter float 1: "); scanf("%f", &f1); printf("Enter float 2: "); scanf("%f", &f2); _asm { finit fld DWORD PTR f1 fadd DWORD PTR f2 fstp DWORD PTR fsum fwait }; printf("f1 + f2 = %6.3f\n", fsum); } return 0; }
In the _asm { } block, the first fld command loads the f1 floating-point number from the memory to the top register of the FPU stack. The fadd command calculates the sum of the values of the top stack register ST(0) and the f2 variable in the memory. The result of the operation is stored on top of the FPU stack. Finally, the fstp command saves the resulting sum to the fsum variable, at the same time clearing the top stack register ST(0) . In Fig. 2.1, you can see the application window with the output of this program.
In the next example, we will consider the use of the loading, addition, and saving commands for summing up the elements of an array of seven integers. Tasks of this kind are common in practice. You will see both the C++ .NET source code and the code using assembly language. This example, like the previous one, also demonstrates the technique for using the assembly commands for performing mathematical calculations. Listing 2.2 shows the source code of the C++ console application without using the assembly language commands.
// FSUM.cpp : Defines the entry point for the console application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { int iarray[6] = {1, 7, 0, 5, 3, 9}; int *piarray = iarray; int isum = 0; int sf = sizeof(iarray)/4; for (int cnt = 0; cnt < sf; cnt++) { isum += *piarray; piarray++; } printf("Sum of integers = %d\n", isum); getchar(); return 0; }
In order to sum up the elements, we use a classical algorithm with the for loop and the piarray pointer to the iarray array:
for (int cnt = 0; cnt < sf; cnt++) { isum += *piarray; piarray++; }
Now, we will modify the source code of the program, using the assembly commands. The new version of such a program is shown in Listing 2.3.
// FSUM.cpp : Defines the entry point for the console application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { int iarray[6] = { 3, 7, 0, 5, 3, 9}; int *piarray = iarray; int isum; int sf = sizeof(iarray)/4; _asm { mov ECX, sf dec ECX mov ESI, DWORD PTR piarray finit fild DWORD PTR [ESI] next: add ESI, 4 fiadd DWORD PTR [ESI] loop next fistp DWORD PTR isum fwait } printf("Sum of integers = %d\n", isum); getchar(); return 0; }
To sum up the array elements, we use the following simple algorithm: first, we use the fild DWORD PTR [ESI] command to load the first array element on top of the stack, and then, in every next iteration, we will add the next array element to this value. The address of the first element is placed in the ESI register, and the number of iterations (i.e., the array size minus 1) ”in the ECX register. After finding the sum, we save it to the isum variable by using the fist command.
The assembly code shown in Listing 2.3 is more efficient in terms of application performance than the addition algorithm in Listing 2.2.
Fig. 2.2 shows the application window with this program running.
Now, we will consider the next group of commands ”the multiplication and division commands for integers and floating-point numbers. They are listed as follows:
WORD_INTEGER LABEL WORD SHORT_INTEGER LABEL DWORD SHORT_REAL LABEL DWORD LONG_REAL LABEL QWORD fmul SHORT_REAL fimul WORD_INTEGER fmulp ST(2), ST(0) fdiv ST(0), ST(2) fidiv SHORT_INTEGER fdivp ST(2), ST(0) fdivr ST(0), ST(2) fidivr WORD_INTEGER fdivrp ST(2), ST(0)
As with the addition and subtraction commands, the operands for these commands may be either the FPU registers or a combination of a stack register and a memory operand. The use of these commands is best illustrated in the following example, which has a more complicated program code and demonstrates the techniques for using different FPU commands. The task is to calculate the z variable (of the floating-point type) according to the formula: (X ˆ’ Y) / (X + Y) . Listing 2.4 shows the source code performing the C++ console application with the assembly code included.
// FORMULA.cpp : Defines the entry point for the console application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { float X, Y, Z; while (true) { printf("\nEnter X: "); scanf("%f, &X); printf("Enter Y: "); scanf("%f", &Y); _asm { finit fld DWORD PTR X fadd DWORD PTR Y fld DWORD PTR X fsub DWORD PTR Y fxch st(1) fdiv st(1), st(0) fxch st(1) fstp DWORD PTR Z fwait }; printf("(X Y)/(X + Y) = %7.3f\n", Z); }; return 0; }
Now, we will analyze this example. To calculate the (X ˆ’ Y) / (X + Y) expression, we will perform three steps, the first of which is to find the denominator by using the following commands:
fld DWORD PTR X fadd DWORD PTR Y
To the top of the stack (the ST(0) register), we load the value of the X variable. Then, we add the Y value to this register. As a result of these two commands, the top of the stack will contain the sum of X and Y .
The next two commands calculate the difference between X and Y . To do this, we load the X value to the top of the stack, and then subtract the Y value:
fld DWORD PTR X fsub DWORD PTR Y
At this moment, the ST(0) register contains the difference between X and Y . As the FPU stack is organized as a cyclic buffer, the previously calculated X + Y value has moved to the ST(1) register of the stack. So, to divide X ˆ’ Y by X + Y , we need to exchange the values between the ST(0) and ST(1) registers, and then divide the contents of the ST(1) register by the ST(0) value:
fxch st(1) fdiv st(1), st(0)
As a result of these commands, the ST(1) register contains the required value that needs to be written to the Z variable in the memory. To do this, use the following commands:
fxch st(1) fstp DWORD PTR Z
Fig. 2.3 shows the window with this application running.
Like in the CPU commands set, the FPU also has commands for comparing two numbers. The comparison commands have the following mnemonic representation:
WORD_INT LABEL WORD SHORT INT LABEL DWORD SHORT_REAL LABEL DWORD LONG_REAL LABEL QWORD fcom fcom ST(2) ficom WORD_INT fcom SHORT_REAL fcomp ficomp SHORT_INT fcomp LONG_REAL fcompp ftst fxam
The FPU discards the comparison result itself, but sets the status flags according to this result. Before checking the status flags, the program must read the status word to the memory. The easiest way to do this is to load the status flags into the AH register, and then to the processor flags register (to facilitate checking the condition).
The comparison operation always involves the top register of the stack, so you need to specify only one operand for this command. This may be a register or a memory operand. The result of the comparison is stored in the processor status word. Here, the C0 bit is placed in the position of the CF carrying flag, C2 ”in the position of the PF parity bit, and C3 ”in the position of ZF .
Reflecting the result of comparison requires only two status bits: C3 and C0 . Table 2.1 shows the correspondence between the operands under comparison and the values of the status bits.
C3 | C0 | Result |
---|---|---|
|
| ST>source |
| 1 | ST<source |
1 |
| ST=source |
1 | 1 | ST and the source are incomparable |
The following program (Listing 2.5) compares two floating-point numbers and displays the result on the screen.
// COMPARE_REAL.cpp : Defines the entry point // for the console application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { float X, Y; int Flag = 0; X = 0; Y = 0; while (true) { printf("\nEnter X: "); scanf("%f", &X); printf("Enter Y: "); scanf("%f", &Y); _asm { finit fldz fld DWORD PTR X fcomp DWORD PTR Y fstsw AX fwait sahf jb xly je xeqy mov Flag, 2 jmp ex xly: mov Flag, 0 jmp ex xeqy: mov Flag, 1 ex: }; switch (Flag) { case 0: printf("X < Y\n"); break; case 1: printf("X = Y\n"); break; case 2: printf("X > Y\n"); break; default: break; } } return 0; }
After initializing the FPU with the finit command, the X variable is placed on top of the stack. The fcomp command compares the number on top of the stack with the variable in the memory, and depending on the result, sets the bits in the processor's status word. The status bits are then written to the CPU status register, where they are analyzed . Depending on the bits set, there is a jump to a corresponding branch of the program. The code fragment performing these actions looks like this:
finit fld DWORD PTR X fcomp DWORD PTR Y fstsw AX sahf
Fig. 2.4 shows the window with this application running.
Besides the fcomp command, there are also other modifications of the fcom comparison command. One of these is the ficomp modification intended for comparing integers. Below, you can see the program code for comparing two integers (Listing 2.6).
// COMPARE_INTS.cpp : Defines the entry point // for the console application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { int X, Y; int Flag = 0; X = 0; Y = 0; while (true) { printf("\nEnter X: "); scanf("%d", &X); printf("Enter Y: "); scanf("%d", &Y); _asm { finit fild DWORD PTR X ficomp DWORD PTR Y fstsw AX fwait sahf jb xly je xeqy mov Flag, 2 jmp ex xly: mov Flag, 0 jmp ex xeqy: mov Flag, 1 ex: }; switch (Flag) { case 0: printf("X < Y\n"); break; case 1: printf("X = Y\n"); break; case 2: printf("X > Y\n"); break; default: break; } } return 0; }
In general, the program code for comparing integers is almost the same as that for comparing floating-point numbers. The only difference is that the floating-point arithmetic commands ( fld , fcomp ) are replaced with those of the integer arithmetic ( fild , ficomp ).
We will consider one more example illustrating the technique of using the FPU commands in the assembly code. Suppose you need to count the number of occurrences of the given number in an array of integers. In Listing 2.7, you can see the source code of the corresponding console application created in C++ .NET.
// COUNT_NUMBER.cpp : Defines the entry point for the console // application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { int iarray[10] = {-13, -7, 10, -5, 3, -7, -5, 4, -7, -3}; int *piarray = iarray; int num; printf("Array: "); for (int cnt = 0; cnt < sizeof(iarray)/4; cnt++) { printf("%d ", *piarray++); } while (true) { printf("\nEnter number to find: "); scanf("%d", &num); cnt = 0; int sf = sizeof(iarray)/4; _asm { mov ESI, DWORD PTR piarray mov ECX, DWORD PTR sf finit fild DWORD PTR num next_cmp : ficom DWORD PTR [ESI] fstsw AX sahf jne skip inc cnt skip: sub ESI, 4 loop next_cmp fwait } printf("\nThe number %d occures = %d times\n", num, cnt); } return 0; }
The first commands of the assembly block initialize the ESI and ECX processor registers with the address of the last array and its size, respectively. After that, the needed number is loaded from the num variable to the top of the FPU stack with the following command:
fild DWORD PTR num
The value on top of the stack is compared in turn with each element of the array:
ficom DWORD PTR [ESI]
This command sets the corresponding bits in the status word. To extract and analyze this data, use the following commands:
fstsw AX sahf jne skip inc cnt
Every time the elements appear identical, the cnt counter is incremented.
To continue on to the next element of the array, use the following command:
sub ESI, 4
When the loop is completed, the cnt counter contains the number of times the given number occurs in the array. If this number is not found in the array, then cnt=0 .
Fig. 2.5 shows the application window.
The comparison commands that extract the value from the stack present a convenient way for clearing the stack. The FPU has no command that would extract an operand from the stack in a convenient way. Instead, you can use the comparison commands with the extraction. These commands also alter the status register, so you should not use them if you need the status bits for further operations. But in most cases, these commands give you a quick way for extracting one or two operands from the stack. As the FPU issues an error on stack overflow, you need to remove all the operands from the stack after completing the calculations.
There are two specialized comparison commands. One of these is the command that allows you to compare the contents of the top register of the stack with zero (0). It is a quick way for finding the sign of the number stored on top of the stack.
The other specialized command is fxam . It sets all four status register flags ( C3 C0 ), reflecting the type of the number contained in the top register of the stack. The FPU can process numbers represented in any form (not only the formalized floating-point numbers). The fxam command allows you to determine what type number is stored on top of the stack.
If the arithmetic processing does not demand anything special, and the results of operations do not reach the limits of the FPU registers, then it is not logical to use the fxam command. Here, we will not explore the FPU s reaction to the exceptions that may sometimes occur in the calculations. There are many issues related to this and they are addressed in detail in Intel s official manual on Processor 387.
The next group of functions we are going to consider is that containing power functions and trigonometric functions.
These commands enable the FPU to calculate mathematical expressions involving logarithms, exponents, and trigonometric functions. These are the commands:
fsqrt fscale fprem frndint fxtract fabs fchs fsin fcos fsincos fptan fpatan f2xml fyl2x fyl2xp1
The commands for transcendental functions are a great contribution to the calculation power of the processors. These functions calculate the results with high precision. Note here that the angle arguments for the trigonometric functions should be specified in radian measure. For example, if you need to calculate sin A, the A angle should be given in radians. To convert angle values between degrees and radians, use the following formula:
A_RAD being the radian value, and A being the angle measured in degrees.
Now, we will consider an example that calculates the sine and the cosine of an angle. The source code of this simple program is shown in Listing 2.8.
// SinCos.cpp : Defines the entry point for the console application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { float angle, angleRad, Sine, Cosine; while (true) { printf("\nEnter degrees: "); scanf("%f", &angle); angleRad = angle*3.14/180; _asm { finit fld DWORD PTR angleRad fld DWORD PTR angleRad fsin fstp DWORD PTR Sine fcos fstp DWORD PTR Cosine fwait } printf("The angle in degrees = %7.3f\n", angle); printf("Sine of angle = %7.3f\n", Sine); printf("Cosine of angle = %7.3f\n", Cosine); getchar(); } return 0; }
The fsin and fcos commands calculate the sine and the cosine of the angle value stored in the top register of the stack: ST(0) . These commands take no operands, and return the result to the ST(0) register.
This means that the previous value of this register (the angle value) is no longer stored in ST(0) after the sine has been calculated. That is why we had to use the fld command twice in our procedure! Fig. 2.6 shows the application window with this program running.
Among assembly language commands for calculating trigonometric functions, there is also the fsincos command. It calculates both the sine and the cosine of the angle value stored in ST(0) , the top register of the FPU stack. This command does not take any operands. The result of this function is returned in the ST(0) and ST(1) registers, with the sine value placed in ST(0) and the cosine in ST(1) . Now, we will modify the previous example, using the fsincos command.
In Listing 2.9, note the modified version of the program code.
// SinCos_mod.cpp : Defines the entry point for the console application #include "stdafx.h" int _tmain(int argc, _TCHAR* argv[]) { float angle, angleRad, Sine, Cosine; while (true) { printf("\nEnter degrees: "); scanf("%f", &angle); angleRad = angle*3.14/180; _asm { finit fld DWORD PTR angleRad fsincos fxch st(1) fstp DWORD PTR Sine fstp DWORD PTR Cosine fwait } printf("The angle in degrees = %7.3f\n", angle); printf("Sine of angle = %7.3f\n", Sine); printf("Cosine of angle = %7.3f\n", Cosine); getchar(); } return 0; }
The examples considered here illustrate just few of the powerful mathematical options provided by assembly language. A remarkable feature of this language is that it is fairly easy to optimize even the code written in assembly language itself!