Assembly language presents an efficient way for optimizing mathematical operations, and also the operations for data array processing and hardware interaction. Most of the applications make intensive use of mathematical calculations, and assembly language is often a good way to optimize the performance of these calculations. For creating highly efficient applications, the crucial factor is expertise in the Floating-Point Unit (FPU) hardware and program architecture, as well as in the SIMD technology.
In this chapter, we will focus on the principles of the FPU operation and the options for application optimization. Practical examples of how to make use of the FPU features will be considered in Chapter 2 .
The earliest models of Intel processors did not have hardware support for the floating-point operations. All operations of this kind were implemented as procedures made up of the ordinary mathematical commands. For early models, a special additional chip was developed, which got the name of mathematical coprocessor. It included the commands enabling the computer to perform the floating-point operations much faster than was done by the procedures containing ordinary mathematical commands.
Starting from the 486DX processors, the mathematical coprocessor no longer exists as a separate device. Instead, the processors contain the FPU, but it is programmed as a separate module. The FPU program model can be described as a combination of the following registers:
FPU stack registers. There are 8 of them, and their names are ST(0) , ST(1) , ST(2) ST(7) . The floating-point numbers are stored as 80-bit numbers of the extended format. The ST(0) register always points to the top of the stack. As the numbers are received by the FPU, they are added on top of the stack.
Control/status registers. These include the status register reflecting the information on the processor status, the control register (for controlling the FPU operation modes), and the tag status register that reflects the status of the ST(0) ST(7) registers.
Data point register and instruction point register. These are intended for processing the exceptions.
Any of the registers listed above can be accessed by the program either directly or indirectly. In FPU programming, the most frequently used elements are the ST(0) ST(7) registers and the C0 , C1 , C2 , and C3 bits of the status register.
The FPU registers operate as an ordinary stack of the CPU. But this stack has a limited number of positions ”only 8 of them. The FPU has one more register, which is difficult for the programmer to access. This is a word containing the labels of each of the stack positions. This register enables the FPU to trace, which of the stack positions are currently in use and which are not engaged. Any attempt to place an object into a stack position that is already engaged creates an exception.
To place the data into the FPU stack, the program uses the load command that places the data on top of the stack. If a number stored in memory has a format other than the temporary float format, then (during the loading) the FPU converts this number to the 80-bit form.
The write commands extract the values from the FPU stack and place them into memory. If data format conversion is needed, it is performed as part of the write operation. Some forms of the write operation leave the top of the stack intact for further operations.
After being placed into the FPU stack, the data can be accessed and used by any command. The processor instructions allow both the operations between the registers and the operations between the memory and the registers. In the same way as in the CPU, between any two operands, one should be stored in a register. For the FPU, one of the operands should always be a top element of the stack, and another operand may be taken either from the memory or from the stack of registers.
Any arithmetic operation should always have the stack of registers as the destination. The FPU, being a processor unit for numeric operations, cannot write the result into memory by using the same command that performed the calculations. To send the operand back to the memory, it is necessary to use either a separate write command or a command that extracts data from the stack and then writes it into memory.
All FPU commands start with the F letter to be distinguished from the CPU commands. The FPU commands can be conventionally arranged into several groups:
Data transfer commands
Addition and subtraction commands
Multiplication and division commands
Transcendental functions commands
Control flow commands
The FPU provides the developer with hardware-level support for the algorithms that calculate trigonometric functions, logarithms, and powers. Such calculations are entirely transparent for the software developer and do not require writing any additional algorithms.
The FPU makes it possible to perform mathematical calculations with very high precision level (up to 18 digits). If you perform such calculations without using the FPU functions, the result will be less precise.
The use of assembly language for FPU programming can give you considerable gain in application performance. This is because the system of FPU instructions contains different groups of commands, providing the developer with virtually all the tools for implementing most calculation algorithms. Even if some of the needed commands are missing, you can easily find an equivalent operation made up of several assembly instructions. It should be noted that by programming the FPU with assembly commands, you could implement even the operations that are difficult or even impossible to write in C++.
With regard to mathematical functions in the C++ standard libraries, we need to note that their assembly analogs often let you obtain an even higher performance, as well as a smaller program size . Assembly language also lets developers create custom functions, which often appear more efficient than their analogs from the mathematical library in Visual C++ .NET.