Using the SIMD Technologies (MMX, SSE) | Visual C++ Optimization with Assembly Code

The operation of the Intel processors up to the Pentium model is governed by the following scheme: every instruction performs the actions over one or two operands. The operands can be stored both in the registers and in the memory. To perform repeated or similar operations over several operands, you must use either loops or recursive calls of certain code fragments .

MultiMedia eXtensions (MMX)

In multimedia applications, 2D/3D graphics, communications, and in a number of other tasks, the need for performing large numbers of similar operations is common. To optimize the solutions to such tasks , the SIMD technology was developed. Implementation uses the FPU registers, on which the MMX calculation block was built. The traditional FPU contains eight 80-bit registers for storing and processing the numbers in the floating-point format.

These registers form an FPU stack. In the instructions, they are addressed through the special stack pointer. Physically, the MMX block uses 64 lower bits of these registers, and these registers are addressed directly ( MMX0 MMX7 ). The MMX uses new types of packed data placed in the 64-bit registers:

Packed bytes ”eight bytes
Packed words ”four words
Packed doublewords ”two double words
Quadword ”one 64-bit word

The MMX technology presents a considerable improvement to the architecture of the Intel microprocessors. It was developed to speed up the performance of multimedia and communication programs. The amount of data and the complexity of its processing by modern computers is increasing exponentially, demanding efficient processor performance.

MMX Instructions

Every MMX instruction performs an operation over the a whole set of operands (8, 4, 2, or 1), placed in the registers addressed. Another peculiarity of the MMX technology is the support for saturation arithmetic. It differs from the ordinary wraparound arithmetic with the regard to the following: If the resulting value exceeds the upper limit for the given data type, then the result is captured at the maximum possible value, with the carry ignored. If the result of the operation appears beyond the lower limit for the given data type, the result is captured at the minimal possible value. The limits are determined by the variable type (signed or unsigned) and precision. Such a calculation mode is convenient , for instance, when you need to determine the colors. The new instructions (57 of them in total) can be subdivided into the following groups:

Commands for arithmetic, including addition and subtraction in different modes, multiplication, and a combination of multiplication and addition
Comparison of data elements: checking the equality or comparing their value
Commands for format conversion
Logical commands ( AND , AND NOT , OR , and XOR ) over the 64-bit operands
Commands for logical and arithmetic shifts
Commands for data transfer between the MMX registers and the integer registers or memory
MMX clearance commands that set the signs of empty registers in the tag word

FPU and MMX

There are some nuances that form an obstacle to combining the FPU and MMX instructions by using them in turn . Some of these reasons are the differences in the methods of register addressing, and also the discrepancy between the MMX and FPU data formats. So, an FPU/MMX block can operate in either of these two modes, but not in both simultaneously . For example, suppose you have to insert some MMX instructions into a chain of FPU instructions, and then continue the FPU calculations. In this case, prior to the first MMX instruction, you need to save the FPU context (the status of registers) in the memory, and after those instructions, load the context again.

These saving and loading operations take up the processor time. As a result, you may even lose the advantages of the SIMD concept. The coincidence of the MMX and FPU registers is sometimes justified by the fact that it allowed saving the MMX context in the same way as the FPU, so this demanded no additional changes in the operating system for saving the MMX context in cases of task switching. This means that the type of the processor installed (the one with MMX or without it) does not affect the operating systems. To make use of the SIMD advantages, the applications need to know how to use them (and not to lose performance on task switching).

Optimizing MMX Intrinsic Functions

In the Visual C++ .NET environment, the MMX support is implemented through intrinsic functions. All declarations of intrinsics are contained in the mmintrin.h header file. The developer can make use of the intrinsics in his or her own programs. The MMX intrinsics fall into several groups:

General-purpose functions (those for packing and unpacking, MMX register clearance, data transfer)
Comparison operations
Arithmetic operations
Shift operations
Logical operations

Any intrinsic function can be represented by an equivalent in the form of assembly code. For manipulating the 64-bit variables , C++ .NET offers the __m64 variable type. All intrinsics make use of such variables to a certain degree, though the mm0 mm7 registers are never mentioned. The disassembled code of any of the intrinsics is redundant. It can be replaced by the assembly commands. The C++ .NET debugger supports MMX instructions, and the corresponding assembly code is generated automatically. In Chapter 10 , you will find detailed examples on using the MMX technology.

Streaming SIMD Extensions (SSE)

Starting from the Pentium III processor, there appears the so-called SSE stream extension. This technology is intended for enhancing the performance of multimedia and communication applications. This extension (which includes new registers, data types, and instructions) has to speed up the applications for mobile video, graphics combined with video, image processing, sound synthesis, speech synthesis and compression, telephony, video communications, 2D and 3D graphics.

The applications of these types usually involve algorithms with a large amount of calculations and perform repeating operations on large sets of simple data elements. These applications have the following characteristics in common:

They involve large amounts of data.
Most of the operations are performed over the integers of small length.
Graphical applications use the operations over the 8-bit pixel color values.
Audio applications use the operations over the 16-bit audio samples of sound.
They involve the use of parallel calculations.

The new processor types have an additional hardware-implemented block of eight 128-bit registers called XMM. Each of the XMM registers can operate four 32-bit float numbers simultaneously. The block enables a single instruction to control four 32-bit operands at once. Such a way of executing the processor instructions is called parallel.

The instructions with the XMM registers can also operate in the so-called scalar mode. In this case, the operations are applied to the lower 32-bit word. Since SSE is a hardware-implemented extension, the system does not use the FPU/MMX block when executing the new instructions.

This separate execution of the FPU/MMX commands and the SSE instructions allows you to gain more efficiency by combining the MMX instructions for integers and the SSE instructions for the float operands. In this case, the FPU registers are used for the MMX integer calculations, and the SSE block performs float calculations.

SSE Instructions

To support the SSE extension, the instruction set of the integer MMX extension adds twelve new commands:

Commands for finding the average value, the minimum, the maximum, and the special arithmetic commands
Unsigned multiplication, as well as several instructions connected with rearrangement of terms

The SSE extension includes the instructions of the following types:

Arithmetic operations (addition, subtraction, multiplication, division, extraction of square root, finding the minimum and the maximum).
Comparison operations.
Conversion operations (these establish the relation between the MMX integer formats and the XMM float formats).
Logical operations (including the AND , OR , AND NOT , and XOR operations over the XMM operands).
Operations for data moving and rearranging (these ensure data exchange between the XMM block and the memory or the processor integer registers, and also rearrange the terms in packed operands).
Status control operations (these serve for saving and loading the additional XMM status register). This group also includes instructions for quick saving/restoring the MMX/FPU and SSE status.

The SSE also adds some new instructions for cache contents management: these instructions write the contents of the MMX and XMM registers to the memory directly, bypassing the cache. Therefore, the purpose of these functions is to eliminate stuffing the cache memory excessively. They also allow you to load the needed data to cache before calling the instructions that process these data.

For an application intended to use more than only the basic 32-bit processor resources, you will have to determine the processor type. This is easy to do by using the cpuid instruction. For the processors with SSE, the cpuid instruction also allows you to obtain the processor s unique 64-bit identifier.

SSE Intrinsic Functions

The SSE extension of the Pentium processors, in the same way as MMX, is supported by the C++ .NET intrinsics. All the declarations of the intrinsics for the SSE extension are stored in the xnimintrin.h file. To facilitate working with 128-bit variables, the __m128 type can be used. In Chapter 10 , you will find practical examples of how to use the SSE extension for application programming.