SSE Extension and Programming It in the Inline Assembler | Visual C++ Optimization with Assembly Code

This section discusses issues of using the inline assembler to optimize applications that use floating-point operations of SSE extension of Pentium. SSE extension includes eight additional 128-bit registers denoted as xmm0 , , xmm7 . Data in the SSE format are a sequence of four 32-bit packed numbers . For programming SSE extension, the processor s command set is extended with a number of commands.

Here, we will concentrate only on key aspects of programming the SSE extension. For a more detailed description of SSE architecture and programming these instructions in the assembler, refer to Intel s documentation.

Before you start programming SSE extension, check whether the processor and operating system support this extension. This can be easily detected with the following simple console application (Listing 10.26).

Listing 10.26: Checking the processor for the SSE extension support

 // TEST_SSE_BY_PROC.cpp : Defines the entry point for the console  // application  #include stdafx.h  int _tmain(int argc, _TCHAR* argv[])  {    bool supSSE=true;    _asm{         mov EAX, 1         cpuid         test EDX, 02000000         jnz found         mov supSSE, 0      found:        };    if (supSSE)printf("SSE supported by CPU!\n);    else printf("SSE not supported by CPU!\n");    getchar();    return 0;  }

The source code of this program is almost identical to the code that checks for the MMX extension support, except that bit 25 of the EDX register is checked here.

You can check the operating system for the SSE support by running an application whose source code is shown in Listing 10.27.

Listing 10.27: Checking the operating system for the SSE extension support

 // CHECK_SSE_SUPPORT_BY_OS.cpp : Defines the entry point for the console  // application  #include stdafx.h  #include <excpt.h>  #include <windows.h>  bool _tmain(int argc, _TCHAR* argv[])  {   _ _try {         _asm xorps xmm0, xmm0         }  _ _except (EXCE PTION_EXECUTE_HANDLER)      {       if (GetExceptionCode0 == STATUS_ILLEGAL_INSTRUCTION)        {          printf("SSE not supported by OS!\n");          getchar();          return (false);        }    }   printf("SSE supported by OS!\n");   getchar ();   return (true);  }

In C++ .NET, support for both SSE and MMX extension is provided with intrinsics. Like in MMX extension, each of the intrinsics that works with floating-point numbers is a pseudo code of an assembly equivalent. For example, the function

 _ _ m128 _mm_add_ss(_ _ m128 a, _ _ m128 b)

is an analog of the addss assembly command. Floating-point operands are represented in C++ .NET as _ _ m128 . For a more detailed description of SSE extension intrinsics, refer to the C++ .NET 2003 online help. Now we will discuss practical programming SSE extension with the inline assembler. SSE extension assembly commands can be divided into a few groups:

Store commands
Arithmetic commands
Comparison commands
Conversion commands
Logical commands

There is a number of additional commands. Arithmetic commands, comparison commands, and conversion commands can be performed on either four packed double words simultaneously (parallel operations) or 32-bit numbers (scalar operations). The scalar operations process only the low order 32-bit words.

We will look at an example of scalar addition of two floating-point numbers. The source code of the application is shown in Listing 10.28.

Listing 10.28: Scalar addition of two floating-point numbers

 // SSE_ADD_2_SCALAR.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {  float a1, b1, c1;  printf(" EXAMPLE OF SCALAR SUMMA IN SSE-EXT.ASM.OPERATIONS\n");  printf("\nEnter float a1: ");  scanf("%f", &a1);  printf("Enter float b1: ");  scanf("%f", &b1);  _asm {        lea       EAX, a1        lea       EDI, b1        lea       EDX, c1        movss     xmm0, DWORD PTR [EAX]        addss     xmm0, DWORD PTR [EDI]        movss     DWORD PTR [EDX], xmm0       };  printf ("c1 = a1+b1 = %.3f\n", c1);  getchar();  return 0;  }

After the a1 and b1 floating-point numbers are entered, the assembly block adds these up as ordinary 32-bit values. The addresses of the a1 and b1 elements and the address of their sum c1 are put to the registers EAX , EDI , and EDX , respectively. Then the commands

 movss xmm0, DWORD PTR [EAX]  addss xmm0, DWORD PTR [EDI]  movss DWORD PTR [EDX], xmm0

add up the numbers and store the result in the c1 variable. All scalar commands have a suffix s while parallel commands have a suffix p . The window of the application is shown in Fig. 10.17.

Fig. 10.17: Window of an application that performs scalar addition of two floating-point numbers with the SSE extension assembler

Adding operands in parallel significantly improves the performance of an application. Operations of this type are very convenient when processing floating-point arrays. Now, we will complicate the previous example by taking two floating-point arrays as a1 and b1 and an array containing their sum as c1 . The source code of such an application is shown in Listing 10.29.

Listing 10.29: Adding elements of two arrays in parallel

 // SSE_ADD_2_FLOATS.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   float a1[4] = {34.5,   12.44, 7.53,   7.4};   float b1[4] = {3.54, 1.23,   3.56, 7.55};   float c1[4];   printf("        SUMMA 2  ARRAYS in parallel\n");   printf("\n a1: ");   for (int cnt = 0; cnt < 4; cnt++)   printf("%.2f\t", a1[cnt]);   printf("\n b1: ");   for (int cnt = 0; cnt < 4; cnt++)         printf("%.2f\t", b1[cnt]);   _asm {        lea    EAX, a1        lea    EDI, b1        lea    ECX, c1        movups xmm0, XMMWORD PTR [EAX]        addps  xmm0, XMMWORD PTR [EDI]        movups  XMMWORD PTR [ECX], xmm0       };   printf("\n\n c1: ");   for (int cnt = 0; cnt < 4; cnt++)        printf("%.2f\t", c1[cnt]);   getchar();   return 0; }

You might have noticed that the commands in the assembly block have a suffix p . In addition, the XMMWORD keyword is used to denote a 128-bit variable. The window of the application is shown in Fig. 10.18.

Fig. 10.18: Window of an application that adds four floating-point numbers in parallel with the SSE extension assembler

The subps command that subtracts two 128-bit operands in parallel is useful when subtracting the elements of floating-point arrays. The next example illustrates the use of the subtraction command, as well as a few other important things related to SSE extension. Consider the source code of an application (Listing 10.30) that uses arrays of four floating-point numbers. The number of elements is chosen little for simplicity s sake.

Listing 10.30: Subtracting the elements of floating-point arrays with the subps command

 // SSE_COMBO_SUB_2.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  #include <xmmintrin.h>  int _tmain(int argc, _TCHAR* argv[])  {   _ _declspec (align(16)) float a1[4];   _ _declspec (align(16)) float b1[4];   _ _declspec (align(16)) float c1[4];   _ _m128 m1 = {12.5, 32.7,   4.8, 6};   _ _m128 mb1 = {3.45, 12.67,   5.88,   2 3.1};   _ _m128 mc1;  printf("  PARALLEL SUBSTRACTION OF 2 ALIGNED ARRAYS \n");  _asm {         lea EAX, a1         lea EDX, b1         lea ECX, c1         movaps xmm0, ma1         movaps XMMWORD PTR [EAX], xmm0         movaps xmm1, mb1         movaps XMMWORD PTR [EDX], xmm1         subps xmm0, xmm1         movaps mc1, xmm0         movaps XMMWORD PTR [ECX]  ,  xmm0       };  printf ("\n a1: ");  for (int cnt = 0; cnt < 4; cnt ++)       printf ("%.2f\t", a1[cnt]);  printf ("\n b1: ");  for (int cnt = 0; cnt < 4; cnt ++)       printf ("%.2f\t", b1[cnt]);  printf ("\n\n c1: ");  for (int cnt = 0; cnt < 4; cnt ++)       printf ("%.2f\t", c1[cnt]);  getchar ();  return 0;  };

A distinct feature of this example is that it demonstrates operations on variables aligned on 16-byte boundary. For this purpose, the floating-point arrays a1 , b1 , and c1 are declared with the align keyword:

 _ _declspec (align (16)) float a1[8];   _ _declspec (align (16)) float b1[8];   _ _declspec (align (16)) float c1[8];

Data alignment can increase the performance of an application significantly. Development of applications that use advanced assembly commands of the latest processor generations requires that the data are aligned on 16-byte boundary. Also, alignment of frequently used data on the string length in the cache is a very effective technique for increasing application performance. For example, if a structure whose size is less than 32 bytes is declared in a program, it should be aligned on 32-byte boundary for effective caching.

In this program, we use variables of both float and _ _m128 types, so that you gain a better understanding of the interrelation between the classic types of variables such as float and 128-bit variables in the SSE extension. Like SSE extension intrinsics, variables of the _ _m128 type are declared in the xmmintrin.h header file, so it is included in the project.

To move aligned data, the application uses the movaps command. To work with unaligned data, you can use the movups command instead of movaps . Such a substitution will not cause errors, but the performance of the application will be lower, and it will be pointless to use the align keyword.

The window of the application is shown in Fig. 10.19.

Fig. 10.19: Window of an application that demonstrates subtraction of floating-point array elements aligned on 16-byte boundary with the SSE extension assembler

This example can be modified so that no variables of the _ _m128 are used. In this case, the xmmintrin.h header file and a few assembly commands can be removed. The source code of the application will appear as shown in Listing 10.31.

Listing 10.31: A modified version of the application that subtracts the elements of arrays

 // SSE_COMBO_SUB_2_MOD.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   _ _declspec(align(16)) float a1[4] = {34.5,   12.44, 7.53,   7.4};   _ _ declspec (align (16)) float b1[4] = {3.54, 1.23,   3.56, 7.55};   _ _ declspec (align (16)) float c1[4];   printf (" PARALLEL SUBSTRACTION OF 2 ALIGNED ARRAYS WITH ASM INSTRUCTIONS  ONLY\n");   _asm {         movaps xmm0, XMMWORD PTR a1         subps  xmm0, XMMWORD PTR b1         movaps XMMWORD PTR c1, xmm0        };   printf ("\n a1: ");   for (int cnt = 0; cnt < 4; cnt ++)        printf ("%.2f\t", a1[cnt]);   printf ("\n b1: ");   for (int cnt = 0; cnt < 4; cnt ++)        printf ("%.2f\t", b1[cnt]);   printf ("\n\n c1: ");   for (int cnt = 0; cnt < 4; cnt ++)        printf ("%.2f\t", c1[cnt]);   getchar ();   return 0;  };

The window of the application is shown in Fig. 10.20.

Fig. 10.20: Window of the modified application that demonstrates parallel subtraction of array elements

For multiplication and division operations on 128-bit data, you can use the following SSE extension assembly commands:

mulps ”parallel multiplication of 128-bit operands. The result is put into one of the registers xmm0 , , xmm7 .
divps ”parallel division of 128-bit operands. The result is put into one of the registers xmm0 , , xmm7 .
mulss ”scalar multiplication of the low order double words of two operands. The result (a 32-bit value) is put into one of the registers xmm0 , , xmm7 . One of the operands can be a 32-bit memory variable.
divss ”scalar division of 32-bit operands. The syntax of this command is the same as mulss .

Below is the source code of an application that demonstrates parallel multiplication and division (Listing 10.32).

Listing 10.32: Parallel multiplication and division of SSE data

 // SSE_MUL_DIV_ALIGN_2_ARRAYS.cpp : Defines the entry point for the  // console application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   _ _declspec(align(16)) float a1[4] = {34.5,   12.44, 7.53,   7.4};   _ _declspec(align(16)) float b1[4] = {3.54, 1.23,   3.56, 7.55};   _ _declspec(align(16)) float c1[4] = {1.5, 2.5, 3.5, 4.5};   _ _declspec(align(16)) float d1[4];   printf(" PAR. MUL/DIV OF 2ALIGNED ARRAYS WITH ASM INSTRUCTIONS ONLY\n");   _asm {          movaps xmm0, XMMWORD PTR a1          mulps  xmm0, XMMWORD PTR b1          divps  xmm0, XMMWORD PTR c1          movaps XMMWORD PTR d1, xmm0         } ;   printf("\n a1: ");   for (int cnt = 0; cnt < 4; cnt++)        printf("%.2f\t", a1[cnt]);   printf("\n b1: ");   for (int cnt = 0; cnt < 4; cnt++)        printf("%.2f\t", b1[cnt]);   printf("\n c1: ");   for (int cnt = 0; cnt < 4; cnt++)        printf("%.2f\t", c1[cnt]);   printf("\n\n d1: ");   for (int cnt=0; cnt < 4; cnt++)        printf("%.2f\t, d1[cnt]);   getchar();   return 0;  };

This listing is straightforward. The window of the application is shown in Fig. 10.21.

Fig. 10.21: Window of an application that demonstrates parallel multiplication and division of the elements of floating-point arrays with the SSE extension assembler

Multiplication and division of scalar values is demonstrated in the next example (Listing 10.33).

Listing 10.33: Scalar multiplication and division with SSE extension assembly commands

 // SCALAR_SSE_MUL_DIV_WITH_ASM.cpp : Defines the entry point for the  // console application  #include stdafx.h  int _tmain(int argc, _TCHAR* argv[])  {   float a1[4]={4.98, 1.44, 3.16,   0.42};   float b1[4]={   3.54, 1.23,   9.56, 5.09};   float c1[4]={1.5, 2.5, 3.5, 4.5};   float d1[4];   printf(" SCALAR MUL/DIV OF 2ARRAYS WITH ASM \n");   int   asize = sizeof(a1)/4;   _asm {         lea EAX, a1         lea EDX, b1         lea ESI, c1         lea EDI, d1         mov ECX, asize         sub EAX, 4         sub EDX, 4         sub ESI, 4         sub EDI, 4   next_4:         add EAX, 4         add EDX, 4         add ESI, 4         add EDI, 4         movss xmm0, DWORD PTR [EAX]         mulss xmm0, DWORD PTR [EDX]         divss xmm0, DWORD PTR [ESI]         movss DWORD PTR [EDI], xmm0         loop next_4         };   printf("\n a1: ");   for (int cnt = 0; cnt < 4; cnt++)        printf("%.2f\t", a1[cnt]);   printf("\n b1: ");   for (int cnt = 0; cnt < 4; cnt++)        printf("%.2f\t", b1[cnt]);   printf("\n c1: ");   for (int cnt = 0; cnt < 4; cnt++)        printf("%.2f\t", c1[cnt]);   printf("\n\n d1: ");   for (int cnt = 0; cnt < 4; cnt++)        printf("%.2f\t", d1[cnt]);   getchar();   return 0;  };

Since we deal with 32-bit values, the registers EAX , EDX , ESI , and EDI are used to access them. The addresses of the arrays are loaded to these registers. The ECX register is used as an array element counter. To access the next elements of the arrays, the values in the registers EAX , EDX , ESI , and EDI are increased by four after each iteration.

For multiplication and division, the mulss and divss SSE commands are used.

The window of the application is shown in Fig. 10.22.

Fig. 10.22: Window of an application that demonstrates scalar multiplication and division of the elements of floating-point arrays

The next group of assembly commands we will describe includes comparison commands. It is best to illustrate their work with examples.

For the first example, we will consider one method of comparing two packed 128-bit values for equality. The result of such an operation is a 128-bit floating-point mask. If all the bits of the mask are equal to 1, two 128-bit numbers are equal.

In this example, this method is used for checking the elements of floating-point arrays for equality. For simplicity s sake, let the size of the arrays be equal to four. To make it easier to understand the algorithm, consider a variant that uses C++ .NET 2003 SSE intrinsics first. The source code of the console application is shown in Listing 10.34.

Listing 10.34: Using SSE extension intrinsics to compare array elements

 // SSE_CMPEQPS_INTR_EXAMPLE.cpp : Defines the entry point for the console  // application  #include stdafx.h  #include <xmmintrin.h>  int _tmain(int argc, _TCHAR* argv[])  {   _ _m128 a1 = {12.4, 19.1,   4.68, 3.12};   _ _ m128 a2 = {12.4, 19.1,   4.68, 3.12};   _ _m128 ares;   float res [4] ;   ares = _mm_cmpeq_ps (a1, a2);   _mm_storeu_ps (res, ares) ;  printf (esult of comparison 2 packed elements \n\n");  for (int cnt = 0; cnt < 4; cnt ++)       printf (res [%d]=%f\n, cnt, res [cnt]);  for (int cnt = 0; cnt < 4; cnt ++)     {      if (res [cnt] == 0)      {         printf ("\nSSEoperands are not equals !\n");         getchar ();         return 0;      }   }  printf ("\nSSEoperands are equals! \n);  getchar ();  return 0;  };

Here, we will analyze the source code. As always, if an application uses intrinsics and variables of the _ _ m128 type, the xmmintrin.h header file should be included (the corresponding line is in bold).

The a1 and a2 variables of the _ _ m128 type are assigned four floating-point values each. In fact, the elements in braces make up a floating-point array. It is not declared explicitly, but it is very convenient to manipulate with such a virtual array using 128-bit variables. Such manipulations are valid in C++ .NET.

A pairwise comparison of the arrays a1 and a2 is implemented with the following functions:

 ares = _mm_cmpeq_ps (a1, a2)  _mm_storeu_ps (res, ares)

The result of comparison is written to the res array. For the given values of a1 and a2 , the comparison operation returns their equality, as is seen in Fig. 10.23.

Fig. 10.23: Result of comparing array elements with the application in Listing 10.34 that uses intrinsics

You can tell from Fig. 10.23 that all elements of the res array took the value of ˆ’ 1, which corresponds to one in all bits. In this case, this means the arrays a1 and a2 are equal.

If you change the source data, for example, give the third element of the a1 array a value of 3.13 instead of 3.12, the result of comparison will change. This is shown in Fig. 10.24.

Fig. 10.24: Result of comparison of arrays when the third elements are unequal

Since the third elements of the arrays are unequal, the third element of the res array is zero, which means a1 and a2 are unequal.

Like with the MMX extension, using intrinsics for programming SSE leads to code redundancy. For example, the statements

 ares = _mm_cmpeq_ps (a1, a2)   _mm_storeu_ps(res, ares)

from Listing 10.34 appear as shown in Listing 10.35 when disassembled.

Listing 10.35: The disassembled code of SSE extension intrinsics

  ares = _mm_cmpeq_ps(a1, a2);  00411C78 movaps       xmm0, xmmword ptr [a2]  00411C7C movaps       xmm1, xmmword ptr [a1]  00411C80 cmpeqps      xmm1, xmm0  00411C84 movaps       xmmword ptr [ebp150h], xmm1  00411C8B movaps       xmm0, xmmword ptr [ebp150h]  00411C92 movaps       xmmword ptr [ares], xmm0  _mm_storeu_ps(res, ares);  00411C96 movaps       xmm0, xmmword ptr [ares]  00411C9A lea          eax, [res]  00411C9D movups       xmmword ptr [eax], xmm0

The code is redundant because developers at Microsoft wanted to avoid manipulations with the registers xmm0 , , xmm7 in C++ .NET programs and to store the result in a 128-bit memory variable. The commands

 00411C84 movaps xmmword ptr [ebp150h], xmm1  00411C8B movaps xmm0, xmmword ptr [ebp150h]

can be advantageously replaced with

 movaps xmm0, xmm1

The command

 00411C80 cmpeqps xmm1, xmm0

can be replaced with a command whose one operand is in the memory.

Modify this example so that the inline assembler can be used. The source code of the application is shown in Listing 10.36.

Listing 10.36: A modified version of the application that compares array elements with SSE extension assembly commands

 // SSE_CMPEQPS_ASM_EXAMPLE.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   float a1 [4] = {12.4, 19.17,   4.68, 3.12};   float a2 [4] = {12.4, 19.1,   4.68, 3.12  };   float res [4] ;   printf ("Comparison 2 packed elements with SSE assembler\n\n") ;   printf("a1: ") ;   for (int cnt = 0; cnt < 4; cnt++)         printf ("%.2 f\t", a1[cnt]);   printf ("\na2 : ");   for (int cnt = 0; cnt < 4; cnt ++)         printf ("%.2 f\t", a2 [cnt]);   _asm {          movups xmm0, XMMWORD PTR a1          cmpeqps xmm0, XMMWORD PTR a2          movups XMMWORD PTR res, xmm0         };   printf ("\nResult of comparison: \n\n") ;   for (int cnt = 0; cnt < 4; cnt ++)         printf ("res [%d] = %f\n", cnt, res [cnt]);   for (int cnt = 0; cnt < 4; cnt ++)      {       if (res [cnt] == 0)         {          printf ("\nSSE   operands are not equals !\n");          getchar ();          return 0;         }           }  printf("\nSSEoperands are equals!\n");  getchar();  return 0;  };

Comparison is done in the assembly block and requires only three assembly commands! We also changed the type of the variables a1 , a2 , and ares from _ _m128 to float to illustrate the use of floating-point numbers in the SSE extension. Comparison of 128-bit values is done with the cmpeqps command. It compares four 32-bit packed numbers in pairs and sets the bits in corresponding positions of a 128-bit result mask if the numbers are equal. Otherwise, the corresponding bits are reset to zeroes.

The window of the application is shown in Fig. 10.25.

Fig. 10.25: Window of an application that compares floating-point arrays with the cmpeqps SSE command

In addition to checking numbers for equality, it is often required to compare the values of two numbers. The next example demonstrates how the elements of two floating-point arrays can be compared according to the greater than / less than principle with the cmpleps SSE command. The source code of the C++ .NET console application is shown in Listing 10.37.

Listing 10.37: Comparing two floating-point arrays for greater than / less than

 // SSE_CMPLEPS_ASM.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   float a1[4] = {3.4, 9.17,   4.39, 3.12};   float a2[4] = {12.7, 19.1,   4.68, 3.52};   float res[4];   printf("LTEQ comparison 2 packed elements with SSE   ext. assembler\n\n");   printf("a1: ");   for (int cnt = 0; cnt < 4; cnt++)         printf("%.2f\t", a1[cnt]);   printf("\na2: ");   for (int cnt = 0; cnt < 4; cnt++)         printf ("%.2f\t", a2 [cnt]);   _asm {         movups xmm0, XMMWORD PTR a1         cmpleps xmm0, XMMWORD PTR a2         movups XMMWORD PTR res, xmm0         };   printf ("\n\nResult of comparison: \n\n");   for (int cnt = 0; cnt < 4; cnt ++)     {      if (res [cnt] != 0)          printf ("\na1[%d] <= a2 [%d], mask = %.2 f\n", cnt, cnt, res [cnt]);      else          printf ("\na1[%d] > a2 [%d], mask = %.2f\n", cnt, cnt, res [cnt]);     }   getchar ();  return 0;  };

Like in the previous example, comparing array elements is a matter of comparing two 128-bit numbers. One of four 32-bit masks sets or resets the bits depending on the result of comparing pairs of 32-bit packed numbers located at the corresponding positions in the source and target arrays. For example, if one of the 32-bit elements of the a1 array is less than or equal to the number at the corresponding position in the 128-bit variable that represents the a2 array, ones are written to the corresponding bits of the result mask. Otherwise, i.e., if the element of a1 is greater than that of a2 , zeroes are written to the corresponding positions.

The window of the application is shown in Fig. 10.26.

Fig. 10.26: Window of an application that compares two floating-point arrays for greater than / less than with SSE extension assembly commands

The examples above demonstrate parallel comparison of packed elements. There are a few more SSE extension assembly commands that perform scalar comparison of pairs of the low-order double words. The result is defined by setting the corresponding bits in the flag register. One of these commands is comiss .

The next example demonstrates the use of this command for pairwise comparison of the elements of two floating-point arrays. The source code of the application is shown in Listing 10.38.

Listing 10.38: Comparing floating-point arrays with the comiss command

 // SSE_COMISS_ASM.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {    float a1[4] = {12.7, 19.17,   4.68, 3.52};    float a2[4] = {12.7, 19.17,   4.68, 3.52 };    bool cres = true;   printf("EQ comparison 2 arrays with COMISS \n\n");   printf("a1: ");   for (int cnt = 0; cnt < 4; cnt++)          printf("%.2 f\t", a1[cnt]);   printf("\na2: ");   for (int cnt = 0; cnt < 4; cnt++)          printf("%.2f\t", a2 [cnt]);   _asm {         lea    EAX, a1         lea    EDX, a2         mov    EBX, 0         mov    ECX, 4   next:         movss  xmm0, DWORD PTR [EAX]         comiss xmm0, DWORD PTR [EDX]         jne    no_eq         add    EAX, 4         add    EDX, 4         loop   next         mov    EBX, 1  no_eq:        mov DWORD PTR cres, EBX       }  if (cres)printf ("\nEquals ! \n");  else    printf ("\nUnequals ! \n");  getchar ();  return 0;  }

Comparison is done in the assembly block. To perform the operations in a loop, the pointer should be moved to the next element in each iteration. Note the following fragments of code:

 lea    EAX, a1  lea    EDX, a2

and

 add   EAX, 4  add   EDX, 4

Comparison proper is done with the following command:

 comiss xmm0, DWORD PTR [EDX]

The low order part of the xmm0 register contains an element of a1 that is compared to the element of a2 . The current address of the element of a2 is in the EDX register. The ECX register is iteration counter that is equal to the size of the arrays. Depending on the result of array comparison, the EBX registers contain either one or zero. One means the arrays are equal while zero means the opposite .

The window of the application is shown in Figs. 10.27 and 10.28.

Fig. 10.27: Window of an application that demonstrates scalar comparison with the comiss command. The arrays are equal

Fig. 10.28: Window of an application that demonstrates scalar comparison with the comiss command. The arrays are unequal

The SSE extension includes a number of commands that make it possible to convert formats between SSE, MMX, and the common integer format. The commands of this group can perform both scalar and parallel operations. We will illustrate the use of these commands with examples.

The first example relates to parallel conversion of two 32-bit integers stored in the mm0 MMX register to two 32-bit floating-point numbers that are written to two low order words of the xmm0 SSE register. The source code of the application is shown in Listing 10.39.

Listing 10.39: Converting MMX integers to SSE floating-point numbers

 // MMX_INT_INTO_SSE_FLOAT.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   int i1[2];   float f1[2];   printf("PARALLEL CONV. 2INTS TO 2 FLOAT WITH SSE-EXT.ASM\n\n");   while (true)    {     printf("\nEnter first integer i1[0]: ");     scanf("%d", i1);     printf("\nEnter second integer i1[1]: ");     scanf("%d", &i1[1]);     print f ("\n");     _asm {           movq     mm0, MMWORD PTR i1           cvtpi2ps xmm0, mm0           movlps   DWORD PTR f1, xmm0           emms         };    for (int cnt = 0; cnt < 2; cnt++)        printf ("f1[%d] = %.3f\n", cnt, f1[cnt]);    printf ("f1[0] / f1[1] = %.3f\n", f1 [0] /f1 [1])     };   return 0;  }

Parallel conversion of two 32-bit integers to two 32-bit floating-point numbers is done with the cvtpi2ps command. The assembly block that performs the conversion contains both SSE and MMX commands. The command

 movq mm0, MMWORD PTR i1

moves a 64-bit number (two integers) to the mm0 register. The command

 cvtpi2ps xmm0, mm0

converts two 32-bit packed numbers stored in the mm0 register to two floating-point numbers and puts the result to the xmm0 SSE register. Two high order double words of the xmm0 register do not change. The precision of the result of conversion depends on which bits of the status/control SSE register are set. Storing the result of conversion as a floating-point array of two 32-bit numbers is done with the command

 movlps DWORD PTR f1, xmm0

that moves two low order double words from the xmm0 register to the f1 array. To illustrate the correctness of the integer-to-floating-point conversion, the statement

 printf ("f1[0] / f1[1] = %.3f\n", f1[0]/f1[1])

displays the result of division of two floating-point numbers.

The window of the application is shown in Fig. 10.29.

Fig. 10.29: Window of an application that demonstrates parallel conversion of two 32-bit integers to floating-point numbers with SSE extension assembly commands

The next example demonstrates the reverse conversion of a 32-bit floating-point number to the integer type. The source code is shown in Listing 10.40.

Listing 10.40: Converting SSE floating-point numbers to MMX integers

 // SSE_INTO_MMX_CONV.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   int i1[2];   float f1[2] = {0,0};   printf("PARALLEL CONV. OF 2 FLOATS TO 2 INTS WITH SSEEXT.ASM\n\n"   printf("Enter real f1: ");   scanf("%f", &f1[0]);   printf("Enter real f2: ");   scanf("%f", &f1[1]);   _asm {         movlps    xmm0, QWORD PTR f1         cvtps2 pi mm0, xmm0         movq      QWORD PTR i1, mm0         emms       };   printf ("\nAfter conversion f1 --> i1, f2 --> i2\n\n");   printf("i1 = %d\n", i1[0]);   printf ("i2 = %d\n", i1[1]);   printf ("i1 / i2 = %d\n", i1 [0] /i1 [1]);   getchar ();   getchar ();   return 0;  }

Parallel conversion of two floating-point numbers to two integers is done in the assembly block with the command

 cvtps2pi mm0, xmm0

The floating-point numbers stored in the xmm0 SSE register are converted to integers and stored in the mm0 MMX register. The commands movlps and movq move the data. When using MMX extension assembly commands, make sure that the last command is emms !

To illustrate correctness of the conversion, the statement

 printf ("i1 / i2 = %d\n", i1[0]/i1[1])

displays the result of division of two integers.

The window of the application is shown in Fig. 10.30.

Fig. 10.30: Window of an application that demonstrates parallel conversion of 32-bit floating-point numbers to integers with the assembler

In addition to these commands of parallel conversion, the SSE extension includes scalar conversion commands. By using them, you can convert 32-bit numbers. We will not provide examples of how to use these commands, but you can do this as an exercise.

To complete the review of SSE extension commands and examples of their use in practice, we will concentrate on a few specific but very useful instructions. These are extraction of the square root and finding the maximum/minimum value of a pair of numbers. SSE extension includes both parallel and scalar commands.

We will look at an example that extracts the square root from floating-point packed numbers. For this operation, the sqrtps command is used. The source code of the application is shown in Listing 10.41.

Listing 10.41: Parallel extraction of the square root from floating-point packed numbers

 // PARALLEL_SQRT.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {  _ _ declspec (align(16)) float f1[4] = {34.78, 23.56, 876.98, 942 3.678};  float fsqrt[4];  printf ("PARALLEL SQRT CALCULATION WITH SSEEXT. ASSEMBLER\n\n");  printf("f1: ");  for (int cnt = 0; cnt < 4; cnt ++)      printf ("%.3f\t", f1[cnt]);  printf ("\n\n");  _asm {       movaps xmm0, XMMWORD PTR f1       sqrtps xmm0, xmm0       movups XMMWORD PTR fsqrt, xmm0       };  for (int cnt = 0; cnt < 4; cnt ++)      printf ("SQRT (%. 3f)= %.3f\n", f1[cnt], fsqrt[cnt]);  getchar ();  return 0; }

The square root is extracted with the command

 sqrtps xmm0, xmm0

The source and destination of this command is the same SSE register xmm0 . The destination of the result is always one of the registers xmm0 , , xmm7 . For the second operand, a 128-bit memory variable can be used. To improve performance, the address of the 128-bit variable should be aligned on 16-bit boundary. This is done in the line

 _ _declspec (align(16)) float f1[4] = {34.78, 23.56, 876.98, 9423.678}

The window of the application is shown in Fig. 10.31.

Fig. 10.31: Window of an application that demonstrates parallel extraction of the square root from floating-point packed numbers with the SSE extension assembler

The final example demonstrating the features of the SSE extension searches for the maximum and minimum element among pairs of floating-point packed numbers. For this purpose, the maxps and minps SSE extension commands are used. Both commands operate on 128-bit packed operands, and the destination for the result can be only one of the SSE registers. The source code of the application is shown in Listing 10.42.

Listing 10.42: A search for the maximum and minimum element among pairs of floating-point packed numbers

 // SSE_PARALEL_MIN_MAX.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  int _tmain(int argc, _TCHAR* argv[])  {   _ _declspec (align(16)) float f1 [4] = {34.78, 23.56, 876.98, 942 3.678};   _ _declspec (align(16)) float f2 [4] = {34.98, 2   1.37, 980.43, 1755.786};   float fmin[4];   float fmax[4];   int choice;   printf("PARALLEL MINIMAX CALCULATION WITH SSE-EXT.ASSEMBLER\n\n")   printf("f1: ");   for (int cnt = 0; cnt < 4; cnt++)   printf("%.3f\t", f1[cnt]);   printf("\n");   printf("f2: ");   for (int cnt = 0; cnt < 4; cnt++)       printf("%.3f\t", f2 [cnt]);   while (true)    {     printf("\n\n");     printf("Enter 0  get MAX elements, 1  get MIN elements: ");     scanf("%d", &choice);     printf("\n");     switch(choice)           {             case 0:                   _asm {                         movaps xmm0, XMMWORD PTR f1                         maxps  xmm0, XMMWORD PTR f2                         movups XMMWORD PTR fmax, xmm0                        };                   printf("\n\nMAX: ");                   for (int cnt = 0; cnt < 4; cnt++)                     printf("%.3f\t", fmax[cnt]);                   break;             case 1:                   _asm {                         movaps xmm0, XMMWORD PTR f1                         minps  xmm0, XMMWORD PTR f2                         movups XMMWORD PTR fmin, xmm0                        };                   printf("\n\nMIN: ");                   for (int cnt = 0; cnt < 4; cnt++)                     printf("%.3f\t", fmin[cnt]);                   break;             default:                   break;             };       };   return 0;  };

The window of the application is shown in Fig. 10.32.

Fig. 10.32: Window of an application that demonstrates a parallel search for the maximum and minimum elements in two floating-point arrays with the assembler

Here are a few conclusions regarding the use of the MMX and SSE extensions of SIMD technology. The use of these extensions significantly speeds up applications that process large amounts of data when the time is limited, because the data can be processed in parallel in one loop. Although Visual C++ .NET 2003 has intrinsics for work with the SIMD technologies, the inline assembler provides a higher performance than they do because it has one fundamental advantage: It lacks redundancy. Moreover, the intrinsics are written in the assembler.

Operations on packed numbers in the MMX and SSE extensions have increased precision, so they should be preferred, all things being equal.

Before you use the numerous possibilities provided by the SIMD technologies, thoroughly think about the algorithm of your task and estimate the logic of using them. Because of the limited scope of this book, not all of SIMD features are discussed. However, we hope you derive benefit from this section and will be able to effectively use these technologies in your programs.