Chapter 10: Inline Assembler and Application Optimization. MMX and SSE Technologies | Visual C++ Optimization with Assembly Code

Download CD Content

This chapter will focus on practical issues related to the use of the inline assembler in C++ .NET applications. While the previous chapter described basic principles of using the inline assembler, this one will present practical examples of using this tool to implement various tasks. The inline assembler is used most frequently in computational tasks , multimedia applications, and for text processing. Certain assembler features were demonstrated in Chapter 2 , and here we will discuss this topic in more detail.

To demonstrate specific technologies, we will use practical examples. All theoretical aspects will be considered in the context of these examples, and all necessary explanations will be given during analysis of them.

Inline Assembler and Optimizing Mathematical Operations

The first example is computing the sum of the elements of a floating-point array. In addition to demonstrating the work of the mathematical coprocessor, we will show how to use a pointer to return a result. Develop a C++ .NET dialog-based application.

On the application s main form, place two edit controls, two static text controls, and one button. The entire floating-point array will be output to the Edit1 edit control, and the Edit2 control will display the result of computation. Associate the s_Array and f_Summa variables with the Edit1 and Edit2 controls, respectively. Assign the s_Array variable the CString string type, and assign the f_Summa variable the float type.

Computing the sum of the floating-point array elements will be done with the sumReals function. This function takes two parameters: the address and size of the array. Add the function to the dialog box class and comment out the return 0 statement ( marked in bold). The source code of this function and the OnBnClicked handler is shown in Listing 10.1.

Listing 10.1: Using pointers in a program that computes the sum of the floating-point array elements

 float* CSummaofRealsDlg: rsumReals (float* f array, int If)  {    float fsum;    _asm {          mov   ESI, farray          mov   ECX, lf          dec   ECX          finit          fldz          fld   [ESI]        next:          add   ESI, 4          fadd  [ESI]          loop  next          fstp  fsum          fwait          lea   EAX, fsum    };  // Return 0;  }  void CSummaofRealsDlg::OnBnClickedButton1 ()  {   // TODO: Add your control notification handler code here    float farray[] = {9.34, 15.05,   4.32,   173.12,   88.45};    int fsize = sizeof (farray) /4;    CString stmp;    UpdateData(TRUE);    stmp. Empty;    for (int cnt = 0; cnt < fsize; cnt++)    {      stmp. Format ("%.2f ", farray [cnt]);      s_Array = s_Array + " " + stmp;    };    f_Summa = *sumReals (farray, fsize);    UpdateData(FALSE);  }

As noted earlier, it is not necessary to save the ESI register when calling a function. This is why the sumReals function loads the address of the farray array to the ESI register and the size of the array to the ESI register at the beginning.

 mov ESI, farray  mov ECX, lf

The address of the farray array is a pointer to its first element. The contents of the ESI register is used for organizing loop computation. The value of the first element of the array is pushed to the stack with the following coprocessor command:

 fld [ESI]

In each iteration of the loop, the pointer is moved to the next element of the array, and the value of the element is added to the top of the stack:

 next:    add  ESI, 4    fadd [ESI]    loop next

The sum is stored in the fsum local variable:

 fstp fsum

The last command of the assembly function is:

 lea EAX, fsum

It loads the address of the fsum variable to the EAX register. The function returns this address to the main program.

The OnBnClicked handler uses string variables for conversion of numeric values to text. We will address strings in later examples. Now look at the following statement:

 f_Summa = *sunnReals (farray, fsize);

Here, f_Summa is a variable of the float type, and the sumReals function returns an address. To get the value by the address, the dereference operator ( "*" ) is used before the address expression, which is the function name in our case.

The first parameter passed to the sumReals function is the address of the array. It can be expressed in an alternative form. As you know, the address of an array points to the first element, so the statement being discussed can be written in the other form:

 f_Summa = *sumReals(&farray[0], fsize);

where the address operator ( " & " ) is used to present the first parameter in a correct form.

The window of the application is shown in Fig. 10.1.

Fig. 10.1: Window of an application that computes the sum of the floating-point array elements

Subsequent examples will illustrate the implementation of simple algorithms for sorting and searching for the maximum element. Using these examples, we will estimate the effectiveness of the inline assembler for program optimization. We will look at an algorithm for searching for the maximum element in an integer array first. Use the Visual C++ .NET Application Wizard to create a dialog-based application frame.

On the main form, place two edit controls, two static text controls, and a button. The search for the maximum element of an integer array will be done with the findMax function. The function takes the address and size of the array as parameters. The result of the function is the value of the maximum element of the array. The code of the findMax function and the OnBnClicked handler is shown in Listing 10.2.

Listing 10.2: Looking for the maximum element in an integer array and displaying the result

 int CFIND_MAX_IN_ARRAY_OF_INTSDlg: :findMax(int* pi1, int si1)  {    int maxVal;    _asm {          mov   EDI, pi1          mov   ECX, si1          mov   EDX, [EDI]          mov   maxVal, EDX        next:          add   EDI, 4          mov   EAX, [EDI]          cmp   EAX, maxVal          ji    no_change          push  EAX          pop   maxVal   no change:          loop  next        ex:          mov   EAX, maxVal  };   // Return 0;  }   void CFIND_MAX_IN_ARRAY_OF_INTSDlg::OnBnClickedButton1()  {  // TODO: Add your control notification handler code here    CString s1;    int i1[] = {2, 33, -19, -7, 32, -90, 13};    int si1 = sizeof(i1)/4;    s1.Empty;    for (int cnt = 0; cnt < si1; cnt++)    {      s1.Format("%d", i1[cnt]);      s_Array = s_Array + " " + s1;    };    i_Max = findMax(i1, si1);    UpdateData(FALSE); }

Now, we will discuss the algorithm of the findMax function. The first element of the array is presumed maximum compared to the other elements in a loop. If one of the next elements is greater than the current maximum, it is considered maximum. The loop is repeated until the end of the array is reached. The commands

 mov EDI, pi1  mov ECX, si1

load the pointer to the array and its size to the EDI and ECX registers, respectively. The comparison between the current maximum and the array element and the choice of a new value as the maximum is done with the following group of commands:

 ... next:    add EDI, 4    mov EAX, [EDI]    cmp EAX, maxVal    jl no_change    push EAX    pop maxVal     ...

As always, the function returns the maximum value in the EAX register:

 mov EAX, maxVal

The OnBnClicked handler does not do anything out of the way. The returned maximum value is assigned to the i_Max variable, which corresponds to the Edit2 control, and is displayed.

The window of the application is shown in Fig. 10.2.

Fig. 10.2: Window of an application that searches for the maximum element in an integer array

It would be interesting to compare two maximum search applications, one using the assembly function, and the other written in "pure" C++.

We already have the assembly variant, so now we will develop a similar program in C++. The code of the OnBnClicked handler is shown in Listing 10.3.

The piece of code that searches for the maximum is in bold.

Listing 10.3: A fragment of a C++ program that searches for the maximum element

 ...  void CFindMaxIntwithCDlg: :OnBnClickedButton1 ()  {    // TODO: Add your control notification handler code here    CString s1;    int i1[] = {2, 33,   19,   7 , 32,   90, 13};    int si1 = sizeof (i1) /4;    // Display the array    s1.Empty;    for (int cnt = 0; cnt < si1; cnt++)    {      s1. Format ("%d", i1 [cnt]);      s_Array = s_Array + " " + s1;    };    // Look for the maximum element in an integer array  i_Max = i1 [0]   for (int cnt = 1; cnt < si1; cnt++)   {   if (i_Max >= i1[cnt]) continue ;   else i_Max = i1 [cnt];   }  UpdateData(FALSE);  };

The code in this fragment does not need any explanation. Look at the listing of the disassembled C++ version of the program, more precisely at its fragment that corresponds to the for loop that computes i_Max (Listing 10.4).

Listing 10.4: The code of the disassembled for loop

  i Max = i1[0];  0041380C mov        eax, dword ptr [this]  0041380F mov        ecx, dword ptr [i1]  00413812 mov        dword ptr [eax+7Ch], ecx  for (int cnt = 1;cnt < si1; cnt++)  00413815 mov        dword ptr [cnt], 1  0041381C jmp        CFindMaxIntwithCDlg::OnBnClickedButton1+167h (413827h)  0041381E mov        eax, dword ptr [cnt]  00413821 add        eax, 1  00413824 mov        dword ptr [cnt], eax  00413827 mov        eax, dword ptr [cnt]  0041382A cmp        eax, dword ptr [si1]  0041382D jge        CFindMaxIntwithCDlg::OnBnClickedButton1+18Fh (41384Fh)  {   if (i Max >= i1[cnt]) continue ;  0041382F mov        eax, dword ptr [this]  00413832 mov        ecx, dword ptr [cnt]  00413835 mov        edx, dword ptr [eax+7Ch]  00413838 cmp        edx, dword ptr i1[ecx*4]  0041383C jl         CFindMaxIntwithCDlg::OnBnClickedButton1+180h (413840h)  0041383E jmp        CFindMaxIntwithCDlg::OnBnClickedButton1+15Eh (41381Eh)  else i Max = i1[cnt];  00413840 mov        eax, dword ptr [this]  00413843 mov        ecx, dword ptr [cnt]  00413846 mov        edx, dword ptr i1[ecx*4]  0041384A mov        dword ptr [eax+7Ch], edx  }  0041384D jmp        CFindMaxIntwithCDlg::OnBnClickedButton1+15Eh (41381Eh)

Compare the disassembled code to the findMax function in the assembly version. A glance is enough to understand the redundancy of the C++ version. Before analyzing the disassembled code, we will consider another example ”sorting the integer array in descending order.

This is a dialog-based application. On its main form, place two edit controls, two static text controls, and a button. The original (unsorted) array will be output to one of the edit controls (that corresponds to the s_Src variable), and the other edit control (corresponding to the s_Dst variable) will display the same array sorted in descending order. The array sort is done with the sortMax function that takes the address and size of the array as parameters. Since code of the control notification handler is similar to the code in the previous examples, we will not focus on it. The code of the sortMax function and the handler that calls this function is shown in Listing 10.5.

Listing 10.5: Fragment of a program that sorts an integer array and displays the result

 void CSORT_ARRAY_BY_MAXIMUMDlg: :sortMax(int* pi1, int si1)   {    int isize = sil;    _asm {         push       EBX         mov        EDI, DWORD PTR pi1         mov        EBX, EDI       big_loop:         mov        ECX, DWORD PTR isize         mov        EAX, [EDI]     next:         mov        EAX, [EDI]         cmp        EAX, [EDI+4]         jl         change         jmp        cont       change:         xchg       EAX, [EDI+4]         mov        [EDI], EAX      cont:         add        EDI, 4         loop       next         dec        isize         cmp        isize, 0         je         ex         mov        EDI, EBX         jmp        big_loop       ex:         pop        EBX      };   }    void CSORT_ARRAY_BY_MAXIMUMDlg: :OnBnClickedButton1()    {    // TODO: Add your control notification handler code here     CString s1;     int i1[]={17,   9, 31,   7, 4, 76, 47,   59};     int si1=sizeof (i1)/4;     s1.Empty;     for (int cnt=0; cnt < si1; cnt++)     {      s1.Format("%d, i1[cnt]);      s_Src=s_Src+" "+s1;     };     sortMax(i1, si1);     s1.Empty;     for (int cnt=0; cnt < si1; cnt++)     {      s1.Format("%d, i1[cnt]);      s_Dst=s_Dst+" "+s1;     };   UpdateData(FALSE);  }

The window of this application is shown in Fig. 10.3.

Fig. 10.3: Window of an application that sorts an integer array in descending order

Now, we will implement the same task of sorting an array with a program written in pure C++ and examine the disassembled code, more precisely its fragment that performs sorting. The pieces of the C++ code that sort an array and display the result are shown in Listing 10.6. The fragment, in which we are interested, is in bold.

Listing 10.6: The code that sorts an integer array only with C++ statements

 void CSortbyDecreasewithCNETDlg: :OnBnClickedButton1 ()  {    // TODO: Add your control notification handler code here    CString s1;    int i1[]={17,   9, 31,   7, 4, 76, 47,   59};    int itmp;    int size_i1=sizeof (i1) /4;    s_S rc. Empty;    s_Dst. Empty;    UpdateData(FALSE);       s1. Empty;    // Display the original array in the edit control    for (int cnt=0; cnt < size_i1; cnt ++)    {      s1. Format ("%d, i1 [cnt]);      s_Src=s_Src+" "+s1;    };    // Sort the array  int tSize_i1=size_i1;   while (tSize_i1 != 0)   {   for (int cnt=0; cnt < tSize_i1; cnt++)   {   if (i1[cnt] >= i1[cnt+1]) continue;   else   {   itmp=i1 [cnt] ;   i1[cnt] =   i1[cnt+1]=itmp;   };   };   tSize_i1  ;   };  // Display the sorted array in the edit control    for (int cnt=0; cnt < size_i1;cnt ++)    {      s1. Format ("%d, i1 [cnt]) ;      s_Dst=s_Dst+" "+s1;    } ;    UpdateData(FALSE) ;  }

We will not give a detailed analysis of the code fragment in bold because it is simple for C++ programmers. The disassembled code of this fragment is shown in Listing 10.7.

Listing 10.7: The disassembled C++ code

  int tSize_i1 = size_i1;  0041382D  mov     eax, dword ptr [size_i1]  00413830  mov     dword ptr [tSize_i1], eax  while (tSize_i1 != 0)  00413833  cmp     dword ptr [tSize_i1],0  00413837  je      CSortbyDecreasewithCNETDlg::OnBnClickedButton1+1E2h   (4138B2h)  {   for (int cnt = 0; cnt  < tSize_i1; cnt++)  00413839  mov     dword ptr [cnt],0  00413843  jmp     CSortbyDecreasewithCNETDlg::OnBnClickedButton1+184h   (413854h)  00413845  mov     eax, dword ptr [cnt]  0041384B  add     eax, 1  0041384E  mov     dword ptr [cnt], eax  00413854  mov     eax, dword ptr [cnt]  0041385A  cmp     eax, dword ptr [tSize_i1]  0041385D  jge     CSortbyDecreasewithCNETDlg::OnBnClickedButton1+1D7h   (4138A7h)  {   if (i1[cnt] >= i1[cnt+1]) continue;  0041385F  mov     eax, dword ptr [cnt]  00413865  mov     ecx, dword ptr [cnt]  0041386B  mov     edx, dword ptr i1[eax*4]  0041386F  cmp     edx, dword ptr [ebp+ecx*444h]  00413873  jl      CSortbyDecreasewithCNETDlg::OnBnClickedButton1+1A7h  (413877h)  00413875  jmp     CSortbyDecreasewithCNETDlg::OnBnClickedButton1+175h  (413845h)  else   {   itmp = i1 [cnt];  00413877  mov     eax, dword ptr [cnt]  0041387D  mov     ecx, dword ptr i1[eax*4]  00413881  mov     dword ptr [itmp], ecx  i1[cnt] = i1[cnt+1];  00413884  mov     eax, dword ptr [cnt]  0041388A  mov     ecx, dword ptr [cnt]  00413890  mov     edx, dword ptr [ebp+ecx*444h]  00413894  mov     dword ptr i1[eax*4], edx  i1[cnt+1] = itmp;  00413898  mov     eax, dword ptr [cnt]  0041389E  mov     ecx, dword ptr [itmp]  004138A1  mov     dword ptr [ebp+eax*444h] , ecx  };   };  004138A5  jmp     CSortbyDecreasewithCNETDlg::OnBnClickedButton1+175h   (413845h)  tSize_i1--;  004138A7  mov     eax, dword ptr [tSize_i1]  004138AA  sub     eax, 1  004138AD  mov     dword ptr [tSize_i1] , eax  };  004138B0  jmp     CSortbyDecreasewithCNETDlg::OnBnClickedButton1+163h   (413833h)

These two disassembled fragments have common features. You might have noticed that the C++ versions of the programs intensively use the main memory for variables, while the processor registers are used much more seldom. This is not bad; however, it makes the code redundant and slows down the application in general. Such a slow-down is insignificant with little computation on small amounts of data. However, it becomes noticeable with large array sizes.

For example, swapping two elements in an array is implemented in C++ as follows :

 itmp = i1 [cnt];  i1[cnt] =  i1[cnt+1];  i1[cnt+1]  = itmp;

The equivalent assembly code is:

 mov      eax, dword ptr [cnt]  mov      ecx, dword ptr i1[eax*4]  mov      dword ptr [itmp], ecx  mov      eax, dword ptr [cnt]  mov      ecx, dword ptr [cnt]  mov      edx, dword ptr [ebp+ecx*444h]  mov      dword ptr i1[eax*4], edx  mov      eax, dword ptr [cnt]  mov      ecx, dword ptr [itmp]  mov      dword ptr [ebp+eax*444h], ecx

By using these assembly commands

 mov   EAX, [EDI]  xchg  EAX, [EDI+4]  mov   [EDI], EAX

you can increase the performance of the maximum search program. Here, the values of variables are stored in registers, and computation is very fast because the data exchange through the system bus is significantly decreased.

The same is true for loop optimization. It is very difficult (and often impossible ) to make the C++ compiler generate code that would intensively use registers rather than the memory. Well- projected loops in assembler are executed much faster and consist of fewer commands.

We will look at the for loop in the sort program. The operator

 for (int cnt = 0; cnt < tSize_i1; cnt++)

is translated into several assembly commands. The disassembled code of this statement has been modified to make it more comprehensible. For this purpose, we removed references to the physical addresses of the memory from the jump statements and put labels in appropriate places. The modified code appears as follows:

 mov     DWORD PTR [cnt], 0    jmp     L2  L1:    mov     EAX, DWORD PTR [cnt]    add     EAX, 1    mov     DWORD PTR [cnt], EAX  L2:    mov     EAX, DWORD PTR [cnt]    cmp     EAX, DWORD PTR [tSize_i1]    jge     <address>    < loop statements >    jmp     L1  . . .

One could not say this code is not optimum. If you had only the EAX, EDX, and ECX processor registers at your disposal, you would most likely use the same algorithm for the implementation of the for loop. Because of this restriction, you would have to store the loop variables in the memory and (like in this piece of code) extract them for each iteration.

If you use the assembler, implementation of the for statement with the standard algorithm and the ECX register will be the following:

 mov   ECX, DWORD PTR isize    ... loop  next

This is executed faster. As you see, using the assembler can solve many optimization tasks if you fully understand what should be optimized and how. This is true not only for C++ .NET, but also for other compilers.