This chapter will focus on practical issues related to the use of the inline assembler in C++ .NET applications. While the previous chapter described basic principles of using the inline assembler, this one will present practical examples of using this tool to implement various tasks. The inline assembler is used most frequently in computational tasks , multimedia applications, and for text processing. Certain assembler features were demonstrated in Chapter 2 , and here we will discuss this topic in more detail.
To demonstrate specific technologies, we will use practical examples. All theoretical aspects will be considered in the context of these examples, and all necessary explanations will be given during analysis of them.
The first example is computing the sum of the elements of a floating-point array. In addition to demonstrating the work of the mathematical coprocessor, we will show how to use a pointer to return a result. Develop a C++ .NET dialog-based application.
On the application s main form, place two edit controls, two static text controls, and one button. The entire floating-point array will be output to the Edit1 edit control, and the Edit2 control will display the result of computation. Associate the s_Array and f_Summa variables with the Edit1 and Edit2 controls, respectively. Assign the s_Array variable the CString string type, and assign the f_Summa variable the float type.
Computing the sum of the floating-point array elements will be done with the sumReals function. This function takes two parameters: the address and size of the array. Add the function to the dialog box class and comment out the return 0 statement ( marked in bold). The source code of this function and the OnBnClicked handler is shown in Listing 10.1.
float* CSummaofRealsDlg: rsumReals (float* f array, int If) { float fsum; _asm { mov ESI, farray mov ECX, lf dec ECX finit fldz fld [ESI] next: add ESI, 4 fadd [ESI] loop next fstp fsum fwait lea EAX, fsum }; // Return 0; } void CSummaofRealsDlg::OnBnClickedButton1 () { // TODO: Add your control notification handler code here float farray[] = {9.34, 15.05, 4.32, 173.12, 88.45}; int fsize = sizeof (farray) /4; CString stmp; UpdateData(TRUE); stmp. Empty; for (int cnt = 0; cnt < fsize; cnt++) { stmp. Format ("%.2f ", farray [cnt]); s_Array = s_Array + " " + stmp; }; f_Summa = *sumReals (farray, fsize); UpdateData(FALSE); }
As noted earlier, it is not necessary to save the ESI register when calling a function. This is why the sumReals function loads the address of the farray array to the ESI register and the size of the array to the ESI register at the beginning.
mov ESI, farray mov ECX, lf
The address of the farray array is a pointer to its first element. The contents of the ESI register is used for organizing loop computation. The value of the first element of the array is pushed to the stack with the following coprocessor command:
fld [ESI]
In each iteration of the loop, the pointer is moved to the next element of the array, and the value of the element is added to the top of the stack:
next: add ESI, 4 fadd [ESI] loop next
The sum is stored in the fsum local variable:
fstp fsum
The last command of the assembly function is:
lea EAX, fsum
It loads the address of the fsum variable to the EAX register. The function returns this address to the main program.
The OnBnClicked handler uses string variables for conversion of numeric values to text. We will address strings in later examples. Now look at the following statement:
f_Summa = *sunnReals (farray, fsize);
Here, f_Summa is a variable of the float type, and the sumReals function returns an address. To get the value by the address, the dereference operator ( "*" ) is used before the address expression, which is the function name in our case.
The first parameter passed to the sumReals function is the address of the array. It can be expressed in an alternative form. As you know, the address of an array points to the first element, so the statement being discussed can be written in the other form:
f_Summa = *sumReals(&farray[0], fsize);
where the address operator ( " & " ) is used to present the first parameter in a correct form.
The window of the application is shown in Fig. 10.1.
Subsequent examples will illustrate the implementation of simple algorithms for sorting and searching for the maximum element. Using these examples, we will estimate the effectiveness of the inline assembler for program optimization. We will look at an algorithm for searching for the maximum element in an integer array first. Use the Visual C++ .NET Application Wizard to create a dialog-based application frame.
On the main form, place two edit controls, two static text controls, and a button. The search for the maximum element of an integer array will be done with the findMax function. The function takes the address and size of the array as parameters. The result of the function is the value of the maximum element of the array. The code of the findMax function and the OnBnClicked handler is shown in Listing 10.2.
int CFIND_MAX_IN_ARRAY_OF_INTSDlg: :findMax(int* pi1, int si1) { int maxVal; _asm { mov EDI, pi1 mov ECX, si1 mov EDX, [EDI] mov maxVal, EDX next: add EDI, 4 mov EAX, [EDI] cmp EAX, maxVal ji no_change push EAX pop maxVal no change: loop next ex: mov EAX, maxVal }; // Return 0; } void CFIND_MAX_IN_ARRAY_OF_INTSDlg::OnBnClickedButton1() { // TODO: Add your control notification handler code here CString s1; int i1[] = {2, 33, -19, -7, 32, -90, 13}; int si1 = sizeof(i1)/4; s1.Empty; for (int cnt = 0; cnt < si1; cnt++) { s1.Format("%d", i1[cnt]); s_Array = s_Array + " " + s1; }; i_Max = findMax(i1, si1); UpdateData(FALSE); }
Now, we will discuss the algorithm of the findMax function. The first element of the array is presumed maximum compared to the other elements in a loop. If one of the next elements is greater than the current maximum, it is considered maximum. The loop is repeated until the end of the array is reached. The commands
mov EDI, pi1 mov ECX, si1
load the pointer to the array and its size to the EDI and ECX registers, respectively. The comparison between the current maximum and the array element and the choice of a new value as the maximum is done with the following group of commands:
... next: add EDI, 4 mov EAX, [EDI] cmp EAX, maxVal jl no_change push EAX pop maxVal ...
As always, the function returns the maximum value in the EAX register:
mov EAX, maxVal
The OnBnClicked handler does not do anything out of the way. The returned maximum value is assigned to the i_Max variable, which corresponds to the Edit2 control, and is displayed.
The window of the application is shown in Fig. 10.2.
It would be interesting to compare two maximum search applications, one using the assembly function, and the other written in "pure" C++.
We already have the assembly variant, so now we will develop a similar program in C++. The code of the OnBnClicked handler is shown in Listing 10.3.
The piece of code that searches for the maximum is in bold.
... void CFindMaxIntwithCDlg: :OnBnClickedButton1 () { // TODO: Add your control notification handler code here CString s1; int i1[] = {2, 33, 19, 7 , 32, 90, 13}; int si1 = sizeof (i1) /4; // Display the array s1.Empty; for (int cnt = 0; cnt < si1; cnt++) { s1. Format ("%d", i1 [cnt]); s_Array = s_Array + " " + s1; }; // Look for the maximum element in an integer array i_Max = i1 [0] for (int cnt = 1; cnt < si1; cnt++) { if (i_Max >= i1[cnt]) continue ; else i_Max = i1 [cnt]; } UpdateData(FALSE); };
The code in this fragment does not need any explanation. Look at the listing of the disassembled C++ version of the program, more precisely at its fragment that corresponds to the for loop that computes i_Max (Listing 10.4).
i Max = i1[0]; 0041380C mov eax, dword ptr [this] 0041380F mov ecx, dword ptr [i1] 00413812 mov dword ptr [eax+7Ch], ecx for (int cnt = 1;cnt < si1; cnt++) 00413815 mov dword ptr [cnt], 1 0041381C jmp CFindMaxIntwithCDlg::OnBnClickedButton1+167h (413827h) 0041381E mov eax, dword ptr [cnt] 00413821 add eax, 1 00413824 mov dword ptr [cnt], eax 00413827 mov eax, dword ptr [cnt] 0041382A cmp eax, dword ptr [si1] 0041382D jge CFindMaxIntwithCDlg::OnBnClickedButton1+18Fh (41384Fh) { if (i Max >= i1[cnt]) continue ; 0041382F mov eax, dword ptr [this] 00413832 mov ecx, dword ptr [cnt] 00413835 mov edx, dword ptr [eax+7Ch] 00413838 cmp edx, dword ptr i1[ecx*4] 0041383C jl CFindMaxIntwithCDlg::OnBnClickedButton1+180h (413840h) 0041383E jmp CFindMaxIntwithCDlg::OnBnClickedButton1+15Eh (41381Eh) else i Max = i1[cnt]; 00413840 mov eax, dword ptr [this] 00413843 mov ecx, dword ptr [cnt] 00413846 mov edx, dword ptr i1[ecx*4] 0041384A mov dword ptr [eax+7Ch], edx } 0041384D jmp CFindMaxIntwithCDlg::OnBnClickedButton1+15Eh (41381Eh)
Compare the disassembled code to the findMax function in the assembly version. A glance is enough to understand the redundancy of the C++ version. Before analyzing the disassembled code, we will consider another example ”sorting the integer array in descending order.
This is a dialog-based application. On its main form, place two edit controls, two static text controls, and a button. The original (unsorted) array will be output to one of the edit controls (that corresponds to the s_Src variable), and the other edit control (corresponding to the s_Dst variable) will display the same array sorted in descending order. The array sort is done with the sortMax function that takes the address and size of the array as parameters. Since code of the control notification handler is similar to the code in the previous examples, we will not focus on it. The code of the sortMax function and the handler that calls this function is shown in Listing 10.5.
void CSORT_ARRAY_BY_MAXIMUMDlg: :sortMax(int* pi1, int si1) { int isize = sil; _asm { push EBX mov EDI, DWORD PTR pi1 mov EBX, EDI big_loop: mov ECX, DWORD PTR isize mov EAX, [EDI] next: mov EAX, [EDI] cmp EAX, [EDI+4] jl change jmp cont change: xchg EAX, [EDI+4] mov [EDI], EAX cont: add EDI, 4 loop next dec isize cmp isize, 0 je ex mov EDI, EBX jmp big_loop ex: pop EBX }; } void CSORT_ARRAY_BY_MAXIMUMDlg: :OnBnClickedButton1() { // TODO: Add your control notification handler code here CString s1; int i1[]={17, 9, 31, 7, 4, 76, 47, 59}; int si1=sizeof (i1)/4; s1.Empty; for (int cnt=0; cnt < si1; cnt++) { s1.Format("%d, i1[cnt]); s_Src=s_Src+" "+s1; }; sortMax(i1, si1); s1.Empty; for (int cnt=0; cnt < si1; cnt++) { s1.Format("%d, i1[cnt]); s_Dst=s_Dst+" "+s1; }; UpdateData(FALSE); }
The window of this application is shown in Fig. 10.3.
Now, we will implement the same task of sorting an array with a program written in pure C++ and examine the disassembled code, more precisely its fragment that performs sorting. The pieces of the C++ code that sort an array and display the result are shown in Listing 10.6. The fragment, in which we are interested, is in bold.
void CSortbyDecreasewithCNETDlg: :OnBnClickedButton1 () { // TODO: Add your control notification handler code here CString s1; int i1[]={17, 9, 31, 7, 4, 76, 47, 59}; int itmp; int size_i1=sizeof (i1) /4; s_S rc. Empty; s_Dst. Empty; UpdateData(FALSE); s1. Empty; // Display the original array in the edit control for (int cnt=0; cnt < size_i1; cnt ++) { s1. Format ("%d, i1 [cnt]); s_Src=s_Src+" "+s1; }; // Sort the array int tSize_i1=size_i1; while (tSize_i1 != 0) { for (int cnt=0; cnt < tSize_i1; cnt++) { if (i1[cnt] >= i1[cnt+1]) continue; else { itmp=i1 [cnt] ; i1[cnt] = i1[cnt+1]=itmp; }; }; tSize_i1 ; }; // Display the sorted array in the edit control for (int cnt=0; cnt < size_i1;cnt ++) { s1. Format ("%d, i1 [cnt]) ; s_Dst=s_Dst+" "+s1; } ; UpdateData(FALSE) ; }
We will not give a detailed analysis of the code fragment in bold because it is simple for C++ programmers. The disassembled code of this fragment is shown in Listing 10.7.
int tSize_i1 = size_i1; 0041382D mov eax, dword ptr [size_i1] 00413830 mov dword ptr [tSize_i1], eax while (tSize_i1 != 0) 00413833 cmp dword ptr [tSize_i1],0 00413837 je CSortbyDecreasewithCNETDlg::OnBnClickedButton1+1E2h (4138B2h) { for (int cnt = 0; cnt < tSize_i1; cnt++) 00413839 mov dword ptr [cnt],0 00413843 jmp CSortbyDecreasewithCNETDlg::OnBnClickedButton1+184h (413854h) 00413845 mov eax, dword ptr [cnt] 0041384B add eax, 1 0041384E mov dword ptr [cnt], eax 00413854 mov eax, dword ptr [cnt] 0041385A cmp eax, dword ptr [tSize_i1] 0041385D jge CSortbyDecreasewithCNETDlg::OnBnClickedButton1+1D7h (4138A7h) { if (i1[cnt] >= i1[cnt+1]) continue; 0041385F mov eax, dword ptr [cnt] 00413865 mov ecx, dword ptr [cnt] 0041386B mov edx, dword ptr i1[eax*4] 0041386F cmp edx, dword ptr [ebp+ecx*444h] 00413873 jl CSortbyDecreasewithCNETDlg::OnBnClickedButton1+1A7h (413877h) 00413875 jmp CSortbyDecreasewithCNETDlg::OnBnClickedButton1+175h (413845h) else { itmp = i1 [cnt]; 00413877 mov eax, dword ptr [cnt] 0041387D mov ecx, dword ptr i1[eax*4] 00413881 mov dword ptr [itmp], ecx i1[cnt] = i1[cnt+1]; 00413884 mov eax, dword ptr [cnt] 0041388A mov ecx, dword ptr [cnt] 00413890 mov edx, dword ptr [ebp+ecx*444h] 00413894 mov dword ptr i1[eax*4], edx i1[cnt+1] = itmp; 00413898 mov eax, dword ptr [cnt] 0041389E mov ecx, dword ptr [itmp] 004138A1 mov dword ptr [ebp+eax*444h] , ecx }; }; 004138A5 jmp CSortbyDecreasewithCNETDlg::OnBnClickedButton1+175h (413845h) tSize_i1--; 004138A7 mov eax, dword ptr [tSize_i1] 004138AA sub eax, 1 004138AD mov dword ptr [tSize_i1] , eax }; 004138B0 jmp CSortbyDecreasewithCNETDlg::OnBnClickedButton1+163h (413833h)
These two disassembled fragments have common features. You might have noticed that the C++ versions of the programs intensively use the main memory for variables, while the processor registers are used much more seldom. This is not bad; however, it makes the code redundant and slows down the application in general. Such a slow-down is insignificant with little computation on small amounts of data. However, it becomes noticeable with large array sizes.
For example, swapping two elements in an array is implemented in C++ as follows :
itmp = i1 [cnt]; i1[cnt] = i1[cnt+1]; i1[cnt+1] = itmp;
The equivalent assembly code is:
mov eax, dword ptr [cnt] mov ecx, dword ptr i1[eax*4] mov dword ptr [itmp], ecx mov eax, dword ptr [cnt] mov ecx, dword ptr [cnt] mov edx, dword ptr [ebp+ecx*444h] mov dword ptr i1[eax*4], edx mov eax, dword ptr [cnt] mov ecx, dword ptr [itmp] mov dword ptr [ebp+eax*444h], ecx
By using these assembly commands
mov EAX, [EDI] xchg EAX, [EDI+4] mov [EDI], EAX
you can increase the performance of the maximum search program. Here, the values of variables are stored in registers, and computation is very fast because the data exchange through the system bus is significantly decreased.
The same is true for loop optimization. It is very difficult (and often impossible ) to make the C++ compiler generate code that would intensively use registers rather than the memory. Well- projected loops in assembler are executed much faster and consist of fewer commands.
We will look at the for loop in the sort program. The operator
for (int cnt = 0; cnt < tSize_i1; cnt++)
is translated into several assembly commands. The disassembled code of this statement has been modified to make it more comprehensible. For this purpose, we removed references to the physical addresses of the memory from the jump statements and put labels in appropriate places. The modified code appears as follows:
mov DWORD PTR [cnt], 0 jmp L2 L1: mov EAX, DWORD PTR [cnt] add EAX, 1 mov DWORD PTR [cnt], EAX L2: mov EAX, DWORD PTR [cnt] cmp EAX, DWORD PTR [tSize_i1] jge <address> < loop statements > jmp L1 . . .
One could not say this code is not optimum. If you had only the EAX, EDX, and ECX processor registers at your disposal, you would most likely use the same algorithm for the implementation of the for loop. Because of this restriction, you would have to store the loop variables in the memory and (like in this piece of code) extract them for each iteration.
If you use the assembler, implementation of the for statement with the standard algorithm and the ECX register will be the following:
mov ECX, DWORD PTR isize ... loop next
This is executed faster. As you see, using the assembler can solve many optimization tasks if you fully understand what should be optimized and how. This is true not only for C++ .NET, but also for other compilers.