To begin, we will consider how you can use assembly language to program loop calculations. In the most general form, a loop presents a sequence of commands starting with a label, and returning to this label after this sequence is performed. In assembly language, the loop usually looks like this:
label: <commands> jmp label
For example, consider the following sequence of commands:
mov EDX, 0 Label: <assembly code > inc EDX cmp EDX, 1000000 je OutLoop jmp Label OutLoop:
This is a simple loop. It increases the value of the EDX register from 0 to 1000000, and upon reaching this value, jumps to the OutLoop label.
This code fragment cannot be considered optimal, as it uses several commands for analyzing the loop exit condition, and for exiting the loop itself. You can improve the program code further if you analyze how the loop sequence of operators works. For a loop with a fixed number of iterations, the optimization solution in assembly language is simple (see Listing 1.1).
. . . mov EDX, 1000000 label: . . . <assembly code> . . . dec EDX jnz label . . .
As you can see from this listing, the loop counter is placed into the EDX register. This fragment is more efficient than the previous one. The decrementing command sets the ZF flag when the counter value turns into 0. In this case, you exit the loop; otherwise , the loop takes a new iteration. As you can see, this fragment contains a smaller number of operations, and therefore will be performed faster.
Finally, you can develop an even more efficient version of the decremented loop. In this case, you can use the assembly command that is written as cmovcc in the pseudocode, with the last two letters standing for one of the conditions ( eq , le , ge , etc.). The decremented loop shown in Listing 1.1 is easy to modify in the following way (Listing 1.2).
. . . lea EAX, L1 lea ECX, L2 mov EDX, 1000000 L1: . . . <assembly code> . . . dec EDX cmovz EAX, ECX jmp EAX L2: . . .
In this code fragment, the cmovz command performs two functions: it analyzes the ZF flag and sends the data to the EAX register if ZF is set. The cmovz command is followed by the jmp command that uses the EAX register value for jumping to the needed branch of the program.
The previous program code fragments demonstrate that the optimization of assembly loops largely depends on the particular task, so there is no universal technique. For example, suppose the program performs some conversion of 1-byte data stored in an array, and you need to increment the byte address in every iteration of the loop. Suppose the source address is contained in the ESI register, the destination address is stored in the EDI register, and the byte counter is placed in the ECX register. In this case, the loop algorithm for data processing can be implemented as follows (see Listing 1.3). The comments in the source code are separated with a semicolon (;).
. . . ; Places the source address to ESI mov ESI, src ; Places the destination address to EDI mov EDI, dst ; Places the byte counter to ECX mov ECX, count ; Adds the ESI address to ECX ; to get the loop exit condition add ECX, ESI label: mov AL, [ESI] inc ESI ; Processes the byte <byte processing commands> ; Writes the result to the destination mov [EDI], AL inc EDI ; Checks the loop exit condition cmp ECX, ESI ; If the condition is not satisfied, repeats the loop jne label . . .
It should be noted that even a well-optimized loop might not appear as fast as the developer might expect. To further improve efficiency, you can use the method of unrolling the loop. This term actually means that you need to reduce the number of iterations by performing more operations during one loop. This method lets you achieve good results. Now, we are going to consider two code fragments, which use unrolling the loops.
For the initial (non-optimized) code fragment, we will take the code that copies the double-word data from one memory buffer to another. In Listing 1.4, you can see the source code of the fragment.
. . . ; Places the source and destination addresses ; to the ESI and EDI registers mov ESI, src mov EDI, dst ; Places the byte counter value to ECX mov ECX, count ; Counts double words, ; so the ECX value is divided by 4 shr ECX, 2 label: mov EAX, [ESI] add ESI, 4 mov [EDI], EAX add EDI, 4 dec ECX jnz label . . .
To unroll the loop, we will copy two double words simultaneously . In Listing 1.5, you can see the source code of the optimized fragment (with the changes shown in bold).
. . . mov ESI, src mov EDI, dst ; Places the counter value to the ECX register mov ECX, count ; Divides the counter value by 8 ; (as we are using double words) shr ECX, 3 label: ; Reads the first double word to the EAX register mov EAX, [ESI] ; Reads the second double word to the EBX register mov EBX, [ESI + 4] ; Writes the first double word to the EDI register mov [EDI], EAX ; Writes the second double word ; to the EDI+4 address mov [EDI + 4], EBX ; Shifts the source and destination addresses ; to point to the next double word add ESI, 8 add EDI, 8 ; Analyzes loop exit condition dec ECX jnz label . . .
This technique allows you to halve the delay brought about by the loop. You can continue unrolling the loop further if you process four double words instead of two.
Here is one more example of unrolling the loops. Suppose you have an array of 20 integers. You task is to assign 0 to the elements with even numbers (0, 2, 4, etc.), and 1 to those with odd numbers .
The straightforward solution is shown in Listing 1.6.
. . . ; Places the array size to the ECX register mov ECX, 20 ; Places the first element address to the ESI register lea ESI, il ; Places the 2 divisor to the EBX register ; for finding out if the element is even or odd mov EBX, 2 next: ; Placing the element counter to EAX register mov EAX, ECX ; Finds out if the array element number is even or odd div EBX cmp EDX, 0 ; If it is odd, sets the element to 1 jne set_1 ; If it is even, sets the element to 0 mov DWORD PTR [ESI], 0 jmp L1 set 1: mov DWORD PTR [ESI], 1 L1: ; Moves to the address of the next element in array add ESI, 4 loop next . . .
You can optimize this code fragment by processing two elements in one iteration. Listing 1.7 shows the source code of the modified version (the assembly commands that are left out are shown in italics).
. . . mov EDX, 0 ; mov ECX, 20 lea ESI, i1 ; mov EBX, 2 next: ; mov EAX, ECX ; div EBX ; cmp EDX, 0 mov DWORD PTR [ESI], 0 mov DWORD PTR [ESI+4], 1 add EDX, 2 ; jmp L1 ; set_1: ; mov DWORD PTR [ESI], 1 ; L1: cmp EDX, 19 jae ex add ESI, 8 jmp next ex: . . .
The source code of this fragment is essentially different from the code in Listing 1.6. We discarded the division commands, and at the same time, halved the number of iterations (see the add EDX, 2 command that is shown in bold). Every iteration now handles two array elements simultaneously (see the mov DWORD PTR [ESI], 0 and mov DWORD PTR [ESI+4], 1 commands that are shown in bold). In the end of each iteration, the value of the ESI register is incremented by 8 by the add ESI, 8 command, so that it points to the next pair of elements.
Now, we will use C++ .NET 2003 to develop a simple console application that implements the optimized algorithm. This application calls the initarr function (the one processing the array) from a separate assembly module.
The interface between the assembly functions and the C++ program will not be analyzed here, since these issues are covered in the following chapters. So now we will focus on our main task and consider the technique for optimizing the assembly code.
In Listing 1.8, you can see the source code of the optimized assembly module containing the initarr function. This module is adapted for compiling in the macroassembler environment MASM 6.14.
.686 .model flat public _initarr .code _initarr PROC ; Function prologue push EBP mov EBP, ESP ; Places array size to the ECX register, ; and the first element address to the ESI register mov ECX, DWORD PTR [EBP+12] mov ESI, DWORD PTR [EBP+8] mov EDX, 0 dec ECX next: mov DWORD PTR [ESI], 0 mov DWORD PTR [ESI+4], 1 add EDX, 2 cmp EDX, ECX jae ex add ESI, 8 jmp next ex: ; Function epilogue pop EBP ret _initarr ENDP end
The source code of the console application created in C++ .NET is shown in Listing 1.9.
// UNROLL_LOOP_OPT.cpp : Defines the entry point for the console // application #include "stdafx.h" extern "C" void initarr(int* pi1, int isize); int _tmain(int argc, _TCHAR* argv[]) { int i1[20]; printf ("UNROLLING LOOP DEMO\n\n"); printf("i1[20] = "); initarr(i1, 20); for (int cnt = 0; cnt < 20; cnt++) printf("%d ", i1[cnt]); getchar(); return 0; }
The application window (see Fig. 1.1) displays the contents of the array after initialization.
We have covered the most general methods for optimizing loop calculations in assembly language. Although there are many other techniques as well, they are processor-specific. A detailed analysis of loop optimizing methods for particular processor types (Pentium Pro, Pentium II, Pentium III, and Pentium IV) is beyond the scope of this book.