Optimizing Loop Calculations | Visual C++ Optimization with Assembly Code

Decrementing the Loop Counter

To begin, we will consider how you can use assembly language to program loop calculations. In the most general form, a loop presents a sequence of commands starting with a label, and returning to this label after this sequence is performed. In assembly language, the loop usually looks like this:

 label:    <commands>    jmp label

For example, consider the following sequence of commands:

 mov EDX, 0  Label:        <assembly code >        inc EDX        cmp EDX, 1000000        je OutLoop        jmp Label  OutLoop:

This is a simple loop. It increases the value of the EDX register from 0 to 1000000, and upon reaching this value, jumps to the OutLoop label.

This code fragment cannot be considered optimal, as it uses several commands for analyzing the loop exit condition, and for exiting the loop itself. You can improve the program code further if you analyze how the loop sequence of operators works. For a loop with a fixed number of iterations, the optimization solution in assembly language is simple (see Listing 1.1).

Listing 1.1: An assembly loop with a decremented counter

 . . .    mov EDX, 1000000  label:    . . .    <assembly code>    . . .    dec EDX    jnz label    . . .

As you can see from this listing, the loop counter is placed into the EDX register. This fragment is more efficient than the previous one. The decrementing command sets the ZF flag when the counter value turns into 0. In this case, you exit the loop; otherwise , the loop takes a new iteration. As you can see, this fragment contains a smaller number of operations, and therefore will be performed faster.

Finally, you can develop an even more efficient version of the decremented loop. In this case, you can use the assembly command that is written as cmovcc in the pseudocode, with the last two letters standing for one of the conditions ( eq , le , ge , etc.). The decremented loop shown in Listing 1.1 is easy to modify in the following way (Listing 1.2).

Listing 1.2: An assembly loop with the cmovz command

 . . .    lea    EAX,  L1     lea    ECX,   L2     mov    EDX, 1000000  L1:     . . .     <assembly code>     . . .     dec    EDX     cmovz  EAX, ECX     jmp    EAX  L2:     . . .

In this code fragment, the cmovz command performs two functions: it analyzes the ZF flag and sends the data to the EAX register if ZF is set. The cmovz command is followed by the jmp command that uses the EAX register value for jumping to the needed branch of the program.

Unrolling the Loop

The previous program code fragments demonstrate that the optimization of assembly loops largely depends on the particular task, so there is no universal technique. For example, suppose the program performs some conversion of 1-byte data stored in an array, and you need to increment the byte address in every iteration of the loop. Suppose the source address is contained in the ESI register, the destination address is stored in the EDI register, and the byte counter is placed in the ECX register. In this case, the loop algorithm for data processing can be implemented as follows (see Listing 1.3). The comments in the source code are separated with a semicolon (;).

Listing 1.3: Optimizing the byte processing loop

 . . . ; Places the source address to ESI            mov ESI, src  ; Places the destination address to EDI            mov EDI, dst  ; Places the byte counter to ECX            mov ECX, count  ; Adds the ESI address to ECX  ; to get the loop exit condition            add ECX, ESI  label:            mov AL, [ESI]            inc ESI  ; Processes the byte     <byte processing commands>  ; Writes the result to the destination            mov [EDI], AL            inc EDI  ; Checks the loop exit condition            cmp ECX, ESI  ; If the condition is not satisfied, repeats the loop            jne label   . . .

It should be noted that even a well-optimized loop might not appear as fast as the developer might expect. To further improve efficiency, you can use the method of unrolling the loop. This term actually means that you need to reduce the number of iterations by performing more operations during one loop. This method lets you achieve good results. Now, we are going to consider two code fragments, which use unrolling the loops.

For the initial (non-optimized) code fragment, we will take the code that copies the double-word data from one memory buffer to another. In Listing 1.4, you can see the source code of the fragment.

Listing 1.4: Copying double words (before optimization)

 . . . ; Places the source and destination addresses  ; to the ESI and EDI registers       mov ESI, src       mov EDI, dst  ; Places the byte counter value to ECX       mov ECX, count  ; Counts double words,  ; so the ECX value is divided by 4       shr ECX, 2  label:       mov EAX, [ESI]       add ESI, 4       mov [EDI], EAX       add EDI, 4       dec ECX       jnz label       . . .

To unroll the loop, we will copy two double words simultaneously . In Listing 1.5, you can see the source code of the optimized fragment (with the changes shown in bold).

Listing 1.5: Unrolling the loop by copying two double words instead of one

 . . .   mov ESI, src    mov EDI, dst  ; Places the counter value to the ECX register    mov ECX, count  ; Divides the counter value by 8  ; (as we are using double words)  shr ECX, 3  label:  ; Reads the first double word to the EAX register    mov EAX, [ESI]  ; Reads the second double word to the EBX register  mov EBX, [ESI + 4]  ; Writes the first double word to the EDI register    mov [EDI], EAX  ; Writes the second double word  ; to the EDI+4 address  mov [EDI + 4], EBX  ; Shifts the source and destination addresses  ; to point to the next double word  add ESI, 8   add EDI, 8  ; Analyzes loop exit condition    dec ECX    jnz label   . . .

This technique allows you to halve the delay brought about by the loop. You can continue unrolling the loop further if you process four double words instead of two.

Here is one more example of unrolling the loops. Suppose you have an array of 20 integers. You task is to assign 0 to the elements with even numbers (0, 2, 4, etc.), and 1 to those with odd numbers .

The straightforward solution is shown in Listing 1.6.

Listing 1.6: Initializing the array (before optimization)

 . . .  ; Places the array size to the ECX register    mov  ECX, 20  ; Places the first element address to the ESI register    lea  ESI, il  ; Places the 2 divisor to the EBX register  ; for finding out if the element is even or odd    mov  EBX, 2  next:  ; Placing the element counter to EAX register    mov  EAX, ECX  ; Finds out if the array element number is even or odd    div  EBX    cmp  EDX, 0  ; If it is odd, sets the element to 1    jne  set_1  ; If it is even, sets the element to 0    mov  DWORD PTR [ESI], 0    jmp L1  set 1:    mov DWORD PTR [ESI], 1  L1:  ; Moves to the address of the next element in array    add  ESI, 4    loop next    . . .

You can optimize this code fragment by processing two elements in one iteration. Listing 1.7 shows the source code of the modified version (the assembly commands that are left out are shown in italics).

Listing 1.7: Initializing the array (optimized)

 . . .     mov EDX, 0  ; mov ECX, 20  lea ESI, i1  ; mov EBX, 2  next:  ; mov EAX, ECX   ; div EBX   ; cmp EDX, 0   mov DWORD PTR [ESI], 0   mov DWORD PTR [ESI+4], 1   add EDX, 2   ; jmp L1   ; set_1:   ; mov DWORD PTR [ESI], 1   ; L1:  cmp EDX, 19      jae ex  add ESI, 8  jmp next  ex:      . . .

The source code of this fragment is essentially different from the code in Listing 1.6. We discarded the division commands, and at the same time, halved the number of iterations (see the add EDX, 2 command that is shown in bold). Every iteration now handles two array elements simultaneously (see the mov DWORD PTR [ESI], 0 and mov DWORD PTR [ESI+4], 1 commands that are shown in bold). In the end of each iteration, the value of the ESI register is incremented by 8 by the add ESI, 8 command, so that it points to the next pair of elements.

Now, we will use C++ .NET 2003 to develop a simple console application that implements the optimized algorithm. This application calls the initarr function (the one processing the array) from a separate assembly module.

The interface between the assembly functions and the C++ program will not be analyzed here, since these issues are covered in the following chapters. So now we will focus on our main task and consider the technique for optimizing the assembly code.

In Listing 1.8, you can see the source code of the optimized assembly module containing the initarr function. This module is adapted for compiling in the macroassembler environment MASM 6.14.

Listing 1.8: Array initialization (version for MASM 6.14)

 .686  .model flat    public _initarr  .code   _initarr PROC      ; Function prologue      push EBP      mov  EBP, ESP      ; Places array size to the ECX register,      ; and the first element address to the ESI register      mov  ECX, DWORD PTR [EBP+12]      mov  ESI, DWORD PTR [EBP+8]      mov  EDX, 0      dec  ECX  next:      mov  DWORD PTR [ESI], 0      mov  DWORD PTR [ESI+4], 1      add  EDX, 2      cmp  EDX, ECX      jae  ex      add  ESI, 8      jmp  next  ex:      ; Function epilogue      pop  EBP      ret  _initarr ENDP  end

The source code of the console application created in C++ .NET is shown in Listing 1.9.

Listing 1.9: Initializing and displaying an array of integers

 // UNROLL_LOOP_OPT.cpp : Defines the entry point for the console  // application  #include "stdafx.h"  extern "C" void initarr(int* pi1, int isize);  int _tmain(int argc, _TCHAR* argv[])  {   int i1[20];   printf ("UNROLLING LOOP DEMO\n\n");   printf("i1[20] = ");   initarr(i1, 20);   for (int cnt = 0; cnt < 20; cnt++)      printf("%d ", i1[cnt]);   getchar();   return 0;  }

The application window (see Fig. 1.1) displays the contents of the array after initialization.

Fig. 1.1: Application window displays the outcome of the assembly function with the optimized loop

We have covered the most general methods for optimizing loop calculations in assembly language. Although there are many other techniques as well, they are processor-specific. A detailed analysis of loop optimizing methods for particular processor types (Pentium Pro, Pentium II, Pentium III, and Pentium IV) is beyond the scope of this book.