Copying Memory | 32/64-Bit 80x86 Assembly Language Architecture

A few years ago I was working on a project that was required to run on a 386 processor but typically ran on a 486 and had this little squirrely problem. One of the in-house computer systems that I tested the application on ran the code extremely slowly. I spent quite a while on it and when doing some benchmark testing to isolate the problem I found that the memory copy algorithm, which was used to blit graphical sprites onto the screen, was the culprit. Sprites could appear on screen with any kind of data alignment as they moved horizontally across the screen. Upon deeper investigation I found that this computer system was running DOS like all the others but in this particular case, it was running on an AMD 386SX processor. AMD usually has pretty good processors but I was intrigued and so I ordered and received their AM386 data book unique to that model processor. Upon reading the book I found out to my horror that this processor had a little zinger. As it is a 32-bit processor with a 16-bit bus, if your source and destination pointers are not properly aligned, then a single 32-bit memory access has an additional eight clock penalty for that misaligned access. And so we come to my next rule.

Hint	Write your assembly to be CPU model and manufacturer specific!

That little problem required the need to detect not only the exact manufacturer but also the model of processor and must route function calls to special code to handle each. In most cases the code could be shared, but some isolated instances required the special code. The following is an older style of the C function memcpy ().

 ; void *memcpy(void *pDst, const void *pSrc, uint nSize) ; ; Note: if nSize is set to 0 then no bytes will be copied!         public  memcpy memcpy  proc    near         push    ebp         mov     ebp,esp         push    esi         push    edi             mov     esi,[ebp+arg1]         ; pSrc         mov     edi,[ebp+arg2]         ; pDst         mov     ecx,[ebp+arg3]         ; nSize             // Insert one of the following example code here!             mov     eax,[ebp+arg1]         ; Return pointer         pop     edi         pop     esi         pop     ebp         ret memcpy  endp

Warning

This loop is really inefficient code except when used on the old 8086 processors.

 $L0:   movsb        loop       $L0

The following code is relatively small but pretty inefficient as it is using the repeating string function to write a series of 8-bit bytes. The payoff on a Pentium only comes with a repeat of 64 or more.

 rep movsb

With a repeat factor of less than 64 use the following. Note that we do not need to put the DS: or the ES: AS the default for the ESI source register is DS, and the default for the EDI destination register is ES.

 $L1:    mov     al,[esi]       ; al,ds:[esi]         mov     [edi],al       ; es:[edi],al         inc     esi         inc     edi         dec     ecx         jne     $L1

In the above example we actually get a dependency penalty as we set the AL register but have to wait before we can actually execute the next instruction. If we adjust the function as follows , we no longer have that problem. You will note that the "inc esi" line was moved up to separate the AL, and the AL register.

 $L1:    mov     al,ds:[esi]         inc     esi             ; removes dependency penalty         mov     es:[edi],al         inc     edi         dec     ecx         jne     $L1

Another method that is a lot more efficient than those listed above uses the same techniques we learned for setting memory. We divide our total number of bytes into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders. We handle the dependency penalty at $L1 in the same way.

 mov     edx,ecx        ; Get # of bytes to set       shr     edx,2          ; n = n  4       jz      $L2            ; Jump if 1..3 bytes     ;     DWORDS (uint32)     $L1:  mov     eax,[esi]      ; 1  ?  OP read 32 bits       add     esi,4       mov     [edi],eax      ; 2  ?  OP write 32 bits       add     edi,4       dec     edx       jne     $L1            ; Loop for DWORDS     ;       Remainders     $L2:  and     ecx,00000011b  ; Mask remainder bits (0..3)       jz      $L4            ; Jump if no remainders     ;       1 to 3 bytes to set     $L3:  mov     al,[esi]       inc     esi       mov     [edi],al       inc     edi       dec     ecx       jne     $L3            ; Loop for 1's     $L4:

This following method is significantly faster as it moves eight bytes at a time instead of four. There is no dependency penalty since the register being set is not being used immediately.

 mov        ecx,[ebp+arg3]   ; nSize         shr        ecx,3            ; n = n  8         jz         $L2              ; Jump if 1..7 bytes     ;       QWORDS (uint64)     $L1:    mov        eax,[esi]        ; 1  ?  OP read 32 bits         mov        edx,[esi+4]      ; read next 32 bits         mov        [edi],eax        ; 2  ?  OP write 32 bits         mov        [edi+4],edx      ; write next 32 bits         add        esi,8         add        edi,8         dec        ecx         jne        $L1              ; Loop for QWORDS     ;       Remainders     $L2:    mov        ecx,[ebp+arg3]   ; nSize         and        ecx,00000111b    ; Mask remainder bits (0..7)         jz         $L4              ; Jump if no remainders     ;       1 to 7 bytes to set     $L3:    mov        al,[esi]         ; read a byte         inc        esi         mov        [edi],al         ; write byte         inc        edi         dec        ecx         jne        $L3              ; Loop for 1's     $L4:

This code is just about as fast as a copy using MMX. An example would be to replace $L1 with the following code:

 $L1:    movq       mm7,[esi]         ; read 64 bits         add        esi,8         movq       [edi],mm7         ; write 64 bits         add        edi,8         dec        ecx         jne        $L1               ; Loop for QWORDS

There are more sophisticated methods that you can employ , but this is a good start.

It is important for memory to be aligned, as a problem occurs when the source and/or destination are misaligned. Memory movement (copy) functions should try to reorient source and destination pointers. Unfortunately, if one is not lucky enough that the source and destination are either both properly aligned or they are misaligned exactly the same:

 If ((pSrc AND 00000000111b) == (pDst AND 00000000111b))

then adjust them. If their logically AND'ed values are 0, no adjustment is needed. If the alignment is the same, adjust by 1's to get into the alignment position. If both are out of alignment, obtain a speed increase by putting at least one of them into alignment (preferably the destination):

 mov        edx,edi          ; At least align destination!         and        edx,0000111b         jz         $Mid             ; Jump if properly aligned             ; Remove misaligned bytes             add        edx,0fffffffch   ; -3     $lead:  mov        al,[esi]         ; read byte         inc        esi         mov        [edi],al         ; write byte         inc        edi         dec        ecx              ; reduce total to move         inc        edx              ; increment to 0         jne        $lead            ; loop for lead bytes     $Mid:

For the actual memory movement operation there are various techniques that can be used, each with its own benefit or drawback.

The best method is a preventative one. If the memory you're dealing with is for video images, then not only should the (width mod 8) equal a remainder of zero but the source and destination pointers should also be properly aligned. In this way, there is no problem of clock penalties for each memory access and no extra and possibly futile effort trying to align them.

In 8-bit images, moving (blitting) sprite memory can be difficult as sprites will always be misaligned. In 32-bit images where one pixel is 32 bits, alignment is a snap, as every pixel is properly aligned.

 #ifdef __cplusplus extern "C" void gfxCopyBlit8x8Asm(byte *pDst, byte *pSrc,         uint nStride, uint nWidth, uint nHeight); #endif             //   Comment this line out for 'C' code     #define USE_ASM_COPYBLIT_8X8         //  8-bit to 8-bit Copy Blit     //     // This function is pre-clipped to copy an 8-bit color     // pixel from the buffer pointed to by the source     // pointer to an identical sized destination buffer.     #ifdef USE_ASM_COPYBLIT_8X8 #define CopyBlit8x8   CopyBlit8x8Asm #else void CopyBlit8x8(byte *pDst, byte *pSrc, uint nStride,         uint nWidth, uint nHeight) {       // If width is the stride then copy entire image       if (nWidth == nStride)         {           memcpy(pDst, pSrc, nStride * nHeight);         }       else         {     // Copy image 1 scanline at a time.           do {               memcpy(pDst, pSrc, nWidth);                   pSrc += nStride;         // Source stride adjustment               pDst += nStride;         // Destination Stride adj.             } while(--nHeight);        // Loop for height       } } #endif

As you probably noted, there is extra logic checking if width and stride are the same. If so, then unroll the loop to make the code even more efficient.

Goal	Try to write the listed function in assembly optimized for your processor. Or multiple processors.