| ||
A few years ago I was working on a project that was required to run on a 386 processor but typically ran on a 486 and had this little squirrely problem. One of the in-house computer systems that I tested the application on ran the code extremely slowly. I spent quite a while on it and when doing some benchmark testing to isolate the problem I found that the memory copy algorithm, which was used to blit graphical sprites onto the screen, was the culprit. Sprites could appear on screen with any kind of data alignment as they moved horizontally across the screen. Upon deeper investigation I found that this computer system was running DOS like all the others but in this particular case, it was running on an AMD 386SX processor. AMD usually has pretty good processors but I was intrigued and so I ordered and received their AM386 data book unique to that model processor. Upon reading the book I found out to my horror that this processor had a little zinger. As it is a 32-bit processor with a 16-bit bus, if your source and destination pointers are not properly aligned, then a single 32-bit memory access has an additional eight clock penalty for that misaligned access. And so we come to my next rule.
Hint | Write your assembly to be CPU model and manufacturer specific! |
That little problem required the need to detect not only the exact manufacturer but also the model of processor and must route function calls to special code to handle each. In most cases the code could be shared, but some isolated instances required the special code. The following is an older style of the C function memcpy ().
; void *memcpy(void *pDst, const void *pSrc, uint nSize) ; ; Note: if nSize is set to 0 then no bytes will be copied! public memcpy memcpy proc near push ebp mov ebp,esp push esi push edi mov esi,[ebp+arg1] ; pSrc mov edi,[ebp+arg2] ; pDst mov ecx,[ebp+arg3] ; nSize // Insert one of the following example code here! mov eax,[ebp+arg1] ; Return pointer pop edi pop esi pop ebp ret memcpy endp
Warning | This loop is really inefficient code except when used on the old 8086 processors. $L0: movsb loop $L0 |
The following code is relatively small but pretty inefficient as it is using the repeating string function to write a series of 8-bit bytes. The payoff on a Pentium only comes with a repeat of 64 or more.
rep movsb
With a repeat factor of less than 64 use the following. Note that we do not need to put the DS: or the ES: AS the default for the ESI source register is DS, and the default for the EDI destination register is ES.
$L1: mov al,[esi] ; al,ds:[esi] mov [edi],al ; es:[edi],al inc esi inc edi dec ecx jne $L1
In the above example we actually get a dependency penalty as we set the AL register but have to wait before we can actually execute the next instruction. If we adjust the function as follows , we no longer have that problem. You will note that the "inc esi" line was moved up to separate the AL, and the AL register.
$L1: mov al,ds:[esi] inc esi ; removes dependency penalty mov es:[edi],al inc edi dec ecx jne $L1
Another method that is a lot more efficient than those listed above uses the same techniques we learned for setting memory. We divide our total number of bytes into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders. We handle the dependency penalty at $L1 in the same way.
mov edx,ecx ; Get # of bytes to set shr edx,2 ; n = n 4 jz $L2 ; Jump if 1..3 bytes ; DWORDS (uint32) $L1: mov eax,[esi] ; 1 ? OP read 32 bits add esi,4 mov [edi],eax ; 2 ? OP write 32 bits add edi,4 dec edx jne $L1 ; Loop for DWORDS ; Remainders $L2: and ecx,00000011b ; Mask remainder bits (0..3) jz $L4 ; Jump if no remainders ; 1 to 3 bytes to set $L3: mov al,[esi] inc esi mov [edi],al inc edi dec ecx jne $L3 ; Loop for 1's $L4:
This following method is significantly faster as it moves eight bytes at a time instead of four. There is no dependency penalty since the register being set is not being used immediately.
mov ecx,[ebp+arg3] ; nSize shr ecx,3 ; n = n 8 jz $L2 ; Jump if 1..7 bytes ; QWORDS (uint64) $L1: mov eax,[esi] ; 1 ? OP read 32 bits mov edx,[esi+4] ; read next 32 bits mov [edi],eax ; 2 ? OP write 32 bits mov [edi+4],edx ; write next 32 bits add esi,8 add edi,8 dec ecx jne $L1 ; Loop for QWORDS ; Remainders $L2: mov ecx,[ebp+arg3] ; nSize and ecx,00000111b ; Mask remainder bits (0..7) jz $L4 ; Jump if no remainders ; 1 to 7 bytes to set $L3: mov al,[esi] ; read a byte inc esi mov [edi],al ; write byte inc edi dec ecx jne $L3 ; Loop for 1's $L4:
This code is just about as fast as a copy using MMX. An example would be to replace $L1 with the following code:
$L1: movq mm7,[esi] ; read 64 bits add esi,8 movq [edi],mm7 ; write 64 bits add edi,8 dec ecx jne $L1 ; Loop for QWORDS
There are more sophisticated methods that you can employ , but this is a good start.
It is important for memory to be aligned, as a problem occurs when the source and/or destination are misaligned. Memory movement (copy) functions should try to reorient source and destination pointers. Unfortunately, if one is not lucky enough that the source and destination are either both properly aligned or they are misaligned exactly the same:
If ((pSrc AND 00000000111b) == (pDst AND 00000000111b))
then adjust them. If their logically AND'ed values are 0, no adjustment is needed. If the alignment is the same, adjust by 1's to get into the alignment position. If both are out of alignment, obtain a speed increase by putting at least one of them into alignment (preferably the destination):
mov edx,edi ; At least align destination! and edx,0000111b jz $Mid ; Jump if properly aligned ; Remove misaligned bytes add edx,0fffffffch ; -3 $lead: mov al,[esi] ; read byte inc esi mov [edi],al ; write byte inc edi dec ecx ; reduce total to move inc edx ; increment to 0 jne $lead ; loop for lead bytes $Mid:
For the actual memory movement operation there are various techniques that can be used, each with its own benefit or drawback.
The best method is a preventative one. If the memory you're dealing with is for video images, then not only should the (width mod 8) equal a remainder of zero but the source and destination pointers should also be properly aligned. In this way, there is no problem of clock penalties for each memory access and no extra and possibly futile effort trying to align them.
In 8-bit images, moving (blitting) sprite memory can be difficult as sprites will always be misaligned. In 32-bit images where one pixel is 32 bits, alignment is a snap, as every pixel is properly aligned.
#ifdef __cplusplus extern "C" void gfxCopyBlit8x8Asm(byte *pDst, byte *pSrc, uint nStride, uint nWidth, uint nHeight); #endif // Comment this line out for 'C' code #define USE_ASM_COPYBLIT_8X8 // 8-bit to 8-bit Copy Blit // // This function is pre-clipped to copy an 8-bit color // pixel from the buffer pointed to by the source // pointer to an identical sized destination buffer. #ifdef USE_ASM_COPYBLIT_8X8 #define CopyBlit8x8 CopyBlit8x8Asm #else void CopyBlit8x8(byte *pDst, byte *pSrc, uint nStride, uint nWidth, uint nHeight) { // If width is the stride then copy entire image if (nWidth == nStride) { memcpy(pDst, pSrc, nStride * nHeight); } else { // Copy image 1 scanline at a time. do { memcpy(pDst, pSrc, nWidth); pSrc += nStride; // Source stride adjustment pDst += nStride; // Destination Stride adj. } while(--nHeight); // Loop for height } } #endif
As you probably noted, there is extra logic checking if width and stride are the same. If so, then unroll the loop to make the code even more efficient.
Goal | Try to write the listed function in assembly optimized for your processor. Or multiple processors. |