Setting Memory

If you happen to be constructing 8-bit images, then the memset () function can work pretty well for you as the transparency value can be anywhere from 0 to 255. If working in 16-, 24-, or 32-bit colors this function is only useful if the transparency color that you are trying to set just happens to be 0; if it is any other value, you have a serious problem. You have put together a C function to do the task but even though this function is not called that often, its speed is not up to your needs.

Your older 32-bit C libraries typically have the following library function for clearing a block of memory:

 ; void *memset(void *pMem, int val, uint nCnt) ; ; Note: if nCnt is set to 0 then no bytes will be set!             public  memset memset  proc    near         push    ebp         mov     ebp,esp         push    edi             mov     ecx,[ebp+arg3]          ; nCnt         mov     edi,[ebp+arg1]          ; pMem         mov     eax,[ebp+arg2]          ; val                // Insert one of the following example code here!     $xit:   mov     eax,[ebp+arg1]          ; Return pointer         pop     edi         pop     ebp         ret memset  endp 

The following loop is really inefficient code except when used on the old 8086 processors.

 test    ecx,ecx       jz      $xit     $L0:  stosb       loop    $L0 

That code is relatively small but pretty inefficient as it is using the repeating string function to write a series of 8-bit bytes. The payoff on Pentium processors only comes with a repeat of 64 or more.

 rep stosb 

With a repeat factor of less than 64, use the following. Note that in using the ES:[EDI], the ES: IS the default and so we do not really need to put it in the code.

 test   ecx,ecx        jz     $xit           ; jump if len of 0     $L1:   mov    es:[edi],al    ; set a byte        inc    edi        dec    ecx        jne    $L1 

An alternate method that is a lot more efficient than those listed above is to divide our total number of bytes into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders.

 test    ecx,ecx         jz      $xit              ; jump if len of 0     ; The speed of writing 1 byte is the same as writing 4 bytes ; properly aligned so we build a 32-bit value to write (al = val)             mov     ah,al         mov     edx,eax         shl     eax,16         mov     ax,dx             ; eax=replicated byte 4             mov     edx,ecx           ; Get # of bytes to set         shr     edx,2             ; n = n  4         jz      $L2               ; Jump if 1..3 bytes     ; edx = # of  32-bit writes     $L1:    mov     [edi],eax         ; set 4 bytes         add     edi,4             ; advance pointer by 4 bytes         dec     edx         jne     $L1               ; Loop for DWORDS ; Remainders (1..3) bytes     $L2:    and     ecx,00000011b     ; Mask remainder bits (0..3)         jz      $L4               ; Jump if no remainders     ;    1 to 3 bytes to set     $L3:    mov     [edi],al          ; set 1 byte         inc     edi               ; advance pointer by 1 byte         dec     ecx         jne     $L3               ; Loop for 1's     $L4: 

There are more sophisticated methods that you can employ but this is a good start.

For optimal performance all data reads and writes must be on a 32-bit boundary. In a copy situation, if the source and destination are misaligned, there is not much that can be done about it. But in the case of setting memory that is misaligned , it is a snap to fix it.

image from book
Figure 19-1: Imagine these four differently aligned memory strands as eels. We pull out our sushi knife and finely chop off their heads into little 8-bit (1-byte) chunks , chop off the tails into 8-bit (1-byte) chunks, and then coarsely chop the bodies into larger 32-bit (4-byte) chunks, and serve raw.

The first three memory strands had their heads misaligned, but the fourth was aligned properly. On the other hand, the tails of the last three were misaligned and the first one was aligned properly. Now that they're sliced and diced, their midsections are all properly aligned for best data handling.

The latest C runtime libraries use something a lot more elaborate such as the following function:

 ; void *memset(void *pMem, int val, uint nCnt) ; ; Note: if nCnt is set to 0 then no bytes will be set!             public   memset memset  proc     near     $BSHFT  =       2               ; Shift count $BCNT   =       4               ; Byte count             push    ebp         mov     ebp,esp     ; Unlike the code above, flow does not have to fall through ; if a size of 0 was passed, and so we need to test for it. ; The lines are adjusted to help prevent a stall.             mov     ecx,[ebp+arg3]  ; nCnt         push    edi     ; Older programmers will say hey, why didn't you 'OR ecx,ecx' ; but this is a read/write function that will cost you ; time for the write. The 'TEST ecx,ecx' is a read only!             test    ecx,ecx         push    ebx         jz      $Xit            ; jump if size is 0                 mov     edi,[ebp+arg1]  ; pMem     ; If the size is (1...3) bytes long, then handle as tail bytes             test    ecx,NOT ($BCNT-1)         mov     eax,[ebp+arg2]  ; val         jz      $Tail     ; If already aligned on a (n mod 3)==0 boundary             mov     edx,edi         and     edx,($BCNT-1)         jz      $SetD     ; The memory attempting to be set may not be properly aligned on ; a 4-byte boundary and thus if the block is 4 bytes in size or ; greater, then the 32-bit writes will have clock penalties on ; each write and so first adjust to be properly aligned.         sub     edx,$BCNT         add     ecx,edx         ; Reduce # of bytes to set     $Lead:  mov     [edi],al        ; Set a byte         inc     edi         inc     edx         jne     $Lead           ; Loop for those {1..3} bytes     ; The speed of writing 1 byte is the same as writing 4 bytes ; properly aligned so build a 32-bit value to write   (al = val)     $SetD : mov     ah,al         mov     edx,eax         shl     eax,16         mov     ax,dx           ; eax=replicated byte 4     ; Now we set the bytes four at a time             mov     edx,ecx         shr     edx,$BSHFT      ; (n4) = # of  32-bit writes     $SetD1: mov     [edi],eax         add     edi,$BCNT         dec     edx         jne     $SetD1             and     ecx,($BCNT-1)         jz      $Xit            ; jump if size is 0     ; Write any trailing bytes     $Tail:  mov     [edi],al        ; set a byte         inc     edi         dec     ecx         jne     $Tail           ; loop for trailing bytes     $Xit:   pop     ebx         pop     edi         mov     eax,[ebp+arg1]  ; Return destination pointer         pop     ebp         ret memset  endp 

As you can see, that simple memory set function became a lot bigger, but its execution speed became a lot quicker. With very short lengths of bytes to set, such as sizes of fewer than four bytes, this code is actually slower but it quickly gains in speed as the memory lengths increase in size, especially if aligned on 4-byte boundaries. For an extra efficiency on a size of 256 bytes or more, using the STOSD instruction would be best.


You should use the string functions such as STOSD only if the repeat factor is 64 or more.

These numbers aren't exactly right as this function has not been tuned for its optimal timing yet, but I leave that to you. Besides, what would be the fun in it if I gave you all the answers? As versatile as the MMX instruction set is, the linear setting or copying of memory is no more efficient than the integer instructions. In fact, a STOSD/MOVSD string set/copy with a repeat of 64 or more is actually faster than the equivalent MMX instructions on legacy processors. This would also leave the XMM register for math related solutions. It turns out that we are actually pumping data very close to or at the bus speed. For experimental purposes and to have some MMX practice, one alternative would be the use of the MMX instruction MOVQ in the $SetD section of the code so eight bytes would be written at one time.

Alter the $BSHFT and $BCNT to the new values:

 $BSHFT  =        3               ; Shift count 8=(1<<3) $BCNT   =        8               ; Byte count     ; Run lookup table replicates an 8-bit byte into a 64-bit qword. ; It saves a lot of shifting and ORing and only costs ; 256x8 = 2048 bytes and 1 time cycle. ; ; 00000000h,00000000h,01010101h,01010101h,02020202h,02020202h, ;   etc.     Replicate64 label  DWORD        .XLIST   foo   =       0   REPEAT        256         DD      foo,foo         foo     =   foo + 01010101h   ENDM         .LIST         $SetD : lea     eax,Replicate64[eax*8]         movq    mm7,[eax]             mov     edx,ecx         shr     edx,$BSHFT        ; (n8) = # of  64-bit writes     $SetD1: movq    [edi],mm7         add     edi,$BCNT         dec     edx         jne     $SetD1 

And call at the appropriate time only if your CPU thread has floating- point operations to handle:


I recommend the use of the ZeroMemory() function instead. It saves passing an extra argument value of 0, or the time to replicate the single byte to four bytes.

32.64-Bit 80X86 Assembly Language Architecture
32/64-Bit 80x86 Assembly Language Architecture
ISBN: 1598220020
EAN: 2147483647
Year: 2003
Pages: 191

Similar book on Amazon © 2008-2017.
If you may any questions please contact us: