| ||
If you happen to be constructing 8-bit images, then the memset () function can work pretty well for you as the transparency value can be anywhere from 0 to 255. If working in 16-, 24-, or 32-bit colors this function is only useful if the transparency color that you are trying to set just happens to be 0; if it is any other value, you have a serious problem. You have put together a C function to do the task but even though this function is not called that often, its speed is not up to your needs.
Your older 32-bit C libraries typically have the following library function for clearing a block of memory:
; void *memset(void *pMem, int val, uint nCnt) ; ; Note: if nCnt is set to 0 then no bytes will be set! public memset memset proc near push ebp mov ebp,esp push edi mov ecx,[ebp+arg3] ; nCnt mov edi,[ebp+arg1] ; pMem mov eax,[ebp+arg2] ; val // Insert one of the following example code here! $xit: mov eax,[ebp+arg1] ; Return pointer pop edi pop ebp ret memset endp
Warning | The following loop is really inefficient code except when used on the old 8086 processors. test ecx,ecx jz $xit $L0: stosb loop $L0 |
That code is relatively small but pretty inefficient as it is using the repeating string function to write a series of 8-bit bytes. The payoff on Pentium processors only comes with a repeat of 64 or more.
rep stosb
With a repeat factor of less than 64, use the following. Note that in using the ES:[EDI], the ES: IS the default and so we do not really need to put it in the code.
test ecx,ecx jz $xit ; jump if len of 0 $L1: mov es:[edi],al ; set a byte inc edi dec ecx jne $L1
An alternate method that is a lot more efficient than those listed above is to divide our total number of bytes into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders.
test ecx,ecx jz $xit ; jump if len of 0 ; The speed of writing 1 byte is the same as writing 4 bytes ; properly aligned so we build a 32-bit value to write (al = val) mov ah,al mov edx,eax shl eax,16 mov ax,dx ; eax=replicated byte 4 mov edx,ecx ; Get # of bytes to set shr edx,2 ; n = n 4 jz $L2 ; Jump if 1..3 bytes ; edx = # of 32-bit writes $L1: mov [edi],eax ; set 4 bytes add edi,4 ; advance pointer by 4 bytes dec edx jne $L1 ; Loop for DWORDS ; Remainders (1..3) bytes $L2: and ecx,00000011b ; Mask remainder bits (0..3) jz $L4 ; Jump if no remainders ; 1 to 3 bytes to set $L3: mov [edi],al ; set 1 byte inc edi ; advance pointer by 1 byte dec ecx jne $L3 ; Loop for 1's $L4:
There are more sophisticated methods that you can employ but this is a good start.
For optimal performance all data reads and writes must be on a 32-bit boundary. In a copy situation, if the source and destination are misaligned, there is not much that can be done about it. But in the case of setting memory that is misaligned , it is a snap to fix it.
The first three memory strands had their heads misaligned, but the fourth was aligned properly. On the other hand, the tails of the last three were misaligned and the first one was aligned properly. Now that they're sliced and diced, their midsections are all properly aligned for best data handling.
The latest C runtime libraries use something a lot more elaborate such as the following function:
; void *memset(void *pMem, int val, uint nCnt) ; ; Note: if nCnt is set to 0 then no bytes will be set! public memset memset proc near $BSHFT = 2 ; Shift count $BCNT = 4 ; Byte count push ebp mov ebp,esp ; Unlike the code above, flow does not have to fall through ; if a size of 0 was passed, and so we need to test for it. ; The lines are adjusted to help prevent a stall. mov ecx,[ebp+arg3] ; nCnt push edi ; Older programmers will say hey, why didn't you 'OR ecx,ecx' ; but this is a read/write function that will cost you ; time for the write. The 'TEST ecx,ecx' is a read only! test ecx,ecx push ebx jz $Xit ; jump if size is 0 mov edi,[ebp+arg1] ; pMem ; If the size is (1...3) bytes long, then handle as tail bytes test ecx,NOT ($BCNT-1) mov eax,[ebp+arg2] ; val jz $Tail ; If already aligned on a (n mod 3)==0 boundary mov edx,edi and edx,($BCNT-1) jz $SetD ; The memory attempting to be set may not be properly aligned on ; a 4-byte boundary and thus if the block is 4 bytes in size or ; greater, then the 32-bit writes will have clock penalties on ; each write and so first adjust to be properly aligned. sub edx,$BCNT add ecx,edx ; Reduce # of bytes to set $Lead: mov [edi],al ; Set a byte inc edi inc edx jne $Lead ; Loop for those {1..3} bytes ; The speed of writing 1 byte is the same as writing 4 bytes ; properly aligned so build a 32-bit value to write (al = val) $SetD : mov ah,al mov edx,eax shl eax,16 mov ax,dx ; eax=replicated byte 4 ; Now we set the bytes four at a time mov edx,ecx shr edx,$BSHFT ; (n4) = # of 32-bit writes $SetD1: mov [edi],eax add edi,$BCNT dec edx jne $SetD1 and ecx,($BCNT-1) jz $Xit ; jump if size is 0 ; Write any trailing bytes $Tail: mov [edi],al ; set a byte inc edi dec ecx jne $Tail ; loop for trailing bytes $Xit: pop ebx pop edi mov eax,[ebp+arg1] ; Return destination pointer pop ebp ret memset endp
As you can see, that simple memory set function became a lot bigger, but its execution speed became a lot quicker. With very short lengths of bytes to set, such as sizes of fewer than four bytes, this code is actually slower but it quickly gains in speed as the memory lengths increase in size, especially if aligned on 4-byte boundaries. For an extra efficiency on a size of 256 bytes or more, using the STOSD instruction would be best.
Note | You should use the string functions such as STOSD only if the repeat factor is 64 or more. |
These numbers aren't exactly right as this function has not been tuned for its optimal timing yet, but I leave that to you. Besides, what would be the fun in it if I gave you all the answers? As versatile as the MMX instruction set is, the linear setting or copying of memory is no more efficient than the integer instructions. In fact, a STOSD/MOVSD string set/copy with a repeat of 64 or more is actually faster than the equivalent MMX instructions on legacy processors. This would also leave the XMM register for math related solutions. It turns out that we are actually pumping data very close to or at the bus speed. For experimental purposes and to have some MMX practice, one alternative would be the use of the MMX instruction MOVQ in the $SetD section of the code so eight bytes would be written at one time.
Alter the $BSHFT and $BCNT to the new values:
$BSHFT = 3 ; Shift count 8=(1<<3) $BCNT = 8 ; Byte count ; Run lookup table replicates an 8-bit byte into a 64-bit qword. ; It saves a lot of shifting and ORing and only costs ; 256x8 = 2048 bytes and 1 time cycle. ; ; 00000000h,00000000h,01010101h,01010101h,02020202h,02020202h, ; etc. Replicate64 label DWORD .XLIST foo = 0 REPEAT 256 DD foo,foo foo = foo + 01010101h ENDM .LIST $SetD : lea eax,Replicate64[eax*8] movq mm7,[eax] mov edx,ecx shr edx,$BSHFT ; (n8) = # of 64-bit writes $SetD1: movq [edi],mm7 add edi,$BCNT dec edx jne $SetD1
And call at the appropriate time only if your CPU thread has floating- point operations to handle:
Emms
Note | I recommend the use of the ZeroMemory() function instead. It saves passing an extra argument value of 0, or the time to replicate the single byte to four bytes. |