| ||
The most important optimization method for the 80x86 processors is using the branch prediction algorithm. These processors use what both Intel and AMD call a BTB (branch target buffer). This is essentially a history buffer of the behavior of the last n Jcc instructions. In a need for speed processors prefetch (preload) instruction code bytes before they are needed, translate them, and arrange them for processing within their multipipelines to be processed . When a relative or absolute jump or call occurs, that code is prefetched and prepared for processing. However, a problem comes up when a branch (Jcc) is encountered . Which way to go? Take the branch or flow through to the next instruction? Different solutions have been taken by different manufacturers. They have and use different sized BTBs, different prediction methodologies, different prefetch sizes, etc. This particular book is not about optimization but we will talk about the mechanism. The particulars depend on which manufacturer and which processor.
The Intel processor does its time sampling in cycles and uses a BTB as well as a prediction history. If a branch is taken, the branch is put into the BTB; if not taken (a flow through), the branch is not put into the BTB unless it was a false prediction. In other words, if executing instructions for the first time and none of the branches are taken, only flowed through, then they were all predicted correctly and thus are not put into the BTB. There is zero to no penalty for executing a branch instruction if the prediction was correct, but if wrong, there is a cycle penalty.
The instruction prefetch has four 32-byte buffers loaded sequentially (one at a time) until a branch is encountered, and then the BTB is used to predict a branch or not. If no predicted branch, the contiguous memory is loaded, but if a branch is predicted, the alternate prefetch buffer is loaded with the memory referenced by the branch. If the prediction was wrong, all the instruction pipelines are flushed and the prefetch mechanism begins again. So you should see the need to design your code to minimize the number of mispredictions. There is one other thing to be careful of and that has to do with two back-to-back Jcc instructions. If two Jcc instructions both have their last byte in the same 4-byte block, a misprediction can occur. This would only occur if the second branch has a displacement of 8 bits. Using a larger bit displacement, rearranging the code, or inserting a NOP instruction would solve the problem.
This method is bad as 14h and 16h are in the same 4-byte block {14h17h}:
00000013 75 F8 jne $Z1 00000015 74 07 jz $Z2
The following is the best method to solve the problem, but only if your assembler lets you override an 8-bit displacement with a 32-bit one for Protected Mode, setting the last byte at 14h and 1ah. Note that in the following, the second code branch uses a 4-byte offset and not one byte. This is because it is outside the [128, 127] range.
00000013 75 F8 jne $Z1 00000015 0f 84 00 000007 jz near ptr $Z2
Here is a 16-bit displacement for Real Mode setting the last byte at 14h and 18h:
00000013 75 F8 jne $Z1 00000015 0f 84 00 07 jz near ptr $Z2
Not that I am urging you to use the NOP instructions, but this one is an alternative as 14h and 18h are in different 4-byte blocks. The NOP pushes the second conditional branch address further down.
00000013 75 F8 jne $Z1 00000015 90 nop 00000016 90 nop 00000017 74 07 jz $Z2
Branches that are not already in the BTB use the static prediction logic as follows .
The branch is predicted to be taken if a negative displacement, such as at the bottom of a loop. A flow through (branch not taken), would be a misprediction!
The branch is predicted not to be taken if a positive displacement such as a jump further down the code. The instruction pointer is expected to just flow through the branch instruction. A jump would be a misprediction.
$L1: nop nop jne $L1 ; Back-Branch-Taken jz $L2 ; Forward-Branch-Not-Taken nop $L2: nop
A prefix of 3Eh (HT) is a hint to take the branch. A prefix of 2eh (HNT) is a hint not to take a branch (flow through). Only set if contrary to a static branch prediction. Sometimes there are no elements to test, so at the top of a function one might have an if conditional ( size =0) empty test.
test ecx,ecx db 3eh ; Hint to take the branch jz $L9 ; Insert looping code here $L9:
The default static prediction is to not branch as the jz is a forward-branch and the prediction logic does a flow through, but the 3eh says to override and take the branch as the length is typically expected to be zero most of the time.
The same rules for Intel apply here but with some minor changes. The AMD K6-2 chip uses a two-level 8192 entry branch prediction table. It is more effective to use only 8-bit displacements. Code with small loops should be aligned on 16-byte boundaries and code with loops that do not fit in the prefetch should be aligned on 32-byte boundaries. Small loops should be unrolled but large loops should not, due to inefficient use of the L1 instruction cache. A mispredicted branch is from one to four clocks. The penalty for a bad prediction if the branch is not in BTB is three clock cycles.
Removing branches from your code such as unrolling loops makes it more efficient by removing the possibility of misprediction. This is discussed in more detail in the next chapter. One method is to use the SETcc instruction to set Boolean flags. Another method is to use CMOV or FCMOV instructions to copy data. These methods can sometimes be manipulated to duplicate the same effect you were trying to achieve with the Jcc instruction without any possible prediction failure that would cost cycles.
For example, the following is the signed integer absolute number function n = ABS(n), which uses a Jcc instruction.
test eax,eax ; Test if negative jns $Abs1 ; Jump if positive neg eax ; Invert number, two's complement $Abs1: ; eax is positive
As an alternative, we can do this without a Jcc instruction:
mov ecx,eax sar ecx,31 ; all 1's if neg, all 0's if pos xor eax,ecx ; At this point we are one's complement sub eax,ecx ; n(1)=n+1 = two's complement
So you see, we did an ABS() function without any Jcc instructions; just a sleight of hand using general-purpose instructions. Admittedly this technique will not work on everything, but it will help in your optimizations. In Chapter 11 we will go into more detail.
Destination addresses of a jump should be code aligned to take advantage of the instruction prefetch.
align 16
Of course the Align statement must not be in the code flow, as unknown bytes are added to align the code, So it must always occur outside a function, thus after a JMP or RET statement. Another alignment type would be for fine-tuning on a byte-by-byte basis such as:
org $+3
which aligns by moving the origin pointer from the $ current location by three bytes, effectively adding three unknown bytes. Typically you will find it paired with an alignment to create a fixed alignment point. This prevents any alignment of a previous function affecting the alignment of a following function.
align 16 org $+3
Inside the code flow you can add nondestructive instructions such as the following to help align your code. Your flags may be altered but not the registers.
nop ; 1 byte mov eax,eax ; 2 bytes 66h mov ax,ax ; 3 bytes
Let's start by examining a C type strlen() function designed to find the number of bytes in a zero- terminated ASCII string:
uint strlen(char *p) { uint cnt = 0; while (*p++) cnt++; return cnt; }
Let's try that in assembly and align the code to a 16-byte boundary:
align 16 00000000 54 push ebp 00000001 8B EC mov ebp,esp 00000003 8B 54 24 08 mov edx,[ebp+arg1] ; String 00000007 B8 FFFFFFFF mov eax,1 0000000C 40 $L1: inc eax 0000000D 8A 0A mov cl,[edx] ; Get a character 0000000F 42 inc edx 00000010 84 C9 test cl,cl 00000012 75 F8 jnz $L1 ; 0 terminator? 00000014 5D pop ebp 00000015 C3 ret ; return eax (length) 00000016
As you have probably noted, the entire function occupies 16h (22) bytes (000h..016h). It also has an efficiency problem. The 16-byte prefetch is first loaded with address 0000h000fh, executes up to and including the "inc edx" and then reloads the prefetch with the next 16 bytes, address 0010h001fh. The code then executes up to address 0013h and then the prefetch has to be reloaded with 0000h again. This continues over and over again until the zero terminator is encountered. Now if we tweak the alignment a tad we can contain this $L1 loop within one prefetch:
align 16 org $+4 00000004 54 push ebp 00000005 8B EC mov ebp,esp 00000007 8B 54 24 08 mov edx,[ebp+arg1] ; String 0000000B B8 FFFFFFFF mov eax,1 00000010 40 $L1: inc eax 00000011 8A 0A mov cl,[edx] ; Get a character 00000013 42 inc edx 00000014 84 C9 test cl,cl 00000016 75 F8 jnz $L1 ; 0 terminator? 00000018 5D pop ebp 00000019 C3 ret ; return eax (length) 0000001A
You will now note that the beginning of the function actually starts on address 0004h, but the beginning of the loop is now aligned perfectly on a 16-byte boundary, allowing the entire loop to be contained with a single instruction prefetch load. This is a very old and simple alignment trick that is still usable on the newer processors.