Branch Prediction | 32/64-Bit 80x86 Assembly Language Architecture

The most important optimization method for the 80x86 processors is using the branch prediction algorithm. These processors use what both Intel and AMD call a BTB (branch target buffer). This is essentially a history buffer of the behavior of the last n Jcc instructions. In a need for speed processors prefetch (preload) instruction code bytes before they are needed, translate them, and arrange them for processing within their multipipelines to be processed . When a relative or absolute jump or call occurs, that code is prefetched and prepared for processing. However, a problem comes up when a branch (Jcc) is encountered . Which way to go? Take the branch or flow through to the next instruction? Different solutions have been taken by different manufacturers. They have and use different sized BTBs, different prediction methodologies, different prefetch sizes, etc. This particular book is not about optimization but we will talk about the mechanism. The particulars depend on which manufacturer and which processor.

Intel Branch Prediction

The Intel processor does its time sampling in cycles and uses a BTB as well as a prediction history. If a branch is taken, the branch is put into the BTB; if not taken (a flow through), the branch is not put into the BTB unless it was a false prediction. In other words, if executing instructions for the first time and none of the branches are taken, only flowed through, then they were all predicted correctly and thus are not put into the BTB. There is zero to no penalty for executing a branch instruction if the prediction was correct, but if wrong, there is a cycle penalty.

The instruction prefetch has four 32-byte buffers loaded sequentially (one at a time) until a branch is encountered, and then the BTB is used to predict a branch or not. If no predicted branch, the contiguous memory is loaded, but if a branch is predicted, the alternate prefetch buffer is loaded with the memory referenced by the branch. If the prediction was wrong, all the instruction pipelines are flushed and the prefetch mechanism begins again. So you should see the need to design your code to minimize the number of mispredictions. There is one other thing to be careful of and that has to do with two back-to-back Jcc instructions. If two Jcc instructions both have their last byte in the same 4-byte block, a misprediction can occur. This would only occur if the second branch has a displacement of 8 bits. Using a larger bit displacement, rearranging the code, or inserting a NOP instruction would solve the problem.

This method is bad as 14h and 16h are in the same 4-byte block {14h17h}:

 00000013   75  F8  jne   $Z1 00000015   74  07  jz    $Z2

The following is the best method to solve the problem, but only if your assembler lets you override an 8-bit displacement with a 32-bit one for Protected Mode, setting the last byte at 14h and 1ah. Note that in the following, the second code branch uses a 4-byte offset and not one byte. This is because it is outside the [128, 127] range.

 00000013   75  F8  jne   $Z1 00000015   0f 84  00  000007   jz    near ptr $Z2

Here is a 16-bit displacement for Real Mode setting the last byte at 14h and 18h:

 00000013   75  F8  jne    $Z1 00000015   0f 84  00  07   jz     near ptr $Z2

Not that I am urging you to use the NOP instructions, but this one is an alternative as 14h and 18h are in different 4-byte blocks. The NOP pushes the second conditional branch address further down.

 00000013   75  F8  jne       $Z1 00000015   90      nop 00000016   90      nop 00000017   74  07  jz        $Z2

Branches that are not already in the BTB use the static prediction logic as follows .

Static Branch Prediction

Back-Branch-Taken

The branch is predicted to be taken if a negative displacement, such as at the bottom of a loop. A flow through (branch not taken), would be a misprediction!

Forward-Branch-Not-Taken

The branch is predicted not to be taken if a positive displacement such as a jump further down the code. The instruction pointer is expected to just flow through the branch instruction. A jump would be a misprediction.

 $L1:   nop        nop        jne    $L1    ;    Back-Branch-Taken            jz     $L2    ;    Forward-Branch-Not-Taken        nop $L2:   nop

Branching Hints

A prefix of 3Eh (HT) is a hint to take the branch. A prefix of 2eh (HNT) is a hint not to take a branch (flow through). Only set if contrary to a static branch prediction. Sometimes there are no elements to test, so at the top of a function one might have an if conditional ( size =0) empty test.

 test ecx,ecx   db 3eh   ; Hint to take the branch   jz $L9                ; Insert looping code here     $L9:

The default static prediction is to not branch as the jz is a forward-branch and the prediction logic does a flow through, but the 3eh says to override and take the branch as the length is typically expected to be zero most of the time.

AMD Branch Prediction

The same rules for Intel apply here but with some minor changes. The AMD K6-2 chip uses a two-level 8192 entry branch prediction table. It is more effective to use only 8-bit displacements. Code with small loops should be aligned on 16-byte boundaries and code with loops that do not fit in the prefetch should be aligned on 32-byte boundaries. Small loops should be unrolled but large loops should not, due to inefficient use of the L1 instruction cache. A mispredicted branch is from one to four clocks. The penalty for a bad prediction if the branch is not in BTB is three clock cycles.

Branch Optimization

Removing branches from your code such as unrolling loops makes it more efficient by removing the possibility of misprediction. This is discussed in more detail in the next chapter. One method is to use the SETcc instruction to set Boolean flags. Another method is to use CMOV or FCMOV instructions to copy data. These methods can sometimes be manipulated to duplicate the same effect you were trying to achieve with the Jcc instruction without any possible prediction failure that would cost cycles.

For example, the following is the signed integer absolute number function n = ABS(n), which uses a Jcc instruction.

 test  eax,eax   ; Test if negative        jns   $Abs1     ; Jump if positive            neg   eax       ; Invert number, two's complement     $Abs1:                 ; eax is positive

As an alternative, we can do this without a Jcc instruction:

 mov    ecx,eax sar    ecx,31          ; all 1's if neg, all 0's if pos xor    eax,ecx         ; At this point we are one's complement sub    eax,ecx         ; n(1)=n+1 = two's complement

So you see, we did an ABS() function without any Jcc instructions; just a sleight of hand using general-purpose instructions. Admittedly this technique will not work on everything, but it will help in your optimizations. In Chapter 11 we will go into more detail.

Destination addresses of a jump should be code aligned to take advantage of the instruction prefetch.

 align   16

Of course the Align statement must not be in the code flow, as unknown bytes are added to align the code, So it must always occur outside a function, thus after a JMP or RET statement. Another alignment type would be for fine-tuning on a byte-by-byte basis such as:

 org   $+3

which aligns by moving the origin pointer from the $ current location by three bytes, effectively adding three unknown bytes. Typically you will find it paired with an alignment to create a fixed alignment point. This prevents any alignment of a previous function affecting the alignment of a following function.

 align  16  org     $+3

Inside the code flow you can add nondestructive instructions such as the following to help align your code. Your flags may be altered but not the registers.

 nop              ; 1 byte       mov    eax,eax   ; 2 bytes  66h  mov    ax,ax     ; 3 bytes

Let's start by examining a C type strlen() function designed to find the number of bytes in a zero- terminated ASCII string:

 uint strlen(char *p) {     uint cnt = 0;         while (*p++)         cnt++;         return cnt; }

Let's try that in assembly and align the code to a 16-byte boundary:

 align 16 00000000   54               push   ebp 00000001   8B EC            mov    ebp,esp 00000003   8B 54 24 08      mov    edx,[ebp+arg1] ; String     00000007   B8 FFFFFFFF      mov    eax,1     0000000C   40         $L1:  inc    eax 0000000D   8A 0A            mov    cl,[edx]       ; Get a character 0000000F   42               inc    edx 00000010   84 C9            test   cl,cl 00000012   75 F8            jnz    $L1            ; 0 terminator?     00000014   5D               pop    ebp 00000015   C3               ret                   ; return eax (length) 00000016

As you have probably noted, the entire function occupies 16h (22) bytes (000h..016h). It also has an efficiency problem. The 16-byte prefetch is first loaded with address 0000h000fh, executes up to and including the "inc edx" and then reloads the prefetch with the next 16 bytes, address 0010h001fh. The code then executes up to address 0013h and then the prefetch has to be reloaded with 0000h again. This continues over and over again until the zero terminator is encountered. Now if we tweak the alignment a tad we can contain this $L1 loop within one prefetch:

 align 16                             org $+4     00000004   54               push   ebp 00000005   8B EC            mov    ebp,esp 00000007   8B 54 24 08      mov    edx,[ebp+arg1] ; String     0000000B   B8 FFFFFFFF      mov    eax,1     00000010   40         $L1:  inc    eax 00000011   8A 0A            mov    cl,[edx]       ; Get a character 00000013   42               inc    edx 00000014   84 C9            test   cl,cl 00000016   75 F8            jnz    $L1            ; 0 terminator?     00000018   5D               pop    ebp 00000019   C3               ret                   ; return eax (length) 0000001A

You will now note that the beginning of the function actually starts on address 0004h, but the beginning of the loop is now aligned perfectly on a 16-byte boundary, allowing the entire loop to be contained with a single instruction prefetch load. This is a very old and simple alignment trick that is still usable on the newer processors.