Flylib.com

Books Software

 
 
 

Branch Prediction

Branch Prediction

The most important optimization method for the 80x86 processors is using the branch prediction algorithm. These processors use what both Intel and AMD call a BTB (branch target buffer). This is essentially a history buffer of the behavior of the last n Jcc instructions. In a need for speed processors prefetch (preload) instruction code bytes before they are needed, translate them, and arrange them for processing within their multipipelines to be processed . When a relative or absolute jump or call occurs, that code is prefetched and prepared for processing. However, a problem comes up when a branch (Jcc) is encountered . Which way to go? Take the branch or flow through to the next instruction? Different solutions have been taken by different manufacturers. They have and use different sized BTBs, different prediction methodologies, different prefetch sizes, etc. This particular book is not about optimization but we will talk about the mechanism. The particulars depend on which manufacturer and which processor.

Intel Branch Prediction

The Intel processor does its time sampling in cycles and uses a BTB as well as a prediction history. If a branch is taken, the branch is put into the BTB; if not taken (a flow through), the branch is not put into the BTB unless it was a false prediction. In other words, if executing instructions for the first time and none of the branches are taken, only flowed through, then they were all predicted correctly and thus are not put into the BTB. There is zero to no penalty for executing a branch instruction if the prediction was correct, but if wrong, there is a cycle penalty.

The instruction prefetch has four 32-byte buffers loaded sequentially (one at a time) until a branch is encountered, and then the BTB is used to predict a branch or not. If no predicted branch, the contiguous memory is loaded, but if a branch is predicted, the alternate prefetch buffer is loaded with the memory referenced by the branch. If the prediction was wrong, all the instruction pipelines are flushed and the prefetch mechanism begins again. So you should see the need to design your code to minimize the number of mispredictions. There is one other thing to be careful of and that has to do with two back-to-back Jcc instructions. If two Jcc instructions both have their last byte in the same 4-byte block, a misprediction can occur. This would only occur if the second branch has a displacement of 8 bits. Using a larger bit displacement, rearranging the code, or inserting a NOP instruction would solve the problem.

This method is bad as 14h and 16h are in the same 4-byte block {14h17h}:

00000013   75

F8

jne   $Z1
00000015   74

07

jz    $Z2

The following is the best method to solve the problem, but only if your assembler lets you override an 8-bit displacement with a 32-bit one for Protected Mode, setting the last byte at 14h and 1ah. Note that in the following, the second code branch uses a 4-byte offset and not one byte. This is because it is outside the [128, 127] range.

00000013   75

F8

jne   $Z1
00000015   0f 84

00

000007   jz    near ptr $Z2

Here is a 16-bit displacement for Real Mode setting the last byte at 14h and 18h:

00000013   75

F8

jne    $Z1
00000015   0f 84

00

07   jz     near ptr $Z2

Not that I am urging you to use the NOP instructions, but this one is an alternative as 14h and 18h are in different 4-byte blocks. The NOP pushes the second conditional branch address further down.

00000013   75

F8

jne       $Z1
00000015   90      nop
00000016   90      nop
00000017   74

07

jz        $Z2

Branches that are not already in the BTB use the static prediction logic as follows .

Static Branch Prediction

Back-Branch-Taken

The branch is predicted to be taken if a negative displacement, such as at the bottom of a loop. A flow through (branch not taken), would be a misprediction!

Forward-Branch-Not-Taken

The branch is predicted not to be taken if a positive displacement such as a jump further down the code. The instruction pointer is expected to just flow through the branch instruction. A jump would be a misprediction.

$L1:   nop
       nop
       jne    $L1    ;

image from book

Back-Branch-Taken
   
       jz     $L2    ;

image from book

Forward-Branch-Not-Taken
       nop
$L2:   nop

Branching Hints

A prefix of 3Eh (HT) is a hint to take the branch. A prefix of 2eh (HNT) is a hint not to take a branch (flow through). Only set if contrary to a static branch prediction. Sometimes there are no elements to test, so at the top of a function one might have an if conditional ( size =0) empty test.

test ecx,ecx
  db 3eh   ; Hint to take the branch
  jz $L9
   
           ; Insert looping code here
   
$L9:

The default static prediction is to not branch as the jz is a forward-branch and the prediction logic does a flow through, but the 3eh says to override and take the branch as the length is typically expected to be zero most of the time.

AMD Branch Prediction

The same rules for Intel apply here but with some minor changes. The AMD K6-2 chip uses a two-level 8192 entry branch prediction table. It is more effective to use only 8-bit displacements. Code with small loops should be aligned on 16-byte boundaries and code with loops that do not fit in the prefetch should be aligned on 32-byte boundaries. Small loops should be unrolled but large loops should not, due to inefficient use of the L1 instruction cache. A mispredicted branch is from one to four clocks. The penalty for a bad prediction if the branch is not in BTB is three clock cycles.

Branch Optimization

Removing branches from your code such as unrolling loops makes it more efficient by removing the possibility of misprediction. This is discussed in more detail in the next chapter. One method is to use the SETcc instruction to set Boolean flags. Another method is to use CMOV or FCMOV instructions to copy data. These methods can sometimes be manipulated to duplicate the same effect you were trying to achieve with the Jcc instruction without any possible prediction failure that would cost cycles.

For example, the following is the signed integer absolute number function n = ABS(n), which uses a Jcc instruction.

test  eax,eax   ; Test if negative
       jns   $Abs1     ; Jump if positive
   
       neg   eax       ; Invert number, two's complement
   
$Abs1:                 ; eax is positive

As an alternative, we can do this without a Jcc instruction:

mov    ecx,eax
sar    ecx,31          ; all 1's if neg, all 0's if pos
xor    eax,ecx         ; At this point we are one's complement
sub    eax,ecx         ; n(1)=n+1 = two's complement

So you see, we did an ABS() function without any Jcc instructions; just a sleight of hand using general-purpose instructions. Admittedly this technique will not work on everything, but it will help in your optimizations. In Chapter 11 we will go into more detail.

Destination addresses of a jump should be code aligned to take advantage of the instruction prefetch.

align   16

Of course the Align statement must not be in the code flow, as unknown bytes are added to align the code, So it must always occur outside a function, thus after a JMP or RET statement. Another alignment type would be for fine-tuning on a byte-by-byte basis such as:

org   $+3

which aligns by moving the origin pointer from the $ current location by three bytes, effectively adding three unknown bytes. Typically you will find it paired with an alignment to create a fixed alignment point. This prevents any alignment of a previous function affecting the alignment of a following function.

align  16 
org     $+3

Inside the code flow you can add nondestructive instructions such as the following to help align your code. Your flags may be altered but not the registers.

nop              ; 1 byte
      mov    eax,eax   ; 2 bytes

66h

mov    ax,ax     ; 3 bytes

Let's start by examining a C type strlen() function designed to find the number of bytes in a zero- terminated ASCII string:

uint strlen(char *p)
{
    uint cnt = 0;
   
    while (*p++)
        cnt++;
   
    return cnt;
}

Let's try that in assembly and align the code to a 16-byte boundary:

align 16

00000000   54               push   ebp
00000001   8B EC            mov    ebp,esp
00000003   8B 54 24 08      mov    edx,[ebp+arg1] ; String
   
00000007   B8 FFFFFFFF      mov    eax,1
   
0000000C   40         $L1:  inc    eax
0000000D   8A 0A            mov    cl,[edx]       ; Get a character
0000000F   42               inc    edx
00000010   84 C9            test   cl,cl
00000012   75 F8            jnz    $L1            ; 0 terminator?
   
00000014   5D               pop    ebp
00000015   C3               ret                   ; return eax (length)
00000016

As you have probably noted, the entire function occupies 16h (22) bytes (000h..016h). It also has an efficiency problem. The 16-byte prefetch is first loaded with address 0000h000fh, executes up to and including the "inc edx" and then reloads the prefetch with the next 16 bytes, address 0010h001fh. The code then executes up to address 0013h and then the prefetch has to be reloaded with 0000h again. This continues over and over again until the zero terminator is encountered. Now if we tweak the alignment a tad we can contain this $L1 loop within one prefetch:

align 16
                            org $+4
   
00000004   54               push   ebp
00000005   8B EC            mov    ebp,esp
00000007   8B 54 24 08      mov    edx,[ebp+arg1] ; String
   
0000000B   B8 FFFFFFFF      mov    eax,1
   
00000010   40         $L1:  inc    eax
00000011   8A 0A            mov    cl,[edx]       ; Get a character
00000013   42               inc    edx
00000014   84 C9            test   cl,cl
00000016   75 F8            jnz    $L1            ; 0 terminator?
   
00000018   5D               pop    ebp
00000019   C3               ret                   ; return eax (length)
0000001A

You will now note that the beginning of the function actually starts on address 0004h, but the beginning of the loop is now aligned perfectly on a 16-byte boundary, allowing the entire loop to be contained with a single instruction prefetch load. This is a very old and simple alignment trick that is still usable on the newer processors.