5.3 Program Branching | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

The Itanium architecture gives programmers and compilers control over logical program flow through predication of a basic branch instruction. The value set into the Itanium predicate registers by compare instructions can predicate most types of instructions, determining which are executed. (Exceptions are flagged with the symbol in the listings of Itanium instructions in Appendix C.)

In the sample programs already presented, we have shown one form of an unconditional branch instruction (br.ret.sptk.many) just below the label done. This form is technically not a branch; instead, it performs a function return, something that is explained more fully in Chapter 7.

5.3.1 Ordinary Branch Instructions

Under the generic rubric of branch instructions, the Itanium ISA unifies a quite diverse collection of capabilities: unconditional branching, conditional branching, calling of procedures or functions, returning from procedures or functions, accelerating loop control, and accessing the included set of IA-32 instructions. Here we begin by introducing the conditional and unconditional branch instructions, which in their simplest form behave as their counterparts in most other architectures.

The assembler syntax for ordinary Itanium branch instructions can take several forms that involve numerous completers:

 (qp)  br.brtype.bwh.ph.dh target25    // IP-relative       br.ph.dh target25           // pseudo-op (unconditional) (qp)  brl.brtype.bwh.ph.dh target64   // IP-relative (qp)  br.btype.bwh.ph.dh  b2      // indirect       br.ph.dh b2                 // pseudo-op (unconditional)

where the target (destination) address can be handled using either relative addressing with the current value of the instruction pointer or indirect addressing with a branch register (Appendix D.4). For relative addressing, a programmer can specify the target25 or target64 destination using a symbolic label on the destination instruction bundle. Section 5.3.5 contains more information about the branch addressing range.

The comprehensive list of choices for brtype (the branch type completer) for use with Itanium branch instructions includes: none at all, cond, call, ret, ia, cloop, ctop, cexit, wtop, and wexit. Although none at all is synonymous with cond, we shall use cond explicitly for conditional branches where the branch instruction also has a qualifying predicate (qp).

Notice that there are two pseudo-op forms for unconditional branching, where the assembler will insert the permanently true Pr₀ as the qualifying predicate.

The comprehensive list of choices for bwh (the branch whether hint) for use with Itanium branch instructions includes: spnt, sptk, dpnt, and dptk. Software can attempt to predict (p) how frequently a particular branch would be taken (tk) or not taken (nt) statically (s), or can leave such prediction to be done dynamically (d) by the hardware. Utilization of these hints is implementation-dependent, and thus the hint sptk (static taken) is available to indicate that the branch should always be taken with no other considerations.

The handling of branch prediction is a very active area of research in the fields of computer architecture and processor implementation because misprediction can enormously degrade performance for pipelined architectures and instruction cache subsystems.

The choices for ph (the prefetch hint completer) include: none at all, few, and many. None at all is synonymous with few. This hint suggests how many lines might be prefetched into the I-cache, beginning with the destination bundle.

The choices for dh (the branch cache deallocation hint) include: none at all and clr. An implementation may contain a small dedicated cache of recently used branch target addresses. This hint relates to whether that cache should be left alone, or may as well be flushed since the software anticipates no longer needing those addresses.

5.3.2 Timing Considerations for Branches

With perhaps only the notable exception of Alpha architecture, which manages to calculate the branch target address within one instruction cycle, RISC branch instructions require more than one instruction cycle, even if the branch is going to fall through. The additional time required to calculate the branch target address exceeds the time for typical instructions, and the designers then choose to treat branches in special ways, so as not to impose a performance penalty on the entire instruction set.

All branches taken incur time penalties, with the best case being that the instructions at the target address are already in the I-cache. The minimum penalty then stems from the effort of having already decoded and launched instructions below the branch into the pipelines of the execution units, since such work must be undone if it affects the machine state. At worst, the future instructions at the target address will have to be retrieved from slower memory i.e., from outer levels of cache, main memory, or even virtual memory (disk).

The Itanium architecture introduces many features that improve branch performance. Special circuit designs allow a predicated branch instruction to execute in a B execution unit fully in parallel with an immediately prior compare instruction that is computing the logical predicate for the branch in an I or M execution unit. The time required to compute the branch target address has been made comparable to the time required to perform the comparison, up to the point where the B unit can peek at the determined predicates in the I or M unit. This is referred to as zero latency between the compare and branch instructions.

Both instructions finish slightly later within the same instruction cycle. The compare instruction writes its outcome into predicate registers for potential subsequent use. The branch either falls through or takes the newly dictated path by putting the target address into the IP register.

As a consequence of this design, we can and should omit the stop (;;) between a compare instruction and a related branch instruction. No such accelerated treatment occurs for other uses of predicates determined by compare operations, and a stop is usually needed, as pointed out in the next section.

5.3.3 If…Then…Else Structures

From your previous studies of programming, you will appreciate the widely useful capability of the if…then…else construct for imparting structure to programs.

Typical high-level languages

The structure can be entered from prior code in only one way, and the entire structure has only one logical exit. In its interior, however, it supports two separate and arbitrarily long blocks of code that may be entirely sequential or in turn contain embedded structures that could be similarly analyzed and described. One of those two blocks is to be executed only for one outcome of the logical test (if), while the other is to be executed only for the opposite outcome. These ideas can be expressed in pseudocode:

 <entrance from prior code> if (ra rel rb) then     <do THEN block> else     <do ELSE block> endif <exit to subsequent code>

where we may imagine ra and rb to be the quantities in two general registers and rel to be a compare relationship (crel) supported by Itanium compare instructions.

Assembly language

In something closer to Itanium assembly language, we can express the same pseudocode:

 enter:    <from prior code>           cmp.crel pt,pf = ra,rb    // Preds for a rel b           (pf) br.cond do_else;;    // Skip THEN block do_then:  <do THEN block>;;           br end_if;;               // Skip ELSE block do_else:  <do ELSE block>;; end_if:   <and continue with subsequent code>

where we intend pt and pf to be predicate registers set for the true and false outcomes of the logical relationship. Sometimes we do not actually need both predicates; the permanently true predicate register p0 can be placed into either position on the left of the equal sign.

When the logical relationship is true, the conditional first branch will fall through (since pf will be 0), the THEN block of code will execute, and the unconditional branch will skip over the ELSE block. When the logical relationship is false, the conditional first branch will be taken (since pf will be 1) to skip over the THEN block and instead execute the ELSE block.

There is the net effect (and time penalty) of taking one branch instruction regardless of the logical outcome leading to either the THEN block or the ELSE block. Important: A branch to end_if at the bottom of the ELSE block would be superfluous, time-consuming, and therefore unwarranted; normal sequential instruction fetching will be the right course of action there.

Assembly language, using predication

With the Itanium architecture, the overwhelming majority of instruction types can be predicated. This makes possible the elimination of both branch instructions:

 enter:  <from prior code>         cmp.crel pt,pf = ra,rb;; // Preds for a rel b         (pt) <do THEN block>     // These can be         (pf) <do ELSE block>     //  interleaved end_if: <and continue with subsequent code>

The predication of THEN and ELSE block instructions in opposite ways can have significant performance benefits, since the first two Itanium processors can fetch and potentially execute as many as six instructions at the same time. Predication is at the heart of EPIC principles.

Here a stop (;;) isolates the compare instruction into a different group from the predicated instructions. The compare must complete its execution before the predicates are used. Otherwise, without an explicit stop, the CPU might execute the compare simultaneously with the predicated instructions, using stale values for their qualifying predicates.

With explicitly predicated execution, the CPU "goes through the motions" of executing both streams of instructions simultaneously. The predicate register values will determine which instruction stream actually has an effect by writing results into destination registers or memory. Any CPU at the digital logic level uses enabling signals to control the writing at the proper point in time; thus, predication is simply another precondition to writing data.

Sometimes one may want a few instructions to occur identically in both a THEN block and an ELSE block i.e., to execute regardless of the logical outcome of the IF condition. With traditional architectures, these instructions have to be replicated in both blocks. Alternatively, the if…then…else structure could be broken into two or more such structures appearing above and below the common instructions, with the same IF condition replicated for the second structure. With the Itanium ISA, unpredicated "whether true or false" instructions can simply be interleaved at desired positions in a sequence along with the variously predicated instructions.

Few instructions per THEN and ELSE block

If there are only a few instructions in the THEN and ELSE blocks, the software can remove as many stops (;;) as possible, subject to any other dependencies involving source values from preceding instructions. In order to fill out bundles with useful instructions rather than no-ops, there can be some interleaving of instructions belonging to the THEN and ELSE blocks. Such interleaving renders the program harder for a person to read, though one can still see that every THEN-related instruction is predicated with one predicate register (pt in our example) and every ELSE-related instruction with another (pf).

Many instructions per THEN and ELSE block

We know that branch instructions are costly in terms of performance. Yet two lengthy blocks passing through the CPU, where only one of them will have constructive consequences, effectively cuts overall throughput in half. Somewhere, a crossover in performance exists between full predication in parallel and a traditional implementation with two simple branch instructions. Locating that crossover, in terms of the numbers of instructions in the THEN and ELSE blocks, would depend on implementation-specific factors, such as assumptions about cache-related effects for instructions and data.

Unbalanced THEN and ELSE blocks

Severely unbalanced lengths of THEN and ELSE blocks present a problem, especially if the most probable logical outcome goes with the shorter block. In that situation, we would be forcing the CPU to wallow through many instructions that do not have constructive effect (predicated false) in order to ensure that some few instructions do have full effect (predicated true). It may thus be better to utilize a traditional structure biased to execute the short ELSE block in most cases. The execution time will reflect the dominance of only those few instructions and one branch taken.

Null ELSE block

In the limit where one block is empty or nearly so, the logical comparison might still be needed for its side-effect(s) e.g., in a C program. This could be described as an if…then structure, where one would predicate the contents of a short block of code following THEN but use a conditional branch if that block were very long.

5.3.4 Loop Structures

Major types of loop structures include those where an index variable counts down (or up) to some limit, those where data are accessed until an address pointer reaches some limit, and those with an arbitrary logical stopping condition situated at the top (while cond do …) or bottom (do … until cond). Here we outline the general principles of loop design using predicated conditional branches.

Loop controlled by a counter

Although counters can run either up or down, the down-counter has advantages with most instruction sets. We can formulate a counted loop this way:

         <enter loop with rc = number of traversals> again:  <instructions of the loop body>         add     rc = -1,rc;;       // Count down         cmp.eq  p0,pf = rc,0       // Reached zero yet?         (pf) br.cond.sptk again;;  // No, more to do         <falling through>          // Yes, all done

where we need to assign only one predicate register pf. Assuming many traversals of the loop, we provide a static prediction hint (sptk) that the branch will be taken. Note that a down-counted loop can also be viewed as a do … until loop with the condition being the test of the decremented counter value against zero.

Loop controlled by an address limit

Suppose that we need to sum the values in a vector. For convenience, suppose that rh is a pointer to the value at the highest address. We can sum the vector this way:

        <enter loop with rs = 0 and rp --> first datum> again: ld8      rt = [rp],8;;     // Get datum, bump pointer        add      rs = rs,rt;;      // rs = running sum        cmp.gtu  p0,pf = rp,rh     // Past the last datum?        (pf) br.cond.sptk again;;  // No, more to do        <falling through>          // Yes, rs = sum

where we must use an unsigned comparison for addresses. Here too we actually have a special case of a do … until loop, with the condition being that the address pointer has been postincremented past storage for the elements of the vector of quad words. This type of loop tacitly assumes that the list of vector components is non-null.

Loop with test at the top

What if the stopping criterion for a loop may already be true before the first traversal? This situation calls for positioning the test at the top instead of the bottom of the loop. Suppose we want to count the number of negative values in a list, knowing that the list may be empty (null). Using the same addressing scheme as in the previous illustration, where rh points to the last value, we can proceed with:

        <enter loop with rn = 0 and rp --> first datum> again: cmp.gtu   pt,p0 = rp,rh     // Past the last datum?        (pt) br.cond.spnt leave;;   // Yes, exit from loop        ld8       rt = [rp],8;;     // Get datum, bump pointer        cmp.lt    pt,p0 = rt,r0;;   // No, test for negative        (pt) add  rn = rn,rt;;      // rn = running count        br again;;                  // Look for next value leave: <falling through>           // Yes, rn = count

Here we first determine whether rp points to valid data before we allow the loop to use that pointer, lest the ld8 instruction trigger some fault (perhaps because a user mode application attempted to access an unmapped address). We can economize on the allocation of predicate registers in the two compare instructions without ambiguity. Note that we can also provide a static prediction hint that the conditional branch will not be taken.

This loop structure encounters two branches per traversal. The test at the top of the loop incorporates a conditional branch that is not taken until it is time to exit the loop. An unconditional branch at the bottom makes the whole structure into a loop.

5.3.5 Branch Addressing Range

The Itanium IP-relative branch instructions encode a target address into a 21-bit immediate value, which the assembler or software computes as a signed offset from the current value of the instruction pointer. Recall that the instruction pointer uses byte addresses, but that instructions are always fetched as 16-byte bundles. As a consequence, the label for a branch target must refer to a bundle, specifically to the instruction in slot 0 of that bundle. Addresses sent to the I-cache always have the four lowest bits as 0. The software computes a 25-bit signed offset, target25 IP, but then shifts this offset four bit positions to the right to remove the four zeros for encoding as 21 bits.

When the branch instruction is decoded by the CPU hardware, this process is reversed. The 21-bit immediate value is sign-extended, shifted left four bit positions, and added to the IP register. The resulting branch range for the Itanium architecture can thus be expressed in several ways: ±2²⁴ bytes = ±16 MiB, or ±2²⁰ instruction bundles, or almost ±350,000 individual instructions.

With any architecture, the question of comparing the branch range to the complete span of the address space arises. Often there is a jump instruction for truly distant moves away from the current IP value. For example, a jump can be accomplished using the indirect form of Itanium branch instruction (Section 5.3.1). The processor contains eight special branch registers, Br₀ Br₇, each capable of being loaded from a general register with a full 64-bit address using a mov instruction. Recall that we could load a general register with a symbolic 64-bit address. Thus one possibility for preparing for a long-range jump would be:

 movl   r1=dest;;     // get address out of this instruction mov    b1=r1         // copy address into a branch register

While this approach was optimal for the initial Itanium processor, subsequent implementations use a specific instruction instead.

The Itanium 2 processor has hardware support for the brl (branch long) instruction, which encodes a 60-bit immediate offset derived from the target64 value in the same manner as the signed target25 value for regular branches. A brl instruction occupies two slots in an instruction bundle, and supports only none at all, cond, and call as values for btype (the branch type completer).

5.3.6 Locality and Program Performance

Many older CISC architectures had rather limited branch ranges (see Tables 1-3 and 1-4), while RISC architectures tend to combine simple branching and procedure calling into one family of related instructions with much larger branch ranges (see Tables 1-4 and 1-5).

Just because one can construct a program loop extending over many hundreds, or even thousands, of instructions does not mean that one should do so. The Itanium branch range (±16 MiB) far exceeds the 16-KiB size of the primary I-cache. A loop whose body of instructions overflows beyond the capacity of the I-cache will perform more slowly because of the necessary accesses to outer cache levels (L2, L3), main memory, or even virtual memory. The system is said to "thrash" as instructions from later parts of the loop displace instructions from earlier parts, only to be displaced in turn.

Well-written programs benefit from adherence to the principle of locality and its benefits: ease of design, runtime performance, and maintainability. Shorter loops are easier to comprehend and alter. When nonloop branching extends only to nearby addresses, the target instruction may already be in one of the levels of cache or could have been prefetched to be there when needed. When the body of a loop contains only reasonable numbers of instructions, the hardware may be able to retain all of them in I-cache.