10.4 Software-Pipelined Loops | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

The operation of loops within a program bears an analogy to the instruction cycle for a computer at the hardware level. Something changes for each traversal through a loop so that the loop advances and works with different source data, just as the instruction pointer advances to bring different instructions into the central processor for execution through the logical sequence of a program.

The existence of this analogy has instigated efforts to improve the efficiency of program loops by some kind of software pipelining, just as the instruction cycle is amenable to improvement through hardware pipelining. In both situations, resources must be sufficiently abundant and sufficiently independent to enable the construction of an effective pipeline strategy to perform tasks in parallel.

10.4.1 Traditional Loop Unrolling

The advent of certain CISC architectures and especially RISC architectures with generous numbers of processor registers enabled new compiler techniques to enable the transformation of a programmer's expression of a loop into an equivalent form that executes more quickly.

Loop unrolling is the most prevalent such technique. Think of a loop in a FORTRAN or C program that is to run a fixed number of iterations let us say twelve and suppose further that the code in the body of the loop would not exhaust all of the freely available processor registers.

What if there were twice as many registers available as needed? The compiler could unroll the loop twofold. That is, the compiler could duplicate the machine instructions of the body of the loop, using two sets of different registers. Then the double-length loop body should be traversed only half as many times (six). This saves on some overhead operations, such as incrementing or decrementing the loop counter. Moreover, if the machine instructions can be safely rearranged a little, there may also be speedup from hiding load latencies.

What if there were three times as many registers available as needed? The compiler could unroll the loop threefold, using three sets of different registers. The transformed loop body would require only one-third as many traversals (four), but would consume at least three times the storage in memory. More seriously, it would occupy three times the storage in I-cache. In fact, excessive unrolling of a very long original loop might cause very costly thrashing of I-cache contents.

Loop unrolling has other serious drawbacks. If the iteration count is not conveniently factorable in relation to available register resources, a further code expansion may be needed to handle the last few remaining traversals specially. If the iteration count is determined dynamically by the running program, instead of being a constant known at compile time, then only very sophisticated compiler techniques could be expected to devise an unrolling strategy.

10.4.2 Software Pipelining

We saw that a hardware pipeline of N stages takes N cycles of operation to fill completely, and only then does its full efficiency come into play. When a smooth sequence of instructions is broken by a serious resource limitation (e.g., a severe load latency) or by a taken branch, the pipeline may require some number of cycles to flush before it can be refilled.

Because nearly all Itanium instructions can be predicated and the architecture provides for rotating registers, a programmer or compiler can set up modulo scheduling of the instructions in loops. Wherever possible, instructions belonging to different iterations of the loop are designated to execute in parallel, just as substeps belonging to different instructions of a sequence occur in parallel in a hardware pipeline. Modulo scheduling is one form of the more general concept of software pipelining.

Nearly every instruction in a modulo-scheduled loop needs to be predicated. While the pipeline is filling, a few instructions (producers) need to be predicated on, in a phased way, while other instructions (consumers) should remain predicated off temporarily. Conversely, when the pipeline is draining, the producers need to be predicated off while consumers are successively phased off also. The rotation of the predicate registers nicely manages this filling, sustaining, and draining of the software pipeline without requiring code expansion, as demonstrated below.

Additional features of the Itanium architecture used in software pipelining include the loop count (ar.lc) register (Section 5.6), the epilog count (ar.ec) register, and special branch instructions to construct either counted or while loops using register rotation.

10.4.3 Rotating Registers

Predicate registers Pr₁₆ Pr₆₃ and floating-point registers Fr₃₂ Fr₁₂₇ are architecturally designated as rotating register sets. The alloc instruction (Section 7.3.3) can designate any multiple of 8 general registers starting with Gr₃₂ to operate as a rotating register set.

Note: The rotating general register set always starts with Gr₃₂ where the first input parameter in0 would occur in the absence of rotation or where the first local variable loc0 would occur in the absence of both rotation and inputs. This overlap can be averted by designating a larger number of locals equal to the number of rotating registers (a multiple of 8) plus the actual number of locals (plus the number of inputs if those must be copied into locals while register rotation is occurring). The trick is then to refer not to loc0, loc1, etc. but to loc[0+rot], loc[1+rot], etc. for register storage that is not overlapped by the rotating register region.

When one of the special branch instructions is encountered on every loop traversal, each set of rotating registers undergoes implicit renaming by incrementing the register number: Pr₁₆ from the first traversal is now Pr₁₇ and Pr₆₃ becomes the new Pr₁₆ for the second traversal; Fr₃₂ becomes Fr₃₃ while Fr₁₂₇ becomes the new Fr₃₂; and Gr₃₂ becomes Gr₃₃ while Gr_{32+rot 1} becomes the new Gr₃₂. In this way, data held in registers from a particular traversal remain accessible for use in the next traversal, but at incremented register locations. At the same time, the next traversal has fresh registers to work with. We explain this with examples below.

The CPU hardware manages register rotation using a three-part register rename base (rrb) register as part of the current frame marker (cfm). The predicate, floating-point, and general register parts of this rrb are each decremented by execution of special forms of branch instructions that are discussed in Section 10.4.5 below.

Since the effective register name is always the sum of the rrb value and the register number embedded in an Itanium instruction, rotation makes the data in register N appear to move to register N +1 for the next operation cycle of the modulo-scheduled loop. That is, rrb + N during one cycle refers to the same physical register as (rrb 1) + (N + 1) during the next cycle.

10.4.4 Loop Phases

Modulo-scheduled loops can be viewed as having three phases: prolog, kernel, and epilog. These phases are useful constructs in the analysis and scheduling of instructions for a loop.

In the prolog phase, the software pipeline fills. Some instructions in the loop body will still be predicated off.

In the kernel phase, the software pipeline operates at its full production capacity, continuing at each operational cycle to start the work of one new traversal of the traditional nonpipelined loop and to finish the work of some previous traversal of the traditional nonpipelined loop. Most or all instructions in the loop body will be predicated on.

In the epilog phase, the software pipeline drains, no longer starting any new work, but continuing to finish the work of in-progress traversals of the traditional nonpipelined loop. On each cycle, one additional rotating predicate register will be turned off.

For a very small loop body and only a few traversals, there may not be a true kernel phase, as is the situation with the sample program to be presented in Section 10.5 below.

10.4.5 Branch Instructions for Software Pipelines

We have previously discussed the ordinary conditional Itanium branch instructions (Section 5.3.1), the br.cloop instruction (Section 5.6), and the br.call and br.ret instructions (Section 7.5.2). Here we take up the special loop control forms of Itanium branch instructions that operate in conjunction with rotating registers. There are separate forms for counted loops and while loops.

Counted loops

One pair of instructions is designed for use in the software pipelining of counted loops:

 br.ctop.bwh.ph.dh destination     // rotate & go to top;                                   //  else exit when ar.lc=0                                   //   and ar.ec=1 br.cexit.bwh.ph.dh destination    // rotate & fall through;                                   //  else exit when ar.lc=0                                   //   and ar.ec=1

where the choices for bwh (branch whether hint), ph (branch prefetch hint), and dh (branch deallocation hint) are the same as identified previously (Section 5.3.1). The values in the ar.lc and ar.ec registers control how register rotation is performed by these instructions and whether the branch is taken or falls through.

As long as the value in the loop count register ar.lc is greater than 0 (prolog and kernel phases), the br.ctop and br.cexit instructions take these actions: decrement the ar.lc register, leave the ar.ec register alone, set the p63 register to 1, and then rotate all three register sets. Thus the value 1 goes from p63 into p16 by rotational renaming for the next cycle.

After the value in the loop count register ar.lc has reached 0, but while the epilog count register ar.ec is greater than 1 (epilog phase), the br.ctop and br.cexit instructions take these actions: leave the ar.lc register alone, decrement the ar.ec register, set the p63 register to 0, and then rotate all three register sets. Thus the value 0 goes from p63 into p16 by rotational renaming for the next cycle.

At the end of the epilog cycles (generally when ar.lc=0 and ar.ec=1), the direction of flow at the branch instruction changes. That is, a br.ctop instruction now falls through instead of branching back, or a br.cexit instruction now goes to its destination address instead of falling through.

We can summarize these register rotation operations for a counted loop in the following manner:

          ar.lc         ar.ec           p63    rrb prolog   decremented   unchanged        1      decremented kernel   decremented   unchanged        1      decremented epilog   0             decremented      0      decremented

Usually, the registers ar.lc, ar.ec, and p63 are all 0 on the exit path.

While loops

Another pair of instructions is designed for use in the software pipelining of while loops:

 (qp) br.wtop.bwh.ph.dh destination      // rotate & go to top                                         //  exit when qp=0                                         //   and ar.ec=1 (qp) br.wexit.bwh.ph.dh destination     // rotate & fall through                                         //  exit when qp=0                                         //   and ar.ec=1

where the choices for bwh (branch whether hint), ph (branch prefetch hint), and dh (branch deallocation hint) are the same as identified previously (Section 5.3.1). The values in the qualifying predicate register qp and epilog count ar.ec register control how register rotation is performed by these instructions and whether the branch is taken or falls through.

As long as the value of the qualifying predicate register qp is 1 (prolog and kernel phases), the br.wtop and br.wexit instructions take these actions: leave the ar.ec register alone, set the p63 register to 0, and then rotate all three register sets. Thus the value 0 goes from p63 into p16 by rotational renaming for the next cycle.

After the value of the qualifying predicate register qp becomes 0, but while the epilog count register ar.ec is greater than 1 (epilog phase), the br.wtop and br.wexit instructions take these actions: decrement the ar.ec register, set the p63 register to 0, and then rotate all three register sets. Thus the value 0 goes from p63 into p16 by rotational renaming for the next cycle.

We can summarize these register rotation operations for a while loop in the following manner:

          qp       ar.ec            p63     rrb prolog   1        unchanged         0      decremented kernel   1        unchanged         0      decremented epilog   0        decremented       0      decremented

Usually, the registers qp, ar.ec, and p63 are all 0 on the exit path.