10.5 Modulo Scheduling a Loop | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

In Chapter 5, we compared two programs, DOTLOOP and DOTCLOOP, the latter of which used the Itanium loop count register. Here we show how the same basic code fragment can be analyzed for modulo scheduling.

VLIW processor designs did not attain commercial success, in part, because software had to be recompiled for each new implementation. This section demonstrates two different Itanium instruction schedules that develop two different EPIC solutions to the obstacles of implementation dependency that exist in pure VLIW architectural designs.

10.5.1 DOTCTOP: Implementation-Independent Schedule

In this section, we develop a pipelined instruction schedule based upon an idealized, convenient fiction: We pretend that every Itanium instruction would exhibit unit latency. At the end of this section, we analyze the potentially adverse consequences stemming from more realistic assessments of instruction latencies.

Assignment of rotating general registers

Conveniently assuming universal unit latencies, we can lay out a software pipeline to compute the dot product of two three-component vectors in the following way:

 cycle  stage:      0         1       2        3 0              ld2 V1,W1 1              ld2 V2,W2   V1*W1 2              ld2 V3,W3   V2*W2   sxt(1) 3                          V3*W3   sxt(2)   add(1) 4                                  sxt(3)   add(2) 5                                           add(3)

where the stages trace the passage of data in the algorithm from DOTCLOOP (Figure 5-2).

In order to make specific assignments of rotating registers, we must consider the number of stages throughout which every quantity is "live." First, notice that the value of V1 is written into a destination register by an ld2 instruction during stage 0, but must be retained as a source operand for the pmpy2.r instruction in stage 1. Hence V1 must be allocated two registers say r32 and r33 so that loading V2 in the next cycle will not overwrite V1 before the latter is used in the multiplication.

Second, the product V1*W1 formed during stage 1 must be captured into a destination register and retained for use as a source operand for the sxt4 instruction in stage 2. Although this quantity is also "live" through two pipeline stages, we can reallocate r33 as one of the two registers required for it, since V1 will not be needed again, and allocate r34 as the other required register.

Third, the result from sign extension formed during stage 2 must be captured into a destination register and retained for use as a source operand for the add instruction in stage 3. Although this quantity is also "live" through two pipeline stages, we can reallocate r34 as one of the two registers required for it, since the product V1*W1 will not be needed again, and allocate r35 as the other required register.

Fourth, the addition in stage 3 should not be allocated any rotating registers, since the purpose of that instruction is to accumulate the partial products. The original assignment of register r20, which is not part of the rotating register set, can be carried over from the nonpipelined algorithm.

Finally, we must provide storage for W1 that will not conflict with any of the preceding requirements, where we traced V1 and related computed quantities through all of the pipeline stages. Since W1, like V1, is "live" through two pipeline stages, it must be allocated two rotating registers, like r36 and r37.

Everything just said about V1 and W1 should apply to V2 and W2, and then to V3 and W3, in turn. Therefore we conclude that this approach to a software pipeline for the loop in question requires a total of six rotating registers.

Assignment of rotating predicate registers

This particular loop body would attempt to attain simultaneous activity in as many as three software pipeline stages during some operational cycles, in principle. This analysis suggests that the instructions for the four pipeline stages should be predicated using the four predicate registers p16, p17, p18, and p19.

Recall that the loop count register ar.lc should be 0 on the cycle at the end of which predicate register p63 will first be set to 0, in order to make register p16 be 0 for the subsequent cycle. Generally that means that the ar.lc register should be initialized to one less than the number of traversals of the programmer's loop, which is the same rule as for a program loop controlled by a br.cloop instruction (Section 5.6).

The epilog count register ar.ec should be initialized to 4 in this illustration because this pipeline has four stages to be drained:

 cycle    ar.lc  ar.ec  p16    p17    p18    p19   0        2      4     1      0      0      0   1        1      4     1      1      0      0   2        0      4     1      1      1      0   3        0      3     0      1      1      1   4        0      2     0      0      1      1   5        0      1     0      0      0      1

That is, the epilog counter is designed to stop on the value 1, but the loop counter is designed to stop on the value 0. The total number of software pipeline cycles is ar.lc + ar.ec, here 2 + 4 = 6.

Adaptation of the instructions from DOTCLOOP

If we express the loop with Itanium instructions in the same order as in the DOTCLOOP program (Figure 5-2), using the rotating register assignments developed above, the gcc assembler produces three instruction bundles:

 top:   (p16)  ld2     r32 = [r14],2      // M   (p16)  ld2     r36 = [r15],2      // M   (p17)  pmpy2.r r33 = r33,r37      // I          nop.m                      // M   (p18)  sxt4    r34 = r34          // I   (p19)  add     r20 = r20,r35      // I          nop.m                      // M          nop.f                      // F          br.ctop.sptk.few top;;     // B;;

Since early Itanium processor implementations can work with only two instruction bundles during each machine cycle, this instruction schedule would require at least two machine cycles per pipeline cycle, again neglecting additional delays attributable to loading data from memory and actual instruction latencies for a specific processor implementation.

Improvement of instruction bundling

Since most or all of the instructions in a software-pipelined loop are predicated, some explicit stops are no longer required. Although Itanium assemblers do not change the programmer's order of machine instructions, compilers frequently do so during optimization. Here, we can play the compiler's role by interchanging the add and sxt4 instructions in the source. Then the gcc assembler can condense the loop body into two bundles instead of three. Whereas the sxt4 instruction can only be executed on an I-unit in early Itanium processors, the add instruction can be directed to an M-unit:

 top:   (p16)  ld2     r32 = [r14],2    // M   (p16)  ld2     r35 = [r15],2    // M   (p17)  pmpy2.r r33 = r33,r37    // I   (p19)  add     r20 = r20,r35    // M   (p18)  sxt4    r34 = r34        // I          br.ctop.sptk.few top;;   // B;;

This approach results in savings of program code, and thus I-cache usage. It also saves execution time, because an Itanium 2 processor has four M-units. The stop after the branch instruction suffices to enforce proper logical grouping from cycle to cycle.

Does this new loop improve upon the estimated performance of the DOTCLOOP program, where we counted seven machine cycles for each of three traversals? If we now recognize the minimum latency of 2 (not 1) for a load instruction that feeds into the pmpy2.r instruction, as well as the latency of 3 (not 1) for the latter, we must realize that the software pipeline will involve one additional machine cycle for pipeline cycle 0 and two additional cycles for each of pipeline cycles 1, 2, and 3. These considerations suggest that the overall minimum number of machine cycles on an Itanium 2 processor would be 6 + 1 + 3 x 2 = 13, not six, compared to 7 x 3 = 21 for the traditional counted loop in DOTCLOOP.

This analysis shows that Itanium hardware support for modulo scheduling can enhance performance significantly, even for a small loop body traversed only a few times. Unlike the situation with traditional loop unrolling, here the size of the loop body itself did not increase, although a moderate amount of additional overhead outside the loop is required for initializations. Figure 10-3 presents the complete DOTCTOP program, incorporating all of the features discussed in this section.

The DOTCTOP program produces the correct answer (0x14 = 20) when tested on systems based on the initial Itanium processor implementation, which would incur split issue and other penalties beyond the factors we identified for an Itanium 2 processor.

Figure 10-3 DOTCTOP: Illustrating modulo scheduling of a loop

 // DOTCTOP          Scalar Product of N-vectors // This program will compute the scalar product // of two multielement vectors V and W.         N        = 3              // N = dimensionality         .data                     // Declare storage         .align   8                // Desired alignment P:      .skip    8                // Space for product V:      data2    -1,+3,+5         // V1, V2, V3, etc. W:      data2    -2,-4,+6         // W1, W2, W3, etc.         .text                     // Section for code         .align   32               // Desired alignment         .global  main             // These three lines         .proc    main             //  mark the mandatory main:                             //   'main' program entry         .prologue                 // Leaf procedure can save         .save    ar.lc, r9        //  the caller's ar.lc         mov      r9 = ar.lc;;     //   in a scratch register         .body                     // Now we really begin... first:  alloc    r10 = ar.pfs,0,8,0,8 // 8 rots         movl     r14 = V          // Pointer for V         movl     r15 = W          // Pointer for W         movl     r16 = P          // Pointer for P         mov      r20 = 0          // r20 = running sum         mov      ar.lc = N-1      // Traversals minus one         mov      ar.ec = 4        // Rotational stages         mov      pr.rot = 0x10000;; // Initialize predicates top:   (p16) ld2      r32 = [r14],2    // Get Vi; bump pointer   (p16) ld2      r36 = [r15],2    // Get Wi; bump pointer   (p17) pmpy2.r  r33 = r33,r37    // Compute Vi times Wi   (p19) add      r20 = r20,r35    // Update sum, after   (p18) sxt4     r34 = r34        //  extension to 64 bits         br.ctop.sptk.few top;;    // More to do?         st8      [r16] = r20      // No, store the product done:   mov      ret0 = 0         // Signal all is normal         mov      ar.lc = r9       // Restore caller's ar.lc         mov      ar.pfs = r10     // Restore caller's ar.pfs         br.ret.sptk.many b0;;     // Back to command line         .endp    main             // Mark end of procedure

Comment about comments

Note that we reworded the comments for the interchanged add and sxt4 instructions for legibility. In more general circumstances, this will be very difficult, and program documentation should present as block comments the version that corresponds more closely to the programmer's algorithm.

10.5.2 DOTCTOP2: Itanium 2 Processor Schedule

We now explore an alternative instruction schedule for a software-pipelined loop that explicitly recognizes the characteristics of the Itanium 2 processor, which exhibits a minimum latency of 2 (not 1) for integer loads from L1 cache when the data are to be handled next by a multimedia instruction, such as pmpy2.r, which itself exhibits a latency of 3 (not 1).

Assignment of rotating general registers

We can lay out a software pipeline for computing the dot product of two three-element vectors in the following way:

 cycle  stage:  0     1     2     3     4     5     6   0         ld2(1)   1         ld2(2)   2         ld2(3)       pmpy(1)   3                      pmpy(2)   4                      pmpy(3)   5                                        sxt(1)   6                                        sxt(2)  add(1)   7                                        sxt(3)  add(2)   8                                                add(3)

where stage 1 represents the extra latency when a load feeds a multimedia instruction and stages 3 and 4 reflect the extra latency of the pmpy2.r instruction used in the DOTCLOOP program (Section 5.6).

With this schedule, the value of each element of V is "live" from stage 0 to stage 2, thus requiring an allocation of three rotating registers, such as r32, r33, and r34. In a similar fashion, each product of corresponding elements of V and W is "live" from stage 2 to stage 5, but we can reallocate register r34 as the destination in stage 2 and thus need only a net allocation of three additional rotating registers, such as r35, r36, and r37. The result from the sign extension operation is "live" from stage 5 to stage 6, but we can reallocate register r37 as the destination in stage 5 and thus need only the net allocation of register r38. Finally, three more registers say r39, r40, and r41 must be allocated to accommodate nonconflicting storage for W in stages 0 through 2. Hence, ten rotating registers are needed in all.

Assignment of rotating predicate registers

This particular loop body only attains simultaneous activity sometimes, in two useful software pipeline stages during any operational cycle. This analysis suggests that the instructions for the four productive pipeline stages should be predicated using registers p16, p18, p21, and p22. Nothing useful appears to be accomplished at times corresponding to predicate registers p17, p19, and p20.

Recall that the loop count register ar.lc should be 0 on the cycle at the end of which predicate register p63 will first be set to 0 in order to make register p16 be 0 for the subsequent cycle. Generally that means that the ar.lc register should be initialized to one less than the number of traversals of the programmer's loop, which is the same rule as for a program loop controlled by a br.cloop instruction (Section 5.6).

The epilog count register ar.ec should be initialized to 7 in this illustration because this pipeline has seven stages to be drained:

 cycle   ar.lc   ar.ec   p16   p17   p18   p19   p20   p21  p22   0       2       7      1     0     0     0     0     0    0   1       1       7      1     1     0     0     0     0    0   2       0       7      1     1     1     0     0     0    0   3       0       6      0     1     1     1     0     0    0   4       0       5      0     0     1     1     1     0    0   5       0       4      0     0     0     1     1     1    0   6       0       3      0     0     0     0     1     1    1   7       0       2      0     0     0     0     0     1    1   8       0       1      0     0     0     0     0     0    1

That is, the epilog counter is designed to stop on the value 1, but the loop counter is designed to stop on the value 0. The total number of software pipeline cycles is usually ar.lc + ar.ec, here 2 + 7 = 9.

Optimized instruction order

Nothing has changed with the instructions themselves. Thus the bundles of instruction and templates can be almost the same as shown before:

 top:   (p16)  ld2     r32 = [r14],2    // M   (p16)  ld2     r39 = [r15],2    // M   (p18)  pmpy2.r r34 = r34,r41    // I   (p22)  add     r20 = r20,r38    // M   (p21)  sxt4    r37 = r37        // I          br.ctop.sptk.few top;;   // B;;

where again we put add ahead of sxt4 in order to optimize the code into two bundles instead of three.

Figure 10-4 presents the complete DOTCTOP2 program, incorporating all of the elements discussed in this section.

Figure 10-4 DOTCTOP2: Illustrating a different modulo scheduling of a loop

 // DOTCTOP2          Scalar Product of N-vectors // This program will compute the scalar product // of two multielement vectors V and W.         N        = 3              // N = dimensionality         .data                     // Declare storage         .align   8                // Desired alignment P:      .skip    8                // Space for product V:      data2    -1,+3,+5         // V1, V2, V3, etc. W:      data2    -2,-4,+6         // W1, W2, W3, etc.         .text                     // Section for code         .align   32               // Desired alignment         .global  main             // These three lines         .proc    main             //  mark the mandatory main:                             //   'main' program entry         .prologue                 // Leaf procedure can save         .save    ar.lc, r9        //  the caller's ar.lc         mov      r9 = ar.lc;;     //   in a scratch register         .body                     // Now we really begin... first:  alloc    r10 = ar.pfs,0,16,0,16 // 16 rots         movl     r14 = V          // Pointer for V         movl     r15 = W          // Pointer for W         movl     r16 = P          // Pointer for P         mov      r20 = 0          // r20 = running sum         mov      ar.lc = N-1      // Traversals minus one         mov      ar.ec = 7        // Rotational stages         mov      pr.rot = 0x10000;; // Initialize predicates top:   (p16) ld2      r32 = [r14],2    // Get Vi; bump pointer   (p16) ld2      r39 = [r15],2    // Get Wi; bump pointer   (p18) pmpy2.r  r34 = r34,r41    // Compute Vi times Wi   (p22) add      r20 = r20,r38    // Update sum, after   (p21) sxt4     r37 = r37        //  extension to 64 bits         br.ctop.sptk.few top;;    // More to do?         st8      [r16] = r20      // No, store the product done:   mov      ret0 = 0         // Signal all is normal         mov      ar.lc = r9       // Restore caller's ar.lc         mov      ar.pfs = r10     // Restore caller's ar.pfs         br.ret.sptk.many b0;;     // Back to command line         .endp    main             // Mark end of procedure

The DOTCTOP2 program produces the correct answer (0x14 = 20) when tested on systems based on the initial Itanium processor implementation, which would incur split issue and other penalties beyond the factors we identified for an Itanium 2 processor.

10.5.3 Further Considerations

We have shown several approaches to a simple program loop. DOTCLOOP (Figure 5-2) used traditional programming, and would require at least 21 machine cycles on an Itanium 2 processor for the three traversals of the programmer's loop. DOTCTOP (Figure 10-3), based on an approach hypothesizing presently unrealistic unit latencies for all operations and implying an optimistic minimum of six cycles, would actually require at least 13 machine cycles on an Itanium 2 processor. DOTCTOP2 (Figure 10-4), based on known latencies for an Itanium 2 processor, would require at least nine machine cycles on that implementation. This improvement of DOTCTOP2 comes at the cost of implementation-specific analysis at compile time and the use of more rotating registers.

The analysis just presented in Sections 10.5.1 and 10.5.2 follows guidelines for the Itanium 2 processor implementation. Other implementations may vary in numbers of execution units and their associated capabilities and latencies. VLIW processor designs did not attain commercial success because software had to be recompiled for any new implementation.

We tested both DOTCTOP and DOTCTOP2 on systems containing the original Itanium processor, obtaining the correct result with both. Because it has fewer, and less versatile, type M execution units, such a processor requires more machine cycles than an Itanium 2 processor to execute the software pipelined loop. For instance, split issue in the MIB bundle would result in at least 2 x 9 = 18 machine cycles overall for the loop in DOTCTOP2. Both programs should execute, though perhaps not optimally, on any other implementation conforming strictly to the Itanium architecture. This fulfills the EPIC promise to circumvent the recompilation requirement of VLIW in order to obtain valid program behavior.

The DOTCTOP program offers some potential advantages over DOTCTOP2. It should run more quickly than the DOTCTOP program on future implementations that impose fewer latencies. It requires fewer rotating predicate registers on any implementation. Moreover, it seems easier to comprehend in relation to fundamental precepts of the Itanium architecture because its form is independent of any specific implementation. Yet it is not necessarily as fast as an implementation-specific tuning, as we showed with DOTCTOP2.

The possibilities for constructing modulo-scheduled loops are both varied and subtle, especially if advanced and/or speculative loads are also used in such loops. Further details may be found in "Application Architecture," volume 1 of the Intel Itanium Architecture Software Developer's Manual.