5.6 DOTCLOOP: Using the Loop Count Register | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

Counted loops (DO in FORTRAN, for in C) are common, frequently nested, structures in application programming. Efficient handling of innermost loops is highly important for good software engineering.

The Itanium architecture provides specialized facilities to implement these loops efficiently using the application register ar.lc (the loop count register) and the branch instruction br.cloop. The ar.lc register is a required member of a set of as many as 128 application registers in addition to the general, predicate, branch, and floating-point registers already mentioned. Appendix D describes the many types of Itanium processor registers, while Appendix C lists the many varieties of branch instructions.

Before the top of a loop body, the ar.lc register is initialized to one less than the total number of intended traversals through the loop. Then the br.cloop instruction at the bottom of the loop body tests the value in the ar.lc register against zero after each traversal. Unless it has already reached zero, the ar.lc register is decremented and the branch back to the first instruction in the loop body is taken; otherwise, the branch falls through after the final traversal has occurred with ar.lc=0.

Note that with the test at the bottom of the loop, the instructions in the loop body are always executed at least once before any testing occurs. Compilers must insert an additional test ahead of the loop if they do not want this to happen.

Figure 5-2 presents the program DOTCLOOP, based on the prior DOTLOOP, showing the use of the ar.lc register in conjunction with a br.cloop instruction. Note that a branch instruction with the cloop completer may never be predicated.

Figure 5-2 DOTCLOOP: Using the Itanium loop count register

 // DOTCLOOP      Scalar Product of N-vectors // This program will compute the scalar product // of two multielement vectors V and W.          N       = 3              // N = dimensionality          .data                    // Declare storage          .align  8                // Desired alignment P:       .skip   8                // Space for product V:       data2   -1,+3,+5         // V1, V2, V3, etc. W:       data2   -2,-4,+6         // W1, W2, W3, etc.          .text                    // Section for code          .align  32               // Desired alignment          .global main             // These three lines          .proc   main             //  mark the mandatory main:                             //   'main' program entry          .prologue                // Leaf procedure can save          .save   ar.lc, r9        //  the caller's ar.lc          mov     r9 = ar.lc;;     //   in a scratch register          .body                    // Now we really begin... first:   movl    r14 = V;;        // Pointer for V          movl    r15 = W;;        // Pointer for W          movl    r16 = P;;        // Pointer for P          mov     r20 = 0          // R20 = running sum          mov     r17 = N-1;;      // Number of traversals          mov     ar.lc = r17      //  minus one top:     ld2     r21 = [r14],2    // Get Vi; bump pointer          ld2     r22 = [r15],2;;  // Get Wi; bump pointer          pmpy2.r r21 = r21,r22;;  // Compute Vi times Wi          sxt4    r21 = r21;;      // Extend 32 bits to 64          add     r20 = r20,r21    // Update the sum          br.cloop.sptk.few top;;  // More to do?          st8     [r16] = r20;;    // No, store the product done:    mov     r8 = 0;;         // Signal all is normal          mov     ar.lc = r9;;     // Restore caller's ar.lc          br.ret.sptk.many b0      // Back to command line          .endp   main             // Mark end of procedure

A mov instruction must be used to transfer an integer value between the ar.lc register and a general register. The Itanium architecture has only one ar.lc register, which must be protected from conflicts across function calls. Any program level that uses the ar.lc register must save and restore its contents for the higher calling level(s). In general, some sort of stack mechanism should be used for this saving and restoring, but a "leaf" procedure (i.e., an innermost routine) can optionally use any available general register that is classified as "scratch" (Appendix D). Here we chose register r9 for this purpose.

This is our first encounter with a prologue section in a program. The prologue should occur at the beginning of a text segment, extending from the .prologue assembler directive down to the .body directive. The prologue includes both special assembler directives and actual instructions.

In order to satisfy the requirement that Itanium operating systems "unwind" calling levels when recovering from runtime errors, the contents of registers classified as "preserved" (Appendix D) must be saved. The programming tools are required to insert an encoded data structure, composed of information from the prologue section, into the executable file.

The converse of a prologue, naturally called an epilogue, may also be required at the bottom of the text section, but there is no special directive explicitly defining an epilogue segment.

Use of the ar.lc (loop count) register requires that the programmer always explicitly save and restore its previous value. Also, the programmer needs to predecrement any initialization value, in order to obtain the correct number of traversals. The architectural loop count register also participates in more complex forms of loop design and control, which we shall take up later.

Here, in the DOTCLOOP program, the benefit of using these special facilities is not appreciable in comparison to our already improved DOTLOOP program. The GNU assembler produces the following three instruction bundles for the body of the loop from top to the branch, from which hexadecimal addresses have again been removed:

 <top>:       [MMI]       ld2 r21=[r14],2 <top+1>:                 ld2 r22=[r15],2 <top+2>:                 nop.i 0x0;; <top+16>:    [MII]       nop.m 0x0 <top+17>:                pmpy2.r r21=r21,r22;; <top+18>:                sxt4 r21=r21;; <top+32>:    [MFB]       add r20=r21,r20 <top+33>:                nop.f 0x0 <top+34>:                br.cloop.sptk.few <top>;;

The figure of merit still appears to be seven: There are four instruction groups, and the load and pmpy2.r instructions will require one and two more clock cycles, respectively. The performance of this loop may be the same as that in the original DOTLOOP program.

Two nop instructions have been inserted in place of the former decrement and compare instructions. A longer loop in some other application might well present new opportunities to close up the nop slots through well-considered rearrangements of instructions in the loop body, thus improving performance.