11.3 Optimizing a Simple Program

We must emphasize that some of the observations of similarities and differences in the machine language programs produced by Itanium compilers in Section 11.2 are artifacts arising because the compilers have been asked to act without optimization on a very simplistic program. Those observations should in no way be construed as flaws in any high-level language, any corresponding compiler, or the Itanium architecture.

Clearly there are things in each of these programs that a human programmer would do differently if hand coding in Itanium assembly language. These might include the following:

  • Load the constant 3.14159 into a floating-point register only once, outside the loop.

  • Use only one integer register for i, and avoid reloading this variable from memory, though perhaps rewriting it to memory if that is required by the definition of a language.

  • Streamline the addressing of items in memory.

  • Schedule (i.e., reorder) the instructions to minimize data stalls.

Remember, however, that these languages permit complicated constructs, and in some instances function calls that may have side effects, where in contrast our COM program has only a simple scalar variable or a constant. Compilers must be engineered to produce correct and reasonably efficient programs for the general case. Also note that these are unoptimized programs, but the default setting for many compilers is optimization at some particular level (see Tables 11-2 and 11-3).

As with nonoptimized output, so too comparing the optimized output from the compilers for high-level languages can usefully reveal both similarities and differences. We proceed with some illustrations of optimization, again using a side-by-side presentation with the instruction stream clustered for brief commentary.

11.3.1 Comparing Levels -O1 and -O2 for g77 (Linux)

Here we want to show similarities and differences in the machine language produced by the g77 FORTRAN compiler at successively higher levels of optimization. Table 11-7 presents a comparison for two optimization levels, -O1 and -O2. We ask that you also compare against the output at optimization level -O0 previously shown (Table 11-5).

We have added a few comments, such as // -> A(I), to help you understand the addressing on the memory stack. We now compare the twelve clusters of Itanium instructions in Table 11-7 as produced by g77 at the two different optimization levels.

  1. The compiler puts the two floating-point constants into the data section.

  2. These are standard beginnings.

  3. The compiler claims 224 bytes of stack space (enough for two 13-element double-precision floating-point vectors and 16 more bytes for the obligatory scratch area), since a function will be called upon exit. The .fframe directive (Section 7.5.5) indicates that the program will not again decrement the stack pointer.

  4. At optimization level -O2, the g77 compiler moves a few additional instructions up amidst those strictly expected in a prologue.

    Table 11-7. Two Levels of Optimization for COM_F Program (Linux)

    g77 at level -O1

    g77 at level -O2

    //

           .section .rodata       .align 8 .LC2:       .sdata       .align 8 .LC0: data8 0x4005bf0995aaf790       .align 8 .LC1: data8 0x400921f9f01b866e 
           .section .rodata       .align 8 .LC2:       .sdata       .align 8 .LC0: data8 0x4005bf0995aaf790       .align 8 .LC1: data8 0x400921f9f01b866e 

    1

           .text       .align 16       .global MAIN__#       .proc MAIN__# MAIN__: 
           .text       .align 16       .global MAIN__#       .proc MAIN__# MAIN__: 

    2

         .prologue 12, 32     .save ar.pfs, r33 alloc  r33 = ar.pfs, 0, 2, 2, 0     .fframe 224  adds  r12 = -224,r12     .save rp, r32 mov    r32 = b0 
         .prologue 12, 32     .save ar.pfs, r33 alloc  r33 = ar.pfs, 0, 3, 2,0     .fframe 224 adds   r12 = -224,r12 ;; 

    3

          .body addl r15 = @gprel(.LC0), gp;; 
     addl   r14 = @gprel(.LC0),gp addl   r15 = @gprel(.LC1),gp adds   r16 = 16,r12 ;; 

    4

     
         .save ar.lc, r34 mov    r34 = ar.lc 

    5

     ldfd   f8 = [r15] addl   r17 = 1,r0 // i addl   r16 = 11,r0 // lc addl   r14 = @gprel(.LC1),gp ;; ldfd   f7 = [r14] 
     ldfd   f8 = [r14] ldfd   f7 = [r15] 

    6

     
     mov    r17 = r16 // -> A(1) adds   r16 = 128,r12 //->B(1)     .save rp, r32 mov    r32 = b0     .body mov    ar.lc = 11 ;; 

    7

     .L5:   adds   r15 = 16,r12 ;;   shladd r14 = r17,3,r15 ;;   adds   r15 = -8,r14 //-> A(I)   adds   r14 = 104,r14;;//->B(I) 

    .L8:

    8

     ldfd   f6 = [r14] ;; fadd.d f6 = f6,f7 ;; fmpy.d f6 = f8,f6 ;; stfd   [r15] = f6 
     ldfd   f6 = [r16],8 ;; fadd.d f6 = f6,f7 ;; fmpy.d f6 = f8,f6 ;; stfd   [r17] = f6,8 

    9

     adds   r17 = 1,r17 // i adds   r16 = -1,r16 ;; // lc 
     

    10

       cmp4.le p6,p7 = r0,r16 (p6) br.cond.sptk .L5 

    br.cloop.sptk.few .L8

    11

       addl  r14 = @ltoff(.LC2), gp ;;   ld8   r34 = [r14]   mov   r35 = r0   br.call.sptk.many b0 = s_stop#  ;;   break.f 0 ;;       .endp MAIN__# 
       addl r14 = @ltoff(.LC2), gp   mov   r36 = r0 ;;   ld8   r35 = [r14]   br.call.sptk.many b0 = s_stop# ;;   break.f 0 ;;       .endp MAIN__# 

    12

  5. As at optimization level -O0 (see Table 11-5), so too at level -O1 the g77 compiler introduces its own internal loop counter (see our marking lc in clusters 6 and 10). At optimization level -O2, the compiler instead uses the loop count register (ar.lc).

  6. The compiler establishes the two floating-point constants in processor registers before entering the loop. At optimization level -O1, the g77 compiler still uses the programmer's index I in addressing the elements of the vectors.

  7. At optimization level -O2, the g77 compiler moves essentially all instructions whose purpose is initialization into the prologue, and also establishes a pointer to each vector prior to entering the loop,

  8. At optimization level -O1, the g77 compiler computes addresses for the two relevant vector elements corresponding to each value of I.

  9. Here the compiler loads B(I), then calculates A(I) using addition and multiplication, and finally stores A(I). At optimization level -O2, the g77 compiler uses the very efficient postincrementing capability of the Itanium load and store instructions.

  10. At optimization level -O1, the g77 compiler increments I and decrements its internal loop counter. At optimization level -O2, the g77 compiler has rewritten the program so much that I is no longer explicitly present.

  11. This is the bottom of the program loop.

  12. Here the program produced by g77 exits by calling a standard FORTRAN exit routine, which takes care of restoring the things altered or saved in the prologue.

An explicit test for loop termination no longer occurs at the top of the DO loop at these levels of optimization, because the g77 compiler considered the fixed parameter values of the DO loop during its overall analysis of the program.

This particular analysis should reinforce your confidence that you might have produced a reasonably efficient program not unlike what the g77 compiler produces when permitted to optimize the FORTRAN program.

11.3.2 Compiler Messages

Compilers for traditional languages like C and FORTRAN frequently offer a gradation of compile-time messages about conditions encountered, ranging from mild warnings to severe errors. Often a compiler will be set to print only messages about situations of such severity that no binary output could be produced.

Compilers may offer options to print the less severe messages, which sometimes provide ways to learn more about the language, the compiler, or your program. Here are the compiler warnings about our program as expressed in the C language:

 L> gcc -Wall -S com_c.c L> ecc -w2 -S com_c.c com_c.c: "com_c.c", line 8: remark #592: variable "b" is used before its value is set         a[i] = c*( b[i] + 3.14159 );                    ^ "com_c.c", line 4: remark #593: variable "a" was set but never used         double a[13], b[13], c;                ^ L> H> cc_bundled +DD64 -S com_c.c H> cc +DD64 -S com_c.c H> aCC +DD64 -Ae +w -S com_c.c 

and as expressed in the FORTRAN language:

 L> g77 -Wall -S com_f.f L> efc -w1 -S com_f.f       program COM_F       INTEGER*8 I              ^ Warning 2 at (3:com_f.f) : Type size specifiers are an extension to standard Fortran 95 8 Lines Compiled L> H> f90 +DD64 -S com_f.f com_f.f    program COM_F 8 Lines Compiled H> 

Only Intel's compiler for the C language (ecc) points out an interesting "flaw" in our program. In order to keep this program as simple as possible, we neither put any initial values into the source vector nor asked to have the destination vector printed out. Thus the heart of our program is really dead code. Any code whose sole purpose is to lead up to a computed result that is never stored or printed need not be deemed important at all.

Many compilers can detect dead code and will remove it when they operate at their higher levels of optimization. In fact, ecc at level -O2 or higher and both cc and aCC at level +O2 or higher entirely eliminated the body of our com_c.c program. Similarly, f90 at level +O2 or higher eliminated the body of our com_f.f program. The open-source compilers did not do this.

Since there were no warning or informational messages about such elimination, it is only by inspecting the machine code using the -S option that we could realize this had happened. Naturally we feel this means of discovery provides another rationalization for your being aware of the capability of compilers to show you their actual machine code.

11.3.3 Loop Length and Optimization with f90 (HP - UX)

Asking for no output in the COM programs led the optimizing compilers to remove the heart of the program as dead code, but if we ask to see something actually be printed, the dead code issue should be overcome.

In an effort to show the machine language code produced by the f90 FORTRAN compiler at a higher level of optimization, we prepared two additional variants of the previous com_f.f program differing in the number of loop traversals. Table 11-8 shows abridged output from the f90 compiler at optimization level +O2 for the com_f1.f (short loop) and com_f2.f (long loop) programs.

Although we have not shown no-op instructions in Table 11-8, you can see that both illustrations contain optimized instruction groups limited principally by availability of execution units. We have substituted comments in place of instructions related to setup and calls to FORTRAN support or I/O procedures.

We now compare the clusters of Itanium instructions in Table 11-8 as produced by f90 at optimization level +O2 for the two program variants.

Table 11-8. Loop Length Differences Using the f90 Compiler (HP-UX) at Level +O2

Short loop with f90 at level +O2

Long loop with f90 at level +O2

//

 DOUBLE PRECISION A(13), B(13), C INTEGER*8 I C=2.71828 DO I=1,12 A(I) = C*( B(I) + 3.14159 ) PRINT *, A(12) END DO END 
 DOUBLE PRECISION A(130),B(130),C INTEGER*8 I C=2.71828 DO I=1,120 A(I) = C*( B(I) + 3.14159 ) PRINT *, A(120) END DO END 

0

 .section .text, "ax","progbits" .proc  _start ..L0: ..L2: _start:: demo:: 
 .section .text, "ax","progbits" .proc  _start ..L0: ..L2: _start:: demo:: 

1

 alloc r35 = ar.pfs, 3, 16, 6, 0 movl   r8 = 0x400921f9f01b866e add    r15 = 0,sp add    sp = -464,sp ;; add    r16 = -224, r15 movl   r10 = 0x4005bf0995aaf790 add    r17 = -208,r15 add    r37 = -448,r15 ;; add    r11 = 8,r37 
   alloc r35=ar.pfs, 3, 6, 6, 0   add    r15 = 0,sp   brp.loop.few.imp ..L12,..LB942   add    sp = -0x840,sp  ;; //start setup instruction   mov    r36 = rp   add    r40 = 0,gp ;; //start setup instruction   add    r39 = -0x830,r15 ;; 

2

  stf.spill [r16] = f2,32   stf.spill [r17] = f3,32 ;; //start setup instruction   stf.spill [r16] = f4,32   add    r46 = 8,r11   add    r44 = 16,r11   stf.spill [r17] = f5,32 ;;   stf.spill [r16] = f16,32   add    r42 = 24,r11   stf.spill [r17] = f17,32   add    r40 = 32,r11 ;;   add    r39 = 72,r11   stf.spill [r16] = f18,32    stf.spill [r17] = f19,32   add    r38 = 80,r11 ;;   stf.spill [r16] = f20,32   mov    r36 = rp   add    r48 = 0,gp 
    //start setup instruction   mov    r38 = pr ;; // call to __F90_STARTUP   add    r32 = 8,r39   mov    r37 = ar.lc   cmp.ne.or.andcm p16,p17= 42,r0   movl  r33=0x4005bf0995aaf790;;   setf.d f6 = r33 ;;   mov    ar.lc = 29   cmp.eq.and p18,p19 = 42,r0   movl  r8 =0x400921f9f01b866e;;   cmp.eq.and p20,p21 = 42,r0   mov    ar.ec = 7   cmp.eq.and p22,p0 = 42,r0   setf.d f7 = r8 ;; 

3

   stf.spill [r17] = f21,32 ;;   stf.spill [r16] = f22,32    //start setup instruction   stf.spill [r17] = f23,32 ;; //start setup instructions   stf.spill [r16] = f24,32   stf.spill [r17] = f25,16 ;; //start setup instruction 
 

3

   ldfd   f22 = [r46],32 ;; //start setup instruction   add    r49 = 152,r11   ldfd   f21 = [r44],32 ;;   ldfd   f20 = [r42],32   add    r43 = 160,r11   ldfd   f19 = [r40],32   add    r41 = 168,r11 ;;   add    r34 = 176,r11   ldfd   f18 = [r46],48   ldfd   f25 = [r11]   add    r33 = 184,r11 ;;   ldfd   f17 = [r44],56   add    r32 = 192,r11    //I/O setup instruction   ldfd   f16 = [r42],56 ;;   ldfd   f5 = [r40],56    //pre-exit setup instruction   setf.d f23 = r8 ;;   ldfd f4 = [r39],56    //pre-exit setup instruction   ldfd   f3 = [r38],56 ;;   ldfd   f2 = [r46],56   setf.d f24 = r10 ;; ;; // call to __F90_STARTUP   add    gp = 0,r48 
 add    gp = 0,r40 add    r10 = 0,r32 add    r9 = 8,r32 add    r8 = 0x410,r32 ;; add    r11 = 0x418,r32 ;; 

4

 ;; //I/O setup instructions   fadd.d.s0 f7 = f22,f23   fadd.d.s0 f8 = f21,f23 ;;   fadd.d.s0 f6 = f25,f23   fadd.d.s0 f9 = f20,f23 ;;   fadd.d.s0 f10 = f19,f23 
 ..L12: (p16) ldfd   f32 = [r10],16 (p19) fadd.d.s0 f57 = f35,f7 (p16) ldfd   f53 = [r9],16 (p19) fadd.d.s0 f61 = f56,f7 ;; 

5

 fadd.d.s0 f11 = f18,f23 ;; fadd.d.s0 f12 = f17,f23 fadd.d.s0 f13 = f16,f23 ;; fadd.d.s0 f14 = f5,f23 fadd.d.s0 f15 = f4,f23 ;; fmpy.d.s0 f7 = f24,f7 fmpy.d.s0 f8 = f24,f8 ;; fmpy.d.s0 f6 = f24,f6 fmpy.d.s0 f9 = f24,f9 ;; fmpy.d.s0 f10 = f24,f10 fmpy.d.s0 f11 = f24,f11 ;; fmpy.d.s0 f12 = f24,f12 fmpy.d.s0 f13 = f24,f13 ;; fmpy.d.s0 f14 = f24,f14 fadd.d.s0 f32 = f3,f23 ;; fadd.d.s0 f33 = f2,f23 fmpy.d.s0 f15 = f24,f15 ;; 
 (p16) ldfd   f39 = [r10],16 (p19) fadd.d.s0 f49 = f42,f7 (p16) ldfd   f35 = [r9],16 (p19) fadd.d.s0 f51 = f38,f7 ;; (p22) stfd   [r8] = f45,16 (p20) fmpy.d.s0 f43 = f6,f58 (p22) stfd   [r11] = f48,16 (p20) fmpy.d.s0 f46 = f6,f62 ;; (p22) stfd   [r8] = f60,16 (p20) fmpy.d.s0 f58 = f6,f50 (p22) stfd   [r11] = f64,16 (p20) fmpy.d.s0 f62 = f6,f52 [..LB942:]   br.ctop.dptk.few ..L12 ;; 

5

 stfd   [r44] = f6 stfd   [r42] = f7 fmpy.d.s0 f32 = f24,f32 ;; stfd   [r40] = f8 stfd   [r39] = f9 fmpy.d.s0 f33 = f24,f33 ;; stfd   [r38] = f10 stfd   [r46] = f11 ;; stfd   [r49] = f12 stfd   [r43] = f13 ;; stfd   [r41] = f14 stfd   [r34] = f15 ;; stfd   [r33] = f32 stfd   [r32] = f33 
 ..L19:    //I/O setup instructions    //pre-exit setup instructions 

6

 ;; // call to __F90_START_IO   add    gp = 0, r48    //I/O setup instructions ;; // call to __F90_DO_IO_ITEM   add    gp = 0,r48 ;; // call to __F90_END_IO    //pre-exit setup instructions ;; call to pre-exit routine 
 ;; // call to __F90_START_IO   add    gp = 0,r40    //I/O setup instructions ;; // call to __F90_DO_IO_ITEM   add    gp = 0,r40 ;; // call to __F90_END_IO    //pre-exit setup instructions ;; call to pre-exit routine 

7

 add    r18 = 464,sp mov    rp = r36 add    gp = 0,r48 ;; add    r19 = -224,r18 mov    ar.pfs = r35 add    r20 = -208,r18 ;; 
 add    gp = 0,r40 mov    rp = r36 add    sp = 0x840,sp ;; mov    ar.pfs = r35 ;; 

8

 ldf.fill f2 = [r19],32 ldf.fill f3 = [r20],32 ;; ldf.fill f4 = [r19],32 ldf.fill f5 = [r20],32 ;; ldf.fill f16 = [r19],32 ldf.fill f17 = [r20],32 ;; ldf.fill f18 = [r19],32 ldf.fill f19 = [r20],32 ;; ldf.fill f20 = [r19],32 ldf.fill f21 = [r20],32 ;; ldf.fill f22 = [r19],32 ldf.fill f23 = [r20],32 ;; ldf.fill f24 = [r19],32 ldf.fill f25 = [r20],16 
 

9

 add    sp = 0, r18 br.ret.sptk.fewrp ;; 
 mov    pr = r38, 0x1fffe br.ret.sptk.fewrp ;; 

10

 ..L1:       .endp _start 
 ..L1:       .endp _start 

11

  1. These are standard beginnings.

  2. The compiler claims stack space for the double-precision floating-point vectors and other requirements. For the short loop variant of the program, the two constants are loaded into integer registers here, but not converted into floating-point registers until much later (see 4 below). This begins a theme of long-range reordering of instructions with both variants of the program.

  3. For the short loop variant, numerous preserved floating-point registers (see Appendix D.5) are spilled (see Section 8.3.1) as 16-byte raw data onto the memory stack. At the same time, numerous general registers are initialized as pointers for subsequent load operations (see 4 below). For the long loop variant, the two constants are converted into floating-point registers. Predicate register p16 is initialized to 1 and p17 to p22 to 0.

  4. For the short loop variant, we see that full loop unrolling is in progress as the 12 relevant elements of B are loaded into floating-point registers, while instruction bundles continue to be filled out with instructions that initialize additional pointers for subsequent store operations (see 6 below). Postincrementing with the load operations also contributes to reuse of integer registers as pointers. For the long loop variant, only two load and two store pointers will be needed in its loop strategy, which combines fourfold unrolling with register rotation operating for 30 traversals (see 5 below).

  5. For the short loop variant, the unrolled floating-point addition and multiplication operations take place, although the last multiplications are interwoven with the first few store operations (see 6 below). For the long loop variant, a very compact and symmetrical coding arrangement processes four vector elements per register rotation. Processing efficiency is thus one element per machine cycle during the kernel phase of this software-pipelined loop. Refer to "Application Architecture," volume 1 of Intel Itanium Architecture Software Developer's Manual, for assistance with interpreting code of this sort.

  6. For the short loop variant, the fully unrolled store operations take place.

  7. We have suppressed details of the FORTRAN I/O operations.

  8. These are standard wrap-up instructions.

  9. For the short loop variant, the preserved floating-point registers are refilled from the memory stack (Section 8.3.2).

  10. The program exits.

  11. The end.

Traditional loop unrolling produced the anticipated expansion of the code, in the form of many spill/fill operations, inasmuch as the register conventions for Itanium programming (Appendix D.5) provide very few scratch floating-point registers below the rotating region. If a process does use Fr32 Fr127 for any purposes, then the operating system must save and restore them when it switches process context.

In contrast, the Itanium register rotation mechanism in support of software pipelining made possible a very efficient loop with minimal code expansion. Using optimization, as above, a compiler may perform partial unrolling so as to create a loop body that most productively employs the available execution units for the software pipeline. This aspect of the optimization may thus be implementation-dependent, although the architectural promise to the programmer is still kept: The loop will execute on any implementation of the architecture even though it was tuned for one particular implementation. Finally, while integer spill/fill operations may also be necessary, those can alternatively be handled at the hardware level by the register stack engine (Section 7.3.4) and be partially shielded from effect upon execution time.



ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net