11.3 Optimizing a Simple Program

We must emphasize that some of the observations of similarities and differences in the machine language programs produced by Itanium compilers in Section 11.2 are artifacts arising because the compilers have been asked to act without optimization on a very simplistic program. Those observations should in no way be construed as flaws in any high-level language, any corresponding compiler, or the Itanium architecture.

Clearly there are things in each of these programs that a human programmer would do differently if hand coding in Itanium assembly language. These might include the following:

Load the constant 3.14159 into a floating-point register only once, outside the loop.
Use only one integer register for i, and avoid reloading this variable from memory, though perhaps rewriting it to memory if that is required by the definition of a language.
Streamline the addressing of items in memory.
Schedule (i.e., reorder) the instructions to minimize data stalls.

Remember, however, that these languages permit complicated constructs, and in some instances function calls that may have side effects, where in contrast our COM program has only a simple scalar variable or a constant. Compilers must be engineered to produce correct and reasonably efficient programs for the general case. Also note that these are unoptimized programs, but the default setting for many compilers is optimization at some particular level (see Tables 11-2 and 11-3).

As with nonoptimized output, so too comparing the optimized output from the compilers for high-level languages can usefully reveal both similarities and differences. We proceed with some illustrations of optimization, again using a side-by-side presentation with the instruction stream clustered for brief commentary.

11.3.1 Comparing Levels -O1 and -O2 for g77 (Linux)

Here we want to show similarities and differences in the machine language produced by the g77 FORTRAN compiler at successively higher levels of optimization. Table 11-7 presents a comparison for two optimization levels, -O1 and -O2. We ask that you also compare against the output at optimization level -O0 previously shown (Table 11-5).

We have added a few comments, such as // -> A(I), to help you understand the addressing on the memory stack. We now compare the twelve clusters of Itanium instructions in Table 11-7 as produced by g77 at the two different optimization levels.

The compiler puts the two floating-point constants into the data section.
These are standard beginnings.
The compiler claims 224 bytes of stack space (enough for two 13-element double-precision floating-point vectors and 16 more bytes for the obligatory scratch area), since a function will be called upon exit. The .fframe directive (Section 7.5.5) indicates that the program will not again decrement the stack pointer.

At optimization level -O2, the g77 compiler moves a few additional instructions up amidst those strictly expected in a prologue.

Table 11-7. Two Levels of Optimization for COM_F Program (Linux)
g77 at level -O1	g77 at level -O2	//
.section .rodata .align 8 .LC2: .sdata .align 8 .LC0: data8 0x4005bf0995aaf790 .align 8 .LC1: data8 0x400921f9f01b866e	.section .rodata .align 8 .LC2: .sdata .align 8 .LC0: data8 0x4005bf0995aaf790 .align 8 .LC1: data8 0x400921f9f01b866e	`1`
.text .align 16 .global MAIN__# .proc MAIN__# MAIN__:	.text .align 16 .global MAIN__# .proc MAIN__# MAIN__:	`2`
.prologue 12, 32 .save ar.pfs, r33 alloc r33 = ar.pfs, 0, 2, 2, 0 .fframe 224 adds r12 = -224,r12 .save rp, r32 mov r32 = b0	.prologue 12, 32 .save ar.pfs, r33 alloc r33 = ar.pfs, 0, 3, 2,0 .fframe 224 adds r12 = -224,r12 ;;	`3`
.body addl r15 = @gprel(.LC0), gp;;	addl r14 = @gprel(.LC0),gp addl r15 = @gprel(.LC1),gp adds r16 = 16,r12 ;;	`4`
	.save ar.lc, r34 mov r34 = ar.lc	`5`
ldfd f8 = [r15] addl r17 = 1,r0 // i addl r16 = 11,r0 // lc addl r14 = @gprel(.LC1),gp ;; ldfd f7 = [r14]	ldfd f8 = [r14] ldfd f7 = [r15]	`6`
	mov r17 = r16 // -> A(1) adds r16 = 128,r12 //->B(1) .save rp, r32 mov r32 = b0 .body mov ar.lc = 11 ;;	`7`
.L5: adds r15 = 16,r12 ;; shladd r14 = r17,3,r15 ;; adds r15 = -8,r14 //-> A(I) adds r14 = 104,r14;;//->B(I)	`.L8:`	`8`
ldfd f6 = [r14] ;; fadd.d f6 = f6,f7 ;; fmpy.d f6 = f8,f6 ;; stfd [r15] = f6	ldfd f6 = [r16],8 ;; fadd.d f6 = f6,f7 ;; fmpy.d f6 = f8,f6 ;; stfd [r17] = f6,8	`9`
adds r17 = 1,r17 // i adds r16 = -1,r16 ;; // lc		`10`
cmp4.le p6,p7 = r0,r16 (p6) br.cond.sptk .L5	`br.cloop.sptk.few .L8`	`11`
addl r14 = @ltoff(.LC2), gp ;; ld8 r34 = [r14] mov r35 = r0 br.call.sptk.many b0 = s_stop# ;; break.f 0 ;; .endp MAIN__#	addl r14 = @ltoff(.LC2), gp mov r36 = r0 ;; ld8 r35 = [r14] br.call.sptk.many b0 = s_stop# ;; break.f 0 ;; .endp MAIN__#	`12`

As at optimization level -O0 (see Table 11-5), so too at level -O1 the g77 compiler introduces its own internal loop counter (see our marking lc in clusters 6 and 10). At optimization level -O2, the compiler instead uses the loop count register (ar.lc).
The compiler establishes the two floating-point constants in processor registers before entering the loop. At optimization level -O1, the g77 compiler still uses the programmer's index I in addressing the elements of the vectors.
At optimization level -O2, the g77 compiler moves essentially all instructions whose purpose is initialization into the prologue, and also establishes a pointer to each vector prior to entering the loop,
At optimization level -O1, the g77 compiler computes addresses for the two relevant vector elements corresponding to each value of I.
Here the compiler loads B(I), then calculates A(I) using addition and multiplication, and finally stores A(I). At optimization level -O2, the g77 compiler uses the very efficient postincrementing capability of the Itanium load and store instructions.
At optimization level -O1, the g77 compiler increments I and decrements its internal loop counter. At optimization level -O2, the g77 compiler has rewritten the program so much that I is no longer explicitly present.
This is the bottom of the program loop.
Here the program produced by g77 exits by calling a standard FORTRAN exit routine, which takes care of restoring the things altered or saved in the prologue.

An explicit test for loop termination no longer occurs at the top of the DO loop at these levels of optimization, because the g77 compiler considered the fixed parameter values of the DO loop during its overall analysis of the program.

This particular analysis should reinforce your confidence that you might have produced a reasonably efficient program not unlike what the g77 compiler produces when permitted to optimize the FORTRAN program.

11.3.2 Compiler Messages

Compilers for traditional languages like C and FORTRAN frequently offer a gradation of compile-time messages about conditions encountered, ranging from mild warnings to severe errors. Often a compiler will be set to print only messages about situations of such severity that no binary output could be produced.

Compilers may offer options to print the less severe messages, which sometimes provide ways to learn more about the language, the compiler, or your program. Here are the compiler warnings about our program as expressed in the C language:

 L> gcc -Wall -S com_c.c L> ecc -w2 -S com_c.c com_c.c: "com_c.c", line 8: remark #592: variable "b" is used before its value is set         a[i] = c*( b[i] + 3.14159 );                    ^ "com_c.c", line 4: remark #593: variable "a" was set but never used         double a[13], b[13], c;                ^ L> H> cc_bundled +DD64 -S com_c.c H> cc +DD64 -S com_c.c H> aCC +DD64 -Ae +w -S com_c.c

and as expressed in the FORTRAN language:

 L> g77 -Wall -S com_f.f L> efc -w1 -S com_f.f       program COM_F       INTEGER*8 I              ^ Warning 2 at (3:com_f.f) : Type size specifiers are an extension to standard Fortran 95 8 Lines Compiled L> H> f90 +DD64 -S com_f.f com_f.f    program COM_F 8 Lines Compiled H>

Only Intel's compiler for the C language (ecc) points out an interesting "flaw" in our program. In order to keep this program as simple as possible, we neither put any initial values into the source vector nor asked to have the destination vector printed out. Thus the heart of our program is really dead code. Any code whose sole purpose is to lead up to a computed result that is never stored or printed need not be deemed important at all.

Many compilers can detect dead code and will remove it when they operate at their higher levels of optimization. In fact, ecc at level -O2 or higher and both cc and aCC at level +O2 or higher entirely eliminated the body of our com_c.c program. Similarly, f90 at level +O2 or higher eliminated the body of our com_f.f program. The open-source compilers did not do this.

Since there were no warning or informational messages about such elimination, it is only by inspecting the machine code using the -S option that we could realize this had happened. Naturally we feel this means of discovery provides another rationalization for your being aware of the capability of compilers to show you their actual machine code.

11.3.3 Loop Length and Optimization with f90 (HP - UX)

Asking for no output in the COM programs led the optimizing compilers to remove the heart of the program as dead code, but if we ask to see something actually be printed, the dead code issue should be overcome.

In an effort to show the machine language code produced by the f90 FORTRAN compiler at a higher level of optimization, we prepared two additional variants of the previous com_f.f program differing in the number of loop traversals. Table 11-8 shows abridged output from the f90 compiler at optimization level +O2 for the com_f1.f (short loop) and com_f2.f (long loop) programs.

Although we have not shown no-op instructions in Table 11-8, you can see that both illustrations contain optimized instruction groups limited principally by availability of execution units. We have substituted comments in place of instructions related to setup and calls to FORTRAN support or I/O procedures.

We now compare the clusters of Itanium instructions in Table 11-8 as produced by f90 at optimization level +O2 for the two program variants.

Table 11-8. Loop Length Differences Using the f90 Compiler (HP-UX) at Level +O2
Short loop with f90 at level +O2	Long loop with f90 at level +O2	//
DOUBLE PRECISION A(13), B(13), C INTEGER8 I C=2.71828 DO I=1,12 A(I) = C( B(I) + 3.14159 ) PRINT *, A(12) END DO END	DOUBLE PRECISION A(130),B(130),C INTEGER8 I C=2.71828 DO I=1,120 A(I) = C( B(I) + 3.14159 ) PRINT *, A(120) END DO END	0
.section .text, "ax","progbits" .proc _start ..L0: ..L2: _start:: demo::	.section .text, "ax","progbits" .proc _start ..L0: ..L2: _start:: demo::	1
alloc r35 = ar.pfs, 3, 16, 6, 0 movl r8 = 0x400921f9f01b866e add r15 = 0,sp add sp = -464,sp ;; add r16 = -224, r15 movl r10 = 0x4005bf0995aaf790 add r17 = -208,r15 add r37 = -448,r15 ;; add r11 = 8,r37	alloc r35=ar.pfs, 3, 6, 6, 0 add r15 = 0,sp brp.loop.few.imp ..L12,..LB942 add sp = -0x840,sp ;; //start setup instruction mov r36 = rp add r40 = 0,gp ;; //start setup instruction add r39 = -0x830,r15 ;;	2
stf.spill [r16] = f2,32 stf.spill [r17] = f3,32 ;; //start setup instruction stf.spill [r16] = f4,32 add r46 = 8,r11 add r44 = 16,r11 stf.spill [r17] = f5,32 ;; stf.spill [r16] = f16,32 add r42 = 24,r11 stf.spill [r17] = f17,32 add r40 = 32,r11 ;; add r39 = 72,r11 stf.spill [r16] = f18,32 stf.spill [r17] = f19,32 add r38 = 80,r11 ;; stf.spill [r16] = f20,32 mov r36 = rp add r48 = 0,gp	//start setup instruction mov r38 = pr ;; // call to __F90_STARTUP add r32 = 8,r39 mov r37 = ar.lc cmp.ne.or.andcm p16,p17= 42,r0 movl r33=0x4005bf0995aaf790;; setf.d f6 = r33 ;; mov ar.lc = 29 cmp.eq.and p18,p19 = 42,r0 movl r8 =0x400921f9f01b866e;; cmp.eq.and p20,p21 = 42,r0 mov ar.ec = 7 cmp.eq.and p22,p0 = 42,r0 setf.d f7 = r8 ;;	3
stf.spill [r17] = f21,32 ;; stf.spill [r16] = f22,32 //start setup instruction stf.spill [r17] = f23,32 ;; //start setup instructions stf.spill [r16] = f24,32 stf.spill [r17] = f25,16 ;; //start setup instruction		`3`
ldfd f22 = [r46],32 ;; //start setup instruction add r49 = 152,r11 ldfd f21 = [r44],32 ;; ldfd f20 = [r42],32 add r43 = 160,r11 ldfd f19 = [r40],32 add r41 = 168,r11 ;; add r34 = 176,r11 ldfd f18 = [r46],48 ldfd f25 = [r11] add r33 = 184,r11 ;; ldfd f17 = [r44],56 add r32 = 192,r11 //I/O setup instruction ldfd f16 = [r42],56 ;; ldfd f5 = [r40],56 //pre-exit setup instruction setf.d f23 = r8 ;; ldfd f4 = [r39],56 //pre-exit setup instruction ldfd f3 = [r38],56 ;; ldfd f2 = [r46],56 setf.d f24 = r10 ;; ;; // call to __F90_STARTUP add gp = 0,r48	add gp = 0,r40 add r10 = 0,r32 add r9 = 8,r32 add r8 = 0x410,r32 ;; add r11 = 0x418,r32 ;;	4
;; //I/O setup instructions fadd.d.s0 f7 = f22,f23 fadd.d.s0 f8 = f21,f23 ;; fadd.d.s0 f6 = f25,f23 fadd.d.s0 f9 = f20,f23 ;; fadd.d.s0 f10 = f19,f23	..L12: (p16) ldfd f32 = [r10],16 (p19) fadd.d.s0 f57 = f35,f7 (p16) ldfd f53 = [r9],16 (p19) fadd.d.s0 f61 = f56,f7 ;;	5
fadd.d.s0 f11 = f18,f23 ;; fadd.d.s0 f12 = f17,f23 fadd.d.s0 f13 = f16,f23 ;; fadd.d.s0 f14 = f5,f23 fadd.d.s0 f15 = f4,f23 ;; fmpy.d.s0 f7 = f24,f7 fmpy.d.s0 f8 = f24,f8 ;; fmpy.d.s0 f6 = f24,f6 fmpy.d.s0 f9 = f24,f9 ;; fmpy.d.s0 f10 = f24,f10 fmpy.d.s0 f11 = f24,f11 ;; fmpy.d.s0 f12 = f24,f12 fmpy.d.s0 f13 = f24,f13 ;; fmpy.d.s0 f14 = f24,f14 fadd.d.s0 f32 = f3,f23 ;; fadd.d.s0 f33 = f2,f23 fmpy.d.s0 f15 = f24,f15 ;;	(p16) ldfd f39 = [r10],16 (p19) fadd.d.s0 f49 = f42,f7 (p16) ldfd f35 = [r9],16 (p19) fadd.d.s0 f51 = f38,f7 ;; (p22) stfd [r8] = f45,16 (p20) fmpy.d.s0 f43 = f6,f58 (p22) stfd [r11] = f48,16 (p20) fmpy.d.s0 f46 = f6,f62 ;; (p22) stfd [r8] = f60,16 (p20) fmpy.d.s0 f58 = f6,f50 (p22) stfd [r11] = f64,16 (p20) fmpy.d.s0 f62 = f6,f52 [..LB942:] br.ctop.dptk.few ..L12 ;;	5
stfd [r44] = f6 stfd [r42] = f7 fmpy.d.s0 f32 = f24,f32 ;; stfd [r40] = f8 stfd [r39] = f9 fmpy.d.s0 f33 = f24,f33 ;; stfd [r38] = f10 stfd [r46] = f11 ;; stfd [r49] = f12 stfd [r43] = f13 ;; stfd [r41] = f14 stfd [r34] = f15 ;; stfd [r33] = f32 stfd [r32] = f33	..L19: //I/O setup instructions //pre-exit setup instructions	6
;; // call to __F90_START_IO add gp = 0, r48 //I/O setup instructions ;; // call to __F90_DO_IO_ITEM add gp = 0,r48 ;; // call to __F90_END_IO //pre-exit setup instructions ;; call to pre-exit routine	;; // call to __F90_START_IO add gp = 0,r40 //I/O setup instructions ;; // call to __F90_DO_IO_ITEM add gp = 0,r40 ;; // call to __F90_END_IO //pre-exit setup instructions ;; call to pre-exit routine	7
add r18 = 464,sp mov rp = r36 add gp = 0,r48 ;; add r19 = -224,r18 mov ar.pfs = r35 add r20 = -208,r18 ;;	add gp = 0,r40 mov rp = r36 add sp = 0x840,sp ;; mov ar.pfs = r35 ;;	8
ldf.fill f2 = [r19],32 ldf.fill f3 = [r20],32 ;; ldf.fill f4 = [r19],32 ldf.fill f5 = [r20],32 ;; ldf.fill f16 = [r19],32 ldf.fill f17 = [r20],32 ;; ldf.fill f18 = [r19],32 ldf.fill f19 = [r20],32 ;; ldf.fill f20 = [r19],32 ldf.fill f21 = [r20],32 ;; ldf.fill f22 = [r19],32 ldf.fill f23 = [r20],32 ;; ldf.fill f24 = [r19],32 ldf.fill f25 = [r20],16		9
add sp = 0, r18 br.ret.sptk.fewrp ;;	mov pr = r38, 0x1fffe br.ret.sptk.fewrp ;;	10
..L1: .endp _start	..L1: .endp _start	11

These are standard beginnings.
The compiler claims stack space for the double-precision floating-point vectors and other requirements. For the short loop variant of the program, the two constants are loaded into integer registers here, but not converted into floating-point registers until much later (see 4 below). This begins a theme of long-range reordering of instructions with both variants of the program.
For the short loop variant, numerous preserved floating-point registers (see Appendix D.5) are spilled (see Section 8.3.1) as 16-byte raw data onto the memory stack. At the same time, numerous general registers are initialized as pointers for subsequent load operations (see 4 below). For the long loop variant, the two constants are converted into floating-point registers. Predicate register p16 is initialized to 1 and p17 to p22 to 0.
For the short loop variant, we see that full loop unrolling is in progress as the 12 relevant elements of B are loaded into floating-point registers, while instruction bundles continue to be filled out with instructions that initialize additional pointers for subsequent store operations (see 6 below). Postincrementing with the load operations also contributes to reuse of integer registers as pointers. For the long loop variant, only two load and two store pointers will be needed in its loop strategy, which combines fourfold unrolling with register rotation operating for 30 traversals (see 5 below).
For the short loop variant, the unrolled floating-point addition and multiplication operations take place, although the last multiplications are interwoven with the first few store operations (see 6 below). For the long loop variant, a very compact and symmetrical coding arrangement processes four vector elements per register rotation. Processing efficiency is thus one element per machine cycle during the kernel phase of this software-pipelined loop. Refer to "Application Architecture," volume 1 of Intel Itanium Architecture Software Developer's Manual, for assistance with interpreting code of this sort.
For the short loop variant, the fully unrolled store operations take place.
We have suppressed details of the FORTRAN I/O operations.
These are standard wrap-up instructions.
For the short loop variant, the preserved floating-point registers are refilled from the memory stack (Section 8.3.2).
The program exits.
The end.

Traditional loop unrolling produced the anticipated expansion of the code, in the form of many spill/fill operations, inasmuch as the register conventions for Itanium programming (Appendix D.5) provide very few scratch floating-point registers below the rotating region. If a process does use Fr₃₂ Fr₁₂₇ for any purposes, then the operating system must save and restore them when it switches process context.

In contrast, the Itanium register rotation mechanism in support of software pipelining made possible a very efficient loop with minimal code expansion. Using optimization, as above, a compiler may perform partial unrolling so as to create a loop body that most productively employs the available execution units for the software pipeline. This aspect of the optimization may thus be implementation-dependent, although the architectural promise to the programmer is still kept: The loop will execute on any implementation of the architecture even though it was tuned for one particular implementation. Finally, while integer spill/fill operations may also be necessary, those can alternatively be handled at the hardware level by the register stack engine (Section 7.3.4) and be partially shielded from effect upon execution time.

11.3.1 Comparing Levels -O1 and -O2 for g77 (Linux)

Table 11-7. Two Levels of Optimization for COM_F Program (Linux)

11.3.2 Compiler Messages

11.3.3 Loop Length and Optimization with f90 (HP - UX)

Table 11-8. Loop Length Differences Using the f90 Compiler (HP-UX) at Level +O2