11.2 Compiling a Simple Program

Comparing the output from several compilers for high-level languages can usefully reveal both similarities and differences. The extremely rudimentary programs shown in Figure 11-1 are written in a fashion as parallel from one language to another as the syntax rules of FORTRAN and C will allow. We knowingly ignore the initialization issue with regard to array B, since filling that array's elements with dummy values would require more lines in the program without amplifying the main points we want to discuss.

Each program contains floating-point variables and constants in addition to integer quantities. In order to understand this current material, if you skipped over Chapter 8, you need to know that Itanium instructions end with s or d for single- or double-precision floating-point data, respectively. Also recall from Chapter 10 that an Itanium implementation contains more than one execution unit in the CPU, which means that floating-point manipulations may be carried out simultaneously with certain integer manipulations.

For this section, we compiled each program with available compilers using the appropriate -Ox or +Ox option (Tables 11-1 through 11-3) to inhibit optimization and the -S option to produce an assembly language file named com_x.s. For example, we used the command lines:

 L> gcc -S -O0 com_c.c L> mv com_c.s com_c.gcc.O0 L> g77 -S -O0 com_f.f L> mv com_f.s com_f.g77.O0 

for the open-source compilers for Linux. We preserved each .s file by renaming it with the mv command (Table A-2) for later study.

Figure 11-1 COM_F and COM_C: Simple programs for compiler comparisons
        PROGRAM COM_F        DOUBLE PRECISION A(13), B(13), C        INTEGER*8 I        C=2.71828        DO I=1,12        A(I) = C*( B(I) + 3.14159 )        END DO        END // COM_C Simple program to study compiler output main () {     double a[13], b[13], c;     long long i;     c=2.71828;     for (i=1; i<13; i++)     a[i] = c*( b[i] + 3.14159 ); } 

In order to locate the relevant code, search the .text segment for main and then for a br.ret instruction bracketing your code. Some compilers have more than one symbolic location resembling main (because of the way they initialize the runtime environment for a program), and you then have to locate and focus on whichever main corresponds to the true start of your own high-level language program.

You should also be alert to the fact that the space for the program's variables may be allocated on a stack, rather than explicitly in the .data segment, especially if the language supports recursive calls.

Many details in these .s files, such as unwind information, will not be of direct concern to us in the investigations illustrated in this chapter. We will show highlights only.

11.2.1 Comparing Output from gcc and ecc (Linux)

Following the procedure just suggested, we made a rough correlation between code versions produced by the open-source gcc and Intel ecc compilers in the form of Table 11-4, where the column under // contains a key to the discussion that follows this table. We have departed from the original output of the compilers by removing spaces or lines of little interest to fit the strictures of this tabular comparison. Since gcc shows only stops that mark off instruction groups, we have removed the more explicit information (template and choice of nop) shown by ecc in the interest of simplifying the table.

Table 11-4. Compiler Output for the COM_C Program from gcc and ecc (Linux)

gcc at level -O0

ecc at level -O0

//

      .sdata      .align 8 .LC0: data8 0x4005bf0995aaf790      .align 8 .LC1: data8 0x400921f9f01b866e 

.section .data

1

      .text      .align 16      .global main#      .proc main# main: 
        .section .text        .proc main#        .align 32        .global main# main: 

2

      .prologue 2, 2      .vframe r2 mov     r2 = r12 
 

3

 adds   r12 = -320,r12     .body 

add sp = -208,sp

4

 addl    r14 = @gprel(.LC0), gp;; ldfd    f6 = [r14] adds    r14 = -96,r2 ;; // -> c stfd    [r14] = f6 
 movl   r3 =0x4005bf0995aaf790;; setf.d f7 = r3 add    r2 = sp,r0 ;; // -> c stfd   [r2] = f7 

5

 addl    r15 = 1,r0 adds    r14 = -88,r2 ;; // -> i st8     [r14] = r15 
 add    r29 = 8,sp // -> i add    r28 = 1,r0 ;; st8    [r29] = r28 ;; 

6

 .L2:   adds    r14 = -88,r2 ;; // -> i   ld8     r14 = [r14] ;;   cmp.ge p6,p7 = 12,r14 (p6) br.cond.dptk .L5   br      .L3 
 .b1_1:   add    r27 = 8,sp ;; // -> i   ld8    r26 = [r27] ;;   cmp.le p6,p0 = 13,r26 (p6) br.cond.dpnt .b1_2 ;; 

7

 .L6:   adds    r14 = -88,r2 ;; // -> i   ld8     r14 = [r14] ;;   shladd  r14 = r14,3,r0   adds    r15 = -320,r2;;//-> a[0]   add     r16 = r15,r14 // -> a(i)   adds    r14 = -88,r2 ;; // -> i   ld8     r14 = [r14] ;;   shladd  r15 = r14,3,r0    adds    r14 = -208,r2;;//-> b[0]   add     r14 = r14,r15;;//-> b(i) 
 add    r25 = sp,r0 ;; // -> c ldfd   f6 = [r25] add    r24 = 120,sp // -> b[0] add    r23 = 8,sp ;; // -> i ld8    r22 = [r23] ;; setf.sig  f15 = r22 add    r21 = 8,r0 ;; setf.sig  f14 = r21 ;; xma.l  f13 = f15,f14,f0 ;; getf.sig  r20 = f13 ;; add    r19=r24,r20 ;;//-> b[i] 

8

 ldfd   f7 = [r14] addl   r14 = @gprel(.LC1),gp ;; ldfd   f6 = [r14] ;; fadd.d f7 = f7,f6 adds   r14 = -96,r2 ;; // -> c ldfd   f6 = [r14] ;; fmpy.d f6 = f6,f7 ;; 
 ldfd  f12=[r19] movl r18=0x400921f9f01b866e ;; setf.d f11=r18 ;; fma.d f10=f12,f1,f11 ;; fma.d f9=f6,f10,f0 ;; 

9

stfd [r16] = f6

 add    r17=16,sp ;; // -> a[0] add    r16=8,sp ;; // -> i ld8    r15=[r16] ;; setf.sig  f8=r15 add    r14=8,r0 ;; setf.sig  f7=r14 ;; xma.l f6=f8,f7,f0 ;; getf.sig  r11=f6 ;; add    r10=r17,r11 ;;//-> a[i] stfd   [r10]=f9 

10

 adds    r14 = -88,r2 ;;// -> i ld8     r14 = [r14] ;; adds    r15 = 1,r14  adds    r14 = -88,r2 ;;// -> i st8     [r14] = r15 br      .L2 
 add    r9=8,sp ;; // -> i ld8    r8=[r9] ;; add    r3=1,r8 add    r2=8,sp ;; st8    [r2]=r3 br.cond.sptk  .b1_1 ;; 

11

 .L3:   mov    r8 = 0       .restore sp   mov    r12 = r2   br.ret.sptk.many b0       .endp main# 
 .b1_2:   add    r8=0,r0   add    sp=208,sp   br.ret.sptk.many b0 ;;       .endp main# 

12

We have added a few comments, such as // -> i, to help you understand the addressing on the memory stack. We now compare the twelve clusters of Itanium instructions in Table 11-4 as produced by gcc and by ecc.

  1. While gcc puts the two floating-point constants in the data section, ecc embeds them as 64-bit immediate values (see 5 and 9 below).

  2. These are standard beginnings.

  3. While gcc saves a copy of the stack pointer value (register r12) in register r2, ecc will instead restore register sp using an adjustment determined at compile time (see 12 below).

  4. While gcc claims 320 bytes of stack space, ecc claims 208. Is it coincidence that 208 bytes corresponds to 26 quad words, which would seem intended for the two 13-element arrays a[0..12] and b[0..12] in the C program? Please read on.

  5. While gcc fetches the floating-point value for c from its data segment at location .LC0, ecc transfers the raw 64-bit immediate value in the instruction stream via register r3 using movl (Section 4.5.4) and setf.d instructions (Section 8.7.1). Both compilers then store c as a local variable on the memory stack. How does ecc still have space for two 13-element vectors of double-precision data? Please read on.

  6. Both compilers establish i=1 and store it on the memory stack.

  7. Both compilers retrieve i and perform the test required by the semantics of the C language to ensure the loop body is not entered at all if the terminating condition is already satisfied.

  8. Both compilers establish pointers to c and b[i]; gcc also establishes a pointer to a[i]. While gcc uses the purpose-built shladd instruction, ecc more slowly performs a very general integer multiplication using the floating-point execution unit (Section 8.7.2).

  9. While gcc retrieves the constant within the algebraic expression from its data segment at location .LC1, ecc again transforms the raw 64-bit value in the instruction stream using register register operations. Both compilers perform the expected floating-point addition, then multiplication.

  10. Both compilers store the result a[i]. It is not unusual for a compiler, when operating at its lowest optimization setting like ecc here, to recalculate the index i for the assignment, or even to recalculate it for every vector reference within an algebraic expression.

  11. Both compilers retrieve, increment, and store the index i before branching back to the test at the top of the loop. It is not unusual for a compiler, when operating at its lowest optimization setting, to store any modified value for one of the programmer's variables immediately back into memory. Holding frequently used quantities, like i here, in registers is expected at higher optimization settings.

  12. Both compilers restore the stack pointer to its original value.

In order to have enough storage on the memory stack for c, i, and two 13-element vectors, ecc uses the 16-byte scratch area that the Itanium programming conventions (Section 7.1.3) require the caller to provide. Not only does gcc not use that scratch area, it appears to claim significantly more stack space than it uses for this program.

This detailed analysis should give you confidence that you could have produced a shorter and more efficient program than either gcc or ecc when they were precluded from optimizing the rough output from their compiling algorithms for the C language.

11.2.2 Comparing Output from gcc and g77 (Linux)

Next we want to draw your attention to the similarities and differences in the machine language produced by compilers for C and FORTRAN when they work through source programs that are as identical as we could make them. Table 11-5 presents the comparison for the open-source gcc and g77 compilers without optimization (-O0). The left column for gcc is copied from Table 11-4 except for clustering the instructions somewhat differently to draw the best parallels with the other language.

We have added a few comments, such as // -> i, to help you understand the addressing on the memory stack. We now compare the twelve clusters of Itanium instructions in Table 11-5 as produced by gcc and by g77.

Table 11-5. Compiler Output for the COM_C and COM_F Programs (Linux)

gcc at level -O0

g77 at level -O0

//

       .sdata       .align 8 .LC0: data8 0x4005bf0995aaf790       .align 8 .LC1: data8 0x400921f9f01b866e 
       .section .rodata       .align 8 .LC2:       .sdata       .align 8 .LC0: data8 0x4005bf0995aaf790       .align 8 .LC1: data8 0x400921f9f01b866e 

1

       .text       .align 16       .global main#       .proc main# main: 
       .text       .align 16       .global MAIN__#       .proc MAIN__# MAIN__: 

2

     .prologue 2, 2     .vframe r2 mov    r2 = r12 adds   r12 = -320,r12 
     .prologue 14, 33     .save ar.pfs, r34 alloc r34 = ar.pfs, 0, 4, 2, 0     .vframe r35 mov    r35 = r12 adds   r12 = -352,r12     .save rp, r33 mov    r33 = b0 

3

     .body addl   r14 = @gprel(.LC0), gp;; ldfd   f6 = [r14] adds   r14 = -96,r2 ;; // -> c stfd   [r14] = f6 
     .body addl   r14 = @gprel(.LC0),gp;; ldfd   f6 = [r14] adds   r14 = -112,r35;;// -> C stfd   [r14] = f6 

4

 addl    r15 = 1,r0 adds    r14 = -88,r2 ;; // -> i st8     [r14] = r15 
 addl    r15 = 12,r0 adds    r14 = -96,r35;;// -> lc st4     [r14] = r15 addl    r15 = 1,r0 adds    r14 = -104,r35;;// -> I st8     [r14] = r15 

5

 .L2:   adds    r14 = -88,r2 ;; // -> i   ld8     r14 = [r14] ;;   cmp.ge p6,p7 = 12,r14 (p6) br.cond.dptk .L5   br      .L3 
 .L2:   adds    r14 = -96,r35;;// -> lc   ld4     r14 = [r14] ;;   adds    r14 = -1,r14 ;;   mov     r15 = r14   adds    r14 = -96,r35;;// -> lc   st4     [r14] = r15   cmp4.le p6,p7 = r0,r15 (p6) br.cond.dptk .L5 

6

 
 addl   r14 =@ltoff(.LC2), gp;; ld8    r36 = [r14] mov    r37 = r0 br.call.sptk.many b0 = s_stop# 

7

 .L6:   adds   r14 = -88,r2 ;; // -> i   ld8    r14 = [r14] ;;   shladd r14 = r14,3,r0   adds   r15 = -320,r2;;//-> a[0]   add    r16 = r15,r14 // -> a(i)   adds   r14 = -88,r2 ;; // -> i   ld8    r14 = [r14] ;;   shladd r15 = r14,3,r0    adds   r14 = -208,r2;;//-> b[0]   add    r14 = r14,r15;;//-> b(i) 
 .L5:   adds   r14 = -104,r35;;// -> I   ld8    r14 = [r14] ;;   shladd r14 = r14,3,r0   adds   r15=-336,r35 ;;//->A(1)   add    r14 = r14, r15 ;;   adds   r15 = -8,r14 // -> A(I)   adds   r14 = -104,r35;;// -> I   ld8    r14 = [r14] ;;   shladd r14 = r14,3,r0   adds   r16=-336,r35;;//->B(1)   add    r14 = r14,r16 ;;   adds   r14 = 104,r14;;//->B(I) 

8

 ldfd    f7 = [r14] addl    r14 = @gprel(.LC1),gp ;; ldfd    f6 = [r14] ;; fadd .d f7 = f7,f6 adds    r14 = -96,r2 ;; // -> c ldfd    f6 = [r14] ;; fmpy .d f6 = f6,f7 ;; 
 ldfd   f7 = [r14] addl   r14 =@gprel(.LC1),gp ;; ldfd   f6 = [r14] ;; fadd.d f7 = f7,f6 adds   r14 = -112,r35;;// -> C ldfd   f6 = [r14] ;; fmpy.d f6 = f6,f7 ;; 

9

stfd [r16] = f6

stfd [r15] = f6

10

 adds   r14 = -88,r2 ;;// -> i ld8    r14 = [r14] ;; adds   r15 = 1,r14  adds   r14 = -88,r2 ;;// -> i st8    [r14] = r15 br     .L2 
 adds   r14 = -104,r3r;;// -> I ld8    r14 = [r14] ;; adds   r15 = 1,r14  adds   r14 = -104,r35;;// -> I st8    [r14] = r15 br     .L2 

11

 .L3:   mov    r8 = 0       .restore sp   mov    r12 = r2   br.ret.sptk.many b0       .endp main# 

.endp main#

12

  1. Both compilers put the two floating-point constants into the data section.

  2. These are standard beginnings.

  3. While gcc only saves a copy of the stack pointer value (register r12) in register r2, g77 allocates register stack storage in a longer prologue in order to call a standard FORTRAN exit routine (see 7 below). While gcc claims 320 bytes of stack space, g77 claims 352.

  4. Both compilers fetch the floating-point value for c from the data segment at location .LC0 and then store c as a local variable on the memory stack.

  5. Both compilers establish i=1 and store it on the memory stack, and g77 also establishes the number of traversals for a down-counter that it will use internally for loop control (see 6 below).

  6. While gcc bases the loop termination test at the top of the loop on i, g77 decrements its internal loop counter for a logically equivalent test.

  7. Here the program produced by g77 exits by calling a standard FORTRAN exit routine.

  8. Both compilers establish pointers to c and to the elements of the two vectors corresponding to the programmer's index value. It is not unusual for a compiler, when operating at its lowest optimization setting, to recalculate the index i for the assignment, or even to recalculate it for every vector reference within an algebraic expression. Note that g77 adds 8 in the addressing because a FORTRAN vector has no zeroth element.

  9. Both compilers obtain the floating-point sources in the algebraic expression, then add and multiply as expected.

  10. Both compilers store the resulting vector element.

  11. Both compilers retrieve, increment, and store the index i before branching back to the respective top of the loop. It is not unusual for a compiler, when operating at its lowest optimization setting, to store any modified value for one of the programmer's variables immediately back into memory. Holding frequently used quantities, like i here, in registers is expected at higher optimization settings.

  12. While gcc restores the stack pointer to its original value, g77 instead will have called a standard library routine that handles stopping the program (see 7 above).

Both compilers appear to claim significantly more stack space than is actually used for this program. This could be an artifact of the extreme simplicity of the target program, which lacks many of the features of more realistic C or FORTRAN programs.

This detailed analysis should give you confidence that you could have produced a shorter and more efficient program than these compilers when they were precluded from optimizing the rough output from their compiling algorithms for C or FORTRAN.

11.2.3 Comparing Output from cc_bundled and f90 (HP-UX)

Here we compare output for each program as produced by two compilers for HP-UX. One compiler is cc_bundled, which ships with some HP-UX systems; it has no capability to operate at higher levels of optimization, but its output corresponds approximately to that from Hewlett-Packard's full-featured C compiler (cc) without optimization. The other compiler considered here is Hewlett-Packard's FORTRAN compiler (f90). We used command lines such as the following:

 H> cc_bundled +DD64 -S com_c.c H> mv com_c.s com_c.bundled H> f90 +DD64 -S +O0 com_f.f H> mv com_f.s com_f.O0 

where the +DD64 option requests the generation of full 64-bit addressing sequences and the -S option requests assembly language output.

Table 11-6 presents a comparison of output from cc_bundled and that from f90 at its +O0 level of optimization. As with output from compilers for the Linux programming environment, we have removed spaces, comments, and template markings in order to fit this tabular format for side-by-side comparison.

Table 11-6. Compiler Output from cc_bundled and f90 (HP-UX)

cc_bundled

f90 at level +O0

//

 .section .text, "axn","progbits" .proc  main ..L0: ..L2: main:: 
 .section .text,  "ax","progbits" .proc  _start ..L0: ..L2: _start:: demo:: 

1

 add    r11 = 0,sp ;; add    sp = -240,sp ;; add    r9 = -224,r11 ;; add    r8 = -48,sp ;; // ??? 
   alloc r35=ar.pfs, 3, 5, 4, 0;;   add    r15 = 0,sp ;;   add    sp = -288,sp ;;   mov    r36 = rp ;;   add    r38 = -16,r15 ;;   add    r37 = -272,r15 ;;    add    r39 = 0,gp ;; // suppressing here a calling // sequence to __F90_STARTUP   add    gp = 0,r39 ;; 

2

 add    r8 = 0,r9 ;; // -> c movl   r10=0x4005bf0995aaf790;; setf.d f6 = r10 ;;  stfd   [r8] = f6 ;; 
 add    r8 = 16,r37 ;; // -> C movl   r9=0x4005bf0995aaf790;; setf.d f6 = r9 ;; stfd   [r8] = f6 ;; 

3

 add    r8 = 8,r9 ;; -> i add    r10 = 1,r0 ;; st8    [r8] = r10 ;; 
 add    r8 = 32,r37 ;; // -> I add    r9 = 1,r0    ;; st8    [r8] = r9 ;; 

4

   add    r8 = 8,r9 ;; // -> i   ld8    r8 = [r8] ;;   cmp.le p6,p0 = 13,r8 ;; (p6) br.dptk.few ..L3 ;; 
   add    r8 = 24,r37 ;; // -> lc   add    r9 = 12,r0 ;;    st8    [r8] = r9 ;;   add    r8 = 8,r37 ;;   add    r9 = 32,r37 ;; // -> I   ld8    r9 = [r9] ;;   st8    [r8] = r9 ;;   add    r8 = 24,r37 ;; // -> lc   ld8    r8 = [r8] ;;   add    r9 = 32,r37 ;; // -> I   ld8    r9 = [r9] ;;   cmp.lt p6, p0 = r8,r9 ;; (p6) br.dptk.few ..L4 ;; ..L5:   br.dptk.few ..L6 ;; 

5

 ..L4: ..L5:   add    r8 = 16,r9 ;; // -> a[0]   add    r10 = 8,r9 ;; // -> i   ld8    r10 = [r10] ;;   shladd r10 = r10,3,r0 ;;   add    r8 = r10,r8 // -> a[i]   add    r10 = 120,r9 ;;//-> b[0]   add    r11 = 8,r9 ;; // -> i   ld8    r11 = [r11] ;;   shladd r11 = r11,3,r0 ;;   add    r10 = r11,r10;;//-> b[i] 
 ..L6:   add    r8 = 144,r37 ;;//->A(1)   add    r9 = 8,r37 ;; // -> I   ld8    r9 = [r9] ;;   add    r9 = -1,r9 ;; // I-1   shladd r9 = r9,3,r0;;   add    r8 = r9,r8 ;;// -> A(I)   add    r9 = 40,r37 ;;//->B(1)   add    r10 = 8,r37 ;; // -> I   ld8    r10 = [r10] ;;   add    r10 = -1,r10 ;; // I-1   shladd r10 = r10,3,r0 ;;   add    r9 = r10,r9 ;;//-> B(I) 

6

 ldfd   f6 = [r10] ;; movl   r10=0x400921f9f01b866e;; setf.d f7 = r10 ;; fadd.d.s0 f6 = f6,f7 ;; add    r10 = 0,r9 ;; // -> c ldfd   f7 = [r10] ;; fmpy.d.s0 f6 = f7,f6 ;; 
 ldfd   f6 = [r9] ;; movl   r9=0x400921f9f01b866e;; setf.d f7 = r9;; fadd.d.s0  f6 = f6,f7 ;; add    r9 = 16,r37;; // -> C ldfd   f7 = [r9] ;; fmpy.d.s0 f6 = f7,f6 ;; 

7

stfd [r8] = f6 ;;

stfd [r8] = f6 ;;

8

 add    r8 = 8,r9 ;; -> i ld8    r8 = [r8] ;; add    r8 = 0,r8 ;; // ??? add    r8 = 1,r8 ;; add    r10 = 8,r9 ;; -> i st8    [r10] = r8 ;; 
 add    r8 = 8,r37 ;; // -> I ld8    r8 = [r8] ;; add     r8 = 1,r8 ;; add     r9 = 8,r37 ;; // -> I st8     [r9] = r8 ;; 

9

   add    r8 = 8,r9;; -> i   ld8    r8 = [r8] ;;   cmp.gt p6,p0 = 13,r8 ;; (p6) br.dptk.few ..L5 ;; 
 add    r8 = 8,r37 ;; // ??? ld8    r8 = [r8] ;; // ??? add    r9 = 24,r37 ;; // -> lc ld8    r9 = [r9] ;; cmp.le p6,p0 = r8,r9;; (p6) br.dptk.few ..L6 ;; 

10

 ..L7:   br.dptk.few ..L3 ;; 
 ..L7:   br.dptk.few ..L4;; 

11

 ..L3:   add    r8 = 0,r0 ;;    add    r8 = 0,r8 ;; // ???   add    r8 = 0,r8 ;; // ???   add    sp = 240,sp ;;   br.ret.dptk.few rp ;;       .endp main 
 ..L4: // suppressing here a calling // sequence to a FORTRAN  // pre-exit routine   add    gp = 0,r39 ;;   mov    rp = r36 ;;   mov    ar.pfs = r35 ;;   add    sp = 288,sp ;;   br.ret.dptk.few rp ;; ......endp _start 

12

We have marked with // ??? a few machine instructions that do not advance the logical progress of the program. We now compare the twelve clusters of Itanium instructions in Table 11-6 as produced by cc_bundled and by f90 at optimization level +O0.

  1. These are standard beginnings.

  2. While cc_bundled claims 240 bytes of stack space, f90 claims 288 and also uses the register stack because it calls two FORTRAN support routines. Each compiler consistently uses a register (r9 for cc_bundled, r37 for f90) to point to the lowest memory stack address used for its variables.

  3. Both compilers transfer the raw 64-bit immediate value for c in the instruction stream using movl (Section 4.5.4) and setf.d (Section 8.7.1) instructions. Both compilers then store c as a local variable on the memory stack.

  4. Both compilers establish i=1 and store it on the memory stack.

  5. Both compilers retrieve i and perform a test to ensure the loop body is not entered at all if the terminating condition is already satisfied; cc_bundled bases the test on i, while f90 instead uses its own internal loop counter.

  6. Both compilers establish pointers to a[i] and b[i] using the purpose-built shladd instruction. Since a FORTRAN array has no zeroth element, f90 subtracts 1 from I in the calculation.

  7. Both compilers transform the raw 64-bit value in the instruction stream for the constant within the algebraic expression using register register operations. Both compilers perform the expected floating-point addition, then multiplication.

  8. Both compilers store the result a[i].

  9. Both compilers retrieve, increment, and store the index i.

  10. While both compilers retrieve i again, only cc_bundled actually uses i in order to test whether to go back to the top of the loop for another traversal; f90 instead decrements and uses its internal loop counter for the test.

  11. This code appears to have no real function. It may be an artifact of the absence of optimization by cc_bundled and f90. It is not unusual to see apparently useless machine instructions in such circumstances.

  12. Both compilers restore the stack pointer to its original value. We suppressed numerous machine instructions that f90 uses to call a FORTRAN support routine.

Note that both compilers have established the 16-byte scratch area that a caller is required by the Itanium programming conventions (Section 7.1.3) to provide, although cc_bundled does not actually call any subsidiary functions.

This detailed analysis should give you confidence that you could have produced a shorter and more efficient program than either cc_bundled or f90 without optimizations.

Note: When the -S option is specified on the command line, most HP-UX compilers (cc, aCC, and f90, but not cc_bundled) also produce, at the same time as the .s file, a .o file that can be used in subsequent linking. Neither the open-source compilers nor the Intel compilers for Linux produce dual output files.



ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net