11.4 Inline Optimizations

In Chapters 7 and 10 you had opportunities to observe the amount of overhead involved with function and procedure calls. The Itanium calling standards are actually quite modest in their impact as compared to some earlier architectures and programming environments. Nevertheless, the system software for high-performance architectures commonly provides options for reducing call overhead.

We have previously alluded to moving functions inline i.e., copying the body of the function or procedure right into the instruction stream of the caller rather than setting up a call. The same function can be replicated several times. While doing that does increase the overall size of the machine-language program, virtual paging by the operating system can readily handle that aspect. Importantly, the total number of executed instructions decreases by the amount of calling and returning overhead that is avoided.

Some compilers provide the ability to consider all the components of a program holistically and to move routines inline when that is advantageous. An even wider view would involve the linker in order to consider bringing certain library functions inline as well.

We have prepared the very simple C program INLINE (Figure 11-2) in order to compare the effect of having an internal function square be placed inline or not. If the compiler front end can consider both the main program and the function together, its optimizer can then analyze opportunities to eliminate call overhead by moving portions of code inline.

Table 11-9 compares the assembly language output from the full-featured cc compiler for HP-UX at optimization levels +O2 and +O3. We again abridge the output where the two results were similar, in order to highlight the differences. As expected (Table 11-3), the compiler pulls the small function square inline at optimization level +O3.

Here, with such a very small function, the overall code length in the body of main is virtually the same whether the function is inline or not, but code expansion would occur for a longer function.

Figure 11-2 INLINE: Program to illustrate bringing a function inline

 //  This C program shows inline optimization. #include <stdio.h> main () {     long long r2,x,y,z;     long long square( long long );     printf("Enter 3 integers: ");     scanf("%lld" "%lld" "%lld",&x,&y,&z);     r2 = square(x)+square(y)+square(z);     printf("%lld\n",r2);     return 0; }     long long square( long long n ) { return n*n; }

Table 11-9. Effect of Moving a Function Inline Using the cc Compiler (HP-UX)
cc at level +O2	cc at level +O3	//
main: // program initialization	main: // program initialization	1
// setup and call printf // setup and call scanf // with pointers now being: // r34 -> x // r39 -> y // r38 -> z // r36,r37 for gp&entry to printf // r40 for printf format string	// setup and call printf // setup and call scanf // with pointers now being // r34 -> x // r40 -> y // r39 -> z // r37,r38 for gp&entry to print	2 3
ld8 out0 = [r34] // M add gp = 0,r41 // M br.call.sptk.few rp = square;;// B add gp = 0,r41 // M add r34 = 0,r8 // I nop.i 0 // I ld8 out0 = [r39] // M nop.f 0 // F br.call.sptk.few rp = square;;// B add gp = 0,r41 // M add r35 = 0,r8 // I nop.i 0 // I ld8 out0 = [r38] // M nop.f 0 // F br.call.sptk.few rp = square;;// B add gp = 0,r41 // M add r9 = r34,r35 // I add out0 = 0,r40 // I ld8 r10 = [r37];; // M ld8 gp = [r36] // M mov b6 = r10 // I ld8 r10 = [r37];; // M ld8 gp = [r36] // M mov b6 = r10 // I ld8 r10 = [r37];; // M ld8 gp = [r36] // M mov b6 = r10 // I	ld8 out0 = [r34] // M add gp = 0,r41 // I add out0 = 0,r42 // I ld8 r9 = [r40];; // M ld8 r10 = [r39] // M add r14 = 0,r41 // I setf.sig f7 = r8;; // M setf.sig f8 = r9 // M nop.i 0 // I setf.sig f6 = r10;; // M ld8 r8 = [r38] // M nop.i 0 // I ld8 gp = [r37] // M nop.i 0;; // I mov b6 = r8;; // I nop.m 0 // M nop.m 0 // M xma.l f7 = f7,f7,f0;; // F nop.m 0 // M xma.l f8 = f8,f8,f0 // F nop.i 0 // I nop.m 0 // M xma.l f6 = f6,f6,f0 // F nop.i 0;; // I getf.sig r9 = f7;; // M getf.sig r10 = f8 // M nop.i 0 // I getf.sig r8 = f6 // M nop.i 0;; // I add r9 = r9,r10;; // I	4
add out1 = r9,r8 // M add r14 = 0, r41 // I br.call.sptk.few rp = b6;; // B // this is final call to printf	add out1 = r9,r8 // M nop.m 0 // M br.call.sptk.few rp = b6;; // B // this is final call to printf	5
`// restore registers and exit`	`// restore registers and exit`	`6`
square: setf.sig f6 = r32;; // M nop.m 0 // M nop.i 0 // I nop.m 0 // M xma.l f6 = f6,f6,f0 // F nop.i 0;; // I getf.sig r8 = f6 // M nop.m 0 // M br.ret.sptk.few rp;; // B	square: setf.sig f6 = r32;; // M nop.m 0 // M nop.i 0 // I nop.m 0 // M xma.l f6 = f6,f6,f0 // F nop.i 0;; // I getf.sig r8 = f6 // M nop.m 0 // M br.ret.sptk.few rp;; // B	7

Many factors would need to be assessed in deciding whether bringing a function inline can be beneficial. These include: whether br.call or br.ret may cause pipeline bubbles, whether there could be thrashing in I-cache behavior if the function is at a distant address, whether the inline version would require significantly more registers, whether the inline function code can be interwoven with mainline code in bundles, and whether preparing arguments for the function call is time-consuming (perhaps some of them do not change from call to call).

Figure 11-2 INLINE: Program to illustrate bringing a function inline

Table 11-9. Effect of Moving a Function Inline Using the cc Compiler (HP-UX)