11.7 Recursion for Fibonacci Numbers Revisited | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

We presented recursive and nonrecursive functions for computing numbers in the Fibonacci sequence in Chapter 10, representing two extreme computational models. Now we revisit the Fibonacci sequence with another algorithm adapted from a Pascal illustration by Rohl that avoids the unfortunate computational time for the branching form of recursion, while still being recursive in nature.

For convenience in setting up this algorithm, the correspondence between the index position n and each Fibonacci number F_n in the sequence will begin with a zeroth member, as follows:

[View full width]

n 0 1 2 3 4 5 6 7 8

9 ... F_n 0 1 1 2 3 5 8 13 21 34

...

The nonbranching recursive algorithm will compute F _i ₊₁, given that it already knows F_i and F _i₁, and proceed in this fashion until it reaches the desired F_n.

Figure 11-3 presents the contents of source file fib101.c, namely, a brief C calling program (main) and a recursive function (fib). In order to minimize the number of arguments to be passed for each recursive call, the member number (n) is made global to both main and fib.

When we compiled this program using available C compilers, we usually obtained quite straightforward machine code whether or not we requested optimizations. But the gcc compiler at optimization level -O3 produced a program that contains another form of optimization that we want to make known to our readers.

Figure 11-3 Simpler recursive generation of Fibonacci numbers (fib101)

 //  Fibonacci C program with simpler recursion. #include <stdio.h> long long n; main () {     long long F;     long long fib(long long, long long, long long);     printf("Which n? ");     scanf("%lld",&n);     if (n <= 1)         F = n;     else         F = fib(1,0,1);     printf("Fibonacci number %lld is %lld\n", n, F);     return 0; } long long fib(long long i, long long prev,         long long present) {     if (i == n)         return present;     else         return fib(i+1,present,prev+present); }

The machine code for main was still altogether straightforward. Thus we only need to concentrate on the placement of the global variable n:

 .common   n#,8,8

and the sequence of instructions to call fib:

     addl r8 = @ltoff(n#), gp;;     ld8 r32 = [r8];;               // loc0 -> n     <instructions removed>     br.call.sptk.many b0=scanf#;;  // Call scanf     ld8 r2 = [r32]                 // Get n     addl r36 = 1, r0               // out0 = 1     mov r37 = r0                   // out1 = 0     addl r38 = 1, r0;;             // out2 = 1     mov r8 = r2                    // Assume default Fn = n     cmp.lt p8,p9 = 1, r2           // If n <= 1     (p9) br.cond.dptk .L18         //  then return default     br.call.sptk.many b0=fib#;;    //  else call fib     <restore global pointer> .L18:                              // Call printf

where we have removed distracting instructions having to do with preparations for the call to scanf and the call to printf that follows this segment.

The generated and optimized machine code for fib was as follows, where we have retained the indications for templates (gcc did not show no-op instructions in the file fib101.s produced with the -S command-line option).

      .align 16      .global fib#      .proc fib# fib:      .prologue      .body      .mii      // cycle 0      mov r8 = r34;;               // in2 held present .L22:      .mii      // cycle 0      addl r9 = @ltoff(n#), gp     // r9 -> [-> n]      add r15 = r33, r8            // r15 = prev + present      mov r33 = r8;;               // r33      .mmi      // cycle 2      ld8 r3 = [r9];;              // r3 -> n      // cycle 5      ld8 r2 = [r3];;              // r2 = n      .mib      // cycle 7      cmp.ne p8, p9 = r2, r32      // in0 = i      adds r32 = 1, r32            // i = i + 1      (p9) br.ret.dpnt.many rp     // conditional return      .mib      // cycle 8      mov r8 = r15                 // new value of present      br .L22      .endp fib#

The cycle indications reflect how gcc anticipates that the first-generation Itanium processor we were using would be able to execute this code.

Notice that fib repetitively branches internally to a point just below its beginning, instead of setting up a fully recursive call to itself using the register stack to pass the three arguments to a child instance. This strategy also converts a pair of br.call and br.ret instructions (and, strictly, restoring the global pointer) into one br instruction per instance of (simulated) recursion.

The Itanium calling standards apply to procedure calling that involves external entry points, like that of fib itself, rather than internal repetition points, like .L22 in this optimized version of fib. Stated another way, the internal details of the fib function may achieve significant economies as long as the outer shell conforms to the calling standards. The program main is not supposed to know anything about the internal details of fib, only its interface parameters (i, prev, present) and the assurance that its returned value is as advertised.

It would seem that gcc does not trust the stability of either the actual location or the value of n in common storage, since it keeps fetching it anew. Such skepticism would be warranted in multithreaded applications. Yet if we gave this to you as a single-threaded problem to improve, you might move the "pointer-chasing" that consumes several machine cycles up above label .L22 and just keep n in register r2 for use in a shorter loop.