10.7 Fibonacci Numbers | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

Mathematical modeling is not a recent development, as Knuth makes plain in retelling the inquiry, first written by Fibonacci in the year 1202, "How many pairs of rabbits can be produced from a single pair in a year's time?" If the rabbits never die, then the population may grow in each reproductive generation to levels proportional to the integer sequence

1, 1, 2, 3, 5, 8, 13, 21, 34, …

known as the Fibonacci numbers. Centuries later the release of European rabbits in Australia, in the absence of effective predation, indeed led to a runaway population. Modern population biologists of course consider in their mathematical models such additional factors as food supply, predation, and death.

How can the sequence of Fibonacci numbers be characterized? First, the correspondence between the index position n and each Fibonacci number F_n in the sequence goes as follows:

n	1	2	3	4	5	6	7	8	9	…
F_n	1	1	2	3	5	8	13	21	34	…

Second, if we notice that a Fibonacci number is equal to the sum of the two preceding Fibonacci numbers, we can propose a formal definition

F₁ = 1 and F₂ = 1 and F_n = F_n₁ + F_n₂ for n > 2

This definition contains a recurrence relationship, by means of which a new member of the sequence can be computed from knowledge of previous members in the sequence.

An algorithm based on such a recurrence relationship can usually be written most obviously and readily using recursion:

 if n = 1 or 2 then F(n) := 1 else F(n) := F(n-1) + F(n-2) end if

This pseudocode directly implements the recurrence relationship for the Fibonacci numbers.

We now proceed to write and analyze three different implementations of a function, FIBx (n), that will report back in register ret0 the value of the nth number in the sequence of Fibonacci numbers. We will link three variant functions (FIB1, FIB2, and FIB3) to a C main program (TESTFIB) for demonstration and testing.

10.7.1 FIB1: Function Using Recursion

Figure 10-5 presents FIB1, the first variant of a function to compute the nth member of the sequence of Fibonacci numbers. This function derives directly from the defining recurrence relation.

As mentioned previously, ins and locals comprise a single physical pool of local storage for a function; all of these will be saved by register rotation across function calls. Since there is no long-term use for n, the input argument to this function, there is no reason not to use in0 for a local variable.

We elect to have the argument n passed by value i.e., as an unsigned quad word value in register in0. Because each level of recursion calls the function FIB1 twice in succession, one local variable in0 is used to retain the value of n 1 while F(n 1) is being computed and a second local variable loc3 is used to retain the value of F(n 1) while F(n 2) is being computed.

If we call FIB1 from a C program for n = 4, a value beyond 2, this first invocation will set up two successive self-calls to compute F(4 1) = F(3) and F(4 2) = F(2) = 1. Of these two calls, the latter can directly return a value but the former must set up two more self-calls, to compute F(3 1) = F(2) = 1 and F(3 2) = F(1) =1. All told, the entire course of recursion includes those four internal self-calls in addition to the initiating outside call from the C program.

Figure 10-5 FIB1: Computing Fibonacci numbers using recursion

 // FIB1          Recursive calls for Fibonacci numbers          .text                    // Section for code          .align  32               // Desired alignment          .global fib1             // These three lines          .proc   fib1             //  mark the mandatory fib1:                             //   function entry          .prologue 12,r32         // Mask for rp, ar.pfs only          alloc   loc0 = ar.pfs,1,4,1,0  // ins, locals, outs          .save   rp,loc1          // Must save return address          mov     loc1 = b0        //  to our caller          .body                    // Now we really begin... first:   mov     ret0 = 1         // Default value          cmp.geu p6,p0 = 2,in0    // Set F(N)=1    (p6)  br.cond.spnt.few done    //  if N=1 or 2          add     in0 = -1,in0;;   // Compute N-1, save it,          mov     out0 = in0       //  and use as argument          mov     loc2 = gp        // Save gp          br.call.sptk.many b0 = fib1  // Compute F(N-1)          mov     gp = loc2        // Restore gp          mov     loc3 = ret0      // Remember F(N-1)          add     out0 = -1,in0    // Compute N-2 as argument          br.call.sptk.many b0 = fib1  // Compute F(N-2)          mov     gp = loc2        // Restore gp          add     ret0 = loc3,ret0 // Return F(n-1) + F(n-2) done:    mov     b0 = loc1        // Restore return address          mov     ar.pfs = loc0    // Restore caller's ar.pfs          br.ret.sptk.many b0;;    // Back to caller          .endp   fib1             // Mark end of procedure

FIB1 does not explicitly contribute anything to the data section. It is axiomatic that the data section of a recursive routine should usually contain just read-only data. Any label and address in the data section will refer to only one information unit of storage, which would be seen and shared by every recursive invocation of the routine. Local variables, in contrast, need to be different for each recursive invocation.

In the simple recursive function, FIB1, each separate self-call involves certain aspects of overhead, although some steps may seem unnecessary:

executing the br.call instruction;
allocating the local variables;
saving and restoring the caller's context;
saving and resetting the global pointer; and
executing the br.ret instruction.

The defined calling convention advocates strict adherence to this scheme at the source-code level. Every step is essential in real-world algorithmic solutions.

Even though one ought not to circumvent any of this call overhead, you may well wonder whether we could reduce its adverse impact on performance. This speculation is pursued as one of the exercises for this chapter.

10.7.2 FIB2: Function Without Recursion

Figure 10-6 presents FIB2, the second variant of a function to compute the nth member of the sequence of Fibonacci numbers. This function derives directly from the sequence itself, requiring only successive addition operations until the sought-for term is reached.

We elect to have the argument n passed by value i.e., as an unsigned quad word value in register in0. No local variables are required to preserve return addresses or intermediate values.

Like FIB1, the FIB2 function handles the first two Fibonacci numbers specially with an immediately returned value of 1. In general, each Fibonacci number is computed by direct addition of adjacent pairs of values, starting from the first two:

F₃ = F₁ + F₂

Figure 10-6 FIB2: Computing Fibonacci numbers without recursion

 // FIB2          Fibonacci numbers without recursion         .text                     // Section for code         .align   32               // Desired alignment         .global  fib2             // These three lines         .proc    fib2             //  mark the mandatory fib2:                             //   function entry         .regstk  1,0,0,0          // Claim in0         .body                     // Now we really begin... first:  mov      ret0 = 1         // Default value         add      in0 = -2,in0;;   // Number of additions         cmp.le   p6,p0 = in0,r0   // Set F(N)=1    (p6) br.cond.spnt.few done     //  if N=1 or 2         mov      r9 = ret0        // r9 = F(n-2) loop:   add      in0 = -1,in0     // Count down...         mov      r10 = ret0;;     // Old F(n) -> F(n-1)         add      ret0 = r9,ret0   // ret0 = newest F(n)         mov      r9 = r10         // Old F(n-1) -> F(n-2)         cmp.gt   p6,p0 = in0,r0   //  ...until finished    (p6) br.cond.sptk.few loop     // ret0 = F(N) sought done:   br.ret.sptk.many b0;;     // Back to caller         .endp    fib2             // Mark end of procedure

Each number in the sequence must be retained long enough to be used twice in such additions. That requirement is met using the instruction that hands over F (n 1) in register r10 from one cycle to become F (n 2) in register r9 for the next cycle. The loop in FIB2 that performs additions must run n 2 times to compute F (n).

In contrast to FIB1, the FIB2 function does not compute early members of the sequence over and over. This improved efficiency dramatically reduces execution time, as we demonstrate after introducing a third variant.

10.7.3 FIB3: Function Using the Register Stack

Although FIB2 offers good performance, its loop with six instructions in two bundles requires two machine cycles per traversal on any Itanium processor, on account of the RAW dependencies on registers r10, ret0, and in0. While an Itanium 2 processor would have enough execution units to accommodate the six instructions in parallel, two instruction groups, and therefore two machine cycles, are still needed because of the dependencies.

Earlier in this chapter, we showed that the Itanium register stack can "cover" various latencies. Here, we can show very simply the potential of the register stack to reduce dependencies in certain favorable circumstances. The Itanium loop count register does not incur the same kind of self-dependency as a counter implemented with addition or subtraction operations on a general register. Moreover, three rotating registers can very neatly handle the simultaneous presence of F_n, F_n₁, and F_n₂.

Figure 10-7 presents FIB3, a third variant of a function for computing the nth member of the sequence of Fibonacci numbers. The successive addition operations are carried out in a single-cycle loop using Itanium rotating registers. No rotating predicate registers are needed in this particular instance.

Notice that the RAR dependency involving in0 within a single instruction group will produce just what we want: The compare should be based on N, but the loop count register should be initialized as (N 2) 1 = N 3.

When a br.ctop instruction falls through, a last rotation of the registers has already occurred. Thus the desired result must be retrieved using the register name r34, not r33, in FIB3.

Figure 10-7 FIB3: Computing Fibonacci numbers using the register stack

 // FIB3          Fibonacci numbers using the register stack          .text                    // Section for code          .align  32               // Desired alignment          .global fib3             // These three lines          .proc   fib3             //  mark the mandatory fib3:                             //   function entry          .prologue 4,r9           // Mask for ar.pfs only          alloc   r9=ar.pfs,1,7,0,8// ins, locs, outs, rots          .body                    // Now we really begin...          mov     r34 = 1          // r34 = F(n-1)          cmp.geu p6,p0 = 2,in0    // Set F(N)=1...          add     in0 = -3,in0     // Loop count          mov     ar.ec = 1        // No epilog phase    (p6)  br.cond.spnt.few done;;  //  ...if N=1 or 2          mov     r35 = 1          // r35 = F(N-2)          mov     ar.lc = in0;;    // Set the loop count loop:    add     r33 = r34,r35    // r33 = newest F(n)          br.ctop.sptk.few loop;;  // r34 = F(N) sought done:    mov     ret0 = r34       // Return F(N)          br.ret.sptk.many b0;;    // Back to caller          .endp   fib3             // Mark end of procedure

10.7.4 TESTFIB: Showing the Cost of Recursion

In order to test the relative performance of the three functions for computing Fibonacci numbers, we can use a short C program (Figure 10-8). For dramatic effect, we arrange to call the variant functions in the order from fastest (FIB3) to slowest (FIB1).

When this program is run, the values computed by FIB3 and FIB2 are printed immediately, no matter what number along the sequence is sought. If the entered value of n is large enough (say 40+), the value computed by FIB1 is printed only after a detectable pause. In order to make these comparisons semiquantitative, the TESTFIB program calls the C function clock in much the way one uses a stopwatch to time a race.

The clock function returns the amount of processor time used by the process that calls it. There is an implementation-dependent parameter CLOCKS_PER_SEC that can be used to convert the returned value to seconds. For both the Linux and HP-UX programming environments, the unit of time is one microsecond (CLOCKS_PER_SEC=1000000), but for HP-UX the fractional microseconds will always be seen as zero.

Running this test program shows the cost of recursion very clearly in this extreme case where the call overhead (including saving and restoring things) represents a large proportion of the total routine length. If the body of a recursive routine were much bigger in some other example, the relative effect of the overhead would then seem smaller.

Before you conclude that recursion is a bad concept, however, please pause to consider that perhaps a given recursive routine could be made more efficient but still remain recursive in character. We will return to that thought near the end of Chapter 11.

Software engineering can sometimes have a bigger immediate effect than hardware engineering. That is, we may be able to speed up a calculation two-fold today by modifying the implementation of a given algorithm. If we were already using the fastest computer we could obtain or afford, we might have to wait some number of years for an equivalent hardware speed-up.

Figure 10-8 TESTFIB: Calling program for FIB1, FIB2, and FIB3

 /*  This C calling program will be linked to the     functions that generate Fibonacci numbers. */ #include <stdio.h> #include <time.h> main () {     unsigned long long fib1(), fib2(), fib3(), Fn, n;     clock_t t;     if ( sizeof(long int) != 8 ) {        printf( "TESTFIB requires compiling with +DD64\n" );        return(1);     }     printf("Which number? ");     scanf("%lld",&n);     t = clock();     Fn = fib3(n);     t = clock() - t;     printf("FIB3 gives %llu, timing %llu\n",Fn,t);     t = clock();     Fn = fib2(n);     t = clock() - t;     printf("FIB2 gives %llu, timing %llu\n",Fn,t);     t = clock();     Fn = fib1(n);     t = clock() - t;     printf("FIB1 gives %llu, timing %llu\n",Fn,t);     printf("CLOCKS_PER_SEC = %ld\n", CLOCKS_PER_SEC);     return 0; }