8.5 HORNER: Evaluating a Polynomial | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

The evaluation of polynomials is essential to implementations of trigonometric functions and other transcendental functions. In practice, simple power series expressions for any F(x) are replaced by Chebyshev polynomial expansions in order to attain the smoothest possible distribution of discrepancies between the approximation and the exact function across the whole range of x (see Markstein).

Consider the operations required to evaluate a third-order polynomial involving real numbers:

P(x) = 1.0 + 0.5 x + 0.25 x² + 0.125 x³

that we might directly transcribe from algebra into pseudocode:

 sum = 1.0 power = x tmp = 0.5 * power sum = sum + tmp power = power * x tmp = 0.25 * power sum = sum + tmp power = power * x tmp = 0.125 * power sum = sum + tmp

requiring about ten instructions, not counting additional instructions that the Itanium ISA, which does not support immediate constants within floating-point instructions, would require to obtain successive floating-point coefficients from a stored list.

Now consider a different but mathematically equivalent expression for the same cubic polynomial:

P(x) = 1.0 + x { 0.5 + x [ 0.25 + x (0.125) ] }

P(x) = { [ (0.125) x + 0.25 ] x + 0.5 } x + 1.0

This factored form readily evidences the great advantage of a fused multiply add instruction. Transforming polynomials into this factored form (Knuth vol. 2) is known as Horner's method or rule, acknowledging the work of William George Horner (1786 1837).

We present in Figure 8-1 a simple Itanium program that evaluates a cubic polynomial. Running the HORNER program under control of a debugger quickly demonstrates that the algorithm is correct:

 L> gcc -Wall -O0 -o bin/horner horner.s L> gdb bin/horner [messages deleted here] (gdb) break done Breakpoint 1 at 0x40000000000005b0 (gdb) run Starting program: /house/user/bin/horner Breakpoint 1, 0x40000000000005b0 in done () (gdb) p/f $f7 $1 = 1.032257080078125 (gdb) x/gf &P 0x6000000000000778 <P>: 1.032257080078125 (gdb) q The program is running.  Exit anyway? (y or n) y L>

The expected value is found to be (0.125)(0.0625)³ + (0.25)(0.0625)² + (0.5)(0.0625) + 1.0 = 1.032257080078125.

Figure 8-1 HORNER: Illustrating the fused multiply add instruction

 // HORNER        Polynomial Evaluation         .data                     // Declare storage         .align   8                // Desired alignment C:      real8    0.125,0.25,0.5,1.0  // Coefficients X:      real8    0.0625           // Argument P:      .skip    8                // Function value         .text                     // Section for code         .align   32               // Desired alignment         .global  main             // These three lines         .proc    main             //  mark the mandatory main:                             //   'main' program entry                                   // Now we really begin... first:  addl     r14 = @gprel(C),gp    // r14 -> coefficients         addl     r15 = @gprel(X),gp    // r15 -> argument         addl     r16 = @gprel(P),gp;;  // r16 -> X         ldfd     f6 = [r15]       // Get argument         ldfd     f7 = [r14],8;;   // Initial coefficient         ldfd     f8 = [r14],8;;   // Next coefficient         fma      f7 = f7,f6,f8    // First partial sum         ldfd     f8 = [r14],8;;   // Next coefficient         fma      f7 = f7,f6,f8;;  // Next partial sum         ldfd     f8 = [r14];;     // Last coefficient         fma      f7 = f7,f6,f8    // Final result         stfd     [r16] = f7       // Put in memory form done:   mov      ret0 = 0         // Signal all is normal         br.ret.sptk.many b0;;     // Back to command line         .endp   main              // Mark end of procedure

The body of the HORNER program requires a mere dozen lines to initialize pointers, obtain each coefficient, and store the result. Such conciseness would not be possible in an architecture without a fused multiply add instruction.