The evaluation of polynomials is essential to implementations of trigonometric functions and other transcendental functions. In practice, simple power series expressions for any F(x) are replaced by Chebyshev polynomial expansions in order to attain the smoothest possible distribution of discrepancies between the approximation and the exact function across the whole range of x (see Markstein). Consider the operations required to evaluate a third-order polynomial involving real numbers:
that we might directly transcribe from algebra into pseudocode: sum = 1.0 power = x tmp = 0.5 * power sum = sum + tmp power = power * x tmp = 0.25 * power sum = sum + tmp power = power * x tmp = 0.125 * power sum = sum + tmp requiring about ten instructions, not counting additional instructions that the Itanium ISA, which does not support immediate constants within floating-point instructions, would require to obtain successive floating-point coefficients from a stored list. Now consider a different but mathematically equivalent expression for the same cubic polynomial:
or
This factored form readily evidences the great advantage of a fused multiply add instruction. Transforming polynomials into this factored form (Knuth vol. 2) is known as Horner's method or rule, acknowledging the work of William George Horner (1786 1837). We present in Figure 8-1 a simple Itanium program that evaluates a cubic polynomial. Running the HORNER program under control of a debugger quickly demonstrates that the algorithm is correct: L> gcc -Wall -O0 -o bin/horner horner.s L> gdb bin/horner [messages deleted here] (gdb) break done Breakpoint 1 at 0x40000000000005b0 (gdb) run Starting program: /house/user/bin/horner Breakpoint 1, 0x40000000000005b0 in done () (gdb) p/f $f7 $1 = 1.032257080078125 (gdb) x/gf &P 0x6000000000000778 <P>: 1.032257080078125 (gdb) q The program is running. Exit anyway? (y or n) y L> The expected value is found to be (0.125)(0.0625)3 + (0.25)(0.0625)2 + (0.5)(0.0625) + 1.0 = 1.032257080078125. Figure 8-1 HORNER: Illustrating the fused multiply add instruction// HORNER Polynomial Evaluation .data // Declare storage .align 8 // Desired alignment C: real8 0.125,0.25,0.5,1.0 // Coefficients X: real8 0.0625 // Argument P: .skip 8 // Function value .text // Section for code .align 32 // Desired alignment .global main // These three lines .proc main // mark the mandatory main: // 'main' program entry // Now we really begin... first: addl r14 = @gprel(C),gp // r14 -> coefficients addl r15 = @gprel(X),gp // r15 -> argument addl r16 = @gprel(P),gp;; // r16 -> X ldfd f6 = [r15] // Get argument ldfd f7 = [r14],8;; // Initial coefficient ldfd f8 = [r14],8;; // Next coefficient fma f7 = f7,f6,f8 // First partial sum ldfd f8 = [r14],8;; // Next coefficient fma f7 = f7,f6,f8;; // Next partial sum ldfd f8 = [r14];; // Last coefficient fma f7 = f7,f6,f8 // Final result stfd [r16] = f7 // Put in memory form done: mov ret0 = 0 // Signal all is normal br.ret.sptk.many b0;; // Back to command line .endp main // Mark end of procedure The body of the HORNER program requires a mere dozen lines to initialize pointers, obtain each coefficient, and store the result. Such conciseness would not be possible in an architecture without a fused multiply add instruction. |