Section 10.5. Refinement 4: Loop Unrolling


10.5. Refinement 4: Loop Unrolling

Refinement 1 began by reducing the size of the generated hardware by recoding some repeated code as a loop. Up to refinement 3, we considered performance improvements that kept the size fairly constant and in the end achieved an overall system speedup of 48X over the initial software-based prototype. In the remaining sections we will abandon our attempts to maintain the size and push instead for the maximum performance, as measured in cycle counts.

Refinement 1 introduced a loop and consequently some overhead to the core DES computation in favor of the reduced size. The performance loss was quite modest when considered in isolation; however, that was before optimizing the statements that make up the body of the new loop. Let's see what impact loop unrolling will have now, after the substantial optimizations of refinements 2 and 3. The following is how the code appears after unrolling the inner loop introduced in refinement 1:

 F(left,right,Ks0[0],Ks1[0]); F(right,left,Ks0[1],Ks1[1]); F(left,right,Ks0[2],Ks1[2]); F(right,left,Ks0[3],Ks1[3]); F(left,right,Ks0[4],Ks1[4]); F(right,left,Ks0[5],Ks1[5]); F(left,right,Ks0[6],Ks1[6]); F(right,left,Ks0[7],Ks1[7]); F(left,right,Ks0[8],Ks1[8]); F(right,left,Ks0[9],Ks1[9]); F(left,right,Ks0[10],Ks1[10]); F(right,left,Ks0[11],Ks1[11]); F(left,right,Ks0[12],Ks1[12]); F(right,left,Ks0[13],Ks1[13]); F(left,right,Ks0[14],Ks1[14]); F(right,left,Ks0[15],Ks1[15]); 

Regenerating hardware using the Impulse C tools on this new revision produces a hardware implementation that requires a little over 2,000 slices and performs nearly 80 times faster than the software implementation.

Tip

As this example shows, the overhead introduced by a loop can be significant when the loop body is small.


Note that, rather than duplicating the body eight times and substituting constants for array index I as we have done here, Impulse C also has an UNROLL pragma that essentially does the same thing for loops with constant values for their iteration values. The UNROLL pragma performs unrolling for you as a preprocessing step before other optimizations are performed. For this example, however, it is more convenient to unroll the loop by hand as a prelude to refinement 5.



    Practical FPGA Programming in C
    Practical FPGA Programming in C
    ISBN: 0131543180
    EAN: 2147483647
    Year: 2005
    Pages: 208

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net