Section 10.6. Refinement 5: Pipelining the Main Loop

10.6. Refinement 5: Pipelining the Main Loop

Along with memory optimization, pipelining is also one of the most significant types of optimizations to consider when generating hardware. Pipelining can be applied to loops that require two or more cycles per iteration and that do not have dependencies between iterations. The DES loop processes independent blocks of data with no dependencies between iterations, so it should be possible to apply pipelining. To evaluate the potential use of pipelines, let's first look in more detail at how the process is currently being parallelized.

Using Stage Master Explorer

The Stage Master Explorer tool (available from Impulse) provides a way to see in detail how hardware is being generated from the C code. In refinement 5, we will try generating a pipeline to improve performance, but we first want to use the Stage Master Explorer to better understand how the current version is being mapped to hardware.

The screen image of Figure 10-1 shows the Stage Master Explorer display for the core DES computation. The center view in the application shows the input C code annotated with the cycle number of each operation. For example, line one in the display shows that the read of Ks[0] will be available in cycle two and that the new value of work will be assigned in cycle two. The second line shows the computation of left based on Spbox6[work & 63]. We saw that work is assigned in cycle two, so the index work & 63 is not available until cycle two, and thus the result of the read will be available in cycle three.

Figure 10-1. Stage Master Explorer.

Because all arrays have been split, we see that the loads from Spbox and Ks can be done simultaneously. As a result, each instantiation of the F macro can be done in a single cycle. There are 16 instantiations of the F macro, and each one depends on the previous one because work is computed from the value of left computed in the previous instance of F. Due to this inherent dependency, there is no way to reduce the cycle count below 16 cycles per block.

For any loop that requires more than one cycle per iteration, it is worth considering the possibility of using a pipeline. The next section takes a look at introducing a pipeline in the main DES loop.

The Goal of Pipelining

The goal of pipelining is to execute multiple iterations of a loop in parallel, much like the assembly line of a factory. The core DES computation contains 16 instantiations of the F macro, and these instantiations can be called F0, F1, ..., F16. Iteration F16 is dependent on F15, which is in turn dependent on F14, and so on so that F1 tHRough F16 must be executed sequentially. Imagine that F1 through F16 are stations in an assembly line processing a sequence of 64-bit blocks (the "raw material") and producing a sequence of 64-bit block "products" on the other end. At the start of the line, F1 receives block 1, does some processing, and passes the result to F2. While F2 is processing block 1, F1 is idle, so it can go ahead and start processing block 2. By the time block 1 makes it to station F16, station F1 is processing block 16 and all 16 stations are busy operating in parallel. The result of building such an assembly line is that we can generate one block of output every cycle, even though it still takes 16 cycles to process one block. The cost (continuing the assembly line metaphor) is that each station needs its own equipment for all of them to operate in parallel.

To generate a pipeline using Impulse C, we must insert the CO PIPELINE pragma inside the main loop:

 while (co_stream_read(blocks_in, &left, sizeof(unsigned long)) == co_err_none) {   #pragma CO PIPELINE  co_stream_read(blocks_in, &right, sizeof(unsigned long));  ...

Unfortunately, simply adding the pipeline pragma to the refinement 4 code generates a pipeline that produces only one result every 16 cycles. Why did the pipeline fail? The reason is each instance of the F macro requires access to the Spbox and Ks arrays, which prevents the macros from being executed in parallel.

In order to generate an effective pipeline in this application, each instance of F must have its own copy of the Spbox arrays. The Ks array could also be duplicated, but since the array is small we use another technique, which is to convert the Ks array into a register bank. Registers can be read by many sources in the same cycle. To implement Ks as a register bank, all indices to Ks must be constants. The resulting code appears as follows:

 #define F(l,r,key0,key1,sp0,sp1,sp2,sp3,sp4,sp5,sp6,sp7){\    work = ((r >> 4) | (r << 28)) ^ key0;\    l ^= sp6[work & 0x3f];\    l ^= sp4[(work >> 8) & 0x3f];\   l ^= sp2[(work >> 16) & 0x3f];\   l ^= sp0[(work >> 24) & 0x3f];\   work = r ^ key1;\    l ^= sp7[work & 0x3f];\    l ^= sp5[(work >> 8) & 0x3f];\    l ^= sp3[(work >> 16) & 0x3f];\    l ^= sp1[(work >> 24) & 0x3f];\  } ... F(left,right,Ks0[0],Ks1[0],   Spbox00,Spbox10,Spbox20,Spbox30,Spbox40,Spbox50,Spbox60,Spbox70); F(right,left,Ks0[1],Ks1[1],   Spbox01,Spbox11,Spbox21,Spbox31,Spbox41,Spbox51,Spbox61,Spbox71); F(left,right,Ks0[2],Ks1[2],   Spbox02,Spbox12,Spbox22,Spbox32,Spbox42,Spbox52,Spbox62,Spbox72); F(right,left,Ks0[3],Ks1[3],   Spbox03,Spbox13,Spbox23,Spbox33,Spbox43,Spbox53,Spbox63,Spbox73); F(left,right,Ks0[4],Ks1[4],   Spbox04,Spbox14,Spbox24,Spbox34,Spbox44,Spbox54,Spbox64,Spbox74); F(right,left,Ks0[5],Ks1[5],   Spbox05,Spbox15,Spbox25,Spbox35,Spbox45,Spbox55,Spbox65,Spbox75); F(left,right,Ks0[6],Ks1[6],   Spbox06,Spbox16,Spbox26,Spbox36,Spbox46,Spbox56,Spbox66,Spbox76); F(right,left,Ks0[7],Ks1[7],   Spbox07,Spbox17,Spbox27,Spbox37,Spbox47,Spbox57,Spbox67,Spbox77); F(left,right,Ks0[8],Ks1[8],   Spbox08,Spbox18,Spbox28,Spbox38,Spbox48,Spbox58,Spbox68,Spbox78); F(right,left,Ks0[9],Ks1[9],   Spbox09,Spbox19,Spbox29,Spbox39,Spbox49,Spbox59,Spbox69,Spbox79); F(left,right,Ks0[10],Ks1[10],   Spbox0a,Spbox1a,Spbox2a,Spbox3a,Spbox4a,Spbox5a,Spbox6a,Spbox7a); F(right,left,Ks0[11],Ks1[11],   Spbox0b,Spbox1b,Spbox2b,Spbox3b,Spbox4b,Spbox5b,Spbox6b,Spbox7b); F(left,right,Ks0[12],Ks1[12],   Spbox0c,Spbox1c,Spbox2c,Spbox3c,Spbox4c,Spbox5c,Spbox6c,Spbox7c); F(right,left,Ks0[13],Ks1[13],   Spbox0d,Spbox1d,Spbox2d,Spbox3d,Spbox4d,Spbox5d,Spbox6d,Spbox7d); F(left,right,Ks0[14],Ks1[14],   Spbox0e,Spbox1e,Spbox2e,Spbox3e,Spbox4e,Spbox5e,Spbox6e,Spbox7e); F(right,left,Ks0[15],Ks1[15],   Spbox0f,Spbox1f,Spbox2f,Spbox3f,Spbox4f,Spbox5f,Spbox6f,Spbox7f);

As a result of this change, each instance of F has its own copy of the Spbox data, and Ks0 and Ks1 will be converted to registers by the compiler.

Running the compiler now results in a 20-stage pipeline that produces one block every two cycles. Two cycles are required because the 32-bit communication stream requires two cycles to transfer the blocks to and from the CPU. If a 64-bit interface were available, the pipeline would generate a single block every cycle.

The overall system performance is now 425 times faster than the software implementation. The size of the hardware has also increased to over 3,000 slices due to the duplicated resources required to implement a pipeline.

Tip

Pipelining can, when combined with other optimizations, result in significantly increased process throughput. Pipelining will, however, increase the size and complexity of the generated hardware. Pipelining should therefore be deferred until the application has been verified and other source-level optimizations have been used.

Need to assert more control over optimization?

Although this chapter focuses on the use of C language coding styles to improve the outcome of C code optimization and hardware generation, there are also methods (such as the StageDelay pragma described elsewhere) that can be used to more directly specify how logic is generated for a given set of C statements. In particular, the Impulse C compiler includes various flags for specifying hardware generation options and the Impulse library itself includes additional functions, including the co_par_break and co_array_config functions that allow you to specify actual clock cycle boundaries, specify memory types, and otherwise direct the operation of the optimizer and HDL generator. Refer to the latest Impulse C documentation for more information. Also be sure to visit the Impulse C discussion forums at www.ImpulseC.com/forums.