Section 6.7. Making Efficient Use of the Optimizers

6.7. Making Efficient Use of the Optimizers

When writing Impulse C processes, it is important to have a basic understanding of how C code is parallelized by the compiler and optimizer during hardware generation so you can achieve the best possible results, both in terms of execution speed and the size of the resulting logic.

In later chapters we will explore specific techniques for writing good, optimizable C code for hardware generation. In this section we'll describe in general terms how the Impulse C optimizer performs its job and describe some of the current constraints of writing C for hardware generation.

The Stage Master Optimizer

The Impulse C optimizer (Stage Master) works at the level of individual blocks of expanded C code that are generated by the compiler, such as are found within a loop body. For each block of C code, the optimizer attempts to create the minimum number of instruction stages by scheduling instructions that do not have conflicting data dependencies.

If pipelining is enabled (via the PIPELINE pragma, which is added within the body of a loop and applies only to that loop), such identified stages occur in parallel. If no pipelining is possible, the stages are generated sequentially. Note that all statements within a stage are implemented in combinational logic operating within a single clock cycle.

To get maximum benefit from the optimizer, you should keep in mind the types of statements that will result in a new instruction stage being created. These include:

A control statement such as an if test or a loop.
Any access (read or write) to a memory or array that is already being addressed in the current stage.

In the case of a loop, pipelining attempts to execute multiple iterations of the loop in parallel without duplicating any code.

For example, this loop:

 for (i=0; i<10; i++)     sum += i << 1;

results in one stage (with one adder and one shifter) being executed ten times. When not pipelined, the computation time is at least ten multiplied by the sum of the shifter and adder delays, or 10 x (delay(shifter) + delay(adder)). If we enable pipelining, however:

 for (i=0; i<10; i++) { #pragma CO PIPELINE     sum += i << 1; }

the result for the same loop is that two stages representing the shifter and adder are executed concurrently, with the computation time being approximately 10 x max((delay(shifter), delay(adder)).

Note that pipelining is not automatic and requires an explicit declaration in your C source code, as described earlier:

 #pragma CO PIPELINE

This declaration must be included within the body of the loop and prior to any statements that are to be pipelined.

Note

If used, the PIPELINE pragma should be inserted before the first statement in a loop. If you attempt to pipeline only part of a loop (for example by inserting the PIPELINE pragma in the middle of the loop) then the result will be an increase in the size of the generated hardware but will probably not result in increased performance when compared to the nonpipelined equivalent hardware.

Instruction Scheduling and Assignments

The instruction scheduler that organizes statements within a block or stage will always produce correct results (it will not break the logic of your C source code), if necessary by introducing stalls in the pipeline. To minimize such stalling, you should consider where you are making assignments and reduce the number of dependent assignments within a section of code. In some instances you may find that adding one or more intermediate variables to read and storing smaller elements of data (such as an array) at the start of a process may result in less stage delay. You should also make assignments to any recursive (self-referencing) variable as early as possible in the body of a loop and make other references to such variables as late as possible in the loop body so that introduced delays in one stage of the loop do not overly impact later stages.

Impacts of Memory Access

An important consideration when writing your inner code loops for maximum parallelism is to consider data dependencies. In particular, the optimizer cannot parallelize stages that access the same bank of memory (whether expressed as an array or using memory block read and block write functions). For this reason you may want to move subregions of a large arrays into local storage (local variables or smaller arrays) before performing multiple, otherwise parallel computations on the local data. Doing so will allow the optimizer to parallelize stages more efficiently, with a small trade-off of extra assignments that may be required. This aspect of C code optimization is explored in greater detail in Chapter 10.