# Section 7.4. Limiting Instruction Stages

### 7.4. Limiting Instruction Stages

To get maximum benefit from the optimizer, you should keep in mind those types of statements that will result in a new instruction stage being created. These statements include

• A control statement such as an if test or a loop

• Any access (read or write) to a memory or array that is already being addressed in the current stage

Tip:

To the greatest extent practical, you should reduce the use of unnecessary control statements and memory accesses to reduce the number of instruction stages.

#### Reduce Memory Accesses for Higher Performance

An important consideration when writing your inner code loops for maximum parallelism is to consider data dependencies. In particular, the optimizer will not be capable of parallelizing stages that access the same "bank" of memory (whether expressed as an array or using memory block read and block write functions). For this reason you may want to move subregions of a large array into local storage (local variables or smaller arrays) before performing multiple, otherwise parallel computations on the local data. Doing so will allow the optimizer to parallelize stages more efficiently , with a small trade-off of extra assignments that may be required.

#### Array Splitting

The way that memory (including local arrays) is accessed within a process can have a dramatic impact on the ability of the optimizer to limit instruction stages and to parallelize C statements. Consider the following example:

x = A[0] + A[1] + A[2]
x = x << 2;

This example involves an array A that is stored in a local RAM block. Only one element of the array can be read from the memory in a single cycle, so the computation must be spread out over four stages:

3. Read A[2] , perform A[0]+A[1]

4. Rerform + A[2] , perform <<2

One way to avoid this problem with memory is to use multiple arrays in multidimensional algorithms. For example, the following algorithm has the same problem as the preceding example:

int a[4][10];

for (i=0; i<10; i++) {
a[3][i] = a[0][i] + a[1][i] + a[2][i];
}

However, suppose this algorithm is written using a separate array for each row of a , as follows :

int a0[10],a1[10],a2[10],a3[10];
for (i=0; i<10; i++) {
a3[I] = a0[i] + a1[i] + a2[i];
}

In this example, each row is stored in a separate block of RAM, allowing each row to be read/written simultaneously . As a result, the body of this loop executes in a single stage instead of the four stages that would be required if the array were not split.

Tip

As this example demonstrates , array splitting is a useful technique to allow multiple simultaneous memory accesses and thereby increase parallelism. In Chapter 10 we'll explore this and other techniques in more detail.

### 7.5. Unrolling Loops

As described in the previous section, the optimizer automatically parallelizes across expressions and statements within a basic block of C code. The optimizer does not, however, automatically parallelize operations across iterations of a loop. Loop unrolling is one technique that can be used to parallelize a loop by turning the loop into one basic block of straight-line code.

If the number of iterations of a loop is known at compile time, it is possible to unroll the loop to create dedicated logic for each iteration of the loop. Unrolling simply duplicates the body of the loop as many times as there are iterations in the loop.

For example, consider the following loop:

for (i=0; i<10; i++) {
sum += A[i];
}

Without unrolling, this loop will generate logic to perform each iteration in two cycles. The first cycle will read from memory A , and the second cycle will calculate the addition. One adder is generated that will be used ten times during execution of the loop. Now consider the same loop with unrolling:

int i;    // Loop index must be type int
...
for (i=0; i<10; i++) {
#pragma CO UNROLL
sum += A[i];
}

Unrolling simply duplicates the body of the loop for all values of i . In this example the result is equivalent to the following:

sum += A[0];
sum += A[1];
sum += A[2];
sum += A[3];
sum += A[4];
sum += A[5];
sum += A[6];
sum += A[7];
sum += A[8];
sum += A[9];

In this case, ten adders are generated, and each one is used only once during the execution of the loop. Most of the time, as in this case, loop unrolling alone has no specific benefit. Only one value of A can be read in a given cycle, so this example loop still requires ten cycles to execute, and ten adders have been generated, which requires a lot of logic. However, if the "scalarize array variables" optimizer option is used together with loop unrolling, the elements of the array are replaced with scalar variables . The results would then be equivalent to the following:

sum += A_0;
sum += A_1;
sum += A_2;
sum += A_3;
sum += A_4;
sum += A_5;
sum += A_6;
sum += A_7;
sum += A_8;
sum += A_9;

Instead of generating a memory for the array A , registers are generated for each of the ten elements of the array. All ten registers can be read simultaneously and thus this entire loop can be executed in a single cycle.

Note

The use of unrolling requires some care because a large amount of logic can be easily generated and the cycle delay can be greatly increased, with a corresponding decrease in maximum clock frequency. In the preceding example, ten adders will be generated, and because the result of each adder is used by the next adder, this cycle's delay is ten times the delay of a single adder's delay. You can keep this delay under control by placing a StageDelay pragma immediately preceding the loop.