7.4. Limiting Instruction StagesTo get maximum benefit from the optimizer, you should keep in mind those types of statements that will result in a new instruction stage being created. These statements include
Tip:
To the greatest extent practical, you should reduce the use of unnecessary control statements and memory
Reduce Memory Accesses for Higher Performance
An important consideration when writing your inner code
Array SplittingThe way that memory (including local arrays) is accessed within a process can have a dramatic impact on the ability of the optimizer to limit instruction stages and to parallelize C statements. Consider the following example: x = A[0] + A[1] + A[2] x = x << 2; This example involves an array A that is stored in a local RAM block. Only one element of the array can be read from the memory in a single cycle, so the computation must be spread out over four stages:
One way to avoid this problem with memory is to use multiple arrays in multidimensional algorithms. For example, the following algorithm has the same problem as the
int a[4][10];
for (i=0; i<10; i++) {
a[3][i] = a[0][i] + a[1][i] + a[2][i];
}
However, suppose this algorithm is written using a separate array for each row of
a
, as
int a0[10],a1[10],a2[10],a3[10];
for (i=0; i<10; i++) {
a3[I] = a0[i] + a1[i] + a2[i];
}
In this example, each row is stored in a separate block of RAM, allowing each row to be read/written
Tip
As this example
|
7.5. Unrolling LoopsAs described in the previous section, the optimizer automatically parallelizes across expressions and statements within a basic block of C code. The optimizer does not, however, automatically parallelize operations across iterations of a loop. Loop unrolling is one technique that can be used to parallelize a loop by turning the loop into one basic block of straight-line code. If the number of iterations of a loop is known at compile time, it is possible to unroll the loop to create dedicated logic for each iteration of the loop. Unrolling simply duplicates the body of the loop as many times as there are iterations in the loop. For example, consider the following loop:
for (i=0; i<10; i++) {
sum += A[i];
}
Without unrolling, this loop will generate logic to perform each iteration in two cycles. The first cycle will read from memory
A
, and the second cycle will calculate the addition. One
int i; // Loop index must be type int
...
for (i=0; i<10; i++) {
#pragma CO UNROLL
sum += A[i];
}
Unrolling simply duplicates the body of the loop for all values of i . In this example the result is equivalent to the following: sum += A[0]; sum += A[1]; sum += A[2]; sum += A[3]; sum += A[4]; sum += A[5]; sum += A[6]; sum += A[7]; sum += A[8]; sum += A[9];
In this case, ten adders are generated, and each one is used only once during the execution of the loop. Most of the time, as in this case, loop unrolling alone has no specific benefit. Only one value of
A
can be read in a given cycle, so this example loop still requires ten cycles to execute, and ten adders have been generated, which requires a lot of logic. However, if the "scalarize array variables" optimizer option is used together with loop unrolling, the elements of the array are
sum += A_0; sum += A_1; sum += A_2; sum += A_3; sum += A_4; sum += A_5; sum += A_6; sum += A_7; sum += A_8; sum += A_9;
Instead of generating a memory for the array
A
, registers are generated for each of the ten elements of the array. All ten registers can be read
Note
The use of unrolling requires some care because a large amount of logic can be easily generated and the cycle delay can be greatly increased, with a corresponding decrease in maximum clock frequency. In the
|