Section 11.4. Refinement 2: Creating a System-Level Pipeline

11.4. Refinement 2: Creating a System-Level Pipeline

The partitioning of the image filter algorithm into three subprocesses (which actually represents a single process replicated three times) reduces the complexity of the filter and allows the algorithm's performance to be improved at the expense of increased hardware requirements. As we have mentioned, however, both versions of the filter are limited in performance because of how the bytebuffer array is accessed. At issue is the fact that both of the edge-detection algorithms just described must store locally (in the bytebuffer array) enough pixels to allow a look-ahead and look-behind of one scan line plus one pixel to sample the desired pixel window. If we rethink the algorithm and apply techniques of pipelined system-level parallelism, however, we can increase the filter's effective throughput dramatically.

The system-level pipeline that we will describe operates in much the same way as the instruction-level pipelines described in Chapter 7, but at a higher level. A system-level pipeline is designed by the application programmer, not automatically generated by the compiler, and consists of two or more processes connected in sequence. Each process in the sequence performs some transformation on the data and passes the results to the next process.

Figure 11-7. Parallel edge-detection configuration function.

[View full width]

 void config_edge_detect(void *arg) {     co_stream blue_source_pixeldata, green_source_pixeldata,              red_source_pixeldata, blue_result_pixeldata,              green_result_pixeldata, red_result_pixeldata;    co_signal header_ready;    co_process producer_process;    co_process blue_process, green_process, red_process;    co_process consumer_process;    blue_source_pixeldata = co_stream_create("blue_source_pixeldata",                                             UINT_TYPE(8), BUFSIZE);    green_source_pixeldata = co_stream_create("green_source_pixeldata",                                             UINT_TYPE(8), BUFSIZE);    red_source_pixeldata = co_stream_create("red_source_pixeldata",                                             UINT_TYPE(8), BUFSIZE);    blue_result_pixeldata = co_stream_create("blue_result_pixeldata",                                              UINT_TYPE(8), BUFSIZE);    green_result_pixeldata = co_stream_create("green_result_pixeldata",                                              UINT_TYPE(8), BUFSIZE);    red_result_pixeldata = co_stream_create("red_result_pixeldata",                                              UINT_TYPE(8), BUFSIZE);    header_ready = co_signal_create("header_ready");    producer_process = co_process_create("producer_process",                             (co_function)test_producer,                                                                    4,blue_source_pixeldata ,green_source_pixeldata, header_ready);     blue_process = co_process_create("blue_process",                          (co_function)edge_detect,                          2, blue_source_pixeldata, blue_result_pixeldata);     green_process = co_process_create("green_process",                          (co_function)edge_detect,                          2, green_source_pixeldata, green_result_pixeldata);     red_process = co_process_create("red_process",                         (co_function)edge_detect,                          2, red_source_pixeldata, red_result_pixeldata);     consumer_process = co_process_create("consumer_process",                              (co_function)test_consumer,                              4, blue_result_pixeldata, green_result_pixeldata,                              red_result_pixeldata, header_ready);     co_process_config(blue_process, co_loc, "PE0");       co_process_config(green_process, co_loc, "PE0");       co_process_config(red_process, co_loc, "PE0");   }

As in the previous examples, this implementation of the edge-detection function requires a 3-by-3 window for processing each pixel in the source image. In this version of the algorithm, however, a separate upstream process generates the pixel window, in the form of three streams of pixel information on which the subsequent filter process can operate. This removes the need for bytebuffer and its corresponding circular buffer overhead.

For this example, we'll create a pipeline of four separate processes. Two of these processes will replace the edge-detection function, as previously described, while the other two will allow us to apply the image filter to an image buffer (an external memory), rather than rely on data streaming from an embedded processor or some other source. Each process performs one step of the filter, as described here and shown in Figure 11-8:

1.	Image data is loaded from an image buffer using the Impulse C `co_memory_readblock` function and then are converted into a stream of 24-bit pixels using `co_stream_write`.
2.	The stream of pixels is processed by a row generator to produce three output streams, each representing a row in three marching columns of pixels.
3.	The marching columns of pixel data are used to compute the new value of the center pixel. This value is streamed out of the filter using `co_stream_write`.
4.	The resulting pixel data is read (using `co_stream_read`) and stored to a new image buffer through the use of `co_memory_writeblock`.

Figure 11-8. Four pipelined processes.

Because these four processes operate in parallel, in a system-level pipeline, the filter can generate processed pixels at a rate of one pixel every two clock cycles. For a standard-size image this is fast enough for real-time video image filtering at quite leisurely FPGA clock rates.

The following sections describe the four processes in more detail.

The DMA Input Process

One aspect of the image filter application that we ignored in discussing the first version of this edge detect algorithm was where the input and output images are actually stored. In a real-world example, pixel data most likely originates from some image buffer (external RAM) or is streamed via a direct hardware interface. The output image may similarly be stored in memory or sent via a hardware channel. In either case, there needs to be a process on either end of our image filter: one that converts the incoming data to an Impulse C stream, and another that converts the resulting filtered pixel stream for output.

Impulse C offers two mechanisms for moving high-volume data: streams and shared memory. Thus far, our examples have primarily used streams, but using shared memory can be a useful way to increase application performance for certain types of streaming applications. It can also be useful for applications that need to operate on nonstreaming data located in memory.

The factors influencing a decision to use streams versus shared memory are highly platform-specific. There are many issues to consider:

How many cycles are required for each stream transaction? This is dependent on whether a processor is involved in the transfer and on the architecture of the specific CPU and bus combination. In particular, if the bus must be polled to receive data sent on a stream, there will be significant cycle-count overhead.
How many cycles are required for a memory transfer? This number is dependent on the memory and bus architecture used, as well as the CPU/bus combination.
Does the CPU have a cache? Is the data likely to already be in the cache?
Is the memory on the same bus as the Impulse C hardware processes? If so, the hardware processes and memory could compete for access to the CPU, negatively impacting performance.

Some of these issues were discussed in Chapter 4, with specific benchmark examples for various platform configurations. For this example, we have chosen to use the shared memory approach, which proves to be more efficient than data streaming on the platform selected for this project, an Altera Stratix FPGA paired with a Nios II embedded soft processor.

In the Altera platform, the FPGA code that we will generate from this image filter and the shared memory are both accessed via Altera's Avalon bus. (The methods for accessing shared memory are similar on other platforms.) For the Altera Nios II processor, it is more efficient to move large blocks of data between the embedded processor and a hardware function such as our image filter using DMA transfer rather than making use of streams. On other platforms, such as the Xilinx MicroBlaze processor with its FSL bus, the use of software-to-hardware streams may provide faster performance than DMA transfers.

One disadvantage of using DMA with a shared memory is that the hardware process is blocked from doing any computation during the transfer. As a result, the use of shared memory often implies a separate process for handling shared memory operations, so that data transfer can be overlapped with computation. Thus, the first process in the pipeline serves only to read image data from the image memory (which is the heap0 memory defined by the selected platform) and send it out on a data stream. In the listing for this process in Figure 11-9, notice that

The to_stream process takes three arguments: a signal, go; a shared memory, imgmem; and an output stream, output_stream.
In the process run function itself, the co_memory_readblock and co_stream_write functions are used to read pixel data from the shared memory (one entire scan line at a time) and to write those pixels to the output stream, respectively. A signal (go) is used to synchronize with the CPU and ensure that the image memory is ready for processing.
The algorithm accepts 16-bit pixel data even though our filter algorithm was designed to operate on 24-bit data. Consequently, this process also converts the 16-bit pixel value into a 24-bit value. Also, notice that a 32-bit memory is used even though our image data values are logically stored as 16-bit unsigned integers. The DMA transfers one element at a time to the array, so an array of 16-bit values would require twice as many bus transactions.

This process could be modified (or replaced by a hand-crafted hardware block) for use with many types of input sources.

The Column Generator Process

Image filters generally operate on a moving window of image data, as described previously. In this example, the filter operates on a 3-by-3 window. Notice that as the filter moves from one pixel to the next, the computation uses six of the same values used in the previous pixel computation plus three new values. A marching column generator generates a stream of three pixel rows that represent (at any given cycle) one three-pixel column corresponding to the three new values the filter requires for the next target pixel's filter window as it proceeds from left to right. The prep_run process buffers enough data to produce the three-pixel column data just as the original algorithm did, but in a much simpler way.

Figure 11-9. Shared memory to stream process for pipelined image filter.

 void to_stream(co_signal go, co_memory imgmem, co_stream output_stream) {    int16 i, j;   uint32 offset, data, d0;   uint32 row[IMG_WIDTH / 2];   co_signal_wait(go, &data);     co_stream_open(output_stream, O_WRONLY, INT_TYPE(32));   offset = 0;   for ( i = 0; i < IMG_HEIGHT; i++ ) {       co_memory_readblock(imgmem, offset, row, IMG_WIDTH * sizeof(int16));      for ( j = 0; j < (IMG_WIDTH / 2); j++ ) {  #pragma CO PIPELINE       d0 = row[j];       data  = ((d0 >> 8) & 0xf8) << 16;       data |= ((d0 >> 3) & 0xf8) << 8;       data |= (d0 << 2) & 0xfc;       co_stream_write(output_stream, &data, sizeof(int32));       d0 = d0 >> 16;       data  = ((d0 >> 8) & 0xf8) << 16;       data |= ((d0 >> 3) & 0xf8) << 8;       data |= (d0 << 2) & 0xfc;       co_stream_write(output_stream, &data, sizeof(int32));     }     offset += IMG_WIDTH * sizeof(int16);   }   data = 0;   co_stream_write(output_stream, &data, sizeof(int32));   co_stream_write(output_stream, &data, sizeof(int32));   co_stream_close(output_stream); }

As described previously, a limiting factor in the performance of the original algorithm was its method of accessing pixel values from a single buffer, which contained all the pixels needed to "look ahead" and "look behind" one entire row of the image. To increase the performance of the image filter, we therefore need to reduce the number of accesses to the same memory and thereby decrease the number of cycles required to perform the desired calculations To do this, we use the array splitting technique described in Chapters 7 and 10 to eliminate the need to access the same array twice in the body of the loop. As a result, this process can achieve very high throughput, generating a pixel column once every two clock cycles. The source listing for this process is shown in Figure 11-10. Notice that this process accepts one stream of pixels, caches two scan lines into arrays B and C, and generates three output streams using a circular buffer technique similar to that in the original example. By taking care not to make unnecessary accesses to the same array, this process has a latency of just two clock cycles for each column output.

The Image Filter Process

The most important process in this pipeline is the filter itself, which is represented by the filter_run process listed in Figure 11-11. This process accepts the three pixel streams (the marching column) generated by the prep_run process and performs a convolution similar to the original algorithm.

In this version of the convolution, a number of statement-level optimizations have been made to reduce the number of cycles required to process the inputs:

The adjacent pixels are captured in local variables p01, p02, p03, and so on rather than being accessed from an array, as was done in the original. This eliminates the need for simultaneous accesses to an array, which would prevent the optimizer from combining multiple calculations into single stages.
The calculation of a difference between points in the horizontal, vertical and diagonal directions has been replaced by a simpler and more parallelizable version that eliminates the need for an abs (absolute value) function or macro. Notice how in this version the center pixel is repeatedly compared to its neighboring pixels to obtain an average difference.

The result of these changes is an edge-detection algorithm that can be accomplished with a throughput rate of just two clock cycles. This calculation can of course be modified as needed to perform other types of convolutions.

The Stream to Memory Process

This process, from_stream, provides the opposite function from the to_stream process described earlier. In this process, a single stream of input pixels is read, one scan line at a time, and written to the output memory using co_memory_writeblock. This process also swaps the order of the bytes being read from the stream and placed in memory, reflecting the requirements of the image format being used (which in this case is the TIFF format). The stream-to-memory process (from_stream) is shown in Figure 11-12.

Figure 11-10. Pixel window row generator for the pipelined image filter.

 void prep_run(co_stream input_stream, co_stream r0, co_stream r1, co_stream r2) {    int32 i, j;   int32 B[IMG_WIDTH], C[IMG_WIDTH];   int32 A01, A02, p02, p12, p22;   co_stream_open(input_stream, O_RDONLY, INT_TYPE(32));   co_stream_open(r0, O_WRONLY, INT_TYPE(32));   co_stream_open(r1, O_WRONLY, INT_TYPE(32));   co_stream_open(r2, O_WRONLY, INT_TYPE(32));   co_stream_read(input_stream, &A01, sizeof(int32));   co_stream_read(input_stream, &A02, sizeof(int32));   for ( j = 0; j < IMG_WIDTH; j++ )           co_stream_read(input_stream, &B[j], sizeof(int32));   for ( j = 0; j < IMG_WIDTH; j++ )       co_stream_read(input_stream, &C[j], sizeof(int32));   co_stream_write(r0, &A01, sizeof(int32));   co_stream_write(r1, &B[IMG_WIDTH - 2], sizeof(int32));   co_stream_write(r2, &C[IMG_WIDTH - 2], sizeof(int32));   co_stream_write(r0, &A02, sizeof(int32));   co_stream_write(r1, &B[IMG_WIDTH - 1], sizeof(int32));   co_stream_write(r2, &C[IMG_WIDTH - 1], sizeof(int32));   for ( i = 2; i < IMG_HEIGHT; i++ ) {            j =0;           do{                 p02 = B[j];                p12 = C[j];                co_stream_read(input_stream, &p22, sizeof(int32));                co_stream_write(r0, &p02, sizeof(int32));                co_stream_write(r1, &p12, sizeof(int32));                co_stream_write(r2, &p22, sizeof(int32));                B[j] = p12;                C[j] = p22;                j++;         }  while ( j < IMG_WIDTH );   }    co_stream_close(input_stream);   co_stream_close(r0);   co_stream_close(r1);   co_stream_close(r2); }

Figure 11-11. Pipelined image filter process.

 void filter_run(co_stream r0, co_stream r1, co_stream r2,             co_stream output_stream) {     uint32 data,res, p00, p01, p02, p10, p11, p12, p20, p21, p22;    uint16 d0;    co_stream_open(r0, O_RDONLY, INT_TYPE(32));    co_stream_open(r1, O_RDONLY, INT_TYPE(32));    co_stream_open(r2, O_RDONLY, INT_TYPE(32));    co_stream_open(output_stream, O_WRONLY, INT_TYPE(32));    p00 = 0; p01 = 0; p02 = 0;    p10 = 0; p11 = 0; p12 = 0;    p20 = 0; p21 = 0; p22 = 0;    while ( co_stream_read(r0, &data, sizeof(int32)) == co_err_none ) {  #pragma CO PIPELINE #pragma CO set stageDelay 256     p00 = p01; p01 = p02;     p10 = p11; p11 = p12;     p20 = p21; p21 = p22;     p02 = data;     co_stream_read(r1, &p12, sizeof(int32));     co_stream_read(r2, &p22, sizeof(int32));     d0 = RED(p11) << 3;     d0 = d0 - RED(p00);     d0 = d0 - RED(p01);     d0 = d0 - RED(p02);     d0 = d0 - RED(p10);     d0 = d0 - RED(p12);     d0 = d0 - RED(p20);     d0 = d0 - RED(p21);     d0 = d0 - RED(p22);     d0 &= (d0 >> 15) - 1;     res = d0 & 0xff;     d0 = GREEN(p11) << 3;     d0 = d0 - GREEN(p00);     d0 = d0 - GREEN(p01);     d0 = d0 - GREEN(p02);     d0 = d0 - GREEN(p10);     d0 = d0 - GREEN(p12);     d0 = d0 - GREEN(p20);     d0 = d0 - GREEN(p21);     d0 = d0 - GREEN(p22);     d0 &= (d0 >> 15) - 1;     res = (res << 8) | (d0 & 0xff);     d0 = BLUE(p11) << 3;     d0 = d0 - BLUE(p00);     d0 = d0 - BLUE(p01);     d0 = d0 - BLUE(p02);     d0 = d0 - BLUE(p10);     d0 = d0 - BLUE(p12);     d0 = d0 - BLUE(p20);     d0 = d0 - BLUE(p21);     d0 = d0 - BLUE(p22);     d0 &= (d0 >> 15) - 1;     res = (res << 8) | (d0 & 0xff);     co_stream_write(output_stream, &res, sizeof(int32));   }    co_stream_close(r0);   co_stream_close(r1);   co_stream_close(r2);   co_stream_close(output_stream); }

The Configuration Function

These four processes and the corresponding stream, memory, and signal declarations are described using Impulse C and are interconnected using the configuration function shown in Figure 11-13.

This configuration function includes the following:

Signal declarations for startsig and donesig. These signals are used to coordinate the use of shared memories (the image buffers) between the software test bench and the image filter hardware. startsig indicates that the image buffer is ready for processing, while donesig indicates that the filtering of the image is complete.
Stream declarations for the input and output pixel streams and for the three streams that connect prep_run to filter_run.

A memory declaration for shrmem. This memory represents the image buffer.

Figure 11-12. Stream to shared memory process, pipelined image filter.

 void from_stream(co_stream input_stream, co_memory imgmem,                                               co_signal done) {     uint8 err;    int16 i;    int32 offset, low, data, d0;    int32 rowout[IMG_WIDTH / 2];    co_stream_open(input_stream, O_RDONLY, INT_TYPE(32));    offset = 0;    do {       for ( i = 0; i < (IMG_WIDTH / 2); i++ ) {  #pragma CO PIPELINE       err = co_stream_read(input_stream, &d0, sizeof(d0));       if ( err != co_err_none ) break;       low=(d0 >> 19) & 0x1f;       low=(low << 5) | ((d0 >> 11) & 0x1f);       low=(low << 6) | ((d0 >>  2) & 0x3f);       err=co_stream_read(input_stream, &d0, sizeof(d0));       if(err != co_err_none) break;       data=d0 >> 19;       data=(data << 5) | ((d0 >> 11) & 0x1f);       data=(data << 6) | ((d0 >>  2) & 0x3f);       rowout[i]=(data << 16) | low;     }      if ( err != co_err_none) break;     co_memory_writeblock(imgmem, offset, rowout,                            IMG_WIDTH * sizeof(int16));     offset += IMG_WIDTH * sizeof(int16);   }while ( 1 );   co_stream_close(input_stream);   co_signal_post(done, 0); }

Process declarations for the four required hardware processes, plus one additional software test bench process, cpu_proc. This software test bench process is listed in Appendix E.
Calls to co_signal_create for the two signals startsig and donesig.

A call to co_memory_create for memory shrmem. The memory is created at location heap0, which is a location specific to the Altera platform being targeted for this example. heap0 represents an on-chip memory accessible to both the processor (where the software test bench will reside) and to the image filter running on the FPGA as dedicated hardware.

Figure 11-13. Pipelined edge detector configuration function.

 void config_img(void *arg) {      co_signal startsig, donesig;     co_memory shrmem;     co_stream istream, row0, row1, row2, ostream;     co_process reader, writer;     co_process cpu_proc, prep_proc, filter_proc;     startsig = co_signal_create("start");     donesig = co_signal_create("done");     shrmem = co_memory_create("image", "heap0",                              IMG_WIDTH * IMG_HEIGHT * sizeof(uint16));     istream = co_stream_create("istream", INT_TYPE(32), IMG_HEIGHT/2);     row0 = co_stream_create("row0", INT_TYPE(32), 4);     row1 = co_stream_create("row1", INT_TYPE(32), 4);     row2 = co_stream_create("row2", INT_TYPE(32), 4);     ostream = co_stream_create("ostream", INT_TYPE(32), IMG_HEIGHT/2);     cpu_proc = co_process_create("cpu_proc",  (co_function)call_fpga,                                 3, shrmem,   startsig, donesig);     reader = co_process_create("reader",    (co_function)to_stream,                                 3, startsig, shrmem,   istream);     prep_proc = co_process_create("prep_proc", (co_function)prep_run,                                 4, istream,  row0,     row1,    row2);    filter_proc = co_process_create("filter",    (co_function)filter_run,                                   4, row0,     row1,     row2,    ostream);    writer = co_process_create("writer",    (co_function)from_stream,                                   3, ostream,  shrmem,   donesig);       co_process_config(reader, co_loc, "PE0");     co_process_config(prep_proc, co_loc, "PE0");     co_process_config(filter_proc, co_loc, "PE0");     co_process_config(writer, co_loc, "PE0"); }

Calls to co_stream_create for the five streams (the input stream, the output stream, and the three intermediate streams). Notice that the three intermediate streams are given stream depths of four, while the input and output streams are given stream depths of one-half the number of scan lines in the image. This is done in part to mitigate stalls that may result from longer-than-expected memory read and write times. (Bear in mind that deep stream buffers such as this can incur substantial penalties in hardware resources, however.)
Calls to co_process_create for the four hardware processes and the one software test bench process.
Calls to co_process_config for the four hardware processes, indicating that these four processes are to be compiled to the FPGA as hardware blocks.