Section 13.5. Parallelizing the Algorithm | Practical FPGA Programming in C

13.5. Parallelizing the Algorithm

As we have described in previous chapters, there are two fundamental ways in which parallelism can be added to a C application that will be targeting FPGA hardware. We have seen how statement-level parallelism and system-level parallelism can together provide enormous speedups. Let's apply both statement-level and system-level parallism to this application. We'll begin by considering an effective partitioning strategy.

Partitioning the Problem

Because the Mandelbrot image generation algorithm performs iterative calculations on the same point, with no need to share data between points, it's natural to consider a partitioning strategy in which all of the computations for a given point are performed within a single process, which is then replicated some number of times to create a complete application. At its most extreme, we could imagine a system in which there is one dedicated processor for each point.

Although this approach might provide the highest possible performance, it is clearly impractical due to the huge amount of hardware required (640,000 processing elements, or PEs, for a relatively modest 800-by-800 image). In order to balance the desire for performance with the practical limits of the design size, we need to consider a few important points:

For any given processor performing calculations on a given point, it is not known in advance how many iterations will be required to obtain a result for that point.
Given the preceding point, we need to design a system in which there is minimal wasted/idle time. All the processing elements need to be kept busy, which implies some kind of master, controlling process.
If we are generating a sorted output (such as would be required when streaming the outputs to a file), we will need some kind of consumer process that can properly arrange the results or store them intelligently in a random-access memory for later use.

The fact that the different point processors have no need to interact with one another gives us a great deal of flexibility in how points are allocated to processors. We might choose a scheme, for example, in which the X-Y plane is sliced into horizontal sections. Each of these sections is streamed to a different process, which then performs its calculations on each point in the section in sequence to return an ordered result. This seems at first to be a good choice, until you consider that in many regions of the Mandelbrot set entire sections of the plane consist entirely of points requiring a greater-than-average number of iterations, while other horizontal slices (consisting of a few or perhaps a hundred or more lines) may consist of points that diverge rapidly and therefore require a minimum amount of computation. Because of this irregular characteristic of the data, the potential exists for an imbalance in the processing requirements; we will waste computing resources if we do not choose a partitioning strategy that smooths out such irregularities.

A better method is to interlace the lines such that each processor handles every nth line, where n is the number of processors. This method is most efficient, of course, when n divides evenly into the number of lines in the region. For the sake of demonstration we'll choose a relatively small value for n (the number of processors), just three processing elements, or worker processes, to perform the operations. The method we will demonstrate to synchronize the activities of these three processes is easily scalable to much larger arrays of processors.

Output Synchronization

The method we will choose, that of creating three worker processes that handle the processing of alternate lines, will allow the interlaced regions of the image to be processed and their outputs ordered through the use of a randomly accessed image buffera shared memory. Alternate methods of synchronization are also possible, including the use of tokens (implemented as either signals or streams) indicating the correct order of output.

The partitioned and load-balanced version of the Mandelbrot image generator, consisting of two processes (a master controller and a worker process that will be instantiated three times), is shown in Figures 13-5 and 13-6. The process shown in Figure 13-5 represents the worker process and is named genline. This process includes the following:

The process declaration for genline, including pointers to the image buffer shared memory (img) and two streams, req and ack. These two streams are used to receive configuration information (the pixel value range) and to communicate to the master process that a given scan line has been completed. (Note that a signal could have been used in place of a stream for the ack communication channel. As you'll see, however, the existence of a nonblocking stream read function makes the use of a stream somewhat more convenient for this particular application.)
Local variable declarations (of type co_uint32) for the pixel value calculations, loop indices, and so on.
A local array of type co_uint8 (and size XSIZE X 3) that will be used to store the pixel color values prior to writing them to the shared memory.
Assignments for the two static values two and four, as described previously.
A co_stream_open function call for the ack stream. Note that this is an output stream, as indicated by the O_WRONLY parameter.
A co_stream_open function call for the req stream. Note that this is an input stream, as indicated by the O_RDONLY parameter.
An outer while loop that reads configuration information (sent by the controller process) from stream req for the line to be processed.
A for loop that iterates XSIZE times to process the line, one pixel position at a time.
A inner do loop that iteratively performs the required calculations for each pixel position. (Note the use of the PIPELINE and StageDelay pragmas in this inner code loop. They accelerate the innermost loop by parallelizing and pipelining the critical calculations.)
After the do loop, a set of if statements are used to determine the intensities of the three colors red, green, and blue for the pixel just processed. These values are assigned to R, G, and B, respectively.

After the colors have been determined, the R, G, and B values are stored in the temporary line buffer.

Figure 13-5. Worker processing for a single line of the Mandelbrot image.

 void genline(co_memory img, co_stream req, co_stream ack) {   co_uint32 xmax,xmin,dx;   co_uint32 B,G,R,BGR;   co_uint32 i,k,t;   co_uint32 c_imag,c_real,two,four;   co_uint32 result,tmp;   co_uint32 z_real,z_imag;   co_uint32 offset,outpos;   co_uint8 line[XSIZE*3];   two = FXCONST(2);   four = FXCONST(4);   co_stream_open(ack,O_WRONLY,UINT_TYPE(8));   // read in config   co_stream_open(req, O_RDONLY, UINT_TYPE(32));   while (co_stream_read(req,&xmax,sizeof(co_uint32))==co_err_none) {     co_stream_read(req,&xmin,sizeof(co_uint32));     co_stream_read(req,&dx,sizeof(co_uint32));     // signal ready     co_stream_write(ack,&tmp,sizeof(co_uint8));     // read in parameters     while (co_stream_read(req,&c_imag,sizeof(co_uint32)) == co_err_none) {       co_stream_read(req,&offset,sizeof(co_uint32));       c_real = xmin;       outpos = 0;       for (i=0; i<XSIZE; i++) {          z_real=z_imag=0;          // Calculate point          k = 0;          do { #pragma CO PIPELINE #pragma CO set stageDelay 96           tmp = FXMUL(z_real,z_real);           tmp = FXSUB(tmp,FXMUL(z_imag,z_imag));           tmp = FXADD(tmp,c_real);           z_imag = FXMUL(two,FXMUL(z_real,z_imag));           z_imag = FXADD(z_imag,c_imag);           z_real = tmp;           tmp = FXMUL(z_real,z_real);           result = FXADD(tmp,FXMUL(z_imag,z_imag));           k++;         } while ((result < four) && (k < MAX_ITERATIONS));         // Map points to gray scale: change to suit your preferences         B = G = R = 0;         if (k != MAX_ITERATIONS) {             R = G = G = k > 255 ? 255 : k;         }         line[outpos++] = B;         line[outpos++] = G;         line[outpos++] = R;         c_real=FXADD(c_real,dx);       }       co_memory_writeblock(img,offset,line,XSIZE*3);       co_stream_write(ack,&tmp,sizeof(co_uint8));     }     co_stream_close(req);     co_stream_open(req, O_RDONLY, UINT_TYPE(32));   }   co_stream_close(ack); }

After an entire line has been processed, the result (in the line array) is written to the shared memory through the use of the co_memory_writeblock function.
After the line has been written to shared memory, a message is sent to the controller via co_stream_write.

This process is replicated (instantiated) three times and operates under the control of a second hardware called ctrl, which is shown in Figure 13-6.

Examining the ctrl function in detail, you find the following:

A declaration of the process that includes references to the config_stream input, a done signal, three req streams (one for each worker process), and three ack streams (again, one for each worker process).

An outer code loop that reads configuration data from the config_stream input stream. This configuration data defines the area of interest (the subregion of the Mandelbrot set) as a set of eight 32-bit fixed-point values.

Figure 13-6. Controller process for the Mandelbrot image generator.

 void ctrl(co_stream config_stream, co_signal done,        co_stream req0, co_stream req1, co_stream req2,        co_stream ack0, co_stream ack1, co_stream ack2) {   co_uint32 xmax,xmin,ymax,ymin,dx,dy,hdx,hdy;   co_uint32 i,j;   co_uint32 c_imag,c_real;   co_uint32 offset,outpos;   co_stream_open(config_stream,O_RDONLY,UINT_TYPE(32));   co_stream_open(ack0,O_RDONLY,UINT_TYPE(8));   co_stream_open(ack1,O_RDONLY,UINT_TYPE(8));   co_stream_open(ack2,O_RDONLY,UINT_TYPE(8));   // Read in parameters   while (co_stream_read(config_stream,&xmax,sizeof(co_uint32))                                                == co_err_none) {     co_stream_read(config_stream,&xmin,sizeof(co_uint32));     co_stream_read(config_stream,&ymax,sizeof(co_uint32));     co_stream_read(config_stream,&ymin,sizeof(co_uint32));     co_stream_read(config_stream,&dx,sizeof(co_uint32));     co_stream_read(config_stream,&dy,sizeof(co_uint32));     co_stream_read(config_stream,&hdx,sizeof(co_uint32));     co_stream_read(config_stream,&hdy,sizeof(co_uint32));     co_stream_open(req0,O_WRONLY,UINT_TYPE(32));     co_stream_write(req0,&xmax,sizeof(co_uint32));     co_stream_write(req0,&xmin,sizeof(co_uint32));     co_stream_write(req0,&dx,sizeof(co_uint32));     co_stream_open(req1,O_WRONLY,UINT_TYPE(32));     co_stream_write(req1,&xmax,sizeof(co_uint32));     co_stream_write(req1,&xmin,sizeof(co_uint32));     co_stream_write(req1,&dx,sizeof(co_uint32));     co_stream_open(req2,O_WRONLY,UINT_TYPE(32));     co_stream_write(req2,&xmax,sizeof(co_uint32));     co_stream_write(req2,&xmin,sizeof(co_uint32));     co_stream_write(req2,&dx,sizeof(co_uint32));     // Loop over region     offset=0;     c_imag=ymax; for (j=0; j<YSIZE; ) {    if (co_stream_read_nb(ack0,&tmp,sizeof(co_uint8))) {        co_stream_write(req0,&c_imag,sizeof(co_uint32));        co_stream_write(req0,&offset,sizeof(co_uint32));        offset += XSIZE*3;        c_imag = FXSUB(c_imag,dy);        j++;      } else if (co_stream_read_nb(ack1,&tmp,sizeof(co_uint8))) {         co_stream_write(req1,&c_imag,sizeof(co_uint32));         co_stream_write(req1,&offset,sizeof(co_uint32));         offset += XSIZE*3;         c_imag = FXSUB(c_imag,dy);         j++;      } else if (co_stream_read_nb(ack2,&tmp,sizeof(co_uint8))) {         co_stream_write(req2,&c_imag,sizeof(co_uint32));         co_stream_write(req2,&offset,sizeof(co_uint32));         offset += XSIZE*3;         c_imag = FXSUB(c_imag,dy);         j++;       }     }     co_signal_post(done,1);     co_stream_close(req0);     co_stream_close(req1);     co_stream_close(req2);   }   co_stream_open(req0,O_WRONLY,UINT_TYPE(32));   co_stream_close(req0);   co_stream_open(req1,O_WRONLY,UINT_TYPE(32));   co_stream_close(req1);   co_stream_open(req2,O_WRONLY,UINT_TYPE(32));   co_stream_close(req2);   co_stream_close(config_stream); }

Three sets of stream writes, each of which passes the necessary configuration data to the individual worker processes via their req stream inputs. The passing of data via the req stream begins processing for each of the workers.
An inner loop iterates through each line of the image (the Y dimension). Using nonblocking read functions (co_stream_read_nb), it polls each of the three worker processes in sequence to determine if they have a completed value waiting in their output streams. If a value is detected, it is read from the ack stream. A new value and X offset are written to the process via the req stream, starting a new calculation for the current X-Y point. Note that as each point is processed, its value is written by the worker process to a shared memory buffer, as described earlier. This is important because the design of the application and the use of co_stream_read_nb means that the lines will not necessarily be processed in order. By using a shared memory to store the image, the lines may be written (using co_memory_writeblock from within the genline process) in any order.
After all points have been processed, the loop terminates and a message is posted via the done signal. This message indicates to the test bench (the controlling software process, which might be driven from a user interface of some kind) that the image generation is complete.

The Configuration Function

Figure 13-7 shows the configuration function that connects the three instances of the worker processes to the controller process and connects these four processes to a software test bench. The software test bench (named test_proc) accepts the results of the fractal image through the use of a shared memory buffer. The results are then written to a Windows format image file for viewing.