Filtering algorithms on the ring | UNIX Systems Programming: Communication, Concurrency and Threads

Team-FLY

7.10 Parallel Ring Algorithms

Parallel processing refers to the partitioning of a problem so that pieces of the problem can be solved in parallel, thereby reducing the overall execution time. One measure of the effectiveness of the partitioning is the speedup , S ( n ), which is defined as follows .

graphics/246equ01.gif

Ideally, the execution time is inversely proportional to the number of processors, implying that the speedup S ( n ) is just n . Unfortunately, linear speedup is a rare achievement in practical settings for a number of reasons. There is always a portion of the work that cannot be done in parallel, and the parallel version of the algorithm incurs overhead when the processors synchronize or communicate to exchange information.

The problems that are most amenable to parallelization have a regular structure and involve exchange of information following well-defined patterns. This section looks at two parallel algorithms for the ring: image filtering and matrix multiplication. The image filtering belongs to a class of problems in which each processor performs its calculation independently or by exchanging information with its two neighbors. In matrix multiplication, a processor must obtain information from all the other processors to complete the calculation. However, the information can be propagated by a simple shift. Other parallel algorithms can also be adapted for efficient execution on the ring, but the communication patterns are more complicated than those of the examples done here.

7.10.1 Image filtering

A filter is a transformation applied to an image. Filtering may remove noise, enhance detail or blur image features, depending on the type of transformation. This discussion considers a greyscale digital image represented by an n x n array of bytes. Common spatial filters replace each pixel value in such an image by a function of the original pixel and its neighbors. The filter algorithm uses a mask to specify the neighborhood that contributes to the calculation. Figure 7.11 shows a 3 x 3 mask of nearest neighbors. This particular mask represents a linear filter because the function is a weighted sum of the pixels in the mask. In contrast, a nonlinear filter cannot be written as a linear combination of pixels under the mask. Taking the median of the neighboring pixels is an example of a nonlinear filter.

Figure 7.11. Mask for applying a smoothing filter to an image.

graphics/07fig11.gif

The values in the mask are the weights applied to each pixel in the sum when the mask is centered on the pixel being transformed. In Figure 7.11, all weights are 1/9. If a _i,j is the pixel at position ( i, j ) of the original image and b _i,j is the pixel at the corresponding position in the filtered image, the mask in Figure 7.11 represents the pixel transformation

This transformation blurs sharp edges and eliminates contrast in an image. In filtering terminology, the mask represents a low-pass filter because it keeps slowly varying ( low-frequency ) components and eliminates high-frequency components . The mask in Figure 7.12 is a high-pass filter that enhances edges and darkens the background.

Figure 7.12. Mask for applying a difference filter to an image.

graphics/07fig12.gif

The ring of processes is a natural architecture for parallelizing the types of filters described by masks such as those of Figures 7.11 and 7.12. Suppose a ring of n processes is to filter an n x n image. Each process can be responsible for computing the filter for one row or one column of the image. Since ISO C stores arrays in row-major format (i.e., the elements of a two-dimensional array are stored linearly in memory by first storing all elements of row zero followed by all elements of row one, and so on), it is more convenient to have each process handle one row.

To perform the filtering operation in process p , do the following.

Obtain rows p-1, p , and p+1 of the original image. Represent the pixel values of three rows of the original image by the following array.
```
 unsigned char a[3][n+2]; 
```
Put the image pixels of row p-1 in a[0][1] , . . ., a[0][n] . Set a[0][0] and a[0][n+1] to 0 to compute the result for border pixels without worrying about array bounds. Handle rows p and p+1 similarly. If p is 1, set a[0][0] , . . . , a[0][n+1] to 0 corresponding to the row above the image. If p is n , set a[2][0] , . . . , a[2][n+1] to 0 corresponding to the row of pixels below the bottom of the image.
Compute the new values for the pixels in row p and store the new values in an array.
```
 unsigned char b[n+2]; 
```
To compute the value of b[i] , use the following program segment.
```
 int sum; int i; int j; int m; sum = 0; for (j = 0; j < 3; j++)    for (m = i - 1; m < i + 2; m++)       sum += a[j][m]; b[i] = (unsigned char) (sum/9); 
```
The value of b[i] is the pixel value bp,i in the new image.
Insert b in row p of the new image.

The preceding description is purposely vague about where the original image comes from and where it goes. This I/O is the heart of the problem. The simplest approach is to have each process read the part of the input image it needs from a file and write the resulting row to another file. In this approach, the processes are completely independent of each other. Assume that the original image is stored as a binary file of bytes in row-major order. Use lseek to position the file offset at the appropriate place in the file, and use read to input the three needed rows. After computing the new image, use lseek and write to write the row in the appropriate place in the image. Be sure to open the input and output image files after the fork so that each process on the ring has its own file offsets.

A bidirectional ring

An alternative approach uses nearest-neighbor communication. Process p on the ring reads in only row p . It then writes row p to its neighbors on either side and reads rows p-1 and p+1 from its neighbors. This exchange of information requires the ring to be bidirectional, that is, a process node can read or write from the links in each direction. (Alternatively, replace each link in the ring by two unidirectional links, one in each direction.) It is probably overkill to implement the linear filter with nearest-neighbor communication, but several related problems require it.

For example, the explicit method of solving the heat equation on an n x n grid uses a nearest-neighbor update of the form

The constant D is related to the rate that heat diffuses on the grid. The array b _i,j is the new heat distribution on the grid after one unit of time has lapsed. It becomes the initial array a _i,j for the next time step. Clearly, the program should not write the grid to disk between each time step, so here a nearest-neighbor exchange is needed.

Block computation

Another important issue in parallel processing is the granularity of the problem and how it maps to the number of processes. The ring is typically under 100 processes, while the images of interest may be 1024 x 1024 pixels. In this case, each process computes the filter for a block of rows.

Suppose the ring has m processes and the image has n x n pixels, where n = qm+r . The first r processes are responsible for q+1 rows, and the remaining processes are responsible for q rows. Each process computes from q and r the range of rows that it is responsible for. Pass m and n as command-line arguments to the original process in the ring.

7.10.2 Matrix multiplication

Another problem that lends itself to parallel execution on a ring is matrix multiplication. To multiply two n x n matrices, A and B , form a third matrix C that has an entry in position ( i, j ) given by the following.

graphics/249equ01.gif

In other words, element ( i, j ) of the result is the product of row i of the first matrix with column j of the second matrix. Start by assuming that there are n processes on the ring. Each input array is stored as a binary file in row-major form. The elements of the array are of type int .

One approach to matrix multiplication is for process p to read row p of the input file for matrix A and column p of the input file for matrix B . Process p accumulates row p of matrix C . It multiplies row p of A by column p of B and sets c[p,p] to the resulting value. It then writes column p of matrix B to the ring and reads column p-1 from its neighbor. Process p then computes element c[p,p-1] , and so on.

The row-column is very efficient once the processes have read the columns of B , but since B is stored in row-major form, the file accesses are inefficient if the process is accessing a column of B , since the read must seek for each element. In addition it is likely that matrix multiplication is an intermediate step in a larger calculation that might have the A and B distributed to processes in row-major form. The following algorithm performs matrix multiplication when process p starts with row p of A and row p of B .

Process p is going to compute row p of the result. On each iteration, a row of B contributes one term to the sum needed to calculate each element of row p of the product matrix. Each process eventually needs all the entries of B , and it receives the rows of B one at a time from its neighbors. Use the following arrays.

 int a[n+1];        /* holds the pth row of A */ int b[n+1];        /* starts with the pth row of B */ int c[n+1];        /* holds the pth row of C */

Initialize the elements of a[] and b[] from their respective files. Initialize c[] , using

 for (k = 1; k < n+1; k++)    c[k] = a[p] * b[k];

In process p , this approach accounts for the contribution of row p of B to row p of the output C . In other words, c[p,k] = a[p,p]*b[p,k] . Process p does the following.

 m = p; write(STDOUT_FILENO, &b[1], n*sizeof(int)); read(STDIN_FILENO, &b[1], n*sizeof(int)); for (k = 1; k < n+1; k++) {    if (m-- == 0)       m = n;    c[k] += a[m]*b[k]; }

The read function fills the b[] array with the values of the row of B held initially by the process immediately before it on the ring. One execution of the for loop adds the contribution of row p-1 of B to row p of the result corresponding to c[p,k] = c[p,k] + a[p,p-1]* b[p-1,k] . Execute this code n-1 times to multiply the entire array. Write the resulting c[] as row p of the output file. Note: The proposed strategy may cause a deadlock if n is so large that the write exceeds the size of PIPE_BUF . A more robust strategy might use select to process the reading and writing simultaneously .

Team-FLY