7 Parallel Applications | How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters (Scientific and Engineering Computation)



		Sometimes a parallel algorithm performs the same computations on many processors. Although this can often be faster than computing the result once and broadcasting, P - 1 of the processors are nevertheless not carrying out useful work. This is accounted as redundancy overhead. Such redundant calculations are negligible in the sort1. A few O(1) operations are performed redundantly, e.g., calling malloc to obtain temporary space, but these should not impact performance.



		Extra work is parallel computation that does not take place in a sequential implementation. For example, sort1 computes the processor destination of every input element. This computation is not required by the sequential algorithm. As implemented, an extra integer division and dynamic array assignment is performed for every element.



		On the other hand, the in-memory sort that every processor performs is smaller by a factor of P. Since sorting (using the library qsort at least) is O(N log N) in time, there are actually fewer cycles spent in qsort in the parallel case. These two effects tend to cancel, and it is even possible for the overhead from extra work to be negative. Such situations are fairly common in practice and are often met with skepticism because they can lead to the counterintuitive phenomenon of "superlinear speedup," the situation where P processors may solve a problem more than P times faster than just one processor. In this case (and in fact, generally), superlinear speedup results from using a non-optimal sequential algorithm. The sequential sort would be faster if its behavior was closer to the parallel algorithm, i.e., first categorize the input into bins, and then sort the bins.



		Load imbalance measures the extra time spent by the slowest processor, in excess of the mean over all processors. If we assume that every processor starts with an equal number of unsorted elements, then there is negligible load imbalance in the initial decomposition phase of the algorithm. The final isort, however, can become imbalanced if different numbers of elements are delivered to every processor. The problem assumptions guarantee that the data are uniformly distributed, which implies that the number of elements in any given processor is a random variable with a binomial distribution with p = 1/P. For large N, the number of elements assigned to any given processor will be approximately a Gaussian distribution with mean N/P and standard deviation . As such, it is very unlikely