Chapter 7: Advanced Heterogeneous Parallel Programming in mpC | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

6.12. SUMMARY

The subset of the mpC language presented in this chapter addresses some primary challenges of heterogeneous parallel computing. Namely it focuses on the uneven distribution of computations in heterogeneous parallel algorithms and heterogeneity of physical processors of the executing network.

The programmers can explicitly specify the uneven distribution of computations across parallel processes dictated by the implemented heterogeneous parallel algorithm. The mpC compiler will use the provided information to map the parallel processes to the executing network of computers.

A simple model of the executing network is used when parallel processes of the mpC program are mapped to physical processors of the network. The relative speed of the physical processors is a key parameter in the model. In heterogeneous environments the parameter is very sensitive to both the particular code executed by the processors and their current workload by external computations. The programmers can control the accuracy of this model at runtime, and adjust its parameters to their particular applications by using the recon statement. This statement

provides a test code that should be used to estimate the relative speed of physical processors, and
specifies the exact point in the program where the test code should be executed to modify parameters the underlying model of the network of computers.

Now we briefly discuss some important topics regarding the mpC language. These are topics that are not presented in detail in this chapter but that should be addressed.

The first is the kind of applications for which the mpC language is suited. The language is most suitable for the parallel solution of irregular problems, such calculating the mass of a building construction frame welded from heterogeneous metal rails, on both homogeneous and heterogeneous distributed-memory computer systems.

The mpC language is also suitable for solving in heterogeneous environments regular problems such as dense linear algebra. There are two approaches to solving regular problems on heterogeneous clusters using mpC:

Irregularization of the problem in accordance with the irregularity of the executing hardware.
Distribution of a relatively large number of homogeneous parallel processes over physical processors of the heterogeneous cluster in accordance with their speed.

The mpC language provides natural implementation in a portable form of these two approaches. The first approach was demonstrated in Section 6.10 for parallel matrix-matrix multiplication. The second approach is demonstrated in Section 9.1.3.

Another interesting issue involves deciding on the number of processes of the mpC program. How many processes should be allocated to each participating computer when the user starts up the program? Obviously the more processes one has, the better load balancing that can be achieved. On the other hand, more processes consume more resources and cause more interprocess communications, and so can significantly increase the total overhead.

Some basic rules of making a choice are the following: First, the number of processes running a computer should not be less than the number of processors of the computer that are capable handling all the available processor resources.

At its upper bound, the number is limited by the underlying operating system and/or the communication platform. For example, LAM MPI version 5.2 installed under Solaris 2.3 does not allow more than 15 MPI processes to run on an individual workstation.

If the mpC application does not define a sufficient amount of static data, then all the processes not selected for the abstract processors of the mpC network subside and do not consume much in terms of processor cycles or memory. The only overhead is the additional communications with the processes that include initialization of the underlying communication platform and the mpC specific communications during execution of the application. The latter mainly fall into the creation of a network. The time elapsed by this operation does not keep up with the rapid growth of the number of processes.

For example, the use of six processes instead of one process per workstation on a network of nine uniprocessor workstations caused only a 30% increase of the time of the network creation. This is because the operation includes some relatively significant calculations, and the calculations are more sensitive to the number of computers than to the number of processes running on each of the computers.

Some applications are designed to run as not more than one process per processor. The matrix multiplication in Section 6.10 is an example of such an application.

Apart from data parallelism, the mpC language also supports task parallelism, mainly via the mechanism of nodal and, especially, networks functions. Different nodal and network functions can be called in parallel, each having its own control flow.

Network functions also enable modular parallel programming in mpC. One programmer can implement a parallel algorithm in the form of a network function, and the other programmers can safely use that program unit in parallel with other computations in their applications without any knowledge of its code.

The next topic concerns applications where different parallel algorithms are coupled. There are many ways of programming such applications in mpC. If two algorithms are loosely coupled, two different mpC networks of the different type executing the algorithms in parallel can be defined. The mpC programming system will try to map the algorithms to the executing network of computers so as to ensure the best execution time.

Alternatively, two different mpC networks executing the algorithms can be defined serially (especially, if there is strong data dependency). In this case the first mpC network should be destructed before the second one is created in order to make all processes of the program available when mapping each of the algorithms on the underlying hardware. If two algorithms are tightly coupled, they can be described in the framework of the same network type and performed by the same mpC network.