8.3. SUMMARY | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

8.2. HMPI: HETEROGENEOUS MPI

The standard MPI specification provides communicator and group constructors that allow the application programmers to create a group of processes that execute together some parallel computations to solve a logical unit of a parallel algorithm. The participating processes in the group are explicitly chosen from an ordered set of processes. This approach to the group creation is quite acceptable if the MPI application runs on homogeneous distributed memory computer systems as one process per processor. Then the explicitly created group will execute the parallel algorithm typically with the same execution time as any other group with the same number of processes, because the processors have the same computing power, and the speed and the bandwidth of communication links between different pairs of processors are the same. However, on heterogeneous NoCs, a group of processes optimally selected by their speeds, and the speeds and the bandwidths of the communication links between them, will execute the parallel algorithm faster than any other group of processes. The selection of such a group is usually very difficult. The programmers must write a lot of complex code to detect the actual speeds of the processors and the speeds of the communication links between them, and then use this information to select the optimal set of processes running on different computers of the heterogeneous network.

The main idea of HMPI is to automate the selection of such a group of processes that executes the heterogeneous algorithm faster than any other group. HMPI allows the application programmers to describe a performance model of their implemented heterogeneous algorithm. This model is essentially the same as the one used in the mpC language and presented in Chapters 6 and 7. It allows for all the main features of the underlying parallel algorithm that have an essential impact on application execution performance on heterogeneous NoCs:

The total number of processes executing the algorithm.
The total volume of computations to be performed by each process in the group during the execution of the algorithm.
The total volume of data to be transferred between each pair of processes in the group during the execution of the algorithm.
The order of execution of the computations and communications by the involved parallel processes in the group, that is, how exactly the processes interact during the execution of the algorithm.

HMPI provides a small and dedicated model definition language for specifying this performance model. This language is practically a subset of the mpC language used to specify mpC network types. A compiler compiles the description of this performance model to generate a set of functions. The functions make up an algorithm-specific part of the HMPI runtime system.

Having provided such a description of the performance model, application programmers can use a new operation, whose interface is shown below, to create a group that will execute the heterogeneous algorithm faster than any other group of processes,

int HMPI_Group_create(HMPI_Group* gid,    const HMPI_Model* perf_model,    const void* model_parameters,    int param_count);

where

perf_model is a handle that encapsulates all the features of the performance model in the form of a set of functions generated by the compiler from the description of the performance model,
model_parameters are the parameters of the performance model (see the example below), and
param_count is the number of parameters of the performance model.

This function returns an HMPI handle to the group of MPI processes in gid.

In HMPI the groups are not entirely independent of each other. Every newly created group has exactly one process shared with already existing groups. That process is called a parent of the newly created group, and it is the connecting link through which the computation results are passed if the group ceases to exist. HMPI_Group_create is a collective operation, and it must be called by the parent and all the processes, that are not members of any HMPI group.

During the creation of the process group, the HMPI runtime system solves the problem of selecting the optimal set of processes running on different computers of the heterogeneous network. The solution to the problem is based on the following:

The performance model of the parallel algorithm in the form of the set of functions generated by the compiler from the description of the performance model.
The model of the executing network of computers, which reflects the state of this network just before the execution of the parallel algorithm.

The algorithms used to solve the problem of process selection are essentially the same as those used in the mpC compiler, as were discussed in Section 7.6. The accuracy of the model of the executing network of computers depends on the accuracy of the estimation of the processor speeds. HMPI provides an operation to dynamically update the estimation of processor speeds at runtime. It is especially important to consider if the computers executing the target program are used for other computations as well. In that case the actual speeds of processors can dynamically change depending on the external computations. The use of this operation, whose interface is shown below, allows the application programmers to write parallel programs that are sensitive to such dynamic variation of the workload in the underlying computer system,

int HMPI_Recon(HMPI_Benchmark_function func,     const void* input_p, int num_of_parameters,     void* output_p)

This operation causes all of the processors to execute a benchmark function func in parallel, and the time elapsed by each processor in executing the code is used to update its speed estimate. This is a collective operation that must be called by all of the processes in the group associated with the predefined communication universe HMPI_COMM_WORLD of HMPI.

Another principal operation provided by HMPI allows application programmers to predict the total execution time of the algorithm on the underlying hardware. Its interface is written as

double HMPI_Timeof(const HMPI_Model* perf_model,    const void* model_parameters,     int param_count)

This function allows the application programmers to write such a parallel application that can follow different parallel algorithms to solve the same problem, making the choice at runtime depending on the particular executing network and its actual performance. This is a local operation that can be called by any process that is a member of the group associated with the predefined communication universe HMPI_COMM_WORLD of HMPI.

A typical HMPI application starts with the initialization of the HMPI runtime system using the operation

int HMPI_Init (int argc, char** argv)

where argc and argv are the same arguments, passed into the application, as the arguments to main. This routine must be called before any other HMPI routine, and must be called once. This routine must be called by all the processes running in the HMPI application.

After the initialization the application can call any other HMPI routines. In addition MPI users can use normal MPI routines, with the exception of MPI initialization and finalization, including the standard group man-agement and communicator management routines to create and free groups of MPI processes. However, they must use the predefined communication universe HMPI_COMM_WORLD of HMPI instead of MPI_COMM_WORLD of MPI.

We recommend that application programmers avoid using groups created with the MPI group constructor operations to perform computations and communications in parallel with HMPI groups, as this may not result in the best execution performance of the application. The point is that the HMPI runtime system is not aware of any group of the MPI processes that is not created under its control. Therefore the HMPI runtime system cannot guarantee that an HMPI group will execute its parallel algorithm faster than any other group of MPI processes if some groups of MPI processes other than HMPI groups, are active during the algorithm execution.

The only group constructor operation provided by HMPI is the creation of the group using HMPI_Group_create, and the only group destructor operation provided by HMPI is

int HMPI_Group_free(HMPI_Group* gid)

where gid is the HMPI handle to the group of MPI processes. This is a collective operation that must be called by all the members of this group. There are no analogues of the other group constructors of MPI such as the setlike operations on groups and the range operations on groups in HMPI. This is because

First, HMPI does not guarantee that groups composed using these operations can execute a logical unit of parallel algorithm faster than any other group of processes, and
Then, it is relatively straightforward for application programmers to perform such group operations by obtaining the groups associated with the MPI communicator given by the HMPI_Get_comm operation (see the interface shown below).

The other additional group management operations provided by HMPI, apart from the group constructor and destructor, are the following group accessors:

HMPI_Group_rank to get the rank of the process in the HMPI group.
HMPI_Group_size to get the number of processes in this group.

The initialization of HMPI runtime system is typically followed by:

Updating of the estimation of the speeds of processors with HMPI_Recon.
Finding the optimal values of the parameters of the parallel algorithm with HMPI_Timeof.
Creation of a group of processes, which will perform the parallel algorithm, by using HMPI_Group_create.
Execution of the parallel algorithm by the members of the group. At this point, control is handed over to MPI. MPI and HMPI are interconnected by operation
```
 const MPI_Comm* HMPI_Get_comm (const HMPI_Group* gid)
```
which returns an MPI communicator with communication group of MPI processes defined by gid. This is a local operation not requiring interprocess communication. Application programmers can use this communicator to call the standard MPI communication routines during the execution of the parallel algorithm. This communicator can safely be used in other MPI routines.
Freeing the HMPI groups with HMPI_Group_free.
Finalizing the HMPI runtime system by using operation
```
Int HMPI_Finalize (int exitcode).
```