6.12. SUMMARY | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

6.11. PRINCIPLES OF IMPLEMENTATION

This section introduces the basic implementation principles of the subset of the mpC language presented above. The discussion is addressed not only to implementers of parallel programming systems, knowledge of these principles allows mpC users to write more efficient programs. The primary components of the mpC programming system are the compiler, the runtime support system, the library, and the command-line user interface.

The mpC compiler translates a source mpC program into the ANSI C program with calls to functions of the mpC runtime system. It normally uses the SPMD model of the target code, when all processes constituting the target message-passing program run the identical code. Optionally, it can translate a source mpC file into two separate target files: one for the host-process and the other for the remaining processes.

The mpC runtime system is an internal library that is not available to the mpC programmer. Its functions are only used by the mpC compiler in the generated target code. It provides both operations on parallel processes of the mpC program and interprocess communication operations. The mpC runtime system has a precisely specified interface and encapsulates a particular communication package such as MPI. It ensures platform-independence of the mpC compiler.

The mpC library provides some definitions and operations that can be used by programmers in their mpC programs. The definition of net-work SimpleNet is an example of an mpC library definition. Operations are provided in the form of a basic, nodal, or network function.

MPC_Global_barrier is an example of an mpC basic library function. MPC_Wtime is an example of an mpC nodal library function. MPC_Barrier is an example of an mpC network library function. All mpC library definitions and prototypes of the mpC library functions are made available via the header file mpc.h.

The mpC command-line user interface consists of a number of utilities that allow the user to initialize the mpC programming system on the network of computers, and to compile, link, and run parallel mpC programs on the network.

6.11.1. Model of a Target Message-Passing Program

We have noted that the mpC compiler translates each mpC program into a message-passing program that uses the mpC runtime system for process management and interprocess communication. The parallel processes that constitute the target program are divided into two groups.

The first group consists of common working processes that perform computations and communications specified in the source mpC program and can play the role of abstract processors of one or another mpC network.

The second group consists of exactly one process that plays the role of the manager of the working processes. This specific process is called dispatcher. The dispatcher works as a server. It receives requests from the working processes and sends them commands to be performed.

In the target program each network of the source mpC program is represented by a group of working processes called a team. Definition of the mpC network causes creation of the team, whose processes play the role of abstract processors of this network.

At any time of program execution, each working process is either free (i.e., not a member of any team) or a member of one or more teams. This is the dispatcher who assigns processes into a newly created team and dismisses them on the destruction of the team. The only exception is a pre-assigned host-process representing the mpC pre-defined abstract host-processor. Thus, immediately after initialization of the target program, its working processes consist of the host-process and a set of temporarily free processes.

Two main operations on working processes performed during execution of the target program are creation and destruction of teams of processes representing mpC networks of the source program. Implementation of these operations determines both the whole structure of the target code and the functionality of the mpC runtime system.

To create a team, the process that plays the role of the parent of the corresponding mpC network computes the actual parameters of this network and sends a creation request to the dispatcher. The request contains information about the team to be created, including the number of processes and the relative volume of computation the processes should perform.

The dispatcher has information about the executing network of computers that includes

the number of physical processors,
the relative speed of the processors, and
the mapping of working processes onto the physical processors.

Based on this information, the dispatcher selects those free working processes that are the most appropriate to be assigned to the newly created team.

After that, the dispatcher sends to each free working process a message saying whether or not the process is assigned into the team. Each process assigned to the team also receives comprehensive information about the team and its identification in this team as part of the message from the dispatcher, and stores the data in its memory.

To destruct a team, its parent process sends a destruction request to the dispatcher. Note that the parent process of the team retains its membership in all those teams that have existed from the time this team was created. The other members of the destructed team become free and await commands from the dispatcher.

Any working process can detect whether or not it is free. It is not free if a call to function MPC_Is_busy of the mpC runtime system returns 1. If such a call returns 0, the process is free.

Any working process can also detect whether or not it is a member of any existing team. A team of working processes is accessed via its descriptor. If descriptor td is associated with the team, then a working process belongs to this team if and only if a function call MPC_Is_member(&td) returns 1. In this case descriptor td allows the process to access comprehensive information about the team and to identify itself in this team. Remember that these data are received from the dispatcher and stored in the memory of the process during the creation of the team. A team descriptor has the same scope and duration class (automatic or static) in the generated code as the corresponding mpC network in the source mpC code.

Creation of a team involves its parent process, all free processes and the dispatcher. The mpC compiler inserts in the target code all of the necessary synchronization operations in order to guarantee coordinated arrival of all interested working processes in the code section that performs the team creation operation.

The parent process first computes the data that specify the team to be created and then composes a creation request, which includes these data and a transferable ID of the team. The ID is used by processes to locally identify the correct team descriptor. As during creation of the team all participating working processes execute a code located in the same file, the team ID does not have to be globally unique. It is enough for the ID to be unique in the scope of the file so that there is one-to-one mapping between the team descriptors defined in this file and the IDs. The IDs can be generated at compile time.

For example, the ordinal number of the mpC network in the source mpC file may be used as such an ID.

The parent process next sends the creation request to the dispatcher and begins to wait for a message from the dispatcher confirming the creation of the team. Upon receipt of the confirmation message, the parent process enters a barrier section that completes the team creation operation.

Meantime each free process waits for a message from the dispatcher. Upon receipt of the message, the process acts according to what the message says. If this message says that the process is not selected for the team, it awaits the next message from the dispatcher.

Otherwise, the received message contains data specifying the team and identifying the process in the team. The process stores the data in its memory and associates the stored data with the team descriptor, which is identified by the received team ID. Then it leaves the set of free processes and enters the barrier section, completing the team creation operation.

A free process leaves the waiting point either after it becomes a member of a team or after the dispatcher sends to all free processes a message commanding them to leave the point. The dispatcher sends such a message after all the teams have been created by this team creation operation. Upon receipt of this command, each free process also enters the barrier section, completing the team creation operation.

The barrier section completing the creation operation finalizes the creation of the teams. For example, if the mpC runtime system is built on top of MPI, the finalization may mean the creation of a communcator for each team of processes.

6.11.2. Mapping of the Parallel Algorithm to the Processors of a Heterogeneous Network

The preceding subset of the mpC language introduced an abstraction of the heterogeneous parallel algorithm in the form of an mpC network. This abstraction can be used by the mpC programmer to advise the mpC compiler of some features of the implemented parallel algorithm that might have an impact on its performance. This model of parallel algorithm ignores communication operations and is quite simple. In fact the implemented parallel algorithm is characterized by only two main attributes:

The number of abstract processors to perform the algorithm.
The relative volume of computations to be executed by each processor.

As we have discussed, the dispatcher uses this information to map the parallel algorithm to the physical processors of the executing network of computers. In this section we explain how the dispatcher performs the mapping.

Technically the dispatcher maps abstract processors of the mpC network to physical processes of the parallel program running on the heterogeneous network of computers. To perform the mapping, the dispatcher uses

the mpC model of the parallel algorithm that should be executed,
the model of the executing network of computers that reflects the network’s state just before the execution of the algorithm,
a map of working processes of the parallel program. For each computer of the executing network, the map displays the number of working processes running on the computer, and the number of currently free ones.

This model of a network of computers matches the model of a parallel algorithm that ignores communication operations. Correspondingly the model of a heterogeneous network ignores the network communication layer and is quite simplistic. The model considers the network as a set of heterogeneous multiprocessors.

Each computer is characterized by two attributes:

The relative speed s of execution of a (serial) test code on the computer.
The number of physical processors, or scalability, of the computer.

The first attribute is a function of time, s(t), and can vary even during the execution of the same mpC application (if the application uses the recon statement). Relative speeds of computers are normalized so that the computer running the host-process is always of unit speed (s = 1). The second attribute is a constant, n. It determines the number of noninteracting processes that can run in parallel on the computer without a loss of speed.

The algorithm of mapping of abstract processors of the mpC network to working processes of the program can be summarized as follows: The dispatcher first computes the relative volume of computations to be performed by each abstract processor so that the parent abstract processor gets the same relative volume that it had in the existing mpC network(s). The dispatcher does not need to map the parent abstract processor because it has been mapped before. The remaining abstract processors are mapped successively, starting from the most loaded abstract processor. It is mapped to the fastest remaining free process. Then the second most loaded abstract processor is mapped to the fastest of the remaining free processes, and so on.

At each step of the procedure, the speed of free processes is estimated as follows: Let v be the relative volume of computations associated with abstract processor a that should be mapped at this step. Let s be the speed associated with computer C in the model of the executing network. Let n be the scalability of C. Let A be a set of the mpC program’s abstract processors that are currently mapped to working processes running on C. Let A be divided into n nonintersecting subsets A₁,..., A_n, where the abstract processors from each subset are supposed to be mapped on processes running on the same physical processor of computer C. Let v_ij be the relative volume of computations to be performed by the ith abstract processor from subset A_j.

Then the speed of free working processes running on computer C is estimated as follows:

The dispatcher maps abstract processor a to the computer, whose free processes have the highest estimated speed. Let C be this computer. Then abstract processor a will be added to subset A_k such that

The algorithm above is based on the obvious observation that the smaller things are, the easier they can be evenly distributed. Hence bigger things should be distributed under weaker constraints than smaller ones.

For example, to distribute a number of balls of different size over a number of baskets of different size, we would start from the biggest ball and put it into the biggest basket, then put the second biggest ball into the basket having the biggest free space, and so on. That algorithm keeps the balance between ball sizes and free basket space and guarantees that if at some step there is not enough space for the next ball, there is no way to put all the balls in the baskets.

Similarly, if the algorithm above cannot balance the load of physical processors, this simply means that there is no way to balance them at all.