9.1 Work Breakdown Structure for the MPI

One of the advantages of using the MPI over traditional UNIX/Linux processes and sockets is the ability of an MPI environment to launch multiple executables simultaneously . An MPI implementation can launch multiple executables, establish a basic relationship between the executables, and identify each executable. In this book we use the MPICH ^[1] implementation of MPI. The command:

^[1] All the MPI examples in this book were implemented using MPICH 1.1.2 and MPICH 1.2.4 in the Linux environment.

  $ mpirun -np 16 /tmp/mpi_example1

tells the MPI runtime to launch 16 processes. Each process will execute the program named mpi_example1 . Each process may use a different processor if the processor is available. Also, each process may be on a different machine if the MPI is run in a cluster-type environment. The processes will execute concurrently. The mpirun command is a shell script that is responsible for starting MPI jobs on the necessary processors. This script insulates the user from details of starting concurrent processes on different machines. Here it will launch 16 copies of the program mpi_example1 . Although the MPI-2 standard does specify spawn routines that can be used to dynamically add programs to an executing MPI application, this technique is not encouraged. In general, the number of processes needed are created at the start of an MPI application. This means that the code is replicated N number of times during startup. This scheme easily supports the SPMD (SIMD) model for concurrency because the same program is launched simultaneously on multiple processors. The data that each program needs to work on can be determined after the programs are running. This technique of starting the same program on multiple processors also has implications if the MPMD model is desired. The work that a MPI program will do is divided between the number of processes launched on startup. Which process does what and which process works on which data is coded in the executable. The computers that can be involved in the process are listed in the machines.arch ( machines.Linux in our case) file by host name . The location of this file is implementation dependent. Depending on your installation, the computers listed in the file will either be able to communicate using ssh or the UNIX/Linux 'r' commands.

9.1.1 Differentiating Tasks by Rank

During the startup of the processes involved in an MPI application, the MPI environment assigns each process a rank and a communication group . The rank is stored as an int . The rank serves as a kind of process id for each MPI task. The communication group determines which processes can engage in point-to-point communications. Initially, all MPI processes are assigned to a default communication group. The members of a communication group can be changed after the application has started. After each process is started, one of the first things that it should do is determine its rank. This is done with the MPI_Comm_rank() routine. The MPI_Comm_rank() routine returns the rank of the calling process. The calling process specifies what communicator it is associated with in the first argument to the routine and the rank is returned in the second argument. Example 9.1 shows how the MPI_Comm_rank() routine is used.

Example 9.1 Using the `MPI_Comm_rank()` routine.

 //... int Tag = 33; int WorldSize; int TaskRank; MPI_Status Status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&TaskRank); MPI_Comm_size(MPI_COMM_WORLD,&WorldSize); //...

The MPI_COMM_WORLD communicator is the default communicator that all MPI tasks are assigned upon startup. MPI tasks are grouped by communicators . The communicator is what identifies a communication group . In Example 9.1 the rank is returned in the variable TaskRank . Each process will have a unique rank. Once the rank is determined, then the appropriate data may be given to that task or the appropriate code for that task to execute may be determined. For instance, in Case 1:

Case 1: Simple MPMD	Case 2: Simple SIMD
if(TaskRank == 1){ // do something } if (TaskRank == 2){ // do something else }	if(TaskRank == 1){ // use this data } if(TaskRank == 2){ // use that data }

the rank is used to differentiate which process will do which work and in Case 2 the rank is used to differentiate which data each process will work on. Although each MPI executable starts out with the same code, MPMD (MIMD) may be achieved by using the rank and performing a branch. Likewise, once the rank is determined, data types may be assigned to the data of a process or specific data that a given process needs to work with may be identified. The rank is also used in message passing. MPI tasks identify each other in a communication exchange by ranks and communicators. The MPI_Send() and MPI_Recv() routines use rank for destination and source, respectively. The call:

 MPI_Send(Buffer,Count,MPI_LONG,TaskRank,Tag,Comm);

will send Count number of long s to a MPI process with rank = TaskRank . The Buffer is a pointer to the data to be sent to the process TaskRank. Count represents the number of items in the Buffer , not the size of Buffer . Each message has a tag. The tag can be used to differentiate one message from another, to group messages into classes, to associate certain messages with certain communicators, and so on. The tag is an int and its value is user-defined. The Comm parameter represents the communicator that the process is assigned to or associated with. If the rank and communicator of a task are known, then messages may be sent to that task. The call:

 MPI_Recv(Buffer,Count,MPI_INT,TaskRank,Tag,Comm,&Status);

will receive Count int s from a process with rank = TaskRank . This routine will cause the caller to block until it receives a message from a process with TaskRank and the appropriate value for Tag . The MPI does support wild-cards for the rank and tag parameters. These wildcards are MPI_ANY_TAG and MPI_ANY_SOURCE . If these values are used, the calling process will accept the next message that it receives regardless of the source and tag of that message. The Status parameter is of type MPI_Status . Information about the receive operation can be retrieved from the Status object. Three fields contained in status are MPI_SOURCE, MPI_TAG , and MPI_ERROR . Therefore, the Status object can be used to determine what the tag and source of the sending process were. Once the processes know how many processes are involved, they can determine who to send messages to and who to receive messages from. Naturally, which task receives messages and which task sends messages will depend on the application. How the work is divided up between the processes will also be application dependent. Another piece of information that is determined immediately by each process before the work starts is how many other processes are involved in the application. This is done by a call to:

 MPI_Comm_size(MPI_COMM_WORLD,&WorldSize);

This routine determines the size of the group of processes associated with a particular communicator. In this case, the communicator is the default communication MPI_COMM_WORLD . The number of processes involved are returned in the WorldSize parameter. This parameter is an int . Once each process has the WorldSize , it knows how many processes are associated with its communicator and what its rank is relative to the other processes.

9.1.2 Grouping Tasks by Communicators

In addition to ranks, processes are also associated with communicators. The communicator specifies a communication domain for a set of processes. All processes with the same communicator are in the same communication group . The work that a MPI program does can be divided between communicator groups. MPI_COMM_WORLD is the default communicator group that all processes are in initially. MPI_Comm_create() can be used to create new communicators. Table 9-1 shows a list of short descriptions for the routines used to work with communicators.

Through the use of the rank and the communicator, MPI tasks are identified and differentiated. The rank and the communicator allow us to structure a program as SPMD or MPMD or some combination. We use the rank and the communicator in conjunction with paramaterized programming and object-oriented techniques to simplify the code written for a MPI program. The templates can accommodate not only the different data aspect of SIMD but different data types may also be specified using templates. This greatly simplifies the many computational- intensive applications that do the same work but with different data types. We recommend runtime polymorphism (supported by objects), parametric polymorphism (supported by templates), function objects, and predicates to achieve MPMD (MIMD). These techniques are used in conjunction with the rank and the communicator of a MPI process to accomplish the division and assignment of work in an MPI application. When using an object-oriented approach, the work of a program is divided between families of objects. The families of objects are each associated with different communicators. Associating families of objects with different communicators helps with modularity in the design of an MPI application. This kind of division also helps with understanding how the parallelism can be applied. We have found that the object-oriented approach makes MPI programs more extensible, maintainable , and easier to debug and test.

9.1.3 The Anatomy of an MPI Task

Figure 9-1 contains a skeleton MPI program. The tasks involved in this MPI program simply report their ranks to the MPI task whose rank == 0 .

Figure 9-1. A MPI program.

graphics/09fig01.gif

Every MPI program should at least have MPI_Init() and MPI_Finalize() . The MPI_Init routine initializes the MPI environment for the calling task. The MPI_Finalize() routine deallocates resources from the MPI task.

Table 9-1. Routines Used to Work with Communicators

MPI Communicator Routines `#include "mpi.h"`	Description
int MPI_Intercomm_create (MPI_Comm LocalComm, int LocalLeader, MPI_Comm PeerComm, int remote_leader, int MessageTag, MPI_Comm *CommOut);	Creates an intercommunicator from two intracommunicators.
int MPI_Intercomm_merge (MPI_Comm Comm,int High, MPI_Comm *CommOut);	Creates an intracommunicator from an intercommunicator.
int MPI_Cartdim_get (MPI_Comm Comm,int *NDims);	Returns Cartesian topology information associated with a communicator.
int MPI_Cart_create (MPI_Comm CommOld,int NDims, int Dims,int Periods, int Reorder, MPI_Comm *CommCart);	Creates a new communicator to which topology information has been attached.
int MPI_Cart_sub (MPI_Comm Comm, int RemainDims, MPI_Comm CommNew);	Divides a communicator up into subgroups, which form lower dimensional Cartesian subgrids.
int MPI_Cart_shift (MPI_Comm Comm, int Direction, int Display,int Source, int Destination);	Retrieves the shifted source and destination ranks, given a shift direction and amount.
int MPI_Cart_map (MPI_Comm CommOld, int NDims,int Dims, int Periods,int *Newrank);	Maps process to Cartesian topology information.
int MPI_Cart_get (MPI_Comm Comm,int MaxDims, int Dims,int Periods, int *Coords);	Returns Cartesian topology information associated with a communicator.
int MPI_Cart_coords (MPI_Comm Comm, int Rank, int MaxDims, int *Coords);	Calculates process coords in Cartesian topology given rank in group.
int MPI_Comm_create (MPI_Comm Comm, MPI_Group Group, MPI_Comm *CommOut);	Creates a new communicator.
int MPI_Comm_rank (MPI_Comm Comm,int *Rank);	Calculates and returns the rank of the calling process in the communicator.
int MPI_Cart_rank (MPI_Comm Comm,int Coords, int Rank);	Calculates and returns the process rank in a communicator given Cartesian location.
int MPI_Comm_compare (MPI_Comm Comm1, MPI_Comm Comm2, int *Result);	Compares two communicators, `Comm1` and `Comm2`
int MPI_Comm_dup (MPI_Comm CommIn, MPI_Comm *CommOut);	Duplicates an already existing communicator along with all its cached information.
int MPI_Comm_free (MPI_Comm *Comm);	Marks the communicator object to be deallocated.
int MPI_Comm_group (MPI_Comm Comm, MPI_Group *Group);	Accesses the group associated with the given communicator.
int MPI_Comm_size (MPI_Comm Comm,int *Size);	Calculates and returns the size of the group associated with a communicator.
int MPI_Comm_split (MPI_Comm Comm,int Color, int Key,MPI_Comm *CommOut);	Creates new communicators based on colors and keys.
int MPI_Comm_test_inter (MPI_Comm Comm,int *Flag);	Determines if a communicator is an intercommunicator.
int MPI_Comm_remote_group (MPI_Comm Comm, MPI_Group *Group);	Accesses the remote group associated with the given intercommunicator.
int MPI_Comm_remote_size (MPI_Comm Comm,int *Size);	Calculates and returns the size of the remote group associated with an intercommunicator.

Every MPI task should call the MPI_Finalize() routine prior to exiting. Notice the calls to MPI_COMM_rank() and MPI_COMM_Size() in Figure 9-1. They are used to get the rank and the number of processes that belong to an MPI application. Most MPI applications should call this function. The remaining MPI functions will depend on the application. The MPI environment supports over 300 functions. Consult your man pages for a complete listing and discussion of the MPI functions.