6.9. THE RECON STATEMENT: A LANGUAGE CONSTRUCT TO CONTROL THE ACCURACY OF THE UNDERLYING MODEL OF COMPUTER NETWORK | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

6.8. A SIMPLE HETEROGENEOUS ALGORITHM SOLVING AN IRREGULAR PROBLEM

In Section 6.2 we saw that the definition of an mpC network causes mapping of abstract processors of this network to processes of the parallel program, and that this mapping is constant during the lifetime of this network. But we did not discuss how the programming system performs this mapping, and how the programmer can control it.

It has been multiply emphasized in this book that the main goal of parallel computing is to accelerate the solution of a single problem on the available computer resources. Therefore it is quite natural that the mapping of abstract processors of the mpC network to parallel processes of the mpC program should be performed in such a way that minimizes the execution time of this program.

The mapping is performed by the mpC programming system based on two types of information. On the one hand, this is information about the network of computers that executes the program. Relative speeds of processors of this network are an example of that type of information.

On the other hand, this is information about the parallel algorithm, which is performed by the mpC network. Relative volumes of computations, which must be performed by abstract processors of the network, are an example of algorithmic information that can be provided by the mpC programmer.

We have not yet specified the volumes of computations in our programs. Therefore the mpC programming system assumed that all abstract processors of the network performed the same volume of computation.

Under this assumption, the programming system performed the mapping trying to keep the number of abstract processors mapped to each physical processor approximately proportional to its performance (naturally, taking into account the maximum number of abstract processors that could be hosted by this physical processor). Such a mapping ensures that the parallel processes, which represent abstract processors of the network, will execute computations approximately at the same speed. Therefore, if the volume of computation, which is performed by different abstract processors between points of synchronization or data transfer, is approximately the same, then the parallel program will be balanced; that is, its processes will not wait for each other at these communication points of the program.

This mapping appeared acceptable in all of our programs because different abstract processors of the network performed practically the same computation, which was, in addition, a very small volume. But if volumes of computations performed by different abstract processors vary substantionally, this mapping may lead to a very low speed of program execution.

Indeed, in this case the computations are executed by all involved processes at the same speed, and this implies that processes performing smaller volume computations will wait at communication points for processes executing computations of bigger volume. A mapping making the speed of each process proportional to the volume of computation that a process peforms would lead to a more balanced and faster parallel program.

The mpC language provides means for specification of relative volumes of computations performed by different abstract processors of the same mpC network. The mpC programming system uses this information to map abstract processors of the mpC network to processes of the parallel program in such a way that ensures each abstract processor to perform computations at the speed proportional to the volume of computation it performs.

The next program introduces these means:

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <mpc.h> #define DELTA (0.5) typedef struct    {       double len;   /* length of rail */       double wid;   /* width of rail */       double hei;   /* height of rail */       double mass;  /* mass of rail */    } rail; nettype HeteroNet(int n, double v[n]) {    coord I=n;    node { I>=0: v[I]; };    parent [0]; }; double Density(double x, double y, double z) {    return 6.0*sqrt(exp(sin(sqrt(x*y*z)))); } int [*]main(int [host]argc, char **[host]argv) {    repl N=3;    if(argc>1)       N=[host]atoi(argv[1]);    if(N<=0)    {       [host]printf("Wrong input (N=%d)\n", [host]N);       MPC_Exit(-1);    }    else    {       static rail [host]steel_hedgehog[[host]N];       repl double volumes[N];       double [host]start;       int [host]i;       repl j; /* Initialize geometry of "hedgehog" on host-processor */ for(i=0; i<[host]N; i++) {    steel_hedgehog[i].len = 200.0*(i+1);    steel_hedgehog[i].wid = 5.0*(i+1);    steel_hedgehog[i].hei = 10.0*(i+1); } /* Set the time on host-processor */ start = [host]MPC_Wtime(); /* Compute the volume of each j-th rail on host-processor, */ /* broadcast it to all processes of the program, where */ /* store it in j-th element of local copy of array volumes */ for(j=0; j<N; j++)    volumes[j] = steel_hedgehog[j].len *          steel_hedgehog[j].wid *          steel_hedgehog[j].hei; {    net HeteroNet(N, volumes) mynet;    rail [mynet]myrail;    double [mynet]x, [mynet]y, [mynet]z;    /* Broadcast "hedgehog" from host processor to all */    /* processors of mynet, where store it in a local copy */    /* (projection) of distributed structure myrail */ myrail = steel_hedgehog[];    /* Each abstract processor of mynet compute */    /* the mass of its rail of "hedgehog" */    for(myrail.mass=0., x=0.; x<myrail.len; x+=DELTA)       for(y=0.; y<myrail.wid; y+=DELTA)       for(z=0.; z<myrail.hei; z+=DELTA)         myrail.mass += [mynet]Density(x,y,z); myrail.mass *= DELTA*DELTA*DELTA; [mynet]MPC_Printf("Rail #%d is %gcm x %gcm x%gcm "          "and weights %g kg\n", I coordof myrail.len,           myrail.wid, myrail.hei,                   myrail.mass/1000.0);    /* Compute the sum of all projections of distributed */    /* variable myrail.mass and output a local copy of */    /* the result from host-processor */    [host]printf("The steel hedgehog weights %g kg\n",          [host]((myrail.mass)[+])/1000.0); } [host]printf("\nIt took %.1f seconds to run the program.\n",          [host]MPC_Wtime()-start); }

This program defines network type HeteroNet, which is parameterized by two parameters. The integer scalar parameter n determines the number of abstract processors in a network of this type. Vector parameter v consists of n elements of type double and is used for specification of relative volumes of computations performed by different abstract processors of the network.

The definition of network type HeteroNet contains an unusual declaration,

 node { I>=0: v[I]; },

saying the following: for any I>=0 the relative volume of computation performed by the abstract processor with coordinate I is equal to v[I].

The mpC program above calculates the mass of a metallic construction welded from N heterogeneous rails. For parallel computation of the total mass of the metallic “hedgehog,” it defines network mynet consisting of N abstract processors, each calculating the mass of one of the rails.

The calculation is performed by numerical three-dimensional integration of the density function Density with a constant integration step. Obviously the volume of computation to calculate the mass of a rail is proportional to the volume of the rail. Therefore the replicated array volumes, whose ith element just contains the volume of the ith rail, is used as the second actual parameter of network type HeteroNet in the definition of network mynet. The program thus specifies that the volume of computations performed by the ith abstract processor of network mynet is proportional to the volume of the rail, whose mass is computed by this abstract processor.

Besides the calculations, the program outputs the wall time elapsed to execute the calculations. To do this, the program uses the library nodal function MPC_Wtime. This function returns the astronomical time in seconds elapsed from some moment in the past not specified but fixed for the process calling the function.

We will not go into philosophical speculations about the relativity of time passing on different processes. We just briefly note that despite its seeming simplicity, this is the wall time that has elapsed in solving the problem, starting from data input and until output of results and measured on the host-process. This time is the most objective and interesting for the end-user temporal metrics of the program. Minimization of this characteristic is actually the main goal of parallel computing.

The next mpC program is equivalent to the previous one except that it does not specify explicitly the relative volumes of computations performed by different abstract processors of network mynet:

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <mpc.h> #define DELTA (0.5) typedef struct    {       double len;       double wid;       double hei;       double mass;    }  rail; double Density(double x, double y, double z) {    return 6.0*sqrt(exp(sin(sqrt(x*y*z)))); } int [*]main(int [host]argc, char **[host]argv)

{    repl N=3;    if(argc>1)       N=[host]atoi(argv[1]);    if(N<=0)    {       [host]printf("Wrong input (N=%d)\n", [host]N);       MPC_Exit(-1);    }    else    { static rail [host]steel_hedgehog[[host]N]; repl double volumes[N]; double [host]start; int [host]i; repl j; for(i=0; i<[host]N; i++) {    steel_hedgehog[i].len = 200.0*(i+1);    steel_hedgehog[i].wid = 5.0*(i+1);    steel_hedgehog[i].hei = 10.0*(i+1); } start = [host]MPC_Wtime(); for(j=0; j<N; j++)    volumes[j] = steel_hedgehog[j].len *             steel_hedgehog[j].wid *             steel_hedgehog[j].hei; {    net SimpleNet(N) mynet;    rail [mynet]myrail;    double [mynet]x, [mynet]y, [mynet]z;    myrail = steel_hedgehog[];    for(myrail.mass=0., x=0.; x<myrail.len; x+=DELTA)       for(y=0.; y<myrail.wid; y+=DELTA)         for(z=0.; z<myrail.hei; z+=DELTA)           myrail.mass += [mynet]Density(x,y,z);    myrail.mass *= DELTA*DELTA*DELTA;    [mynet]MPC_Printf("Rail #%d is %gcm x %gcm x%gcm "             "and weights %g kg\n", I coordof mynet,              myrail.len, myrail.wid,             myrail.hei, myrail.mass/1000.0);    [host]printf("The steel hedgehog weights %g kg\n",             [host]((myrail.mass)[+])/1000.0);    }    [host]printf("\nIt took %.1f seconds to run the program.\n",             [host]MPC_Wtime()-start); }

Therefore, while mapping the abstract processors of mynet to the processes of the program, the mpC programming system regards them as performing equal volumes of computations. In this case it leads to suboptimal mapping and hence to a longer time period of solving the problem as compared to the previous program.

The deceleration is visible if the program runs on a heterogeneous network that includes processors significantly differing in performance. In that case the mass of the biggest rail will likely be calculated on the slowest processor, resulting in a multi-fold deceleration as compared to the execution of the same calculations by the previous program, where we ensured that the fastest processor would compute the mass of the biggest rail.