7.5. SUMMARY | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

7.4. UNDERLYING MODELS AND THE MAPPING ALGORITHM

Thus in mpC the programmer can specify a performance model of the implemented parallel algorithm, including

the number of parallel processes executing the algorithm,
the absolute volume of computations performed by each process,
the absolute volume of data transferred between each pair of processes, and
the scenario of interaction between the parallel processes during the algorithm execution.

This model is used to map the algorithm to the physical processors of the executing network of computers. As we saw earlier, the mapping is per-formed by a component of the programming system called a dispatcher (see Section 6.11). Apart from the performance model of the algorithm, the dispatcher also uses a model of the executing network of computers to perform the mapping.

This section introduces a model of a heterogeneous network of computers that matches the advanced performance model of the parallel algorithm and takes into account the communication layer of network. It also introduces an algorithm that maps the abstract processors of an mpC network to processes of the parallel program running on a heterogeneous network of computers.

7.4.1. Model of a Heterogeneous Network of Computers

The basic model of a heterogeneous network of computers does not allow for communication links between computers, and treats the network as a set of heterogeneous multiprocessors. Each computer is characterized by two attributes:

The time of execution of a (serial) test code on the computer;
The number of physical processors.

The first attribute is a function of time, s(t), and it can vary even during the execution of the same mpC application (if the application uses the recon statement). Relative speeds of computers are normalized so that the computer running the host-process is always of unit speed (s = 1). The second attribute is a constant, n, and it determines how many noninteracting processes can run in parallel on the computer without a loss of speed.

There is a new advanced model, however, that is more sophisticated and allows for the material nature of communication links and their heterogeneity. This model considers a heterogeneous network to be a multilevel hierarchy of interconnected sets of heterogeneous multiprocessors. The hierarchy reflects the heterogeneity of the communication links, and it may be represented in the form of an attributed tree.

Each node of the tree represents a homogeneous communication space of the heterogeneous network. The first attribute associated with an internal node is a set of computers that communicate over the space, and this set is just a union of all computers which are descendents of the node.

The second attribute is the speed of data transfer between two computers from different sets associated with its children. This attribute characterizes the point-to-point communication at this communication layer and is a function of size of the transferred data block, q(d). Note that q(0) may be nonzero and equal to the start-up time of point-to-point communication at this layer.

The third attribute indicates whether the communication layer allows parallel point-to-point communications between different pairs of computers without a loss of data transfer speed, or whether the layer serializes all of the communications. This attribute can have two values—serial and parallel. A pure Ethernet network is serial. However, the use of switches can make it parallel.

The next group of attributes is only applicable to a parallel communication layer. It characterizes collective communication operations such as broadcast, scatter, gather, and reduction. The point is that in general, a collective communication operation cannot be considered as a set of independent point-to-point communications. It normally has some special process, called a root, that is involved in more communications than other participating processes.

The level of parallelism of each collective communication operation depends on its implementation, and this is reflected in the model by a corresponding attribute. For example, the attribute f_b characterizes the level of parallelism of a broadcast. It is presumed that the execution time t of this operation can be calculated as follows:

where t_s is the time of a purely serial execution of the operation, and t_p is the time of an ideally parallel execution of this operation (0 f_b 1).

Each leaf node of this tree represents a single (homogeneous) multiprocessor computer. In addition to the attributes inherited from the basic model, each node is also characterized by the attributes of the communication layer provided by the computer.

Figure 7.9 depicts the model for a local network of five computers, labeled A, B, C, D, and E. Computer A is a distributed memory eight-processor computer, D is a shared memory two-processor server. Computers B, C, and E are uniprocessor workstations. The local network consists of two segments with A, B, and C belonging to the first segment. Computers D and E belong to the second segment.

click to expand
Figure 7.9: Hierarchical model of a heterogeneous network of five computers.

The speed of transfer of a data block of k bytes from a process running on computer C to a process running on computer D is estimated by q₀(k), while the speed of transfer of the same data block from a process running on computer C to the one running on computer A is estimated by q₁(k). The level of parallelism of a broadcast operation involving processes running on computers B, C, and E is f_b⁽⁰⁾, while that of a broadcast involving processes running on computer A is f_b^(A).

This communication model is simple but rough. It is used at runtime by the mpC programming system to predict the execution time of the implemented parallel algorithm. It uses a small number of integral attributes presenting some average characteristics rather then a detailed and fine-structured description.

The main reason of this simplicity is that the target architecture for the mpC language is the common network of computers, which normally consists of a multiple-user environment whose structure is irregular and not very stable. Therefore fine-grained communication effects can hardly be reliably predicted for that architecture.

Second, the mpC language is aimed at programming applications in which computations prevail over communications; that is, the contribution of computations in the total execution time is more sunstantial than that of communications. Where that is not the case, it normally means that the main goal of the application is not to speed up the solution of some individual problem, and the distribution of its components over different computers is its intrinsic feature; in other words, the application is actually not parallel but distributed.

Thus the mpC language needs an efficient communication model of common heterogeneous networks of computers suitable for prediction of the execution time of data transfer operations involving the transfer of relatively big volumes of data. The accuracy of the prediction does not need to be particularly high because the contribution of communications in the total execution time is supposed to be relatively small. Actually the accuracy cannot be high because of the nature of the modeled hardware.

This communication model is designed to satisfy the primary necessities of the mpC language. Its main disadvantage is that it is static. If an efficient way were found to update its parameters at runtime to reflect the current situation, that could improve its accuracy.

The model of parallel communication layer and collective communication operations also has room for improvement. Multiple experiments with different network configurations could make the model more useful for a wide range of common networks.

7.4.2. The Mapping Algorithm

Any mpC program running on a network of computers is nothing more than a number of processes interacting via message passing. The total number of processes, and the number of processes running on each computer of the network, are determined by the user during the start-up of the program. This information is available to the dipatcher responsible for mapping the abstract mpC networks to these processes.

Each definition of an mpC network creates a group of processes that will act as abstract processors of the network. The main criterion in the selection of the processes for this group is minimization of the execution time of the parallel algorithm whose performance model is described by this mpC network, on a particular network of computers.

Thus at runtime the dispatcher solves the problem of optimally mapping the abstract processors of the mpC network to a set of processes running on different computers of the heterogeneous network. In solving the problem, the dispatcher considers the following:

The performance model of the parallel algorithm to be executed.
The model of the executing network of computers, which normally reflects the state of this network right before the execution of the algorithm.
A map of processes of the parallel program. For each computer the map displays both the total number of running processes and the number of free processes, meaning those processes available to act as the abstract processor of the mpC network.

Each mapping, m:I C, where I is a set of coordinates of the abstract processors of the mpC network and C is a set of computers of the executing network, is characterized by the estimation of the execution time of the algorithm on the network of computers. The estimation is calculated based on the performance models of the parallel algorithm and the executed network.

The execution time of each computation unit in the scheme declaration of the form e%%[i] is calculated as follows:

 timeof(e%%[i]) = (e/100)*v_i*b_i->j(t₀),

where v_i is the total volume of computations to be performed by the abstract processor with coordinate i, and b_i->j(t₀) is the execution time of the test code on the computer m(i) obtained as a result of execution of the corresponding recon statement (t₀ denotes time when this execution took place). The execution time of each communication unit of the form e%%[i]->[j] is calculated as follows:

 timeof(e%%[i] = (e/100)*v_i->j*b_i->j(t₀),

where w_i->j is the total volume of data to be transferred from the abstract processor with the coordinates i to the abstract processor with the coordinates j, and s(w_i->j) is the speed of transfer of data block of w_i->j bytes between computers m(i) and m(j).

A simple rule of calculation of the execution time is associated with each sequential algorithmic pattern in the scheme declaration. For example, the execution time T of the pattern

 for(e1;e2;e3)a

is calculated as

 for(T=0, e1; e2; e3)    T += timeof(a);

The execution time T of the pattern

 if(e) a1 else a2

is calculated as

 if(e)    T = timeof(a1); else    T = timeof(a2);

These rules just reflect semantics of the corresponding serial algorithmic patterns. The rule for calculation of the execution time of the parallel algorithmic pattern

 par(e1;e2;e3)a

is more complicated. Informally, this pattern describes parallel execution of some actions (mixtures of computations and communications) on the corresponding abstract mpC network.

Let A = {a₀, a₁,..., a_N1} be a set of the actions ordered in accordance with the estimation of their execution time, namely

 timeof(a₀)>=timeof(a₁)>=...>=timeof(a_N-1).

Let B be a subset of A consisting of all actions that only perform communications, B = {b₀, b₁,..., b_Q1}. Let C = {c₀, c₁,..., c_M1}. Finally, let m_i be the number of abstract processors of the mpC network mapped on computer c_i, and let n_i be the total number of physical processors of this computer. Then the rule for calculation of the execution time T of this pattern is as follows:

 for(j=0, T=0; j<M; j++) {    for(i=0, T₀=0, k=0; k<Upper(m_j, n_j) && i<N; i++) {       if(a_i performs some computations on c_j) {          T₀ += timeof(a_i);          k++;       }    }    T = max(T, T₀); } T = max(T, timeof(B));

Here the function Upper is defined as

 Upper(x, y) = if(x/y<=1)                   then 1                   else if ((x/y)*y == x)                      then x/y                      else x/y+1

Informally, the preceding system of loops first computes for each computer the estimation time T₀ of its parallel execution activity for some computations. The estimate is obtained by the assumption that if the number of parallel actions on one computer exceeds the number of its physical processors, then

the actions are distributed evenly over the physical processors—that is, the number of actions executed by different physical processors differs by at most one; and
the most computationally intensive actions are executed on the same physical processor.

The parallel actions that do not involve computation (i.e., perform pure communications) are separated into set B. Let l(B) be the lowest communication layer covering all communication links in B, and let f_b, f_g be the parallel levels of broadcast and gather correspondingly for this layer. Then the rule for calculation of time T of the parallel execution of the communication operations from set B is as follows:

 if(l(B) is serial)    for(i=0, T=0; i<Q; i++)    T += timeof(b_i); else if(B matches broadcast/scatter) {    for(i=0, T_serial=0, T_parallel=0; i<Q; i++) {       T_serial += timeof(b_i);       T_parallel = max(T₂, timeof(b_i));    }    T = f_b*T_parallel+(1-f_b)*T_serial } else if(B matches gather) {    for(i=0, T_serial=0, T_parallel=0; i<Q; i++) {       T_serial += timeof(b_i);       T_parallel = max(T₂, timeof(b_i));    }    T = f_g*T_parallel+(1-f_g)*T_serial } else    for(i=0, T=0; i<Q; i++)       T += max(T, timeof(b_i));

This rule just sums the execution time of the parallel communication operations if the underlying communication layer serializes all data packages. Otherwise, we have a parallel communication layer. Thus, if set B of the communication operations indicates that it is about to broadcast or scatter (i.e., one abstract processor sends data to other involved abstract processors), then the time of parallel execution of the communication operations is calculated as if the broadcast were performed. Similarly, if B indicates that it is about to gather (i.e., one abstract processor receives data from other involved abstract processors), then the time of parallel execution of the communication operations is calculated as if the gather was performed.

In all other instances, it is assumed that B is a set of independent point-to-point communications. Note that it is the responsibility of the programmer not to specify different communication operations sharing the same communication link as the parallel ones.

The rule for estimating the execution time of the parallel algorithmic pattern determes the accuracy and efficiency of the entire mapping algorithm. The rule takes into account the material nature and heterogeneity of the processors and the network equipment.

The rule depends on fair allocation of the processes to the physical processors in the shared memory multiprocessors that are normally implemented by the operating systems for processes of the same priority (e.g., mpC processes). However, the rule takes a pessimistic point of view where the workloads of different processors of that multiprocessor must be estimated.

The communication cost is sensitive to the scalability of the underlying communication network technology. The rule treats differently the communication layers serializing data packages from those supporting their parallel transfer. The most typical and widely used collective communication operations also must be treated individually to provide more accurate estimates of their execution time.

An important feature of the rule nevertheless is its relative simplicity and efficiency, and its efficiency is a crucial consideration because the algorithm is intended to be multiply executed at runtime. A disadvantage of the rule is just the reverse of its simplicity and its efficiency. Except for some common collective communication operations, the rule is not sensitive to different collective communication patterns such as ring data shifting, and tree reduction; rather it treats them all as independent point-to-point communications. The problem is that recognition of different patterns is very costly. One way around this maybe is to introduce into the mpC language explicit constructs that specify all possible important collective communication patterns.

Another disadvantage of the rule in regard to estimation accuracy is that any parallel communications are treated as if they occur within the same communication layer in a hierarchy, namely at the lowest communication layer that encompasses all of the involved processors. In reality some communications will use different communication layers. The task of incorporate multi-layer parallel communications in this algorithm without significant loss of efficiency poses a very difficult research problem.

Ideally, the mpC dispatcher should find the mapping that has been proved to provide the fastest execution of the parallel algorithm. In reality, for accurate mapping as many as M^K possible mappings have to be examined in order to find the best one (here K is the power of the set I of coordinates of abstract processors). Obviously such computational complexity is not acceptable for a practical algorithm that must be performed at runtime. Therefore the dispatcher must search for an approximate solution that can be found in some reasonable time frame, namely after examination of MxK possible mappings instead of M^K.

The underlying algorithm can be summarized as follows: At a preliminary step, set I is re-ordered in accordance with the volume of computations to be performed by the abstract processors so that the most loaded abstract processor will come first. Let P={p_k}(k=0,...,K-1) be this well-ordered set.

Let Q_j be a subnetwork of the mpC network formed by the set P_j={p_i}(i=0,...,j) of abstract processors. By definition, a subnetwork is a result of projection of the mpC network onto some subset of its abstract processors. Essentially the subnetwork is equivalent to its supernetwork, which is modified as follows:

Zero volume computations are set for each abstract processor not included in the subnetwork.
A zero volume of communications is set for each pair of abstract processors where at least one is not included in the subnetwork.

Finally, we let c_j denote the jth computer from set C. Then the main loop of the algorithm can be described by the following pseudocode:

 for(k=0; k<K; k++) {    for(j=0, t_best=MAXTIME, c_best=c₀; j<M; j++) {       if(p_k is not a parent of the mpC network) {          Map p_k to c_j          Estimate execution time t for this mapping of Q_k to C          if(t<t_best) {             t_best=t;             c_best=c_j;          }          Unmap p_k       }    }    Map p_k to c_best }

The algorithm above reflects the fact that the mpC language is designed to focus on applications with computations prevailing over communications. Therefore abstract processors rather than communication links drive the algorithm. Another argument for this approach is that the maximal number of abstract communication links is equal to the total number of abstract processors squared. Therefore, in general, an algorithm driven by communication links can be quite expensive to execute.

Informally, the algorithm first maps the most loaded abstract processor, ignoring the other abstract processors and communications. After the first abstract processor is mapped, it maps the second most loaded abstract processor only regarding the communications between these two processors, and so on. At the ith step, it maps the ith most loaded abstract processor only regarding data transfer between these i abstract processors. This algorithm exploits the principle that smaller things can be the more evenly distributed than, bigger things. It may be useful to recall here the analogy of balls and baskets in Section 6.11. In distributing balls of different sizes among the baskets of different sizes, it makes sense to start with the biggest ball and put it into the biggest basket, then the second biggest ball into the basket with the biggest free space left, and so on. This algorithm keeps the balance between the ball sizes and the free basket space, and guarantees that if at some step there is no more space left for the next ball, it simply means that there is no way to put all the balls in the baskets.

Similarly, if the algorithm cannot balance the load of physical processors in the case of a low communication cost, it simply means that there is no way to balance the load at all. This algorithm could work as well if the data transfer between the more loaded abstract processors is more significant than that between the less loaded processors. Then the more loaded abstract communication links must be accounted for at earlier stages of the algorithm.

An obvious situation where this mapping algorithm may not work well is when the least loaded abstract processor is involved in a transfer of a much bigger volume of data than more loaded processors, and the contribution of communications to the total execution time is significant. But common sense shows us that this is not what occurs in most parallel algorithms.