9.4. SIMULATION OF OIL EXTRACTION | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

9.3. NUMERICAL INTEGRATION

In this section we extend the problem of numerical approximation to definite integrals on heterogeneous networks of computers. First, we briefly introduce numerical integration, following the excellent book Computer Methods for Mathematical Computations by Forsythe, Malcolm, and Moler (1977). Then we give a parallel algorithm of numerical integration on heterogeneous networks as well as its mpC implementation.

To avoid confusion with numerical integration of ordinary differential equations, the term quadrature is commonly used for numerical approximation of definite integrals. Let [a,b] be a finite interval on the x axis that is partitioned into n subintervals called panels, [x_i, x_i1], i = 0,. . ., n - 1. Assume x₀ = a, x_n = b, and x₀ < x₁ <...< x_n. Let h_i = x_i1 - x_i be the panel widths.

Let f(x) be a function defined on [a,b]. Suppose that an approximation is desired to the definite integral

numerical integration

Clearly, I(f) can be expressed as a sum of integrals over the panels

where

A quadrature rule is a simple formula for approximating the individual I_i. A composite quadrature rule is a formula giving an approximation to I(f) as a sum of the quadrature rule approximations to the individual I_i. Two of the simplest quadrature rules are the rectangle rule and the trapezoid rule.

9.3.1. Basic Quadrature Rules

The rectangle rule uses function values in the midpoints of the panels

It approximates each I_i by the area of the rectangle with base h_i and height f(y_i). This gives

and hence we have the composite rectangle rule:

The trapezoid rule uses function values at the end points of the panels. It approximates I_i by the area of the trapezoid with base h_i and height that varies linearly from f(x_i) on the left to f(x_i1) on the right. This leads to

and hence we have the composite trapezoid rule:

It is easily shown that if f(x) is continuous (or even merely Riemann integrable) on [a,b] and h = max_i h_i, then

and

That is, both rules converge as the lengths of the subintervals decrease.

The rectangle rule is based on piecewise constant interpolation, while the trapezoid rule is based on piecewise linear interpolation. One might therefore expect the trapezoid rule to be more accurate that the rectangle rule. But this natural expectation does not always prove true.

Indeed, assume that f has five continuous derivatives and that the values of these derivatives are not too large. Consider a panel [x_i, x_i1]. The Taylor series expansion of f(x) about y_i, the center of this panel, is

click to expand

These assumptions about f imply that the remainder term denoted by ellipses (...) is less significant than the terms explicitly shown.

To integrate this series over [x_i, x_i1], observe that

Note that the odd powers integrate to zero. Consequently

numerical integration

Returning to the Taylor series and substituting x = x_i and x = x_i1, we obtain

Combining this with the expansion for the integral gives

This shows that when h_i is small, the error for the trapezoid rule on one panel is

plus some less significant terms.

The total error for either rule is the sum of the errors on the individual panels. Let

Then

If the h_i are sufficiently small, then h_i⁵ << h_i³; so if f^IV is not too badly behaved, then F << E.

The first conclusion from these results is that for many functions f(x), the rectangle rule is about twice as accurate as the trapezoid rule. The second conclusion is that the difference between the values obtained by the rectangle and trapezoid rules can be used to estimate the error in either one of them. The third conclusion is that, for functions f(x) having continuous, bounded second derivatives, doubling the number of panels by bisecting them in either the rectangle or trapezoid rule can be expected to roughly quadruple the accuracy. The difference in the results of either of these rules before and after doubling the number of panels can be used to estimate the error or improve the computed result.

The technique of repeatedly doubling the number of panels and estimating the error can be programmed to produce a method that automatically determines the panels so that the approximate integral is computed to within some prescribed error tolerance. Combining the rectangle and trapezoid rules in the proper way can produce a new rule, for which the error formula no longer contains E. Indeed, the error in the combined quadrature rule

is given by the formula

click to expand

This rule is called the composite Simpson’s rule. Notice that even though S(f) is based on second-degree interpolation, the error term involves the fourth derivative, and hence Simpson’s rule is exact for cubic functions. In other words, like the rectangle rule, Simpson’s rule obtains an “extra” order of accuracy.

If the length of each panel is halved, then each h_i⁵ term in the error formula is decreased by a factor of 1/32. The total number of terms is increased by a factor of 2, so the overall error is decreased by a factor close to 1/16. This fact is important for programs that automatically choose the number and size of the panels.

9.3.2. Adaptive Quadrature Routines

An adaptive quadrature routine is a numerical quadrature algorithm that uses one or two basic quadrature rules and automatically determines the subinterval sizes so that the computed result meets some prescribed accuracy requirement. Different mesh sizes may be used in different parts of the interval. Relatively large meshes are used where the integrand is smooth, and slowly varying and relatively small meshes are used in regions where the integration becomes difficult. This way an attempt is made to provide a result with the prescribed accuracy at as small a cost in computer time as possible.

An adaptive quadrature routine normally requires that the user specify a finite interval [a,b], provide a subroutine that computes f(x) for any x in the interval, and choose an error tolerance e. The routine attempts to compute a quantity Q so that

The routine may decide that the prescribed accuracy is not attainable, do the best it can, and return an estimate of the accuracy actually achieved.

In assessing the efficiency of any quadrature routine, it is generally assumed that the major portion of the computation cost lies in evaluation of the integrand f(x). So, if two subroutines are given the same function to integrate and both produce answers of about the same accuracy, the subroutine that requires the smaller number of function evaluations is regarded as the more efficient in that particular problem.

It is always possible to invent integrands f(x) that “fool” the routine into producing a completely wrong result. However, for good adaptive routines the class of such examples has been made as small as possible without unduly complicating the logic of the routine or making it inefficient in handling reasonable problems.

During the computation the interval [a, b] is broken down into subintervals [x_i, x_i1]. Each subinterval is obtained by bisecting a subinterval obtained earlier in the computation. The actual number of subintervals, as well as their locations and lengths, depends on the integrand f(x) and the accuracy requirement e. The numbering is arranged so that x₀ = a, and the routine determines an n so that x_n = b. We let h_i = x_i1 - x_i be the subinterval width.

A typical scheme applies two different quadrature rules to each subinterval. Let us denote the results by P_i and Q_i. Both P_i and Q_i are approximations to

The basic idea of adaptive quadrature is to compare the two approximations P_i and Q_i and thereby obtain an estimate of their accuracy. If the accuracy is acceptable, one of them is taken as the value of the integral over the subinterval. If the accuracy is not acceptable, the subinterval is divided into two parts and the process repeated on the smaller subintervals.

To reduce the total number of function evaluations, it is usually arranged that the two rules producing P_i and Q_i require integrand values at some of the same points. Consequently processing a new subinterval normally requires about the same number of new function evaluations.

For simplicity we assume that Q_i is obtained simply by applying the rule for P_i twice, once to each half of the interval. Let us also assume that the rule for P_i gives the exact answer if the integrand is a polynomial of degree p - 1, or equivalently, if the p-th derivative f^(p)(x) is identically zero. By expanding the integrand in a Taylor series about the midpoint of the subinterval, we can then show that there is a constant c so that

The exponent of h_i is p + 1 rather than p because the subinterval is of length h_i. From the assumption that Q_i is the sum of two P’s from two subintervals of length h_i/2, it follows that

Since

and the two errors are related by

This indicates that bisecting the subinterval decreases the error by a factor of about 2^p. Solving for the unknown I_i and rearranging terms, we obtain

In other words, the error in the more accurate approximation Q_i is about 1/2^p - 1 times the difference between the two approximations.

The basic task of a typical routine is to bisect each subinterval until the following inequality is satisfied:

where e is the user-supplied accuracy tolerance. If the entire interval [a, b] can be covered by n subintervals, on which this is valid, then the routine would return

Combining the above and ignoring the higher-order terms, we find that

click to expand

which is the desired goal.

This analysis requires the assumptions that f^(p)(x) be continuous and that the error be proportional to h_i^p1 f^(p)(x). Even when these conditions are not exactly satisfied, the routine may perform well, and the final result may be within the desired tolerance. However, the detailed behavior will be different.

For simplicity we have also assumed that the routine is using an absolute error criterion, that is,

In many situations the preferred error tolerance is the relative kind, which is independent of the scale factors in f. The pure relative criterion

is complicated by several factors. The denominator f might be zero. Even if the denominator is merely close to zero because the positive values of f at one part of the interval nearly cancel the negative values of f at another part, the criterion may be impossible to satisfy in practice. Furthermore a good approximation to the value of the denominator is not available until the end of the computation

Some routines use a criterion involving |f|,

The denominator cannot be zero unless f is identically zero and does not suffer the cancellation difficulties associated with oscillatory integrands. However, this route does involve more computation, although no more function evaluations, and does require a more elaborate explanation.

9.3.3. The quanc8 Adaptive Quadrature Routine

One of the most popular adaptive quadrature routines is a routine called quanc8. The user of the routine must supply a function fun for the integrand f(x); the lower and upper limits of integration, a and b; and the absolute and relative error tolerances, abserr and relerr. The routine returns the approximation of the integral; an estimate of the absolute error, errest; the number of function evaluations required, nofun; and a reliability indicator, flag. If flag is zero, then the approximation likely has the desired accuracy. If flag is large, then some unacceptable integrand has been encountered.

The name quanc8 is derived from quadrature, adaptive, Newton-Cotes’s 8-panel. The Newton-Cotes formulas are a family of quadrature rules that are obtained by integrating interpolating polynomials over equally spaced evaluation points. The rectangle, trapezoid, and Simpson’s rules are obtained by integrating zeroth-, first-, and second-degree polynomials, and they constitute the first three members of the family. The eighth-panel formula is obtained by integrating an eighth-degree polynomial, but just as by the rectangle and Simpson’s rules, an “extra” degree is acquired, so the formula gives the exact result for polynomials of degree 9.

For integration over the interval [0, 1] the rule is

Rational expressions for the weights w_k (multiplied by 8) are included in the routine itself. Since the rule is exact for f(x) ∫ 1, the weights satisfy

However, some of the weights are negative, and

This factor is important in error analysis. For example, suppose that function f(x) ∫ 1 is contaminated by some error factor so that f(x) = 0.99 at x = ^2/₈, ^4/₈, ^6/₈, and f(x) = 1.01 at all other nodes. The eight-panel rule would give 1.014512 as the value of the integral. The error in the function values is slightly magnified in the result. The magnification is inconsequential in this rule, but for the Newton-Cotes formulas involving higher polynomials, this sort of error is unacceptable.

A rule with eight panels has been chosen so that each panel width is the basic interval width, b - a, divided by a power of 2. Consequently there are usually no round-off errors in computing the nodes x_i on a computer.

The eight-panel rule is used in quanc8 to obtain the quantity P_i. The quantity Q_i is obtained by applying the same rule to the two halves of the interval, thereby using 16 panels. In the program, variables qprev and qnow are used for P_i and Q_i. Variables qleft and qright are used for the results on the two halves of the interval.

Since the basic rule is exact for ninth-degree polynomials, the error in Q_i will be about 1/(2¹⁰ - 1) = 1/1023 times the error in P_i whenever the integrand is smooth and higher-order terms can be ignored. Thus the results for a particular subinterval are acceptable if

In quanc8 this test is made by comparing two quantities

esterr = fabs(qnow-qprev)/1023.

and

tollerr = fmax(abserr, relerr*fabs(area))*(step/stone).

The ratio step/stone is equal to h_i/(b - a), and area is an estimate of the integral over the entire interval, so e is the maximum of abserr and relerr*| f|. The subinterval is acceptable if esterr<=tolerr.

When the subinterval is accepted, qnow is added to result, and esterr is added to errest. Then the quantity (qnow-qprev)/1023 is added to variable cor11 because Q_i + (Q_i - P_i)/1023 is usually a more accurate estimate of I_i than simply Q_i. The final value of corr11 is added to result just before its value is returned.

Each time a subinterval is bisected, the nodes and function values for the right half are saved for later use. Limit levmax=30 is set on the level of bisection. When this maximum is reached, the subinterval is accepted even if the estimated error is too large, but a count of the number of such subintervals is kept and returned as the integer part of flag. The length of each of these subintervals is quite small, (b - a)/2³⁰, so for most integrands a few of subintervals can be included in the final result without affecting accuracy much.

However, if the integrand has discontinuities or unbounded derivatives, or if it is contaminated by round-off error, the bisection level maximum will be frequently reached. Consequently another limit, nomax, is set on the number of function evaluations. When it appears that this limit may be exceeded, levmax is reduced to levout so that the remainder of the interval can be processed with fewer than nomax function evaluations. Furthermore the point x0 where the trouble is encountered is noted and the quantity (b-x0)/(b-a) is returned as the fractional part of flag. This is the portion of the interval that remains to be processed with the lower bisection limit.

As based on piecewise polynomial approximation, quanc8 is not designed to handle certain kinds of integrals. Roughly these are integrals of functions f(x) for which some derivative f^(k)(x) with k 10 is unbounded or fails to exist.

The C code of quanc8 is as follows:

#define fmax(a,b) ((a)>(b)?(a):(b)) // Quanc8 estimates the integral of fun(x) from a to b to a user // provided tolerance (a relative error tolerance, relerr, and/or // an absolute error tolerance, abserr). Quanc8 is an automatic // adaptive routine based on the 8-panel Newton-Cotes rule. // It returns an approximation to the integral hopefully // satisfying the least stringent of the two error tolerances. // An estimate of the magnitude of the actual error is returned // in variable pointed by errest. The number of function values // used in calculation of the result is returned in variable // pointed by nofun. A reliability indicator is returned in // variable pointed by flag. If it is zero, then the result // probably satisfies the error tolerance. If it is x.y, then x // is the number of intervals that have not converged and 0.y // is the fraction of the interval left to do when the limit on // nofun was approached. double quanc8(double (*fun)(double), double a, double b,    double abserr, double relerr, double *errest,    int *nofun, double *flag) {    static double w[] = {3956.0/14175.0, 23552.0/14175.0,    -3712.0/14175.0, 41984.0/14175.0,    -18160.0/14175.0};    double area=0., result=0., stone, step, cor11=0.0,     temp;    double qprev=0.0, qnow, qdiff, qleft, esterr, tollerr;    double qright[31], f[17], x[17], fsave[30][8], xsave[30][8];    int levmin=1, levmax=30, levout=6, nomax=5000, nofin,     lev=0,nim=1, i, j;    if(a==b)       return result;       // Initialization       nofin = nomax-8*(levmax-levout+(int)pow(2,        levout+1));       *flag=0;       *errest=0.0;       *nofun=0;       x[0] = a;       x[16] = b;       f[0] = (*fun)(x[0]);       stone = (b-a)/16.0;       x[8] = (x[0]+x[16])/2.0;       x[4] = (x[0]+x[8])/2.0;       x[12] = (x[8]+x[16])/2.0;       x[2] = (x[0]+x[4])/2.0;       x[6] = (x[4]+x[8])/2.0;       x[10] = (x[8]+x[12])/2.0;       x[14] = (x[12]+x[16])/2.0;       for(j=2; j<17; j+=2)          f[j] = (*fun)(x[j]);       *nofun = 9;       // Central calculation    m30:       for (j=1; j<16; j+=2)       {          x[j] = (x[j-1]+x[j+1])/2.0;          f[j] = (*fun)(x[j]);       }       *nofun += 8;       step = (x[16]-x[0])/16.0;       qleft = (w[0]*(f[0]+f[8])+             w[1]*(f[1]+f[7])+             w[2]*(f[2]+f[6])+             w[3]*(f[3]+f[5])+             w[4]*f[4]             )*step;       qright[lev] = (w[0]*(f[8]+f[16])+             w[1]*(f[9]+f[15])+             w[2]*(f[10]+f[14])+             w[3]*(f[11]+f[13])+             w[4]*f[12]             )*step;       qnow = qleft+qright[lev];       qdiff = qnow-qprev;       area = area+qdiff;       // Interval convergence test       esterr = fabs(qdiff)/1023.0;       tollerr = fmax(abserr, relerr*fabs(area))*(step/stone);       if(lev<levmin)          goto m50;       if(lev>=levmax)          goto m62;       if(*nofun>nofin)          goto m60;       if(esterr<=tollerr)          goto m70;       // Locate next interval and save right-hand elements for future       // use    m50:       nim *= 2;       for(i=0; i<8; i++)       {          fsave[lev][i] = f[i+9];          xsave[lev][i] = x[i+9];       }       lev++;       // Assemble left-hand elements for immediate use       qprev = qleft;       for(i=-1; i>-9; i--)       {          f[2*i+18] = f[i+9];          x[2*i+18] = x[i+9];       }       goto m30;       // Trouble: number of function values is about to exceed limit.    m60:       nofin *= 2;       levmax = levout;       *flag += (b-x[0])/(b-a);       goto m70;       // Trouble: current level is levmax.    m62:       *flag += 1.0;       // Interval converged: add contributions into running sums.    m70:       result += qnow;       *errest += esterr;       cor11 += qdiff/1023.0;       // Locate next interval.    m72:       if(nim==2*(nim/2))          goto m75;       nim/=2;       lev--;       goto m72;    m75:       nim++;       if(lev<=0)          goto m80;       // Assemble elements required for the next interval.       qprev = qright[lev-1];       x[0] = x[16];       f[0] = f[16];       for(i=1; i<9; i++)       {          f[2*i] = fsave[lev-1][i-1];          x[2*i] = xsave[lev-1][i-1];       }       goto m30;       // Finalize and return: make sure that errest not less than       // round-off level.    m80:       result+=cor11;       if(!*errest)          return result;    m82:       temp = fabs(result)+*errest;       if(temp!=fabs(result))          return result;          *errest*=2.0;          goto m82; }

9.3.4. Parallel Adaptive Quadrature Routine for Heterogeneous Clusters

The best adaptive quadrature routines are designed to provide an approximation to definite integrals with the prescribed accuracy in the fastest possible computer time. The large portion of the computation cost lies in evaluation of the integrand f(x). Therefore, to achieve the best efficiency on serial computers, the programmer should minimize the total number of function evaluations.

If the evaluation of integrand f(x) is prohibitively expensive, the speedup may be achieved via parallel computing of the integrand values. To do this, we let t_p(f) be the time of evaluation of f(x) at a single point on a processor p. We let t_pq be the time of transfer of a single floating-point value from the processor p to a processor q. Say we have a cluster of heterogeneous processors, and the time of evaluation of f(x) on a single processor exceeds the time of transfer of a single floating-point value between any pair of the processors doubled,

We can write an algorithm of parallel evaluation of integrand f(x) at n different points on two processors p and q. We let n points be stored in the memory of processor p, and resulting integrand values be stored in the memory of this processor. Let s_p(f) and s_q(f) be the relative speed of processors p and q respectively, demonstrated on the evaluation of integrand f(x):

At the first step of the algorithm, processor p sends n_q points to processor q.
At the second step, processor p evaluates f(x) at n_p points, while processor q evaluates f(x) at the n_q points received from processor p. Note that the number of function evaluations performed by each processor will be proportional to its relative speed, n_p/n_q = s_p(f)/s_q(f). From the assumption that the relative speed is demonstrated on the evaluation of the integrand f(x), it follows that s_p(f)/s_q(f) = t_q(f)/t_p(f). Therefore n_p x t_p(f) = n_q x t_q(f).
Finally processor p receives n_q integrand values from processor q and stores them in its memory.

The total time of serial evaluation of f(x) at n different points on the processor p will exceed the time of the parallel evaluation on a cluster of two processors, p and q:

click to expand

The result can be easily generalized to a heterogeneous cluster consisting of an arbitrary number of processors.

There are different ways of designing parallel adaptive quadrature routines for a heterogeneous cluster of p processors. One simple way is to break up the interval [a, b] into p subintervals, with each subinterval assigned to a separate processor. The desired integral over the entire interval [a, b] is obtained as a sum of integrals over these subintervals. Each processor computes an estimate of the integral over its subinterval following the same serial adaptive quadrature routine. To balance the load of processors, one would make the length of the kth interval proportional to the relative speed of the kth processor.

This algorithm is quite simple and straightforward to implement. However, it balances the load of processors only if the same mesh size is used in all parts of the interval [a, b]. If different mesh sizes are used in different parts of this interval, the number of function evaluations performed by each processor will be no longer proportional to its relative speed, resulting in unbalanced loads processor.

This algorithm will not be efficient if integrand f(x) is smooth and slowly varying in some regions of interval [a, b] and has high peaks in other regions. If the same interval spacing required near the peaks to obtain the desired accuracy, is used throughout the entire interval [a, b], then many more function evaluations will be performed than is necessary (although the load of the processors will be balanced). If larger meshes are used in regions where the integrand is smooth and slowly varying, and smaller meshes are used in regions where the integration becomes difficult, then the balance between the loads of different processors will be broken. In both situations the actual computing time will be far longer from the optimal one. This is when we turn to a parallel adaptive quadrature routine, pquanc8, that tries to keep balance between loads of processors while minimizing the total number of function evaluations required to provide a result with the prescribed accuracy.

As it might be clear from its name, the pquonc8 routine is a parallel version of the quanc8 adaptive quadrature routine. The algorithm of parallel numerical integration on a cluster of p heterogeneous processors can be summarized as follows:

The algorithm uses a master worker parallel programming paradigm. All computations, except for function evaluations, are performed serially by a host-processor.
The host-processor keeps a list of subintervals of the entire interval [a, b], whose contributions to the integral are still to be estimated. Each subinterval [l, r] is represented by a data structure that consists of
- an array x to contain a set of points x₀, x₁, . . ., x₁₆ so that x₀ = l, x₁₆ = r, x₀ = 1 x₁₆ = r. x₀< x₁, < . . . < x₁₆and x_i1 - x_i = (l - r)/16;
- an array f to contain a set of integrand values f(x) at these points; and
- a variable qprev to contain an estimate of the integral over the interval [l, r] obtained by using the eight-panel rule, that is, by using values f(x₀), f(x₂), f(x₄), f(x₆), f(x₁₀), f(x₁₂), f(x₁₄), f(x₁₆).
Initially the list consists of interval [a, b] and the host-processor computes the integrand values at even points x₀, x₂,..., x₁₆ associated with the interval.
At each iteration of the main loop of the algorithm, an estimate of the integral over each subinterval in the current list of subintervals is calculated. The estimate is obtained from the integrand values at all 17 points associated with the subinterval. If the subinterval is acceptable, its contribution is added to the result. If the subinterval is not acceptable, the interval is bisected into two new subintervals that are added to a list of subintervals to be processed at next iteration of the main loop. In other words, one step of routine quanc8 is applied to each subinterval in the current list. Note that at the beginning of the iteration all subintervals in the current list are of the same length.
In order to estimate the integral over each subinterval in the current list, the host-processor needs integrand f(x) to be evaluated at odd points x₁, x₃,..., x₁₅ of the subinterval. The estimation involves all processors of the cluster and is performed as follows:
- First the host-processor forms a set of all odd points associated with subintervals in the current list.
- Then it scatters the set across p processors of the cluster. The number of points received by each processor is proportional to the current estimation of its relative speed.
- Each processor estimates integrand f(x) at the points received form the host-processor. Besides making the estimation, the processor measures the time taken to do the estimation. Then it sends to the host-processor both the integrand values and the time taken to compute the values.
- The host-processor uses the computation time received from each processor to calculate their relative speeds demonstrated during this integrand evaluation. This fresh estimation of the relative speeds of processors will be used at the next iteration of the main loop. Note that this approach makes pquanc8 adaptive to possible changes in the speeds of the individual processors even during the execution of the program (i.e., due to some external applications).

The mpC implementation of routine pquanc8 takes the following form:

#include <stdio.h> #include <stdlib.h> #include <mpc.h> #include <math.h> #include "quanc8.h" #include "partition.h" typedef struct Interval {    double *x, *f;    double qprev;    struct Interval *next; } *IntervalList; #include "interval.h" double [*]pquanc8(double (*fun)(double), double     [host]a, double [host]b, double     [host]abserr, double [host]relerr,    double *[host]errest, int *[host]nofun, int     *[host]flag, double *[host]badpoint) {    double [host]area=0., [*]result=0., [host]absarea,                      [host]stone, [host]step, [host]minstep,              [host]cor11=0., [host]esterr,             [host]tollerr, [host]qleft, [host]qright,                      [host]qnow, [host]qdiff, [host]prevleft=0.;    int [host]repeat=0, [host]maxrepeat=2;    double *[host]f, *[host]x;    IntervalList [host]ints, [host]nextints;    repl int p;    repl double *speeds;    if(a==b)       return result;    // Initialization for first interval    [host]:    {       *flag = 0;       *errest = 0.;       *nofun = 0;       stone = (b-a)/16.;       minstep = stone*relerr+abserr;       InitFirstInterval(fun, a, b, &x, &f);       *nofun=9;    }    MPC_Processors_static_info(&p, &speeds);    {       net SimpleNet(p) s;       int [s]myn;       repl [s]nints, *[s]ns, *[s]dsp, [s]src=0;       double *[s]nodes, *[host]times;       double *[s]mynodes, *[s]myvalues, [s]mytime;       [s]:       {          ns = malloc(p*sizeof(int));          dsp = malloc(p*sizeof(int));       }       [host]: times = malloc(p*sizeof(double));       ints = ([host]CreateInterval)(x, f, 0., NULL);       nints = 1;       // Main loop of the algorithm       for(step=stone; ints!=NULL; ints=nextints,                      step/=2.0)       {          // Calculate points, where integrand should be evaluated          // at this iteration, proceeding from list ints of nints          // subintervals of width step, whose contribution into          // the integral is still to be estimated. Store the points          // in array nodes.          [host]:          {             int j, k;             struct Interval *i;             nodes = malloc(nints*8*sizeof(double));             for(i=ints, k=0; i!=NULL; i=i->next)                for (j=1; j<16; j+=2, k++)                   nodes[k] = i->x[j] = (i->x[j-1]+i->x[j+1])/2.;          }          nints = [host]nints;          // Become alert if the left end point of left subinterval          // does not change from iteration to iteration          [host]:             if(fabs(prevleft-nodes[0])<relerr*fabs(prevleft) ||                fabs(prevleft-nodes[0])<abserr)                repeat++;             else             {                prevleft = nodes[0];                repeat = 0;             }          // Convergence test          if(repeat>maxrepeat && step<minstep)          {             *flag = 1;             *badpoint = [host]nodes[0];             [s]: break;          }          [s]:          {             int k;             // Calculate the number of points assigned to each             // processor and store it in array ns; the number is             // proportional to the processor’s relative speed             Partition(p, speeds, ns, nints*8);             // Distribute the points across processors             myn = ns[I coordof myn];             mynodes = malloc(myn*sizeof(double));             myvalues = malloc(myn*sizeof(double));             for(k=1, dsp[0]=0; k < p; k++)                dsp[k] = ns[k-1]+dsp[k-1];             ([(p)s])MPC_Scatter(&src, nodes, dsp, ns, myn, mynodes);             // Evaluate the integrand at the points in parallel and             // measure the time taken by each processor             if(myn)             {                mytime = MPC_Wtime();                for(k=0; k<myn; k++)                   myvalues[k] = (*fun)(mynodes[k]);                   mytime = MPC_Wtime()-mytime;             }             else                mytime = 0.;          }          // Calculate the actual relative speed of processors          // and store it in array speeds          times[] = mytime;          [host]:          {             int j;             double temp;             for(j=0, temp=0.; j<p; j++)                if(ns[j] && times[j] > temp*ns[j])                   temp = times[j]/ns[j];             for(j=0; j<p; j++)                if(ns[j])                   speeds[j] = temp/times[j]*1000.*ns[j];          }          [s]: ([(p)s])MPC_Bcast(&src, speeds, 1, p, speeds, 1);          // Gather the function values to host-processor storing          // them in array nodes          ([([s]p)s])MPC_Gather(&src, nodes, dsp, ns, myn, myvalues);          // Complete computation of all elements of list ints that          // represents a set of subintervals of width step,whose          // contribution into the integral is still to be estimated          [host]:          {             int j, k;             struct Interval *i;             for(i=ints, k=0; i!=NULL; i=i->next)                for (j=1; j<16; j+=2, k++)                   i->f[j] = nodes[k];          }          // Increase number of function values          *nofun += ([host]nints)*8;          [s]free(mynodes);          [s]free(myvalues);          [host]: free(nodes);          // Calculate an estimate of the integral over each          // interval in list ints. If the interval is acceptable,          // add its contribution to the result. If not, bisect it          // and add 2 new subintervals to a list of subintervals to          // be processed at next iteration of the main loop          [host]:          {             struct Interval *i;             for(i=ints, nextints=NULL, nints=0; i!=NULL;                    i=i->next)             {                CentralQuanc8Calculation( i, step, &qleft, &qright);                qnow = qleft+qright;                qdiff = qnow - i->qprev;                area = area+qdiff;                esterr = fabs(qdiff)/1023.0;                absarea = fabs(area);                tollerr =  fmax(abserr, relerr*absarea)*(step/stone);                if(esterr <= tollerr) //Interval convergence test                {   // Interval converged. Add contributions into sums.                   result += qnow;                   *errest += esterr;                   cor11 += qdiff/1023.0;                   free(i->x);                   free(i->f);                }                else                {// No convergence. Bisect interval i and add its                   // subintervals to list nextints of intervals to be                   // processed at next iteration of the main loop                   nextints = BisectInterval(i, nextints,                            qleft, qright);                   nints += 2;                }             }          }          [host]FreeList(ints);       }    }    [host]:    {       double temp;       result += cor11;       temp = fabs(result)+*errest;       while (*errest!=0. && temp == fabs(result))       {          *errest *= 2.;          temp = fabs(result)+*errest;          }       }       return result; } #include "function.h" int [*] main() {    double [host]a=0., [host]b=2., [host]abserr=0.,       [host]relerr=1.e-11, [host]errest, [host]bad point,        [host]sflag, [host]time, result;    int [host]nofun, [host]pflag;    time = ([host]MPC_Wtime)();    result = pquanc8(fun, a, b, abserr, relerr, &errest,                                  &nofun, &pflag, &badpoint);    time = ([host]MPC_Wtime)()-time;    [host]:       if(pflag)          printf( "pquanc8:\nBad singularity at %e\n", _ badpoint);       else          printf("pquanc8:\nresult=%e errest=%e nofun=%d time=%es\n",                            result, errest, nofun, time);    time = ([host]MPC_Wtime)();    [host]: result = quanc8(fun, a, b, abserr, relerr,                                  &errest, &nofun, &sflag);    time = ([host]MPC_Wtime)()-time;    [host]:       if(sflag)          printf("quanc8:\nBad singularity at %e\n",                   b-modf(sflag,NULL)*(b-a));       else          printf("quanc8:\nresult=%e errest=%e nofun=%d time=%es\n",                            result, errest, nofun, time); }

The preceding pquanc8 routine does not contain a serial code for calculating the number of points to be assigned to each processor and for processing the lists of subintervals. The code is kept in the files partition.h and interval.h respectively, and included in the mpC program with the include directives. Other two files included in the program are

quanc8.h, which contains the C code of the quanc8 routine. The file is used in the program to compare the quanc8 execution time with the execution time of pquanc8.
function.h, which contains the definition of a function fun for integrand f(x).

The full code of the mpC application can be found in Appendix C.