3.5. SUMMARY


3.4. PARALLEL LANGUAGES

Optimizing C and Fortran 77 compilers do not solve the efficiently portable parallel programming of the SMP architecture. In using those compilers, the programmers can only implement simple multithreaded applications.

Thread libraries do solve this efficiency problem. In particular, Pthreads are a powerful tool by which programmers can implement a wide range of parallel algorithms on SMP computers in highly efficient, portable multithreaded applications.

However, some programmers find the power of thread libraries redundant for parallel programming purposes. The reason is that the multithreaded programming model underlying the libraries is universal and supports both parallel and diverse distributed computing technologies. There are a number of high-level design approaches to writing MT programs with thread libraries, most of which are used in distributed programming, in particular, when implementing servers in server/client applications.

The master/slave design is common to many tasks, and it is widely used in multithreaded parallel programming. Its idea is that one thread does most work of the program, occasionally creating other threads to help in some portion of the work.

Apart from this basic design strategy there exist other design strategies used in distributed programming and not in parallel programming. One such strategy is the producer/consumer design where some threads create work and put it in a queue, and other threads take the work items off the queue and execute them. Other examples are the pipeline design whereby each thread performs some part of an entire tack and passes it partially completed to the next thread, and the personal servant design whereby a thread is assigned to each client and serves only that client.

Thus the programming model underlying the thread libraries is considered to be too powerful and complicated (and hence, error-prone) for use in parallel programming. For this reason a number of high-level parallel extensions of Fortran 77 and C have been designed to provide a simplified multithreaded programming model based on the master/slave strategy, and specifically directed at parallel computing on SMP architectures. Next, we look at two such parallel language forms: Fortran 95 and OpenMP.

3.4.1. Fortran 95

Fortran 95 is the latest version of the standard Fortran language released by ISO in December 1997. Fortran 95 is seen as a relatively minor enhancement of Fortran 90, and it includes a small number of new features. The major new feature added to support parallel programming SMP computers is the FORALL statement and construct.

The FORALL statement is introduced as an alternative to the DO-loop. The idea is that the FORALL statement can be executed in any order, independent of the index. It therefore gives the possibility of parallel implementation of the statement, in particular, with parallel threads for the SMP architecture.

For example, the FORALL statement

 FORALL (I = 1:N, J = 1:N) H(I,J) = 1./REAL(I+J-1)

defines a Hilbert matrix of order N. The FORALL statement

 FORALL (I = 1:N, J = 1:N, Y(I,J).NE.0.) X(I,J) = 1./Y(I,J)

inverts the elements of a matrix, avoiding division with zero.

In the statements above, FORALL can be considered as a double loop, which can be executed in arbitrary order. The general form of the FORALL statement is

 FORALL (v1=l1:u1:s1,..., vn=ln:un:sn, mask) a(e1,..., em) = expr,

and it is evaluated according to certain well specified rules. Namely, first the left bound, right bound and stride expressions for each triplet are evaluated in any order. All possible pairings of indexes form the set of combinations. For example, given the statement

 FORALL (I=1:3, J=4:5) A(I,J) = A(J,I)

the set of combinations of I and J is

{ (1,4), (1,5), (2,4), (2,5), (3,4), (3,5) }

Second, the mask expression for the set of combinations is evaluated, in any order, producing a set of active combinations (those for which mask is evaluated to .TRUE.). For example, if mask (I+J.NE.6) is applied to the set above, the set of active combinations is

{ (1,4), (2,5), (3,4), (3,5) }

Then expressions e1, ._._._, em, and expr are evaluated in any order for all active combinations of indexes. Finally, the computed values of expr are assigned to the corresponding elements of a in any order for all active combinations of indexes.

In addition there is a FORALL construct that has to be distinguished from the single-line FORALL statement. The construct is interpreted in the same way as if its internal statements were written as FORALL statements in the same order. This restriction is important in order to achieve a unique computational result. For example, when the statements in the below FORALL construct

 REAL, DIMENSION(N, N) :: A, B ... FORALL (I = 2:N-1, J = 2:N-1) A(I,J) = 0.25*(A(I,J-1)+A(I,J+1)+A(I-1,J)+A(I+1,J)) B(I,J) = A(I,J) END FORALL 

have been executed, arrays A and B have identical values in the internal points, while B has kept its previous values on the boundaries.

In general, Fortran 95 procedures may have side effects. Side effects that may occur include a change in value of an argument or global variable. However, FORALL statements and constructs by design require that all referenced functions be free of side effects. Therefore, to avoid problems associated with side effects, Fortran 95 enables the programmers to specify a procedure as side effect free. These procedures are referred to as PURE functions. Within FORALL statements and constructs calls of PURE functions are only permitted.

A restricted form of a PURE procedure is called ELEMENTAL. In Fortran 90, to provide a procedure that operated elementally, the programmer had to produce eight versions of the procedure: one to operate on scalars, and one for each rank of the array from one to seven. If two array arguments were used, as many as 16 versions could be required. Using ELEMENTAL, the programmer can write a single procedure with the same functionality.

3.4.2. OpenMP

OpenMP is a recent application program interface (API) for writing multithreaded applications based on the master/slave design strategy. OpenMP supports parallel programming in C, C+, and Fortran on all SMP architectures, including Unix platforms and Windows NT platforms. OpenMP is jointly defined by a group of major computer hardware and software vendors such as Intel, HP, SGI, IBM, SUN, Compaq, KAI, PGI, and NAG, among many others. First specifications of OpenMP appeared in late 1998.

Technically OpenMP specifies a set of language extensions in the form of compiler directives, library procedures, and environmental variables. OpenMP is semantically the same between Fortran and C/C+ differing mainly in syntax. In this book we use the C/C+ API to present OpenMP.

An OpenMP compiler directive is just a C/C+ preprocessing pragma directive of the form

#pragma omp directive-name list-of-clauses 

Such syntax allows C/C+ compilers that do not support OpenMP to compile an OpenMP program. OpenMP also specifies stubs for the runtime library functions in order to enable portability to platforms that do not support the OpenMP C/C+ API. On these platforms, OpenMP programs must be linked with a library containing the stub functions. The stub functions assume that the directives in the OpenMP program are ignored. As such, they emulate serial semantics.

As a rule an OpenMP directive applies to a structured block, that is, a statement (single or compound) that has a single entry and a single exit. No statement is a structured block if there is a jump into or out of that statement. A compound statement is a structured block if its execution always begins at the opening {, and always ends at the closing } or at a call to the exit function.

The directive and the subsequent structured block to which it applies make up an OpenMP construct. Note that some directives are not part of a construct.

A typical sequence of OpenMP’s constructs and single directives do the following:

  • Define a structured block, which is to be executed by multiple threads in parallel, and create the parallel threads; in OpenMP that structured block is called a parallel region.

  • Distribute the execution of the parallel region among the parallel threads.

  • Synchronize the work of the parallel threads during the execution of the parallel region.

  • Control the data environment during the execution of the parallel region via specification of shared data, private data, accumulating variables, and so on.

3.4.2.1. Parallel Regions. A parallel construct is the fundamental construct that starts parallel execution of a parallel region. It has the following form:

#pragma omp parallel list-of-clauses structured-block 

When a thread encounters a parallel construct, a team of threads is created. This thread becomes the master thread of the team, with a thread number of 0, and all threads, including the master thread, execute the region in parallel.

By default, the number of threads that are requested is implementation-defined. To explicitly determine the number of threads requested, the num_threads clause may be used in the parallel directive. Alternatively, the omp_set_num_threads library function may be called before activization of the parallel region.

For example, the following OpenMP code

 int a[5][5]; omp_set_num_threads(5); #pragma omp parallel { int thrd_num = omp_get_thread_num(); zero_init(a[thrd_num], 5); }

defines a parallel region executed by five parallel threads. The omp_get_thread_num library function returns the thread number, within its team, of the thread executing the function. The thread number lies between 0 and 4. The master thread of the team is thread 0. Array a is shared among all the threads. Variable thrd_num is private to each thread in the team. Each thread calls a user-defined zero_init function to initialize with zeros the corresponding row of the array in parallel.

There is an implicit barrier at the end of a parallel region, that is, a synchronization point that must be reached by all threads in the team. Each thread waits until all threads in the team arrive at this point. Only the master thread of the team continues execution at the end of the parallel region.

If a thread in a team executing a parallel region encounters another parallel construct, it creates a new team, and it becomes the master of that new team. Nested parallel regions are serialized by default. As a result, by default, a nested parallel region is executed by a team composed of one thread. The default behavior may be changed by using either the omp_set_nested library function or the OMP_NESTED environment variable. However, the number of threads in a team that execute a nested parallel region is implementation-defined.

3.4.2.2. Work-Sharing Constructs. A work-sharing construct distributes the execution of the associated statement among the members of the team that encounter it. The work-sharing directives do not launch new threads, and there is no implicit barrier on entry to a work-sharing construct.

The sequence of work-sharing constructs and barrier directives encountered must be the same for every thread in a team.

A for construct of the form

#pragma omp for list-of-clauses for-loop 

specifies that the iterations of the associated loop will be executed in parallel. The iterations of the for loop are distributed across threads that already exist in the team executing the parallel region to which the team binds.

The associated for loop must be of the following canonical form

 for(var = lb; var logical-op rb; incr-expr)

where var is a signed integer variable that must not be modified within the body of the for statement; incr-expr is one of +var, -var, var+, var-, var+incr, var-incr, var=var+incr, var=var-incr, or var=incr+var; lb, lb, and incr are loop invariant integer expressions; and logical-op is one of <, >, <, or >. The canonical form allows the number of loop iterations to be computed upon entry to the loop.

The schedule clause of the form

schedule(kind [, chunk-size])

specifies how iterations of the for loop are divided among threads of the team. If kind is static, iterations are divided into chunks of a size specified by chunk-size. The chunks are statically assigned to threads in a round-robin fashion in the order of the thread number. When no chunk-size is specified, the iteration space is divided into chunks that are approximately equal in size, with one chunk assigned to each thread. For example, code

 int a[10][10], k; omp_set_num_threads(5); #pragma omp parallel #pragma omp for schedule(static) for(k=0; k<10; k++) zero_init(a[k], 10);

specifies that each of five threads, executing the for statement, will execute two successive iterations of the loop. Namely thread i will execute 2ith and (2i + 1)-th iterations of the loop calling the zero_init function to initialise with zeros the 2ith and (2i+1)-th rows of array a. Note that variable k will be private to each of the threads.

If kind is dynamic, the iterations are divided into a series of chunks, each containing chunk-size iterations. Each chunk is assigned to a thread that is waiting for an assignment. The thread executes the chunk of iterations and then waits for its next assignment, until no chunks remain to be assigned. When no chunk-size is specified, it defaults to 1. For example, code

 int a[10][10], k; omp_set_num_threads(5); #pragma omp parallel #pragma omp for schedule(dynamic, 2) for(k=0; k<10; k++) zero_init(a[k], 10);

also specifies that each of five threads, executing the for statement, will execute two successive iterations of the loop. But which thread will execute which pair of iterations is decided at run time.

If kind is guided, the iterations are assigned to threads in chunks with decreasing sizes. When a thread finishes its assigned chunk of iterations, it is dynamically assigned another chunk, until none remains. For a chunk-size of 1, the size of each chunk is approximately the number of unassigned iterations divided by the number of threads. These sizes decrease approximately exponentially to 1. For a chunk-size greater than 1, the sizes decrease approximately exponentially to chunk-size, except that the last chunk may be fewer than chunk-size iterations. When no chunk-size is specified, it defaults to 1.

Finally, if kind is runtime, the decision regarding scheduling is deferred until runtime. The schedule kind and size of the chunks can be chosen at run time by setting the environment variable OMP_SCHEDULE. If this environment variable is not set, the resulting schedule is implementation-defined. No chunk-size must be specified.

In the absence of an explicitly defined schedule clause, the default schedule is implementation-defined. An OpenMP compliant program should not rely on a particular schedule for correct execution. There is an implicit barrier at the end of a for construct unless a nowait clause is specified.

A sections construct of the form

#pragma omp sections list-of-clauses {  [#pragma omp section]    structured-block    [#pragma omp section    structured-block] ... }

specifies a set of constructs that are to be divided among threads in a team. Each section is executed once by a thread in the team. Except for the first section, each section must be preceded by a section directive. For example, code

 int a[10][10]; omp_set_num_threads(5); #pragma omp parallel #pragma omp sections { {  zero_init(a[0], 10);  zero_init(a[1], 10); } #pragma omp section {  zero_init(a[2], 10);  zero_init(a[3], 10); } #pragma omp section {  zero_init(a[4], 10);  zero_init(a[5], 10); } #pragma omp section {  zero_init(a[6], 10);  zero_init(a[7], 10); } #pragma omp section {  zero_init(a[8], 10);  zero_init(a[9], 10); } }

specifies that each of five threads, executing the parallel region, will execute a section containing two calls to the zero_init function initializing with zeros two corresponding rows of array a. There is an implicit barrier at the end of a sections construct unless a nowait clause is specified.

A single construct of the form

#pragma omp single list-of-clauses structured-block 

specifies that the associated structured block is executed by only one thread in the team (not necessarily the master thread). There is an implicit barrier after a single construct unless a nowait clause is specified.

A master construct of the form

#pragma omp master structured-block 

specifies that the associated structured block is executed by the master thread of the team. Other threads of the team do not execute this statement. There is no implicit barrier on entry to or exit from the master section.

OpenMP provides shortcuts to specifying a parallel region that contains only one work-sharing construct. The combined parallel work-sharing constructs are

#pragma omp parallel for list-of-clauses for-loop 

for a parallel region that contains a single for construct, and

#pragma omp parallel sections list-of-clauses {  [#pragma omp section]    structured-block    [#pragma omp section    structured-block]  ...  }

for a parallel region containing a single sections construct.

3.4.2.3. Synchronization Directives and Constructs. In addition to implicit barriers implied on some OpenMP constructs, the barrier directive

#pragma omp barrier

is provided to explicitly synchronize all the threads in a team. When encountered, each thread of the team waits until all of the others have reached this point.

An ordered construct of the form

#pragma omp ordered list-of-clauses structured-block 

specifies that the associated structured block is executed in the order in which iterations would be executed in a sequential loop.

In order to synchronize access to shared data, a critical construct

#pragma omp critical [(name)] structured-block 

is provided that restricts execution of the associated structured block to a single thread at a time. A thread waits at the beginning of a critical region until no other thread is executing a critical region (anywhere in the program) with the same name name. All unnamed critical regions have the same unspecified name. For example, code

 int a[100], ind[100], b[100], k; ... omp_set_num_threads(5); #pragma omp parallel for for(k=0; k<100; k++) {   #pragma omp critical   {   a[ind[k]] += f(k); } b[k] += g(k); }

specifies that five threads execute in parallel the iterations of the above for loop. During the execution they are updating elements of the shared array a. An element of a to be updated by an iteration is specified by an element of array ind. Therefore different iterations of the loop may update the same element of a. The critical directive is used to serialize execution of the corresponding statement by the parallel threads, and thus the threads avoid race conditions (simultaneous updates of an element of a by multiple threads).

An atomic construct of the form

#pragma omp atomic expression-statement 

may be seen as a special case of a critical section. Unlike the critical directive, the atomic directive does not prevent simultaneous execution by multiple threads of the associated statement. Instead, it ensures that a specific memory location is updated atomically, that is, prevents simultaneous access to the memory location by multiple threads.

In the atomic construct, expression-statement must have one of the following forms: x op= expr, x+, +x, x-, or -x, where x is an lvalue of scalar type, expr is an expression of scalar type not referencing the object designated by x, and op is one of the binary operators +, *, -, /, &, ^, |, <<, or >>. Only the load and store of the object designated by x are atomic. The evaluation of expr is not atomic. To avoid race conditions, all updates of the location in parallel should be protected with the atomic directive, except those that are known to be free of race conditions.

When supported by hardware instructions, atomic directives allow more efficient implementation than via straightforward replacement of all atomic directives with critical directives that have the same unique name. For example, code

 int a[100], ind[100], b[100], k; ... omp_set_num_threads(5); #pragma omp parallel for for(k=0; k<100; k++) { #pragma omp atomic a[ind[k]] += f(k); b[k] += g(k); }

only differs from the previous example code in the way it avoids race conditions—the atomic directive is used instead of the critical directive. The advantage of using the atomic directive in this example is that it allows updates of two different elements of a to occur in parallel. The use of the critical directive leads to all updates to elements of a are executed serially.

A flush directive of the form

#pragma omp flush [(variable-list)] 

specifies a sequence point at which all threads in a team must have a consistent view of memory. This means that all memory operations (both reads and writes) specified before the sequence point must complete. For example, variables in registers or write buffers must be updated in memory. All memory operations specified after the point would not have yet begun. The directive provides an explicit mechanism to ensure consistency of shared data not depending on particular algorithms of cache management implemented in one or another underlying SMP computer.

A flush directive without a variable-list synchronizes all shared objects except in accessible automatic variables. It is implied for the barrier directive, at entry to and exit from critical and ordered constructs, and at exit from parallel, for, sections, and single constructs. The directive is not implied if a nowait clause is present.

A flush directive with a variable-list synchronizes only the objects designated by variables in the variable-list.

3.4.2.4. Data Environment. There are two kinds of variable in an OpenMP program. A shared variable designates a single region of storage. All threads in a team that access this variable will access this single region of storage. By default, if a variable is defined outside a parallel construct and is visible when the parallel construct is encountered, then the variable is shared within the parallel construct. Static variables declared within a parallel region are shared. Heap-allocated memory is shared.

A private variable designates a region of storage that is unique to the thread making the reference. Automatic variables declared within a parallel construct are private. Other ways to specify that a variable is private are

  • a threadprivate directive;

  • a private, firstprivate, lastprivate, or reduction clause;

  • use of the variable as a for loop control variable immediately following a for or parallel for directive.

The threadprivate directive

#pragma omp threadprivate [(variable-list)] 

makes the variables specified in the variable-list private to a thread. Each copy of a threadprivate variable is initialized once, at an unspecified point of the program before the first reference to that copy, and in the usual manner (i.e., as the master copy would be initialized in a serial execution of the program). Only global variables or static block-scope variables may appear in the variable-list. For example, the following code

 int count() { static int counter = 0; #pragma omp threadprivate(counter) counter++; return counter; }

uses the threadprivate directive to give each thread calling the count function a separate counter.

The parallel directive and most work-sharing directives accept clauses that allow the programmer to explicitly specify whether a variable is private or shared, how the variable is initialized, and so on.

A copyin clause of the form

copyin(variable-list)

provides a mechanism to assign the same value to threadprivate variables for each thread in the team executing the parallel region. For each variable specified in a copyin clause, the value of the variable in the master thread of the team is copied, as if by assignment, to the thread-private copies at the beginning of the parallel region.

A private clause of the form

private(variable-list) 

declares the variables in variable-list to be private to each thread in a team. The behavior of a variable specified in a private clause is as follows: A new object with automatic storage duration is allocated for the construct. The size and alignment of the new object are determined by the type of the variable. This allocation occurs once for each thread in the team; the initial value of the object is indeterminate. The original object designated by the variable has an indeterminate value upon entry to the construct, must not be modified within the construct, and has an indeterminate value upon exit from the construct. Within the construct the variable designates the new private object allocated by the thread. Note that unlike the threadprivate directive, which replicates a global or local static variable preserving its static nature within each thread, the private clause creates private automatic variables overriding the original (possibly, global or local static) variable.

A firstprivate clause of the form

firstprivate(variable-list)

is a special case of the private clause, which additionally initializes each new private object. The initialization happens as if it were done once per thread, before the thread’s execution of the construct. For a firstprivate clause on a parallel construct, the initial value of the new private object is the value of the original object that exists immediately before the parallel construct for the thread that encounters it. For a firstprivate clause on a work-sharing construct, the initial value of the new private object for each thread that executes the work-sharing construct is the value of the original object that exists before the point in time that the same thread encounters the work-sharing construct.

A lastprivate clause of the form

lastprivate(variable-list)

is a special case of the private clause, which additionally determines the original object upon exit from the work-sharing construct. Namely the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section, is assigned to the variable’s original object. Variables that are not assigned a value by the last iteration of the for or parallel for, or by the lexically last section of the sections or parallel sections construct, have indeterminate values after the construct.

A shared clause of the form

shared(variable-list)

shares variables that appear in the variable-list among all the threads in a team. All threads within a team access the same region of storage for shared variables.

A copyprivate clause of the form

copyprivate(variable-list)

provides a mechanism to use a private variable to broadcast a value from one member of a team to the other members. This clause may only appear on the single directive. After the execution of the structured block associated with the single directive and before any of the threads in the team have left the barrier at the end of this single construct, a copyprivate variable becomes defined (as if by assignment) in all other threads of the team with the value of the corresponding variable in the thread that executed the construct’s structured block. The mechanism is an alternative to using a shared variable for the value, when providing such a shared variable would be difficult.

A reduction clause of the form

reduction(op:variable-list)

specifies the way a reduction operation with the operator op on a scalar variable in the variable-list is performed in the corresponding parallel or work-sharing construct. Namely inside the construct a private copy of each variable in the variable-list is created, one for each thread, as if the private clause had been used, and initialized depending on the op (0 for +, -, |, ^, and ||; 1 for * and &&; and ~0 for &). Then each thread updates its private copy of the variable. At the end of the region for which the reduction clause was specified, the original object is updated to reflect the result of combining its original value with the final value of each of the private copies using the operator op. For example, the code

 int k; double sum, a[1000]; ... sum = 0.; #pragma omp parallel for reduction(+: sum) for(k=0; k<1000; k++) sum += a[k];

uses the reduction clause to compute efficiently the sum of elements of the array a by parallel threads.

A default clause of the form

default(none)

may be used in the parallel directive. When specified, the clause repeals all defaults for implicit specification of shared variables in the corresponding parallel region, and requires explicit specification of their shared status.

3.4.2.5. Runtime Library Functions. The OpenMP runtime library functions are divided into the following two non-intersecting classes:

  • Functions controlling and querying the parallel execution environment like the omp_set_num_threads function.

  • Functions (and data types) implementing a mutexlike explicit synchronization mechanism.

Examples of functions that affect and monitor threads, processors, and the parallel environment are omp_set_num_threads, omp_get_num_ threads, omp_get_max_threads, omp_get_thread_num, and omp_ get_num_procs.

The function

 void omp_set_num_threads(int num_threads)

sets the default number of threads to use for subsequent parallel regions that do not specify a num_threads clause.

The function

 int omp_get_num_threads(void)

returns the number of threads currently in the team executing the parallel region from which it is called.

The function

 int omp_get_max_threads(void)

returns the maximum value that can be returned by calls to omp_get_num_threads.

The function

 int omp_get_thread_num(void)

returns the thread number, within its team, of the thread executing the function.

The function

 int omp_get_num_procs(void)

returns the maximum number of processors that could be assigned to the program.

The part of the OpenMP runtime library providing a mutexlike synchronization mechanism declares two types of lock variables (an analogue of mutex variables of Pthreads) as well as functions for

  • initializing a lock variable;

  • destroying the lock variable;

  • setting the lock variable (an analogue of locking a mutex in Pthreads);

  • unsetting the lock variable (an analogue of unlocking a Pthreads mutex);

  • testing the lock variable (an analogue of the Pthreads’s trylock).

3.4.2.6. Example of OpenMP Application: Multithreaded Dot Product. To illustrate parallel programming SMP computers with OpenMP, we consider an application for computing the dot product of two real vectors x and y. The OpenMP application implements the same parallel algorithm as that of Pthreads presented in Section 3.3.4.

The source code of the application is as follows:

#include <stdio.h> #include <stdlib.h> #include <omp.h> #define MAXTHRDS 124 int main() { double *x, *y, dot_prod; int num_of_thrds, vec_len, i; num_of_thrds = omp_get_num_procs(); omp_set_num_threads(num_of_thrds); printf("Vector length = "); if(scanf("%d", &vec_len)<1) { printf("Check input for vector length. Bye.\n"); return -1; } x = malloc(vec_len*sizeof(double)); y = malloc(vec_len*sizeof(double)); for(i=0; i<vec_len; i++) { x[i] = 1.; y[i] = 1.; } dot_prod = 0.; #pragma omp parallel for reduction(+: dot_prod) for(i=0; i<vec_len; i++) { dot_prod += x[i]*y[i]; } printf("Dot product = %f\n", dot_prod); free(x); free(y); }

Obviously this OpenMP code is much simpler and more compact than the Pthreads code.




Parallel Computing on Heterogeneous Networks
Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)
ISBN: B000W7ZQBO
EAN: N/A
Year: 2005
Pages: 95

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net