4.4. SUMMARY | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

4.3. PARALLEL LANGUAGES

Message-passing libraries solve the problem of efficiently portable programming MPPs. In particular, MPI is a well designed, powerful, and widespread library that allows the programmers to implement a wide range of parallel algorithms on MPPs in the form of highly efficient and portable message-passing applications.

However, many end-users that do scientific computing on MPPs find the explicit message passing provided by MPI and other libraries tedious and error-prone. They are accustomed to writing their applications mainly in Fortran, and they consider parallel primitives provided by the message-passing libraries too low level, creating unnecessary detailed description of parallel algorithms they would like to implement on MPPs. Indeed, many of the algorithms are straightforward and based on the data parallel paradigm.

In the data parallel programming model, processors perform the same work on different parts of data. It is the distribution of the data across the processors that determines distribution of computational work and interprocessor communication. This programming style is characterized by single threaded control, global name space, loosely synchronous processes, and parallelism implied by operations on data. It is mainly supported by high-level parallel languages, and the responsibility of a compiler is to generate the explicit message-passing code, which will be executed in parallel by all participating processors (the single-program multiple-data model, or SPMD for short).

The main advantage of the data parallel programming style is that it is easy to use. First, it is relatively simple to write data parallel applications. Second, due to the single thread of control, a data parallel program is easy to debug. Third, as conceptually data parallel programming is not too far removed from serial programming, it is easy to port legacy serial code to MPPs using the data parallel programming approach.

A number of data parallel languages have been designed for portable programming MPPs, among which HPF (High Performance Fortran) is the most popular and well known. In the book we only outline HPF.

High Performance Fortran consists in a set of extensions to Fortran aimed at writing data parallel programs for architectures where the distribution of data impacts performance, that is, mainly for MPPs. The extensions were defined by the High Performance Fortran Forum (HPFF) in which over 40 organizations participated. There are two main versions of HPF produced by HPFF: a version known as HPF 1.1, which is dated November 10, 1994, and a version known as HPF 2.0, which was released on January 31, 1997. While HPF is widely implemented and available, it is not accepted in an official standard form as yet.

HPF 1.1 is based on Fortran 90 and specifies a number of language constructs, which are now included in Fortran 95 (i.e., the FORALL construct and statement, PURE procedures). HPF 2.0 is defined as an extension of Fortran 95 and hence inherits all of its data parallel constructs.

Thus the language features provided by HPF to express data parallelism are the following: whole-array operations, assignments, and functions; the FORALL construct and statement; PURE and ELEMENTAL procedures. In addition to these inherited data parallel features, HPF introduces one more way to specify computation, which can be executed in parallel. It provides a directive of the form

!HPF$  INDEPENDENT

that can precede an indexed DO loop or FORALL statement. The directive asserts to the compiler that the iterations in the following DO loop or the operations in the following FORALL statement may be executed independently—that is, in any order, or interleaved, or concurrently—without changing the semantics of the program. Such information is valuable as it enables compiler optimization without complex data dependence analysis. The INDEPENDENT directive allows the programmer to directly pass on his or her knowledge of the application to the compiler. The compiler relies on the assertion in its translation process. If the assertion is true, the semantics of the program are not changed; if it is false, the program is not HPF conforming and has no defined meaning.

Apart from the INDEPENDENT directive, HPF introduces a number of directives that allow the programmer to suggest the distribution of data among available processors to the compiler. Like the INDEPENDENT directive, the data distribution directives are structured comments of the form

!HPF$  directive-body

These directives may affect the efficiency of the computation performed, but they do not change the value computed by the program. Therefore, when properly written, an HPF program may be compiled by Fortran compilers that do not support HPF and be executed serially.

Two basic data distribution directives are the PROCESSORS directive and the DISTRIBUTE directive.

HPF provides a logical view of the parallel machine as a rectilinear arrangement of abstract processors in one or more dimensions. The PROCESSOR directive allows the programmer to declare such a rectilinear processor arrangement, specifying its name, its rank (number of dimensions), and the extent in each dimension. For example, the directive

!HPF$  PROCESSORS p(4,8)

specifies a logical 4 X 8 grid of abstract processors p.

The intrinsic functions NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE may be used to inquire about the total number of physical processors actually used to execute the program. This information may then be used to calculate appropriate sizes for the declared abstract processor arrangements. The values returned by these functions remain constant for the duration of one program execution. Therefore the values may be used in data distribution directives. For example, the directive

!HPF$  PROCESSORS q(4, NUMBER_OF_PROCESSORS()/4)

may be used to specify a logical grid of abstract processors q, each abstract processor of which is supposed to be mapped to a separate physical processor.

Several processors arrangements may be declared in the same program. If two processor arrangements have the same shape, then the corresponding elements of the two arrangements are understood to refer the same abstract processor. Therefore, if function NUMBER_OF_PROCESSORS returns 32, then elements p(2,3) and q(2,3) of the arrangements above will refer the same abstract processor.

The DISTRIBUTE directive specifies a mapping of data objects (mainly arrays) to abstract processors in a processor arrangement. There are two basic distribution types: block and cyclic.

For example, the code fragment

   REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK)

specifies that array A should be distributed across some set of abstract processors by partitioning it uniformly into blocks of contiguous elements. If there are 50 processors, the array would be divided into groups of 200 elements, with A(1:200) mapped to the first processor, A(201:400) mapped to the second processors, and so on. If there is only one processor, the entire array is mapped to that processor as a single block of 10,000 elements.

The block size may be specified explicitly:

   REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK(256)).

This specifies that groups of exactly 256 elements should be mapped to successive abstract processors. There must be at least 40 abstract processors if the directive is to be satisfied. The fortieth processor will contain a partial block of only 16 elements.

The code fragment

   INTEGER D(52) !HPF$ DISTRIBUTE D(CYCLIC(2))

specifies a cyclic distribution of array D. Successive two-element blocks of the array are mapped to successive abstract processors in a round-robin fashion.

Thus, if there are four abstract processors, the first processor will contain D(1), D(2), D(9), D(10),…, D(45), D(46); the second processor will contain D(3), D(4), D(11), D(12),…, D(47), D(48); the third processor will contain D(5), D(6), D(13), D(14),…, D(49), D(50); and the fourth processor will contain D(7), D(8), D(19), D(16),…, D(51), D(52).

CYCLIC by definition means the same as CYCLIC(1). Therefore the code fragment

   INTEGER DECK_OF_CARDS(52) !HPF$ PROCESSORS PLAYERS(4) !HPF$ DISTRIBUTE DECK_OF_CARDS(CYCLIC) ONTO PLAYERS

specifies that the first processor of processor arrangement PLAYERS will contain DECK_OF_CARDS[1:49:4], the second processor will contain DECK_OF_CARDS[2:50:4], the third processor will contain DECK_OF_CARDS[3:51:4], and the fourth processor will contain DECK_OF_CARDS[4:52:4].

Distributions are specified independently for each dimension of a multidimensional array:

   INTEGER CHESS_BOARD(8,8), GO_BOARD(19,19) !HPF$ DISTRIBUTE CHESS_BOARD(BLOCK, BLOCK) !HPF$ DISTRIBUTE GO_BOARD(CYCLIC,*)

The CHESS_BOARD array will be partitioned into contiguous rectangular patches that will be distributed onto a two-dimensional arrangement of abstract processors. The GO_BOARD array will have its rows distributed cyclically over a one-dimensional arrangement of abstract processors. The asterisk (*) specifies that GO_BOARD is not to be distributed along the second axis; thus an entire row is to be distributed as one object.

Consider the following simple HPF program:

   PROGRAM SIMPLE    REAL, DIMENSION(1000,1000):: A, B, C !HPF$ PROCESSORS p(4,4) !HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO p:: A, B, C !HPF$ INDEPENDENT    DO J=1,1000 !HPF$   INDEPENDENT    DO I=1,1000    A(I,J)=1.0    B(I,J)=2.0    END DO    END DO !HPF$ INDEPENDENT    DO J=1,1000 !HPF$   INDEPENDENT    DO I=1,1000    C(I,J)=0.0    DO K=1,1000    C(I,J)=C(I,J)+A(I,K)*B(K,J)    END DO    END DO    END DO    END

The program implements matrix operation C = A X B on a 16-processor MPP, where A, B are dense square 1000 X 1000 matrices.

The PROCESSORS directive specifies a logical 4 X 4 grid of abstract processors, p. The DISTRIBUTE directive recommends the compiler to partition each of the arrays A, B, and C into equal-sized blocks along each of its dimension. This will result in a 4 X 4 configuration of blocks, each containing 250 X 250 elements with one block per processor. The corresponding blocks of arrays A, B, and C will be mapped to the same abstract processor and, hence, to the same physical processor.

Each of the four INDEPENDENT directives in the program is applied to a DO loop. They advise the compiler that the loop does not carry any dependences, and therefore its different iterations may be executed in parallel.

Altogether the directives give the compiler enough information in order to generate a target message-passing program. Additional information is given by a general HPF rule saying that evaluation of an expression should be performed on the processor in the memory of which its result will be stored. Actually the HPF program above specifies the parallel matrix-matrix multiplication algorithm based on a two-dimensional matrix decomposition, as was presented in Section 4.1. Thus, a clever HPF compiler should be able to generate an SPMD message-passing code such as the following:

 PROGRAM SIMPLE REAL, DIMENSION(250,250):: A, B, C REAL, DIMENSION(250,1000):: Arows, Bcols INTEGER colcom, rowcom, col, row INTEGER rank, colrank, rowrank INTEGER err CALL MPI_INIT(ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank); row = rank/4 col = rank-row*4 DO J=1,250    DO I=1,250       A(I,J)=1.0       B(I,J)=2.0    END DO END DO CALL MPI_COMM_SPLIT(MPI_COMM_WORLD, row, rank, rowcom, err) CALL MPI_COMM_SPLIT(MPI_COMM_WORLD, col, rank, colcom, err) CALL MPI_ALLGATHER(A, 40000, MPI_REAL, Arows, 62500, &MPI_REAL, rowcom, err) CALL MPI_ALLGATHER(B, 40000, MPI_REAL, Bcols, 62500, &MPI_REAL, colcom, err) DO J=1,250    DO I=1,250       C(I,J)=0.0       ind1=1       ind2=J       DO K=1,1000          C(I,J)=C(I,J)+Arows(I,K)*Bcols           (ind1,ind2)          IF(ind1.LT.250) THEN             ind1=ind1+1          ELSE             ind1=1             ind2=ind2+250          END IF       END DO    END DO END DO CALL MPI_COMM_FREE(rowcom, err) CALL MPI_COMM_FREE(colcom, err) CALL MPI_FINALIZE(err) END

The preceding code is in Fortran 77 with calls to MPI routines. It is supposed to be executed by all 16 processes making up the parallel program. Each process locally contains one 250 X 250 block of global arrays A, B, and C of the source HPF program. A logical 4 X 4 process grid is formed from the 16 participating processes, and each process gets its coordinates row and col in the grid. In order to compute its block of the resulting matrix C, the process needs blocks of matrix A from its horizontal neighbours in the 4 X 4 process grid, and blocks of matrix B from its vertical neighbors (see Figure 4.3). The necessary communication is achieved as we show next.

First, communicators are created for horizontal communication, one for each row of the 4 X 4 process grid. The MPI_COMM_SPLIT routine performs this operation, returning the new communicators in rowcom. Then a call to MPI_COMM_SPLIT creates communicators for columns of the process grid and returns them in colcom.

click to expand
Figure 4.6: Storage schemes used in the target message-passing program to store a row of blocks of matrix A and a column of blocks of matrix B in local arrays Arows and Bcols, respectively.

Now all processes call the MPI_ALLGATHER routine supplying rowcom as a communicator argument. This call performs, in parallel, four collective gather-to-all communication operations, each on its row of the process grid. As a result each process has in its local array Arows a copy of the corresponding row of blocks of matrix A.

The next call to MPI_ALLGATHER with colcom as a communicator argument also performs, in parallel, four collective gather-to-all communication operations but on columns of the process grid. During the execution of this call, each process gathers in local array Bcols copies of blocks of the corresponding column of blocks of matrix B. Note that array Bcols stores the blocks horizontally, as a row of blocks, and not in the form of columns of blocks as they are normally stored (see Figure 4.6). Therefore the inner DO loop, which computes an element of the resulting matrix C, must recalculate indexes of Bcols elements to make successive iterations of the loop refer to successive elements of the corresponding column of matrix B.

The main optimization performed by an HPF compiler is the minimiza-_tion of the cost of the interprocessor communication. This is not a trivial problem. It requires full analysis of both the source code and the executing MPP. HPF provides no specific constructs nor directives to help the compiler solve the problem. This is why HPF is considered a difficult language to compile.

For example, many real HPF compilers (i.e., the ADAPTOR HPF compiler from GMD) will translate the HPF program above into a message-passing program, whose processes each send its blocks of matrices A and B to all the other processes. That straightforward communication scheme guarantees that each process receives all of the elements of global arrays A and B that it needs to compute its elements of global array C. However, in many cases, including ours, this universal scheme involves a good deal of redundant communications, sending and receiving data that are never used in computation. The better a compiler is, the more accurate communication patterns it generates to avoid redundant communications as much as possible. The message-passing program above is generated by an imaginary clever HPF compiler and performs no redundant communication. Each process of the program sends its blocks of matrices A and B only to 3 other processes, and not to 15 as each process of the straightforward program does.

The PROCESSORS, DISTRIBUTE, and INDEPENDENT directives make up the core of HPF. There are two more HPF directives introduced mainly to facilitate the specification of the coordinated distribution of a group of interrelated arrays and other data objects. These two are the TEMPLATE and ALIGN data distribution directives. The directives provide a two-level mapping of data objects to abstract processors. Data objects are first aligned relative to some template, which is simply an abstract space of indexed positions (an array of nothings). The template is then distributed with the DISTRIBUTE directive implying that all array elements and scalar objects aligned with the same position in the template will be mapped to the same abstract processor. For example, the code fragment

   REAL, DIMENSION(10000,10000) :: NW, NE, SW, SE !HPF$ TEMPLATE EARTH(10001,10001) !HPF$ ALIGN NW(I,J) WITH EARTH(I,J) !HPF$ ALIGN NE(I,J) WITH EARTH(I,J+1) !HPF$ ALIGN SW(I,J) WITH EARTH(I+1,J) !HPF$ ALIGN SE(I,J) WITH EARTH(I+1,J+1) !HPF$ DISTRIBUTE EARTH(BLOCK, BLOCK)

aligns four 10000 X 10000 arrays NW, NE, SW, and SE to the four corners of template EARTH of size 10001 X 10001.

The presented model of data parallel programming MPPs, which is supported by the PROCESSORS, TEMPLATE, ALIGN, DISTRIBUTE, and INDEPENDENT directives, is the fundamental model provided by HPF 1.1. HPF 2.0 extends this model in three directions.

The first set of extensions provides the programmer with greater control over the mapping of the data. These include directives for dynamic remapping of data: DYNAMIC, REDISTRIBUTE, and REALIGN. The DYNAMIC directive

!HPF$  DYNAMIC alignee-or-distributee-list

specifies that objects in alignee-or-distributee-list may be dynamically realigned or redistributed.

The REDISTRIBUTE directive is similar to the DISTRIBUTE directive but is considered executable. An array or template may be redistributed at any time if it has been declared DYNAMIC. Any other objects currently ultimately aligned with the array (or template) when it is redistributed are also remapped to preserve alignment relationships in the new distribution.

The REALIGN directive is similar to the ALIGN directive but is considered executable. An array or template may be realigned at any time if it has been declared DYNAMIC. Unlike redistribution, realigning a data object does not cause any other object to be remapped. However, realignment of even a single object, if it is large, can require a lot of computational and communication effort at runtime.

The ONTO clause used in the DISTRIBUTE directive is extended to allow direct distribution to subsets of processors. Explicit mapping of pointers and components of derived types are also introduced.

The second set of extensions allows the programmer to provide the compiler with information useful for generating efficient code. The programmer can use the RANGE directive to specify the range of distributions that a dynamically distributed array or a pointer may have. The SHADOW directive allows the programmer to specify the amount of additional space required on a processor to accommodate nonlocal elements in a nearest-neighbor computation.

The third set of extensions provides some basic support for task parallelism, that is, parallelism implied by computation partitioning rather than data partitioning. These include the ON directive, the RESIDENT directive, and the TASK_REGION construct. The ON directive partitions computations among the processors of a parallel machine (much as the DISTRIBUTE directive partitions the data among the processors). The RESIDENT directive asserts that certain data access do not require interprocessor data movement for their implementation. The TASK_REGION construct provides the means to create independent coarse-grain tasks, each of which can itself execute a data-parallel (or nested task parallel) computation. These extensions make HPF a multiparadigm parallel programming language rather than a pure data parallel programming language as it was originally designed.