12.1 Dense Linear System Solving

The problem in solving linear systems is: Given a square matrix A and a vector b, find a vector x such that Ax = b. In the most general case, the matrix is stored as a distributed array, and the system is solved by Gaussian elimination. This is the basic algorithm in ScaLAPACK and PLAPACK, the two packages we discuss for solving a linear system with a distributed dense coefficient matrix. (Sparse systems are discussed in Section 12.2.)

On a single processor, the algorithm for dense linear system solving is fairly obvious, although a good deal of optimization is needed for high performance (see Section 12.6.) In a distributed context, achieving high performance—especially performance that scales up with increasing processor numbers—requires radical rethinking about the basic data structures. Both ScaLAPACK and PLAPACK use block-cyclic data distributions. In Section 11.2.1, we focus on how to specify data in this distribution in ScaLAPACK, since it is the more widely used package; we then briefly compare PLAPACK's calling style.

12.1.1 ScaLAPACK

ScaLAPACK is a parallel version of LAPACK, both in function and in software design. Like the earlier package, ScaLAPACK targets linear system solution and eigenvalue calculation for dense and banded matrices. Note that, while sparse matrices are often of banded form, use of the band storage is usually not an efficient way of dealing with sparse systems; other software packages are better suited to that. In particular, one should use SuperLU (section 12.2.1) for sparse linear systems and Arpack (section 12.3.2) for eigenvalue computations.

In a way, ScaLAPACK is the culmination of a line of linear algebra packages that started with LINPACK and EISPACK. The coding of those packages was fairly straightforward, using at most Basic Linear Algebra Subprograms (BLAS) Level-1 operations as an abstraction level. LAPACK [4, 63] attains high efficiency on a single processor (or a small number of shared-memory processors) through the introduction of blocked algorithms and the concomitant use of BLAS Level-3 operations. ScaLAPACK uses these blocked algorithms in a parallel context to attain scalably high performance on parallel computers.

The seemingly contradictory demands of portability and efficiency are realized in ScaLAPACK through confining the relevant parts of the code to two subroutine libraries: the BLAS for the computational kernels and the BLACS (Basic Linear Algebra Communication Subprograms) for communication kernels. While the BLACS come with ScaLAPACK, the user is to supply the BLAS library; see Section 12.6.

ScaLAPACK is written in Fortran, as are the examples in this section. The distribution has no C prototypes, but interfacing to a C program is simple, observing the usual name conversion conventions.

ScaLAPACK Parallel Initialization

ScaLAPACK relies for its communications on the BLACS, (Basic Linear Algebra Communication Subprograms) (Basic Linear Algebra Communication Subprograms) which offers an abstraction layer over MPI. Its main feature is the ability to communicate submatrices, rather than arrays, and of both rectangular and trapezoidal shape. The latter is of obvious value in factorization algorithms. We will not go into the details of the BLACS here; instead, we focus on the aspects that come into play in the program initialization phase.

Suppose you have divided your cluster into an approximately square grid of nprows by npcols processors. The following two calls set up a BLACS processor grid - its handle is returned as ictxt - and return the current processor number (by row and column) in it:

       call sl_init(ictxt,nprows,npcols)       call blacs_gridinfo(ictxt,nprows,npcols,myprow,mypcol)

Correspondingly, at the end of your code you need to release the grid by

       call blacs_gridexit(ictxt)

ScaLAPACK Data Format

Creating a matrix in ScaLAPACK is, unfortunately, not simple, even though none of the indirect addressing problems of sparse storage concern us here. The difficulty lies in the fact that for scalably high performance on factorization algorithms, a storage mode called "two-dimensional block-cyclic" storage is used. The blocking is what enables the use of BLAS Level-3 routines; the cyclic storage is needed for scalable parallelism.

Specifically, the block-cyclic storage implies that a global (i, j) coordinate in the matrix gets mapped to a triplet of (p, l, x) for both the i and the j directions, where p is the processor number, l the block, and x the offset inside the block.

The block size has to be decided by the user; 64 is usually a safe bet. For generality, let us assume that block sizes bs_i and bs_j have been chosen. First we determine how much storage is needed for the local part of the matrix:

       mlocal = numroc(mglobal,bs_i,myprow,0,nprows)       nlocal = numroc(nglobal,bs_j,mypcol,0,npcols)

where numroc is a library function. (The m and n sizes of the matrix need not be equal, since ScaLAPACK also has routines for QR factorization and such.)

Filling in a matrix requires the conversion from (i, j) coordinates to (p, l, x) coordinates. It is best to use conversion functions

       p_of_i(i,bs,p) = mod(int((i-1)/bs),p)       l_of_i(i,bs,p) = int((i-1)/(p*bs))       x_of_i(i,bs,p) = mod(i-1,bs)+1

that take i or j as input, as well as the block size and the number of processors in that direction. The global matrix element (i, j) is then mapped to

        pi = p_of_i(i,bs_i,nprows)        li = l_of_i(i,bs_i,nprows)        xi = x_of_i(i,bs_i,nprows)        pj = p_of_i(j,bs_j,npcols)        lj = l_of_i(j,bs_j,npcols)        xj = x_of_i(j,bs_j,npcols)        mat(li*bs_i+xi,lj*bs_j+xj) = mat_global(i,j)

if the current processor is (p_i, p_j).

Calling ScaLAPACK Routines

ScaLAPACK routines adhere to the LAPACK naming scheme: PXYYZZZ, where P indicates parallel; X is the "precision," meaning single or double, real or complex; YY is the shape, with GE for rectangular and TR for triangular; and ZZZ denotes the function.

For most functions there is a "simple driver" (for instance, SV for system solving), which makes the routine name in our example PDGESV for double precision, as well as an "expert driver," which has X attached to the name, PDGESVX in this example. The expert driver usually has more input options and usually returns more diagnostic information.

In the call to a ScaLAPACK routine, information about the matrix has to be passed by way of a descriptor:

       integer desca(9)       call descinit(desca,      >     mglobal,nglobal, bs_i,bs_j, 0,0,ictxt,lda,ierr)       call pdgesv(nglobal,1, mat_local,1,1, desca,ipiv,      >     rhs_local,1,1, descb, ierr)

where lda>mlocal is the allocated first dimension of a.

Linear System Solution Routines

ScaLAPACK linear solver routines support dense and banded matrices. The drivers for solving a linear system are PxyySV, where yy=GE or GB for dense and band, respectively. We do not discuss here other cases such as positive definite band, nor do we discuss band matrices, which are stored by using a variant of the scheme described above. The reader is referred to the ScaLAPACK Users' Guide [15] for details. The input matrix A of the system is on output overwritten with the LU factorization, and the right-hand side B is overwritten with the solution. Temporary storage is needed only for the (integer) pivot locations.

12.1.2 PLAPACK

PLAPACK [86] is a package with functionality similar to that of ScaLACK but with a different calling style. It also relies on optimized BLAS routines and is therefore able to achieve a high performance. Whereas ScaLAPACK uses a calling style that is similar to Fortran, to stay close to its LAPACK roots PLAPACK uses a more object-oriented style. Its interface is similar in philosophy to that of the PETSc package (discussed later in this chapter).

As an illustration of this object-oriented handling of matrices and vectors, here are matrix-vector multiply and triangular system solve calls:

   PLA_Gemv( PLA_NO_TRANS, one, A, x, zero, b );   PLA_Trsv( PLA_LOWER_TRIANGULAR, PLA_NO_TRANSPOSE,             PLA_UNIT_DIAG, A, b );

The distribution of the matrix over the processors is induced as a "distribution template" declared by the user and is passed to the matrix creation call:

   PLA_Matrix_create(  datatype, size, size,                       templ, PLA_ALIGN_FIRST, PLA_ALIGN_FIRST, &A );

PLAPACK wins over ScaLAPACK in user-friendliness in filling in the matrix. As in PETSc, matrix elements can be specified anywhere; and instead of being written directly into the data structure, they are passed by a PLA_API_axpy_matrix_to_global call. On the other hand, PLAPACK lacks ScaLAPACK's sophistication of simple and expert drivers and pays less attention to the issue of numerical stability.