6.5.3 Restricting Host Access

9
Programming with MPI-A Detailed Example
As discussed in the previous chapter, MPI presents the programmer with a distributed address space model. The programmer must therefore partition or decompose the data so that individual computational objects are assigned to individual processors. This decomposition can be either static, i.e., fixed once and for all, or dynamic, i.e., changing in response to the simulation or the system on which it is running.
In the CA example in Chapter 8 the state of the system was described by a vector of integers. We partitioned that vector over a set of MPI processes so that each process was responsible for a contiguous block of data. The important feature of that partitioning was that it minimized communication. The only data that needed to be shared between processors were the end-points of every block.
In this chapter we give a detailed example of an approach to implementing high performance, parallel applications with MPI. The basic approach to designing a good parallel application is as follows:
Choose an algorithm with sufficient parallelism and a favorable ratio of communication to computation;
Optimize a sequential version of the algorithm. This may mean finding library implementations of key steps, e.g., well tested and extremely fast implementations of Fourier transforms1 and linear algebra2 are widely available. High quality software archives3 are an excellent starting place for locating both sequential and parallel implementations of key algorithmic components;
Implement the parallel version using the simplest possible MPI operations-usually blocking, standard mode procedures;
Profiling and analysis. Find what operations or activities are taking the most time;
Attack the most time-consuming components. Be selective. Remember the corollary to Amdahl's law: if something takes 10% of the time, then no matter how much you optimize it, your overall speed will never increase by more than 10%.
Experience has shown that the most effective way to improve the performance of a parallel program is to improve the single-node performance. This is at odds with a
1The FFTW library at http://theory.lcs.mit.edu/~fftw/ RPM at: ftp://contrib.redhat.com/libc6/i386/fftw-2.0.1-3.i386.rpm
2LAPACK home page: http://www.netlib.org/lapack/ RPM at: ftp://rhcn. redhat. com/pub/rhcn/RPMS/i386/lapack-2.0-12. i386. rpm
3http://netlib.org/ and http://www.nhse.org

 



How to Build a Beowulf
How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters (Scientific and Engineering Computation)
ISBN: 026269218X
EAN: 2147483647
Year: 1999
Pages: 134

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net