9.10 Measuring MPI Performance

Many tools have been developed for measuring performance. The best test is always your own application, but a number of tests are available that can give a more general overview of the performance of MPI on a cluster. Measuring communication performance is actually quite tricky; see [51] for a discussion of some of the issues in making reproducible measurements of performance. That paper describes the methods used in the mpptest program for measuring MPI performance.

9.10.1 mpptest

The mpptest program allows you to measure many aspects of the performance of any MPI implementation. The most common MPI performance test is the Ping-Pong test; this test measures the time it takes to send a message from one process to another and then back. The mpptest program provides Ping-Pong tests for the different MPI communication modes, as well as providing a variety of tests for collective operations and for more realistic variations on point-to-point communication, such as halo communication (like that in Section 8.3) and communication that does not reuse the same memory locations (thus benefiting from using data that is already in memory cache). The mpptest program can also test the performance of some MPI-2 functions, including MPI_Put and MPI_Get.

Using mpptest

The mpptest program is distributed with MPICH and MPICH2 in the directory 'examples/perftest'. You can also download it separately from www.mcs.anl.gov/mpi/perftest. Building and using mpptest is very simple:

 % tar zxf perftest.tar.gz % cd perftest-1.2.1 % ./configure --with-mpich % make % mpiexec -n 2 ./mpptest -logscale % mpiexec -n 16 ./mpptest -bisect % mpiexec -n 2 ./mpptest -auto

To run with LAM/MPI, simply configure with the option --with-lammpi. The 'README' file contains instructions for building with other MPI implementations.

9.10.2 SKaMPI

The SKaMPI test suite [94] is a comprehensive test of MPI performance, covering virtually all of the MPI-1 communication functions.

One interesting feature of the SKaMPI benchmarks is the online tables showing the performance of MPI implementations on various parallel computers, ranging from Beowulf clusters to parallel vector supercomputers.

9.10.3 High Performance LINPACK

Perhaps the best-known benchmark in technical computing is the LINPACK benchmark. The version of this benchmark that is appropriate for clusters is the High Performance LINPACK (HPL). Obtaining and running this benchmark are relatively easy, though getting good performance can require a significant amount of effort. In addition, while the LINPACK benchmark is widely known, it tends to significantly overestimate the achieveable performance for many applications because it involves n³ computation on n² data and is thus relatively insensitive to the performance of the node memory system.

The HPL benchmark depends on another library, the basic linear algebra subroutines (BLAS), for much of the computation. Thus, to get good performance on the HPL benchmark, you must have a high-quality implementation of the BLAS. Fortunately, several sources of these routines are available. You can often get implementations of the BLAS from the CPU vendor directly, sometimes at no cost. Another possibility is to use the ATLAS implementation of the BLAS.

ATLAS

ATLAS is available from math-atlas.sourceforge.net. If prebuilt binaries fit your system, you should use those. Note that ATLAS is tuned for specific system characteristics including clock speed and cache sizes; if you have any doubts about whether your configuration matches that of a prebuilt version, you should build ATLAS yourself.

To build ATLAS, first download ATLAS from the Web site and then extract it. This will create an 'ATLAS' directory into which the libraries will be built, so extract this where you want the libraries to reside. A directory on a local disk (such as '/tmp') rather than on on an NFS-mounted disk can help speedup ATLAS.

 % cd /tmp % tar zxf atlas3.4.1.tgz % cd ATLAS

Check the 'errata.html' file at math-atlas.sourceforge.net/errata.html for updates. You may need to edit various files (no patches are supplied for ATLAS). Pay particular attention to the items that describe various possible ways that the install step may fail; you may choose to update values such as ATL_nkflop before running ATLAS. Next, have ATLAS configure itself. Select a compiler; note that you should not use the Portland Group compiler here.

 % make config CC=gcc

Answer yes to most questions, including threaded and express setup, and accept the suggested architecture name. Next, make ATLAS. Here, we assume that the architecture name was Linux-PIIISSE2:

 % make install arch=Linux-PIIISSE2 >&make.log

Note that this is not an "install" in the usual sense; the ATLAS libraries are not copied to '/usr/local/lib' and the like by the install. This step may take as long as several hours, unless ATLAS finds a precomputed set of parameters that fits your machine. ATLAS is also sensitive to variations in runtimes, so try to use a machine that has no other users. Make sure that it is the exact same type of machine as your nodes (e.g., if you have login nodes that are different from your compute nodes, make sure that you run ATLAS on the compute nodes).

At the end of the "make install" step, the BLAS are in 'ATLAS/lib/Linux-PIIISSE2'. You are ready for the next step.

HPL

Download and unpack the HPL package from www.netlib.org/benchmark/hpl:

 % tar zxf hpl.tgz % cd hpl

Create a 'Make.<archname>' in the 'hpl' directory. Consider an archname like Linux_PIII_CBLAS_gm for a Linux system on Pentium III processors, using the C version of the BLAS constructed by ATLAS, and using the gm device from the MPICH implementation of MPI. To create this file, look at the samples in the 'hpl/setup' directory, for example,

 % cp setup/Make.Linux_PII_CBLAS_gm Make.Linux_PIII_CBLAS_gm

Edit this file, changing ARCH to the name you selected (e.g., Linux_PIII_CBLAS_gm), and set LAdir to the location of the ATLAS libraries. Then do the following:

 % make arch=<thename> % cd bin/<thename> % mpiexec -n 4 ./xhpl

Check the output to make sure that you have the right answer. The file 'HPL.dat' controls the actual test parameters. The version of 'HPL.dat' that comes with the hpl package is appropriate for testing hpl. To run hpl for performance requires modifying 'HPL.dat'. The file 'hpl/TUNING' contains some hints on setting the values in this file for performance. Here are a few of the most important:

Change the problem size to a large value. Don't make it too large, however, since the total computational work grows as the cube of the problem size (doubling the problem size increases the amount of work by a factor of eight). Problem sizes of around 5,000–10,000 are reasonable.
Change the block size to a modest size. A block size of around 64 is a good place to start.
Change the processor decomposition and number of nodes to match your configuration. In most cases, you should try to keep the decomposition close to square (e.g., P and Q should be about the same value), with P ≥ Q.
Experiment with different values for RFACT and PFACT. On some systems, these parameters can have a significant effect on performance. For one large cluster, setting both to right was preferable.