7.6 Benchmarks | Quintero - Deploying Linux on IBM E-Server Pseries Clusters

< Day Day Up >

In this section, we discuss industry standard benchmarks performed on pSeries running Linux.

7.6.1 System configurations

Hardware

Eight p655 nodes (lp01-08) interconnected by the Myrinet 2000 switch. Each node has eight 1.1 GHz POWER4 processors, private 32 KB L1 Data and 64 KB L1 Instruction cache per processor, 4x1.44 MB L2 cache, 32 GB memory, two Myrinet D-cards.

Two of the nodes (lp07, lp08) have direct peer-to-peer Gigabit Ethernet connection. The schematic diagram is shown in Figure 7-2 on page 318.

Figure 7-2. Node configuration for HPC case study

graphics/07fig02.gif

Software

SuSE SLES 8 (ppc) Linux version 2.4.19-ul1-ppc64-SMP
GCC 3.2
IBM XLF 8.1 for Linux on pSeries
IBM VACPP 6.0 for Linux on pSeries
Myrinet GM 2.06, MPICH_GM 1.2.5.. 10
ANL (Argonne National Lab) MPICH 1.2.5
IBM ESSL 4.1 for Linux on pSeries

7.6.2 Linpack DP n=100 Benchmark

For further details on this study, refer to the following site:

http://www.netlib.org/benchmark/linpackd

Linpack DP (Double Precision) is one of three Linpack benchmarks (also see 7.6.3, "Linpack TPP" on page 320 and 7.6.4, "Linpack HPC" on page 322) that measure the number of floating point operations per second. It solves a system of linear equations:

where A is a dense matrix of order 100 by 100, and x and b are vectors of length 100. It is a serial source code written in Fortran double precision. The source code is not allowed to change, even comments. However, a function subroutine "second()" which returns elapsed time in seconds has to be provided by a user .

This is a very small problem for modern computers and does not reflect the best performance a modern computer can deliver. Nevertheless, it can measure, to some extent, the performance of processors and compilers. Because of this and due largely to historical reasons, it still appears in some customer benchmark requests .

In our measurement, we replaced the function subroutine "second()" by XLF runtime function "rtc()" and used the following command and options to compile and run the code:

 xlf -03 -qhot -qarch=pwr4 -qtune=pwr4 -o linpackd linpackd.f time ./linpackd

Here is the output file of the program:

Example 7-9. Program output

 Please send the results of this run to:     Jack J. Dongarra     Computer Science Department     University of Tennessee     Knoxville, Tennessee 37996-1300     Fax: 865-974-8296 Internet: dongarra@cs.utk.edu    norm. resid       resid            machep          x(1)          x(n)    1.23217369E+00   1.36796649E-14   2.22044605E-16   1.00000000E+00  1.00000000E+00      times are reported for matrices of order 100   dgefa      dgesl      total      mflops     unit       ratio    times for array with leading dimension of 201   1.010E-03  3.695E-05  1.047E-03  6.558E+02  3.050E-03  1.870E-02   9.960E-04  3.695E-05  1.033E-03  6.648E+02  3.009E-03  1.845E-02   9.960E-04  3.695E-05  1.033E-03  6.648E+02  3.009E-03  1.845E-02   9.996E-04  3.591E-05  1.036E-03  6.631E+02  3.016E-03  1.849E-02    times for array with leading dimension of 200   1.012E-03  3.600E-05  1.048E-03  6.552E+02  3.052E-03  1.871E-02   1.007E-03  3.600E-05  1.043E-03  6.584E+02  3.038E-03  1.862E-02   1.013E-03  3.695E-05  1.050E-03  6.540E+02  3.058E-03  1.875E-02   1.008E-03  3.619E-05  1.044E-03  6.575E+02  3.042E-03  1.865E-02     end of tests -- this version dated 10/12/92

The resulting MFLOPS, along with time information returned by the Linux time command, is shown in Figure 7-4.

Table 7-4. Linpack DP n=100 result

Real time (s)	User time (s)	Sys time (s)	MFLOPS
0.061	0.060	0.000	664.8

The theoretical peak performance is 4400 MFLOPS, far above the one measured here, since the problem is too small.

7.6.3 Linpack TPP

For further details on this study, refer to the following site:

http://www.netlib.org/benchmark/1000d

Linpack TPP (Toward Peak Performance), the second Linpack benchmark, is for a matrix of order 1000 by 1000. It allows vendors to implement their own linear equation solver in any language, as long as the implementation computes a solution and the result satisfies prescribed accuracy.

We used ESSL subroutines dgef and dges to replace the original dgefa and dgesl respectively. These two ESSL subroutines have SMP versions in the ESSLSMP library which contains codes that have been parallelized to use all available processors within a node. The following line compiles the code:

 xlf_r -03 -qarch=pwr4 -qtune=pwr4 -qsmp=noauto -lesslsmp \ -o linpackt linpackt.f

To run (four threads):

 nt=4 export XLSMPOPTS="parthds=$nt:spins=0:yields=0" export OMP_NUM_THREADS=$nt ./linpackt > $nt.out

The resulting GFLOPS are shown in Figure 7-3, where we also include available AIX results which are published in:

Figure 7-3. Linpack TPP result of p655 running Linux and AIX

graphics/07fig03.gif

http://www-1.ibm.com/servers/eserver/pseries/hardware/system_perf.pdf

This is not an exact shoulder-to-shoulder comparison since many factors such as memory configuration, large page support, compiler options, and so on may be different. Nevertheless, it can be seen that Linux is performing almost the same as AIX.

Theoretical peak GFLOPS for 1.1GHz p655 are 4.4, 8.8, and 17.6 for one, two, and four processors, respectively. By comparing the actual numbers from Figure 7-3 with these peak numbers , it can be seen that the problem size (n=1000) under consideration is still not optimal to show the best performance. In order to see how much GFLOPS can be obtained, we have tried to vary the problem size from 1100 to 30000 with an increment of 100. The results are shown in Figure 7-4 on page 322 for four-way runs.

Figure 7-4. Four-way Linpack NxN performance as N varies

graphics/07fig04.gif

It can be seen from Figure 7-4 that as N (the size of the matrix) increases , we get better performance overall. After around N=15000, the performance saturates at about 11GFLOPS. Actually, the best number is 10.91GFLOPS at N=16800. This is much better than Linpack TPP (N=1000), which is 7.549GFLOPS in Figure 7-3 on page 321 (the four-way result).

Another interesting observation is that the performance fluctuates even in the saturated region. So it is very important to have a dense data point to get the best result.

7.6.4 Linpack HPC

For further details on this study, refer to the following site:

http://www.netlib.org/benchmark/hpl/

Linpack HPC (Highly Parallel Computing), also known as HPL (High Performance Linpack, is the benchmark used for world's top 500 supercomputers list (top500). Unlike Linpack DP and Linpack TPP, which work only on shared memory systems, Linpack HPC works on distributed memory systems as well. Because of this, Linpack HPC requires an implementation of MPI (version 1.1-compliant). Another prerequisite is either Basic Linear Algebra Subprograms (BLAS) or Vector Signal Image Processing Package Library (VSIPL).

About BLAS

BLAS consists of three levels.

Level 1 performs vector-vector operations.
Level 2 performs matrix-vector operations.
Level 3 performs matrix-matrix operations.

BLAS is available from different sources.

At the netlib Web site, source code and some binary builds are available for downloading:

http://www.netlib.org/blas/

Performance is usually lacking, compared with vendor-provided or specially optimized ones.
The Automatically Tuned Linear Algebra Software (ATLAS) package contains BLAS and some routines from Linear Algebra PACKage (LAPACK). ATLAS uses an empirical method to tune performance. After extensive tuning, it can achieve very high performance. ATLAS is maintained at:

http://math-atlas. sourceforge .net/
The GOTO library, a high performance BLAS library developed by Kazushige Goto, available at:

http://www.cs.utexas.edu/users/flame/goto/

Only binaries for some platforms are available. This is definitely a site to check if you do not have a tuned BLAS.
Vendor-provided; many hardware vendors (and some compiler providers) have highly tuned BLAS for their products. For IBM, it comes with ESSL libraries.

Important

There may be more than one copy of a BLAS library on your system and they may be from different sources. Therefore, avoid putting -lblas in your makefile without knowing which BLAS it is linking to.

For ESSL users, it is safe to replace any appearance of -lblas by -lessl . If you intend to use multi-thread versions, use this linking order: -lesslsmp .

About MPI

Three implementations of MPI are frequently used on Linux clusters:

Local Area Multicomputer (LAM) MPI includes a full implementation of the MPI-1 standard, as well as many elements of the MPI-2 standard. It often comes with Linux installation CDs as optional packages. LAM/MPI supports the following communications.

- Native Myrinet message passing through GM

- Asynchronous UDP-based message passing

- Message passing using TCP

- Message passing using shared memory and TCP

LAM MPI is available from:

http://www.lam-mpi.org/
ANL (Argonne National Laboratory) MPICH:

http://www-unix.mcs.anl.gov/mpi/mpich

This is probably the most popular MPI implementation. It supports many devices and communication protocols. It can be built to suit for:

- Single SMP box for shared memory communication only.

- A cluster of uniprocessors for TCP/IP communication.

- A cluster of SMP boxes for combined TCP/IP and shared memory communication
Myricom MPICH_GM ported by Myricom from MPICH to exploit lower latency and higher data rates of Myrinet networks. It requires Myrinet hardware and device drivers.

http://www.myri.com/scs/index.html

To build

After downloading the hpl.tgz file, untar it to a directory, for example, /bench. Examine the available makefiles in directory /bench/hpl/setup/ and find the one that closely matches your system. Then copy the file to /bench/hpl/ and rename it to something like Make.MyArch . Modify this file appropriately to suit your system environment and then issue the command make arch=MyArch from /bench/hpl/. This will build an executable xhpl at /bench/hpl/bin/MyArch/, where a sample input file HPL.dat will also be available.

For pSeries running Linux, the closest match of makefiles is Make.PWR3_FBLAS, so we took this file and modified the definitions shown in Example 7-10, which uses the ESSL library.

Example 7-10. Linpack HPC build options with ESSL

 ARCH = $(arch) TOPdir = /bench/hpl CC = /usr/bin/mpicc CCFLAGS = -qnosmp -qarch=pwr4 -qtune=pwr4 -qcache=auto -03 LINKER = /usr/bin/mpif77 LINKFLAGS = $(CCFLAGS) -lessl

Example 7-11 shows settings for the ESSLSMP library. In both cases, we used 64-bit as the application space by setting the environment variable OBJECT_MODE=64.

Note that /usr/bin/mpicc and /usr/bin/mpif77 are wrappers for the "real" mpicc and mpif77.

Example 7-11. Linpack HPC build options with ESSLSMP

 ARCH = $(arch) TOPdir = /bench/hpl CC = /usr/bin/mpicc CCFLAGS = -qsmp=noauto -qarch=pwr4 -qtune=pwr4 -qcache=auto -03 LINKER = /usr/bin/mpif77 LINKFLAGS = $(CCFLAGS) -lesslsmp

Change of source code is normally not necessary since most of the time would be spent in the matrix-matrix operation which is handled by user-supplied performance library.

However, there is one place which you can change to see how performance changes. In line 158 of the file hpl/testing/ptest/HPL_pddriver.c, the parameter to call to HPL_grid_init is HPL_ROW_MAJOR. It can be changed to HPL_COLUMN_MAJOR. This change allows more of the MPI communication to be done within each SMP.

The input file HPL.dat

The HPL executable xhpl only needs one input file HPL.dat to be in the current working directory. The input file contains parameters controlling the size of the problem, the number of problems per job, the algorithm features and processor grid, and so on.

Actually, one of the major challenges to get the best performance number is to try different combinations of the parameters. Example 7-12 lists the input file we use in our 8-way runs (note that it requires about 16 GB of total memory).

Example 7-12. Linpack HPC input file HPL.dat

 HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out      output file name (if any) 6            device out (6=stdout,7=stderr,file) 1            # of problems sizes (N) 42000 1            # of NBs 400 1            # of process grids (P x Q) 2            Ps 4            Qs 16.0         threshold 1            # of panel fact 2            PFACTs (0=left, 1=Crout, 2=Right) 1            # of recursive stopping criterium 8            NBMINs (>= 1) 1            # of panels in recursion 2            NDIVs 1            # of recursive panel fact. 0            RFACTs (0=left, 1=Crout, 2=Right) 1            # of broadcast 1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1            # of lookahead depth 1            DEPTHs (>=0) 0            SWAP (0=bin-exch,1=long,2=mix) 64           swapping threshold 0            L1 in (0=transposed,1=no-transposed) form 0            U in (0=transposed,1=no-transposed) form 1            Equilibration (0=no,1=yes) 8            memory alignment in double (> 0)

Parameters N, NB, P, Q in Example 7-12 are of particular interest for performance tuning and are discussed below.

N - The size of the matrix. The bigger the problem size, the bigger the ratio of computing time to communication time and therefore normally the better the performance. However, the problem cannot be too big to cause system paging.

A rule of thumb is to have a problem size which is 80% of the total memory of all nodes ( assuming a uniform amount of memory on each node). So you can start with the N given by the following formula:

N=sqrt(0.8*Total_Mem_in_Bytes/8)

Then vary it to see how performance changes.
NB - The block size of the matrix used by HPL for data distribution and computational granularity. It is better to be a small multiple of the block size used by matrix-matrix multiply routine ( dgemm ). Again, an optimal one should be big enough for better data reuse and small enough for better load balance. Normally NB should be bigger for bigger N.
P, Q - These two parameters determine processor grid. P*Q=Number of MPI processes. Start with a "square" grid and then decrease P (increase Q so P*Q is constant) to see how performance changes.

Note

There is another layer of tuning if hybrid MPI/OpenMP is used ”the combination of the number of MPI processes and the number of OpenMP threads.

In this case, P*Q*T=total number of processors, where T is the number of OpenMP threads per MPI process. For example, for a system of two nodes each with four processors, you can also run with P=Q=T=2, in addition to P=2,Q=4,T=1. Hybrid MPI/OpenMP has the advantage of achieving better balance between computing time and communication time.

To run

Copy files xhpl and HPL.dat to a directory. Then the rest will depend on your hardware configurations and MPI packages. We give examples for combinations of Gigabit Ethernet, Myrinet GM, pure MPI, hybrid MPI/OpenMP.

Note that in the following examples (Example 7-13 to Example 7-16), we assume the following:

We have two compute nodes whose Myrinet interfaces are lp01 and lp02, and whose Gigabit interfaces are lg01 and lg02.
Four processors on each are used in all examples.

We have put large values (in bytes) for two important MPICH environment variables , P4_GLOBMEMSIZE and P4-SOCKBUFSIZE. Their default values are 4 MB and 16 KB, respectively. P4_GLOBMEMSIZE specifies the amount of memory reserved for communication with shared memory (when MPICH has been configured with the -comm=shared option). The job hangs if the value is too low. This is a dynamic run-time behavior and you need to set it appropriately according to the communication size of the running job.

P4_SOCKBUFSIZE sets the socket buffer size in bytes. Increasing this value usually improves performance.

Pure MPICH_GM over GM

Example 7-13. HPL run script for pure MPICH_GM over GM

 #!/bin/bash cat > host.list << EOF lp01 lp01 lp01 lp01 lp02 lp02 lp02 lp02 EOF mpirun_ch.gm -np 8 -machinefile host.list xhpl > log.log

Hybrid MPICH_GM/OpenMP over GM

Example 7-14. HPL run script for hybrid MPICH_GM/OpenMP over Myrinet

 #!/bin/bash cat > host.list << EOF lp01 lp01 lp02 lp02 EOF mpirun_ch.gm OMP_NUM_THREADS=2 -np 4 -machinefile host.list xhpl > log.log

Pure MPICH over IP (Gigabit Ethernet)

Example 7-15. HPL run script for pure MPICH over IP (GigE)

 #!/bin/bash cat > host.list << EOF lg01:4 lg02:4 EOF cat >> ~/.bashrc << EOF P4_GLOBMEMSIZE=660000000 P4_SOCKBUFSIZE=512000 export P4_GLOBMEMSIZE P4_SOCKBUFSIZE EOF mpirun -nolocal -np 8 -machinefile host.list xhpl > log.log

Hybrid MPICH/OpenMP over IP (Gigabit Ethernet)

Example 7-16. HPL run script for hybrid MPICH/OpenMP over GigE

 #!/bin/bash cat > host.list << EOF lg01:2 lg02:2 EOF cat >> ~/.bashrc << EOF OMP_NUM_THREADS=2 P4_GLOBMEMSIZE=660000000 P4_SOCKBUFSIZE=512000 export OMP_NUM_THREADS P4_GLOBMEMSIZE P4_SOCKBUFSIZE EOF mpirun -nolocal -np 4-machinefile host.list xhpl > log.log

Notes on Example 7-13 to Example 7-16:

Myrinet MPICH_GM always uses shared memory for intra-node communication by default, and it passes environment variables to other MPI processes by command line options.
The -nolocal option and the format lp01:2 of the host file for MPICH is critical to make sure shared memory communication is used for intra-node traffic. In order to have this capability, the MPICH has to have been configured with options: "--with-device=ch_p4 --with-comm=shared"
Those two P4 environment variables are important to run large case and to run faster. The values can, of course, be changed.

The result

Figure 7-5 on page 330 shows an AIX and Linux comparison of HPL results on p655. Note that this is for illustration purposes only. More effort should be made to get the best performance in both cases.

Figure 7-5. AIX and Linux HPL results for two p655 at N=42000

graphics/07fig05.gif

We can see from Figure 7-5 that AIX has slightly better performance, especially if the AIX large page is enabled. (In addition to the standard 4 KB paging space, AIX 5.1 also supports a large paging space which is 16 MB per memory page. HPC applications that involve large sequential access of memory and files will benefit from the large paging space.)

The hybrid MPI/OpenMP (4x2 means 4 MPI processes, each with 2 OpenMP threads) performance better than pure MPI (the 8x1 case in Figure 7-5).

Figure 7-6 on page 331 shows comparisons of MPICH over Gigabit Ethernet and MPICH_GM over Myrinet in the smaller case (N=7000). It can clearly be seen that Myrinet performs better than Gigabit Ethernet.

Figure 7-6. Eight-way HPL results on two p655 nodes: GigE and Myrinet comparison

graphics/07fig06.gif

However, keep in mind that this is a very small problem (only takes about 15 seconds), so the communication time is relatively more important than with large problems, where computational time becomes dominant.

7.6.5 NetPIPE - point-to-point communication

For further details on this study, refer to the following site:

http://www.scl.ameslab.gov/netpipe/

Network Protocol Independent Performance Evaluator (NetPIPE) measures point-to-point communication between two processes. Message sizes are increased in regular intervals and also varied with small perturbations in order to see communication throughput.

NetPIPE is designed to handle many different protocols and communication middleware such as MPI, MPI-2, PVM, SHMEM, TCP, GM, LAPI, and so on. Here we show the usage and performance of two modules, TCP and MPI, under different situations.

To build

Building executables for TCP and MPI modules is very straightforward. No change to source files and makefile is necessary if compilers and MPI are on the system PATH. Simply execute these commands and the executables NPtcp and NPmpi will be created:

  make tcp   make mpi

TCP across nodes - Gigabit Ethernet

Figure 7-7 shows NetPIPE TCP communication throughput over Gigabit Ethernet connection. Two curves are shown, one with default parameters, and the other with a buffer size of 256 KB, which is set through the command line option NPtcp -b 262144 . The default case corresponds to a buffer size of 16 KB.

Figure 7-7. NetPIPE TCP on two p655 nodes connected by Gigabit Ethernet

graphics/07fig07.gif

It can be seen that the throughput increases as the message size increases, reaching the top at around 1 MB in both cases. The one with the 256 KB buffer has higher bandwidths for large messages.

TCP within a node

To see how TCP performs within an SMP box, we also run the same jobs on a single p655 node; the results are shown in Figure 7-8. Much higher bandwidths are achieved than in the cross-node communication case (Figure 7-7 on page 332).

Figure 7-8. NetPIPE TCP within a single p655 node

graphics/07fig08.gif

Again, comparing the two curves in this figure, you can see that the default buffer size of 16 KB does not yield the best performance for large messages.

MPI across nodes - GigE and Myrinet

Figure 7-9 on page 334 shows three cases of NetPIPE MPI communications across two p655 nodes.

Figure 7-9. NetPIPE MPI across p655 nodes - GigE and Myrinet comparison

graphics/07fig09.gif

MPICH over Gigabit Ethernet for a default buffer size of 16 KB.
MPICH over Gigabit Ethernet for a buffer size of 256 KB.
MPICH_GM over Myrinet. In this case, the buffer size parameter does not have any effect because it runs in "user space" mode, not IP mode.

In this case, the change of buffer size helped improve bandwidth from 394Mbps to 596Mbps for MPICH, a 51% improvement. This is much larger than in the TCP case (Figure 7-7 on page 332), where the bandwidth improvement for inter-node communication is (648-623)/623=4%.

A more impressive observation is the large gap between MPICH_GM over Myrinet and MPICH over Gigabit Ethernet. MPICH_GM/Myrinet is able to deliver 1691Mbps, while the "best" that MPICH/Gigabit can reach 596Mbps.

By comparing Figure 7-7 on page 332 with the MPICH part of Figure 7-9 on page 334, we can also see how much overhead MPICH has brought to the communication; the maximum bandwidth is 648Mbps for TCP and 596Mbps for MPICH.

Shared memory MPI within an SMP node

We now take a look at shared memory communication within an SMP box, which is an eight-way p655, in our case.

Note

NetPIPE has a separate module to do the shmem test for Cray systems. It is not the same as what is discussed here.

MPICH_GM (version 1.2.5..10) has shared memory support for communications within SMP nodes by using a device ch_gm . For MPICH, there are two implementations of shared memory support, one with the device of ch_shmem (configured with option --with-device=ch_shmem ) and the other with device ch_p4 and comm being shared ( --with-device=ch_p4 --with-comm=shared ) The former only works with a single SMP node, the latter works with a cluster of SMP nodes.

Figure 7-10 compares these three cases. We have chosen the buffer size to be 256 KB for the MPICH cases. It is interesting to note that MPICH/ch_shmem has the highest peak in all three cases, but MPICH_GM has better performance outside that peak area.

Figure 7-10. NetPIPE MPI within an SMP - shared memory MPI

graphics/07fig10.gif

It is also worth noting that MPICH/ch_shmem outperforms MPICH/ch_p4 for almost all sizes of messages. This suggests that multiple MPICHs be built on a system of clustered SMP nodes for performance reasons.

MPI within an SMP node - without shared memory

It is very unusual to run MPI without shared memory on a single SMP node or on a cluster of SMP nodes. However, it is still interesting to see what happens if we switch shared memory off.

Figure 7-11 illustrates such a scenario; it compares MPICH (different buffer sizes) and MPICH_GM. Surprisingly, MPICH_GM shows very bad performance, even compared with the default 16 KB buffer MPICH result.

Figure 7-11. NetPIPE MPI within an SMP - MPI without shared memory

graphics/07fig11.gif

Summary

So far, we have discussed TCP and MPI performance under various situations for different message sizes. To make it easier to understand communication performance differences, and therefore make better use of a system, we now provide a simplified version of performance comparisons of the maximal bandwidths of each case.

TCP

Figure 7-12 shows maximal bandwidths of NetPIPE TCP for intra-node and inter-node communications. This gives an idea of how much difference there is between intra-node and inter-node communication. It also shows the effect of message buffer size.

Figure 7-12. Maximal bandwidth of NetPIPE TCP

graphics/07fig12.gif

MPI

Figure 7-13 on page 338 shows maximal bandwidths of NetPIPE MPI for intra-node and inter-node communications. You can see how communication bandwidths change from inter-node to intra-node non-shared, and from intra-node non-shared to intra-node shared memory.

Figure 7-13. Maximal bandwidth of NetPIPE MPI

graphics/07fig13.gif

MPICH with ch_shmem device gives the best peak bandwidth within an SMP node. MPICH_GM has the best performance for cross-node communications (it uses Myrinet adapters, not Gigabit adapters).

7.6.6 Pallas PMB - more on MPI

For further details on this subject, refer to the following site:

http://www.pallas.com/e/products/pmb

The Pallas benchmark suite is a collection of functions that measure MPI performances of various communication patterns such as one-directional point-to-point, bidirectional point-to-point, cyclic sendreceive and exchange, and collective communications.

Point-to-point communication is covered in 7.6.5, "NetPIPE - point-to-point communication" on page 331, so in this section, we show results for collective communications. Whenever possible, we make comparisons among GigaE, Myrinet, and the IBM Cruiser switch.

To build

The version we built for testing is PMB2.2.1 and the package we used is MPI-1. Only minimal modification was done to the makefiles, and the command to make the binaries we needed is make PMB-EXT . There are other components , such as MPI-IO, but we only tested PMB-EXT.

To run

The publication Pallas MPI Benchmarks - PMB, Part MPI-1 , which comes with the tar file, describes in detail of what PMB can do and how to run the code. For our purposes, we used the following command lines to run the binary.

For GigE:

 mpirun -nolocal -np <np> -machinefile host.list PMB-MPI1 -npmin <np>

For Myrinet:

 mpirun.ch_gm -np <np> -machinefile host.list PMB-MPI1 -npmin <np>

For AIX:
```
 poe -nprocs <np> PMB-MPI1 -npmin <np> 
```

Note

The host.list is different for those runs, and extra environment variables such as P4_GLOBMEMSIZE, P4_SOCKBUFSIZE for GigE runs and POE environment variables for AIX runs have to be set appropriately. <np> is the number of MPI processes to start.

Allreduce

Figure 7-14 on page 340 shows a comparison of eight-process Allreduce among GigE (Linux), Myrinet (Linux), and Cruiser (AIX) on two p655 nodes. Both X and Y axes are in logarithmic scales in order to see full range of curves.

Figure 7-14. PMB eight-way Allreduce on two p655 nodes

graphics/07fig14.gif

The first data point (corresponding to zero- sized message) for GigE and Myrinet is probably not accurate and should be discarded. It can be seen that GigE is the worst for all message sizes. Myrinet performs better than Cruiser for small messages. For large messages (>16 KB), Cruiser outperforms Myrinet.

Figure 7-15 on page 341 shows the same picture in linear scales for both X and Y axes. It shows that the communication time increases linearly with message size in all three cases. However, it hides the details for small messages, as we see in Figure 7-14.

Figure 7-15. Same as Figure 7-14, with X and Y in linear scales

graphics/07fig15.gif

Reduce

Figure 7-16 on page 342 shows Reduce results of the same MPI job (eight-way across two p655 nodes). The performance comparison for the three cases is the same as Allreduce; even the turning point at which Cruiser outperforms Myrinet is the same.

Figure 7-16. PMB eight-way Reduce on two p655 nodes

graphics/07fig16.gif

Reduce_scatter

The MPI collective communication Reduce_scatter shows a slightly different picture (see Figure 7-17 on page 343); the Myrinet and Cruiser results are closer, compared with Allreduce and Reduce. The Cruiser result is better than Myrinet for sizes of messages.

Figure 7-17. PMB eight-way Reduce_scatter on two p655 nodes

graphics/07fig17.gif

Allgather

Figure 7-18 on page 344 shows Allgather performance. Myrinet has a clear advantage for small messages. For larger messages (>4KB), these two are very close to each other. However, you can still see that Myrinet performs slightly better.

Figure 7-18. PMB 8-way Allgather on two p655 nodes

graphics/07fig18.gif

Allgatherv

The overall performance trend for different sizes of messages is the same as Allreduce; that is, GigE is the worst, Myrinet is better than Cruiser for smaller messages, but worse than Cruiser for larger messages.

However, the turning point is much smaller: 512 B. Another thing about Myrinet and GigE is that the performance drops a great deal for message sizes of 8 KB, 16 KB, and 32 KB.

Figure 7-19. PMB eight-way Allgatherv on two p655 nodes

graphics/07fig19.gif

Alltoall

Figure 7-20 on page 346 shows Alltoall performance. For large messages, Cruiser has slightly better performance than Myrinet. For small messages up to 16 KB, Myrinet and Cruiser performances are crossed to each other. The GigE performance favors message sizes of 32 B to 1 KB.

Figure 7-20. PMB eight-way Alltoall on two p655 nodes

graphics/07fig20.gif

Bcast

The Bcast results are shown in Figure 7-21 on page 347. It looks from the figure that performances of all three cases have a sudden change around 8 KB or 16 KB. After that, the curves become straight. Again, Cruiser is the best one, and the GigE is the worst.

Figure 7-21. PMB eight-way Bcast on two p655 nodes

graphics/07fig21.gif

7.6.7 NAS Parallel benchmarks

For further details on this subject, refer to the following site:

http://www.nas.nasa.gov/Software/NPB

Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) are from NASA, Ames, and have been very popular in measuring parallel system performance for a number of years . These benchmarks include some of the widely used parallel algorithms in computational fluid dynamics.

NPB 2.4 consists of eight benchmark programs with several problem sizes (class A, B, C, and D):

Class C and D are tailored for systems with more than 16 CPUs.
Class A problems are quite small for many present-day computer systems; therefore, only class B results are presented here.

The first five of these benchmarks are kernel benchmarks with simple data structures. In addition to these kernels , there are three simulated applications. All of these codes are parallelized using message passing interface (MPI), and except for the Integer Sort kernel benchmark that is written in the C language, all of these are written in FORTRAN language.

Kernel benchmarks

Embarrassingly Parallel (EP) problem

The Embarrassingly Parallel problem is to generate pairs of Gaussian derivatives (also called two independent normally distributed variables ) according to a specific scheme, and tabulate the number of pairs in successive square annuli. This problem is typical in many "Monte-Carlo" applications, and it requires very little communication between processors, as the name of the benchmark suggests.

The MultiGrid (MG) problem

The MultiGrid kernel uses the V-cycle multigrid algorithms to solve approximate solution of a 3-D Poisson equation with periodic boundary conditions. Multigrid methods are used in many CFD simulations, since it improves the convergence characteristics. This kernel requires highly structured interprocessor communication.

The Conjugate Gradient (CG) problem

The Conjugate Gradient benchmark computes an iterative approximation to the smallest eigenvalue of a large sparse, symmetric positive definite matrix. Within the iteration loop, the primary procedure is to solve the linear system of equations with the conjugate gradient method. This kernel is typical in unstructured grid computations , and it tests irregular long distance communication.

The Fast Fourier Transform (FT) problem

The Fast Fourier Transform benchmark solves 3-D Poisson partial differential equations using 3-D forward and inverse Fourier transforms. The FFTs, key parts of turbulence simulations (large eddy simulations), use spectral methods and test rigorous long distance interprocessor communications.

The Integer Sort (IS) problem

Integer Sort problem sorts keys into a number of buckets. A typical application of this method is particle in cell applications in fluid mechanics. This kernel stresses both integer computing performance and interprocessor communication.

Pseudo Application Benchmarks

Block Tridiagonal (BT) problem

This solves 3-D Navier-Stokes/Euler equations (a system of five partial differential equations) numerically using the approximate factorization scheme on a structured finite-volume mesh. In each of the three sweeps (for three directions), multiple independent systems of block tridiagonal equations (each block being a 5x5 matrix) are solved .

This method is used in many of the existing NASA flow solvers such as ARC3D, CFL3D, and OVERFLOW. This parallel implementation requires a square grid of processors.

Scalar Pentadiagonal (SP) problem

This method is also used in many NASA compressible flow solvers like ARC3D, CFL3D, and OVERFLOW. Like the BT problem, this parallel implementation requires a square grid of processors.

Lower-Upper Diagonal (LU) problem

This solves 3-D Navier-Stokes/Euler equations numerically with the successive over relaxation procedure that involves the LU decomposition method. This algorithm is used in incompressible flow codes such as the NASA INS3D code.

Benchmark system

The details of the hardware and software configurations are described in detail in the earlier sections of this chapter. All of these benchmarks were run from 1 through 8 CPUs (BT and SP require square grid of processors, and could not be run on 8 CPUs). Up to the 4 CPU cases, only one node was used. For the 8 CPU case, 4 CPUs from each node were used.

To build

The NAS benchmark tar file NAS2[1].4.tar can be untarred in the benchmark directory, for example, /bench1. In the /bench1/NPB2.4/NPB2.4-MPI directory, there are several directories including the config directory.

In the config directory, there are two files that needs to be edited appropriately for setting up this benchmark: the make.def file, and the suite.def file.

In the make.def file are set the compilers to be used, the compiler flags, the linker flags, and the binary directory. For each problem, and each number of processors, a separate binary needs to be built.

A typical make.def file will have the settings shown in Example 7-17.

Example 7-17. Typical makefile (make.def) settings for NAS benchmarks

 MPIF77 = mpif77 FLINK = mpif77 FFLAGS = -03 qarch-pwr4 qtune=pwr4 q64 FLINKFLAGS = -q64 MPICC = mpicc CLINK = mpicc CFLAGS = -03 q64 qarch=pwr4 qtune=pwr4 BINDIR = ../bin

The file is set to build 64-bit binaries, and put the binaries in the ../bin directory. In addition, the config directory has a file called suite.def that needs to be set for the binaries to be built.

Example 7-18 shows a suite.def file which will build all eight applications for class B for 4 CPUs.

Example 7-18. The suite definition (suite.def) file for NAS benchmarks

 # This is a sample suite config file that # builds the class B benchmark for the entire suite for 4 CPUs bt      B    4 sp      B    4 lu      B    4 cg      B    4 ep      B    4 is      B    4 ft      B    4 mg      B    4

Properly creating these two files (make.def and suite.def) in the config directory, and issuing the following commands at the upper level directory (that is, NPB2.4-MPI) builds all eight executables:

  make clean   make suite

The make clean step is important, since there may be leftover object files from some earlier build sessions with a different problem size or for different number of processes, that may result in inconsistent binaries.

To run

Set the necessary environmental variables for Myrinet or the Gigabit Ethernet, the OBJECT_MODE (64), a host.list file with the node names appropriate for Myrinet or Gigabit Ethernet, as described in the earlier sections.

The P4_GLOBMEMSIZE environmental variable had to be set to a large value to run these benchmarks. A typical mpirun command, such as the following, runs the four-processor job of the application LU and creates an output file called out_lu.B.4:

  mpirun -np 4 -machinefile host.list ../bin/lu.B.4 > out_lu.B.4

Each of the output files will include the verification of the results (SUCCESSFUL or FAILURE), elapsed time in seconds, total MOP/s (millions of operations per second, and MOP/s/process), the compiler, and link options that were used in the binary creation.

The results

The results are presented as bar charts comparing Myrinet and Gigabit Ethernet interface performance. In these charts , performance is measured in terms of total MOP/s, and higher is better.

Figure 7-22 on page 352 shows results for the EP benchmark. As expected, there is no difference in performance between Myrinet and the Gigabit Ethernet interfaces for this case, because there is very little communication.

Figure 7-22. NAS EP class B performance

graphics/07fig22.gif

MG class B performance is shown in Figure 7-23 on page 353. For the eight-CPU case, the Myrinet interface is about 30% faster than the Gigabit Ethernet interface, since inter-node communication becomes important. The dominant message passing call is MPI_Send.

Figure 7-23. NAS MG class B performance

graphics/07fig23.gif

For CG (Figure 7-24 on page 354), the inter-node communication becomes even more important, and for the eight-CPU case, the Myrinet interface is 50% better than the Gigabit Ethernet interface. The dominant message passing call for this application is MPI_Send.

Figure 7-24. NAS CG class B performance

graphics/07fig24.gif

FT (Figure 7-25 on page 355) does a significant amount of MPI_Alltoall communication, requiring significant collective communication bandwidth. As a result, for the eight-CPU case, the Myrinet interface is nearly 100% faster than the Gigabit Ethernet interface.

Figure 7-25. NAS FT class B performance

graphics/07fig25.gif

The integer sort benchmark is quite sensitive to collective communication performance (since the dominant message passing calls in this application are MPI_Allgatherv, and MPI_Alltoall). Therefore, there is a huge (nearly 400%) difference in performance for the eight-CPU case between Myrinet and Gigabit Ethernet interfaces (Figure 7-26 on page 356).

Figure 7-26. NAS IS class B performance

graphics/07fig26.gif

The BT results are shown in Figure 7-27 on page 357. Since BT has to be run on square grid of processors, it was run only on 1 & 4 CPUs ”all within the same node. There is no inter-node communication, and there are no differences between interconnects. The primary message passing calls in this application are non-blocking sends and receives.

Figure 7-27. NAS BT class B performance

graphics/07fig27.gif

SP

As in the case of BT, SP has to be run only on a square number of processors, and hence this was run within a node; there was no inter-node communication. The primary message passing calls here are non-blocking sends and receives like BT, except that SP is a much more communication- intensive application (see Figure 7-28 on page 358 for performance comparisons).

Figure 7-28. NAS SP class B performance

graphics/07fig28.gif

LU

For the eight-way LU, the Myrinet interface has a slight advantage over the Gigabit Ethernet interface (Figure 7-29 on page 359). The primary message passing calls are non-blocking send and receive.

Figure 7-29. NAS LU class B performance

graphics/07fig29.gif

< Day Day Up >