List of Tables | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

Chapter 1: Serial Scalar Processor

Figure 1.1: Serial scalar processor architecture.

Chapter 2: Vector and Superscalar Processors

Figure 2.1: Vector processor architecture.
Figure 2.2: Superscalar processor architecture.

Chapter 3: Shared Memory Multiprocessors

Figure 3.1: Shared memory multiprocessor.

Chapter 4: Distributed Memory Multiprocessors

Figure 4.1: Distributed memory multiprocessor.
Figure 4.2: Matrix-matrix multiplication with matrices A, B, and C evenly partitioned in one dimension. The slices mapped onto a single processor are shaded in black. During execution, this processor requires all of matrix B (shown shaded in gray).
Figure 4.3: Matrix-matrix multiplication with matrices A, B, and C evenly partitioned in two dimensions. The blocks mapped onto a single processor are shaded black. During execution, this processor requires corresponding rows of matrix A and columns of matrix B (shown shaded in gray).
Figure 4.4: Optimal broadcast tree P = 8, L = 6, g = 4, and o = 2. The number shown for each node is the time at which it has received the data unit and can begin sending it.
Figure 4.5: Communication tree for optimal summing for T = 28, P = 8, L = 5, g = 4, and o = 2.
Figure 4.6: Storage schemes used in the target message-passing program to store a row of blocks of matrix A and a column of blocks of matrix B in local arrays Arows and Bcols, respectively.

Chapter 5: Networks of Computers: Architecture and Programming Challenges

Figure 5.1: Local network of computers.
Figure 5.2: Matrix-matrix multiplication with matrices A, B, and C unevenly partitioned in one dimension. The area of the slice mapped to each processor is proportional to its speed. The slices mapped onto a single processor are shaded black. During execution this processor requires all of matrix A (shown shaded gray).

Chapter 6: Introduction to mpC

Figure 6.1: Matrix operation C = A x B^T with matrices A, B, and C unevenly partitioned in one dimension. The slices mapped onto a single processor are shaded in black.
Figure 6.2: One step of parallel multiplication of matrices A and B^T. The pivot row of blocks of matrix B (shown slashed) is first broadcast to all processors. Then each processor computes, in parallel with the others, its part of the corresponding column of blocks of the resulting matrix C.

Chapter 7: Advanced Heterogeneous Parallel Programming in mpC

Figure 7.1: The system of bodies consists of large groups of bodies, with different groups at a good distance from each other. The bodies move under the influence of Newtonian gravitational attraction.
Figure 7.2: The point of view of each individual process. The system of bodies includes all bodies of its group while each remote group is approximated by a single equivalent body.
Figure 7.3: A star communication pattern.
Figure 7.4: A fully connected network.
Figure 7.5: A ring communication pattern.
Figure 7.6: A four-level binary tree. Node 0 is a root. Nodes 5, 6, 7, and 8 are leaves. Nodes 1, 2, 3, and 4 are interior.
Figure 7.7: A four-level complete binary tree. Level 3 is full.
Figure 7.8: A four-level perfect binary tree, since all levels are full. An n-level perfect binary tree has 2ⁿ - 1 nodes.
Figure 7.9: Hierarchical model of a heterogeneous network of five computers.

Chapter 9: Scientific Applications

Figure 9.1: One step of the modified algorithm of parallel matrix-matrix multiplication based on two-dimensional block distribution of matrices A, B, and C. First, the pivot column a_•k of r x r blocks of matrix A (shown shaded in gray) is broadcast horizontally, and the pivot row b_k• of r x r blocks of matrix B (shown shaded in gray) is broadcast vertically. Then each r x r block c_ij of matrix C (also shown shaded in gray) is updated, c_ij = c_ij+ a_ik x b_kj.
Figure 9.2: A matrix with 18 x 18 blocks is distributed over a 3 x 3 processor grid. The numbers at the left and at the top of the matrix represent indexes of a row of blocks and a column of blocks, respectively. (a) The labeled squares represent blocks of elements, and the label indicates at which location in the processor grid the block is stored—all blocks labeled with the same name are stored in the same processor. Each shaded and unshaded area represents different generalized blocks. (b) Each processor has 6 x 6 blocks.
Figure 9.3: Example of two-step distribution of a 6 x 6 generalized block over a 3 x 3 processor grid. See below for complete explanation.
Figure 9.4: A matrix with 18 x 18 blocks is distributed over a 3 x 3 processor grid. See below for complete explanation.
Figure 9.5: One step of the algorithm of parallel matrix-matrix multiplication based on heterogeneous two-dimensional block distribution of matrices A, B, and C. First each r x r block of the pivot column a_•k of matrix A (shown shaded in dark gray) is broadcast horizontally, and each r x r block of the pivot row b_k• of matrix B (shown shaded in dark gray) is broadcast vertically. Then each r x r block c_ij of matrix C is updated, c_ij = c_ij + a_ik x b_kj.
Figure 9.6: Different combinations of rectangles in a generalized block. (a) No r x r block of rectangle R₃₁ is a horizontal neighbor of rectangle R₂₃; therefore h[3][1][2][3] = 0. (b) All r x r blocks of rectangle R₃₁ are horizontal neighbors of rectangle R₃₃; h[3][1][3][3] = 3. (c) Neighbors of rectangle R₂₂ in rectangle R₂₁ make up a 3 x 6 rectangle area (shaded in dark gray); h[2][1][2][2] = 3. (d) Neighbors of rectangle R₃₃ in rectangle R₂₁ make up the last row of this rectangle (shaded in dark gray); h[2][1][3][3] = 1.
Figure 9.7: Execution times of the heterogeneous and homogeneous 2D algorithms and the heterogeneous 1D algorithm. All algorithms are performed on the same heterogeneous network.
Figure 9.8: Speedups of the heterogeneous 1D and 2D algorithms compared with that of the homogeneous 2D block cyclic algorithm. All algorithms are performed on the same heterogeneous network.
Figure 9.9: Execution times of the 2D heterogeneous block cyclic algorithm on a heterogeneous network and of the 2D homogeneous block cyclic algorithm on a homogeneous network. The networks have approximately the same total power of processors and share the same communication network.
Figure 9.10: One step of the block Cholesky factorization algorithm: See below for complete explanation.
Figure 9.11: Speedup achieved by the heterogeneous block cyclic Cholesky factorization compared to its homogeneous counterpart on a heterogeneous network of eight workstations.
Figure 9.12: Speedups achieved by the heterogeneous Cholesky factorizations based on the data distribution and process distribution compared to their homogeneous prototype on a heterogeneous network of eight workstations.
Figure 9.13: Speedups for different permutations of the numbers of bodies in groups.
Figure 9.14: The seven-point “honeycomb” scheme of oil/water well disposition.