2.8. SUMMARY | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

2.7. MEMORY HIERARCHY AND PARALLEL PROGRAMMING TOOLS

Our consideration of vector and superscalar architectures would not be complete without a brief outline of the basic approaches to efficient management of such an important resource as memory. Although the optimal use of memory is not directly related to parallel computing technologies, there are at least two reasons to take a closer look at this issue in the context of parallel programming vector and superscalar architectures. First, all parallel programming systems for the architectures take into account their modern memory structure, and the speedup due to more optimal use of the memory often appears more significant than that due to better use of parallel instruction execution units. Second, the approaches, models, and techniques aimed at efficient use of the memory appear surprisingly similar to those aimed at efficient use of parallel facilities of the architectures.

Vector and superscalar processors, as they were presented in Sections 2.1 and 2.2, have a simple two-level memory hierarchy:

Small and fast register memory.
Large and relatively slow main memory.

This memory hierarchy is reflected in instruction sets and directly visible to the programmers. At the same time, physically modern uniprocessor architectures provide more complex memory hierarchy, although not directly reflected in their instruction sets. That actual memory hierarchy is organized into several levels with higher levels being substantially faster and smaller than lower ones.

A simple actual memory hierarchy includes the following levels:

Register memory
Cache memory
Main memory
Disk memory

Cache is a buffer memory between the main memory and the registers. A cache holds copies of some data from the main memory. To read a data item from the main memory into a register, the instruction is first to check if a copy of the item is already in the cache. If so, the data item will be actually transferred into the register not from the main memory but from the cache. If not, the data item will be transferred into the register from the main memory, and a copy of the item will appear in the cache. Therefore, if the next data transfer instruction reads the same data item, it will be taken from the cache rather than from the main memory resulting in a faster execution of the instruction.

Cache is partitioned into blocks called cache lines. A cache line is a minimum unit of data transfer between the cache and the main memory. So scalar data items may be transferred between the cache and the main memory only as a part of a cache line. The cache memory is much smaller than the main memory and hence it may not be able to hold copies of all the data items from the main memory, which are processed by the program. Therefore, at different moments of the program execution, the same cache line may reflect different data blocks from the main memory, because a copy of the data block, which is not reflected in the cache and contains a required data item, will replace a copy of some other data block.

A cache is said to be direct mapped if each block of the main memory has only one place it can appear in the cache. If a block can be placed anywhere in the cache, the cache is said to be fully associative. If a block can be placed in a restricted set of places in the cache, the cache is said to be set associative. A set is a group of two or more cache lines. A block is first mapped onto a set, and then the block can be placed anywhere within the set. If there are n blocks in a set, the cache is called n-way associative.

The situation when a data item being referenced is not in the cache is called cache miss. If the contribution of data transfer instructions into the total execution time of a program is substantial, a low number of cache misses will significantly accelerate the execution of the program. An obvious class of programs suitable for such optimization includes programs intensively using basic operations on arrays. Note that exactly the same class of programs is perfectly suitable for vector and superscalar processors allowing a very high level of utilization of their parallel units.

The main specific optimization performed by optimizing C and Fortran 77 compilers in order to minimize the number of cache misses is loop tiling. Consider the following loop nest:

 for(i=0; i<m; i++)            /* loop 1 */       for(j=0; j<n; j++)      /* loop 2 */             if(i==0)                   b[j]=a[i][j];             else                   b[j]+=a[i][j];

which computes the sum of rows of the m X n matrix a, storing the resulting n-element vector b in array b. Data items accessed by reference b[j] are repeatedly used by successive interations of loop 1. If the number n of iterations of loop 2 is large enough, the data items may be flushed from the cache by the moment of their repeated use. In order to minimize the flushing of repeatedly used data items, the number of iterations of loop 2 may be decreased. In order to keep the total number of iterations of this loop nest unchanged, an additional controlling loop is introduced. As a result the transformed loop nest looks as follows:

 for(k=0; k<n; k+=T)           /* additional                               controlling loop 0 */       for(i=0; i<m; i++)      /* loop 1 */             for(j=k; j<min(k+T,n); j++) /* loop 2 */                   if(i==0)                         b[j]=a[i][j];                   else                         b[j]+a[i][j];

This transformation is called tiling, and T is the tile size.

In general, the loop tiling is applied to loop nests of the form

 for(i1=...)            /* loop 1 */       for(i2=...)      /* loop 2 */             ...                   for(in=...) {      /* loop n */                         ...                         e[i2]…[in]                         ...                   }

in order to minimize the number of cache misses for reference e[i2]…[in], which is repeatedly used by succesive iterations of loop 1. Thus loop tiling improves temporal locality of nested loops by decreasing the number of iterations between repeatedly used array elements.

The recognition of tilable loop nests is a difficult problem that should be solved by optimizing C and Fortran 77 compilers. Its solution is based on an analysis of data dependencies in the loop nests. For example, loop tiling is legitimately applicable to the loop nest given above if and only if the loops from loop 2 to loop n are fully interchangeable, that is, provided that any two loops from this set may be safely interchanged without any change of the functional semantics of this code. To prove the interchangability of the loops, an analysis of dependence between different iterations of the above loop nest is needed.

Maximization of data reuse in the upper levels of the memory hierarchy can be achieved by partitioning the matrix or matrices into blocks, and performing the computation with matrix-matrix operations on the blocks. The Level 3 BLAS were specified in such a way to support the implementation of such algorithms, and they have been successfully used as the building blocks of a number of applications, including LAPACK. (LAPACK is a linear algebra package written in Fortran 77 that provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems.)

Compilers for parallel languages such as Fortran 90 and C[ ] do not need to recognize loops suitable for tiling in order to improve the reuse of data held in the cache. Instead, they may translate explicit operations on arrays into loop nests with the best possible temporal locality.

As a rule the address space provided by the instruction set of a modern uniprocessor computer is larger than its physical main memory. In fact, the instructions address virtual memory rather than real physical memory. The virtual memory is partitioned into pages of a fixed size. Each page is stored on a disk until it is needed. When the page is needed, the operating system copies it from disk to main memory, translating the virtual addresses into real addresses. The process of translating virtual addresses into real addresses is called mapping. The copying of virtual pages from disk to main memory is known as paging or swapping.

Programs processing large enough arrays do not fit into main memory. Therefore during execution of such programs the swapping takes place every time the required data are not in the main memory. As copying of a virtual page from disk to main memory is an expensive operation, minimization of the number of swappings can significantly accelerate the programs. The problem is similar to the problem of minimizing the cache misses, and it can be therefore approached similarly.