3.2. OPTIMIZING COMPILERS | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

3.1. SHARED MEMORY MULTIPROCESSOR ARCHITECTURE AND PROGRAMMING MODELS

Parallel Computing on Heterogeneous Networks, by Alexey Lastovetsky
ISBN 0-471-22982-2 Copyright 2003 by John Wiley & Sons, Inc.

The shared memory multiprocessor (SMP) architecture, shown in Figure 3.1, consists of a number of identical processors sharing a global main memory. In general, the processors of an SMP computer framework are of the vector or superscalar architecture.

click to expand
Figure 3.1: Shared memory multiprocessor.

The primary model of a program efficiently utilizing the performance potential of the architecture is parallel threads of control, each running on a separate processor of the SMP and sharing memory with other threads in the framework of the same process. Such a program is called a multithreaded (MT) program. All threads within the process share the same process-level structures and data, such as file descriptors and user ID. Therefore the threads have access to all the same functions, data, open files, and so on.

An MT program starts up with one initial main thread. That main thread may create new threads by calling the create routine, passing a routine for that new thread to run. The new thread now runs the routine and provides another stream of instructions operating on data in the same address space. When several threads all make use of the same data, they coordinate their usage via synchronization variables, such as a mutual exclusion lock (a mutex). Another way for threads to synchronize their work is to direct signals internally to individual threads.

The secondary model of a program for the SMP architecture is parallel processes, each running on a separate processor, not sharing main memory with other processes and using message passing to communicate with the others in order to coordinate their work. This message-passing parallel programming model is a primary one for the distributed memory multiprocessor architecture, and is considered in the next chapter.

The SMP architecture provides more parallel potentialities to speed up computations. This is done by adding parallel streams of instructions to the instruction-level parallelism provided by the vector and superscalar architectures. How significant is the performance potential of the SMP architecture? It might be expected that an n-processor SMP computer would be able to perform the same volume of computations approximately n times as fast compared to the one-processor configuration of the computer. But the real picture is quite different. If you start from a one-processor configuration and add processors one by one, you will find that each next processor is adding only a fraction of the performance that you got from the first; and the fraction is becoming smaller and smaller. Eventually there is a point where adding one more processor will just decrease performance.

This limitation on speedup provided by the SMP architecture cannot be explained by the so-called Amdahl law, which is a formulation of a rather obvious observation, that if a program has one section that is parallelizable and another section that must run serially, then the program can never run faster than the serial section:

where n is the number of processors. This fact is of no interest for any but the most simplistic of programs. “Normal” programs encounter other limitations far before they ever reach this one.

The real bottleneck of the SMP architecture putting limits on speedup is the memory bus. For example, heavy matrix multiplication programs come up against the limited memory bandwidth very quickly. If, say, 30 possible processors on a SMP computer average one main memory reference every 90 bus cycles, which is quite possible for such programs, then there will be a reference every third cycle. If a request/reply occupies the bus, say, for about 6 cycles, then that will be already twice what the bus can handle (and we are ignoring bus contention and possible I/O activity).