Section 3.1. Parallel Processing Models | Practical FPGA Programming in C

3.1. Parallel Processing Models

The popularity and usability of today's most common processor architectures (whether they are simple embedded controllers or more advanced, workstation-class processors) owe a great deal to the fact that they all share a common machine model, which was first described by computer pioneer John von Neumann in the 1940s. In a classic von Neumann computer, a central processing unit (CPU) is connected to some form of memory that is itself capable of storing both machine instructions and data. The CPU executes a stored program as a sequence of instructions that perform various operations including read and write operations on the memory.

SISD: The Original Processor Machine Model

At its simplest, such a single-processor machine is termed (in the generally accepted taxonomy of computers) a Single Instruction, Single Data machine (SISD). In such a machine, only one operation on one data input can be performed at any given time, and that operation is defined by only one computer instruction (an ADD operation, for example). It is for this basic machine model that the vast majority of software programming languages (and resulting application programs) have been developed.

In a programming environment for SISD machines, a software process is expressed using a series of statements, each corresponding to one or more distinct machine operations, that are punctuated by various branches, loops, and subroutine calls. The key thing to understand is that today's most common programming methodsor programming modelshave been developed to meet the needs of the basic SISD model. This has remained true even as various levels of parallelism have crept into the machines for which we program, in the form of instruction pipelines and other such processor architecture enhancements.

The existence of a common machine model for general-purpose processors has been extremely useful for software developers. Programming languagesand the expertise of programmers who use themhave evolved in a gradual manner as the applications and operating systems that are implemented on these machines have grown increasingly complex and powerful. As support for multitasking and multithreaded operating systems and processor architectures has emerged, existing languages have been adapted in response, and new (but not fundamentally different) languages have been developed.

Throughout this evolutionary process, the application programmer has assumed that a given program is being executed on a single, sequential processing resource. There are exceptions, of course, but for the vast majority of software applications being developed today, including embedded software applications, programmers have been trained to think in terms of a linear flow of control and to consider software programs to be fundamentally sequential in their operation. But parallelism at the machine level has in fact been with us for some time.

Early in the development of processor architectures, parallelism was introduced to increase processor performance. An example of this is an instruction prefetch operation, which allows the overlapping of instruction fetches and instruction executions. This feature later evolved into generalized instruction pipelining, which allows time multiplexing of operations. More recent advances include vector processors, which support multiplexing in both the time and space domains, whereby a given instruction can operate on several inputs simultaneously.

The SIMD Machine Model

If we move out of the realm of traditional processors and into the domain of supercomputers, we can find examples of machines that support much greater degrees of parallelism, in which a controlling processor (or control unit) directs the operation of many supporting processing elements, each of which performs some specific operation in parallel with all the others. In such a machine, a single instruction (which might perform a matrix multiply operation, for example) triggers a potentially large number of processing elements to execute the same operation simultaneously, but each on its own data. This is an example of a Single Instruction, Multiple Data (SIMD) machine model. A number of commercial supercomputers have been constructed using this model, including machines offered by Thinking Machines, Cray, and Digital Equipment Corporation.

MIMD Machines and the Transputer

If we take parallelism to the next logical level, we can conceive of machines that are capable of executing Multiple Instructions on Multiple Data (MIMD). In this type of machine, many different instruction processors operate in parallel, each accepting different data and executing different instructions. This sounds like the best of all possible worlds, but programming such a system necessarily involves coordinating all the machine's independent processing elements to solve some larger problem. This is trickier than it sounds, particularly given that the programmers who would make use of such machines are used to thinking in terms of a sequential flow of control.

There has been much research into the development and programming of such "multicomputer" systems and into methods of programming for them. One such project (which has spawned many other areas of parallel processing research efforts) has been the Transputer, first described by the English company INMOS in the mid-1980s.

The Transputer was a blueprint for creating highly modular computer systems based on arrays of low-cost, single-chip computing elements. These self-supporting, independently synchronized chips were to be connected to form a complete computer of arbitrary size and complexity. The goal of this modular architecture was to allow any number of Transputers to be connected, creating a high-performance parallel computing platform with little or no need to design complex interconnections or motherboards.

Communication between Transputer processing elements was via serial links. This meant that the primary performance bottleneck in such a system might well have been data movement rather than raw processing power. For many types of applications the Transputer nonetheless demonstrated extremely high performance for relatively low cost, and the project suggested an architecture thattwo decades latermakes a great deal of sense when considering high-performance computing on FPGA-based platforms.

Because of their high degree of parallelism, Transputers were programmed using a unique programming language called Occam. Occam supported parallelism at different levels, including a thread model for multiprocess programs and language features for describing parallelism at the level of individual statements and statement blocks. The language supports the explicit unrolling of loops, for example, and the explicit parallelizing of individual statements.

Shared Memory MIMD Architectures

Because serial interfaces form the communications backbone of a Transputer-based system, such a machine may be characterized as a message-passing architecture. Message passing in its purest form assumes that there is no shared memory or other shared system resources. Instead, the data in a message-passing application moves from process to process as small packets on buffered or unbuffered communication channels. This simple interconnection strategy makes it possible for processing elements to operate with a high degree of independence and with fewer resource conflicts.

Another category of MIMD machines includes those with shared memory resources, which may be arranged in some hierarchy to provide a combination of localized high-speed storage elements (such as local, processor-specific caches), as well as more generally accessible system memory. These memories are used in conjunction with, or as alternatives to, the message-passing data communications mechanism.