Section 3.2. FPGAs as Parallel Computing Machines | Practical FPGA Programming in C

3.2. FPGAs as Parallel Computing Machines

There is no argument that FPGAs provide enormous opportunities for performing parallel computations and accelerating complex algorithms. It is commonindeed, almost trivial given the right programming experience and toolsto demonstrate speedups of a thousand times or more over software approaches for certain types of algorithms and low-level computations. This is possible because the FPGA is, in many respects, a blank slate onto which a seemingly infinite variety of computational structures may be written. An FPGA's resources are not unlimited, however, and creating structures to efficiently implement a broad set of algorithms, ranging from large array-processing routines to simpler combinatorial control functions, can be challenging. As you will see in later chapters, this suggests a two-pronged approach to application development. At the low level, compiler tools can be used to extract and generate hardware for instruction-level parallelism. At a higher level, parallelism can be expressed explicitly by modeling the application as a set of blocks operating in parallel.

The key to success with FPGAs as computing machines is to apply automated compiler tools where they are practical, but at the same time use programming techniques that are appropriate for parallel computing. Although tools have been developed that will extract parallelism from large, monolithic software applicationsapplications that were not written with parallelism in mindthis technique is not likely to produce an efficient implementation; the compiler does not have the same knowledge of the application that the programmer possesses. Hence, it cannot make the system-level decisions and optimizations that are needed to make good use of the available parallel structures. In addition, it should be understood that compiler tools for FPGAs (including those described in this book and all others currently in existence) are still in their infancy. This means that for maximum performance it may be necessary for a hardware designer to step in and rewrite certain parts of the application at a low level. It is therefore important to use a programming and partitioning strategy that allows an application to be represented by a collection of any number of semi-independent modular components, such that hardware-level reengineering is practical and does not represent a wholesale redesign of the application.

The approach of partitioning an application for system-level parallelism suggests the need for a different conceptual model of program execution than is common in traditional software development. In this model, functionally independent subprograms are compiled into hardware blocks rather than into the assembly language of a processor. Within these blocks there is no CPU with its fetch-and-execute cycle. Rather, whole components of the program can execute in parallel, to whatever degree the compiler (and the software programmer) can handle.

In support of such a machine modelone in which multiple program blocks are simultaneously operating on multiple data streams, and in which each program block is itself composed of parallel structureswe need a different kind of programming model, one that is both parallel and at the same time procedural. The C language libraries used in this book support such a model of programming while retaining the general look and feel ofand compatibility withC language programming.

As you have seen from the previous descriptions of machine models, a programming model does not need to exactly mirror the underlying machine model. In fact, the more a programming model can be abstracted away from the machine model, the more reliance the program can place on the compiler tools, and the easier the machine is to program. There is a downside to this, however, which is that the program's efficiency also depends on the compiler's capabilities.

Why use C?

Why use C, rather than a language such as Occam that was specifically designed for parallelism? The primary reason for using C is that it is the language that the vast majority of embedded systems and desktop application developers are most familiar with. This is important because parallel programmingexpressing the operation of multiple interacting processesis not trivial. At the same time, coarse-grained parallel programming does not require a fundamentally different method of expressing individual blocks of functionality (a loop, for example).

The widespread understanding of the C language also makes it easier for us, in the book, to describe how programming for parallelism is different, and it provides us with a familiar way of expressing the basic statements that make up a given algorithm.

Another important benefit of C is the large body of legacy algorithms and applications. While direct translation of these algorithms into a streaming parallel model is not always practical, it is quite possible to combine legacy sequential applications and algorithms with their parallel equivalents for the purpose of testing and algorithm validation. It is also possible to create mixed software/hardware applications with relatively little effort.

The programming model described in this book takes a middle path by abstracting away details of the lower-level (instruction-level) parallelism while offering explicit control over higher, system-level parallelism.

Adding Soft Processors to the Mix

If we consider an FPGA-based computing machine to be a collection of parallel machines, each of which has its own unique computing function, there is no reason why one or more of these machines can't be a traditional microprocessor. Given the wide availability of FPGA-based "soft" processors, this is a reasonable, practical way to balance the need for legacy C and traditional programming with the need for application-specific hardware accelerators.

Such FPGA-based processors can be useful for a variety of reasons. They can run legacy code (including code that is planned for later acceleration in the FPGA fabric). They can be used during development as software test generators. They can also be used to replace most costly hardware structures for such things as embedded state machines, and for standardizing I/O. They can run operating systems and perform noncritical computations that would be too space-intensive when implemented in hardware. When arranged as a grid, multiple soft processors can even form a parallel computing platform in and of themselvesone that is more generally programmable than an equivalent platform constructed entirely of low-level FPGA gates.

The recent explosion in the use of soft processors has proven that FPGAs can provide a flexible, powerful hardware platform for complete "systems-on-programmable-chips." FPGA vendors now provide (at little or no cost) all the necessary processor and peripheral components needed to assemble highly capable single-chip computing platforms, and these platforms can include customized, highly parallel software/hardware accelerators.

Used in this way, FPGAs are excellent platforms for implementing coarse-grained heterogeneous parallelism. Compared to other models of machine parallelism, this approach requires less process-to-process communication overhead. If each process maintains its own local memory and has a clearly delineated task to perform, the application can easily be partitioned between different areas of the FPGA (perhaps including different clock domains) and between independent FPGA devices. Many types of calculations lend themselves quite naturally to coarse-grained parallelism, including vector array processing, pipelined image and audio processing, and other multistage signal filtering.