3.2. FPGAs as Parallel Computing Machines
There is no argument that FPGAs provide enormous opportunities for performing parallel computations and accelerating complex algorithms. It is commonindeed, almost trivial given the right programming experience and toolsto demonstrate speedups of a thousand times or more over software approaches for certain types of algorithms and low-level computations. This is possible because the FPGA is, in many respects, a blank slate onto which a seemingly infinite variety of computational structures may be written. An FPGA's resources are not unlimited, however, and creating structures to efficiently implement a broad set of algorithms, ranging from large array-processing routines to simpler combinatorial control functions, can be challenging. As you will see in later chapters, this suggests a two-pronged approach to application development. At the low level, compiler tools can be used to extract and generate hardware for instruction-level parallelism. At a higher level, parallelism can be expressed explicitly by modeling the application as a set of blocks operating in parallel.
The key to success with FPGAs as computing machines is to apply automated compiler tools where they are practical, but at the same time use programming techniques that are appropriate for parallel computing. Although tools have been developed that will extract parallelism from large, monolithic software applicationsapplications that were not written with parallelism in mindthis technique is not likely to produce an efficient implementation; the compiler does not have the same knowledge of the application that the programmer possesses. Hence, it cannot make the system-level decisions and optimizations that are needed to make good use of the available parallel structures. In addition, it should be understood that compiler tools for FPGAs (including those described in this book and all others currently in existence) are still in their infancy. This means that for maximum performance it may be necessary for a hardware designer to step in and rewrite certain parts of the application at a low level. It is therefore important to use a programming and partitioning strategy that allows an application to be represented by a collection of any number of semi-independent modular components, such that hardware-level reengineering is practical and does not represent a wholesale redesign of the application.
The approach of partitioning an application for system-level parallelism suggests the need for a different conceptual model of program execution than is common in traditional software development. In this model, functionally independent subprograms are compiled into hardware blocks rather than into the assembly language of a processor. Within these blocks there is no CPU with its fetch-and-execute cycle. Rather, whole components of the program can execute in parallel, to whatever degree the compiler (and the software programmer) can handle.
In support of such a machine modelone in which multiple program blocks are simultaneously operating on multiple data streams, and in which each program block is itself composed of parallel structureswe need a different kind of programming model, one that is both parallel and at the same time procedural. The C language libraries used in this book support such a model of programming while retaining the general look and feel ofand compatibility withC language programming.
As you have seen from the previous descriptions of machine models, a programming model does not need to exactly mirror the underlying machine model. In fact, the more a programming model can be abstracted away from the machine model, the more reliance the program can place on the compiler tools, and the easier the machine is to program. There is a downside to this, however, which is that the program's efficiency also depends on the compiler's capabilities.
The programming model described in this book takes a middle path by abstracting away details of the lower-level (instruction-level) parallelism while offering explicit control over higher, system-level parallelism.
Adding Soft Processors to the Mix
If we consider an FPGA-based computing machine to be a collection of parallel machines, each of which has its own unique computing function, there is no reason why one or more of these machines can't be a traditional microprocessor. Given the wide availability of FPGA-based "soft" processors, this is a reasonable, practical way to balance the need for legacy C and traditional programming with the need for application-specific hardware accelerators.
Such FPGA-based processors can be useful for a variety of reasons. They can run legacy code (including code that is planned for later acceleration in the FPGA fabric). They can be used during development as software test generators. They can also be used to replace most costly hardware structures for such things as embedded state machines, and for standardizing I/O. They can run operating systems and perform noncritical computations that would be too space-intensive when implemented in hardware. When arranged as a grid, multiple soft processors can even form a parallel computing platform in and of themselvesone that is more generally programmable than an equivalent platform constructed entirely of low-level FPGA gates.
The recent explosion in the use of soft processors has proven that FPGAs can provide a flexible, powerful hardware platform for complete "systems-on-programmable-chips." FPGA vendors now provide (at little or no cost) all the necessary processor and peripheral components needed to assemble highly capable single-chip computing platforms, and these platforms can include customized, highly parallel software/hardware accelerators.
Used in this way, FPGAs are excellent platforms for implementing coarse-grained heterogeneous parallelism. Compared to other models of machine parallelism, this approach requires less process-to-process communication overhead. If each process maintains its own local memory and has a clearly delineated task to perform, the application can easily be partitioned between different areas of the FPGA (perhaps including different clock domains) and between independent FPGA devices. Many types of calculations lend themselves quite naturally to coarse-grained parallelism, including vector array processing, pipelined image and audio processing, and other multistage signal filtering.