Section 14.1. The FPGA as a High-Performance Computer

14.1. The FPGA as a High-Performance Computer

Low-cost, high-performance computing using highly parallel platforms is not a new concept. It has long been recognized that many, perhaps most, of the problems requiring supercomputer solutions are well-suited to parallel processing. By partitioning a problem into smaller parts that can be solved simultaneously, it's possible to gain massive amounts of raw processing power, at the expense of increased hardware size, system cost, and power consumption. It is quite common, for example, for supercomputing researchers to create clusters of traditional desktop PC (often running the Linux operating system) to create what can be thought of as a "coarse-grained" parallel processing system. To create applications for such a system, software programmers may make use of various available programming tools, including PVM (Parallel Virtual Machine) and MPI (Message Passing Interface). Both of these libraries provide the ability to express parallelism through message passing and/or shared memory resources using the C language.

Using such an approach, traditional supercomputers can be replaced for many applications by clusters of readily available, low-cost computers. The computers used in such clusters can be of standard design and are programmed to act as one large, parallel computing platform. Higher-performance (and correspondingly higher-cost) clusters often employ high- bandwidth, low-latency interconnects to reduce the overhead of communication between nodes. Another path to reducing communication overhead is to increase the functionality of each node through the introduction of low-level custom hardware to create high-speed datapaths. More recently, supercomputer cluster designers have pursued this approach further and have exploited node-level parallelism by introducing programmable hardware into the mix. In this environment, an FPGA can be thought of as one node in a larger, highly parallel computer cluster.

Taking Parallelism to Extreme Levels

When FPGAs are introduced into a supercomputing platform strategy, opportunities exist for improving both coarse-grained (application-level) and fine-grained (instruction-level) parallelism. Because FPGAs provide a large amount of highly parallel, configurable hardware resources, it is possible to create structures (such as parallel multiply/add instructions) that can greatly accelerate individual operations or higher-level statements such as loops. Through the use of instruction scheduling, instruction pipelining, and other techniques, inner code loops can be further accelerated. And at a somewhat higher level these parallel structures can themselves be replicated to create further levels of parallelism, up to the limit of the target device capacity. You have seen some of these techniques used on a more modest scale in earlier chapters.

As you learned in Chapter 2, FPGA devices were introduced in the mid-1980s to serve as desktop-programmable containers for random ("glue") logic and as alternatives to custom ASICs. These devices were almost immediately discovered by researchers seeking low-cost, hardware-based computing solutions. Thus, the field of FPGA-based reconfigurable computing (RCC) was born, or at least came of age. The applications for which FPGA-based reconfigurable computing platforms are appropriate include computationally intensive real-time image processing and pattern recognition, data encryption and cryptography, genomics, biomedical, signal processing, scientific computing (including algorithms in physics, astronomy, and geophysics), data communications, and many others.

It should be noted that today most such uses of FPGAs are relatively static, meaning that the algorithm(s) being implemented on the FPGA are created at system startup and rarely, if ever, are changed during system operation. There has been extensive research into dynamic reconfiguration of FPGAs (changing the hardware-based algorithm "on the fly" to allow the FPGA to perform multiple tasks). However, to date there has not been widespread use of FPGAs for dynamically reconfigurable computing, due in large part to the system costs, measured in performance and power, of performing dynamic FPGA reprogramming. In most cases, then, the goal of FPGA-based computing is to increase the raw power and computational throughput for a specific algorithm.

Through a combination of design techniques that exploit spatial parallelism, key algorithms and algorithmic "hot spots" can be accelerated in FPGAs by multiple orders of magnitude. Until recently, however, it has been difficult or impossible for software algorithm developers to take advantage of these potential speedups because the software development tools (in particular, the compiler technologies) required to generate low-level structures from higher-level software descriptions have not been widely available. Instead, software algorithm developers have had to learn low-level hardware design methods or have turned to more experienced FPGA designers to take on the daunting task of manually optimizing an algorithm and expressing it as low-level hardware. With the advent of software-to-hardware compilers, however, this process is becoming easier.

Many Platforms to Choose From

The smallest FPGA-based computing platforms combine one or more widely available, reasonably priced FPGA devices with standard I/O devices (such as PCI or network interfaces) on a prototyping board. There are many examples of such boards, which typically make use of FPGA devices produced by Altera or Xilinx. These board-level solutions may or may not include adjacent microprocessors, but in most cases it is possible to make use of embedded processors to create mixed hardware/software applications, as you've seen in earlier chapters. Using such boards (or a collection of such boards) in combination with existing tools, it is possible to create hardware implementations of computationally intensive software algorithms using software-based methods.

When combined with other peripherals on a board, an FPGAor a collection of FPGAscan become an excellent platform for algorithm experimentation, given appropriate design expertise and/or design tools.

By making appropriate use of one or more embedded ("soft") processor cores, a savvy application developer or design team can construct a computing environment roughly analogous to the multiprocessor cluster systems described earlier. Soft processors are appropriate for this because they are generally configurable (you can select only those peripheral interfaces you need) and they generally support high-performance connectivity with the adjacent FPGA logic.

Using such a mixed processor/FPGA platform and carefully evaluating the application's processing and bandwidth requirements, it is possible to create mixed hardware/software solutions in which the FPGA logic serves as a hardware coprocessor for one or more on-chip embedded processors. And by using the multiprocess, streaming programming model presented in this book, it's possible to achieve truly astonishing levels of performance for many types of algorithms.

In terms of cost, a board-level prototyping system consisting of a mainstream FPGA device (capable of hosting one or more embedded processors along with other FPGA-based computations) and various input/output peripherals will range in price from a few hundred to a few thousand dollars, depending on the capacity of the onboard FPGA device(s). In fact, because the streaming programming model (with its network of connected processing nodes) requires relatively little node-to-node connectivity (typically a small number of streams, which may be implemented externally as high-speed serial I/O), it is becoming practical to use simple, low-cost FPGA prototyping boards arranged in a grid computing matrix.

The next evolutionary step is to combine larger, multiple-FPGA processing arrays with custom interconnect schemes and dynamic reprogramming to create large-scale, reconfigurable supercomputing applications. All of the major supercomputing vendors today are working on such platforms and have either announced, or will soon announce, related supercomputing products. Smaller FPGA-based supercomputing (or supercomputing-capable) platforms are available from Nallatech, SBS Technologies, Annapolis Microsystems, Gidel, the Dini Group, and many others.

Taking a Software Approach

Much of the research activity in FPGA-based computing has been oriented around the problem of software programming for massively parallel targets. As described in earlier chapters, parallelism in an application can be considered at two distinct levels: at the system level and at the level of specific instruction within a computational process or loop. The ideal software development tools for such targets would exploit both levels of parallelism with a high degree of automation. For now, however, the best approach seems to be to focus the efforts of automation (represented by compiler and optimizer technologies) on the lower-level aspects of the problem. At the same time, the software programmer is provided with an appropriate programming model and related tools and libraries allowing higher-level, coarse-grained parallelism to be expressed in a natural way. (This has, of course, been a major theme of this book.)

In this way the programmer, who has knowledge of the application's external requirements that are not necessarily reflected in a given set of source files, can make decisions and experiment with alternative algorithmic approaches while leaving the task of low-level optimization to the compiler tools. A number of programming models can be applied to FPGA-based programmable platforms, but the most successful of these models share a common attribute: they support modularity and parallelism through a dataflow (or dataflow-like) method of design partitioning and abstraction. The key to allocating processing power within such a system, and using such a programming model, is to use FPGA to implement one or more processes that handle the heavy computation, and provide other processes running on embedded or external microprocessors to handle file I/O, memory management, system setup, and other non-performance-critical tasks.