Section A.5. FPGA-Specific Optimization Techniques | Practical FPGA Programming in C

A.5. FPGA-Specific Optimization Techniques

Because the designer is actually building and creating the embedded processor system hardware in an FPGA, much can be done to improve the performance of the hardware itself. Additionally, with an FPGA embedded processor residing next to additional FPGA hardware resources, a designer can consider custom coprocessor designs specifically targeted at a design's core algorithm.

Increasing the FPGA's Operating Frequency

Employing FPGA design techniques to increase the operating frequency of the FPGA embedded processor system increases performance. Several methods are considered.

Logic Optimization and Reduction

Connect only the peripherals and buses that will be used. Here are a few examples:

If a design does not store and run any instructions using external memory, do not connect the instruction side of the peripheral bus. Connecting both the instruction and data side of the processor to a single bus creates a multimaster system, which requires an arbiter. Optimal bus performance is achieved when a single master resides on the bus.
Debug logic requires resources in the FPGA and may be the hardware bottleneck. When a design is completely debugged, the debug logic can be removed from the production system, potentially improving the system's performance. For example, removing a MicroBlaze Debug Module (MDM) with an FSL acceleration channel saves 950 LCs. In MicroBlaze systems with the cache enabled, the debug logic is typically the critical path that slows down the entire design.
The Xilinx OPB External Memory Controller (EMC) used to connect SRAM and Flash memories creates a 32-bit address bus even if 32 bits are not required to address the memory. Xilinx also provides a bus-trimming peripheral that removes the unused address bits. When using this memory controller, the bus trimmer should always be used to eliminate the unused addresses. This frees up routing and pins that would have otherwise been used. The Xilinx Base System Builder (BSB) now does this automatically.
Xilinx provides several general-purpose I/O (GPIO) peripherals. The latest GPIO peripheral version (v3.01.a) has excellent capabilities, including dual-channel support, bidirectionality, and interrupt capability. However, these features also require more resources, which affect timing. If a simple GPIO is all that the design requires, the designer should use a more primitive version of the GPIO, or at least ensure that the unused features in the enhanced GPIO are turned off. In the optimized examples in this study, GPIO v1.00.a is used, which is much less sophisticated, much faster, and approximately half the size (304 LCs for seven GPIO v1.00.a peripherals as compared to 602 LCs for v3.01.a).

Area and Timing Constraints

Xilinx FPGA place and route tools perform much better when given guidelines as to what is most important to the designer. In the Xilinx tools, a designer can specify the desired clock frequency, pin location, and logic element location. By providing these details, the tools can make smarter trade-offs during the hardware design implementation.

Some peripherals require additional constraints to ensure proper operation. For example, both the DDR SDRAM controller and the 10/100 Ethernet MAC require additional constraints to guarantee that the tools create correct and optimized logic. The designer must read the datasheet for each peripheral and follow the recommended design guidelines.

Hardware Acceleration

Dedicated hardware outperforms software. The embedded designer who is serious about increasing performance must consider the FPGA's ability to accelerate the processor performance with dedicated hardware. Although this technique consumes FPGA resources, the performance improvements can be extraordinary.

Turn on the Hardware Divider and Barrel-Shifter

MicroBlaze can be customized to use a hardware divider and a hardware barrel-shifter rather than performing these functions in software. Enabling these processor capabilities consumes more logic but improves performance. In one example, enabling the hardware divider and barrel-shifter adds 414 LCs, but the performance is improved by 18.1%.

Software Bottlenecks Converted to Coprocessing Hardware

Custom hardware logic can be designed to offload an FPGA embedded processor. When a software bottleneck is identified, a designer can choose to convert the bottleneck algorithm into custom hardware. Custom software instructions can then be defined to operate the hardware coprocessor.

Both MicroBlaze and Virtex-4 PowerPC include very low-latency access points into the processor, which are ideal for connecting custom coprocessing hardware. Virtex-4 introduces the Auxiliary Processing Unit (APU) for the PowerPC. The APU provides a direct connection from the PowerPC to co-processing hardware. In MicroBlaze, the low-latency interface is called the Fast Simplex Link (FSL) bus. The FSL bus contains multiple channels of dedicated, unidirectional, 32-bit interfaces. Because the FSL channels are dedicated, no arbitration or bus mastering is required. This allows an extremely fast interface to the processor.

Converting a software bottleneck into hardware may seem like a very difficult task. Traditionally, a software designer identifies the bottleneck, after which the algorithm is transitioned to an FPGA designer who writes VHDL or Verilog code to create the hardware coprocessor. Fortunately, this process has been greatly simplified by tools that can generate FPGA hardware from C code. One such tool is CoDeveloper from Impulse Accelerated Technologies. This tool allows one designer who is familiar with C to port a software bottle neck into a custom piece of coprocessing FPGA hardware using CoDeveloper's Impulse C libraries.

Here are some examples of algorithms that could be targeted for hardware-based coprocessors:

Inverse Discrete Cosine Transformation, used in JPEG 2000
Fast Fourier Transform
MP3 decode
Triple-DES and AES encryption
Matrix manipulation

Any operation that is algorithmic, mathematical, or parallel is a good candidate for a hardware coprocessor. FPGA logic consumption is traded for performance. The advantages can be enormous, improving performance by tens or hundreds of times.