A.4. Optimization Techniques That Are Not FPGA-Specific
An experienced embedded system designer is familiar with many of the techniques discussed in this section. For that reason, significant detail is not given here. The main objective of this section is to emphasize that many standard microprocessor design optimization techniques apply to FPGA embedded processor design and can have excellent benefits.
Many optimizations are available to affect the application code. Some techniques apply to how the code is written. Other techniques affect how the compiler handles the code.
Compiler optimizations are available in Xilinx Platform Studio (XPS) based on the Gnu C compiler (GCC). The current version of the MicroBlaze and PowerPC GCC-based compilers in EDK 6.3 is 2.95.3-4. These compilers have several levels of optimization, including Levels 0, 1, 2, and 3, and also a size reduction optimization. An explanation of the strategy for the different optimization levels briefly follows:
Use of Manufacturer-Optimized Instructions
Xilinx provides several customized instructions that have been streamlined for Xilinx embedded processors. One example is xil_printf. This function is nearly identical to the standard printf, with the following exceptions: support for type real numbers is removed, the function is not reentrant, and no longlong (64-bit) types are supported. For these differences, the xil_printf function is 2,953 bytes, making it much smaller than printf, which is 51,788 bytes.
Assembly, including inline assembly, is supported by GCC. As with any microprocessor, assembly becomes very useful in fully optimizing time-critical functions. Be aware, however, that some compilers do not optimize the remaining C code in a file if inline assembly is also used in that file. Also, assembly code does not enjoy the code portability advantages of C.
Many other code-related optimizations can and should be considered when optimizing an FPGA embedded processor:
Many processors provide access to fast, local memory, as well as an interface to slower, secondary memory. The same is true with FPGA embedded processors. The way this memory is used has a significant affect on performance. Like other processors, the memory usage in an FPGA embedded processor can be manipulated with a linker script.
Local Memory Only
The fastest possible memory option is to put everything in local memory. Xilinx local memory is made up of large FPGA memory blocks called BlockRAM (BRAM). Embedded processor accesses to BRAM happen in a single bus cycle. Since the processor and bus run at the same frequency in MicroBlaze, instructions stored in BRAM are executed at the full MicroBlaze processor frequency. In a MicroBlaze system, BRAM is essentially equivalent in performance to a Level 1 (L1) cache. The PowerPC can run at frequencies greater than the bus and has true, built-in L1 cache. Therefore, BRAM in a PowerPC system is equivalent in performance to a Level 2 (L2) cache.
Xilinx FPGA BRAM quantities differ by device. For example, the 1.5 million gate Spartan-3 device (XC3S1500) has a total capacity of 64KB, whereas the 400,000 gate Spartan-3 device (XC3S400) has half as much at 32KB. An embedded designer using FPGAs should refer to the device family datasheet to review a specific chip's BRAM capacity.
If the designer's program fits entirely within local memory, the designer achieves optimal memory performance. However, many embedded programs exceed this capacity.
External Memory Only
Xilinx provides several memory controllers that interface with a variety of external memory devices. These memory controllers are connected to the processor's peripheral bus. The three types of volatile memory supported by Xilinx are static RAM (SRAM), single-data-rate synchronous dynamic RAM (SDRAM), and double data rate (DDR) SDRAM. The SRAM controller is the smallest and simplest inside the FPGA, but SRAM is the most expensive of the three memory types. The DDR controller is the largest and most complex inside the FPGA, but fewer FPGA pins are required, and DDR is the least expensive per megabyte.
In addition to the memory access time, the peripheral bus also incurs some latency. In MicroBlaze, the memory controllers are attached to the On-chip Peripheral Bus (OPB). For example, the OPB SDRAM controller requires a four- to six-cycle latency for a write and an eight- to ten-cycle latency for a read, depending on bus clock frequency. The worst possible program performance is achieved by having the entire program reside in external memory. Since optimizing execution speed is a typical goal, an entire program should rarely, if ever, be targeted solely at external memory.
Cache External Memory
The PowerPC in Xilinx FPGAs has instruction and data cache built into the silicon of the hard processor. Enabling the cache is almost always a performance advantage for the PowerPC.
The MicroBlaze cache architecture is different from the PowerPC because the cache memory is not dedicated silicon. The instruction and data cache controllers are selectable parameters in the MicroBlaze configuration. When these controllers are included, the cache memory is built from BRAM. Therefore, enabling the cache consumes BRAM that otherwise could have been used for local memory. Cache consumes more BRAM than local memory for the same storage size because the cache architecture requires address line tag storage. Additionally, enabling the cache consumes general-purpose logic to build the cache controllers.
For example, an experiment in Spartan-3 enables 8KB of data cache and designates 32MB of external memory to be cached. This cache requires 12 address tag bits. This configuration consumes 124 logic cells and six BRAMs. Only four BRAMs are required in Spartan-3 to achieve 8KB of local memory. In this case, cache is 50% more expensive in terms of BRAM usage than local memory. The two extra BRAMs are used to store address tag bits.
If 1MB of external memory is cached with an 8KB cache, the address tag bits can be reduced to seven. This configuration then requires only five BRAMs rather than six (four BRAMs for the cache and one BRAM for the tags). This is still 25% greater than if the BRAMs are used as local memory.
Additionally, the achievable system frequency may be reduced when the cache is enabled. In one example, the system without any cache is capable of running at 75MHz; the system with cache is capable of running at only 60MHz. Enabling the cache controller adds logic and complexity to the design, decreasing the achieved system frequency during FPGA place and route. Therefore, in addition to consuming FPGA BRAM resources that may have otherwise been used to increase local memory, the cache implementation may also cause the overall system frequency to decrease.
Considering these cautions, enabling the MicroBlaze cache, especially the instruction cache, may improve performance, even when the system must run at a lower frequency. Testing has shown that a 60MHz system with instruction cache enabled can have a 150% advantage over a 75MHz system without instruction cache (both systems store the entire program in external memory). When both instruction and data caches are enabled, the 60MHz outperforms the 75MHz system by 308%.
Note that this particular test example is not the most practical since the entire test program fits in the cache. A more realistic experiment would be to use an application that is larger than the cache. Another precaution concerns applications that frequently jump beyond the size of the cache. Multiple cache misses degrade the performance, sometimes making a cached external memory worse than the external memory without cache.
Enabling the cache is always worth an experiment to determine if it improves the performance for your particular application.
Partitioning Code into Internal, External, and Cached Memory
The memory architecture that provides the best performance is one that has only local memory. However, this architecture is not always practical since many useful programs exceed the available capacity of the local memory. On the other hand, running from external memory exclusively may have more than an eight times performance disadvantage due to the peripheral bus latency. Caching the external memory is an excellent choice for PowerPC. Caching the external memory in MicroBlaze definitely improves results, but an alternative method is presented that may provide more optimal results.
For MicroBlaze, perhaps the optimal memory configuration is to wisely partition the program code, maximizing the system frequency and local memory size. Critical data, instructions, and stack are placed in local memory. Data cache is not used, allowing for a larger local memory bank. If the local memory is not large enough to contain all instructions, the designer should consider enabling the instruction cache for the address range in external memory used for instructions.
By not consuming BRAM in data cache, the local memory can be increased to contain more space. An instruction cache for the instructions assigned to external memory can be very effective. Experimentation or profiling shows which code items are most heavily accessed; assigning these items to local memory provides a greater performance improvement than caching.
For example, Express Logic's Thread-Metric test suite has been used to demonstrate how partitioning a small piece of code in local memory can result in a significant performance improvement. One function in the Thread-Metric Basic Processing test is identified as time-critical. The function's data section (consisting of 19% of the total code size) is allocated to local memory, and the instruction cache is enabled. The 60MHz cached and partitioned- program system achieves performance that is 560% better than running a noncached, nonpartitioned 75MHz system using only external memory.
However, the 75MHz system shows even more improvement with code partitioning. If the time-critical function's data and text sections (22% of the total code size) are assigned to local memory on the 75MHz system, a 710% improvement is realized, even with no instruction cache for the remainder of the code assigned to external memory.
In this one case, the optimal memory configuration is one that maximizes local memory and system frequency without cache. In other systems where the critical code is not so easily pinpointed, a cached system may perform better. Designers should experiment with both methods to determine what is optimal for their design.