4.14. Memory and Stream Performance Considerations
FPGA-based computing platforms may include many different types of memory, some of which are embedded within the FPGA, and others of which are external. Embedded memory is integrated within the FPGA fabric itself and may in fact be implemented using the same internal resources that are used for logic elements. External memory, on the other hand, is located in one or more separate chips that are connected to the FPGA through board-level connections. FPGA platforms most often include some external memory such as SRAM or Flash. All popular FPGAs today also include some configurable amount of on-chip embedded memory.
There are many ways to configure a platform to use the available external or embedded memories. Platforms can be configured to have program, data, and cache storage located in either embedded or external memory or a combination of the two. Additionally, memory may be connected to the CPU and/or FPGA logic in many different ways, perhaps over multiple on-chip bus interfaces. As a result, there are many considerations in deciding how to use the available memory and, more generally, how data should move through an application. Some of these issues are discussed at greater length in Appendix A.
The Impulse C programming model supports both external and embedded memories, in many different configurations, for use as shared memory available to Impulse C processes. Using the Impulse C library, you can allocate space from a specific memory and copy blocks of data to and from that memory in any process that has access to the shared memory. This means that memories can be used as convenient alternatives to streams for moving data from process to process. So how do you know when to use shared memory for communication and when to use streams?
For communication between two hardware processes, the choice is simple. If the data is sequential, meaning that it is produced by one process in the same order it will be used by the next, a stream is by far the most efficient means of data transfer. If, however, the processes require random or irregular access to the data, a shared memory may be the only option.
For communication between a software process and a hardware process, however, the answer is more complex. The best solution depends very much on the application and the types and configuration of the memory available in the selected platform. The remainder of this section looks at some micro-benchmark results for three specific platform configurations and shows how you might use similar data from your own applications to make such memory-related decisions.
For the purposes of evaluating streams versus memories in a number of possible platforms, a micro-benchmark was created to test data transfer performance using the typical styles of communication found in Impulse C applications. The first three tests measure three common uses of stream communication:
Another three tests (called Shared Mem-4B, Shared Mem-16B, and Shared Mem-1024B) measure the performance of direct transfers from memory to an Impulse C hardware process using the Impulse C shared memory support. In these three tests, the hardware process repeatedly reads blocks of data from the external memory into a local array. These tests emulate applications in which the data in memory is ready for processing and can be directly read by the hardware process without CPU intervention. The three tests differ only in the size of the blocks transferred to represent different types of applications. Applications requiring random access, for example, might need to read only four bytes at a time, whereas sequential processing algorithms might need to read much larger sequential blocks at a time.
Memory Test Results for the Altera Nios Platform
The table in Figure 4-10 displays the micro-benchmark results for our first configuration, which consists of an Altera Nios embedded soft processor implemented on an Altera Stratix S10 FPGA.
In this particular configuration, the Nios CPU was connected to an external SRAM and to our test Impulse C hardware process via Altera's Avalon bus. These results seem to indicate that stream communications are quite efficient in this platform/configuration if the data requires some computation on the CPU and needs to be subsequently sent to a hardware process. If, however, the data in memory does not require any processing and can be directly accessed by the hardware process, the shared memory approach is more efficient. If the application is accessing data randomly, just four bytes at a time, the performance difference is not significant. In that case, stream-based communication might be used because it is easier to program, requiring less complicated process-to-process synchronization.
How do we explain these results? The Avalon bus architecture is unique in that there is not a single set of data and address signals shared by all components on the bus. Instead, the Avalon bus is customized to the set of components actually attached to the bus and is automatically generated by software. A significant result of this is that two masters can perform transfers at the same time if they are not communicating with the same slave. For an example of particular interest to us, the CPU can access Impulse C streams and signals at the same time that an Impulse C hardware process might be transferring data to and from an external RAM. This means that a software process on the CPU may be polling (waiting for) a signal, or receiving data from a hardware process, while the hardware process is simultaneously transfering data to or from external memory.
Note also that, in our test, the program running on the Nios processor was stored in the external memory. This means that the CPU may have slowed down the shared memory tests by making frequent requests for instructions from the external RAM. Another approach is to use a separate embedded memory for program storage, which would increase performance. The performance gain is due to the fact that the Avalon bus architecture permits the CPU to access the embedded memory while the hardware process simultaneously accesses the external memory. This would also increase the performance of the stream tests, because the embedded memory is much faster than external memory and program execution would be faster.
Memory Test Results for the Xilinx PowerPC Platform
Figure 4-11 displays the micro-benchmark results for our second sample configuration, which includes an embedded (but hard rather than soft) PowerPC processor as supplied in the Xilinx Virtex-II Pro FPGA.
In this test, the FPGA was configured with both a PLB (Processor Local Bus) and an OPB (On-chip Peripheral Bus). The test program running on the PowerPC was stored in an embedded (on-chip) memory attached to the PLB, while the external memory was attached to the OPB. Although these busses are standard shared-bus architectures, using two busses allows the PowerPC to execute programs from the embedded memory on the PLB bus while a hardware process might be accessing the external memoryat the same timeon the OPB.
These results indicate that stream performance over the PLB is very poor. The reason for the low performance is currently unknown, but it might be due to the PLB-to-stream bridge components. The conclusion here is if the application does not require any computation on the PowerPC and can be directly used by the hardware process, shared memory is much faster. As a general rule, it is inefficient for external data to be accessed by a hardware process through the CPU and it is better to access that memory directly.
Memory Test Results for the Xilinx MicroBlaze Platform
Figure 4-12 shows the micro-benchmark results for our final sample configuration, that of a MicroBlaze soft processor implemented in a Xilinx Virtex II FPGA.
For this test, the FPGA was configured with a single OPB, embedded memory for program storage, and an external SDRAM. The first thing that stands out from these results is the large transfer rate obtained in the Stream (one-way) test. This result comes from the fact that Impulse C implements MicroBlaze-to-hardware streams using the Fast Simplex Link (FSL) provided for MicroBlaze. FSLs are dedicated FIFO channels connected directly to the MicroBlaze processor, thus avoiding the system bus altogether and providing single-cycle instructions to read and write data to and from hardware.
Although there is only one system bus, the MicroBlaze has dedicated instruction and data lines that can be connected to embedded memory for faster performance. Our sample configuration uses these dedicated connections and disconnects the embedded memory from the OPB to avoid interference from instruction fetching.
At first glance, the memory performance results from this test look good, but there are some additional considerations for this platform related to the use of Impulse C signals. Signals are implemented by the Impulse compiler as memory mapped I/O on the OPB. In applications making use of shared memory for process-to-process communication, a software process typically waits on a signal from the hardware process to know when the hardware process has completed and has finished writing to memory. Because this sample configuration uses one shared bus, the signal polling interferes with memory usage. For this reason we've shown two results for each shared memory test. While at smaller block sizes the performance was not adversely affected, we find that for larger block sizes, the signal polling significantly reduces performance.
The results also show that, for random access with small block sizes, using shared memory does not provide any advantage, and a streams-based approach would be preferable because it is simpler to program. However, as in the earlier tests using Altera Nios, the performance is significantly better using shared memory with large block sizes, provided that signal polling can be avoided.