8.5. Preliminary Hardware Analysis
Once the algorithm has been ported to use the Impulse C library for streaming data and has been verified using desktop simulation, we can perform some preliminary analysis of the hardware processes before going all the way through hardware synthesis. By simply compiling the hardware processes in Impulse C, we can see some basic statistics in the output report from the optimization. This information includes stage counts for basic loops, and pipeline rates and latencies.
Using the optimizer outputs, we can look at all the basic blocks of the main loop in the algorithm and determine that each iteration requires 191 stages, which generally correspond to cycles. Using that information, we can determine an estimated throughput rate for a given clock frequency. For example, assuming a 50 MHz clock, we can compute 50e6/191, which equates to 261,780 blocks per second (at one block per process iteration). That in turn equates to 2,094,240 bytes per second, which at eight bytes per block is approximately 2MB per second performance. This is a rough estimate and assumes ideal conditions such as data always being available on the input stream.
While this analysis represents only a theoretical maximum throughput, it can be useful to determine the potential benefit of compiling to hardware and to suggest when further optimization is clearly necessary. Performing an initial analysis like this can help you avoid building actual hardware (whether that hardware is generated by the C compiler or hand-crafted using an HDL) for processes that turn out to be ill-suited for hardware implementation. This kind of analysis can also help when performing iterative language-level optimizations on an algorithm as it is being refined. Because the actual hardware synthesis process can be time-consuming, the design cycle can be significantly reduced and hardware/software partitioning decisions can be more accurately made.
Initial Results: 10.6X Performance Increase
The results of the test (expressed as the computation times for a specified number of data blocks) were generated using timers available on the MicroBlaze processor and invoked from within the C language test application. (This embedded test bench is described in more detail in Chapter 9.)
The results demonstrated that, for this algorithm running on the Virtex II, a hardware implementation would result in faster performance (a 10.6X speedup) than a software-only solution, even with the modest overhead of data communication between the processor and the FPGA-based encryption algorithm. This is due in part to the extremely low data communication overhead introduced by the Xilinx FSL bus. It is due as well to the compiler's ability to find and exploit low-level parallelism within the inner code loop of the algorithm. As you will learn in Chapter 10, however, the potential speedups for this algorithm are substantially greater.