In this chapter we have taken a legacy C algorithm and made the changes necessary to create a streams-oriented hardware accelerator, one that could be connected to other hardware or software processes (using the streaming interface) and implemented directly in an FPGA. We made little or no attempt to optimize this algorithm. Instead, we have focused on creating and verifying the algorithm through the use of desktop simulation in a standard C development environment. We have also generated prototype hardware using the Impulse compiler and have obtained some initial performance numbers from the optimizer. These numbers have shown that, at a minimum, we should be able to accelerate this particular algorithm by an order of magnitude over its software equivalent, at least when that software equivalent is running on an embedded processor.
While 10.6X is a good start, and suggests that a hardware implementation for this algorithm may be appropriate, it is actually on the low end of what is possible when implementing software algorithms in programmable hardware. For this algorithm, further performance increases as well as reductions in gate count requirements can be obtained by optimizing the algorithm itselffor example, by reordering statements to better enable pipelining or by invoking the three stages of the triple-DES algorithm in parallel.
In the next chapter we will detail how to create an embedded, in-system test for this algorithm using the Memec FPGA prototyping board previously mentioned.
In Chapter 10 we will show, step-by-step, how an application such as this can be iteratively improved, resulting in enormous increases in performance for relatively little cost in added hardware.