Section 10.4. Refinement 3: Improving Streaming Performance

10.4. Refinement 3: Improving Streaming Performance

Sometimes the most significant bottleneck in a system is the hardware/software communication interface. In the 3DES example, the data is assumed to be both produced and consumed by the connected CPU (a MicroBlaze processor connected via FSL). This results in a significant amount of data crossing over the software/hardware interface. When the code was initially ported to Impulse C hardware, no attention was given to the communication overhead, so the resulting stream implementation is very inefficient. Consider the following code, which is located at the start of the main processing loop:

 while (co_stream_read(blocks_in,&block[0], sizeof(uint8))==co_err_none) {    for ( i = 1; i < BLOCKSIZE; i++ ) {       co_stream_read(blocks_in, &block[i], sizeof(uint8));   }

Here, each 64-bit block is being transferred eight bits at a time (one character) over an 8-bit stream even though the hardware is connected to the CPU via a 32-bit bus. As with memories, streams require at least one cycle per read/write operation, so this code requires at least eight cycles. Furthermore, consider the code immediately following the stream reads:

 left = ((unsigned long)block[0] << 24)       | ((unsigned long)block[1] << 16)      | ((unsigned long)block[2] << 8)      | (unsigned long)block[3]; right = ((unsigned long)block[4] << 24)      | ((unsigned long)block[5] << 16)      | ((unsigned long)block[6] << 8)      | (unsigned long)block[7];

After reading the data eight bits at a time, this code rearranges the data into two 32-bit values, which requires eight loads from the block array and therefore at least eight more cycles. The same situation is also present in the output communications at the end of the main processing loop.

Rewriting the streams interface to use 32-bit streams significantly improves performance. The input and output communication can be rewritten as follows:

 while (co_stream_read(blocks_in, &left, sizeof(left))==co_err_none) {    co_stream_read(blocks_in, &right, sizeof(unsigned long));   ...   co_stream_write(blocks_out, &right, sizeof(unsigned long));   co_stream_write(blocks_out, &left, sizeof(unsigned long)); }

Notice that the block array has been completely eliminated. Obviously this change to the streams specification requires a corresponding change to the producer and consumer processes (which in this example are represented by a single software test bench process running on the embedded MicroBlaze processor), but this is a simple change.

Regenerating hardware with this new communication scheme, we obtain system performance 48 times faster than the software implementation, which was also modified to access the block data in 32-bit chunks. Thus, the new result is nearly 2.5 times faster than the previous result (refinement 2).

Tip

Consider packing multiple data values into a single stream packet to increase process throughput.