Section 8.2. Converting the Algorithm to a Streaming Model

8.2. Converting the Algorithm to a Streaming Model

The inner processing loop of the original (legacy) encryption function is shown in the code excerpt in Figure 8-1. Notice that this function repetitively processes blocks of characters, performing various operations on these blocks (using the SP box and key schedule data) to generate a resulting encrypted block.

As we indicated in the introduction to this chapter, our goal is to create a version of this algorithm that is compatible with hardware compilation, while at the same time making only the minimum necessary changes to the original function. In practice, many optimizations can be introduced right up front, but we will delay those optimizations for a later chapter. Instead we will focus primarily on the movement of data, which is often the most important aspect to consider when moving processing from a traditional processor to an FPGA.

As we've described in previous chapters, the most efficient way to move data from one hardware or software process to another is often via data streams. In fact, even if some other method of data transfer such as DMA will eventually be required (perhaps for performance reasons), the use of streams can greatly simplify the design and debugging of complex hardware/software systems.

To create a streaming implementation of this function, it's first necessary to wrap the core algorithm in a stream read/write loop. This outer loop will read characters from an input data stream, package these characters into the block array, perform the preceding calculations, and then write the results, eight characters at a time, to the output stream.

Before performing these operations on the data stream, however, the encryption algorithm must be initialized with the encryption keys (the key schedule) and the SP box data. In the original legacy C version of the algorithm these values were provided in global arrays. In an Impulse C implementation these values can either be streamed in or accessed via a shared memory. For maximum portability we will stream in the key schedule and SP box data via a separate stream called config_in. This will be done prior to the start of the main code loop.

Note

We could also use a shared memory and memory block read and write functions for this initialization data, but memory types are FPGA platform-specific and can introduce unwanted complexity during hardware simulation. For this reason it's often best to make use of streams during development and defer the use of shared memories for a later design optimization.

Figure 8-1. Inner processing loop of the original encryption function.

 while ( blockCount < NumBlocks ) {   for ( i = 0; i < DES_BLOCKSIZE; i++ ) {      block[i] = inputBlocks[(blockCount * DES_BLOCKSIZE) + i];   }   // Process the block...   // Read input block and place in left/right in big-endian order   //   left = ((unsigned long)block[0] << 24)         |((unsigned long)block[1] << 16)         | ((unsigned long)block[2] << 8)         | (unsigned long)block[3];   right = ((unsigned long)block[4] << 24)         | ((unsigned long)block[5] << 16)         | ((unsigned long)block[6] << 8)         | (unsigned long)block[7];   work = ((left >> 4) ^ right) & 0x0f0f0f0f;   right ^= work;   left ^= work << 4;   work = ((left >> 16) ^ right) & 0xffff;   right ^= work;   left ^= work << 16;   work=((right >> 2) ^ left) & 0x33333333;   left^=work;   right^=(work << 2);   work=((right >> 8) ^ left) & 0xff00ff;   left^=work;   right^=(work << 8);   right=(right << 1) | (right >> 31);   work=(left ^ right) & 0xaaaaaaaa;   left^=work;   right^=work;   left=(left << 1) | (left >> 31);   // First key   F(left,right,(*ks)[0]);   F(right,left,(*ks)[1]);   F(left,right,(*ks)[2]);   F(right,left,(*ks)[3]);   F(left,right,(*ks)[4]);   F(right,left,(*ks)[5]);   F(left,right,(*ks)[6]);   F(right,left,(*ks)[7]);   F(left,right,(*ks)[8]);   F(right,left,(*ks)[9]);   F(left,right,(*ks)[10]);   F(right,left,(*ks)[11]);   F(left,right,(*ks)[12]);   F(right,left,(*ks)[13]);   F(left,right,(*ks)[14]);   F(right,left,(*ks)[15]);   // Second key (must be created in opposite mode to first key)   F(right,left,(*ks)[16]);   F(left,right,(*ks)[17]);   F(right,left,(*ks)[18]);   F(left,right,(*ks)[19]);   F(right,left,(*ks)[20]);   F(left,right,(*ks)[21]);   F(right,left,(*ks)[22]);   F(left,right,(*ks)[23]);   F(right,left,(*ks)[24]);   F(left,right,(*ks)[25]);   F(right,left,(*ks)[26]);   F(left,right,(*ks)[27]);   F(right,left,(*ks)[28]);   F(left,right,(*ks)[29]);   F(right,left,(*ks)[30]);   F(left,right,(*ks)[31]);   // Third key   F(left,right,(*ks)[32]);   F(right,left,(*ks)[33]);   F(left,right,(*ks)[34]);   F(right,left,(*ks)[35]);   F(left,right,(*ks)[36]);   F(right,left,(*ks)[37]);   F(left,right,(*ks)[38]);   F(right,left,(*ks)[39]);   F(left,right,(*ks)[40]);   F(right,left,(*ks)[41]);   F(left,right,(*ks)[42]);   F(right,left,(*ks)[43]);   F(left,right,(*ks)[44]);   F(right,left,(*ks)[45]);   F(left,right,(*ks)[46]);   F(right,left,(*ks)[47]);   // Inverse permutation, also from Hoey   // via Outerbridge and Schneier   right=(right << 31)|(right >> 1);   work=(left ^ right)&0xaaaaaaaa;   left^=work;   right^=work;   left=(left >> 1)|(left  << 31);   work=((left >> 8)^right) & 0xff00ff;   right^=work;   left^=work << 8;   work=((left >> 2)^right) & 0x33333333;   right^=work;   left^=work << 2;   work=((right >> 16)^left) & 0xffff;   left^=work;   right^=work<<16;   work=((right>>4)^left) & 0x0f0f0f0f;   left^=work;   right^=work<<4;   // Put the block into the output stream with final swap   block[0]=(co_int8) (right >> 24);   block[1]=(co_int8) (right >> 16);   block[2]=(co_int8) (right >> 8);   block[3]=(co_int8) right;   block[4]=(co_int8) (left >> 24);   block[5]=(co_int8) (left >> 16);   block[6]=(co_int8) (left >> 8);   block[7]=(co_int8) left;   for ( i=0; i < DES_BLOCKSIZE; i++ ) {          outputBlocks[(blockCount * DES_BLOCKSIZE) + i] = block[i];     }     ++blockCount; }

A summary of the required stream read/write loops appears in Figure 8-2. Note that, for brevity, only the key schedule read loop is shown.

In the complete application (which is listed in Appendix D), this encryption process is connected via its data streams to producer and consumer processes that generate and accept data for the purpose of testing or as part of the complete application that resides in the FPGA. This embedded test bench is described in Chapter 9.

Figure 8-2. Summary of the required stream read/write loops for the encryption process.

 // The key schedule and SP box data stream in via config_in, // while the text data streams in via blocks_in. // void des_ic(co_stream config_in, co_stream blocks_in,        co_stream blocks_out) {   co_int8 i, k;   co_uint8 block[8];   co_uint32 ks[KS_DEPTH][2];   /* Get the key schedule and put it in a local array */   co_stream_open(config_in, O_RDONLY, UINT_TYPE(32));   for ( k = 0; k < 2; k++ ) {      for ( i = 0; i < KS_DEPTH; i++ ) {         if (co_stream_read(config_in, &ks[i][k],                     sizeof(uint32))!=co_err_eos ) {         }      }   }   co_stream_close(config_in);   . . .   // Now read in the data and process, one block at a time   co_stream_open(blocks_in, O_RDONLY, UINT_TYPE(8));   co_stream_open(blocks_out, O_WRONLY, UINT_TYPE(8));   while (co_stream_read(blocks_in, &block[0],                sizeof(uint8)) == co_err_none ) {     for ( i = 1; i < BLOCKSIZE; i++ ) {        if ( co_stream_read(blocks_in, &block[i],                   sizeof(uint8)) == co_err_none ) {        }     }     // Now process the current block     . . .     // Done processing the current block.   }   co_stream_close(blocks_in);  // finished, close the streams   co_stream_close(blocks_out); }