Section 4.10. Avoiding Stream Deadlocks | Practical FPGA Programming in C

4.10. Avoiding Stream Deadlocks

Deadlocks can be one of the most difficult problems to resolve in a streaming application, and care must therefore be taken when designing complex, multiprocess applications. A stream deadlock occurs when one process is unable to proceed with its operation until another process has completed its tasks and written data to its outputs. If the two processes are mutually dependent or are dependent on some other blocked process, the system can come quickly to a halt.

The problem of deadlocks is most severe in systems having irregular data (unpredictable numbers of stream outputs for a given number of stream inputs) or in systems having variable cycle delays (such as the example presented in Chapter 13). In some cases stream deadlocks can be removed by increasing the depth of stream buffers, but in most cases this only delays finding a real solution to the problem. Many such situations, in fact, are completely independent of stream buffer sizes.

Consider, for example, the two processes shown in Figure 4-5.

Figure 4-5. The producer and consumer must be designed to avoid deadlocks.

 void Supervisor_run(co_stream S1, co_stream S2, co_parameter iparam) {   int iterations = (int)iparam;   int i, j;   uint32 local_S1, local_S2;   co_stream_open(S1, O_WRONLY, UINT_TYPE(32));   co_stream_open(S2, O_RDONLY, UINT_TYPE(32));   srand(iterations); // Seed value   // For each test iteration, send random value to the stream.   for ( i = 0; i < iterations; i++ ) {      // Must send 4 characters on S1 before attempting to read from S2...      for ( j = 0; j < 4; j++ ) {         local_S1 = rand();         printf("S1 = %d\ n", local_S1);         co_stream_write(S1, &local_S1, sizeof(uint32));      }      for ( j = 0; j < 4; j++ ) {         co_stream_read(S2, &local_S2, sizeof(uint32));         printf("S2 = %d\ n", local_S2);       }   }   co_stream_close(S1);   co_stream_close(S2); } // This process will reverse the order of every block of four input values. void Worker_run(co_stream S1, co_stream S2) {   int i;   uint32 local_S1, local_S2;   uint32 data[4];   co_stream_open(S1, O_RDONLY, UINT_TYPE(32));   co_stream_open(S2, O_WRONLY, UINT_TYPE(32));   while (!co_stream_eos(S1)) {     for (i = 0; i < 4; i++)         co_stream_read(S1, &data[i], sizeof(uint32));     for (i = 0; i < 4; i++)         co_stream_write(S2, &data[3-i], sizeof(uint32));   }   co_stream_close(S1);   co_stream_close(S2); }

In this example, notice that the first process, called Supervisor, sends packets of four 32-bit unsigned values on a stream (S1), which is subsequently received by a second process called Worker. The first process sends the stream using a loop that generates four calls to the co_stream_write function. Similarly, the second process (after receiving the four values) writes to its output stream (S2) using the same type of loop, reversing the order of the four values written.

This is a very simple example of a process (Worker) that must cache some number of values locally before performing its operation (reversing the order of the values). In this example, the assumption being made by the programmer is that the worker process will accept these four values without blocking. In this case it is a valid assumption, but you can easily imagine situations in which the controlling process (the Supervisor) has not been so carefully designed and in which a deadlock is inevitable. The following code represents one such situation:

 // Send random values to the stream, one value at a time. for ( i = 0; i < iterations * 4; i++ ) {        local_S1 = rand();        printf("S1 = %d\ n", local_S1);        co_stream_write(S1, &local_S1, sizeof(uint32));        co_stream_read(S2, &local_S2, sizeof(uint32));  // This will deadlock!        printf("S2 = %d\ n", local_S2); }

In this version of the Supervisor processing loop, the programmer has incorrectly assumed that data will become available on stream S2 after only one data element has been placed on stream S1. Because Worker does not produce any output until four values have been received, Supervisor (and hence the rest of the system) will be deadlocked and will not produce any outputs.

Note

Be careful with the design of your streams, and always consider issues of process and stream synchronization. While debugging tools can help you find where deadlocks are occurring, it is not always trivial to resolve them in the most efficient way, or in extreme cases to resolve them at all without a substantial redesign.

Using Nonblocking Stream Reads

What if you need to create an equivalent to the Supervisor function that is not dependent on a particular length of the data packet? In fact, what if the Worker process is capable of producing reordered outputs of arbitrary and constantly changing lengths? This is a common situation, particularly for pattern matching and searching functions, and the solution is to use a nonblocking stream read function.

Figure 4-6 demonstrates one alternative way to write the Supervisor function. In this version, the co_stream_read function has been replaced with the nonblocking co_stream_read_nb function. The processing loop uses co_stream_read_nb to iteratively poll the output stream (S2). This introduces some delay into the system (polling has some cost in terms of cycle delays) but resolves the issue of deadlocking in this example.

Figure 4-6. The nonblocking stream read function co_stream_read_nb can resolve many deadlock problems.

 void Supervisor_run(co_stream S1, co_stream S2, co_parameter iparam) {   int    iterations = (int)iparam;   int    i, j;   uint32 local_S1;   uint32 local_S2;   co_stream_open(S1, O_WRONLY, UINT_TYPE(32));   co_stream_open(S2, O_RDONLY, UINT_TYPE(32));   srand(iterations); // Seed value   // For each test iteration, send four random values to the stream.   for ( i = 0; i < iterations * 4; i++ ) {         local_S1 = rand();         printf("S1 = %d\ n", local_S1);         co_stream_write(S1, &local_S1, sizeof(uint32));         while (co_stream_read_nb(S2, &local_S2, sizeof(uint32)))            printf("S2 = %d\ n", local_S2);   }   co_stream_close(S1);   co_stream_close(S2); }

Deadlocks and the PIPELINE Pragma

When using the PIPELINE pragma (described in later chapters), it is possible for the generated hardware to exhibit deadlock conditions not present during desktop simulation. Pipelining is a parallelizing technique that allows multiple iterations of a loop to execute in parallel. When a loop that inputs a value and outputs a result each iteration is converted to a pipeline, the resulting hardware may not produce any output until after it has received some number of input values, because multiple iterations are being executed at the same time. Pipelining and the PIPELINE pragma are discussed in more detail in later chapters.