5.4. SUMMARY | Parallel Computing on Heterogeneous Networks (Wiley Series on Parallel and Distributed Computing)

5.3. MULTIPLE-USER DECENTRALIZED COMPUTER SYSTEM

Unlike MPPs, NoCs are not strongly centralized computer systems. A typical NoC consists of relatively autonomous computers, each of which may be used and administered independently by its users.

5.3.1. Unstable Performance Characteristics

The first implication from the multiple-user decentralized nature of NoCs is that computers that execute a parallel program may be also used for other computations and be involved in other communications. In this case the real performance of processors and communication links can dynamically change depending on the external computations and communications.

Therefore a good parallel program for a NoC must be sensitive to such dynamic variations of its workload. In such a program, computations and communications are distributed across the NoC in accordance to the actual performance at the moment of execution of the program.

5.3.2. High Probability of Resource Failures

Fault tolerance is not a primary problem for parallel applications running on MPPs. The probability of unexpected resource failures in a centralized dedicated parallel computer system is quite small. But this probability reaches much higher figures for NoCs. First, any single computer in an NoC may be switched off or rebooted unexpectedly for other users in the NoC. The same may happen with any other resource in the NoC. Second, not all building elements of the common NoC as well as interaction between different elements are equally reliable. These characteristics make fault tolerance a desirable feature for parallel applications that run on NoCs, and the longer the execution time of the application is, the more important the feature becomes.

The basic programming tool for distributed memory parallel architectures, MPI, does not address the problem. This is because a fault-tolerant parallel program assumes a dynamic process model. Failure of one or more processes of the program does not necessarily lead to failure of the entire program. The program may continue running even after its set of processes has changed.

The MPI 1.1 process model is fully static. MPI 2.0 does include some support for dynamic process control, although this is limited to the creation of new MPI process groups with separate communicators. These new processes cannot be merged with previously existing communicators to form intracommunicators needed for a seamless single application model; they are limited to a special set of extended collective communications.

To date, there is no industrial fault-tolerant implementation of MPI. However, there are a few research versions of MPI that suggest different approaches to the problem of fault-tolerant parallel programming.

The first approach to making MPI applications fault tolerant is through the use of checkpointing and rollback. This approach is that all processes of the MPI program will flush their message queues to avoid in flight messages getting lost, and then they will all synchronously checkpoint. If any error occurs at some later stage, the entire MPI program will be rolled back to the last complete checkpoint and be re-started. This approach needs the entire application to checkpoint synchronously, which depending on the application, and its size may become expensive in terms of time (with potential scaling problems).

The second approach is to use “spare” processes that are utilized when there is a failure. For example, MPI-FT supports several master/slave models where all communicators are built from grids that contain “spare” processes. To avoid loss of message data between the master and slaves, all messages are copied to an observer process that can reproduce lost messages in the event of a failure. This system has a high overhead for every message and considerable memory needs for the observer process during long-running applications. This system is not a full checkpoint system in that it assumes any data (or state) can be rebuilt using just the knowledge of any passed messages, which might not be the case for nondeterministic unstable solvers.

MPI-FT is an example of an implicit fault-tolerant MPI. Such implementations of MPI do not extend to the MPI interface. No specific design is needed for an application that uses an implicit fault-tolerant MPI. The system takes full responsibility over fault-tolerant features of the application. The drawback of such an approach is that the programmer cannot control fault-tolerant features of the application and fine-tune for better balance between fault tolerance and performance as system and application conditions may dictate.

Unlike MPI-FT, FT-MPI is an explicit fault-tolerant MPI that extends the standard MPI’s interface and semantics. An application using FT-MPI has to be designed to take advantage of its fault-tolerant features.

First of all, FT-MPI extends MPI communicator and process states. Standard semantics of MPI indicate that a failure of an MPI process or communication causes all communicators associated with them to become invalid. As the standard provides no method to reinstate them (and it is unclear if they can be freed), this causes MPI_COMM_WORLD itself to become invalid and thus the entire MPI application will be brought to a halt.

FT-MPI extends the MPI communicator states from {valid, invalid}, or {ok, failed}, to a range {ok, problem, failed}.

The MPI process states are also extended from typical states of {ok, failed} to {ok, unavailable, joining, failed}. The unavailable state includes unknown, unreachable, or “we have not voted to remove it yet” states. A communicator changes its state when either an MPI process changes its state or a communication within that communicator fails for some reason.

The typical MPI semantics is from ok to failed, which then causes an application to abort. Allowing the communicator to be in an intermediate state gives the application the ability to determine how to alter the communicator, and its state, as well as how communication within the intermediate state behaves.

On detecting a failure within a communicator, that communicator is marked as having a probable error. As soon as this occurs, the underlying system sends a state update to all other processes involved in that communicator. If the error was a communication error, not all communicators are forced to be updated. If it was a process exit, then all communicators that include this process are changed.

Once a communicator has an error state, it can only recover by rebuilding it, using a modified version of one of the MPI communicator build functions such as MPI_Comm_create, MPI_Comm_dup, or MPI_Comm_split. Under these functions the new communicator will follow one of the following semantics depending on its failure mode:

SHRINK: The communicator is reduced so that the data structure is contiguous. The ranks of the processes are changed, forcing the application to recall MPI_COMM_RANK.
BLANK: The same as the SHRINK mode, except that the communicator can now contain gaps to be filled later. Communicating with a gap will cause an invalid rank error. Calling MPI_COMM_SIZE will return the extent of communicator, and not the number of valid processes within it.
REBUILD: Most complex mode that forces the creation of new processes to fill any gaps until the size is the same as the extent. The new processes can either be placed into the empty ranks or the communicator can be shrunk, and the remaining processes placed at the end.
ABORT: A mode that effects the application immediately on detection of an error and forces a graceful abort. The user is unable to trap this. If the application needs to avoid this mode, it must set all communicators to one of the communicator modes above.

Communications within the communicator are controlled by a message mode for the communicator, which can take one of two forms:

NOP: No operation on error. That is, no user-level message operations are allowed, and all simply return an error code. This is used to allow an application to return from any point in the code to a state where it can take appropriate action as soon as possible.
CONT: All communication that is not to the affected/failed process can continue as normal. Attempts to communicate with a failed process will return errors until the communicator state is reset.

The user discovers errors from the return code of MPI calls, with a new fault indicated by MPI_ERR_OTHER. Details as to the nature and specifics of an error are available through the cached attributes interface of MPI.

FT-MPI takes extra care in implementation of collective operations. When an error occurs during an operation, the result of the operation will be the same as if there had been no error, or else the operation is aborted. In broadcast, even if there is a failure of a receiving process, the receiving processes still receive the same data, that is, the same end result for the surviving processes. Gather and all-gather are different in that the result depends on whether or not the problematic processes sent data to the gatherer/root. In the case of gather, the root may or may not have gaps in the result. For the all-to-all operation, which typically uses a ring algorithm, it is possible that some processes may have complete information and others incomplete. Thus for operations that require multiple process input as in gather/reduce type of operations, any failure causes all processes to return an error code, rather than possibly invalid data.

Typical usage of FT-MPI would be in the form of an error check and then some corrective action such as a communicator rebuild. For example, in the following code fragment the communicator is simply rebuilt and reused on an error:

rc = MPI_Send(..., comm) if(rc== MPI_ERR_OTHER)      MPI_Comm_dup(comm, newcomm); comm = newcomm; // Continue...

Some types of parallel programs such as SPMD master/worker codes only need the error checking in the master code if the user is willing to accept the master as the only point of failure. The next example shows how complex a master code can become:

rc = MPI_Bcast(..., comm); // Initial work if(rc == MPI_ERR_OTHER)      reclaim_lost_work(...); while(!all_work_done) {      if(work_allocated)      {           rc = MPI_Recv(buf, ans_size, result_dt,                      MPI_ANY_SOURCE, MPI_ANY_TAG, comm,                      &status);           if(rc == MPI_SUCCESS)           {                handle_work(buf);                free_worker(status.MPI_SOURCE);                all_work_done--;           }           else           {                reclaim_lost_work(status.MPI_SOURCE);                if(no_surviving_workers)           {                // Do something           }      } } // Get a new worker as we must have received a result or a death rank = get_free_worker_and_allocate_work(); if(rank) {           rc = MPI_Send(...,rank,...);           if(rc == MPI_OTHER_ERR)                reclaim_lost_work(rank);           if(no_surviving_workers)           {                // Do something           }      } }

In this example the communicator mode is BLANK and communications mode is CONT. The master keeps track of work allocated, and on an error just reallocates the work to any “free” surviving processes. Note that the code has to check if there is any surviving worker process left after each death is detected.