7.2 Failures in Software Layers and Hardware Components

Designing reliable fault-tolerant software requires that we design software that continues to operate even after some of its components fail. These components can be either hardware or software components. If our software is fault tolerant, it will have features that counter the effects of hardware or software faults. At the very minimum, our fault-tolerant designs should provide for graceful degradation of service as opposed to immediate interruption of service. If our software is fault tolerant and it encounters failed component(s), it should continue to function but at reduced levels. The failures that our software must handle can be divided into two categories: software and hardware. Figure 7-1 contains a breakdown of some of the hardware components as well as the layers of software that may be involved in failure.

Figure 7-1. A breakdown of some of the hardware components and layers of software involved in failure.

graphics/07fig01.gif

In Figure 7-1, we make a distinction between the hardware components and the software layers because the techniques for handling hardware failures are often different from the techniques used to handle software failures. Also in Figure 7-1, there are several software layers involved. Some of the software layers are beyond the direct control of the developer and require special consideration during exception and error handling. The software design, development, and testing phases have to take into account the kinds of problems caused by hardware failures and the software layers where failure can occur. Programs that require parallelism or that consist of distributed components have additional hardware failures to consider. For instance, distributed programs rely on communications hardware and software. Failure in a communication component can cause the entire system to fail. Programs designed for parallel processors may fail if the anticipated number of processors is not available. Also, if communications or processors are available during startup, failure may occur at some time after the program has begun to execute. Exceptions may occur with any of the hardware components and in any of the software layers. In addition, each software layer may contain defects that must be handled. During the software design phase it is useful to approach exception and error handling layer by layer. The options for recovery or repair for an application that faces failure at layer 2 are different from the options that are available at layer 3. In addition to the failures that may occur in the various software layers and hardware components, the failures may also be characterized by location. Figure 7-2 depicts how as the distance between the tasks increases , so does the level of difficulty of error and exception handling.

Figure 7-2. Contrasting the increase of distance between location of tasks and the increase of the level of difficulty of error and exception handling.

graphics/07fig02.gif

The more distance in software or hardware components between the concurrently executing tasks, the more sophistication required when designing exception and error handling components. So from Figures 7-1 and 7-2, we can see that in order to design and develop reliable software, we will have to make provisions for the what and where of defects and exceptions.