Chapter 12: Software Availability | Software Engineering Measurement

12.1 Introduction

There are as many definitions of software availability as there are authors writing about the subject. In the more traditional hardware sense of the term, availability is defined as the ratio of mean time to failure (MTTF) to the sum of MTTF and mean time to repair (MTTR). In the hardware world, a failure event is very easy to observe. A light bulb burned out. A disk drive went south. Consequently, it is easy to measure the precise elapsed time between failure events. Similarly, it is quite easy to measure the interval between when a light bulb burned out and when it was replaced with another. We can actually measure the length of time that it takes for a light bulb to burn out. From this we can compute the average time to failure for light bulbs. Similarly, we can measure how long it takes to change a light bulb; and from this we can compute the average time to repair a failed light bulb.

The concept of availability in the software world is not so clear. For the most part, software failure events are not directly observable. In fact, perhaps only 1 percent of the most catastrophic software failure events can be witnessed. In modern complex software, system failure events may occur hours, days, or even weeks before the consequences of the failure event percolate to the surface to be observed. Thus, the MTTF statistic cannot be determined with any degree of certainty because the underlying failure event is essentially unobservable.

Similarly, the MTTR statistic has even less meaning in the software world. In most cases of an operating system failure, for example, we simply reboot the system and continue working. We do not, and cannot, fix the problem. We simply work around it until the software vendor provides updates to fix the problem. We do not wait for this fix to occur. We try our best to survive with the current problem.

The term availability must be redefined when it is applied to software systems. There are clearly four components to availability: (1) reliability, (2) security, (3) survivability, and (4) maintainability. A system will not run correctly if it is hijacked or damaged by intentional misuse. Thus, security is an important component of availability. A system will not run correctly if it is permitted to execute flawed code. Software that has been made corrupt or contains corrupt code will cause a system to operate improperly.

A reliable software system is one that does not break when it is placed into service at a customer's site. Reliability, however, is not a static attribute of a software system. Reliability is a function of how that customer will use the software. Some operations of the software will perform flawlessly and forever; other operations of this same software system will be flawed and subject to repeated failure events. Software reliability, then, is determined by the interaction between the structure of the code and the user's operation of the system, as reflected in his or her operational profile.

In the case of embedded software systems, the reliability of the software system depends on both good code and good hardware. If a specific set of software modules is tasked with the responsibility for the operation of a failing hardware component, the first sign of this incipient hardware problem may first appear in the software component. That is, the software behavior will change as a direct result of an evolving hardware failure. The problem might first detected in a disturbance in the software system.

A secure system is one that can repulse all attempts for misuse. A software system can be assailed from outside the designated and authorized user community by agents who wish to stop the normal use of our software. An outside agent who exploits the weakness of our defenses for his or her own purposes might also invade a software system. Such misuse might divert financial resources or goods to the agent. The agent might also misuse our software to steal our intellectual property. At the heart of a secure software system is a real-time control infrastructure that can monitor the system activity and recognize invidious behavior and control it before damage is done to the system. From the availability standpoint, the essence of the security problem is that outside agents are actively trying to subvert our system. This will cause our system to fail to perform its normal activities. In the worst case, they can cause our system to fail, destroy system resources, or consume system resources through a denial-of-service attack.

A system that has been developed for survivability can identify potential problems as they occur and seek remediation for these problems before the system can fail. Typically, a system that has been tested and certified for certain operational behaviors will run without problems when it is placed into service. When the software is driven into new and uncertified domains by new and certified user activity, it is likely to fail. A system based on principles of survivability will be able to identify new usage patterns by the customer and communicate these new uses to the software developer. The developer then has the ability to recertify the system for its new usage patterns and ship a new release of the software to the customer before the system has the opportunity to fail in the user's hands.

A maintainable system is one that is built around the principle of requirements traceability. If you do not know what a piece of code is doing, it is nearly impossible to alter it without having an adverse effect on some undetermined functionality. Similarly, if it becomes necessary to change the system requirements, this becomes an impossible task when we do not know which code modules implement which requirement. Basically, a maintainable system is one that can be fixed or modified very quickly.

At the heart of a highly available system is a control structure built into the software that will monitor the system in real-time for failure potential and for misuse. This control structure will measure the software to ensure that it is performing functions that have been certified by the vendor. It will also detect and blockade noncritical functionalities that will cause the entire system to fail should they execute.