12.3 Availability

12.3 Availability

A major cause of failures in the field of any software system is that the software is used in a different manner in the field than the developers had anticipated and that testers have validated. Also, it frequently happens that a software system will function with high reliability in the field until a substantial change in the system activity causes these reliable systems to fail. We recently witnessed, for example, an astonishing number of failures in the field of a fault-tolerant storage device built by one of the manufacturers of large fault-tolerant file servers. These failure events were attributable to a new release of a database management system being run on these systems. This new software release was determined to make use of a host of features that had not been adequately tested when the software was developed. What really happened, in other words, was that there was a major shift in the operational profile of the system after the system had been in operation. This shift in operational profile caused an attendant shift in the reliability of the system.

To ensure the continuing reliable operation of a system, our objective should be to monitor the activity of our software system in real-time. In this operating mode, a baseline operational profile can be used to initialize an instrumented software system with a standard for certified behavior. When this system is deployed to a user's site, the activity of the system will be monitored dynamically as it is exercised by the user. When the system is placed in service, its activity will be monitored constantly through a sequence of execution profiles. The profile summaries of these module epochs will be sent to an analytical engine. This analytical engine will compare this profile against a nominal database to ensure that each profile is within the bounds of nominal behavior; i.e., that the current activity of the system is in conformance with the system behavior that has been tested by the vendor during its test and validation process.

Perhaps the greatest threat to the reliable operation of a modern software system is that the customer will exercise a system in an unanticipated fashion. That is, the customer will shift to the use of an operational profile different from that certified by the software developer. As a consequence, the system will shift from a reliable one to an unreliable one in this context.

Our ultimate objective of dynamic measurement availability assessment is to provide a methodology that can easily be incorporated into its software system, one that can:

  • Monitor the activity of any piece of software

  • Identify the novel uses of this software

  • Call home (to the software developer) to notify testers of the new behavior

  • Describe, in sufficient detail, the precise nature of the new behavior

We accept the proposition that no software system can be thoroughly or exhaustively tested. It is simply not possible to do this because of the complexity of our modern software system. We can, however, certify a range of behaviors for this software. These behaviors, in turn, are represented by certified operational profiles. If a user induced new and uncertified behavior on a system, then the reliability of the system might decline. It is also possible that the new behavior might well be sufficiently reliable but our level of certainty about this reliability estimate may be too low.

In the event that a user does depart from a certified operational profile, it will be important to know just what the new behavior is so that we can test the system and possibly certify it for the new behavior. If the software system is suitably instrumented, as was the case in our simple example in Section 12.3.4, then we will have sufficient information at our disposal to reconstruct the user's activity and certify the system components associated with that behavior.

The main objective of our dynamic measurement methodology for availability is to trap in real-time any behavior that is considered uncertified. We want to observe the software modules and their behavior to determine, with a certain level of confidence, the future reliability of a system under one or more certified operational profiles.

In the event that a system is determined to be operating out of the nominal range, the analytical engine must:

  • Capture the execution profiles that represent the aberrant behavior

  • Map this behavior to specific user operations

  • Call home to report the specifics of the new behavior, including:

    • The specific modules involved

    • The operations exercised

    • The duration of the activity

With these data in hand, the software vendor can take the necessary steps to duplicate the activity, fix any problems that occur in this new operational context, recertify the software, and ship a new release to the customer before the software in the field breaks.

The availability measurement instrumentation can be installed at any level of granularity on a computer: the system kernel, the network layer, the file system, the shell, and the end-user application. At the kernel level, the operating system will generate and display a normal level of activity, as shown in its nominal execution profile. When this profile shifts to an off-nominal profile, something new and potentially unreliable is occurring on the system. At the file system level, each user accesses different files, in different locations, with different frequencies that describe certain patterns that can be represented in a profile. At the shell level, each user generates a standard profile representing the normal activities that are customary for that person. Finally, each application generates profiles of characteristic nominal behavior for each activity.

In any of these levels, when a user profile begins to differ from a nominal profile by a preestablished amount, an alarm is activated. Two things might be wrong: (1) a hardware component on which the system is running is beginning to fail; or (2) a current user is driving the system into operations that have not been certified. For complete availability monitoring capability, all software running on a system must be monitored in real-time.

Essentially, all developers of large software systems have come to realize that the past reliability of a piece of software is not a very good determinant of its future reliability. A system may function well in the field until its users begin to use it in new and unanticipated ways. In particular, developers of software for Internet applications are painfully aware of this phenomenon. The Internet and all its clients is a rapidly changing environment. Further, it is relatively difficult to forecast just how the system will evolve. In that we really cannot predict the future behavior of our software clients, we can never know for certain just how they will use our products. Thus, we are perennially vulnerable to failure events in this environment. These failure events are not particularly critical for Internet browsers. You can always reboot. The consequences of failure in a safety-critical software system are very different. No software vendors can or want to assume this kind of cyber-liability.

We have sought to introduce a little science into the world of software development. We exploit dynamic measurement techniques to analyze the internal software behavior. Our research has shown that internal behavior analysis has an enormous potential as an effective means to detect abnormal activities that might constitute threats to the availability of a system. Through the real-time analysis of the internal program activities, we can detect very subtle shifts in the behavior of a system. In addition, based on the initial experiments, we can now presume that each system abnormality has a particular internal behavior profile that can be recognized.

The dynamic availability measurement technology permits us to measure, in real-time, the behavioral characteristics of the software. These measurements, in turn, allow us to make inferences as to whether a software system is executing certified user operations or is being pushed into new uncertified behavior domains. Of utmost importance is the fact that we can capture this data and submit it to a software vendor in a timely fashion with an eye toward (1) understanding the new uses of the software, (2) recertifying the software for the new observed behaviors, and (3) replacing the software on the user's system before it has a chance to fail. With this capability in place, a software vendor has the capability of fixing a software system before it breaks.



Software Engineering Measurement
Software Engineering Measurement
ISBN: 0849315034
EAN: 2147483647
Year: 2003
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net