Testing and Evaluation

Testing and evaluation usually begin after the coding is done. This phase is intended to ensure that the coded application actually does the things it purports to accomplish, and to the users' satisfaction. Testing and evaluating test results are the means employed to validate and verify the final software application. Watts Humphrey, again following the pioneering G. J. Myers, defines seven types of software tests:^[12]

Unit or module tests verify single programs or modules. These are typically conducted in isolation and in special test environments.
Integration tests verify the interfaces between system partssubsystems, modules, and components.
External function tests verify the external system functions as stated in the external specifications.
Regression tests validate the system's previous functionality, ensuring that improvements and corrections have not resulted in the loss of previous functionality.
System tests both verify and validate the system to its initial objectives.
Acceptance tests validate the system to the user's requirements.
Installation tests validate the system's installability and operability.

Much has been written about the software system testing that has grown up in the three decades since Myers. We make no attempt to summarize these writings here. We're content to describe the seven testing methods or means as they contribute to the verification and validation of a software system under development. In the Taguchi context, we seek to evaluate by testing the software's trustworthiness as well.

Testing methods at any level are often called either black box or white box tests. The term black box originated in engineering labs when a student was given a sealed box that had two input terminals and two output terminals and that contained some combination of resistors, inductors, and capacitors. The student was told to determine the precise network in the box by external testing alone. He would usually start with input from a sine wave generator, and then, by increasing frequency and observing the output on an oscilloscope, measure the unknown circuit's response. Then he would use a square wave input, and then an impulse input, and so on. The idea was that the student had to learn the functional performance of the network in the black box without knowing what was in it, and then construct a hypothesis regarding its contents. White-box testing arose in the early days of software testing as a metaphorical chiasmus to black-box testing. Here the goal was to verify the design performance of the box's contents knowing exactly what was in it and having full access to the designer's documentation. In a well-designed and well-documented system, members of the development team can apply both types of methods in support of design verification and functional validation as the system develops, starting with units, going to components, then modules, then subsystems, and finally the system itself. Integration testing follows the assembly of components and in many ways is a bridge between black-box and white-box testing.^[13] The many aspects of integration testing are covered in the next chapter.

Software testing presents an interesting problem: Because more testing almost always yields more bugs, when should you stop? The law of diminishing returns applies at a certain point, but if testing and consequent debugging go beyond that, the law of increasing returns often pops up (see Sidebar 18.2). While the number of bugs goes down exponentially with testing, the time to find them goes up exponentially. Experts estimate that testing may uncover only 50 to 70% of the bugs in a program.^[14] Until the use of formal methods and specification-based program development becomes widespread, testing after coding will continue to be required.

Sidebar 18.2: Testing and Debugging Anomalies

In the 1950s, one of the authors worked with a team to convert a large assembly language-coded linear programming package from the Univac 1105, which was a two-address architecture, to the Univac 1107, a single-address architecture. The original program contained only three comments in 25,000 lines of code, and all three of them were false. The program was manually converted line by line by a team of programmers who knew a lot about linear programming and who were familiar with both computers. The computer center had an 1105 with the code running properly, as well as a brand-new 1107, so we could do functional-level whole-system black-box tests. The first test run indicated 166 coding errors. As the testing of improved versions progressed, the number of errors fell exponentially until only three bugs were left. The team set a cigar box on the 1107 computer console. They invited members and curious onlookers to put in a dollar bill and a slip of paper inscribed with their name and the number of bugs they thought would appear on the next run. Most guessed one or two. One cynic guessed eight. The author guessed 18. He was sneered at by the others, who had less testing experience and were not prepared for the outcome. The result was 15 new bugs!

Also to be noted is the fact that as the number of bugs goes down exponentially with testing, the time to find them goes up exponentially. In 1956, Bell Labs brought out an interpretive floating-point system for the IBM 650 computer that occupied 1,200 of its 2,000 magnetic drum memory locations. The structures programming group at Boeing-Wichita set a goal of providing the same functionality in only 600 locations. It took three programmers three months to achieve all this functionality, but they used 616 memory locations. It took a month to remove eight lines of code, another month to remove five more, and yet another month to remove the last three.

Unit testing is usually white-box testing aimed at testing as many of the internal paths as possible through a small piece (unit) of code. A path is a sequence of instruction executions or a thread that leads from the unit program's entry to its exit. The use of functions in C and objects in C++ and Java has dramatically simplified unit testing and made it much more effective. Unit testing by the programmer who knows it best has the fault of its virtue. Because the programmer designed and/or wrote it, he or she may be blind to unexpected actions or side effects. The person who created the faults in a program is the person least likely to recognize them. The person most likely to recognize them has a steep learning curve to get to the same point of understanding as the program's creator. It is generally not possible to test all the possible paths of even a simple program. Even if it were, that would not guarantee correctness in the face of unexpected or pathological inputs. In addition to object-oriented technology, the conscious anticipation of later white-box unit testing during design has advanced the ability to test almost every aspect of a program's behavior.^[15] As products become more complex, testing becomes more expensive and time-consuming. Defects tend to mask other defects or problems. They may even interact to make their discovery and correction more difficult, or even mask the effects of yet other defects. Thus, as noted in Sidebar 18.2, as programs become larger and more complex, unit testing finds fewer defects, and it costs more to find them.

Although it's usually thought of as a late process, integration testing may be employed whenever a developer tries to get two program units or modules to work together, the most elementary being a function call or object invocation. Testing pioneer G. J Myers proposed two types of integration testingbottom-up and top-down, which are described in Tables 18.1 and 18.2.^[16] We will turn to the details in the next chapter, which discusses integration testing of the entire system. It's clear that integration testing is best done at each stage of code accumulation, because errors undiscovered at integration points are much more difficult to find and correct later. Here we simply note that bottom-up testing is heuristic, intuitive, personal, and localized, whereas top-down testing is algorithmic, structured, team-based, and global. The former is often used for small development projects, and the latter is almost always required for large programming projects.

Table 18.1. Bottom-Up Integration Testing
Approach	Early testing to both verify and validate modules.
	Modules can be integrated in clusters.
	Emphasis on module performance and functionality.
Advantages	Test stubs are not needed.
	Errors in critical modules are found early.
	Manpower requirements are low.
Problems	Test drivers are required.
	Modules are usually not yet working programs.
	Interface errors are discovered late.
Remarks	At any point in the process, more code is available than with top-down testing.
	Bottom-up testing is more intuitive than top-down testing.
	The strategy is relatively ad hoc as modules become available.

Table 18.2. Top-Down Integration Testing
Approach	The control program must be completely tested first.
	Modules are carefully integrated one at a time.
	The major emphasis is on interface testing.
Advantages	No test drivers are needed.
	The control program and the first layer of modules constitute an early prototype of the whole system.
	Interface errors are discovered early.
	Modular features aid debugging.
Problems	Test stubs are needed at each level in the program hierarchy.
	A planned, structured approach slows manpower buildup.
	Errors in critical modules at low levels are found late.
Remarks	A "working" program early on builds both team and management confidence.
	It is difficult to maintain a purely top-down integration testing strategy practice.

External function testing is employed to verify the external system functions as stated in the external specifications. It is a significant part of any test plan and the ultimate user's major criterion of satisfaction. As noted earlier in this chapter, it is the major component in verification of any software system. The major requirements for an external function test are a file of input data and corresponding expected (and hence, specified) output data. The respective input and output data files may be generated by the program being replaced by the new application, by a competitive program, by a previous noncomputer-based system, or by a manual analysis. In any case, the external function test indicates the cases for which the program works to specification and the cases for which it does not. It is always wise to include in a test plan pathological examples or data cases intended to exercise the exception handling capability of the system under test. This latter type of testing is particularly important in numerical applications such as linear programming, integer programming, mixed integer programming, crew scheduling, or any other application in which a small numerical error in input can result in a costly or inappropriate business decision.

Regression testing validates the system's previous functionality, ensuring that improvements and corrections in the new version have not resulted in the loss of previous functionality. Naturally it is employed only when an existing system is upgraded or replaced with a later version. Over the years, some complex systems have been literally maintained or upgraded out of existence. It may be done after major maintenance or installation of the vendor's bug fixes as well. It is common practice to employ novice programmers to maintain existing program inventories for obvious reasons: it is less costly, it gives trainees experience, and it teaches them how the firm's data processing applications work. Unfortunately, these programs were written by much more highly skilled programmers, who may have employed software technology and techniques the new programmers do not understand. Regression testing is the organization's best defense against negative maintenance, as discussed in the next chapter.

System testing is done to both verify and validate that the system satisfies its initial objectives. This is the final test made by the software's designers and developers. Success means that the product is ready for the marketplace.

Acceptance testing is done to validate that the system meets the user's requirements. Such testing is almost purely functional. However, performance aspects depending on technical decisions and trade-offs may also play a role.

Installation testing is done to validate the completed system's installability and operability. This type of testing is done by and for the system's operators as opposed to the users. The payroll department may be completely happy with the functionality of a payroll system that takes 12 hours every Friday to run the payroll using one processor. However, the MIS department would surely prefer a system that employed two processors and that completed the payroll in only one shift. That would not only simplify operations and shift assignments but also would allow the payroll to be rerun on second shift in case a glitch or power failure occurred.

An excellent objective for software development process improvement would surely be the withering away of testing as a component of the overall development process. Formal methods will be a big step toward this goal but will not completely eliminate testing.^[17] Testing is so large and costly a part of the development process that we should do everything possible to reduce it to a minimum. The French automotive engineer Rene Panhard and Emile Levassor built their first automobile in 1890. Upon seeing the first Rolls-Royce Silver Ghost in 1911, Panhard is said to have remarked, "Mon Dieu, it is a triumph of craftsmanship over good design." The goal of using Taguchi Methods upstream in software design and development is to make the software creation process a triumph of good design over downstream craftsmanship and expensive rework. Craftsmanship in the form of "build and fix" has been used for 50 years. It is too slow and too expensive and has never worked very well. Moreover, it does not support the development of the increasingly larger and more complex software systems that our technological future will depend on.

Sidebar 18.2: Testing and Debugging Anomalies

Table 18.1. Bottom-Up Integration Testing

Table 18.2. Top-Down Integration Testing