Determining the Scope of Testing: How Big Is It?

Antonio Stradivari (1644-1737), better known to most of us as Stradivarius, made some of the finest and most prized violins and violas in the world today. Stradivari was a master craftsman; he had a special knack for understanding the properties, strengths, and weaknesses of a piece of wood. He measured the quality of his materials and craftsmanship by his senses. It is said that he could examine a piece of wood and know just by looking at it, touching it, and listening to the sound it made when he knocked on it what its best use would be. But for all the master's art and craftsmanship, when it came to doing a complete job and managing his efforts, he relied on lists.

We know that he did so because when he was through with a list, he reused the paper the list was written on, which was too valuable to be thrown away, to reinforce the inside joints of his instruments. At that time, all craftsmen were trained to keep lists. These lists might be used to satisfy a benefactor that their funds were well spent or the craft guild if the craftsman was called to make an accounting. Much of what we know of many of the masters' arts and lives comes from handwritten lists found reinforcing the insides of the fine instruments and furniture they created.

An inventory is a detailed list. An inventory of all the tasks associated with a project, such as all the tests identified for a test effort, is the basis for answering such questions as, "How long will it take to accomplish everything on the list?" In a test effort, how big? is best answered in terms of how much?, how many?, and most importantly, how long? The inventory then becomes the basis of an agreement between parties, or a contract for accomplishing the project. We will postpone the discussion of the test agreement or test plan until later. For now, consider the test inventory only as a means of answering the question, "How big is it?"

The test inventory is the complete enumeration of all tests, of all types, that have been defined for the system being tested. For example, the inventory for a typical end-to-end system test effort will include path tests, data tests, module tests, both old and new user scenarios (function tests), installation tests, environment tests, configuration tests, tests designed to ensure the completeness of the system, requirements verification, and so on.

The inventory can be organized in any way that is useful, but it must be as comprehensive as possible. It is normal and healthy that the inventory grow during the test effort. The inventory is dynamic, not static. It evolves with the system. When a test inventory is used, the test coverage metric can be used to measure many types of testing, such as function or specification coverage.

Test Units

The much-defamed LOC (lines of code) metric is crude, rather like that knotted string the Egyptians used that we will discuss in Chapter 11. Its biggest problem is that it is not uniform. There is no equivalence between a line of C code and a line of Ada, Pascal, or Basic. But it is better than no measure at all.

How do you measure a test? If I say, "How many tests are in this test effort?", do you think of mouse clicks, or long scripts composed of many keystrokes and mouse clicks, or huge data files that will be processed? Clearly, a test is a measure of the size of the test effort. But, like LOC, it is only useful if there is a way to normalize the way we measure a test.

Unfortunately, there is no standard definition of the word test. The IEEE defines test as a set of one or more test cases. This definition implies multiple verifications in each test. Using this definition, if we say that one test effort requires 1,000 tests and another requires 132 tests, the measure is totally ambiguous because we have no idea what a test entails. In fact, a test is performed each time a tester compares a single outcome to a standard, often called an expected response. In today's test lab, a test is an item or event that is verified, where the outcome is compared to a standard. A test case or test script is a set of tests, usually performed in some sequence and related to some larger action or software function. A test suite is a set of test scripts or test cases. Test cases in a test suite are usually related and organized to verify a more complex set of functions.

If a tester performs many steps before performing a verification, and if the comparison of the actual outcome to the expected outcome fails, the entire set of steps will have to be repeated in order to repeat the test. At Microsoft I observed that, except for simple text entry, almost every keystroke or mouse action was considered to be a test, meaning it was verified individually. When the definition of a test is this granular, the test inventory is going to be very comprehensive, at least as far as function test coverage is concerned. This degree of granularity is necessary not only for good test coverage; it is essential if the goal is to create robust test automation. I am not saying that we should plan to rigorously verify the result of every single keystroke in every single scenario. That level of verification may or may not be necessary or cost-effective. I am saying that we need to count every single keystroke or system stimuli that can be verified and make sure that each one is included in the test inventory.

One Test Script: Many Types of Tests

System testing is usually conducted once the entire system is integrated and can be tested from "end to end." Even though this part of a test effort is often called a systems test or end-to-end test, there are different types of test activities being performed. Statement execution is the focus of unit testing. But statement execution is the result of the user's actions, like keystrokes or mouse clicks during the higher-level tests as well. Behind every system response there are internal and possibly external module, component, and device responses. Any of these responses could be verified and validated. A single user or system input could require multiple layers of verification and validation at many levels of the system. These tests are all different tests even though they may all be generated by the same test script, and they may or may not all be verified and validated each time the test script is attempted.

The function tester is generally concerned with verifying that the system functions correctly from the user's perspective. The function tester cannot usually perform verification of internal system processes, only the outcomes or actual results visible to a normal user. This type of testing is called black box testing or behavioral testing. The task of verifying internal, often invisible, system processes generally falls to the system testers. System testing may use the same system stimuli as the function test-that is, the same test script, database, and so on-but what is verified as a result of the stimuli is different. System testing usually delves into the internal system response to the stimuli.

For the number of tests to have meaning as a sizing metric, there must be some way to normalize the way we measure a test. One important attribute that all tests have in common is the time required to conduct the test. The total time required to conduct all the tests in the inventory is an important measurement in estimating the test effort.

The function tester and the system tester may use the same keystrokes to stimulate the system, but the tests may take very different amounts of time to complete because the verification being performed is very different. The function tester uses the next screen he or she receives to verify the test. The system tester may have to evaluate the contents of systems log files and trapped events. Verification and validation at this low level takes a lot more time than verification from the user interface.

Low-level system verification requires a highly technical tester. Because this type of testing is difficult, time-consuming, and expensive, the trend has been to do less and less of it. After all, if the system appears to be sending the correct response to the user, why look any further? This argument has been used to justify the current top-down approach to system testing. The problem is the really tough, expensive bugs often live in these low-level areas. For example, a bug may only affect 3 percent of the users. Sounds OK, right? But what if those 3 percent happen to be those users who have over $10 million being managed by the system?

Using Historical Data in Estimating Effort

Historical data from a test effort that is similar to the one being estimated can provide a factual basis for predictions. However, even if the time required to conduct two test efforts is known, it is not possible to compare two test efforts and say which was more productive unless there is some additional basis for comparison. It is possible to establish a working basis of comparison by considering similar sets of measurements about each test effort. Consider the sample application data in Table 6.1.

Table 6.1: Comparison of Two Releases of the Same Application
ITEM	RELEASE 1M	RELEASE 1A
Number of test scripts (actual)	1,000	132
Total user functions identified in the release (actual)	236	236
Number of verifications/test script (average actual)	1	50
Total verifications performed	1,000	6,600
Average number of times a test was executed during the test cycle	1.15	5
Number of tests attempted by the end of the test cycle (theoretical)	1,150	33,000
Average duration of a test (known averages)	20 min.	4 min.
Total time required to run the tests (from project logs)	383 hrs.	100 hrs.
Total verifications/hr. of testing (efficiency)	(1,000/383) = 2.6	(6,600/100) = 66
Definition of a test	Verification occurs after a user function is executed	Verification occurs after each user action required to execute a function

A function in Release 1M is roughly equivalent to a function in Release 1A. They are two consecutive releases of the same application, differing only by some bug fixes. At first glance, the test set for Release 1M appears to be more comprehensive than the test set for Release 1A. This is not the case, however. The Release 1M statistics are from the original manual test effort. Release 1A was the automated version of the tests from Release 1M. When the test estimation for Release 1A was performed, all the tests from Release 1M were included in the test inventory. The test analysis showed there were many redundancies in the tests from Release 1M. These were largely removed, and new tests were added to the inventory based on the path and data analysis. ^[1] The values in the table marked theoretical were calculated values.

Efficiency is the work done (output) divided by the energy required to produce the work (input). In this example, the efficiency of Release 1A is 66 verifications per hour, while the efficiency of Release 1M was 2.6 verifications per hour. The efficiency of the tests in Release 1A in terms of verifications was roughly 25 times (66/2.6) greater than Release 1M.

Cost is the inverse of efficiency. Cost in this case will be measured in units of time per test, that is, the time per verification performed and the time per function tested. It took (383/236) or 1.6 hours to verify a function in Release 1M, while it took (100/236 = 0.42 hours) or about 25 minutes to verify a function in Release 1A. Verifying the 236 functions in Release 1A cost about one-quarter the time as Release 1M. This improvement was due to the introduction of automation tools.

These types of cost and efficiency comparisons are dependent on the assumption that the program functions are similar. A program function in firmware is very different from a program function in a piece of windowing software. Program functions are similar from one release of a product to the next. Comparison of tests and test efficiency is most useful when you are planning for a subsequent release when functions will be fairly similar. Such comparisons can also be made between similar applications written in the same language.

Another measure of efficiency is the number of bugs found per test or per test-hour. If it is not possible to establish equivalence, it is still possible to measure the time required to perform the tests and use this measurement to make predictions about regression testing and other similar tests. "How long will it take?" and "How much will it cost?" are two questions for which testers need to have good answers. Using this approach, we can say:

1. The overall size of the test inventory is equal to the number of tests that have been identified for the project. (This is how big it is.)

It does not matter that the true set of tests that exists for the project is unbounded, that is to say, virtually infinite. The techniques discussed in the next chapters are used to cut that number down to some still large but manageable number. Care must be taken that the test inventory includes at least minimum items discussed in the following chapters. If the test inventory is poor, incomplete, or simply too shallow, the resultant test effort will be unsatisfactory even if test coverage is 100 percent.

2. The size of the test set is the number of tests that will actually be executed to completion.

This is the number of tests that must pass in order for testing to be considered complete. The test coverage that this effort achieves is the fraction of test items from the test inventory that will actually be executed until they are successful. As stated previously, we want this subset of all the tests that have been identified to be the most important tests from the inventory.

3. The size of the test effort will include all test activities undertaken to successfully execute the test set.

The time required to accomplish the test effort can be estimated based on the total time required to plan, analyze, and execute tests; report, negotiate and track bugs; and retest fixes on the tests that will be run. A worksheet presented in Chapter 11, "Path Analysis," is used to total the time required for each of these tasks.

4. The cost of the test effort can be estimated based on the time and resources required to perform an adequate test effort.

This seems too simplistic to mention, but it has been my experience that the budget for the test effort is rarely determined in this way, by adding up predicted costs. It seems that the most popular method of estimating the cost of the test effort goes something like this:

Pick a delivery date.
Estimate the date when the code should be turned over to test.
Subtract Step 2 from Step 1. This is the number of days available for testing. ^[2]
Multiply the days available for testing by the cost per hour and the number of testers that can be spared for the effort.

This method totally ignores the goal of an adequate test effort. Testers are somehow expected to get it tested in some arbitrary number of days. When the users start reporting bugs in the product, management immediately asks, "Why didn't you test it?"

When I became an engineer, my first manager told me, "You have to manage your management." She was absolutely correct. The way I manage management who are using this approach to figuring the budget is to show them exactly what they can get for their money-and what they will not get. If they eventually discover that they did not get enough, it is their problem.

In commercial software testing, this method is sufficient and I have done my job. If management chooses to ignore my professional estimate of what is required to perform an adequate test effort, so be it. I will work with whatever they leave me, being sure all the while that they understand what they are choosing not to test. In high-reliability and safety-critical software, the situation cannot be handled so casually. Whatever form the engineer's protestations or advisements may take, they must be documented. Such documentation is the tester's best defense against the charge that the efforts were inadequate.

^[1]In this case, the time required to create the automated tests is factored into the total of 100 hours.

^[2]I will resist commenting here on what happens if this is a negative number.