|
|
|
Measure the things that help you answer the questions you have to answer. The challenge with testing metrics is that the test objects that we want to measure have multiple properties; they can be described in many ways. For example, a software bug has properties much like a real insect: height, length, weight, type or class (family, genus, spider,
I find that I can make my clearest and most convincing arguments when I stick to fundamental metrics. For example, the number of bugs found in a test effort is not meaningful as a measure until I combine it with the severity, type of bugs found, number of
Several fundamental and derived metrics taken together provide the most
Fundamental testing metrics are the ones that can be used to answer the following questions.
How big is it ?
How long will it take to test it ?
How much will it cost to test it ?
How much will it cost to fix it ?
The question "How big is
it
?" is usually
I have
Time
Cost
Tests
Bugs found by testing
We quantify "How big
it
is" with these metrics. These are probably the most fundamental metrics specific to software testing. They are listed here in order of
For example, product failures are a special class of bug-one that has
The properties and criteria used to quantify tests and bugs are normally defined by an organization; so they are local and they vary from project to project. In Chapters 11 through 13, I introduce
Units of time are used in several test metrics, for example, the time required to run a test and the time available for the best test effort. Let's look at each of these more closely.
This measurement is
The time required to conduct test setup and cleanup activities must also be
Sample Units:
This is usually the most firmly established and most published metric in the test effort. It is also usually the only measurement that is consistently decreasing.
Sample Units: Generally estimated in weeks and measured in minutes.
The cost of testing usually includes the cost of the testers' salaries, the equipment, systems, software, and other tools. It may be
Calculating the cost of testing is straightforward if you keep good project metrics. However, it does not offer much cost justification unless you can contrast it to a
Sample Units: Currency, such as dollars; can also be measured in units of time.
We do not have an invariant, precise, internationally accepted standard unit that measures the
Tests have attributes such as quantity, size, importance or priority, and type
Sample Units (listed simplest to most complex):
A keystroke or mouse action
An SQL query
A single transaction
A complete function path traversal through the system
A function-dependent data set
Many people claim that finding bugs is the main purpose of testing. Even though they are
Sample Units: Severity, quantity, type, duration, distribution, and cost to find and fix. Note: Bug distribution and the cost to find and fix are derived metrics.
Like tests, bugs also have attributes as discussed in the following sections.
Severity is a fundamental measure of a bug or a failure. Many ranking schemes exist for defining severity. Because there is no set standard for establishing bug severity, the magnitude of the severity of a bug is often
|
SEVERITY RANKING |
RANKING CRITERIA |
|---|---|
|
Severity 1 Errors |
Program ceases meaningful operation |
|
Severity 2 Errors |
Severe function error but application can continue |
|
Severity 3 Errors |
Unexpected result or inconsistent operation |
|
Severity 4 Errors |
Design or suggestion |
First of all, bugs are bugs; the
Like severity, bug classification, or bug types, are usually defined by a local set of rules. These are further modified by factors like
In a connected system, some types of bugs are system "failures," as opposed to, say, a coding error. For example, the following bugs are caused by missing or broken connections:
Network outages.
Communications failures.
In mobile computing, individual units that are constantly connecting and disconnecting.
Integration errors.
Missing or malfunctioning
Timing and synchronization errors.
These bugs are actually system failures. These types of failure can, and probably will, recur in production. Therefore, the tests that found them during the test effort test are very valuable in the production environment. This type of bug is important in the test effectiveness metric, discussed later in this chapter.
For this metric, there are two main genres: (1) bugs found before the product ships or goes live and (2) bugs found after-or, alternately, those bugs found by testers and those bugs found by customers. As I have already said, this is a very weak measure until you bring it into perspective using other measures, such as the severity of the bugs found.
This measurement is usually established by the users of the product and
This is an important metric in establishing an answer to the question "Was the test effort worth it?" But,
Sample Units: Quantity, severity, and currency.
This is a most useful derived metric both for measuring the cost of testing and for assessing the stability of the system. The bug find rate is closely
Consider Tables 5.2 and 5.3. The following statistics are taken from a case study of a
|
Bugs found/hour |
5.33 bugs found/hr |
|
Cost/bug to find |
$9.38/bug to find |
|
Bugs reported/hr |
3.25 bugs/hr |
|
Cost to report |
$15.38/bug to report |
|
Cost/bug find and report |
$24.76/bug to find and report |
|
Bugs found/hour |
0.25 bugs found/hr |
|
Cost/bug to find |
$199.79 bug to find |
|
Bugs reported/hr |
0.143 bugs/hr |
|
Cost to report |
$15.38/bug to report |
|
Cost/bug find and report |
$215.17 bug to find and report |
Notice that the cost of reporting and tracking bugs is normally higher than the cost of finding bugs in the early part of the test effort. This situation changes as the bug find rate
By week 4, the number of bugs being found per hour has dropped significantly. It should drop as the end of the test effort is approached. However, the cost to find each successive bug rises, since testers must look longer to find a bug, but they are still paid by the hour.
These tables are helpful in explaining the cost of testing and in evaluating the readiness of the system for production.
Figure 5.1 shows the bug concentrations in four modules of a system during the system test phase. A graph of this type is one of the simplest and most efficient tools for determining where to concentrate development and test resources in a project.
Figure 5.1:
Bug density per unit.
Bug
Even though this type of chart is one of the most useful tools testers have for measuring code worthiness, it is one of the most seldom published. There is a common fear that these metrics will be used against someone. Care should be taken that these metrics are not misused. The highest bug density usually resides in the newest modules, or the most experimental modules. High bug densities do not
As we have just discussed, there are various classes of bugs. Some of them can be eradicated, and some of them cannot. The most
If a significant percentage of the bugs being found in testing are serious, then there is a definite risk that the users will also find serious bugs in the shipped product. The following statistics are taken from a case study of a shrink-wrap RAD project. Table 5.4 shows separate categories for the bugs found and bugs reported.
|
ERROR RANKING |
RANKING DESCRIPTION: |
BUGS FOUND |
BUGS REPORTED |
|---|---|---|---|
|
Severity 1 Errors |
GPF or program ceases meaningful operation |
18 |
9 |
|
Severity 2 Errors |
Severe function error but application can continue |
11 |
11 |
|
Severity 3 Errors |
Unexpected result or inconsistent operation |
19 |
19 |
|
Severity 4 |
Design or suggestion |
|
|
|
Totals |
48 |
39 |
Management required that only bugs that could be reproduced were reported. This is a practice that I discourage because it allows management and development to ignore the really hard bugs-the unreproducible ones. These bugs are then shipped to the users. Notice that half the Severity 1 bugs found were not reported. Inevitably, it will fall on the support group to try and isolate these problems when the users begin to report them.
Figure 5.2 shows the graphical representation of the bugs found shown in Table 5.4
Figure 5.2:
Bug distribution by severity.
The Severity 1 bugs reported represent 38 percent of all bugs found. That means that over a third of all the bugs in the product are serious. Simply put, the probability is that one of every three bugs the user finds in the shipped product will be serious. In this case, it is even
There is a pervasive myth in the industry that all of the bugs found during testing are fixed before the product is shipped. Statistics gathered between 1993 and 1994
Figure 5.3:
Bug fix rate from 1998 study.
Many of the shipped bugs are
The risk of not shipping on time is better understood than the risk of shipping bugs that cannot be easily reproduced or are not well
The next metrics help measure the test effort itself, answering questions about how much was tested, what was achieved, and how productive the effort was.
Given a set of things that could be tested, test coverage is the portion that was actually tested. Test coverage is generally presented as a percentage.
For example, 100 percent statement coverage means that all of the statements in a program were tested. At the unit test level, test coverage is commonly used to measure statement and branch coverage. This is an absolute measure; it is based on known countable
It is important to note that just because every statement in a group of programs that comprise a system was tested, this does not mean that the system was tested 100 percent. Test coverage can be an absolute measure at the unit level, but it quickly becomes a relative measure at the system level. Relative because while there are a finite number of statement tests and branch tests in a program, the number of tests that exist for a system is an unbounded set-for all practical purposes, an infinite set. Just as testing can never find all the bugs in a system,
For test coverage to be a useful measurement at the system level, a list of tests must be
The value of this test coverage metric depends on the quality and completeness of the test inventory. A test coverage of 100 percent of a system is only possible if the test inventory is very limited. Tom Gilb calls this "painting the bull's-eye around the arrow."
Test effectiveness is a measure of the bug-finding ability of the test set. If a comprehensive test inventory is constructed, it will probably be too large to exercise completely. This will be demonstrated as we proceed through the next several chapters. The goal is to pick the smallest test set from the test inventory that will find the most bugs while staying within the time frame. In a test effort, adequate test coverage does not necessarily require that the test set achieve a high rate of test coverage with respect to the test inventory.
It is usually easier to
We can answer the question, "How good were the tests?" in several ways. One of the most common is to answer the question in terms of the number of bugs found by the users, and the type of bugs they were.
The bug-finding effectiveness of the test set can be measured by taking the ratio of the number of bugs found by the test set to the total bugs found in the product.
An effective test suite will maximize the number of bugs found during the test effort. We also want this test suite to be the smallest test set from the inventory that will accomplish this goal. This approach yields the highest test effectiveness (most bugs found) and highest test efficiency (least effort, expense, or waste). For example, if the test coverage of a system test suite covers only 50 percent of the test inventory but it finds 98 percent of all the bugs ever found in the system, then it probably provided adequate test coverage. The point is, the tests in the suite were the right 50 percent of the inventory-the most important tests. These tests found most of the bugs that were important to the user community. The benefits of increasing the test coverage for this system would be minimal.
Test effectiveness only measures the percentage of bugs that the test effort found. Some bugs will be found by both the testers and the users. These are only counted once. The test effectiveness metric
Test effectiveness is valuable when you are evaluating the quality of the test set. I use it as one of the selection criteria when I am distilling a test set that is a candidate for becoming a part of a production diagnostic suite. All the tests that I run during a test effort are part of the most important tests suite. The subset of these most important tests that can discover a failure are valuable indeed.
To select this subset, I use test effectiveness in conjunction with a certain class of bugs that I call failures. Failures are bugs that can recur even after they appear to have been fixed. (See the examples listed under the section Bug Type Classification earlier in the chapter.)
This failure-finding subset of the most important tests can provide years of value in the production environment as a diagnostics suite,
When the test effort can identify this test set and instantiate it in the production environment, the testers are delivering a very good return on the test investment. Some of my diagnostics suites have run for years in production environments with few changes. Often these tests are still running long after the original coding of the application or system has been
The following set of derived metrics are used to track the test effort and judge the readiness of the product. By
The number of tests attempted by a given time
The number of tests that passed by a given time
The number of bugs that were found by a given time
The number of bugs that were fixed by a given time
The average time between failures
When you use the Most Important Tests method, it is possible to show management how big it is and what a given level of test coverage will require in time and resources. From this basis, it is possible to calculate cost. These techniques are very useful in the test estimation phase. S-curves can help you stay on track during the testing phase, and performance metrics can help you determine the actual cost of testing and what you got for the effort.
Calculating the cost of the effort afterward is one of the simplest of all the metrics in this chapter, even though it may have several components. It is very useful in make a
| Note |
Today, you have to show management that it was worth it. |
Not only does the test effort have to provide proof of its performance, it also needs to show that it adds enough value to the product to justify its budget. The performance of the test effort can be most accurately measured at the end of the life of a release. This makes it difficult to justify an ongoing test effort. Constant measurement by testers is the key to demonstrating performance.
Performance is (1) the act of performing, for instance, execution, accomplishment, fulfillment, and so on; and (2) operation or functioning, usually with regard to effectiveness.
The goal of the test effort is to minimize the number of bugs that the users find in the product. We accomplish this goal by finding and removing bugs before the product is shipped to the users. The performance of the test effort is based on the ratio of the total bugs found and fixed during test to all the bugs ever found in the system.
The performance of the last test effort, including its test coverage, bug fix rate, and the number of serious bugs that occurred or required fixes in the shipped product, are all used to evaluate the adequacy of the test effort. The cost of the test effort and the cost of the bugs found in production are also considered.
We measure to
This information can be used to adjust test coverage and bug fix requirements on the next release. For example, two test efforts conducted on similarly
Table 5.5: Determining If the Test Effort Was Adequate
|
TEST COVERAGE |
AVG. CALLS TO CUSTOMER SERVICE PER LICENSE IN FIRST 90 DAYS |
BUG FIX RATE |
PERFORMANCE RATIO (6 MO. AFTER RELEASE) |
SEVERITY 1 BUGS REPORTED IN PRODUCTION |
SEVERITY 2 BUGS REPORTED IN PRODUCTION |
|---|---|---|---|---|---|
|
CASE STUDY 1 |
|||||
|
67% |
5 |
70% |
98% |
|
6 |
|
CASE STUDY 2 |
|||||
|---|---|---|---|---|---|
|
100% |
30 |
50% |
75% |
7 |
19 |
|
Severity 1 = Most serious bugs Severity 2 = Serious bugs |
|||||
One of the problems in Case Study 2 is that the test inventory was probably insufficient. When a poor test inventory is
It is necessary to distinguish between the adequacy of the test effort as a whole and the adequacy of the test coverage, because many of the problems that occur in production environments actually occurred during the test cycle. In other words, the bugs were triggered by the test set, but the bugs were not fixed during the test effort and consequently were shipped with the product.
The challenge here is that traditional test efforts cannot remove bugs directly. I have already talked about the management tools "argument" and "persuasion." The current trend is toward find-and-fix.
As I mentioned already, this metric is very useful as a point of comparison, but very difficult to establish. The preceding case studies are good examples of how to limp around this problem. I have had the opportunity to conduct a couple of forensic studies in recent years, and I am convinced that the cost of not testing is profound, but because a product failure is such a sensitive issue, I have never been in a position to safely publish any details. It is a common
You can compare the cost of fixing a bug found in testing to the cost of fixing a bug in the shipped product. I have had some luck in recent years showing that my test effort had a positive cost-benefit ratio using this technique, but it requires intimate knowledge of the customer support process and the ability to track costs in the support area and the development area.
Testing never finds and
The test effort
I already mentioned the diagnostics suites for production. Installation instructions are another normal product of the test effort; they may be
Finally, since testers are the first expert users of a product, their questions, working notes, and instructions usually form the foundation of the user guide and seed for frequently asked questions documentation. It is a sad waste when the documentation creation process does not take advantage of this resource.
[1]
There are
|
|
|