Test Execution

Overview

"Knowledge must come through action; you can have no test which is not fanciful, save by trial."

— Sophocles

"Take time to deliberate, but when the time for action has arrived, stop thinking and go in."

— Napoleon Bonaparte

Test execution is the process of executing all or a selected number of test cases and observing the results. Although preparation and planning for test execution occur throughout the software development lifecycle, the execution itself typically occurs at or near the end of the software development lifecycle (i.e., after coding).



Before Beginning Test Execution

When most people think of testing, test execution is the first thing that comes to mind. So, why is there an emphasis on execution? Well, not only is test execution the most visible testing activity, it also typically occurs at the end of the development lifecycle after most of the other development activities have concluded or at least slowed down. The focus then shifts to test execution, which is now on the critical path of delivery of the software. By-products of test execution are test incident reports, test logs, testing status, and results, as illustrated in Figure 7-1.

click to expand
Figure 7-1: Executing the Tests

By now, you should know that we consider testing to be an activity that spans the entire software development lifecycle. According to Martin Pol, "test execution may only consume 40% or less of the testing effort," but it typically has to be concluded as quickly as possible, which means that there is often an intense burst of effort applied in the form of long hours and borrowed resources. Frequently, it also means that the test execution effort will be anxiously scrutinized by the developers, users, and management.

Deciding Who Should Execute the Tests

Who executes the tests is dependent upon the level of test. During unit test it is normal for developers to execute the tests. Usually each developer will execute their own tests, but the tests may be executed by another programmer using techniques like buddy testing (refer to Chapter 4 - Detailed Test Planning for more information on buddy testing). Integration tests are usually executed by the developers and/or the testers (if there is a test group). System tests could be executed by the developers, testers, end-users, or some combination thereof. Some organizations also use developers to do the system testing, but in doing so, they lose the fresh perspective provided by the test group.

Ideally, acceptance tests should be executed by end-users, although the developers and/or testers may also be involved. Table 7-1 shows one way that test execution may be divided, but there is really no definitive answer as to who should execute the tests. Ideally, we are looking for people with the appropriate skill set, although sometimes we're lucky to find somebody, anybody that's available.

Table 7-1: Responsibilities for Test Execution

Responsible Group

Unit

Integration

System

Acceptance

Testers

 

ü

ü

ü

Developers

ü

ü

ü

 

End-Users

   

ü

ü

During test execution, the manager of the testing effort is often looking for additional resources. Potential testers might include: members of the test team (of course), developers, users, technical writers, trainers, or help desk staff members. Some organizations even bring in college interns to help execute tests. College interns and new-hires can be used successfully if the test cases are explicit enough for them to understand and they've received training in how to write an effective incident report. We must be cautious, though, or we might spend more time training the neophyte testers than it's worth. On the other hand, having new testers execute tests is one way to quickly make them productive and feel like part of the team.

  Key Point 

"Newbies" make good candidates for usability testing because they're not contaminated by previous knowledge of the product.

Deciding What to Execute First

  Key Point 

As a rule of thumb, we normally recommend that the regression test set (or at least the smoke test) be run in its entirety early on to flag areas that are obviously problematic.

Choosing which test cases to execute first is a strategy decision that depends on the quality of the software, resources available, existing test documentation, and the results of the risk analysis. As a rule of thumb, we normally recommend that the regression test set (or at least the smoke test) be run in its entirety early on to flag areas that are obviously problematic. This strategy may not be feasible if the regression set is exceptionally large or totally manual. Then, the focus of the test should be placed on those features that were identified as high-risk during the software risk analysis described in Chapter 2. High-risk features will almost certainly contain all features that were extensively modified, since we know that changed features are typically assigned a higher likelihood of failure.

Writing Test Cases During Execution

Really, no matter how good you and your colleagues are at designing test cases, you'll always think of new test cases to write when you begin test execution (this is one of the arguments for techniques such as exploratory testing). As you run tests, you're learning more about the system and are in a better position to write the additional test cases. Unfortunately, in the heat of battle (test execution) when time is short, these tests are often executed with no record made of them unless an incident is discovered. This is truly unfortunate, because some of these "exploratory" test cases are frequently among the most useful ones created and we would like to add them to our test case repository, since we have a long-term goal of improving the coverage of the test set. One of our clients includes notes in the test log as a shorthand way of describing these exploratory test cases. Then, after release, they use the test log to go back and document the test cases and add them to the test case repository.

  Key Point 

No matter how good you and your colleagues are at designing test cases, you'll always think of new test cases to write when you begin test execution.

Recording the Results of Each Test Case

Obviously, the results of each test case must be recorded. If the testing is automated, the tool will record both the input and the results. If the tests are manual, the results can be recorded right on the test case document. In some instances, it may be adequate to merely indicate whether the test case passed or failed. Failed test cases will also result in an incident report being generated. Often, it may be useful to capture screens, copies of reports, or some other output stream.



Test Log

The IEEE Std. 829-1998 defines the test log as a chronological record of relevant details about the execution of test cases. The purpose of the test log shown in Figure 7-2 is to share information among testers, users, developers, and others and to facilitate the replication of a situation encountered during testing.

IEEE Std. 829-1998 for Software Test Documentation

Template for Test Log

Contents

1.

Test Log Identifier

2.

Description

3.

Activity and Event Entries


Figure 7-2: Test Log Template from IEEE Std. 829-1998

In order for a test log to be successful, the people that must submit data into and eventually use the log must want to do so. Forcing participants to use a test log when they don't want to use it is seldom successful. In order to make it desirable, the test log must be easy to use and valuable to its users.

  Key Point 

Since the primary purpose of the test log is to share information rather than analyze data, we recommend making the log free form.

Since the primary purpose of the test log is to share information rather than analyze data, we recommend making the log free form, instead of using fields or buttons, which are desirable in other areas such as defect tracking. If the testing team is small and co-located, the test log might be as simple as a spiral notebook in which testers and/or developers can make log entries. Alternatively, it might be more convenient to have a word-processed document or e-mailed form. If the team members are geographically separated, the test log would probably be better served in the form of a Web page or company intranet. Wherever it is, the test log should be easy to access and update. One of our clients, for example, has a continuously open active window on their monitor where a thought can be entered at any time.

Table 7-2 shows an example of a test log sheet. Notice that even though it mentions the writing up of PR#58, it doesn't go into any detail. The purpose of the log is not to document bugs (we have defect tracking systems for that), but rather to record events that you want to share among the team members or use for later recall. The larger the team and/or the project and the more geographically separated they are, the more important the log becomes.

Table 7-2: Sample Test Log

Description: Online Trade

Date: 01/06/2002

ID

Time

Activity and Event Entries

1

08:00

Kicked off test procedure #18 (buy shares) with 64 users on test system.

2

09:30

Test system crashed.

3

10:00

Test system recovered.

4

10:05

Kicked off test procedure #19 (sell shares).

5

11:11

PR #58 written up.

6

12:00

New operating system patch installed.



Test Incident Reports

Incidents can be defined as any unusual result of executing a test (or actual operation). Incidents may, upon further analysis, be categorized as defects or enhancements, or merely retain their status as an incident if they're determined to be inconsequential or the result of a one-time anomaly. A defect (or bug) is a flaw in the software with the potential to cause a failure. The defect can be anywhere: in the requirements, design, code, test, and/or documentation. A failure occurs when a defect prevents a system from accomplishing its mission or operating within its specifications. Therefore, a failure is the manifestation of one or more defects. One defect can cause many failures or none, depending on the nature of the defect. An automated teller machine (ATM), for example, may fail to dispense the correct amount of cash, or an air defense system may fail to track an incoming missile. Testing helps you find the failure in the system, but then you still have to track the failure back to the defect.

  Key Point 

Some software testing books say, "The only important test cases are the ones that find bugs."

We believe that proving that some attribute of the system works correctly is just as important as finding a bug.

Defect tracking is an important activity and one that almost all test teams accomplish. Defect tracking is merely a way of recording software defects and their status. In most organizations, this process is usually done using a commercial or "home-grown" tool. We've seen many "home-grown" tools that were developed based on applications such as Infoman, Lotus Notes, Microsoft Access, and others.

Case Study 7-1: On September 9, 1945, a moth trapped between relays caused a problem in Harvard University's Mark II Aiken Relay Calculator.

The First Computer Bug

I'm very proud of the fact that I got to meet Rear Admiral Grace Murray Hopper, the famous computer pioneer, on two separate occasions. On one of these meetings, I even received one of Admiral Hopper's "nanoseconds," a small piece of colored "bell" wire about a foot or so long, which Grace often gave to her many admirers to show them how far electricity would travel in one nanosecond.

Among Rear Admiral Grace Murray Hopper's many accomplishments were the invention of the programming language COBOL and the attainment of the rank of Rear Admiral in the U.S. Navy (one of the first women to ever reach this rank). But, ironically, Admiral Hopper is probably most famous for an event that occurred when she wasn't even present.

Admiral Hopper loved to tell the story of the discovery of the first computer bug. In 1945, she was working on the Harvard University Mark II Aiken Relay Calculator. On September 9 of that year, while Grace was away, computer operators discovered that a moth trapped between the relays was causing a problem in the primitive computer. The operators removed the moth and taped it to the computer log next to the entry, "first actual case of a bug being found." Many people cite this event as the first instance of using the term "bug" to mean a defect. The log with the moth still attached is now located in the History of American Technology Museum.

Even though this is a great story, it's not really the first instance of using the term "bug" to describe a problem in a piece of electrical gear. Radar operators in World War II referred to electronic glitches as bugs, and the term was also used to describe problems in electrical gear as far back as the 1900s. The following is a slide that I used in a presentation that I gave at a testing conference in the early 1980s shortly after meeting Admiral Grace Hopper.

click to expand

— Rick Craig

IEEE Template for Test Incident Report

An incident report provides a formal mechanism for recording software incidents, defects, and enhancements and their status. Figure 7-3 shows the IEEE template for a Test Incident Report. The parts of the template in Figure 7-3 shown in italics are not part of the IEEE template, but we've found them to be useful to include in the Test Incident Report. Please feel free to modify this (or any other template) to meet your specific needs.

IEEE Std. 829-1998 for Software Test Documentation Template for Test Incident Report

Contents

1.

Incident Summary Report Identifier

2.

Incident Summary

3.

Incident Description

 

3.1

Inputs

 

3.2

Expected Results

 

3.3

Actual Results

 

3.4

Anomalies

 

3.5

Date and Time

 

3.6

Procedure Step

 

3.7

Environment

 

3.8

Attempts to Repeat

 

3.9

Testers

 

3.10

Observers

4.

Impact

5.

Investigation

6.

Metrics

7.

Disposition


Figure 7-3: Template for Test Incident Report from IEEE Std. 829-1998

Incident Summary Report Identifier

The Incident Summary Report Identifier uses your organization's incident tracking numbering scheme to identify this incident and its corresponding report.

Incident Summary

The Incident Summary is the information that relates the incident back to the procedure or test case that discovered it. This reference is often missing in many companies and is one of the first things that we look for when we're auditing their testing processes. Absence of the references on all incident reports usually means that the testing effort is largely ad hoc. All identified incidents should have a reference to a test case. If an incident is discovered using ad hoc testing, then a test case should be written that would have found the incident. This test case is important in helping the developer recreate the situation, and the tester will undoubtedly need to re-run the test case after any defect is fixed. Also, defects have a way of reappearing in production and this is a good opportunity to fill in a gap or two in the test coverage.

  Key Point 

All identified incidents should have a reference to a test case. If an incident is discovered using ad hoc testing, then a test case should be written that would have found the incident.

Incident Description

The author of the incident report should include enough information so that the readers of the report will be able to understand and replicate the incident. Sometimes, the test case reference alone will be sufficient, but in other instances, information about the setup, environment, and other variables is useful. Table 7-3 describes the subsections that appear under Incident Description.

Table 7-3: Subsections Under Incident Description

Section Heading

Description

4.1 Inputs

Describes the inputs actually used (e.g., files, keystrokes, etc.).

4.2 Expected Results

This comes from the test case that was running when the incident was discovered.

4.3 Actual Results

Actual results are recorded here.

4.4 Anomalies

How the actual results differ from the expected results. Also record other data (if it appears to be significant) such as unusually light or heavy volume on the system, it's the last day of the month, etc.

4.5 Date and Time

The date and time of the occurrence of the incident.

4.6 Procedure Step

The step in which the incident occurred. This is particularly important if you use long, complex test procedures.

4.7 Environment

The environment that was used (e.g., system test environment or acceptance test environment, customer 'A' test environment, beta site, etc.)

4.8 Attempts to Repeat

How many attempts were made to repeat the test?

4.9 Testers

Who ran the test?

4.10 Observers

Who else has knowledge of the situation?

Impact

The Impact section of the incident report form refers to the potential impact on the user, so the users or their representative should ultimately decide the impact of the incident. The impact will also be one of the prime determinants in the prioritization of bug fixes, although the resources required to fix each bug will also have an effect on the prioritization.

One question that always arises is, "Who should assign the impact rating?" We believe that the initial impact rating should be assigned by whoever writes the incident report. This means that if the incident is written as a result of an incorrect response to a test case, the initial assignment will be made by a tester.

Many people think that only the user should assign a value to the impact, but we feel that it's important to get an initial assessment of the impact as soon as possible. Most testers that we know can correctly determine the difference between a really critical incident and a trivial one. And, it's essential that incidents that have the potential to become critical defects be brought to light at the earliest opportunity. If the assignment of criticality is deferred until the next meeting of the Change Control Board (CCB) or whenever the user has time to review the incident reports, valuable time may be lost. Of course, when the CCB does meet, one of their most important jobs is to review and revise the impact ratings.

  Key Point 

It's essential that incidents that have the potential to become critical defects be brought to light at the earliest opportunity.

A standardized organization-wide scale should be devised for the assignment of impact. Oftentimes, we see a scale such as Minor, Major, and Critical; Low, Medium, and High; 1 through 5; or a variety of other scales. Interestingly enough, we discovered a scale of 1 through 11 at one client site. We thought that was strange and when we queried them, they told us that they had so many bugs with a severity (or impact) of 10 that they had to create a new category of 11. Who knows how far they may have expanded their impact scale by now (…35, 36, 37)? It's usually necessary to have only four or five severity categories. We're not, for example, looking for a severity rating of 1 to 100 on a sliding scale. After all, how can you really explain the difference between a severity of 78 and 79? The key here is that all of the categories are defined.

If your categories are not defined, but just assigned on a rolling scale, then the results will be subjective and very much depend on who assigns the value. The imprecision in assigning impact ratings can never be totally overcome, but it can be reduced by defining the parameters (and using examples) of what minor, major, and critical incidents look like. Case Study 7-2 shows the categories chosen by one of our clients.

Case Study 7-2: Example Impact Scale

Example of Minor, Major, and Critical Defects

Minor:

Misspelled word on the screen.

Major:

System degraded, but a workaround is available.

Critical:

System crashes.

  Key Point 

The imprecision in assigning impact ratings can never be totally overcome, but it can be reduced by defining the parameters (and using examples) of what minor, major, and critical incidents look like.

Investigation

The Investigation section of the incident report explains who found the incident and who the key players are in its resolution. Some people also collect some metrics here on the estimated amount of time required to isolate the bug.

Metrics

Initially, most testers automatically assume that every incident is a software problem. In some instances, the incident may be a hardware or environment problem, or even a testing bug! Some of our clients are very careful to record erroneous tests in the defect tracking system, because it helps them to better estimate their future testing (i.e., how many bad test cases are there?) and helps in process improvement. As an aside, one company told us that recording erroneous test cases helped give their testers a little extra credibility with the developers, since the testers were actually admitting and recording some of their own bugs. In most cases, though, testers don't normally like to record their own bugs, just as many developers don't like to record their own unit testing bugs.

  Key Point 

The Metrics section of the incident report can be used to record any number of different metrics on the type, location, and cause of the incidents.

The Metrics section of the incident report can be used to record any number of different metrics on the type, location, and cause of the incidents. While this is an ideal place to collect metrics on incidents, be cautious not to go overboard. If the incident report gets too complicated or too long, testers, users, and developers will get frustrated and look for excuses to avoid recording incidents.

Case Study 7-3: What happens when there's a defect in the testware?

Good Initiative, But Poor Judgment

I learned that every bug doesn't have to be a software bug the hard way. In the late 1980's, I was the head of an independent test team that was testing a critical command and control system for the U.S. military. During one particularly difficult test cycle, we discovered an alarming number of bugs. One of my sergeants told me that I should go inform the development manager (who was a Brigadier General, while I was only a Captain) that his software was the worst on the planet. I remember asking the sergeant, "Are you sure all of the tests are okay?" and he replied, "Yes, sir. It's the software that's bad."

In officer training school we were taught to listen to the wisdom of our subordinate leaders, so I marched up to the General's office and said, "Sir, your software is not of the quality that we've come to expect." Well, you can almost guess the ending to this story. The General called together his experts, who informed me that most of the failures were caused by a glitch in our test environment. At that point, the General said to me, "Captain, good initiative, but poor judgment," before I was dismissed and sent back to my little windowless office.

The moral of the story is this: If you want to maintain credibility with the developers, make sure your tests are all valid before raising issues with the software.

— Rick Craig

Disposition (Status)

In a good defect tracking system, there should be the capability to maintain a log or audit trail of the incident as it goes through the analysis, debugging, correction, re-testing, and implementation process. Case Study 7-4 shows an example of an incident log recorded by one of our clients.

Case Study 7-4: Example Incident Log Entry

Example Incident Log

2/01/01 Incident report opened.

2/01/01 Sent to the CCB for severity assignment and to Dominic for analysis.

2/03/01 Dominic reports that the fix is fairly simple and is in code maintained by Hunter C.

2/04/01 CCB changed the severity to Critical.

2/04/01 Bug fix assigned to Hunter.

2/06/01 Bug fix implemented and inspection set for 2/10/01.

2/10/01 Passed inspection and sent to QA.

2/12/01 Bug fix is re-tested and regression run. No problems encountered.

2/12/01 Incident report closed.

  Note 

Closed incident reports should be saved for further analysis of trends and patterns of defects.

Writing the Incident Report

  Key Point 

Most incident tracking tools are also used to track defects, and the tools are more commonly called defect tracking tools than incident tracking tools.

We are often asked, "Who should write the incident report?" The answer is, "Whoever found the incident!" If the incident is found in the production environment, the incident report should be written by the user - if it's culturally acceptable and if the users have access to the defect tracking system. If not, then the help desk is a likely candidate to complete the report for the user. If the incident is found by a tester, he or she should complete the report. If an incident is found by a developer, it's desirable to have him or her fill out the report. In practice, however, this is often difficult, since most programmers would rather just "fix" the bug than record it. Very few of our clients (even the most sophisticated ones) record unit testing bugs, which are the most common type of bugs discovered by developers. If the bug is discovered during the course of a review, it should be documented by the person who is recording the minutes of the review.

  Key Point 

Very few of our clients record unit testing bugs, which are the most common type of bugs discovered by developers.

It's a good idea to provide training or instructions on how to write an incident report. We've found that the quality of the incident report has a significant impact on how long it takes to analyze, recreate, and fix a bug. Specifically, training should teach the authors of the incident reports to:

  • focus on factual data
  • ensure the situation is re-creatable
  • not use emotional language (e.g., bold text, all caps)
  • not be judgmental

Attributes of a Defect Tracking Tool

Most organizations that we work with have some kind of defect tracking tool. Even organizations that have little else in the way of formal testing usually have some type of tracking tool. This is probably because in many organizations, management's effort is largely focused on the number, severity, and status of bugs.

  Key Point 

In a survey conducted in 1997, Ross Collard reported that defect tracking tools were the most commonly used testing tools among his respondents (83%).

- Ross Collard 1997 Rational ASQ User Conference

We also find that some companies use commercial defect tracking tools, while others create their own tools using applications that they're familiar with such as MS-Word, MS-Access, Lotus Notes, Infoman, and others. Generally, we urge companies to buy tools rather than make them themselves unless the client environment is so unique that there is no tool that will fit their requirements. Remember that if you build the tool yourself, you also have to document it, test it, and maintain it.

Ideally, a defect tracking tool should be easy to use, be flexible, allow easy data analysis, integrate with a configuration management system, and provide users with easy access. Ease of use is why so many companies choose to build their own defect tracking tool using some familiar product like Lotus Notes. If the tool is difficult to use, is time-consuming, or asks for a lot of information that the author of an incident report sees no need for, use of the tool will be limited and/or the data may not be accurate. Software engineers have a way of recording "any old data" in fields that they believe no one will use.

  Key Point 

Ideally, a defect tracking tool should be easy to use, be flexible, allow easy data analysis, integrate with a configuration management system, and provide users with easy access.

A good defect tracking tool should allow the users to modify the fields to match the terminology used within their organization. In other words, if your organization purchases a commercial tool that lists severity categories of High, Medium, and Low, users should be able to easily change the categories to Critical, Major, Minor, or anything else they require.

  Key Point 

A good defect tracking tool should allow the users to modify the fields to match the terminology used within their organization.

The tool should facilitate the analysis of data. If the test manager wants to know the distribution of defects against modules, it should be easy for him or her to get this data from the tool in the form of a table or chart. This means that most of the input into the tool must be in the form of discrete data rather than free-form responses. Of course there will always be a free-form description of the problem, but there should also be categories such as distribution, type, age, etc. that are discrete for ease of analysis. Furthermore, each record needs to have a dynamic defect log associated with it in order to record the progress of the bug from discovery to correction (refer to Case Study 7-4).

Ideally, the incident reports will be linked to the configuration management system used to control the builds and/or versions.

All users of the system must be able to easily access the system at all times. We have occasionally encountered organizations where incident reports could only be entered into the defect tracking system from one or two computers.

Using Multiple Defect Tracking Systems

Although we personally prefer to use a single defect tracking system throughout the organization, some organizations prefer to use separate defect tracking systems for production and testing bugs. Rather than being a planned event, separate defect tracking systems often evolve over time, or are implemented when two separate systems are initially created for separate purposes. For example, the production defect tracking system may have originally been designed for tracking support issues, but later modified to also track incidents. If you do use separate systems, it's beneficial to ensure that field names are identical for production and test bugs. This allows the test manager to analyze trends and patterns in bugs from test to production. Unfortunately, many of you are working in organizations that use entirely different vocabularies and/or metrics for bugs discovered in test versus those found by the user.

  Key Point 

Part of the process of analyzing an incident report is to determine if the failure was caused by a new bug, or if the failure is just another example of the same old bug.

It's useful to analyze trends and patterns of the failures, and then trends and patterns of the defects. Generally, part of the process of analyzing an incident report is to determine if the failure was caused by a new bug, or if the failure is just another example of the same old bug. Obviously, the development manager is most concerned about the number of defects that need to be fixed, whereas the user may only be concerned with the number of failures encountered (even if every failure is caused by the same bug).



Testing Status and Results

One of the first issues that a test manager must deal with during test execution is finding a way to keep track of the testing status. How the testing status will be tracked should be explained in the master test plan (refer to Chapter 3 - Master Test Planning). Basically, testing status is reported against milestones completed; number, severity, and location of defects discovered; and coverage achieved. Some test managers may also report testing status based upon the stability, reliability, or usability of the system. For our purposes, however, we'll measure these "…ilities" based on their corresponding test cases (e.g., reliability test cases, usability test cases, etc.).

Measuring Testing Status

The testing status report (refer to Table 7-4) is often the primary formal communication channel that the test manager uses to inform the rest of the organization of the progress made by the testing team.

Table 7-4: Sample Test Status Report

Project: Online Trade Date: 01/05/02

Feature Tested

Total Tests

# Complete

% Complete

# Success

% Success (to date)

Open Account

46

46

100

41

89

Sell Order

36

25

69

25

100

Buy Order

19

17

89

12

71

Total

395

320

81

290

91

Notice that Table 7-4 shows, at a glance, how much of the test execution is done and how much remains unfinished. Even so, we must be careful in how we interpret the data in this chart and understand what it is that we're trying to measure. For example, Table 7-4 shows that the testing of this system is 81% complete. But is testing really 81% complete? It really shows that 81% of the test cases have been completed, not 81% of the testing. You should remember that all test cases are definitely not created equal. If you want to measure testing status against a timeline, you must weight the test cases based on how long they take to execute. Some test cases may take only a few minutes, while others could take hours.

  Key Point 

If you want to measure testing status against a time line, you have to weight the test cases based on how long they take to execute, but if you want to measure status against functionality, then the test cases must be weighted based on how much of the functionality they cover.

On the other hand, if you want to measure status against functionality (i.e., how much of the user's functionality has been tested?), then the test cases must be weighted based on how much of the functionality they cover. Some test cases may cover several important requirements or features, while others may cover fewer or less important features.

Test Summary Report

  Key Point 

The purpose of the Test Summary Report is to summarize the results of the testing activities and to provide an evaluation based on the results.

The purpose of the Test Summary Report is to summarize the results of the testing activities and to provide an evaluation based on these results. The summary report provides advice on the release readiness of the product and should document any known anomalies or shortcomings in the product. This report allows the test manager to summarize the testing and to identify limitations of the software and the failure likelihood. There should be a test summary report that corresponds to every test plan. So, for example, if you are working on a project that had a master test plan, an acceptance test plan, a system test plan, and a combined unit/integration test plan, each of these should have its own corresponding test summary report, as illustrated in Figure 7-4.

click to expand
Figure 7-4: There Should Be a Test Summary Report for Each Test Plan

In essence, the test summary report is an extension of the test plan and serves to "close the loop" on the plan.

One complaint that we often hear about the test summary report is that, since it occurs at the end of the test execution phase, it's on the critical path of the delivery of the software. This is true, but we would also like to add that completing the test summary report doesn't take a lot of time. In fact, most of the information contained within the report is information that the test manager should be collecting and analyzing constantly throughout the software development and testing lifecycles. You could consider using most of the information in the test summary report as a test status report. Just think of the test summary report as a test status report on the last day of the project.

  Key Point 

Think of the Test Summary Report as a test status report on the last day of the project.

Here's a tip that you may find useful - it works well for us. Once we begin the execution of the tests, we seldom have time to keep the test plan up-to-date. Instead of updating the plan, we keep track of the changes in the test summary report's Variances section, and after the software under test has moved on, we go back and update the plan.

The Test Summary Report shown in Figure 7-5 conforms to IEEE Std. 829-1998 for Software Test Documentation. Sections that are not part of the IEEE template are indicated in italics.

IEEE Std. 829-1998 for Software Test Documentation

Template for Test Summary Report

Contents

1.

Test Summary Report Identifier

2.

Summary

3.

Variances

4.

Comprehensive Assessment

5.

Summary of Results

 

5.1

Resolved Incidents

 

5.2

Unresolved Incidents

6.

Evaluation

7.

Recommendations

8.

Summary of Activities

9.

Approvals


Figure 7-5: Template for Test Summary Report from IEEE-829-1998

Test Summary Report Identifier

The Report Identifier is a unique number that identifies the report and is used to place the test summary report under configuration management.

Summary

This section summarizes what testing activities took place, including the versions/releases of the software, the environment and so forth. This section will normally supply references to the test plan, test-design specifications, test procedures, and test cases.

Variances

This section describes any variances between the testing that was planned and the testing that really occurred. This section is of particular importance to the test manager because it helps him or her see what changed and provides some insights into how to improve the test planning in the future.

Comprehensive Assessment

In this section, you should evaluate the comprehensiveness of the testing process against the criteria specified in the test plan. These criteria are based upon the inventory, requirements, design, code coverage, or some combination thereof. Features or groups of features that were not adequately covered need to be addressed here, including a discussion of any new risks. Any measures of test effectiveness that were used should be reported and explained in this section.

Summary of Results

Summarize the results of testing here. Identify all resolved incidents and summarize their resolution. Identify all unresolved incidents. This section will contain metrics about defects and their distribution (refer to the section on Pareto Analysis in this chapter).

Evaluation

Provide an overall evaluation of each test item, including its limitations. This evaluation should be based upon the test results and the item pass/fail criteria. Some limitations that might result could include statements such as "The system is incapable of supporting more than 100 users simultaneously" or "Performance slows to x if the throughput exceeds a certain limit." This section could also include a discussion of failure likelihood based upon the stability exhibited during testing, reliability modeling and/or an analysis of failures observed during testing.

Recommendations

We include a section called Recommendations because we feel that part of the test manager's job is to make recommendations based on what they discover during the course of testing. Some of our clients dislike the Recommendations section because they feel that the purpose of the testing effort is only to measure the quality of the software, and it's up to the business side of the company to act upon that information. Even though we recognize that the decision of what to do with the release ultimately resides with the business experts, we feel that the authors of the test summary report should share their insights with these decision makers.

Summary of Activities

Summarize the major testing activities and events. Summarize resource consumption data; for example, total staffing level, total machine time, and elapsed time used for each of the major testing activities. This section is important to the test manager, because the data recorded here is part of the information required for estimating future testing efforts.

Approvals

Specify the names and titles of all persons who must approve this report. Provide space for the signatures and dates. Ideally, we would like the approvers of this report to be the same people who approved the corresponding test plan, since the test summary report summarizes all of the activities outlined in the plan (if it's been a really bad project, they may not all still be around). By signing this document, the approvers are certifying that they concur with the results as stated in the report, and that the report, as written, represents a consensus of all of the approvers. If some of the reviewers have minor disagreements, they may be willing to sign the document anyway and note their discrepancies.



When Are We Done Testing?

So, how do we know when we're done testing? We'd like to think that this would have been spelled out in the exit criteria for each level. Meeting the exit criteria for the acceptance testing is normally the flag you're looking for, which indicates that testing is done and the product is ready to be shipped, installed, and used. We believe that Boris Beizer has nicely summed up the whole issue of when to stop testing:

  Key Point 

"Good is not good enough when better is expected."

- Thomas Fuller

"There is no single, valid, rational criterion for stopping. Furthermore, given any set of applicable criteria, how each is weighed depends very much upon the product, the environment, the culture and the attitude to risk."

At the 1999 Application Software Measurement (ASM) Conference, Bob Grady identified the following key points associated with releasing a product too early:

  • Many defects may be left in the product, including some "show-stoppers."
  • The product might be manageable with a small number of customers with set expectations.
  • A tense, reactive environment may make it difficult for team members to switch their focus to new product needs.
  • The tense environment could result in increased employee turnover.
  • Customers' frustration with the product will continue.

Grady also identified the following key points associated with releasing the product too late:

  • Team members and users are confident in the quality of the product.
  • Customer support needs are small and predictable.
  • The organization may experience some loss of revenue, long-term market share, and project cancellations, thus increasing the overall business risk.
  • The organization may gain a good reputation for quality, which could lead to capturing a greater market share in the long term.

When you consider the implications associated with releasing too early or too late, it's clear that important decisions (such as when to stop testing) should be based on more than one metric that way, one metric can validate the other metric. Some commonly used metrics are explained in the paragraphs that follow.

Defect Discovery Rate

Many organizations use the defect discovery rate as an important measure to assist them in predicting when a product will be ready to release. When the defect discovery rate drops below the specified level, it's often assumed (sometimes correctly) that the product is ready to be released. While a declining discovery rate is typically a good sign, one must remember that other forces (less effort, no new test cases, etc.) may cause the discovery rate to drop. This is why it is normally a good idea to base important decisions on more than one supporting metric.

Notice in Figure 7-6 that the number of new defects discovered per day is dropping and, if we assume that the effort is constant, the cost of discovering each defect is also rising.

click to expand
Figure 7-6: Defect Discovery Rate

At some point, therefore, the cost of continuing to test will exceed the value derived from the additional testing. Of course, all we can do is estimate when this will occur, since the nature (severity) of the undiscovered bugs is unknown. This can be offset somewhat if risk-based techniques are used. In fact, another useful metric to determine whether the system is ready to ship is the trend in the severity of bugs found in testing. If risk-based techniques are used, then we would expect not only the defect discovery rate to drop, but also the severity of the defects discovered. If this trend is not observed, then that's a sign that the system is not ready to ship.

Remaining Defects Estimation Criteria

One method of determining "when to ship" is to base the decision on an estimate of the number of defects expected. This can be accomplished by comparing the defect discovery trends with those from other, similar projects. This normally requires that an extensive amount of data has been collected and managed in the past, and will also require normalization of the data to take into account the differences in project scope, complexity, code quality, etc.

Running Out of Resources

It's true, running out of time or budget may be a valid reason for stopping. Many of us have certainly recommended the release of a product that we felt was fairly poor in quality because it was better than what the user currently had. Remember, we're not striving for perfection, only acceptable risk. Sometimes, the risk of not shipping (due to competition, failure of an existing system, etc.) may exceed the (business) risk of shipping a flawed product.

Case Study 7-5: In the world of software, timing can be everything.

Great Software, But Too Late

One of our clients several years ago made a PC-based tax package. They felt that they had created one of the best, easiest-to-use tax packages on the market. Unfortunately, by the time their product had met all of their quality goals, most people who would normally buy and use their product had already purchased a competitor's product. Our client was making a great product that was of very little use (since it was too late to market). The next year the company relaxed their quality goals (just a little), added resources on the front-end processes, and delivered a marketable product in a timely fashion.



Measuring Test Effectiveness

When we ask our students, "How many of you think that the time, effort, and money spent trying to achieve high-quality software in your organization is too much?" we get lots of laughs and a sprinkling of raised hands. When we ask the same question, but change the end of the sentence to "…too little?" almost everyone raises a hand. Changing the end of the sentence to "…about right?" gets a few hands (and smug looks). Generally, there may be one or two people who are undecided and don't raise their hands at all.

  Question #1 

Do you think that the time, effort, and money spent trying to achieve high-quality software in your organization is:

⃞  

too much?

⃞  

too little?

⃞  

not enough?

Next, we ask the same people, "How many of you have a way to measure test effectiveness?" and we get almost no response. If you don't have a way to measure test effectiveness, it's almost impossible to answer question #1 with anything other than, "It's a mystery to me." Knowing what metrics to use to measure test effectiveness and implementing them is one of the greatest challenges that a test manager faces.

  Question #2 

Do you have a way to measure test effectiveness?

We've discovered the following key points regarding measures of test effectiveness:

  • Many organizations don't consciously attempt to measure test effectiveness.
  • All measures of test effectiveness have deficiencies.
  • In spite of the weaknesses of currently used measures, it's still necessary to develop a set to use in your organization.
  Gilb's Law 

"Anything you need to quantify can be measured in some way that is superior to not measuring it at all."

- Tom Gilb

In this section, we'll analyze some of the problems with commonly used measures of test effectiveness, and conclude with some recommendations. In working with dozens of organizations, we've found that most attempts to measure test effectiveness fall into one of the three major categories illustrated in Figure 7-7.

click to expand
Figure 7-7: Categories of Metrics for Test Effectiveness

Customer Satisfaction Measures

Many companies use customer satisfaction measures to determine if their customers are happy with the software they have received. The most common customer satisfaction measures are usually gathered by analyzing calls to the help desk or by using surveys. Both of these measures have general deficiencies and other problems specific to testing. First, let's examine surveys.

Surveys

Surveys are difficult to create effectively. It's hard for most of us to know specifically what questions to ask and how to ask them. In fact, there's a whole discipline devoted to creating, using, and understanding surveys and their results. Case Studies 7-6 and 7-7 describe some of the pitfalls associated with designing and administering surveys.

Case Study 7-6: The Science of Survey Design

What Do You Mean, "I Need an Expert"?

When I was working on a large survey effort in the early 1990s, my colleague suggested that I should have a "survey expert" review my survey. I was a little miffed, since I had personally written the survey and I was sure that I knew how to write a few simple questions. Still, I found a "survey expert" at a local university and asked him to review my software survey. I was surprised when he had the nerve to say, "Come back tomorrow and I'll tell you how your respondents will reply to your survey." Sure enough, the next day he gave me a completed survey (and a bill for his services). I thought that he was pretty presumptuous since he was not a "software expert," but after administering the survey to a few of my colleagues, I was amazed that this professor had predicted almost exactly how they would respond!

How a respondent answers a survey is dependent on all kinds of issues like the order of the questions, use of action verbs, length of the survey, and length of the questions used. So, the moral of the story is this: If you want to do a survey, we recommend that you solicit help from an "expert."

— Rick Craig

Case Study 7-7: Personal Bias in Survey Design

The Waitress Knows Best

Another experience I had with a survey occurred several years ago at a restaurant that I own in Tampa called "Mad Dogs and Englishmen." My head waitress decided to create a customer satisfaction survey (on her own initiative). You have to love employees like that! The survey had two sections: one rated the quality of food as Outstanding, Excellent, Above Average, Average, and Below Average, and the other section rated service.

click to expand

The service scale included Outstanding, Excellent, and Above Average. I asked her about the missing Average and Below Average categories and she assured me that as long as she was in charge, no one would ever get average or below-average service! I realized that the survey designer's personal bias can (and will) significantly influence the survey results!

— Rick Craig

  Key Point 

Customer satisfaction surveys don't separate the effectiveness of the test from the quality of the software.

Customer satisfaction measures are probably useful for your company, and are of interest to the test manager, but don't, in themselves, solve the problem of how to measure test effectiveness.

All issues of construction aside, surveys have more specific problems when you try to use them to measure the effectiveness of testing. The biggest issue, of course, is that it's theoretically possible that the developers could create a really fine system that would please the customer even if the testing efforts were shoddy or even non-existent. Customer satisfaction surveys do not separate the quality of the development effort from the effectiveness of the testing effort. So, even though surveys may be useful for your organization, they don't give the test manager much of a measure of the effectiveness of the testing effort. On the other hand, if the surveys are all negative, that gives the test manager cause for concern.

Help Desk Calls

Another customer satisfaction measure that is sometime used is the number of calls to the help desk. This metric suffers from the same problem as a survey - it doesn't separate the quality of the software from the effectiveness of the testing effort. Each call must be analyzed in order to determine the root cause of the problem. Was there an insufficient amount of training? Are there too many features? Although most companies are immediately concerned when the help desk is swamped right after a new release, how would they feel if no one called? If nobody called, there would be no data to analyze and no way to discover (and resolve) problems. Even worse, though, maybe the application is so bad that nobody is even using it.

  Key Point 

Customer satisfaction measures are useful for your organization and are of interest to the test manager, but don't, in themselves, solve the problem of how to measure test effectiveness.

One final problem with the customer satisfaction measures that we've just discussed (i.e., surveys and help desk calls) is that they're both after the fact. That is, the measures are not available until after the product is sold, installed, and in use. Just because a metric is after the fact doesn't make it worthless, but it does lessen its value considerably.

Customer satisfaction measures are useful for your organization, and are of interest to the test manager, but don't, in themselves, solve the problem of how to measure test effectiveness.

Defect Measures

Another group of measures commonly used for test effectiveness are built around the analysis of defects.

Number of Defects Found in Testing

Some test managers attempt to use the number of defects found in testing as a measure of test effectiveness. The first problem with this, or any other measure of defects, is that all bugs are not created equal. It's necessary to "weight" bugs and/or use impact categories such as all "critical" bugs. Since most defects are recorded with a severity rating, this problem can normally be overcome.

  Key Point 

Another problem with defect counts as a measure of test effectiveness is that the number of bugs that originally existed significantly impacts the number of bugs discovered (i.e., the quality of the software).

Another problem with defect counts as a measure of test effectiveness is that the number of bugs that originally existed significantly impacts the number of bugs discovered (i.e., the quality of the software). Just as in the customer satisfaction measures, counting the number of bugs found in testing doesn't focus on just testing, but is affected by the initial quality of the product being tested.

Some organizations successfully use the number of defects found as a useful measure of test effectiveness. Typically, they have a set of test cases with known coverage (coverage is our next topic) and a track record of how many bugs to expect using various testing techniques. Then, they observe the trends in defect discovery versus previous testing efforts. The values have to be normalized based upon the degree of change of the system and/or the quantity and complexity of any new functionality introduced.

Figure 7-8 shows the defect discovery rates of two projects. If the projects are normalized based on size and complexity, one can assume that 'Project B' will contain a number of defects similar to 'Project A'. Consequently, the curves described by each of these projects should also be similar.

click to expand
Figure 7-8: Defect Discovery Rates for Projects A and B

If the testing effort on 'Project B' finds significantly fewer bugs, this might mean that the testing is less effective than it was on 'Project A'. Of course, the weakness in this metric is that the curve may be lower because there were fewer bugs to find! This is yet another reason why decisions shouldn't be based solely on a single metric.

Another, more sophisticated, example of measuring defects is shown in Table 7-5. Here, a prediction of the number of bugs that will be found at different stages in the software development lifecycle is made using metrics from previous releases or projects. Both of these models require consistent testing practices, covering test sets, good defect recording, and analysis.

Table 7-5: Bug Budget Example
 

Total # Predicted

Predicted (P) Versus Actual (A)

P

A

P

A

P

A

P

A

P

A

Jan

Feb

Mar

Apr

May

Requirements Review

20

20

14

               

Design Review

35

5

0

15

 

15

         

Test Design

65

   

25

 

30

 

10

     

Code Inspections

120

           

60

 

60

 
                       

Unit Test

80

                   

System Test

40

                   

Regression Test

10

                   

Acceptance Test

5

                   

6 Months Production

15

                   

Totals

390

25

14

40

 

45

70

   

60

 

Production Defects

A more common measure of test effectiveness is the number of defects found in production or by the customer. This is an interesting measure, since the bugs found by the customer are obviously ones that were not found by the tester (or at least were not fixed prior to release). Unfortunately, some of our old problems such as latent and masked defects may have appeared.

Another issue in using production defects as a measure of test effectiveness is that it's another "after the fact" measure and is affected by the quality of the software. We must measure severity, distribution, and trends from release to release.

Defect Removal Efficiency (DRE)

A more powerful metric for test effectiveness (and the one that we recommend) can be created using both of the defect metrics discussed above: defects found during testing and defects found during production. What we really want to know is, "How many bugs did we find out of the set of bugs that we could have found?" This measure is called Defect Removal Efficiency (DRE) and is defined in Figure 7-9.

click to expand
Figure 7-9: Formula for Defect Removal Efficiency (DRE)

  Note 

Dorothy Graham calls DRE Defect Detection Percentage (DDP). We actually prefer her naming convention because it's more descriptive of the metric. That is, Defect Removal Efficiency does not really measure the removal of defects, only their detection.

The number of bugs not found is usually equivalent to the number of bugs found by the customers (though the customers may not find all of the bugs either). Therefore, the denominator becomes the number of bugs that could have been found. DRE is an excellent measure of test effectiveness, but there are many issues that you must be aware of in order to use it successfully:

  • The severity and distribution of the bugs must be taken into account. (Some organizations treat all defects the same, i.e., no severity is used, based on the philosophy that the ratio of severity classes is more or less constant).
  • How do you know when the customer has found all of the bugs? Normally, you would need to look at the trends of defect reporting by your customers on previous projects or releases to determine how long it takes before the customer has found "most" of the bugs. If they are still finding a bug here and there one year later, it probably won't significantly affect your metrics. In some applications, especially those with many users, most of the bugs may be reported within a few days. Other systems with fewer users may have to go a few months to have some assurance that most of the bugs have been reported.

      Key Point 

    In his book A Manager's Guide to Software Engineering, Roger Pressman calls DRE "the one metric that we can use to get a 'bottom-line' of improving quality."

  • It's after the fact (refer to the bullet item above). Metrics that are "after the fact" do not help measure the test effectiveness of the current project, but they do let test managers measure the long-term trends in the test effectiveness of their organizations.

      Key Point 

    Metrics that are "after the fact" do not help measure the test effectiveness of the current project, but they do let test managers measure the long-term trends in the test effectiveness of their organizations.

  • When do we start counting bugs (e.g., during unit, integration, system, or acceptance testing? during inspections? during ad hoc testing?), and what constitutes a bug-finding activity? It's important to be consistent. For example, if you count bugs found in code inspections, you must always count bugs found in code inspections.
  • Some bugs cannot be found in testing! Due to the limitations of the test environment, it's possible, and even likely, that there is a set of bugs that the tester could not find no matter what he or she does. The test manager must decide whether or not to factor these bugs out. If your goal is to measure the effectiveness of the testing effort without considering what you have to work with (i.e., the environment), the bugs should be factored out. If your goal is to measure the effectiveness of the testing effort including the environment (our choice), the bugs should be left in. After all, part of the job of the tester is to ensure that the most realistic test environment possible is created.

DRE Example

Figure 7-10 is an example of a DRE calculation. Horizontal and upward vertical arrows represent bugs that are passed from one phase to the next.

click to expand
Figure 7-10: DRE Example

  Key Point 

In their book Risk Management for Software Projects, Alex Down, Michael Coleman, and Peter Absolon report that the DRE for the systems they are familiar with is usually in the range of 65–70%.

Defect Removal Efficiency (DRE) is also sometimes used as a way to measure the effectiveness of a particular level of test. For example, the system test manager may want to know what the DRE is for their system testing. The number of bugs found in system testing should be placed in the numerator, while those same bugs plus the acceptance test and production bugs should be used in the denominator, as illustrated in the example below.

System Test DRE Example

Figure 7-11 is an example of a system test DRE calculation. Horizontal and upward vertical arrows represent bugs that are passed from one phase to the next.

click to expand

click to expand
Figure 7-11: System Test DRE Example

Unit Testing DRE

When measuring the DRE of unit testing, it will be necessary to factor out those bugs that could not be found due to the nature of the unit test environment. This may seem contradictory to what we previously recommended, but we don't want the developer to have to create a "system" test environment and, therefore, there will always be bugs that cannot be found in unit testing (e.g., bugs related to the passing of data from one unit to another).

Defect Age

Another useful measure of test effectiveness is defect age, often called Phase Age or PhAge. Most of us realize that the later we discover a bug, the greater harm it does and the more it costs to fix. Therefore, an effective testing effort would tend to find bugs earlier than a less effective testing effort would.

  Key Point 

The later we discover a bug, the greater harm it does and the more it costs to fix.

Table 7-6 shows a scale for measuring defect age. Notice that this scale may have to be modified to reflect the phases in your own software development lifecycle and the number and names of your test levels. For example, a requirement defect discovered during a high-level design review would be assigned a PhAge of 1. If this defect had not been found until the Pilot, it would have been assigned a PhAge of 8.

Table 7-6: Scale for Defect Age on Project X

Phase Created

Phase Discovered

 

Requirements

High-Level Design

Detailed Design

Coding

Unit Testing

Integration Testing

System Testing

Acceptance Testing

Pilot

Production

Requirements

0

1

2

3

4

5

6

7

8

9


 

High-Level Design

 

0

1

2

3

4

5

6

7

8

 

Detailed Design

   

0

1

2

3

4

5

6

7

 

Coding

     

0

1

2

3

4

5

6

 

Summary

                     

Table 7-7 shows an example of the distribution of defects on one project by phase created and phase discovered. In this example, there were 8 requirements defects found in high-level design, 4 during detailed design, 1 in coding, 5 in system testing, 6 in acceptance testing, 2 in pilot, and 1 in production. If you've never analyzed bugs to determine when they were introduced, you may be surprised how difficult a job this is.

Table 7-7: Defect Creation versus Discovery on Project X

Phase Created

Phase Discovered

Total Defects

Requirements

High-Level Design

Detailed Design

Coding

Unit Testing

Integration Testing

System Testing

Acceptance Testing

Pilot

Production

Requirements

0

8

4

1

0

0

5

6

2

1

27

High-Level Design

 

0

9

3

0

1

3

1

2

1

20

Detailed Design

   

0

15

3

4

0

0

1

8

31

Coding

     

0

62

16

6

2

3

20

109

Summary

0

8

13

19

65

21

14

9

8

30

187

Defect Spoilage

Spoilage is a metric that uses the Phase Age and distribution of defects to measure the effectiveness of defect removal activities. Other authors use slightly different definitions of spoilage. Tom DeMarco, for example, defines spoilage as "the cost of human failure in the development process," in his book Controlling Software Projects: Management, Measurement, and Estimates. In their book Software Metrics: Establishing a Company-Wide Program, Robert Grady and Deborah Caswell explain that Hitachi uses the word spoilage to mean "the cost to fix post-release problems." Regardless of which definition of spoilage you prefer, you should not confuse spoilage with Defect Removal Efficiency (DRE), which measures the number of bugs that were found out of the set of bugs that could have been found.

  Key Point 

Spoilage is a metric that uses Phase Age and distribution of defects to measure the effectiveness of defect removal activities.

Table 7-8 shows the defect spoilage values for a particular project, based on the number of defects found weighted by defect age. During acceptance testing, for example, 9 defects were discovered. Of these 9 defects, 6 were attributed to defects created during the requirements phase of this project. Since the defects that were found during acceptance testing could have been found in any of the seven previous phases, the requirements defects that remained hidden until the acceptance testing phase were given a weighting of 7. The weighted number of requirements defects found during acceptance testing is 42 (i.e., 6 x 7 = 42).

Table 7-8: Number of Defects Weighted by Defect Age on Project X
 

Phase Discovered

Spoilage = Weight/Total Defects

Phase Created

Requirements

High-Level Design

Detailed Design

Coding

Unit Testing

Integration Testing

System Testing

Acceptance Testing

Pilot

Production

Requirements

0

8

8

3

0

0

30

42

16

9

116 / 27 = 4.3

High-Level Design

 

0

9

6

0

4

15

6

14

8

62 / 20 = 2.1

Detailed Design

   

0

15

6

12

0

0

6

42

81 / 31 = 2.6

Coding

     

0

62

32

18

8

15

120

255 / 109 = 2.3

Summary

                   

514 / 187 = 2.7

The Defect Spoilage is calculated using the formula shown in Figure 7-12.

click to expand
Figure 7-12: Formula for Defect Spoilage

Generally speaking, lower values for spoilage indicate more effective defect discovery processes (the optimal value is 1). As an absolute value, the spoilage has little meaning. However, it becomes valuable when used to measure a long-term trend of test effectiveness.

Defect Density and Pareto Analysis

Defect Density is calculated using the formula shown in Figure 7-13.

click to expand
Figure 7-13: Formula for Defect Density

  Key Point 

J.M. Duran admonished us to concentrate on the vital few, not the trivial many. Later, Thomas J. McCabe extended the Pareto Principle to software quality activities.

To learn more about the history of the Pareto Principle and see some actual examples, read The Pareto Principle Applied to Software Quality Assurance by Thomas J. McCabe and G. Gordon Schulmeyer in the Handbook of Software Quality Assurance.

Unfortunately, defect density measures are far from perfect. The two main problems are in determining what is a defect and what is a line of code. By asking, "What is a defect?" we mean "What do we count as a defect?"

  • Are minor defects treated the same as critical defects, or do we need to weight them?
  • Do we count unit testing bugs or only bugs found later?
  • Do we count bugs found during inspection? During ad hoc testing?

Similarly, measuring the size (i.e., lines of code or function points) of the module is also a problem, because the number of lines of code can vary based on the skill level of the programmer and the language that was used.

Figure 7-14 shows the defect density per 1,000 lines of code in various modules of a sample project. Notice that Module D has a high concentration of bugs. Experience has shown that parts of a system where large quantities of bugs have been discovered will continue to have large numbers of bugs even after the initial cycle of testing and correcting of bugs. This information can help the tester focus on problematic (i.e., error prone) parts of the system. Similarly, instead of plotting defect density on the histogram as in Figure 7-14, the causes of the defects could be displayed in descending order of frequency (e.g., functionality, usability, etc.). This type of analysis is known as Pareto Analysis and can be used to target areas for process improvement.

click to expand
Figure 7-14: Defect Density in Various Modules

The bottom line is that defect measures can and should be used to measure the effectiveness of testing, but by themselves, they're inadequate and need to be supplemented by coverage metrics.

Coverage Measures

Coverage metrics are probably the most powerful of all measures of test effectiveness, since they are not necessarily "after the fact" and are not affected by the quality of the software under test. High-level coverage metrics such as requirements and/or inventory coverage can be done as soon as the test cases are defined. In other words, you can measure the coverage of the test cases created before the code is even written!

  Key Point 

Requirements coverage can be measured before the code is even written.

Coverage can be used to measure the completeness of the test set (i.e., the test created) or of the tests that are actually executed. We can use coverage as a measure of test effectiveness because we subscribe to the philosophy that a good test case is one that finds a bug or demonstrates that a particular function works correctly. Some authors state that the only good test case is the one that finds a bug. We contend that if you subscribe to that philosophy, coverage metrics are not useful as a measurement of test effectiveness. (We reckon that if you knew in advance where the bugs were, you could concentrate on only writing test cases that found bugs - or, better yet, just fix them and don't test at all.)

Requirements and Design Coverage

How to measure requirements, inventory, design, and code coverage was discussed in Chapter 5 - Analysis and Design. At the very least, every testing methodology that we are familiar with subscribes to requirements coverage. Unfortunately, it's possible to "test" every requirement and still not have tested every important condition. There may be design issues that are impossible to find during the course of normal requirements-based testing, which is why it's important for most testing groups to also measure design coverage. Table 7-9 shows a matrix that combines requirements and design coverage.

Table 7-9: Requirements and Design Coverage

Attribute

TC #1

TC # 2

TC #3

TC #4

TC #5

Requirement 1

ü

ü

 

ü

ü

Requirement 2

 

ü

   

ü

Requirement 3

   

ü

 

ü

Requirement 4

ü

ü

 

ü

ü

Design 1

     

ü

ü

Design 2

 

ü

 

ü

ü

Design 3

       

ü

It's quite clear, however, that if requirements coverage is not achieved, there will be parts of the system (possibly very important parts) that are not tested!

Code Coverage

Many testing experts believe that one of the most important things a test group can do is to measure code coverage. These people may be touting code coverage as a new silver bullet, but actually, code coverage tools have been in use for at least as long as Rick has been a test manager (20 years). The tools in use today, however, are much more user-friendly than earlier tools. Table 7-10 is a conceptual model of the output of a code coverage tool. These tools can measure statement, branch, or path coverage.

Table 7-10: Code Coverage

Statement

Test Run

Covered?

TR #1

TR# 2

TR #3

A

ü

ü

ü

Yes

B

ü

 

ü

Yes

C

ü

   

Yes

D

     

No

E

   

ü

Yes

Total

60%

20%

60%

80%

The reports are clearer and easier to interpret, but the basic information is almost the same. Code coverage tools tell the developer or tester which statements, paths, or branches have and have not been exercised by the test cases. This is obviously a good thing to do, since any untested code is a potential liability.

Code Coverage Weaknesses

Just because all of the code has been executed does not, in any way, assure the developer or tester that the code does what it's supposed to do. That is, ensuring that all of the code has been exercised under test does not guarantee that it does what the customers, requirements, and design need it to do.

Figure 7-15 shows a fairly typical progression in software development. The users' needs are recorded as requirements specifications, which in turn are used to create the design, and from there, the code is written. All of the artifacts that are derived from the users' needs can and should be tested. If test cases are created from the code itself, the most you can expect to prove is that the code "does what it does" (i.e., it functions, but not necessarily correctly).

click to expand
Figure 7-15: Test Verification

By testing the design, you can show that the system matches the design and the code matches the design, but you can't prove that the design meets the requirements. Test cases based upon the requirements can demonstrate that the requirements have been met and the design matches the requirements. All of these verification steps should be done and, ultimately, it's important to create test cases based upon the code, the design, and the requirements. Just because a code coverage tool is used, doesn't mean that the test cases must be derived solely from the code.

  Key Point 

Just because a code coverage tool is used, doesn't mean that the test cases must be derived solely from the code.

In some organizations, code coverage becomes the ultimate metric. Some organizations may be struggling to move from say 85% to 90% coverage regardless of the cost. While that may be good, we think it's important to ensure that we have tested the "right" 90%. That is to say, even using code coverage metrics requires that we use some kind of risk-based approach. This is true even if a goal of 100% code coverage is established, because it's advantageous to test (and fix problems) in the critical components first.

Another issue when using code coverage tools is that the test cases may have to be executed one additional time, since most code coverage tools require the source code to be instrumented. Therefore, if a defect is discovered while using a code coverage tool, the test will have to be run again without the tool in order to ensure that the defect is not in the instrumentation itself.

Generally, we have found that code coverage is most effective when used at lower levels of test (e.g., unit and integration). These levels of test are normally conducted by the developer, which is good because the analysis of the parts of the code that were not tested is best done by the people with intimate knowledge of the code. Even when using a code coverage tool, unit test cases should first be designed to cover all of the attributes of the program specification before designing test cases based on the code itself.

  Key Point 

Even when using a code coverage tool, unit test cases should first be designed to cover all of the attributes of the program specification before designing test cases based on the code itself.

As with most projects, you've got to crawl before you run. Trying to get a high level of code coverage is fine in organizations that have already established a fairly comprehensive set of test cases. Unfortunately, many of the companies that we have visited have not even established a test set that is comprehensive enough to cover all of the features of the system. In some cases, there are entire modules, programs, or subsystems that are not tested. It's fine to say that every line of code should be tested, but if there are entire subsystems that have no test cases, it's a moot point.

Code Coverage Strengths

Code coverage measures put the focus on lower levels of tests and provide a way to determine if and how well they've been done. Many companies do little unit testing, and code coverage tools can help highlight that fact. Global coverage can also be used as a measure of test effectiveness.

Global Code Coverage

The global coverage number can be used as a measure of test effectiveness, and it can highlight areas that are deficient in testing. This measure of test effectiveness, like all of the others we've discussed, has some problems. While it is useful to see that the coverage has increased from 50% to 90%, it's not as clear how valuable it is to go from 50% to 51%. In other words, if risk-based techniques that ensure that the riskiest components are tested first are used, then every increase has a more or less "known" value. If tests are not based on risk, it's harder to know what an increase in coverage means. It's also interesting to note that it costs more to go from, for example, 95% to 100% coverage than it did to go from 50% to 55% coverage. If risk-based techniques were used, the last gain might not "buy" us as much, since the last 5% tested is also (we hope) the least risky. Perhaps we're just trying to weigh the value of these added tests against the resources required to create and run them. At the end of the day, though, that's what testing is all about!

  Key Point 

Global code coverage is the percent of code executed during the testing of an entire application.







Systematic Software Testing
Systematic Software Testing (Artech House Computer Library)
ISBN: 1580535089
EAN: 2147483647
Year: 2002
Pages: 114
Simiral book on Amazon

Flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net