Generic Software Quality Measures


Metrics Methodology

In 1993 the IEEE published a standard for software quality metrics methodology that has since defined and led development in the field. Here we begin by summarizing this standard. It was intended as a more systematic approach for establishing quality requirements and identifying, implementing, analyzing, and validating software quality metrics for software system development. It spans the development cycle in five steps, as shown in Table 3.2.

Table 3.2. IEEE Software Quality Metrics Methodology


A typical "catalog" of metrics in current use will be discussed later. At this point we merely want to present a gestalt for the IEEE recommended methodology. In the first step it is important to establish direct metrics with values as numerical targets to be met in the final product. The factors to be measured may vary from product to product, but it is critical to rank the factors by priority and assign a direct metric value as a quantitative requirement for that factor. There is no mystery at this point, because Voice of the Customer (VOC) and Quality Function Deployment (QFD) are the means available not only to determine the metrics and their target values, but also to prioritize them.

The second step is to identify the software quality metrics by decomposing each factor into subfactors and those further into the metrics. For example, a direct final metric for the factor reliability could be faults per 1,000 lines of code (KLOC) with a target valuesay, one fault per 1,000 lines of code (LOC). (This level of quality is just 4.59 Sigma; Six Sigma quality would be 3.4 faults per 1,000 KLOC or one million lines of code.) For each validated metric at the metric level, a value should be assigned that will be achieved during development. Table 3.3 gives the IEEE's suggested paradigm for a description of the metrics set.[6]

Table 3.3. IEEE Metric Set Description Paradigm[7]

Term

Description

Name

Name of the metric

Metric

Mathematical function to compute the metric

Cost

Cost of using the metric

Benefit

Benefit of using the metric

Impact

Can the metric be used to alter or stop the project?

Target value

Numerical value to be achieved to meet the requirement

Factors

Factors related to the metric

Tools

Tools to gather data, calculate the metric, and analyze the results

Application

How the metric is to be used

Data items

Input values needed to compute the metric

Computation

Steps involved in the computation

Interpretation

How to interpret the results of the computation

Considerations

Metric assumptions and appropriateness

Training

Training required to apply the metric

Example

An example of applying the metric

History

Projects that have used this metric and its validation history

References

List of projects used, project details, and so on


To implement the metrics in the metric set chosen for the project under design, the data to be collected must be determined, and assumptions about the flow of data must be clarified. Any tools to be employed are defined, and any organizations to be involved are described, as are any necessary training. It is also wise at this point to test the metrics on some known software to refine their use, sensitivity, accuracy, and the cost of employing them.

Analyzing the metrics can help you identify any components of the developing system that appear to have unacceptable quality or that present development bottlenecks. Any components whose measured values deviate from their target values are noncompliant.

Validation of the metrics is a continuous process spanning multiple projects. If the metrics employed are to be useful, they must accurately indicate whether quality requirements have been achieved or are likely to be achieved during development. Furthermore, a metric must be revalidated every time it is used. Confidence in a metric will improve over time as further usage experience is gained.

In-Process Quality Metrics for Software Testing

Until recently, most software quality metrics in many development organizations were of an in-process nature. That is, they were designed to track defect occurrences during formal machine testing. These are listed here only briefly because of their historical importance and because we will be replacing them with upstream quality measures and metrics that will supersede them.

Defect rate during formal system testing is usually highly correlated with the future defect rate in the field because higher-than-expected testing defect rates usually indicate high software complexity or special development problems. Although it may be counterintuitive, experience shows that higher defect rates in testing indicate higher defect rates later in use. If these appear, the development manager has a set of "damage control" scenarios he or she may apply to correct the quality problem in testing before it becomes a problem in the field.

Overall defect density during testing is only a gross indicator; the pattern of defect arrival or "mean time between defects" is a more sensitive metric. Naturally the development organization cannot fix all of the problems arriving today or this week, so a tertiary measure of defect backlog becomes important. If the defect backlog is large at the end of a development cycle, a lot of prerelease fixes have to be integrated into the system. This metric thus becomes not only a quality and workload statement but also a predictor of problems in the very near future.

Phase-based defect removal uses the defect removal effectiveness (DRE) metric. This is simply the defects removed during each development phase divided by the defects latent in the product, times 100 to get the result as a percentage. Because the number of latent defects is not yet known, it is estimated to be the defects removed during the phase plus the defects found later. This metric is best used before code integration (already well downstream in the development process, unfortunately) and for each succeeding phase. This simple metric has become a widely used tool for large application development. However, its ad hoc downstream nature naturally leads to the most important in-process metric as we include in-service defect fixes and maintenance in the development process.

Four new metrics can be introduced to measure quality after the software has been delivered:

  • Fix the backlog and the backlog management index (BMI)

  • Fix the response time and responsiveness

  • Percentage of delinquent fixes

  • Fix quality (that is, did it really get fixed?)

These metrics are not rocket science. The monthly BMI as a percent is simply 100 times the number of problem arrivals during the month divided by the number of problems closed during the month. Fix responsiveness is the mean time of all problems from their arrival to their close. If for a given problem the turnaround time exceeds the required or standard response time, it is declared delinquent. The percentage of delinquent fixes is 100 times the number that did not get fixed in time divided by the number that did. Fix quality traditionally is measured negatively as lack of quality. It's the number of defective fixesfixes that did not work properly in all situations or, worse, that caused yet other problems. The real quality goal is, of course, zero defective fixes.

Software Complexity Metrics

Chapter 1 pointed out that computer systems are more complex than other large-scale engineered systems and that it is the software that makes them more complex. A number of approaches have been taken to calculate, or at least estimate, the degree of complexity. The simplest basis is LOC, a count of the executable statements in a computer program. This metric began in the days of assembly language programming, and it is still used today for programs written in high-level programming languages. Most procedural third-generation memory-to-memory languages such as FORTRAN, COBOL, and ALGOL typically produce six executable ML statements per high-level language statement. Register-to-register languages such as C, C++, and Java produce about three. Recent studies show a curvilinear relationship between defect rate and executable LOC. Defect density, or defects per KLOC, appears to decrease with program size and then increase again as program modules become very large (see Figure 3.1). Curiously, this result suggests that there may be an optimum program size leading to a lowest defect ratedepending, of course, on programming language, project size, product type, and computing environment.[8] Experience seems to indicate that small programs have a defect rate of about 1.5 defects per KLOC. Programs larger than 1,000 lines of code have a similar defect rate. Programs of about 500 LOC have defect rates near 0.5. This is almost certainly an effect of complexity, because small, "tight" programs are usually intrinsically complex. Programs larger than 1,000 LOC exhibit the complexity of size because they have so many pieces. This situation will improve with Object-Oriented Programming, for which there is greater latitude, or at least greater convenience of choice, in component size than for procedural programming. As you might expect, interface coding, although the most defect-prone of all programming, has defect rates that are constant with program size.

Figure 3.1. The Relationship Between LOC and Defect Density


Software Science

In 1977 Professor Maurice H. Halstead distinguished software science from computer science by describing programming as a process of collecting and arranging software tokens, which are either operands or operators. His measures are as follows:

n1 = the number of distinct operators in a program

n2 = the number of distinct operands in a program

N1 = the number of operator occurrences

N2 = the number of operand occurrences

He then based a set of derivative measures on these primitive measures to express the total token vocabulary, overall program length, potential minimum volume for a programmed algorithm, actual program volume in bits, program level as a complexity metric, and program difficulty, among others. For example:

Vocabulary (n)

n = n1 + n2

Length (N)

N = N1 + N2 = n1 log2 (n1) + n2 log2 (n2)

Volume (V)

V = N log2 (n) = N log2 (n1 + n2)

Level (L)

L = V*/ V = (2/n1) x (n2/N2)

Difficulty (D)

D = V / V* = (n1/2) x (N2/n2)

Effort (E)

E = V / L

Faults (B)

B = V / S*


V* is the minimum volume represented by a built-in function that can perform the task of the entire program. S* is the mean number of mental discriminations or decisions between errorsa value estimated as 3,000 by Halstead.

When these metrics first were announced, some software development old-timers thought that Halstead had violated Aristotle's first law of scientific inquiry: "Don't employ more rigor than the subject matter can bear." But none could gainsay the accuracy of his predictions or the quality of his results. In fact, the latter established software metrics as an issue of importance for computer scientists and established Professor Halstead as the founder of this field of inquiry. The major criticism of his approach is that his most accurate metric, program length, is dependent on N1 and N2, which are not known with sufficient accuracy until the program is almost done. Halstead's formulas fall short as direct quantitative measures because they fail to predict program size and quality sufficiently upstream in the development process. Also, his choice of S* as a constant presumes an unknown model of human memory and cognition and unfortunately is also a constant that doesn't depend on program volume. Thus, the number of faults depends only on program size, which later experience has not supported. The results shown in Figure 3.1 indicate that the number of defects is not constant with program size but may rather take on an optimum value for programs having about 500 LOC. Perhaps Halstead's elaborate quantifications do not fully represent his incredible intuition, which was gained from long experience, after all.

Cyclomatic Complexity

About the same time Halstead founded software science, McCabe proposed a topological or graph-theory measure of cyclomatic complexity as a measure of the number of linearly independent paths that make up a computer program. To compute the cyclomatic complexity of a program that has been graphed or flow-charted, the formula used is

M = V(G) = e n + 2p

in which

V(G)

= the cyclomatic number of the graph G of the program

e

= the number of edges in the graph

n

= the number of nodes in the graph

p

= the number of unconnected parts in the graph


More simply, it turns out that M is equal to the number of binary decisions in the program plus 1. An n-way case statement would be counted as n 1 binary decisions. The advantage of this measure is that it is additive for program components or modules. Usage recommends that no single module have a value of M greater than 10. However, because on average every fifth or sixth program instruction executed is a branch, M strongly correlates with program size or LOC. As with the other early quality measures that focus on programs per se or even their modules, these mask the true source of architectural complexityinterconnections between modules. Later researchers have proposed structure metrics to compensate for this deficiency by quantifying program module interactions. For example, fan-in and fan-out metrics, which are analogous to the number of inputs to and outputs from hardware circuit modules, are an attempt to fill this gap. Similar metrics include number of subroutine calls and/or macro inclusions per module, and number of design changes to a module, among others. Kan reports extensive experimental testing of these metrics and also reports that, other than module length, the most important predictors of defect rates are number of design changes and complexity level, however it is computed.[9]

Function Point Metrics

Quality metrics based either directly or indirectly on counting lines of code in a program or its modules are unsatisfactory. These metrics are merely surrogate indicators of the number of opportunities to make an error, but from the perspective of the program as coded. More recently the function point has been proposed as a meaningful cluster of measurable code from the user's rather than the programmer's perspective. Function points can also be surrogates for error opportunity, but they can be more. They represent the user's needs and anticipated or a priori application of the program rather than just the programmer's a posteriori completion of it. A very large program may have millions of LOC, but an application with 1,000 function points would be a very large application or system indeed. A function may be defined as a collection of executable statements that performs a task, together with declarations of formal parameters and local variables manipulated by those statements. A typical function point metric developed by Albrecht[10] at IBM is a weighted sum of five components that characterize an application:

4

x the number of external inputs or transaction types

5

x the number of external types or reports

10

x the number of logical internal files

7

x the number of external interface files

4

x the number of external (online) inquiries supported


These represent the average weighting factors wij, which may vary with program size and complexity. xij is the number of each component type in the application.

The function count FC is the double sum:


The second step employs a scale of 0 to 5 to assess the impact of 14 general system characteristics in terms of their likely effect on the application:

  • Data communications

  • Distributed functions

  • Performance

  • Heavily used configuration

  • Transaction rate

  • Online data entry

  • End-user efficiency

  • Online update

  • Complex processing

  • Reusability

  • Installation ease

  • Operational ease

  • Multiple sites

  • Facilitation of change

The scores for these characteristics ci are then summed based on the following formula to find a value adjustment factor (VAF):


Finally, the number of function points is obtained by multiplying the number of function counts by the value adjustment factor:

FP = FC x VAF

This is actually a highly simplified version of a commonly used method that is documented in the International Function Point User's Group Standard (IFPUG, 1999).11

Although function point extrinsic counting metrics and methods are considered more robust than intrinsic LOC counting methods, they have the appearance of being somewhat subjective and experimental in nature. As used over time by organizations that develop very large software systems (having 1,000 or more function points), they show an amazingly high degree of repeatability and utility. This is probably because they enforce a disciplined learning process on a software development organization as much as any scientific credibility they may possess.

Availability and Customer Satisfaction Metrics

To the end user of an application, the only measures of quality are in the performance, reliability, and stability of the application or system in everyday use. This is "where the rubber meets the road," as users often say. Developer quality metrics and their assessment are often referred to as "where the rubber meets the sky." This book is dedicated to the proposition that we can arrive at a priori user-defined metrics that can be used to guide and assess development at all stages, from functional specification through installation and use. These metrics also can meet the road a posteriori to guide modification and enhancement of the software to meet the user's changing needs. Caution is advised here, because software problems are not, for the most part, valid defects, but rather are due to individual user and organizational learning curves. The latter class of problem calls places an enormous burden on user support during the early days of a new release. The catch here is that neither alpha testing (initial testing of a new release by the developer) nor beta testing (initial testing of a new release by advanced or experienced users) of a new release with current users identifies these problems. The purpose of a new release is to add functionality and performance to attract new users, who initially are bound to be disappointed, perhaps unfairly, with the software's quality. The DFTS approach we advocate in this book is intended to handle both valid and perceived software problems.

Typically, customer satisfaction is measured on a five-point scale:[11]

  • Very satisfied

  • Satisfied

  • Neutral

  • Dissatisfied

  • Very dissatisfied

Results are obtained for a number of specific dimensions through customer surveys. For example, IBM uses the CUPRIMDA categoriescapability, usability, performance, reliability, installability, maintainability, documentation, and availability. Hewlett-Packard uses FURPS categoriesfunctionality, usability, reliability, performance, and serviceability. In addition to calculating percentages for various satisfaction or dissatisfaction categories, some vendors use the net satisfaction index (NSI) to enable comparisons across product lines. The NSI has the following weighting factors:

  • Completely satisfied = 100%

  • Satisfied = 75%

  • Neutral = 50%

  • Dissatisfied = 25%

  • Completely dissatisfied = 0%

NSI then ranges from 0% (all customers are completely dissatisfied) to 100% (all customers are completely satisfied). Although it is widely used, the NSI tends to obscure difficulties with certain problem products. In this case the developer is better served by a histogram showing satisfaction rates for each product individually.

Sidebar 3.1: A Software Urban Legend

Professor Maurice H. Halstead was a pioneer in the development of ALGOL 58, automatic programming technology, and ALGOL-derived languages for military systems programming at the Naval Electronics Laboratory (NEL) at San Diego. Later, as a professor at Purdue (19671979), he took an interest in measuring software complexity and improving software quality. The legend we report, which was circulated widely in the early 1960s, dates from his years at NEL, where he was one of the developers of NELIAC (Naval Electronics Laboratory International ALGOL Compiler). As the story goes, Halstead was offered a position at Lockheed Missile and Space Systems. He would lead a large military programming development effort for the Air Force in the JOVIAL programming language, which he also helped develop. With messianic confidence, Halstead said he could do it with 12 programmers in one year if he could pick the 12. Department of Defense contracts are rewarded at cost plus 10%, and Lockheed had planned for a staff of 1,000 programmers, who would complete the work in 18 months. The revenue on 10% of the burdened cost of 1,000 highly paid professionals in the U.S. aerospace industry is a lot of money. Unfortunately, 10% of the cost of 12 even very highly paid software engineers is not, so Lockheed could not accept Halstead's proposition. This story was widely told and its message applied for the next 20 years by developers of compilers and operating systems with great advantage, but it has never appeared in print as far as we know. Halstead did leave the NEL to join Lockheed about this time. They benefited from his considerable software development expertise until he went to Purdue.





Design for Trustworthy Software. Tools, Techniques, and Methodology of Developing Robust Software
Design for Trustworthy Software: Tools, Techniques, and Methodology of Developing Robust Software
ISBN: 0131872508
EAN: 2147483647
Year: 2006
Pages: 394

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net