7.6 Canonical Correlation

7.6 Canonical Correlation

Software development is a very complex process. It is clear that there are multiple software attributes for source code. This is a multivariate problem. Unfortunately, the world of software quality attributes is also multivariate. We are interested in more than just software faults; we need to worry about all of the "ilities" simultaneously. It might well be that actions taken to reduce the rate at which faults are introduced may have an adverse effect on the maintainability of the software. We need a modeling process, then, that will allow us to map multiple independent variables onto a set of multiple dependent variables. We will use a procedure called canonical correlation for this modeling process.

We will use the data drawn from the Primary Avionics Software System (PASS) of the Space Shuttle to study the various aspects of canonical correlation. Nineteen software metrics were collected using a measurement tool called HALMET written specifically for the HAL/S language. In addition to the 19 software metrics, four quality metrics — Q1 Q2, Q3, and Q4 — were collected. These metrics relate to the number of faults and change requests that have been made to each of the program modules.

The correlation coefficients for the four quality metrics and the 19 complexity metrics are shown in Exhibit 28. When each of the 19 metrics is correlated with each of the quality metrics, the correlation coefficients are reasonably large. The problem with the data is that there is little or no information here. There are a lot of numbers but no global picture of the relationship among the quality metrics and the complexity metrics. We will need multivariate statistical techniques to extract these relationships.

Exhibit 28: Correlation Coefficients of Quality Metrics with Complexity Metrics

start example

 

Q1

Q2

Q3

Q4

η1

0.48

0.15

0.45

0.39

η2

0.64

0.43

0.68

0.48

N1

0.54

0.47

0.58

0.28

N2

0.59

0.45

0.60

0.40

Stmt

0.57

0.51

0.63

0.28

LOC

0.76

0.41

0.73

0.60

Comm

0.74

0.50

0.73

0.54

Nodes

0.55

0.31

0.56

0.31

Edges

0.51

0.31

0.54

0.26

Paths

0.36

0.28

0.44

0.11

Cycle

0.24

0.10

0.22

0.13

MaxP

0.46

0.27

0.46

0.25

AveP

0.44

0.25

0.45

0.24

Sets

0.29

0.19

0.19

0.25

Reset

0.41

0.21

0.30

0.38

Can

0.36

0.14

0.25

0.36

SetA

0.28

0.18

0.21

0.21

ResA

0.35

0.11

0.25

0.34

CanA

0.42

0.09

0.28

0.41

end example

Unfortunately, the software development process is not intrinsically univariate in nature. It is clear that the independent variables are highly correlated one with another. A similar pattern emerges for the quality metrics, as is shown in Exhibit 29. There is a strong interrelationship among the metrics.

Exhibit 29: Correlation Coefficients of Quality Metrics

start example

 

Q1

Q2

Q3

Q2

0.32

  

Q3

0.75

0.47

 

Q4

0.66

0.11

0.51

end example

The general notion of linear regression is to select from a set of independent variables a subset of these variables that will explain the most amount of variance in a dependent variable. Coefficients for the independent variables are produced by a least squares fit of these variables to sample data. The key to model development is to choose the subset of independent variables in such a manner as to not introduce more variance (or noise) in the model than might be contributed by introducing into the model a new independent variable.

The canonical correlation model is multivariate in both the dependent variables and the independent variables. These models permit the formulation of correlations between two sets of variables. They have the general form:

(36) 

where p is the number of criterion variables and m is the number of independent variables (metrics). An example of two such variable sets might be drawn from the relationship between aspects of software quality and code complexity. We know that there are significant correlations among the variables in each of these sets. Several separate multiple regression models would use the information provided by the quality metrics interrelationships, but would still fail to use the information provided by the code metrics interrelationships. Canonical correlation analysis, on the other hand, will provide information on the simultaneous relationship between two distinct sets of measures. This is particularly true when the variable sets are highly correlated, as is the case with quality and code metrics.

Canonical correlation reveals the complex relationships between two distinct metric sets by isolating pairs of linear combinations of these metrics. [2] Let X = X1, X2,...,Xm and Y = Y1, Y2,...,Yp represent, respectively, the m-dimensional vector of code metrics, and the p-dimensional vector of quality metrics. Let μx and μy represent the mean vectors for the code and quality metrics, respectively. Then the relationships between the two sets of metrics can be expressed in terms of the covariance within the code set:

(37) 

the covariance within the quality set:

(38) 

and the covariance between the sets:

(39) 

The first step in canonical correlation analysis is to find a linear combination of quality metrics that maximally correlates with a linear combination of the code metrics. This will proceed in a fashion very similar to principal components analysis. The next step is to find the maximally correlated pair of linear combinations among all pairs that are uncorrelated with the first pair. This process continues until, at most, M = min(m, p) uncorrelated pairs are isolated. The pairs and their correlations are called canonical variates and canonical correlations. In addition to the canonical correlations, a number of derived quantities are important in interpreting the results of a canonical correlation analysis. Canonical weights are comparable to regression weights. They are the coefficients in the linear combinations that define the canonical variates. The ith canonical variate representing the Xs is given by:

(40) 

and the ith canonical variate representing the Ys is given by:

(41) 

where the vectors ai and bi give the canonical weights for the code and quality metrics, respectively. The magnitude of a weight, ai,j (or bi,j), indicates the importance of variable Xj (respectively Yj) with regard to Y (respectively X) in obtaining the canonical correlation of the ith canonical variate.

Because a canonical variate is not directly observable, it is best understood in terms of those variables that are related to it. Canonical loadings are comparable to the factor loadings in principal components analysis. They give the correlations of the raw variable scores with the canonical variate scores. The canonical loadings for the ith canonical variate are given by two vectors: one for the code metrics and one for the design metrics. These vectors are, respectively, rx,i = Rxxai and ry,i = Ryybi, where Rxx and Ryy are the within-set correlation matrices for the independent and dependent variables, respectively. The canonical loadings obtained in this analysis correspond directly with the factor loadings obtained in principal components analysis.

The canonical correlation technique was performed on the PASS software metrics shown above. The canonical correlation analysis for these data is shown in Exhibit 30. In this case there were exactly four canonical variates. The first two canonical variates account for most of the total variation. Canonical Variate 1 accounts for 78 percent of the variance and Canonical Variate 2 accounts for 1 percent of the variance. Canonical Variates 3 and 4 contribute little to our understanding of the simultaneous variation of code and quality metrics. The first two variates account for approximately 96 percent of the total variance. The canonical correlation of the set of quality and code metrics with the first canonical variate is 0.896. The canonical correlation of these metrics with the second variate is less, at 0.693. In both cases, these correlations are significant (p < 0.05).

Exhibit 30: Canonical Correlation Analysis

start example

Canonical Variate

Canonical Correlation

Standard Error

Eigenvalue

Proportion

Cumulative

1

0.896

0.007

4.10

0.78

0.78

2

0.693

0.018

0.93

0.17

0.96

3

0.342

0.032

0.13

0.02

0.98

4

0.264

0.033

0.07

0.01

1.00

end example

Exhibit 31 reveals the canonical structure of the code and quality metrics. The numbers in the columns reveal the relative strength of the relationship of each of the code and quality metrics with Canonical Variates 1 and 2. We can see, for example, that the metrics η2, LOC, Com, and Q1 are all strongly related to Canonical Variate 1. On the other hand, Canonical Variate 2 is more closely associated with Paths and Q2.

Exhibit 31: Canonical Structure of Code and Quality Metrics

start example

 

Canonical Variate 1

Canonical Variate 2

Code Metrics

η1

0.55

0.05

η2

0.77

0.32

N1

0.61

0.51

N2

0.70

0.37

Exec

0.64

0.59

LOC

0.90

0.21

Com

0.87

0.35

Nodes

0.60

0.37

Edges

0.55

0.40

Paths

0.38

0.47

Cycle

0.25

0.11

MaxP

0.49

0.32

AveP

0.47

0.31

Sets

0.35

-0.01

Reset

0.49

-0.06

Can

0.44

-0.13

SetA

0.32

0.06

ResA

0.42

-0.11

CanA

0.49

-0.18

Quality Metrics

Q1

0.92

0.11

Q2

0.44

0.62

Q3

0.83

0.41

Q4

0.85

-0.50

end example

The basic notion of canonical program complexity is that each raw complexity metric will have an appropriate canonical weight assigned by the analysis. These weights will be used to send the multivariate complexity metrics onto a single canonical variate of canonical complexity. Consider now a scenario where we have chosen to form canonical variates for a set of quality metrics in relation to a set of source code metrics. For each canonical variable there will be two vectors, w, of these weights. Let wc represent the set of weights for the code metrics and wq the weights for the quality metrics. These weights will send the set of raw standardized complexity metrics onto a single value canonical complexity, γ. When the matrix z of standardized metric values is multiplied by the vector of weights for the code metrics, there will be a canonical complexity vector for the code metrics as follows:

(42) 

Similarly, there is a corresponding canonical complexity metric value γq for the quality metrics.

In an alternate scenario, we might wish to construct canonical complexity metrics from a domain of source code metrics compared with a domain of software design metrics. This is an example of a mapping from two metric sets within the product metric domain. The canonical complexity will be that value that maximizes the relationship between the complexity domain and the quality domains. The canonical complexity of a program would then represent the complexity of the associated source code in relation to measures of software design.

[2]Dillion, W. and Goldstein, M., Multiuariate Analysis: Methods and Applications, John Wiley & Sons, New York, 1984.



Software Engineering Measurement
Software Engineering Measurement
ISBN: 0849315034
EAN: 2147483647
Year: 2003
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net