Statistical Methods for Software Architects | Design for Trustworthy Software: Tools, Techniques, and Methodology of Developing Robust Software

The major quantitative methodology in support of DFTS is Taguchi Methods and the application of orthogonal arrays, which are covered in Chapter 17. Here we will cover some basic concepts of statistical analysis deemed useful for the software development architect or project manager for managing the development process from a more quantitative perspective. If you want more details, refer to Applied Statistics for Software Managers for a very useful tutorial.^[9] For more depth in statistical quality assurance methods, we recommend Statistical Quality Assurance Methods for Engineers.^[10] For a standard-practices manual, we recommend the Handbook of Statistical Methods for Engineers and Scientists.^[11] The latter two publications are not oriented to the software development process but are very good references for statistical process control in manufacturing generally.

We begin by reviewing a few basic definitions. The variance of a range of sample values measures the distance between each value and the mean of the sample. It is equal to the sum of the squared differences between observations over the range of values and the mean value of the entire sample, divided by the number of observations n minus 1 (n1). The more commonly used measure of sample variability is the standard deviation, which is simply the square root of the variance. A data set can also be described by its frequency distribution, usually represented graphically by a bar chart or pie chart. Some frequency distributions occur so often that they have special names and equations. For example, the normal distribution is graphically represented by the familiar bell-shaped curve that is symmetrical about the average value of the data it represents. In a normal distribution, the mean, median, and mode all have the same value. The normal distribution can be described by only two parametersthe mean and the standard deviation. The bell curve's width is defined by the standard deviation; thus, the greater it is, the wider the curve. If the numerical data collected for some measurable attribute follows a normal distribution, 68% of the observations fall within plus or minus one standard deviation of the mean, 95.5% of them fall within two standard deviations of the mean, and 99.7% fall within three standard deviations.^[12]

The analyst uses correlation methods to tell how well two sets of observations or distributions are related to each other. Calling for a correlation analysis on two sets of data produces a single number between 1 and +1, called the correlation coefficient. A value of 1.0 indicates a perfect correlation, 1.0 a perfect negative correlation, and 0 no correlation at all. A high correlation value between two sets of data does not necessarily demonstrate a functional relationship or even causality. Further analysis may indicate a Spearman rank correlation to compare the two variables' rank, or place, in an ordered list for the same observation. For example, if you wanted to study the relationship between a program's size and effort metrics, as described in Chapter 3, you would first rank the project's size. Suppose there are five projects, ranked in size from smallest to largest as 1 through 5. Ranking the effort the same way, you can call the statistical package for Spearman's rank correlation and compute the correlation coefficient. If it were 1.0, size and effort would be perfectly correlated, which is the usual case.^[13] If the coefficient is less than 1, you would investigate which project enjoyed more or less effort than expected and inquire why. This analysis generally is used when the data is ordinal, as in our example, or when the data is far from normally distributed. When the data is normally distributed and linear, or when it is interval or ratio data, the Pearson's correlation is indicated. This method uses the actual values of the variables instead of their ranks. Calling the Pearson correlation method in Minitab 14 or another statistical package with the same data for Maxwell's example produces a correlation coefficient of 0.9745. Although this result may be more precise because it was computed with the actual size and effort data, it yields the same result. Size and effort are completely correlated for these five programs.

Regression analysis is an extension of the familiar least-squares method of fitting a straight line through a set of data points in the x-y plane. In the case of Dr. Maxwell's example of effort versus size, we can call for a regression analysis to get a straight-line fit to the data, which now gives us a formula for predicted effort in terms of program size:

predicted effort = A + B x program size

in which the coefficients A and B are returned by the statistical package. Such a formula is based on the actual experience of our development team and thus is a good predictor of the manpower required to create the next program given an estimate of its size.

Multiple regression analysis is similar to simple regression analysis, except that the dependent variable is some unknown but experiential function of two or more independent variables. For example, you could extend the preceding example to develop a formula for predicted effort as a function of program size and team size. Multiple regression analysis is particularly valuable in a situation in which you have a measurable process result or desideratum such as program performance, quality, time to build, cost to build, and so on, and you have a sea of historical data measurements describing the various aspects of your process and its subprocesses. Now the question is, which of the independent variables most influence cost, which influence quality, and which influence cycle time? Calling the multiple regression method for each of the process results as dependent variables, assuming the process measures to be the independent variables, shows that each result typically can be explained by a few (three to five) of the total group of measures. As shown in Table 15.2, any process can have measurable attributes, and it is unclear which of them influence the process outcomes and how much. One author's experience has shown that the bread-baking process has more than 350 variables, but only five are statistically significant. To predict cast-iron mold life in a Bethlehem Steel mill, data was collected for more than 200 measurable attributes for four years, but 98% of the variance in the fourth year's data could be explained with only six attributes, based on a multiple regression (MR) analysis of the first three years' data. It turned out that more than 90% could be explained by one variablethe mold's temperature at the time the steel was poured into it. A simple rerouting of the molds from the cold outside storage yard through a very warm steel mill before reuse dramatically lengthened mold life and reduced process costs. Manufacturing or continuous-material processes and software development have many observable, measurable, experiential variables, all of which contribute to the result in some way. But we rarely know which ones are the most important in influencing product attribute outcomessuch as quality, for example.

Some regression opportunities involve qualitative rather than quantitative variables and thus cannot employ MR. Instead, they must use analysis of variance (ANOVA). Examples of such variables in software development might include business sector, programming language, hardware platform, operating system, and user interface. These could have significant influence on dependent variables such as effort, productivity, time to build, and cost. Typically ANOVA begins with a null hypothesis such as The mean percentage of measurable-variable-x utilization is the same for all seven application software products the firm sells. You have no idea starting out what the means in the population are, so you must use the sample data to estimate them. The larger the variance in x between the application types and the smaller its variance within application types, the more likely it is that x utilization differs between application types.

This brief summary of statistical tools for the computer architect is not meant to be complete or exhaustive. It's a review and hopefully a spark to encourage your interest in using quantitative methods to measure and improve software development processes. We'll close with a descriptive example adapted from Maxwell.^[14] The goal of her example was to develop a predictive model or equation to estimate the effort required of a development group to produce a new software product. The independent variables might be, for example, size, type of application, OS platform, user interface, language used, p1p15 for 15 different productivity factors, and time to build. The goal is a predictive equation such as

effort = f (size, type, OS, user, lang, p1p15, time)

Some of these variables are quantitative data, and others are qualitative ordinal data. Maxwell begins by analyzing effort against each of the variables separately. Clearly effort is proportional to size. Among the productivity variables, effort increases with customer participation. It shows no relationship to staff availability, methods, or use of tools, but it does increase with logical complexity and probably with requirements volatility. Maxwell's multiple regression analysis using the numerical variables alone explains 78% of the variance in effort. After experimenting with two-, three-, and four-variable models, she ends up with a five-variable model that explains 81% of the variance in effort. Effort increases with increasing application size, requirements volatility, and application size; it decreases over time and depends on the user interface chosen.

You will find that although the Minitab 14 software is a benefit, almost everything you need to do with statistics for software design and development can be done using Microsoft Excel.