References


References

1. Brooks, Fredrick P., The Mythical Man-Month, Addison Wesley Longman, Inc., Reading, MA, 1995, p. 16.

2. Brooks, Fredrick P., The Mythical Man-Month, Addison Wesley Longman, Inc., Reading, MA, 1995, p. 25.

3. Goldratt, Eliyahu M., Critical Chain, North River Press, Great Barrington, MA, 1997.

4. Goldratt, Eliyahu M. and Cox, Jeff, The Goal, North River Press, Great Barrington, MA, 1985.



Chapter 8: Special Topics in Quantitative Management

Overview

It is common sense to take a method and try it.

If it fails, admit frankly and try another. But above all, try something.

Franklin Roosevelt [1]

[1]Brooks, Fredrick P., The Mythical Man-Month, Addison Wesley Longman, Inc., Reading, MA, 1995, p. 115.



Regression Analysis

Regression analysis is a term applied by mathematicians to the investigation and analysis of the behaviors of one or more data variables in the presence of another data variable. For example, one data variable in a project could be cost and another data variable could be time or schedule. Project managers naturally ask the question: How does cost behave in the presence of a longer or shorter schedule? Questions such as these are amenable to regression analysis. The primary outcome of regression analysis is a formula for a curve that "best" fits the data observations. Not only does the curve visually reinforce the relationship between the data points, but the curve also provides a means to forecast the next data point before it occurs or is observed, thereby providing lead time to the project manager during which risk management can be brought to bear on the forecasted outcome.

Beyond just the dependency of cost on schedule, cost might depend on the training hours per project staff member, the square feet of facilities allocated to each staff member, the individual productivity of staff, and a host of other possibilities. Of course, there also are many multivariate situations in projects that might call for the mathematical relationship of one on the other, such as employee productivity as an outcome (dependency) of training hours. For each of these project situations, there is also the forecast task of what the next data point is given another outcome of the independent variable. In effect, how much risk is there in the next data set?

Single-Variable Regression

Probably the easiest place to start with regression analysis is with the case introduced about cost and schedule. In regression analysis, one or more of the variables must be the independent variable. The remaining data variable is the dependent variable. The simplest relationship among two variables, one independent and one dependent, is the linear equation of the form

Y = a * X + b

where X is the independent variable, and the value of Y is dependent on the value of X. When we plot the linear equation we observe that the "curve" is a straight line. Figure 8-1 provides a simple illustration of the linear "curve" that is really a straight line. In Figure 8-1, the independent variable is time and the dependent variable is cost, adhering to the time-honored expression "time is money."

click to expand
Figure 8-1: Linear Equation.

Those familiar with linear equations from the study of algebra recognize that the parameter "a" is the slope of the line and has dimensions of "Y per X," as in dollars per week if Y were dimensioned in dollars and X were dimensioned in weeks. As such, project managers can always think of the slope parameter as a "density" parameter. [2] The "b" parameter is usually called the "intercept," referring to the fact that when X = 0, Y = b. Therefore, "b" is the intercept point of the curve with the Y-axis at the origin where X = 0.

Of course, X and Y could be deterministic variables (only one fixed value) or they could be random variables (observed value is probabilistic over a range of values). We recognize that the value of Y is completely forecasted by the value of X once the deterministic parameters "a" and "b" are known.

In Figure 8-2, we see a scatter of real observations of real cost and schedule data laid on the graph containing a linear equation of the form we have been discussing, C = a * T + b. Visually, the straight line (that is, the linear curve) seems to fit the data scatter pretty well. As project managers, we might be quite comfortable using the linear curve as the forecast of data values beyond those observed and plotted. If so, the linear equation becomes the "regression" curve for the observed data.

click to expand
Figure 8-2: Random "Linear" Data.

Calculating the Regression Curve

Up to this point, we have discussed single independent variable regression, albeit with the cart in front of the horse: we discussed the linear equation before we discussed the data observations. In point of fact, the opposite is the case in real projects. The project team has or makes the data observations before there is a curve. The task then becomes to find a curve that "fits" the data. [3]

There is plenty of computer tool support for regression analysis. Most spreadsheets incorporate the capability or there is an add-in that can be loaded into the spreadsheet to provide the functionality. Beyond spreadsheets, there is a myriad of mathematics and statistics computer packages that can perform regression analysis. Suffice it to say that in most projects no one would be called on to actually calculate a regression curve. Nevertheless, it is instructive to understand what lies behind the results obtained from the computer's analysis.

In this book, we will constrain ourselves to a manual calculation of a linear regression curve. Naturally, there are higher order curves involving polynomial equations that plot as curves and not straight lines. Again, most spreadsheets offer a number of curve fits, not just the linear curve.

By now you might be wondering if there is a figure of merit or some other measure or criteria that would help in picking the regression line. In other words, if there is more than one possibility for the regression curve, as surely there always is, then which one is best? We will answer that question in subsequent paragraphs.

Our task is to find a regression curve that fits our data observations. Our deliverable is a formula of the linear equation type Y = a * X + b. Our task really, then, is to find or estimate "a" and "b". As soon as we say "estimate" we are introducing the idea that we might not be able to exactly derive "a" and "b" from the data. "Estimate" means that the "a" and "b" we find are really probabilistic over a range of value possibilities.

Since we are working with a set of data observations, we may not have all the data in the universe (we may have only a sample or subset), but nevertheless we can find the average of the sample we do have: we can find the mean value of X and the mean value of Y, and since we are now talking about random variables, we will use the notation already adopted, X and Y.

It makes some sense to think that any linear regression line we come up with that is "pretty good" should pass through, or very close to, the point on the graph represented by the coordinates of the average value of X and the average value of Y. For this case we have one equation in two unknowns, "a" and "b":

Yav = a * Xav + b

where Xav is the mean or average value of the random variable.

This equation can be rearranged to have the form

0 = a * Xav + b - Yav

Students of algebra know that when there are two unknowns, two independent equations involving those unknowns must be found in order to solve for them. Thus we are now faced with the dilemma of finding a second equation. This task is actually beyond the scope of this book as it involves calculus to find some best-fit values for "a" and "b", but the result of the calculus is not hard to use and understand as our second equation involving "a" and "b" and the observations of X and Y:

0 = a * X2av + b * Xav - (X * Y)av

Solving for "a" and "b" with the two independent equations we have discussed provides the answers we are looking for:

a = [(X * Y)av - Xav * Yav]/[X2av - (Xav)2], and b = Yav - a * Xav

Goodness of Fit to the Regression Line

We are now prepared to address how well the regression line fits the data. We have already said that a good line should pass through the coordinates of Xav and Yav, but the line should also be minimally distant from all the other data. "Minimally distant" is a general objective of all statistical analysis. After all, we do not know these random variables exactly; if we did, they would be deterministic and not random. Therefore, statistical methods in general strive to minimize the distance or error between the probabilistic values found in the probability density function for the random variable and the real value of the variable.

As discussed in Chapter 2, we minimize distance by calculating the square of the distance between a data observation and its real, mean, or estimated value and then minimizing that squared error. Figure 8-3 provides an illustration. From each data observation value along the horizontal axis, we measure the Y-distance to the regression line from that data observation point. Each such measure is of the form:

Y2distance = (Yi - Yx)2

click to expand
Figure 8-3: Distance Measures in Linear Regression.

where Yi is the specific observation, and Yx is the value of Y on the linear regression line closest to Yi.

Consider also that there is another distance measure that could be made. This second distance measure involves the Yav rather than the Yx:

Y2dAv = (Yi - Yav)2

Figure 8-4 illustrates the measures that sum to Y2dAv. Ordinarily, this second distance measure is counterintuitive because you would think you would always want to measure distance to the nearest point on the regression line and not to an average point that might be further away than the nearest point on the line. However, the issue is whether or not the variations in Y really are dependent on the variations in X. Perhaps they are strongly or exactly dependent. Then a change in Y can be forecast with almost no error based on a forecast or observation of X. If such is the case, then Y2 distance is the measure to use. However, if Y is somewhat, but not strongly, dependent on X, then a movement in X will still cause a movement in Y but not to the extent that would occur if Y were strongly dependent on X. For the loosely coupled dependency, Y2dAv is the measure to use.

click to expand
Figure 8-4: Distance Measures to the Average.

The r2 Figure of Merit

Mathematicians have formalized the issue about how dependent Y is on X by developing a figure of merit, r2, which is formally called the "coefficient of determination":

r2 = 1 - (Y2distance/Y2dAv)

0 r2 1

Most mathematical packages that run on computers and provide regression analysis also calculate the r2 and can report the figure in tables or on the graphical output of the package. Being able to calculate r2 so conveniently relieves the project manager of having to calculate all the distances first to the regression line and then to the average value of Y. Moreover, the regression analyst can experiment with various regression lines until the r2 is maximized. After all, if r2 = 1, then the regression line is "perfectly" fitted to the data observations; the advantage to the project team is that with a perfect fit, the next outcome of Y is predictable with near certainty. The corollary is also true: the closer r2 is to 0, the less predictive is the regression curve and the less representative of the relationship between X and Y. Here are the rules:

r2 = 1, then Y2distance/Y2dAv = 0, or Y2distance = 0

where "Y2distance = 0" means all the observations lie on the regression line and the fit of the regression line to the data is perfect.

r2 = 0, then Y2distance/Y2dAv = 1

where "Y2distance/Y2dAv = 1" means that data observations are more likely predicted by the average value of Y than by the formula for the regression line. The line is pretty much useless as a forecasting tool.

Figure 8-5 shows the r2 for a data set.

click to expand
Figure 8-5: r2 of Data Observations.

Some Statistical Properties of Regression Results

There are some interesting results that go along with the analysis we have been developing. For one thing, it can be shown [4] that the estimate of "a" is a random variable with Normal distribution. Such a conclusion should not come as a surprise since there is no particular reason why the distribution of values of "a" should favor the more pessimistic or the more optimistic value. All of the good attributes of the Normal distribution are working for us now: "a" is unbiased maximum likelihood estimator for the true value of "a" in the population. In turn, the expected value of "a" is the true value of "a" itself:

Mean value of "a"

=

expected value of "a"

 

=

true value of "a" in the population

Variance (a)

=

σ2/ (Xi - Xav)2

Looking at the data observations of X for a moment, we write:

(Xi - Xav)2 = n * Variance (X)

where n = number of observations of X in the population.

Now we have a couple of interesting results: as the sample size, n, gets larger, the variance of "a" gets smaller, meaning that we can zero in on the true value of "a" all that much better. Another way to look at this is that as the variance of X gets larger, meaning a larger spread of the X values in the population, again the variance of "a" is driven smaller, making estimate all the better. In effect, if all the X data are bunched together, then it is very difficult to find a line that predicts Y. We see that the ks in Figure 8-6 do not provide a sufficient spread to develop a regression curve.

click to expand
Figure 8-6: Spread of X.

Multiple Independent Variables

Having more than one independent variable complicates calculations immediately. The dependent variable, say cost, now depends on two different and independent variables, say schedule and worker productivity. Much of the conceptual ground is the same. Indeed, the r2 becomes R2, but the idea remains the same: the measure in a figure of merit of how the dependent data are driven by the independent data.

For the more complex projects there may be a need to do multiple variate regression. The only practical approach is to apply a computer program to do the calculations. The more challenging problem for the project manager is to actually deduce the contributing independent variables and make observations simultaneously of all the contributions under the same conditions. Either of a few things may be at work if R2 is not as expected:

  • The project manager has deduced incorrectly what the contributing data are to the dependent outcome. For example, does project cost really depend on schedule and productivity or schedule and something else not recognized?

  • The independent data really do not predict the dependent data. There simply is not a strong cause-and-effect relationship even though on the surface there appears to be a strong correlation of one data item with another.

  • The data observations of the multiple variables were not made at the same time under the same conditions, thereby tainting the cause-and-effect influences.

[2]The word density used in a mathematical context refers to the incremental change in the dependent variable in response to an incremental change in the independent variable. We have already used the density concept when we referred to the probability distribution as the probability density curve.

[3]Of course, there is a place in project management for the "cart to come before the horse" and that is in hypothesis analysis and forecasting before the fact. In hypothesis analysis, the task is to validate that the data observed fit the forecast or hypothesis and that there is proper cause and effect and not just coincidence or other unidentified dependencies.

[4]Downing, Douglas and Clark, Jeffery, Statistics the Easy Way, Barron's, Hauppauge, NY, 1997, pp. 264–269.