Details | SAS.STAT 9.1 Users Guide (Vol. 6)

Introduction to Response Surface Experiments

Many industrial experiments are conducted to discover which values of given factor variables optimize a response. If each factor is measured at three or more values, a quadratic response surface can be estimated by least-squares regression. The predicted optimal value can be found from the estimated surface if the surface is shaped like a simple hill or a valley. If the estimated surface is more complicated, or if the predicted optimum is far from the region of experimentation, then the shape of the surface can be analyzed to indicate the directions in which new experiments should be performed.

Suppose that a response variable y is measured at combinations of values of two factor variables, x ₁ and x ₂ . The quadratic response-surface model for this variable is written as

The steps in the analysis for such data are

model fitting and analysis of variance to estimate parameters
canonical analysis to investigate the shape of the predicted response surface
ridge analysis to search for the region of optimum response

Model Fitting and Analysis of Variance

The first task in analyzing the response surface is to estimate the parameters of the model by least-squares regression and to obtain information about the fit in the form of an analysis of variance. The estimated surface is typically curved : a 'hill' whose peak occurs at the unique estimated point of maximum response, a 'valley,' or a ' saddle -surface' with no unique minimum or maximum. Use the results of this phase of the analysis to answer the following questions:

What is the contribution of each type of effect-linear, quadratic, and crossproduct-to the statistical fit? The ANOVA table with sources labeled 'Regression' addresses this question.
What part of the residual error is due to lack of fit? Does the quadratic response model adequately represent the true response surface? If you specify the LACKFIT option in the MODEL statement, then the ANOVA table with sources labeled ' Residual ' addresses this question.
What is the contribution of each factor variable to the statistical fit? Can the response be predicted as well if the variable is removed? The ANOVA table with sources labeled 'Factor' addresses this question.
What are the predicted responses for a grid of factor values? (See the section 'Plotting the Surface' on page 4048 and the 'Searching for Multiple Response Conditions' section on page 4048.)

Lack-of-Fit Test

The test for lack-of-fit compares the variation around the model with 'pure' variation within replicated observations. This measures the adequacy of the quadratic response surface model. In particular, if there are n _i replicated observations Y _i ₁ , , Y _in _i of the response all at the same values x _i of the factors, then we can predict the true response at x _i either by using the predicted value _i based on the model or by using the mean Y _i of the replicated values. The test for lack-of-fit decomposes the residual error into a component due to the variation of the replications around their mean value (the 'pure' error), and a component due to the variation of the mean values around the model prediction (the 'bias' error):

If the model is adequate, then both components estimate the nominal level of error; however, if the bias component of error is much larger than the pure error, then this constitutes evidence that there is significant lack of fit.

If some observations in your design are replicated, you can test for lack of fit by specifying the LACKFIT option in the MODEL statement. Note that, since all other tests use total error rather than pure error, you may want to hand-calculate the tests with respect to pure error if the lack-of-fit is significant. On the other hand, significant lack-of-fit indicates the quadratic model is inadequate, so if this is a problem you can alsotrytorefine the model, possibly using PROC GLM for general polynomial modeling; refer to Chapter 32, 'The GLM Procedure,' for more information. Example 63.1 on page 4055 illustrates the use of the LACKFIT option.

Canonical Analysis

The second task in analyzing the response surface is to examine the overall shape of the curve and determine whether the estimated stationary point is a maximum, a minimum, or a saddle point. The canonical analysis can be used to answer the following questions:

Is the surface shaped like a hill, a valley, a saddle surface, or a flat surface?
If there is a unique optimum combination of factor values, where is it?
To which factor or factors are the predicted responses most sensitive?

The eigenvalues and eigenvectors in the matrix of second-order parameters characterize the shape of the response surface. The eigenvectors point in the directions of principle orientation for the surface, and the signs and magnitudes of the associated eigenvalues give the shape of the surface in these directions. Positive eigenvalues indicate directions of upward curvature, and negative eigenvalues indicate directions of downward curvature. The larger an eigenvalue is in absolute value, the more pronounced is the curvature of the response surface in the associated direction. Often, all of the coefficients of an eigenvector except for one are relatively small, indicating that the vector points roughly along the axis associated with the factor corresponding to the single large coefficient. In this case, the canonical analysis can be used to determine the relative sensitivity of the predicted response surface to variations in that factor. (See the 'Getting Started' section on page 4034 for an example.)

Ridge Analysis

If the estimated surface is found to have a simple optimum well within the range of experimentation, the analysis performed by the preceding two steps may be sufficient. In more complicated situations, further search for the region of optimum response is required. The method of ridge analysis computes the estimated ridge of optimum response for increasing radii from the center of the original design. The ridge analysis answers the following question:

If there is not a unique optimum of the response surface within the range of experimentation, in which direction should further searching be done in order to locate the optimum?

You can use the RIDGE statement to compute the ridge of maximum or minimum response.

Coding the Factor Variables

For the results of the canonical and ridge analyses to be interpretable, the values of different factor variables should be comparable. This is because the canonical and ridge analyses of the response surface are not invariant with respect to differences in scale and location of the factor variables. The analysis of variance is not affected by these changes. Although the actual predicted surface does not change, its parameterization does. The usual solution to this problem is to code each factor variable so that its minimum in the experiment is ˆ’ 1 and its maximum is 1 and to carry through the analysis with the coded values instead of the original ones. This practice has the added benefit of making 1 a reasonable boundary radius for the ridge analysis since 1 represents approximately the edge of the experimental region. By default, PROC RSREG computes the linear transformation to perform this coding as the data are initially read in, and the canonical and ridge analyses are performed on the model fitto the coded data. The actual form of the coding operation for each value of a variable is

where M is the average of the highest and lowest values for the variable in the design and S is half their difference.

Missing Values

If an observation has missing data for any of the variables used by the procedure, then that observation is not used in the estimation process. If one or more response variables are missing, but no factor or covariate variables are missing, then predicted values and confidence limits are computed for the output data set, but the residual and Cook's D statistic are missing.

Plotting the Surface

You can generate predicted values for a grid of points with the PREDICT option (see the 'Getting Started' section on page 4034 for an example) and then use these values to create a contour plot or a three-dimensional plot of the response surface over a two-dimensional grid. Any two factor variables can be chosen to form the grid for the plot. Several plots can be generated by using different pairs of factor variables.

Searching for Multiple Response Conditions

Suppose you want to find the factor setting that produces responses in a certain region. For example, you have the following data with two factors and three responses:

  data a;   input x1 x2 y1 y2 y3;   datalines;   -1   1         1.8 1.940  3.6398   -1       1         2.6 1.843  4.9123   1   1         5.4 1.063  6.0128   1       1         0.7 1.639  2.3629   0       0         8.5 0.134  9.0910   0       0         3.0 0.545  3.7349   0       0         9.8 0.453 10.4412   0       0         4.1 1.117  5.0042   0       0         4.8 1.690  6.6245   0       0         5.9 1.165  6.9420   0       0         7.3 1.013  8.7442   0       0         9.3 1.179 10.2762   1.4142  0         3.9 0.945  5.0245     1.4142  0         1.7 0.333  2.4041   0       1.4142    3.0 1.869  5.2695     1.4142    5.7 0.099  5.4346   ;

You want to find the values of x1 and x2 that maximize y1 subject to y2 < 2and y3 < y2 + y1 . The exact answer is not easy to obtain analytically, but you can obtain a practically feasible solution by checking conditions across a grid of values in the range of interest. First, append a grid of factor values to the observed data, with missing values for the responses.

  data b;   set a end=eof;   output;   if eof then do;   y1=.;   y2=.;   y3=.;   do x1=-2 to 2 by .1;   do x2=-2 to 2 by .1;   output;   end;   end;   end;   run;

Next, use PROC RSREG to fit a response surface model to the data and to compute predicted values for both the observed data and the grid, putting the predicted values in a data set c .

  proc rsreg data=b out=c;   model y1 y2 y3=x1 x2 / predict;   run;

Finally, find the subset of predicted values that satisfy the constraints, sort by the unconstrained variable, and display the top five predictions .

  data d;   set c;   if y2<2;   if y3<y2+y1;   proc sort data=d;   by descending y1;   run;   data d; set d;   i = _n_;   proc print;   where (i <= 5);   run;

The final results are displayed in Figure 63.5. They indicate that optimal values of the factors are around 0.3 for x1 and around -0.5 for x2 .

  Obs     x1     x2     _TYPE_        y1         y2         y3      i   1    0.3   0.5    PREDICT    6.92570    0.75784    7.60471    1   2    0.3   0.6    PREDICT    6.91424    0.74174    7.54194    2   3    0.3   0.4    PREDICT    6.91003    0.77870    7.64341    3   4    0.4   0.6    PREDICT    6.90769    0.73357    7.51836    4   5    0.4   0.5    PREDICT    6.90540    0.75135    7.56883    5

Figure 63.5: Top Five Predictions

Handling Covariates

Covariate regressors are added to a response surface model because they are believed to account for a sizable yet relatively uninteresting portion of the variation in the data. What the experimenter is really interested in is the response corrected for the effect of the covariates. A common example is the block effect in a block design. In the canonical and ridge analyses of a response surface, which estimate responses at hypothetical levels of the factor variables, the actual value of the predicted response is computed using the average values of the covariates. The estimated response values do optimize the estimated surface of the response corrected for covariates, but true prediction of the response requires actual values for the covariates. You can use the COVAR= option in the MODEL statement to include covariates in the response surface model. Example 63.2 on page 4059 illustrates the use of this option.

Computational Method

Canonical Analysis

For each response variable, the model can be written in the form

where

y _i	is the i th observation of the response variable.
x _i	=( x _i ₁ , x _i ₂ , , x _ik ) ² are the k factor variables for the i th observation.
z _i	=( z _i ₁ , z _i ₂ , , z _iL ) ² are the L covariates, including the intercept term .
A	is the k — k symmetrized matrix of quadratic parameters, with diagonal elements equal to the coefficients of the pure quadratic terms in the model and off-diagonal elements equal to half the coefficient of the corresponding cross product.
b	is the k — 1 vector of linear parameters.
c	is the L — 1 vector of covariate parameters, one of which is the intercept.
ˆˆ _i	is the error associated with the i th observation. Tests performed by PROC RSREG assume that errors are independently and normally distributed with mean zero and variance ƒ ² .

The parameters in A , b , and c are estimated by least squares. To optimize y with respect to x , take partial derivatives, set them to zero, and solve:

You can determine if the solution is a maximum or minimum by looking at the eigenvalues of A :

If the eigenvalues	then the solution is
are all negative are all positive have mixed signs contain zeros	a maximum a minimum a saddle point in a flat area

If the eigenvalues

then the solution is

are all negative

are all positive

have mixed signs

contain zeros

a maximum

a minimum

a saddle point

in a flat area

Ridge Analysis

The eigenvector for the largest eigenvalue gives the direction of steepest ascent from the stationary point, if positive, or steepest descent, if negative. The eigenvectors corresponding to small or zero eigenvalues point in directions of relative flatness .

The point on the optimum response ridge at a given radius R from the ridge origin is found by optimizing

over d satisfying d ² d = R ² , where x is the k — 1 vector containing the ridge origin and A and b are as previously discussed. By the method of Lagrange multipliers, the optimal d has the form

where I is the k — k identity matrix and µ is chosen so that d ² d = R ² . There may be several values of µ that satisfy this constraint; the right one depends on which sort of response ridge is of interest. If you are searching for the ridge of maximum response, then the appropriate µ is the unique one that satisfies the constraint and is greater than all the eigenvalues of A . Similarly, the appropriate µ for the ridge of minimum response satisfies the constraint and is less than all the eigenvalues of A . (Refer to Myers and Montgomery (1995) for details.)

Output Data Sets

OUT=SAS-data-set

An output data set containing statistics requested with options in the MODEL statement for each observation in the input data set is created whenever the OUT= option is specified in the PROC RSREG statement. The data set contains the following variables.

the BY variables
the ID variables
the WEIGHT variable
the independent variables in the MODEL statement
the variable _TYPE_ , which identifies the observation type in the output data set. _TYPE_ is a character variable with a length of eight, and it takes on the values 'ACTUAL', 'PREDICT', 'RESIDUAL', 'U95M', 'L95M', 'U95', 'L95', and 'D', corresponding to the options specified.
the response variables containing special output values identified by the _TYPE_ variable

All confidence limits use the two-tailed Student's t value.

OUTR=SAS-data-set

An output data set containing the optimum response ridge is created when the OUTR= option is specified in the RIDGE statement. The data set contains the following variables:

the current values of the BY variables
a character variable _DEPVAR_ containing the name of the dependent variable
a character variable _TYPE_ identifying the type of ridge being computed, MINIMUM or MAXIMUM. If both MAXIMUM and MINIMUM are specified, the data set contains observations for the minimum ridge followed by observations for the maximum ridge.
a numeric variable _RADIUS_ giving the distance from the ridge starting point
the values of the model factors at the estimated optimum point at distance _RADIUS_ from the ridge starting point
a numeric variable _PRED_ , which is the estimated expected value of the dependent variable at the optimum
a numeric variable _STDERR_ , which is the standard error of the estimated expected value

Displayed Output

All estimates and hypothesis tests assume that the model is correctly specified and the errors are distributed according to classical statistical assumptions.

The output displayed by PROC RSREG includes the following.

Estimation and Analysis of Variance

The actual form of the coding operation for each value of a variable is
where M is the average of the highest and lowest values for the variable in the design and S is half their difference. The Subtracted off column contains the M values for this formula for each factor variable, and S is found in the Divided by column.
The summary table for the response variable contains the following information.
- Response Mean is the mean of the response variable in the sample. When a WEIGHT statement is used, the mean y is calculated by
- Root MSE estimates the standard deviation of the response variable and is calculated as the square root of the Total Error mean square.
- The R-Square value is R ² , or the coefficient of determination. R ² measures the proportion of the variation in the response that is attributed to the model rather than to random error.
- The Coefficient of Variation is 100 times the ratio of the Root MSE to the Response Mean.

A table analyzing the significance of the terms of the regression is displayed. Terms are brought into the regression in four steps: (1) the Intercept and any covariates in the model, (2) Linear terms like X1 and X2, (3) pure Quadratic terms like X1*X1 or X2*X2, and (4) Crossproduct terms like X1*X2.
- The Degrees of Freedom should be the same as the number of corresponding parameters unless one or more of the parameters are not estimable .
- Type I Sum of Squares, also called the sequential sums of squares, measure the reduction in the error sum of squares as sets of terms (Linear, Quadratic, and so forth) are added to the model.
- R-Square measures the portion of total R ² contributed as each set of terms (Linear, Quadratic, and so forth) is added to the model.
- Each F Value tests the null hypothesis that all parameters in the term are zero using the Total Error mean square as the denominator. This item is a test of a Type I hypothesis, containing the usual F test numerator, conditional on the effects of subsequent variables not being in the model.
- Pr > F is the significance value or probability of obtaining at least as great an F ratio given that the null hypothesis is true.

The Total Error Sum of Squares can be partitioned into Lack of Fit and Pure Error. When Lack of Fit is significant, there is variation around the model other than random error (such as cubic effects of the factor variables).
- The Total Error Mean Square estimates ƒ ² , the variance.
- F Value tests the null hypothesis that the variation is adequately described by random error.

A table containing the parameter estimates from the model is displayed.
- The Parameter Estimate column contains the parameter estimates based on the uncoded values of the factor variables. If an effect is a linear combination of previous effects, the parameter for the effect is not estimable. When this happens, the degrees of freedom are zero, the parameter estimate is set to zero, and the estimates and tests on other parameters are conditional on this parameter being zero.
- The Standard Error column contains the estimated standard deviations of the parameter estimates based on uncoded data.
- The t Value column contains t values of a test of the null hypothesis that the true parameter is zero when the uncoded values of the factor variables are used.
- Pr > T gives the significance value or probability of a greater absolute t ratio given that the true parameter is zero.
- The Parameter Estimate from Coded Data column contains the parameter estimates based on the coded values of the factor variables. These are the estimates used in the subsequent canonical and ridge analyses.

The sum of squares are partitioned by the Factors in the model, and an analysis table is displayed. The test on a factor, say X1, is a joint test on all the parameters involving that factor. For example, the test for X1 tests the null hypothesis that the true parameters for X1, X1*X1, and X1*X2 are all zero.

Canonical Analysis

The Critical Value columns contains the values of the factor variables that correspond to the stationary point of the fitted response surface. The critical values can be at a minimum, maximum, or saddle point.
The Eigenvalues and Eigenvectors are from the matrix of quadratic parameter estimates based on the coded data. They characterize the shape of the response surface.

Ridge Analysis

Coded Radius is the distance from the coded version of the associated point to the coded version of the origin of the ridge. The origin is given by the point at radius zero.
Estimated Response is the estimated value of the response variable at the associated point. The Standard Error of this estimate is also given. This quantity is useful for assessing the relative credibility of the prediction at a given radius. Typically, this standard error increases rapidly as the ridge moves up to and beyond the design perimeter, reflecting the inherent difficulty of making predictions beyond the range of experimentation.
Uncoded Factor Values are the values of the uncoded factor variables that give the optimum response at this radius from the ridge origin.

ODS Table Names

PROC RSREG assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'

Table 63.1: ODS Tables Produced in PROC RSREG
ODS Table Name	Description	Statement
Coding	Coding coefficients for the independent variables	default
ErrorANOVA	Error analysis of variance	default
FactorANOVA	Factor analysis of variance	default
FitStatistics	Overall statistics for fit	default
ModelANOVA	Model analysis of variance	default
ParameterEstimates	Estimated linear parameters	default
Ridge	Ridge analysis for optimum response	RIDGE
Spectral	Spectral analysis	default
StationaryPoint	Stationary point of response surface	default