The following example, from Umetrics (1995), demonstrates different ways to examine a PLS model. The data come from the field of drug discovery. New drugs are developed from chemicals that are biologically active. Testing a compound for biological activity is an expensive procedure, so it is useful to be able to predict biological activity from cheaper chemical measurements. In fact, computational chemistry makes it possible to calculate certain chemical measurements without even making the compound. These measurements include size , lipophilicity, and polarity at various sites on the molecule . The following statements create a data set named penta , which contains these data.
data penta; input obsnam $ S1 L1 P1 S2 L2 P2 S3 L3 P3 S4 L4 P4 S5 L5 P5 log_RAI @@; n = _n_; datalines; VESSK 2.6931 2.5271 -1.2871 3.0777 0.3891 0.0701 1.9607 1.6324 0.5746 1.9607 1.6324 0.5746 2.8369 1.4092 3.1398 0.00 VESAK 2.6931 -2.5271 1.2871 3.0777 0.3891 0.0701 1.9607 1.6324 0.5746 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 0.28 VEASK 22.6931 2.5271 1.2871 3.0777 0.3891 0.0701 0.0744 1.7333 0.0902 1.9607 1.6324 0.5746 2.8369 1.4092 3.1398 0.20 VEAAK 2.6931 2.5271 1.2871 3.0777 0.3891 0.0701 2.8369 1.4092 3.1398 0.51 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 VKAAK 2.6931 2.5271 1.2871 2.8369 1.4092 3.1398 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 0.11 VEWAK 2.6931 2.5271 1.2871 3.0777 0.3891 0.0701 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 2.73 VEAAP 2.6931 2.5271 1.2871 3.0777 0.3891 0.0701 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 1.2201 0.8829 2.2253 0.18 VEHAK 2.6931 2.5271 1.2871 3.0777 0.3891 0.0701 2.4064 1.7438 1.1057 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 1.53 VAAAK 2.6931 2.5271 1.2871 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 0.10 GEAAK 2.2261 5.3648 0.3049 3.0777 0.3891 0.0701 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 2.8369 1.4092 -3.1398 0.52 LEAAK 4.1921 1.0285 0.9801 3.0777 0.3891 0.0701 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 0.40 FEAAK 4.9217 1.2977 0.4473 3.0777 0.3891 0.0701 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 0.30 VEGGK 2.6931 2.5271 1.2871 3.0777 0.3891 0.0701 2.2261 5.3648 0.3049 2.2261 5.3648 0.3049 2.8369 1.4092 3.1398 1.00 VEFAK 2.6931 2.5271 1.2871 3.0777 0.3891 0.0701 4.9217 1.2977 0.4473 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 1.57 VELAK 2.6931 2.5271 1.2871 3.0777 0.3891 0.0701 4.1921 1.0285 0.9801 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 0.59 AAAAA 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 0.10 AAYAA 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 1.3944 2.3230 0.0139 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 0.46 AAWAA 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 0.75 VAWAA 2.6931 2.5271 1.2871 0.0744 1.7333 0.0902 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 1.43 VAWAK 2.6931 2.5271 1.2871 0.0744 1.7333 0.0902 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 1.45 VKWAA 2.6931 2.5271 1.2871 2.8369 1.4092 3.1398 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 1.71 VWAAK 2.6931 2.5271 1.2871 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 2.8369 1.4092 3.1398 0.04 VAAWK 2.6931 2.5271 1.2871 0.0744 1.7333 0.0902 0.0744 1.7333 0.0902 4.7548 3.6521 0.8524 2.8369 1.4092 3.1398 0.23 EKWAP 3.0777 0.3891 0.0701 2.8369 1.4092 3.1398 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 1.2201 0.8829 2.2253 1.30 VKWAP 2.6931 2.5271 1.2871 2.8369 1.4092 3.1398 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 1.2201 0.8829 2.2253 2.35 RKWAP 2.8827 2.5215 3.4435 2.8369 1.4092 3.1398 4.7548 3.6521 0.8524 0.0744 1.7333 0.0902 1.2201 0.8829 2.2253 1.98 VEWVK 2.6931 2.5271 1.2871 3.0777 0.3891 0.0701 4.7548 3.6521 0.8524 2.6931 2.5271 1.2871 2.8369 1.4092 3.1398 1.71 PGFSP 1.2201 0.8829 2.2253 2.2261 5.3648 0.3049 4.9217 1.2977 0.4473 1.9607 1.6324 0.5746 1.2201 0.8829 2.2253 0.90 FSPFR 4.9217 1.2977 0.4473 1.9607 1.6324 0.5746 1.2201 0.8829 2.2253 4.9217 1.2977 0.4473 2.8827 2.5215 3.4435 0.64 RYLPT 2.8827 2.5215 3.4435 1.3944 2.3230 0.0139 4.1921 1.0285 0.9801 1.2201 0.8829 2.2253 0.9243 2.0921 1.3996 0.40 GGGGG 2.2261 5.3648 0.3049 2.2261 5.3648 0.3049 2.2261 5.3648 0.3049 2.2261 5.3648 0.3049 2.2261 5.3648 0.3049 . ; data ptrain; set penta; if (n <= 15); data ptest ; set penta; if (n > 15); run;
You would like to study the relationship between these measurements and the activity of the compound, represented by the logarithm of the relative Bradykinin activating activity ( log_ RAI ). Notice that these data consist of many predictors relative to the number of observations. Partial least squares is especially appropriate in this situation as a useful tool for finding a few underlying predictive factors that account for most of the variation in the response. Typically, the model is fit for part of the data (the training or work set), and the quality of the fit is judged by how well it predicts the other part of the data (the test or prediction set). For this example, the first 15 observations serve as the training set and the rest constitute the test set (refer to Ufkes et al. 1978, 1982).
When you fit a PLS model, you hope to find a few PLS factors that explain most of the variation in both predictors and responses. Factors that explain response variation provide good predictive models for new responses, and factors that explain predictor variation are well represented by the observed values of the predictors. The following statements fit a PLS model with two factors and save predicted values, residuals, and other information for each data point in a data set named outpls .
proc pls data=ptrain nfac=2; model log_RAI = S1-S5 L1-L5 P1-P5; output out=outpls predicted = yhat1 yresidual = yres1 xresidual = xres1-xres15 xscore = xscr yscore = yscr; run;
The PLS procedure displays a table, shown in Output 56.1.1, showing how much predictor and response variation is explained by each PLS factor.
The PLS Procedure Percent Variation Accounted for by Partial Least Squares Factors Number of Extracted Model Effects Dependent Variables Factors Current Total Current Total 1 16.9014 16.9014 89.6399 89.6399 2 12.7721 29.6735 7.8368 97.4767
Partial least squares algorithms choose successive orthogonal factors that maximize the covariance between each X-score and the corresponding Y-score. For a good PLS model, the first few factors show a high correlation between the X- and Y-scores. The correlation usually decreases from one factor to the next . You can plot the X-scores versus the Y-scores for the first PLS factor using the following SAS statements.
%let ifac = 1; data pltanno; set outpls; length text $ 2; retain function label position 5 hsys 3 xsys 2 ysys 2 color blue style swissb; text=%str(n); x=xscr&ifac; y=yscr&ifac; axis1 label=(angle=270 rotate=90 "Y score &ifac") major=(number=5) minor=none; axis2 label=("X-score &ifac") minor=none; symbol1 v=none i=none; proc gplot data=outpls; plot yscr&ifac*xscr&ifac=1 / anno=pltanno vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
By changing the macro variable ifac to 2 instead of 1, you can use the same statements to plot the X-scores versus the Y-scores for the second PLS factor. The resulting plots are shown in Output 56.1.2 and Output 56.1.3. The numbers on the plot represent the observation number in the penta data set.
For this example, the figures show high correlation between X- and Y-scores for the first factor but somewhat lower correlation for the second factor.
You can also plot the X-scores against each other to look for irregularities in the data. You should look for patterns or clearly grouped observations. If you see a curved pattern, for example, you may want to add a quadratic term . Two or more groupings of observations indicate that it might be better to analyze the groups separately, perhaps by including classification effects in the model. The following SAS statements produce a plot of the first and second X-scores:
data pltanno; set outpls; length text $ 2; retain function label position 5 hsys 3 xsys 2 ysys 2 color blue style swissb; text=%str(n); x=xscr1; y=xscr2; axis1 label=(angle=270 rotate=90 "X score 2") major=(number=5) minor=none; axis2 label=("X-score 1") minor=none; symbol1 v=none i=none; proc gplot data=outpls; plot xscr2*xscr1=1 / anno=pltanno vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
The plot is shown in Output 56.1.4.
This plot appears to show most of the observations close together, with a few being more spread out with larger positive X-scores for factor 2. There are no clear grouping patterns, but observation 13 stands out; note that this observation is the most extreme on all three plots so far. This run may be overly influential in the PLS analysis; thus, you should check to make sure it is reliable.
Plots of the weights give the directions toward which each PLS factor projects. They show which predictors are most represented in each factor. Those predictors with small weights in absolute value are less important than those with large weights.
You can use the DETAILS option in the PROC PLS statement to display various model details, including the X-weights. You can then use the ODS statement to send the weights to an output data set, as follows :
ods output XWeights=xweights; proc pls data=ptrain nfac=2 details; model log_RAI = S1-S5 L1-L5 P1-P5; run;
Once the X-weights are in a data set, you can use the following statements to plot the weights for the first two PLS factors against one another:
proc transpose data=xweights(drop=NumberOfFactors InnerRegCoef) out =xweights; data xweights; set xweights; rename col1=w1 col2=w2; data wt_anno; set xweights; length text $ 2; retain function label position 5 hsys 3 xsys 2 ysys 2 color blue style swissb; text=%str(_name_); x=w1; y=w2; run; axis1 label=(angle=270 rotate=90 "X weight 2") major=(number=5) minor=none; axis2 label=("X-weight 1") minor=none; symbol1 v=none i=none; proc gplot data=xweights; plot w2*w1=1 / anno=wt_anno vaxis=axis1 haxis=axis2 frame cframe=ligr; run; quit;
The plot of the X-weights is shown in Output 56.1.5.
The weights plot shows a cluster of X-variables that are weighted at nearly zero for both factors. These variables add little to the model fit, and removing them may improve the model s predictive capability.
To explore further which predictors can be eliminated from the analysis, you can look at the regression coefficients for the standardized data. Predictors with small coefficients (in absolute value) make a small contribution to the response prediction. Another statistic summarizing the contribution a variable makes to the model is the Variable Importance for Projection (VIP) of Wold (1994). Whereas the regression coefficients represent the importance each predictor has in the prediction of just the response, the VIP represents the value of each predictor in fitting the PLS model for both predictors and response. If a predictor has a relatively small coefficient (in absolute value) and a small value of VIP, then it is a prime candidate for deletion. Wold in Umetrics (1995) considers a value less than 0.8 to be small for the VIP. The following statements produce coefficients and the VIP.
/* / Put coefficients, weights, and R**2's into data sets. /------------------------------------------------------- */ ods listing close; ods output PercentVariation = pctvar XWeights = xweights CenScaleParms = solution; proc pls data=ptrain nfac=2 details; model log_RAI = S1 L1 P1 S2 L2 P2 S3 L3 P3 S4 L4 P4 S5 L5 P5 / solution; run; ods listing; /* / Just reformat the coefficients. /------------------------------------------------------- */ data solution; set solution; format log_RAI 8.5; if (RowName = 'Intercept') then delete; rename RowName = Predictor log_RAI = B; run; /* / Transpose weights and R**2's. /------------------------------------------------------- */ data xweights; set xweights; _name_='W'trim(left(_n_)); data pctvar ; set pctvar ; _name_='R'trim(left(_n_)); proc transpose data=xweights(drop=NumberOfFactors InnerRegCoef) out =xweights; proc transpose data=pctvar(keep=_name_ CurrentYVariation) out =pctvar; run; /* / Sum the normalized squared weights times the / normalized R**2's. The VIP is defined as the square / root of this weighted average times the number of / predictors. /-------------------------------------------------------*/ proc sql; create table vip as select *, w1 /sqrt(uss(w1)) as wnorm1, w2 /sqrt(uss(w2)) as wnorm2 from xweights left join pctvar(drop=_name_) on 1; data vip; set vip; keep _name_ vip; array wnorm{2}; array r{2}; VIP = 0; do i = 1 to 2; VIP = VIP + r{i}*(wnorm{i}**2)/sum(of r1-r2); end; VIP = sqrt(VIP * 15); data vipbpls; merge solution vip(drop=_name_); proc print data=vipbpls; run;
The output appears in Output 56.1.6.
Obs Predictor B VIP 1 S1 0.13831 0.61108 2 L1 0.05720 0.31822 3 P1 0.19064 0.75127 4 S2 0.12383 0.50482 5 L2 0.05909 0.27123 6 P2 0.09361 0.35927 7 S3 0.28415 1.57775 8 L3 0.47131 2.43480 9 P3 0.26613 1.13222 10 S4 0.09145 1.22255 11 L4 0.12265 1.17994 12 P4 0.04878 0.88380 13 S5 0.03320 0.21288 14 L5 0.03320 0.21288 15 P5 0.03320 0.21288
For this data set, the variables L1 , L2 , P2 , S5 , L5 , and P5 have small absolute coefficients and small VIP. Looking back at the weights plot in Output 56.1.5, you can see that these variables tend to be the ones near zero for both PLS factors. You should consider dropping these variables from the model.
This example is a continuation of Example 56.1 on page 3388.
A PLS model effectively models both the predictors and the responses. In order to check for outliers, you should, therefore, look at the Euclidean distance from each point to the PLS model in both the standardized predictors and the standardized responses. No point should be dramatically farther from the model than the rest. If there is a group of points that are all farther from the model than the rest, they may have something in common, in which case they should be analyzed separately. The following statements compute and plot these distances to the reduced model, dropping variables L1 , L2 , P2 , P4 , S5 , L5 , and P5 :
proc pls data=ptrain nfac=2 noprint; model log_RAI = S1 P1 S2 S3 L3 P3 S4 L4 ; output out=stdres stdxsse=stdxsse stdysse=stdysse; data stdres; set stdres; xdist = sqrt(stdxsse); ydist = sqrt(stdysse); run; symbol1 i=needles v=dot c=blue; proc gplot data=stdres; plot xdist*n=1 / cframe=ligr; proc gplot data=stdres; plot ydist*n=1 / cframe=ligr; run;
The plots are shown in Output 56.2.1 and Output 56.2.2.
There appear to be no profound outliers in either the predictor space or the response space.
The following example demonstrates issues in spectrometric calibration. The data (Umetrics 1995) consist of spectrographic readings on 33 samples containing known concentrations of two amino acids, tyrosine and tryptophan. The spectra are measured at 30 frequencies across the overall range of frequencies. For example, Output 56.3.1 shows the observed spectra for three samples, one with only tryptophan, one with only tyrosine, and one with a mixture of the two, all at a total concentration of 10 ˆ’ 6 .
Of the 33 samples, 18 are used as a training set and 15 as a test set. The data originally appear in McAvoy et al. (1989).
These data were created in a lab, with the concentrations fixed in order to provide a wide range of applicability for the model. You want to use a linear function of the logarithms of the spectra to predict the logarithms of tyrosine and tryptophan concentration, as well as the logarithm of the total concentration. Actually, because of the possibility of zeros in both the responses and the predictors, slightly different transformations are used. The following statements create SAS data sets containing the training and test data, named ftrain and ftest , respectively:
data ftrain; input obsnam $ tot tyr f1-f30 @@; try = tot - tyr; if (tyr) then tyr_log = log10(tyr); else tyr_log = -8; if (try) then try_log = log10(try); else try_log = -8; tot_log = log10(tot); datalines; 17mix35 0.00003 0 6.215 5.809 5.114 3.963 2.897 2.269 1.675 1.235 0.900 0.659 0.497 0.395 0.335 0.315 0.333 0.377 0.453 0.549 0.658 0.797 0.878 0.954 1.060 1.266 1.520 1.804 2.044 2.269 2.496 2.714 19mix35 0.00003 3E-7 -5.516 -5.294 -4.823 -3.858 -2.827 -2.249 -1.683 -1.218 -0.907 -0.658 -0.501 -0.400 -0.345 -0.323 -0.342 -0.387 -0.461 -0.554 -0.665 -0.803 -0.887 -0.960 -1.072 -1.272 -1.541 -1.814 -2.058 -2.289 -2.496 -2.712 21mix35 0.00003 7.5E-7 -5.519 -5.294 -4.501 -3.863 -2.827 -2.280 -1.716 -1.262 -0.939 -0.694 -0.536 -0.444 -0.384 -0.369 -0.377 -0.421 -0.495 -0.596 -0.706 -0.824 -0.917 -0.988 -1.103 -1.294 -1.565 -1.841 -2.084 -2.320 -2.521 -2.729 23mix35 0.00003 1.5E-6 -5.294 -4.705 -4.262 -3.605 -2.726 -2.239 -1.681 -1.250 -0.925 -0.697 -0.534 -0.437 -0.381 -0.359 -0.369 -0.426 -0.499 -0.591 -0.701 -0.843 -0.925 -0.989 -1.109 -1.310 -1.579 -1.852 -2.090 -2.316 -2.521 -2.743 25mix35 0.00003 3E-6 -4.600 -4.069 -3.764 -3.262 -2.598 -2.191 -1.680 -1.273 -0.958 -0.729 -0.573 -0.470 -0.422 -0.407 -0.422 -0.468 -0.538 -0.639 -0.753 -0.887 -0.968 -1.037 -1.147 -1.357 -1.619 -1.886 -2.141 -2.359 -2.585 -2.792 27mix35 0.00003 7.5E-6 -3.812 -3.376 -3.026 -2.726 -2.249 -1.919 -1.541 -1.198 -0.951 -0.764 -0.639 -0.570 -0.528 -0.525 -0.550 -0.606 -0.689 -0.781 -0.909 -1.031 -1.126 -1.191 -1.303 -1.503 -1.784 -2.058 -2.297 -2.507 -2.727 -2.970 29mix35 0.00003 0.000015 -3.053 -2.641 -2.382 -2.194 -1.977 -1.913 -1.728 -1.516 -1.317 -1.158 -1.029 -0.963 -0.919 -0.915 -0.933 -0.981 -1.055 -1.157 -1.271 -1.409 -1.505 -1.546 -1.675 -1.880 -2.140 -2.415 -2.655 -2.879 -3.075 -3.319 28mix35 0.00003 0.0000225 -2.626 -2.248 -2.004 -1.839 -1.742 -1.791 -1.786 -1.772 -1.728 -1.666 -1.619 -1.591 -1.575 -1.580 -1.619 -1.671 -1.754 -1.857 -1.982 -2.114 -2.210 -2.258 -2.379 -2.570 -2.858 -3.117 -3.347 -3.568 -3.764 -4.012 26mix35 0.00003 0.000027 -2.370 -1.990 -1.754 -1.624 -1.560 -1.655 -1.772 -1.899 -1.982 -2.074 -2.157 -2.211 -2.267 -2.317 -2.369 -2.460 -2.545 -2.668 -2.807 -2.951 -3.030 -3.075 -3.214 -3.376 -3.685 -3.907 -4.129 -4.335 -4.501 -4.599 24mix35 0.00003 0.0000285 -2.326 -1.952 -1.702 -1.583 -1.507 -1.629 -1.771 -1.945 -2.115 -2.297 -2.448 -2.585 -2.696 -2.808 -2.913 -3.030 -3.163 -3.265 -3.376 -3.534 -3.642 -3.721 -3.858 -4.012 -4.262 -4.501 -4.704 -4.822 -4.956 -5.292 22mix35 0.00003 0.00002925 -2.277 -1.912 -1.677 -1.556 -1.487 -1.630 -1.791 -1.969 -2.203 -2.437 -2.655 -2.844 -3.032 -3.214 -3.378 -3.503 -3.646 -3.812 -3.958 -4.129 -4.193 -4.262 -4.415 -4.501 -4.823 -5.111 -5.113 -5.294 -5.290 -5.294 20mix35 0.00003 0.0000297 -2.266 -1.912 -1.688 -1.546 -1.500 -1.640 -1.801 -2.011 -2.277 -2.545 -2.823 -3.094 -3.376 -3.572 -3.812 -4.012 -4.262 -4.415 -4.501 -4.705 -4.823 -4.823 -4.956 -5.111 -5.111 -5.516 -5.524 -5.806 -5.806 -5.806 18mix35 0.00003 0.00003 -2.258 -1.900 -1.666 -1.524 -1.479 -1.621 -1.803 -2.043 -2.308 -2.626 -2.895 -3.214 -3.568 -3.907 -4.193 -4.423 -4.825 -5.111 -5.111 -5.516 -5.516 -5.516 -5.516 -5.806 -5.806 -5.806 -5.806 -5.806 -6.210 -6.215 trp2 0.0001 0 -5.922 -5.435 -4.366 -3.149 -2.124 -1.392 -0.780 -0.336 -0.002 0.233 0.391 0.490 0.540 0.563 0.541 0.488 0.414 0.313 0.203 0.063 -0.028 -0.097 -0.215 -0.411 -0.678 -0.953 -1.208 -1.418 -1.651 -1.855 mix5 0.0001 0.00001 -3.932 -3.411 -2.964 -2.462 -1.836 -1.308 -0.796 -0.390 -0.076 0.147 0.294 0.394 0.446 0.460 0.443 0.389 0.314 0.220 0.099 -0.033 -0.128 -0.197 -0.308 -0.506 -0.785 -1.050 -1.313 -1.529 -1.745 -1.970 mix4 0.0001 0.000025 -2.996 -2.479 -2.099 -1.803 -1.459 -1.126 -0.761 -0.424 -0.144 0.060 0.195 0.288 0.337 0.354 0.330 0.274 0.206 0.105 -0.009 -0.148 -0.242 -0.306 -0.424 -0.626 -0.892 -1.172 -1.425 -1.633 -1.877 -2.071 mix3 0.0001 0.00005 -2.128 -1.661 -1.344 -1.160 -0.996 -0.877 -0.696 -0.495 -0.313 -0.165 -0.042 0.032 0.069 0.079 0.050 -0.006 -0.082 -0.179 -0.295 -0.436 -0.523 -0.584 -0.706 -0.898 -1.178 -1.446 -1.696 -1.922 -2.128 -2.350 mix6 0.0001 0.00009 -1.140 -0.757 -0.497 -0.362 -0.329 -0.412 -0.513 -0.647 -0.772 -0.877 -0.958 -1.040 -1.104 -1.162 -1.233 -1.317 -1.425 -1.543 -1.661 -1.804 -1.877 -1.959 -2.034 -2.249 -2.502 -2.732 -2.964 -3.142 -3.313 -3.576 ; data ftest; input obsnam $ tot tyr f1-f30 @@; try = tot - tyr; if (tyr) then tyr_log = log10(tyr); else tyr_log = -8; if (try) then try_log = log10(try); else try_log = -8; tot_log = log10(tot); datalines; 43trp6 1E-6 0 -5.915 -5.918 -6.908 -5.428 -4.117 -5.103 -4.660 -4.351 -4.023 -3.849 -3.634 -3.634 -3.572 -3.513 -3.634 -3.572 -3.772 -3.772 -3.844 -3.932 -4.017 -4.023 -4.117 -4.227 -4.492 -4.660 -4.855 -5.428 -5.103 -5.428 59mix6 1E-6 1E-7 -5.903 -5.903 -5.903 -5.082 -4.213 -5.083 -4.838 -4.639 -4.474 -4.213 -4.001 -4.098 -4.001 -4.001 -3.907 -4.001 -4.098 -4.098 -4.206 -4.098 -4.213 -4.213 -4.335 -4.474 -4.639 -4.838 -4.837 -5.085 -5.410 -5.410 51mix6 1E-6 2.5E-7 -5.907 -5.907 -5.415 -4.843 -4.213 -4.843 -4.843 -4.483 -4.343 -4.006 -4.006 -3.912 -3.830 -3.830 -3.755 -3.912 -4.006 -4.001 -4.213 -4.213 -4.335 -4.483 -4.483 -4.642 -4.841 -5.088 -5.088 -5.415 -5.415 -5.415 49mix6 1E-6 5E-7 -5.419 -5.091 -5.091 -4.648 -4.006 -4.846 -4.648 -4.483 -4.343 -4.220 -4.220 -4.220 -4.110 -4.110 -4.110 -4.220 -4.220 -4.343 -4.483 -4.483 -4.650 -4.650 -4.846 -4.846 -5.093 -5.091 -5.419 -5.417 -5.417 -5.907 53mix6 1E-6 7.5E-7 -5.083 -4.837 -4.837 -4.474 -3.826 -4.474 -4.639 -4.838 -4.837 -4.639 -4.639 -4.641 -4.641 -4.639 -4.639 -4.837 -4.838 -4.838 -5.083 -5.082 -5.083 -5.410 -5.410 -5.408 -5.408 -5.900 -5.410 -5.903 -5.900 -6.908 57mix6 1E-6 9E-7 -5.082 -4.836 -4.639 -4.474 -3.826 -4.636 -4.638 -4.638 -4.837 -5.082 -5.082 -5.408 -5.082 -5.080 -5.408 -5.408 -5.408 -5.408 -5.408 -5.408 -5.408 -5.900 -5.900 -5.900 -5.900 -5.900 -5.900 -5.900 -6.908 -6.908 41tyro6 1E-6 1E-6 -5.104 -4.662 -4.662 -4.358 -3.705 -4.501 -4.662 -4.859 -5.104 -5.431 -5.433 -5.918 -5.918 -5.918 -5.431 -5.918 -5.918 -5.918 -5.918 -5.918 -5.918 -5.918 -5.918 -6.908 -5.918 -5.918 -6.908 -6.908 -5.918 -5.918 28trp5 0.00001 0 -5.937 -5.937 -5.937 -4.526 -3.544 -3.170 -2.573 -2.115 -1.792 -1.564 -1.400 -1.304 -1.244 -1.213 -1.240 -1.292 -1.373 -1.453 -1.571 -1.697 -1.801 -1.873 -2.008 -2.198 -2.469 -2.706 -2.990 -3.209 -3.384 -3.601 37mix5 0.00001 1E-6 -5.109 -4.865 -4.501 -4.029 -3.319 -3.070 -2.569 -2.207 -1.895 -1.684 -1.516 -1.423 -1.367 -1.348 -1.374 -1.415 -1.503 -1.596 -1.718 -1.839 -1.927 -1.997 -2.118 -2.333 -2.567 -2.874 -3.106 -3.313 -3.579 -3.781 33mix5 0.00001 2.5E-6 -4.366 -4.129 -3.781 -3.467 -3.037 -2.939 -2.593 -2.268 -1.988 -1.791 -1.649 -1.565 -1.520 -1.509 -1.524 -1.580 -1.665 -1.758 -1.882 -2.037 -2.090 -2.162 -2.284 -2.465 -2.761 -3.037 -3.270 -3.520 -3.709 -3.937 31mix5 0.00001 5E-6 -3.790 -3.373 -3.119 -2.915 -2.671 -2.718 -2.555 -2.398 -2.229 -2.085 -1.971 -1.902 -1.860 -1.837 -1.881 -1.949 -2.009 -2.127 -2.230 -2.381 -2.455 -2.513 -2.624 -2.827 -3.117 -3.373 -3.586 -3.785 -4.040 -4.366 35mix5 0.00001 7.5E-6 -3.321 -2.970 -2.765 -2.594 -2.446 -2.548 -2.616 -2.617 -2.572 -2.550 -2.508 -2.487 -2.488 -2.487 -2.529 -2.593 -2.688 -2.792 -2.908 -3.037 -3.149 -3.189 -3.273 -3.467 -3.781 -4.029 -4.241 -4.501 -4.669 -4.865 39mix5 0.00001 9E-6 -3.142 -2.812 -2.564 -2.404 -2.281 -2.502 -2.589 -2.706 -2.842 -2.964 -3.068 -3.103 -3.182 -3.268 -3.361 -3.411 -3.517 -3.576 -3.705 -3.849 -3.932 -3.932 -4.029 -4.234 -4.501 -4.664 -4.860 -5.104 -5.431 -5.433 26tyro5 0.00001 0.00001 -3.037 -2.696 -2.464 -2.321 -2.239 -2.444 -2.602 -2.823 -3.144 -3.396 -3.742 -4.063 -4.398 -4.699 -4.893 -5.138 -5.140 -5.461 -5.463 -5.945 -5.461 -5.138 -5.140 -5.138 -5.138 -5.463 -5.461 -5.461 -5.461 -5.461 tyro2 0.0001 0.0001 -1.081 -0.710 -0.470 -0.337 -0.327 -0.433 -0.602 -0.841 -1.119 -1.423 -1.750 -2.121 -2.449 -2.818 -3.110 -3.467 -3.781 -4.029 -4.241 -4.366 -4.501 -4.366 -4.501 -4.501 -4.668 -4.668 -4.865 -4.865 -5.109 -5.111 ;
The following statements fit a PLS model with 10 factors.
proc pls data=ftrain nfac=10; model tot_log tyr_log try_log = f1-f30; run;
The table shown in Output 56.3.2 indicates that only three or four factors are required to explain almost all of the variation in both the predictors and the responses.
The PLS Procedure Percent Variation Accounted for by Partial Least Squares Factors Number of Extracted Model Effects Dependent Variables Factors Current Total Current Total 1 81.1654 81.1654 48.3385 48.3385 2 16.8113 97.9768 32.5465 80.8851 3 1.7639 99.7407 11.4438 92.3289 4 0.1951 99.9357 3.8363 96.1652 5 0.0276 99.9634 1.6880 97.8532 6 0.0132 99.9765 0.7247 98.5779 7 0.0052 99.9817 0.2926 98.8705 8 0.0053 99.9870 0.1252 98.9956 9 0.0049 99.9918 0.1067 99.1023 10 0.0034 99.9952 0.1684 99.2707
In order to choose the optimal number of PLS factors, you can explore how well models based on the training data with different numbers of factors fit the test data. To do so, use the CV=TESTSET option, with an argument pointing to the test data set ftest , as in the following statements:
proc pls data=ftrain nfac=10 cv=testset(ftest) cvtest(stat=press seed=12345); model tot_log tyr_log try_log = f1-f30; run;
The results of the test set validation are shown in Output 56.3.3. They indicate that, although five PLS factors give the minimum predicted residual sum of squares, the residuals for four factors are insignificantly different from those for five. Thus, the smaller model is preferred.
The PLS Procedure Test Set Validation for the Number of Extracted Factors Number of Root Extracted Mean Prob > Factors PRESS PRESS 0 3.056797 <.0001 1 2.630561 <.0001 2 1.00706 0.0070 3 0.664603 0.0020 4 0.521578 0.3800 5 0.500034 1.0000 6 0.513561 0.5100 7 0.501431 0.6870 8 1.055791 0.1530 9 1.435085 0.1010 10 1.720389 0.0320 Minimum root mean PRESS 0.5000 Minimizing number of factors 5 Smallest number of factors with p > 0.1 4 The PLS Procedure Percent Variation Accounted for by Partial Least Squares Factors Number of Extracted Model Effects Dependent Variables Factors Current Total Current Total 1 81.1654 81.1654 48.3385 48.3385 2 16.8113 97.9768 32.5465 80.8851 3 1.7639 99.7407 11.4438 92.3289 4 0.1951 99.9357 3.8363 96.1652
The factor loadings show how the PLS factors are constructed from the centered and scaled predictors. For spectral calibration, it is useful to plot the loadings against the frequency. In many cases, the physical meanings that can be attached to factor loadings help to validate the scientific interpretation of the PLS model. You can use the following statements to plot the loadings for the four PLS factors against frequency.
ods listing close; ods output XLoadings=xloadings; proc pls data=ftrain nfac=4 details method=pls; model tot_log tyr_log try_log = f1-f30; run; ods listing; proc transpose data=xloadings(drop=NumberOfFactors) out =xloadings; data xloadings; set xloadings; n = _n_; rename col1=Factor1 col2=Factor2 col3=Factor3 col4=Factor4; run; goptions border; axis1 label=("Loading") major=(number=5) minor=none; axis2 label=("Frequency") minor=none; symbol1 v=none i=join c=red l=1; symbol2 v=none i=join c=green l=1 /*l= 3*/; symbol3 v=none i=join c=blue l=1 /*l=34*/; symbol4 v=none i=join c=yellowl=1 /*l=46*/; legend1 label=none cborder=black; proc gplot data=xloadings; plot (Factor1 Factor2 Factor3 Factor4)*n / overlay legend=legend1 vaxis=axis1 haxis=axis2 vref=0 lvref=2 frame cframe=ligr; run; quit;
The resulting plot is shown in Output 56.3.4.
Notice that all four factors handle frequencies below and above about 7 or 8 differently. For example, the first factor is very nearly a simple contrast between the averages of the two sets of frequencies, and the second factor appears to be a weighted sum of only the frequencies in the first set.