Details | SAS.STAT 9.1 Users Guide (Vol. 5)

Missing Values

PROC PRINQUAL can estimate missing values, subject to optional constraints, so that the covariance matrix is optimized. The procedure provides several approaches for handling missing data. When you specify the NOMISS option in the PROC PRINQUAL statement, observations with missing values are excluded from the analysis. Otherwise, missing data are estimated, using variable means as initial estimates. Missing values for OPSCORE character variables are treated the same as any other category during the initialization. See the section Missing Values on page 4599 in Chapter 75, The TRANSREG Procedure, for more information on missing data estimation.

Controlling the Number of Iterations

Several options in the PROC PRINQUAL statement control the number of iterations performed. Iteration terminates when any one of the following conditions is satisfied:

The number of iterations equals the value of the MAXITER= option.
The average absolute change in variable scores from one iteration to the next is less than the value of the CONVERGE= option.
The criterion change is less than the value of the CCONVERGE= option.

With the MTV method, the change in the proportion of variance criterion can become negative when the data have converged so that it is numerically impossible , within machine precision, to increase the criterion. Because the MTV algorithm is convergent, a negative criterion change is the result of very small amounts of rounding error. The MGV method displays the average squared multiple correlation (which is not the criterion being optimized), so the criterion change can become negative well before convergence. The MAC method criterion (average correlation) is never computed, so the CCONVERGE= option is ignored for METHOD=MAC. You can specify a negative value for either convergence option if you want to define convergence only in terms of the other convergence option.

With the MGV method, iterations minimize the generalized variance (determinant), but the generalized variance is not reported for two reasons. First, in most data sets, the generalized variance is almost always near zero (or will be after one or two iterations), which is its minimum. This does not mean that iteration is complete; it simply means that at least one multiple correlation is at or near one. The algorithm continues minimizing the determinant in ( m ˆ’ 1) , ( m ˆ’ 2) dimensions, and so on. Because the generalized variance is almost always near zero, it does not provide a good indication of how the iterations are progressing. The mean R ² provides a better indication of convergence. The second reason for not reporting the generalized variance is that almost no additional time is required to compute R ² values for each step. This is because the error sum of squares is a by-product of the algorithm at each step. Computing the determinant at the end of each iteration adds more computations to an already computationally intensive algorithm.

You can increase the number of iterations to ensure convergence by increasing the value of the MAXITER= option and decreasing the value of the CONVERGE= option. Because the average absolute change in standardized variable scores seldom decreases below 1E ˆ’ 11, you typically do not specify a value for the CONVERGE= option less than 1E ˆ’ 8 or 1E ˆ’ 10. Most of the data changes occur during the first few iterations, but the data can still change after 50 or even 100 iterations. You can try different combinations of values for the CONVERGE= and MAXITER= options to ensure convergence without extreme overiteration. If the data do not converge with the default specifications, specify the REITERATE option, or try CONVERGE=1E ˆ’ 8 and MAXITER=50, or CONVERGE=1E ˆ’ 10 and MAXITER=200.

Performing a Principal Component Analysis of Transformed Data

PROC PRINQUAL produces an iteration history table that displays (for each iteration) the iteration number, the maximum and average absolute change in standardized variable scores computed over the iteratively transformed variables, the criterion being optimized, and the criterion change. In order to examine the results of the analysis in more detail, you can analyze the information in the output data set using other SAS procedures.

Specifically, use the PRINCOMP procedure to perform a components analysis on the transformed data. PROC PRINCOMP accepts the raw data from PROC PRINQUAL but issues a warning because the PROC PRINQUAL output data set has _ NAME_ and _ TYPE_ variables, but it is not a TYPE=CORR data set. You can ignore this warning.

If the output data set contains both scores and correlations , you must subset it for analysis with PROC PRINCOMP. Otherwise, the correlation observations are treated as ordinary observations and the PROC PRINCOMP results are incorrect. For example, consider the following statements:

  proc prinqual data=a out=b correlations replace;   transform spline(var1-var50 / nknots=3);   run;   proc princomp data=b;   where _TYPE_='SCORE';   run;

Also note that the proportion of variance accounted for, as reported by PROC PRINCOMP, can exceed the proportion of variance accounted for in the last PROC PRINQUAL iteration. This is because PROC PRINQUAL reports the variance accounted for by the components analysis that generated the current scaling of the data, not a components analysis of the current scaling of the data.

Using the MAC Method

You can use the MAC algorithm alone by specifying METHOD=MAC, or you can use it as an initialization algorithm for METHOD=MTV and METHOD=MGV analyses by specifying the iteration option INITITER=. If any variables are negatively correlated, do not use the MAC algorithm with monotonic transformations (MONOTONE, UNTIE, and MSPLINE) because the signs of the correlations among the variables are not used when computing variable approximations. If an approximation is negatively correlated with the original variable, monotone constraints would make the optimally scaled variable a constant, which is not allowed (see the section Avoiding Constant Transformations on page 3672). When used with other transformations, the MAC algorithm can reverse the scoring of the variables. So, for example, if variable X is designated LOG(X) with METHOD=MAC and TSTANDARD=ORIGINAL, the final transformation (for example, TX ) may not be LOG(X). If TX is not LOG(X), it has the same mean as LOG(X) and the same variance as LOG(X), and it is perfectly negatively correlated with LOG(X). PROC PRINQUAL displays a note for every variable that is reversed in this manner.

You can use the METHOD=MAC algorithm to reverse the scorings of some rating variables before a factor analysis. The correlations among bipolar ratings such as like - dislike , hot - cold , and fragile - monumental are typically both positive and negative. If some items are reversed to say dislike - like , cold - hot , and monumental - fragile , some of the negative signs can be eliminated, and the factor pattern matrix would be cleaner. You can use PROC PRINQUAL with METHOD=MAC and LINEAR transformations to reverse some items, maximizing the average of the intercorrelations.

Output Data Set

The PRINQUAL procedure produces an output data set by default. By specifying the OUT=, APPROXIMATIONS, SCORES, REPLACE, and CORRELATIONS options in the PROC PRINQUAL statement, you can name this data set and control, to some extent, the contents of it.

Structure and Content

The output data set can have 16 different forms, depending on the specified combinations of the REPLACE, SCORES, APPROXIMATIONS, and CORRELATIONS options. You can specify any combination of these options. To illustrate , assume that the data matrix consists of N observations and m variables, and n components are computed. Then, define the following:

D	the N — m matrix of original data with variable names that correspond to the names of the variables in the input data set. However, when you use the OPSCORE transformation on character variables, those variables are replaced by numeric variables that contain category numbers
T	the N — m matrix of transformed data with variable names constructed from the value of the TPREFIX= option (if you do not specify the REPLACE option) and the names of the variables in the input data set
S	the N — n matrix of component scores with variable names constructed from the value of the PREFIX= option and integers
A	the N — m matrix of data approximations with variable names constructed from the value of the APREFIX= option and the names of the variables in the input data set
R _TD	the m — m matrix of correlations between the transformed variables and the original variables with variable names that correspond to the names of the variables in the input data set. When missing values exist, casewise deletion is used to compute the correlations.
R _TT	the m — m matrix of correlations among the transformed variables with the variable names constructed from the value of the TPREFIX= option (if you do not specify the REPLACE option) and the names of the variables in the input data set
R _TS	the m — n matrix of correlations between the transformed variables and the principal component scores (component structure matrix) with variable names constructed from the value of the PREFIX= option and integers
R _TA	the m — m matrix of correlations between the transformed variables and the variable approximations with variable names constructed from the value of the APREFIX= option and the names of the variables in the input data set

To create a data set WORK.A that contains all information, specify the following options in the PROC PRINQUAL statement

  proc prinqual scores approximations correlations out=a;

and also use a TRANSFORM statement appropriate for your data. Then the WORK.A data set contains

  D    T    S    A   R   _TD   R   _TT   R   _TS   R   _TA

To eliminate the bottom partitions that contain the correlations and component structure, do not specify the CORRELATIONS option. For example, use the following PROC PRINQUAL statement with an appropriate TRANSFORM statement.

  proc prinqual scores approximations out=a;

Then the WORK.A data set contains

  D T S A

If you use the following PROC PRINQUAL statement (with an appropriate TRANSFORM statement)

  proc prinqual out=a;

this creates a data set WORK.A of the form

D T

To output transformed data and component scores only, specify the following options in the PROC PRINQUAL statement:

  proc prinqual replace scores out=a;

Then the WORK.A data set contains

T S

_ TYPE_ and _ NAME_ Variables

In addition to the preceding information, the output data set contains two character variables, the variable _ TYPE_ (length 8) and the variable _ NAME_ (length 32).

The _ TYPE_ variable has the value SCORE if the observation contains variables, transformed variables, components, or data approximations; the _ TYPE_ variable has the value CORR if the observation contains correlations or component structure.

By default, the _ NAME_ variable has values ROW1 , ROW2 , and so on, for the observations with _ TYPE_ = SCORE . If you use an ID statement, the variable _ NAME_ contains the formatted ID variable for SCORES observations. The values of the variable _ NAME_ for observations with _ TYPE_ = CORR are the names of the transformed variables.

Certain procedures, such as PROC PRINCOMP, which can use the PROC PRINQUAL output data set, issue a warning that the PROC PRINQUAL data set contains _ NAME_ and _ TYPE_ variables but is not a TYPE=CORR data set. You can ignore this warning.

Variable Names

The TPREFIX=, APREFIX=, and PREFIX= options specify prefixes for the transformed and approximation variable names and for principal component score variables, respectively. PROC PRINQUAL constructs transformed and approximation variable names from a prefix and the first characters of the original variable name. The number of characters in the prefix plus the number of characters in the original variable name (including the final digits, if any) required to uniquely designate the new variables should not exceed 32. For example, if the APREFIX= parameter that you specify is one character, PROC PRINQUAL adds the first 31 characters of the original variable name; if your prefix is four characters, only the first 28 characters of the original variable name are added.

Effect of the TSTANDARD= and COVARIANCE Options

The values in the output data set are affected by the TSTANDARD= and COVARIANCE options. If you specify TSTANDARD=NOMISS, the NOMISS standardization is performed on the transformed data after the iterations have been completed, but before the output data set is created. The new means and variances are used in creating the output data set. Then, if you do not specify the COVARIANCE option, the data are transformed to mean zero and variance one. The principal component scores and data approximations are computed from the resulting matrix. The data are then linearly transformed to have the mean and variance specified by the TSTANDARD= option. The data approximations are transformed so that the means within each pair of a transformed variable and its approximation are the same. The ratio of the variance of a variable approximation to the variance of the corresponding transformed variable equals the proportion of the variance of the variable that is accounted for by the components model.

If you specify the COVARIANCE option and do not specify TSTANDARD=Z, you can input the transformed data to PROC PRINCOMP, again specifying the COVARIANCE option, to perform a components analysis of the results of PROC PRINQUAL. Similarly, if you do not specify the COVARIANCE option with PROC PRINQUAL and you input the transformed data to PROC PRINCOMP without the COVARIANCE option, you receive the same report. However, some combinations of PROC PRINQUAL options, such as COVARIANCE and TSTANDARD=Z, while valid, produce approximations and scores that cannot be reproduced by PROC PRINCOMP.

The component scores in the output data set are computed from the correlations among the transformed variables, or from the covariances if you specified the COVARIANCE option. The component scores are computed after the TSTANDARD=NOMISS transformation, if specified. The means of the component scores in the output data set are always zero. The variances equal the corresponding eigenvalues, unless you specify the STANDARD option; then the variances are set to one.

Avoiding Constant Transformations

There are times when the optimal scaling produces a constant transformed variable. This can happen with the MONOTONE, UNTIE, and MSPLINE transformations when the target is negatively correlated with the original input variable. It can happen with all transformations when the target is uncorrelated with the original input variable. When this happens, the procedure modifies the target to avoid a constant transformation. This strategy avoids certain nonoptimal solutions.

If the transformation is monotonic and a constant transformed variable results, the procedure multiplies the target by ˆ’ 1 and tries the optimal scaling again. If the transformation is not monotonic or if the multiplication by ˆ’ 1 did not help, the procedure tries using a random target. If the transformation is still constant, the previous nonconstant transformation is retained. When a constant transformation is avoided by any strategy, this message is displayed: A constant transformation was avoided for name .

Constant Variables

Constant and almost constant variables are zeroed and ignored.

Character OPSCORE Variables

Character OPSCORE variables are replaced by a numeric variable containing category numbers before the iterations, and the character values are discarded. Only the first eight characters are considered when determining category membership. If you want the original character variable in the output data set, give it a different name in the OPSCORE specificiation (OPSCORE( x / name=( x2 )) and name the original variable on the ID statement (ID x ;).

REITERATE Option Usage

You can use the REITERATE option to perform additional iterations when PROC PRINQUAL stops before the data have adequately converged. For example, suppose that you execute the following code:

  proc prinqual data=A cor out=B;   transform mspline(X1-X5);   run;

If the transformations do not converge in the default 30 iterations, you can perform more iterations without repeating the first 30 iterations.

  proc prinqual data=B reiterate cor out=B;   transform mspline(X1-X5);   run;

Note that a WHERE statement is not necessary to exclude the correlation observations. They are automatically excluded because their _ TYPE_ variable value is not SCORE .

You can also use the REITERATE option to specify starting values other than the original values for the transformations. Providing alternate starting points may avoid local optima. Here are two examples.

  proc prinqual data=A out=B;   transform rank(X1-X5);   run;   proc prinqual data=B reiterate out=C;   /* Use ranks as the starting point. */   transform monotone(X1-X5);   run;   data B;   set A;   array TXS[5] TX1-TX5;   do j = 1 to 5;   TXS[j] = normal(0);   end;   run;   proc prinqual data=B reiterate out=C;   /* Use a random starting point. */   transform monotone(X1-X5);   run;

Note that divergence with the REITERATE option, particularly in the second iteration, is not an error since the initial transformation is not required to be a valid member of the transformation family. When you specify the REITERATE option, the iteration does not terminate when the criterion change is negative during the first ten iterations.

Passive Observations

Observations may be excluded from the analysis for several reasons, including zero weight, zero frequency, missing values in variables designated IDENTITY, or missing values with the NOMISS option specified. These observations are passive in that they do not contribute to determining transformations, R ² , total variance, and so on. However, some information can be computed for them, such as approximations, principal component scores, and transformed values. Passive observations in the output data set have a blank value for the variable _ TYPE_ .

Missing value estimates for passive observations may converge slowly with METHOD=MTV. In the following example, the missing value estimates should be 2, 5, and 8. Since the nonpassive observations do not change, the procedure converges in one iteration but the missing value estimates do not converge. The extra iterations produced by specifying CONVERGE= ˆ’ 1 and CCONVERGE= ˆ’ 1, as shown in the second PROC PRINQUAL step, generate the expected results.

  data A;   input X Y;   datalines;   1 1   2 .   3 3   4 4   5 .   6 6   7 7   8 .   9 9   ;   proc prinqual nomiss data=A nomiss n=1 out=B method=mtv;   transform lin(X Y);   run;   proc print;   run;   proc prinqual nomiss data=A nomiss n=1 out=B method=mtv   converge=-1 cconverge=-1;   transform lin(X Y);   run;   proc print;   run;

Computational Resources

This section provides information on the computational resources required to run PROC PRINQUAL.

Let

N = number of observations
m = number of variables
n = number of principal components
k = maximum spline degree
p = maximum number of knots

For the MTV algorithm, more than

bytes of array space are required.
For the MGV and MAC algorithms, more than 56 m plus the maximum of the data matrix size and the optimal scaling work space bytes of array space are required. The data matrix size is 8 N m bytes. The optimal scaling work space requires less than 8 (6 N + ( p + k + 2)( p + k + 11)) bytes.
For the MTV and MGV algorithms, more than 56 m + 4 m ( m + 1) bytes of array space are required.
PROC PRINQUAL tries to store the original and transformed data in memory. If there is not enough memory, a utility data set is used, potentially resulting in a large increase in execution time. The amount of memory for the preceding data formulas are underestimates of the amount of memory needed to handle most problems. These formulas give an absolute minimum amount of memory required. If a utility data set is used, and if memory could be used with perfect efficiency, then roughly the amount of memory stated previously would be needed. In reality, most problems require at least two or three times the minimum.
PROC PRINQUAL sorts the data once. The sort time is roughly proportional to mN ^{3 / 2} .
For the MTV algorithm, the time required to compute the variable approximations is roughly proportional to 2 Nm ² + 5 m ³ + nm ² .
For the MGV algorithm, one regression analysis per iteration is required to compute model parameter estimates. The time required for accumulating the crossproduct matrix is roughly proportional to Nm ² . The time required to compute the regression coefficients is roughly proportional to m ³ . For each variable for each iteration, the swept crossproduct matrix is updated with time roughly proportional to m(N+m). The swept crossproduct matrix is updated for each variable with time roughly proportional to m ² , until computations are refreshed, requiring all sweeps to be performed again.
The only computationally intensive part of the MAC algorithm is the optimal scaling, since variable approximations are simple averages.
Each optimal scaling is a multiple regression problem, although some transformations are handled with faster special case algorithms. The number of regressors for the optimal scaling problems depends on the original values of the variable and the type of transformation. For each monotone spline transformation, an unknown number of multiple regressions is required to find a set of coefficients that satisfies the constraints. The B-spline basis is generated twice for each SPLINE and MSPLINE transformation for each iteration. The time required to generate the B-spline basis is roughly proportional to Nk ² .

Displayed Output

The main output from the PRINQUAL procedure is the output data set. However, the procedure does produce displayed output in the form of an iteration history table that includes the following:

Iteration Number
the criterion being optimized
criterion change
Maximum and Average Absolute Change in standardized variable scores computed over variables that can be iteratively transformed
notes
final convergence status

ODS Table Names

PROC PRINQUAL assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table.

For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 59.2: ODS Tables Produced in PROC PRINQUAL
ODS Table Name	Description	Statement	Option
ConvergenceStatus	Convergence Status		default
Footnotes	Iteration History Footnotes		default
MAC	MAC Iteration History	PROC	METHOD=MAC
MGV	MGV Iteration History	PROC	METHOD=MGV
MTV	MTV Iteration History	PROC	METHOD=MTV

ODS Graphics (Experimental)

This section describes the use of ODS for creating graphics with the PRINQUAL procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release. To request a graph you must specify the ODS GRAPHICS statement in addition to the following option. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

The following table shows the available plot option .

Option	Plot Description
MDPREF	Multidimensional preference analysis

ODS Graph Names

PROC PRINQUAL assigns a name to the graph it creates using ODS. You can use this name to reference the graph when using ODS. The name is listed in Table 59.3.

Table 59.3: ODS Graphics Produced by PROC PRINQUAL
ODS Graph Name	Plot Description	Statement	Option
PrinqualPlot	Multidimensional preference analysis	PROC	MDPREF

To request a graph you must specify the ODS GRAPHICS statement in addition to the option indicated in Table 59.3. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.