The following statements are available in the STDIZE procedure.
PROC STDIZE < options > ;
BY variables ;
FREQ variable ;
LOCATION variables ;
SCALE variables ;
VAR variables ;
WEIGHT variable ;
The PROC STDIZE statement is required. The BY, LOCATION, FREQ, VAR, SCALE, and WEIGHT statements are described in alphabetical order following the PROC STDIZE statement.
PROC STDIZE < options > ;
The PROC STDIZE statement invokes the procedure. You can specify the following options in the PROC STDIZE statement.
Task | Options | Description |
---|---|---|
Specify standardization methods | METHOD= | specifies the name of the standardization method |
INITIAL= | specifies the method for computing initial estimates for the A estimates | |
Unstandardize variables | UNSTD | unstandardizes variables when you also specify the METHOD=IN option |
Process missing values | NOMISS | omits observations with any missing values from computation |
MISSING= | specifies the method or a numeric value for replacing missing values | |
REPLACE | replaces missing data by zero in the standardized data | |
REPONLY | replaces missing data by the location measure (does not standardize the data) | |
Specify data set details | DATA= | specifies the input data set |
OUT= | specifies the output data set | |
OUTSTAT= | specifies the output statistic data set | |
Specify computational settings | VARDEF= | specifies the variances divisor |
NMARKERS= | specifies the number of markers when you also specify PCTLMTD=ONEPASS | |
MULT= | specifies the constant to multiply each value by after standardizing | |
ADD= | specifies the constant to add to each value after standardizing and multiplying by the value specified in the MULT= option | |
FUZZ= | specifies the relative fuzz factor for writing the output | |
Specify percentiles | PCTLDEF= | specifies the definition of percentiles when you also specify the PCTLMTD=ORD_STAT option |
PCTLMTD= | specifies the method used to estimate percentiles | |
PCTLPTS= | writes observations containing percentiles to the data set specified in the OUTSTAT= option | |
Normalize scale estimators | NORM | normalizes the scale estimator to be consistent for the standard deviation of a normal distribution |
SNORM | normalizes the scale estimator to have an expectation of approximately 1 for a standard normal distribution | |
Specify output | PSTAT | displays the location and scale measures |
These options and their abbreviations are described, in alphabetical order, in the remainder of this section.
ADD= c
specifies a constant, c , to add to each value after standardizing and multiplying by the value you specify in the MULT= option. The default value is 0.
DATA= SAS-data-set
specifies the input data set to be standardized. If you omit the DATA= option, the most recently created data set is used.
FUZZ= c
specifies the relative fuzz factor. The default value is 1E-14. For the OUT= data set, the score is computed as follows :
For the OUTSTAT= data set and the Location and Scale table, the scale and location values are computed as follows:
Otherwise,
INITIAL= method
specifies the method for computing initial estimates for the A estimates (ABW, AWAVE, and AHUBER). The following methods are not allowed: INITIAL=ABW, INITIAL=AHUBER, INITIAL=AWAVE, and INITIAL=IN. The default is INITIAL=MAD.
METHOD= name
specifies the name of the method for computing location and scale measures. Valid values for name are as follows: MEAN, MEDIAN, SUM, EUCLEN, USTD, STD, RANGE, MIDRANGE, MAXABS, IQR, MAD, ABW, AHUBER, AWAVE, AGK, SPACING, L, and IN.
For details on these methods, see the descriptions in the 'Standardization Methods' section on page 4136. The default is METHOD=STD.
MISSING= method
MISSING= value
specifies the method (or a numeric value) for replacing missing values. If you omit the MISSING= option, the REPLACE option replaces missing values with the location measure given by the METHOD= option. Specify the MISSING= option when you want to replace missing values with a different value. You can specify any name that is valid in the METHOD= option except the name IN. The corresponding location measure is used to replace missing values.
If a numeric value is given, the value replaces missing values after standardizing the data. However, you can specify the REPONLY option with the MISSING= option to suppress standardization for cases in which you want only to replace missing values.
MULT= c
specifies a constant, c , by which to multiply each value after standardizing. The default value is 1.
NMARKERS= n
specifies the number of markers used when you specify the one-pass algorithm (PCTLMTD=ONEPASS). The value n must be greater than or equal to 5. The default value is 105.
NOMISS
omits observations with missing values for any of the analyzed variables from calculation of the location and scale measures. If you omit the NOMISS option, all nonmissing values are used.
NORM
normalizes the scale estimator to be consistent for the standard deviation of a normal distribution when you specify the option METHOD=AGK, METHOD=IQR, METHOD=MAD, or METHOD=SPACING.
OUT= SAS-data-set
specifies the name of the SAS data set created by PROC STDIZE. The output data set is a copy of the DATA= data set except that the analyzed variables have been standardized. Note that analyzed variables are those specified in the VAR statement or, if there is no VAR statement, all numeric variables not listed in any other statement. See the section 'Output Data Sets' on page 4141 for more information.
If you want to create a permanent SAS data set, you must specify a two-level name. (Refer to 'SAS Files' in SAS Language Reference: Concepts for more information on permanent SAS data sets.)
If you omit the OUT= option, PROC STDIZE creates an output data set named according to the DATA n convention.
OUTSTAT= SAS-data-set
specifies the name of the SAS data set containing the location and scale measures and other computed statistics. See the section 'Output Data Sets' on page 4141 for more information.
PCTLDEF= percentiles
specifies which of five definitions is used to calculate percentiles when you specify the option PCTLMTD=ORD_STAT. By default, PCTLDEF=5.
Note that the option PCTLMTD=ONEPASS implies a specification of PCTLDEF=5. See the section 'Computational Methods for the PCTLDEF= Option' on page 4140 for details on the PCTLDEF= option.
PCTLMTD=ORD_STAT
PCTLMTD=ONEPASS P2
specifies the method used to estimate percentiles. Specify the PCTLMTD=ORD_STAT option to compute the percentiles by the order statistics method. The PCTLMTD=ONEPASS option modifies an algorithm invented by Jain and Chlamtac (1985). See the 'Computing Quantiles' section on page 4139 for more details on this algorithm.
The PCTLMTD=ONEPASS option modifies an algorithm invented by Jain and Chlamtac (1985). See the 'Computing Quantiles' section on page 4139 for more details on this algorithm.
PCTLPTS= n
writes percentiles to the OUTSTAT= data set. Values of n can be any decimal number between 0 and 100, inclusive.
A requested percentile is identified by the _TYPE_ variable in the OUTSTAT= data set with a value of P n . For example, suppose you specify the option PCTLPTS=10, 30. The corresponding observations in the OUTSTAT= data set that contain the 10th and the 30th percentiles would then have values _TYPE_ =P10 and _TYPE_ =P30, respectively.
PSTAT
displays the location and scale measures.
REPLACE
replaces missing data with the value 0 in the standardized data (this value corresponds to the location measure before standardizing). To replace missing data by other values, see the preceding description of the MISSING= option. You cannot specify both the REPLACE and REPONLY options.
REPONLY
replaces missing data only; PROC STDIZE does not standardize the data. Missing values are replaced with the location measure unless you also specify the MISSING= value option, in which case missing values are replaced with value . You cannot specify both the REPLACE and REPONLY options.
SNORM
normalizes the scale estimator to have an expectation of approximately 1 for a standard normal distribution when you specify the METHOD=SPACING option.
UNSTD
UNSTDIZE
unstandardizes variables when you specify the METHOD=IN( ds ) option. The location and scale measures, along with constants for addition and multiplication that the unstandardization is based upon, are identified by the _TYPE_ variable in the ds data set.
The ds data set must have a _TYPE_ variable and contain the following two observations: a _TYPE_ = ˜LOCATION' observation and a _TYPE_ = ˜SCALE' observation. The variable _TYPE_ can also contain the optional observations, ˜ADD' and ˜MULT'; if these observations are not found in the ds data set, the constants specified in the ADD= and MULT= options (or their default values) are used for unstandardization.
See the 'OUTSTAT= Data Set' section on page 4141 for details on the statistics that each value of _TYPE_ represents. The formula used for unstandardization is as follows: If the final output value from the previous standardization is calculated as
VARDEF= DF
VARDEF= N
VARDEF= WDF
VARDEF= WEIGHT WGT
specifies the divisor to be used in the calculation of variances. By default, VARDEF=DF. The values and associated divisors are as follows.
Value | Divisor | Formula |
---|---|---|
DF | degrees of freedom | n ˆ’ 1 |
N | number of observations | n |
WDF | sum of weights minus 1 | ( & pound ; i w i ) ˆ’ 1 |
WEIGHT WGT | sum of weights | i w i |
BY variables ;
You can specify a BY statement with PROC STDIZE to obtain separate standardization for observations in groups defined by the BY variables.
If your DATA= input data set is not sorted in ascending order, use one of the following alternatives:
Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the STDIZE procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure.
For more information on the BY statement, refer to the discussion in SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .
When you specify the option METHOD=IN( ds ), the following rules are applied to BY- group processing:
If the ds data set does not contain any of the BY variables, the entire DATA= data set is standardized by the location and scale measures (along with the constants for addition and multiplication) in the ds data set.
If the ds data set contains some, but not all, of the BY variables or if some BY variables do not have the same type or length in the ds data set that they have in the DATA= data set, PROC STDIZE displays an error message and stops.
If all of the BY variables appear in the ds data set with the same type and length as in the DATA= data set, each BY group in the DATA= data set is standardized using the location and scale measures (along with the constants for addition and multiplication) from the corresponding BY group in the ds data set. The BY groups in the ds data set must be in the same order as they appear in the DATA= data set. All BY groups in the DATA= data set must also appear in the ds data set. If you do not specify the NOTSORTED option, some BY groups can appear in the ds data set but not in the DATA= data set; such BY groups are not used in standardizing data.
FREQ FREQUENCY variable ;
If one variable in the input data set represents the frequency of occurrence for other values in the observation, specify the variable name in a FREQ statement. PROC STDIZE treats the data set as if each observation appeared n times, where n is the value of the FREQ variable for the observation. Nonintegral values of the FREQ variable are truncated to the largest integer less than the FREQ value. If the FREQ variable has a value that is less than 1 or is missing, the observation is not used in the analysis.
LOCATION variables ;
The LOCATION statement specifies a list of numeric variables that contain location measures in the input data set specified by the METHOD=IN option.
SCALE variables ;
The SCALE statement specifies the list of numeric variables containing scale measures in the input data set specified by the METHOD=IN option.
VAR VARIABLES variables ;
The VAR statement lists numeric variables to be standardized. If you omit the VAR statement, all numeric variables not listed in the BY, FREQ, and WEIGHT statements are used.
WGT WEIGHT variable ;
The WEIGHT statement specifies a numeric variable in the input data set with values that are used to weight each observation. Only one variable can be specified.
The WEIGHT variable values can be nonintegers. An observation is used in the analysis only if the value of the WEIGHT variable is greater than zero. The WEIGHT variable applies only when you specify the option METHOD=MEAN, METHOD=SUM, METHOD=EUCLEN, METHOD=USTD, METHOD=STD, METHOD=AGK, or METHOD=L.
PROC STDIZE uses the value of the WEIGHT variable w i , as follows.
The sample mean and (uncorrected) sample variances are computed as
where w i is the weight value of the i th observation, x i is the value of the i th observation, and d is the divisor controlled by the VARDEF= option (see the VARDEF= option for details).
PROC STDIZE uses the value of the WEIGHT variable to calculate the following statistics:
MEAN | the weighted mean, x w |
SUM | the weighted sum, i w i x i |
USTD | the weighted uncorrected standard deviation, |
STD | the weighted standard deviation, |
EUCLEN | the weighted Euclidean length, computed as the square root of the weighted uncorrected sum of squares: |
| |
AGK | the AGK estimate. This estimate is documented further in the ACECLUS procedure as the METHOD=COUNT option. See the discussion of the WEIGHT statement in Chapter 16, ' The ACECLUS Procedure,' for information on how the WEIGHT variable is applied to the AGK estimate. |
L | the L p estimate. This estimate is documented further in the FASTCLUS procedure as the LEAST= option. See the discussion of the WEIGHT statement in Chapter 28, 'The FASTCLUS Procedure,' for information on how the WEIGHT variable is used to compute weighted cluster means. Note that the number of clusters is always 1. |