Details


Standardization Methods

The following table lists standardization methods and their corresponding location and scale measures available with the METHOD= option.

For METHOD=ABW( c ), METHOD=AHUBER( c ), or METHOD=AWAVE( c ), c is a positive numeric tuning constant.

For METHOD=AGK( p ), p is a numeric constant giving the proportion of pairs to be included in the estimation of the within-cluster variances.

For METHOD=SPACING( p ), p is a numeric constant giving the proportion of data to be contained in the spacing.

For METHOD=L( p ), p is a numeric constant greater than or equal to 1 specifying the power to which differences are to be raised in computing an L( p ) or Minkowski metric.

For METHOD=IN( ds ), ds is the name of a SAS data set that meets either one of the following two conditions:

  • contains a _TYPE_ variable. The observation that contains the location measure corresponds to the value _TYPE_ = 'LOCATION' and the observation that contains the scale measure corresponds to the value _TYPE_ = 'SCALE'. You can also use a data set created by the OUTSTAT= option from another PROC STDIZE statement as the ds data set. See the section 'Output Data Sets' on page 4141 for the contents of the OUTSTAT= data set.

  • contains the location and scale variables specified by the LOCATION and SCALE statements.

PROC STDIZE reads in the location and scale variables in the ds data set by first looking for the _TYPE_ variable in the ds data set. If it finds this variable, PROC STDIZE continues to search for all variables specified in the VAR statement. If it does not find the _TYPE_ variable, PROC STDIZE searches for the location variables specified in the LOCATION statement and the scale variables specified in the SCALE statement.

For robust estimators, refer to Goodall (1983) and Iglewicz (1983). The MAD method has the highest breakdown point (50%), but it is somewhat inefficient. The ABW, AHUBER, and AWAVE methods provide a good compromise between breakdown and efficiency. The L( p ) location estimates are increasingly robust as p drops from 2 (corresponding to least squares, or mean estimation) to 1 (corresponding to least absolute value, or median estimation). However, the L( p ) scale estimates are not robust.

The SPACING method is robust to both outliers and clustering (Jannsen et al. 1995) and is, therefore, a good choice for cluster analysis or nonparametric density estimation. The mid-minimum spacing method estimates the mode for small p . The AGK method is also robust to clustering and more efficient than the SPACING method, but it is not as robust to outliers and takes longer to compute. If you expect g clusters, the argument to METHOD=SPACING or METHOD=AGK should be 1/ g or less. The AGK method is less biased than the SPACING method for small samples. As a general guide, it is reasonable to use AGK for samples of size 100 or less and SPACING for samples of size 1000 or more, with the treatment of intermediate sample sizes depending on the available computer resources.

Computation of the Statistics

Formulas for statistics of METHOD=MEAN, METHOD=MEDIAN, METHOD=SUM, METHOD=USTD, METHOD=STD, METHOD=RANGE, and METHOD=IQR are given in the chapter on elementary statistics procedures in the SAS Procedures Guide .

Note that the computations of median and upper and lower quartiles depend on the PCTLMTD= option.

The other statistics listed in Table 66.2, except for METHOD=IN, are described as follows :

Table 66.2: Available Standardization Methods

Method

Location

Scale

MEAN

mean

1

MEDIAN

median

1

SUM

sum

EUCLEN

Euclidean length

USTD

standard deviation about origin

STD

mean

standard deviation

RANGE

minimum

range

MIDRANGE

midrange

range/2

MAXABS

maximum absolute value

IQR

median

interquartile range

MAD

median

median absolute deviation from median

ABW( c )

biweight 1-step M-estimate

biweight A-estimate

AHUBER( c )

Huber 1-step M-estimate

Huber A-estimate

AWAVE( c )

Wave 1-step M-estimate

Wave A-estimate

AGK( p )

mean

AGK estimate (ACECLUS)

SPACING( p )

mid minimum-spacing

minimum spacing

L( p )

L( p )

L( p )

IN( ds )

read from data set

read from data set

EUCLEN

Euclidean length.

where x i is the i th observation and n is the total number of observations in the sample.

L( p )

Minkowski metric. This metric is documented as the LEAST= p option in the PROC FASTCLUS statement of the FASTCLUS procedure (see Chapter 28, 'The FASTCLUS Procedure,' ).

If you specify METHOD=L( p ) in the PROC STDIZE statement, your results are similar to those obtained from PROC FASTCLUS if you specify the LEAST= p option with MAXCLUS=1 (and use the default values of the MAXITER= option). The difference between the two types of calculations concerns the maximum number of iterations. In PROC STDIZE, it is a criteria for convergence on all variables; In PROC FASTCLUS, it is a criteria for convergence on a single variable.

The location and scale measures for L( p ) are output to the OUTSEED= data set in PROC FASTCLUS.

MIDRANGE

(maximum + minimum) / 2

ABW( c )

Tukey's biweight. Refer to Goodall (1983, pp. 376-378, p. 385) for the biweight 1-step M-estimate. Also refer to Iglewicz (1983, pp. 416-418) for the biweight A-estimate.

AHUBER( c )

Hubers. Refer to Goodall (1983, pp. 371-374) for the Huber 1-step M-estimate. Also refer to Iglewicz (1983, pp. 416-418) for the Huber A-estimate of scale.

AWAVE( c )

Andrews' Wave. Refer to Goodall (1983, p. 376) for the Wave 1-step M-estimate. Also refer to Iglewicz (1983, pp. 416 -418) for the Wave A-estimate of scale.

AGK( p )

The noniterative univariate form of the estimator described by Art, Gnanadesikan, and Kettenring (1982).

The AGK estimate is documented in the section on the METHOD= option in the PROC ACECLUS statement of the ACECLUS procedure (also see the 'Background' section on page 388 in Chapter 16, ' The ACECLUS Procedure,' ). Specifying METHOD=AGK( p ) in the PROC STDIZE statement is the same as specifying METHOD=COUNT and P= p in the PROC ACECLUS statement.

SPACING( p )

The absolute difference between two data values. The minimum spacing for a proportion p is the minimum absolute difference between two data values that contain a proportion p of the data between them. The mid minimum-spacing is the mean of these two data values.

Computing Quantiles

PROC STDIZE offers two methods for computing quantiles: the one-pass approach and the order-statistics approach (like that used in the UNIVARIATE procedure).

The one-pass approach used in PROC STDIZE modifies the P 2 algorithm for histograms proposed by Jain and Chlamtac (1985). The primary difference comes from the movement of markers. The one-pass method allows a marker to move to the right (or left) by more than one position (to the largest possible integer) as long as it does not result in two markers being in the same position. The modification is necessary in order to incorporate the FREQ variable.

You may obtain inaccurate results if you use the one-pass approach to estimate quantiles beyond the quartiles (that is, when you estimate quantiles < P25 or > P75). A large sample size (10,000 or more) is often required if the tail quantiles (quantiles <= P10 or >= P90 ) are requested . Note that, for variables with highly skewed or heavy-tailed distributions, tail quantile estimates may be inaccurate.

The order-statistics approach for estimating quantiles is faster than the one-pass method but requires that the entire data set be stored in memory. The accuracy in estimating the quantiles is comparable for both methods when the requested percentiles are between the lower and upper quartiles. The default is PCTLMTD=ORD_STAT if enough memory is available; otherwise , PCTLMTD=ONEPASS.

Computational Methods for the PCTLDEF= Option

You can specify one of five methods for computing quantile statistics when you use the order-statistics approach (PCTLMTD=ORD_STAT); otherwise, the PCTLDEF=5 method is used when you use the one-pass approach (PCTLMTD=ONEPASS).

Let n be the number of nonmissing values for a variable, and let x 1 , x 2 , , x n represent the ordered values of the variable. For the t th percentile, let p = t/ 100. In the following definitions numbered 1, 2, 3, and 5, let

where j is the integer part and g is the fractional part of np . For definition 4, let

Given the preceding definitions, the t th percentile, y , is defined as follows:

PCTLDEF=1

weighted average at x np

click to expand

where x is taken to be x 1

PCTLDEF=2

observation numbered closest to np

where i is the integer part of np +1 / 2 if g ‰  1 / 2. If g = 1 / 2, then

y = x j if j is even, or

y = x j +1 if j is odd

PCTLDEF=3

empirical distribution function

PCTLDEF=4

weighted average aimed at x p ( n +1)

click to expand

where x n +1 is taken to be x n

PCTLDEF=5

empirical distribution function with averaging

click to expand

Missing Values

Missing values can be replaced by the location measure or by any specified constant (see the REPLACE option and the MISSING= option). You can also suppress standardization if you want only to replace missing values (see the REPONLY option).

If you specify the NOMISS option, PROC STDIZE omits observations with any missing values in the analyzed variables from computation of the location and scale measures.

Output Data Sets

OUT= Data Set

The output data set is a copy of the DATA= data set except that the analyzed variables have been standardized. Analyzed variables are those listed in the VAR statement or, if there is no VAR statement, all numeric variables not listed in any other statement.

OUTSTAT= Data Set

The new data set contains the following variables:

  • the BY variables, if any

  • _TYPE_ , a character variable

  • the analyzed variables

Each observation in the new data set contains a type of statistic as indicated by the _TYPE_ variable. The values of the _TYPE_ variable are as follows:

_TYPE_

 

LOCATION

location measure of each variable

SCALE

scale measure of each variable

ADD

constant specified in the ADD= option. This value is the same for each variable.

MULT

constant specified in the MULT= option. This value is the same for each variable.

N

total number of nonmissing positive frequencies of each variable

NORM

norm measure of each variable. This observation is produced only when you specify the NORM option with METHOD=AGK, METHOD=IQR, METHOD=MAD, or METHOD=SPACING or when you specify the SNORM option with METHOD=SPACING.

NObsRead

number of physical records read

NObsUsed

number of physical records used in the analysis

NObsMiss

number of physical records containing missing values

SumFreqsRead

sum of the frequency variable (or the sum of NObsUsed ones when there is no frequency variable) for all observations read

SumFreqsUsed

sum of the frequency variable (or the sum of NObsUsed ones when there is no frequency variable) for all observations used in the analysis

SumWeightsRead

sum of the weight variable (or the sum of NObsUsed ones when there is no weight variable) for all observations read

SumWeightsUsed

sum of the weight variable (or the sum of NObsUsed ones when there is no weight variable) for all observations used in the analysis

P n

percentiles of each variable, as specified by the PCTLPTS= option. The argument n is any real number such that 0 n 100.

Displayed Output

If you specify the PSTAT option, PROC STDIZE displays the following statistics for each variable:

  • the name of the variable, Name

  • the location estimate, Location

  • the scale estimate, Scale

  • the norm estimate, Norm (when you specify the NORM option with METHOD=AGK, METHOD=IQR, METHOD=MAD, or METHOD=SPACING or when you specify the SNORM option with METHOD=SPACING)

  • the total nonmissing positive frequencies, N

ODS Table Names

PROC STDIZE assigns a name to the single table it creates. You can use this name to reference the table when using the Output Delivery System (ODS) to select output or create an output data set. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'

Table 66.3: ODS Table Produced in PROC STDIZE

ODS Table Name

Description

Option

Statistics

Location and Scale Measures

PSTAT




SAS.STAT 9.1 Users Guide (Vol. 6)
SAS.STAT 9.1 Users Guide (Vol. 6)
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 127

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net