Details


Missing Values

PROC UNIVARIATE excludes missing values for an analysis variable before calculating statistics. Each analysis variable is treated individually; a missing value for an observation in one variable does not affect the calculations for other variables . The statements handle missing values as follows :

  • If a BY or an ID variable value is missing, PROC UNIVARIATE treats it like any other BY or ID variable value. The missing values form a separate BY group .

  • If the FREQ variable value is missing or nonpositive, PROC UNIVARIATE excludes the observation from the analysis.

  • If the WEIGHT variable value is missing, PROC UNIVARIATE excludes the observation from the analysis.

PROC UNIVARIATE tabulates the number of missing values and reports this information in the ODS table named Missing Values; see the section ODS Table Names on page 309. Before the number of missing values is tabulated, PROC UNIVARIATE excludes observations when

  • you use the FREQ statement and the frequencies are nonpositive.

  • you use the WEIGHT statement and the weights are missing or nonpositive (you must specify the EXCLNPWGT option).

Rounding

When you specify ROUND= u , PROC UNIVARIATE rounds a variable by using the rounding unit to divide the number line into intervals with midpoints of the form ui , where u is the nonnegative rounding unit and i is an integer. The interval width is u . Any variable value that falls in an interval is rounded to the midpoint of that interval. A variable value that is midway between two midpoints, and is therefore on the boundary of two intervals, rounds to the even midpoint . Even midpoints occur when i is an even integer (0 , ±2 , ±4 , . . . ).

When ROUND=1 and the analysis variable values are between ˆ’ 2.5 and 2.5, the intervals are as follows:

Table 3.57: Intervals for Rounding When ROUND=1

i

Interval

Midpoint

Left endpt rounds to

Right endpt rounds to

ˆ’ 2

[ ˆ’ 2.5, ˆ’ 1.5]

ˆ’ 2

ˆ’ 2

ˆ’ 2

ˆ’ 1

[ ˆ’ 1.5, ˆ’ 0.5]

ˆ’ 1

ˆ’ 2

[ ˆ’ 0.5,0.5]

1

[0.5,1.5]

1

2

2

[1.5,2.5]

2

2

2

When ROUND=.5 and the analysis variable values are between ˆ’ 1.25 and 1.25, the intervals are as follows:

Table 3.58: Intervals for Rounding When ROUND=0.5

i

Interval

Midpoint

Left endpt rounds to

Right endpt rounds to

ˆ’ 2

[ ˆ’ 1.25, ˆ’ 0.75]

ˆ’ 1.0

ˆ’ 1

ˆ’ 1

ˆ’ 1

[ ˆ’ 0.75, ˆ’ 0.25]

ˆ’ 0.5

ˆ’ 1

[ ˆ’ 0.25,0.25]

0.0

1

[0.25,0.75]

0.5

1

2

[0.75,1.25]

1.0

1

1

As the rounding unit increases, the interval width also increases . This reduces the number of unique values and decreases the amount of memory that PROC UNIVARIATE needs.

Descriptive Statistics

This section provides computational details for the descriptive statistics that are computed with the PROC UNIVARIATE statement. These statistics can also be saved in the OUT= data set by specifying the keywords listed in Table 3.30 on page 237 in the OUTPUT statement.

Standard algorithms (Fisher 1973) are used to compute the moment statistics. The computational methods used by the UNIVARIATE procedure are consistent with those used by other SAS procedures for calculating descriptive statistics.

The following sections give specific details on a number of statistics calculated by the UNIVARIATE procedure.

Mean

The sample mean is calculated as

where n is the number of nonmissing values for a variable, x i is the i th value of the variable, and w i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the formula reduces to

Sum

The sum is calculated as , where n is the number of nonmissing values for a variable, x i is the i th value of the variable, and w i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the formula reduces to

Sum of the Weights

The sum of the weights is calculated as , where n is the number of nonmissing values for a variable and w i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the sum of the weights is n .

Variance

The variance is calculated as

click to expand

where n is the number of nonmissing values for a variable, x i is the i th value of the variable, x w is the weighted mean, w i is the weight associated with the i th value of the variable, and d is the divisor controlled by the VARDEF= option in the PROC UNIVARIATE statement:

click to expand

If there is no WEIGHT variable, the formula reduces to

Standard Deviation

The standard deviation is calculated as

click to expand

where n is the number of nonmissing values for a variable, x i is the i th value of the variable, x w is the weighted mean, w i is the weight associated with the i th value of the variable, and d is the divisor controlled by the VARDEF= option in the PROC UNIVARIATE statement. If there is no WEIGHT variable, the formula reduces to

click to expand

Skewness

The sample skewness, which measures the tendency of the deviations to be larger in one direction than in the other, is calculated as follows depending on the VARDEF= option:

Table 3.59: Formulas for Skewness

VARDEF

Formula

DF (default)

click to expand

N

click to expand

WDF

missing

WEIGHTWGT

missing

where n is the number of nonmissing values for a variable, x i is the i th value of the variable, x w is the sample average, s is the sample standard deviation, and w i is the weight associated with the i th value of the variable. If VARDEF=DF, then n must be greater than 2. If there is no WEIGHT variable, then w i = 1 for all i = 1 , . . . , n .

The sample skewness can be positive or negative; it measures the asymmetry of the data distribution and estimates the theoretical skewness , where µ 2 and µ 3 are the second and third central moments. Observations that are normally distributed should have a skewness near zero.

Kurtosis

The sample kurtosis, which measures the heaviness of tails , is calculated as follows depending on the VARDEF= option:

Table 3.60: Formulas for Kurtosis

VARDEF

Formula

DF (default)

click to expand

N

click to expand

WDF

missing

WEIGHTWGT

missing

where n is the number of nonmissing values for a variable, x i is the i th value of the variable, x w is the sample average, s w is the sample standard deviation, and w i is the weight associated with the i th value of the variable. If VARDEF=DF, then n must be greater than 3. If there is no WEIGHT variable, then w i = 1 for all i = 1 , . . . , n .

The sample kurtosis measures the heaviness of the tails of the data distribution. It estimates the adjusted theoretical kurtosis denoted as ² 2 ˆ’ 3, where , and µ 4 is the fourth central moment. Observations that are normally distributed should have a kurtosis near zero.

Coefficient of Variation (CV)

The coefficient of variation is calculated as

Calculating the Mode

The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the values of the analysis variables or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest mode in the table labeled Basic Statistical Measures in the statistical output. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode. The WEIGHT statement has no effect on the mode. See Example 3.2.

Calculating Percentiles

The UNIVARIATE procedure automatically computes the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles (quantiles), as well as the minimum and maximum of each analysis variable. To compute percentiles other than these default percentiles, use the PCTLPTS= and PCTLPRE= options in the OUTPUT statement.

You can specify one of five definitions for computing the percentiles with the PCTLDEF= option. Let n be the number of nonmissing values for a variable, and let x 1 , x 2 , . . . , x n represent the ordered values of the variable. Let the t th percentile be y , set , and let

click to expand

where j is the integer part of np , and g is the fractional part of np . Then the PCTLDEF= option defines the t th percentile, y , as described in the following table:

Table 3.61: Percentile Definitions

PCTLDEF

Description

Formula

1

Weighted average at x np

y = (1 ˆ’ g ) x j + gx j +1

where x is taken to be x 1

2

Observation numbered closest to np

y = x j if g < 1/2

y = x j if g = 1/2 and j is even

y = x j +1 if g = 1/2 and j is odd

y = x j +1 if g > 1/2

3

Empirical distribution function

y = x j if g = 0

y = x j +1 if g >

4

Weighted average aimed at x ( n +1) p

y = (1 ˆ’ g ) x j + gx j +1

where x n +1 is taken to be x n

5

Empirical distribution function with averaging

y = 1/2( x j + x j +1 ) if g = 0

y = x j +1 if g >

Weighted Percentiles

When you use a WEIGHT statement, the percentiles are computed differently. The 100 p th weighted percentile y is computed from the empirical distribution function with averaging

click to expand

where w i is the weight associated with x i , and where is the sum of the weights.

Note that the PCTLDEF= option is not applicable when a WEIGHT statement is used. However, in this case, if all the weights are identical, the weighted percentiles are the same as the percentiles that would be computed without a WEIGHT statement and with PCTLDEF=5.

Confidence Limits for Percentiles

You can use the CIPCTLNORMAL option to request confidence limits for percentiles, assuming the data are normally distributed. These limits are described in Section 4.4.1 of Hahn and Meeker (1991). When 0 < p < 1/2, the two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are

click to expand

where n is the sample size . When 1/2 p < 1, the two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are

click to expand

One-sided 100(1 ˆ’ ± )% confidence bounds are computed by replacing by ± in the appropriate preceding equation. The factor g ² ( ³ , p, n ) is related to the noncentral t distribution and is described in Owen and Hua (1977) and Odeh and Owen (1980). See Example 3.10.

You can use the CIPCTLDF option to request distribution-free confidence limits for percentiles. In particular, it is not necessary to assume that the data are normally distributed. These limits are described in Section 5.2 of Hahn and Meeker (1991). The two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are

click to expand

where X ( j ) is the j th order statistic when the data values are arranged in increasing order:

click to expand

The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around [ np ] + 1 where [ np ] is the integer part of np , and where n is the sample size. Furthermore, l and u are chosen so that X ( l ) and X ( u ) are as close to X [ n +1] p as possible while satisfying the coverage probability requirement

click to expand

where Q ( k ; n, p ) is the cumulative binomial probability

click to expand

In some cases, the coverage requirement cannot be met, particularly when n is small and p is near 0 or 1. To relax the requirement of symmetry, you can specify CIPCTLDF(TYPE = ASYMMETRIC). This option requests symmetric limits when the coverage requirement can be met, and asymmetric limits otherwise .

If you specify CIPCTLDF(TYPE = LOWER), a one-sided 100(1 ˆ’ ± )% lower confidence bound is computed as X ( l ) , where l is the largest integer that satisfies the inequality

click to expand

with 0 < l n . Likewise, if you specify CIPCTLDF(TYPE = UPPER), a one-sided 100(1 ˆ’ ± )% lower confidence bound is computed as X ( u ) , where u is the largest integer that satisfies the inequality

click to expand

Note that confidence limits for percentiles are not computed when a WEIGHT statement is specified. See Example 3.10.

Tests for Location

PROC UNIVARIATE provides three tests for location: Students t test, the sign test, and the Wilcoxon signed rank test. All three tests produce a test statistic for the null hypothesis that the mean or median is equal to a given value µ against the two-sided alternative that the mean or median is not equal to µ . By default, PROC UNIVARIATE sets the value of µ to zero. You can use the MU0= option in the PROC UNIVARIATE statement to specify the value of µ . Students t test is appropriate when the data are from an approximately normal population; otherwise, use nonparametric tests such as the sign test or the signed rank test. For large sample situations, the t test is asymptotically equivalent to a z test. If you use the WEIGHT statement, PROC UNIVARIATE computes only one weighted test for location, the t test. You must use the default value for the VARDEF= option in the PROC statement (VARDEF=DF). See Example 3.12.

You can also use these tests to compare means or medians of paired data . Data are said to be paired when subjects or units are matched in pairs according to one or more variables, such as pairs of subjects with the same age and gender. Paired data also occur when each subject or unit is measured at two times or under two conditions. To compare the means or medians of the two times, create an analysis variable that is the difference between the two measures. The test that the mean or the median difference of the variables equals zero is equivalent to the test that the means or medians of the two original variables are equal. Note that you can also carry out these tests using the PAIRED statement in the TTEST procedure; refer to Chapter 77, The TTEST Procedure, in SAS/STAT Users Guide . Also see Example 3.13.

Students t Test

PROC UNIVARIATE calculates the t statistic as

where x is the sample mean, n is the number of nonmissing values for a variable, and s is the sample standard deviation. The null hypothesis is that the population mean equals µ . When the data values are approximately normally distributed, the probability under the null hypothesis of a t statistic that is as extreme, or more extreme, than the observed value (the p -value) is obtained from the t distribution with n ˆ’ 1 degrees of freedom. For large n , the t statistic is asymptotically equivalent to a z test. When you use the WEIGHT statement and the default value of VARDEF=, which is DF, the t statistic is calculated as

click to expand

where x w is the weighted mean, s w is the weighted standard deviation, and w i is the weight for i th observation. The t w statistic is treated as having a Students t distribution with n ˆ’ 1 degrees of freedom. If you specify the EXCLNPWGT option in the PROC statement, n is the number of nonmissing observations when the value of the WEIGHT variable is positive. By default, n is the number of nonmissing observations for the WEIGHT variable.

Sign Test

PROC UNIVARIATE calculates the sign test statistic as

click to expand

where n + is the number of values that are greater than µ , and n ˆ’ is the number of values that are less than µ . Values equal to µ are discarded. Under the null hypothesis that the population median is equal to µ , the p -value for the observed statistic M obs is

click to expand

where n t = n + + n ˆ’ is the number of x i values not equal to µ .

Note: If n + and n ˆ’ are equal, the p -value is equal to one.

Wilcoxon Signed Rank Test

The signed rank statistic S is computed as

click to expand

where is the rank of x i ˆ’ µ after discarding values of x i = µ , and n t is the number of x i values not equal to µ . Average ranks are used for tied values.

If n 20, the significance of S is computed from the exact distribution of S , where the distribution is a convolution of scaled binomial distributions. When n > 20, the significance of S is computed by treating

as a Students t variate with n ˆ’ 1 degrees of freedom. V is computed as

click to expand

where the sum is over groups tied in absolute value and where t i is the number of values in the i th group (Iman 1974; Conover 1999). The null hypothesis tested is that the mean (or median) is zero, assuming that the distribution is symmetric. Refer to Lehmann (1998).

Confidence Limits for Parameters of the Normal Distribution

The two-sided 100(1 ˆ’ ± )% confidence interval for the mean has upper and lower limits

where click to expand and is the percentile of the t distribution with n ˆ’ 1 degrees of freedom. The one-sided upper 100(1 ˆ’ ± )% confidence limit is computed as and the one-sided lower 100(1 ˆ’ ± )% confidence limit is computed as . See Example 3.9.

The two-sided 100(1 ˆ’ ± )% confidence interval for the standard deviation has lower and upper limits

click to expand

respectively, where and are the and percentiles of the chi-square distribution with n ˆ’ 1 degrees of freedom. A one-sided 100(1 ˆ’ ± )% confidence limit has lower and upper limits

click to expand

respectively. The 100(1 ˆ’ ± )% confidence interval for the variance has upper and lower limits equal to the squares of the corresponding upper and lower limits for the standard deviation. When you use the WEIGHT statement and specify VARDEF=DF in the PROC statement, the 100(1 ˆ’ ± )% confidence interval for the weighted mean is

click to expand

where x w is the weighted mean, s w is the weighted standard deviation, w i is the weight for i th observation, and is the percentile for the t distribution with n ˆ’ 1 degrees of freedom.

Robust Estimators

A statistical method is robust if it is insensitive to moderate or even large departures from the assumptions that justify the method. PROC UNIVARIATE provides several methods for robust estimation of location and scale. See Example 3.11.

Winsorized Means

The Winsorized mean is a robust estimator of the location that is relatively insensitive to outliers. The k -times Winsorized mean is calculated as

click to expand

where n is the number of observations, and x ( i ) is the i th order statistic when the observations are arranged in increasing order:

click to expand

The Winsorized mean is computed as the ordinary mean after the k smallest observations are replaced by the ( k + 1)st smallest observation, and the k largest observations are replaced by the ( k + 1)st largest observation.

For data from a symmetric distribution, the Winsorized mean is an unbiased estimate of the population mean. However, the Winsorized mean does not have a normal distribution even if the data are from a normal population.

The Winsorized sum of squared deviations is defined as

click to expand

The Winsorized t statistic is given by

where µ denotes the location under the null hypothesis, and the standard error of the Winsorized mean is

click to expand

When the data are from a symmetric distribution, the distribution of t wk is approximated by a Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom (Tukey and McLaughlin 1963; Dixon and Tukey 1968).

The Winsorized 100 % confidence interval for the location parameter has upper and lower limits

click to expand

where is the 100th percentile of the Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom.

Trimmed Means

Like the Winsorized mean, the trimmed mean is a robust estimator of the location that is relatively insensitive to outliers. The k -times trimmed mean is calculated as

click to expand

where n is the number of observations, and x ( i ) is the i th order statistic when the observations are arranged in increasing order:

click to expand

The trimmed mean is computed after the k smallest and k largest observations are deleted from the sample. In other words, the observations are trimmed at each end.

For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. However, the trimmed mean does not have a normal distribution even if the data are from a normal population.

A robust estimate of the variance of the trimmed mean t tk can be based on the Winsorized sum of squared deviations , which is defined in the section Winsorized Means on page 278; refer to Tukey and McLaughlin (1963). This can be used to compute a trimmed t test which is based on the test statistic

where the standard error of the trimmed mean is

click to expand

When the data are from a symmetric distribution, the distribution of t tk is approximated by a Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom (Tukey and McLaughlin 1963; Dixon and Tukey 1968).

The trimmed 100(1 ˆ’ ± )% confidence interval for the location parameter has upper and lower limits

click to expand

where is the 100th percentile of the Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom.

Robust Estimates of Scale

The sample standard deviation, which is the most commonly used estimator of scale, is sensitive to outliers. Robust scale estimators, on the other hand, remain bounded when a single data value is replaced by an arbitrarily large or small value. The UNIVARIATE procedure computes several robust measures of scale, including the interquartile range, Ginis mean difference G , the median absolute deviation about the median (MAD), Q n , and S n . In addition, the procedure computes estimates of the normal standard deviation derived ƒ from each of these measures.

The interquartile range (IQR) is simply the difference between the upper and lower quartiles. For a normal population, ƒ can be estimated as IQR/1.34898.

Ginis mean difference is computed as

click to expand

For a normal population, the expected value of G is . Thus is a robust estimator of ƒ when the data are from a normal sample. For the normal distribution, this estimator has high efficiency relative to the usual sample standard deviation, and it is also less sensitive to the presence of outliers.

A very robust scale estimator is the MAD, the median absolute deviation from the median (Hampel 1974), which is computed as

click to expand

where the inner median, med j ( x j ), is the median of the n observations, and the outer median (taken over i ) is the median of the n absolute values of the deviations about the inner median. For a normal population, 1 . 4826MAD is an estimator of ƒ .

The MAD has low efficiency for normal distributions, and it may not always be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two statistics as alternatives to the MAD. The first is

click to expand

where the outer median (taken over i ) is the median of the n medians of x i ˆ’ x j , j = 1 , 2 , . . . , n . To reduce small-sample bias, c sn S n is used to estimate ƒ , where c sn is a correction factor; refer to Croux and Rousseeuw (1992).

The second statistic proposed by Rousseeuw and Croux (1993) is

click to expand

where

In other words, Q n is 2.219 times the k th order statistic of the distances between the data points. The bias-corrected statistic c qn Q n is used to estimate ƒ , where c qn is a correction factor; refer to Croux and Rousseeuw (1992).

Creating Line Printer Plots

The PLOTS option in the PROC UNIVARIATE statement provides up to four diagnostic line printer plots to examine the data distribution. These plots are the stem-and-leaf plot or horizontal bar chart, the box plot, the normal probability plot, and the side-by-side box plots. If you specify the WEIGHT statement, PROC UNIVARIATE provides a weighted histogram, a weighted box plot based on the weighted quantiles, and a weighted normal probability plot.

Note that these plots are a legacy feature of the UNIVARIATE procedure in earlier versions of SAS. They predate the addition of the HISTOGRAM, PROBPLOT, and QQPLOT statements, which provide high-resolution graphics displays. Also note that line printer plots requested with the PLOTS option are mainly intended for use with the ODS LISTING destination. See Example 3.5.

Stem-and-Leaf Plot

The first plot in the output is either a stem-and-leaf plot (Tukey 1977) or a horizontal bar chart. If any single interval contains more than 49 observations, the horizontal bar chart appears. Otherwise, the stem-and-leaf plot appears. The stem-and-leaf plot is like a horizontal bar chart in that both plots provide a method to visualize the overall distribution of the data. The stem-and-leaf plot provides more detail because each point in the plot represents an individual data value.

To change the number of stems that the plot displays, use PLOTSIZE= to increase or decrease the number of rows. Instructions that appear below the plot explain how to determine the values of the variable. If no instructions appear, you multiply Stem.Leaf by 1 to determine the values of the variable. For example, if the stem value is 10 and the leaf value is 1, then the variable value is approximately 10.1. For the stem-and-leaf plot, the procedure rounds a variable value to the nearest leaf. If the variable value is exactly halfway between two leaves , the value rounds to the nearest leaf with an even integer value. For example, a variable value of 3.15 has a stem value of 3 and a leaf value of 2.

Box Plot

The box plot, also known as a schematic box plot, appears beside the stem-and-leaf plot. Both plots use the same vertical scale. The box plot provides a visual summary of the data and identifies outliers. The bottom and top edges of the box correspond to the sample 25th (Q1) and 75th (Q3) percentiles. The box length is one interquartile range (Q3 - Q1). The center horizontal line with asterisk endpoints corresponds to the sample median. The central plus sign (+) corresponds to the sample mean. If the mean and median are equal, the plus sign falls on the line inside the box. The vertical lines that project out from the box, called whiskers , extend as far as the data extend, up to a distance of 1.5 interquartile ranges. Values farther away are potential outliers. The procedure identifies the extreme values with a zero or an asterisk (*). If zero appears, the value is between 1.5 and 3 interquartile ranges from the top or bottom edge of the box. If an asterisk appears, the value is more extreme.

Note: To produce box plots using high-resolution graphics, use the BOXPLOT procedure in SAS/STAT software; refer to Chapter 18, The BOXPLOT Procedure, in SAS/STAT Users Guide .

Normal Probability Plot

The normal probability plot plots the empirical quantiles against the quantiles of a standard normal distribution. Asterisks (*) indicate the data values. The plus signs (+) provide a straight reference line that is drawn by using the sample mean and standard deviation. If the data are from a normal distribution, the asterisks tend to fall along the reference line. The vertical coordinate is the data value, and the horizontal coordinate is ˆ’ 1 ( v i ) where

click to expand

For a weighted normal probability plot, the i th ordered observation is plotted against ˆ’ 1 ( v i ) where

click to expand

When each observation has an identical weight, w j = w , the formula for v i reduces to the expression for v i in the unweighted normal probability plot:

When the value of VARDEF= is WDF or WEIGHT, a reference line with intercept and slope is added to the plot. When the value of VARDEF= is DF or N, the slope is where is the average weight.

When each observation has an identical weight and the value of VARDEF= is DF, N, or WEIGHT, the reference line reduces to the usual reference line with intercept and slope in the unweighted normal probability plot.

If the data are normally distributed with mean µ , standard deviation ƒ , and each observation has an identical weight w , then the points on the plot should lie approximately on a straight line. The intercept for this line is µ . The slope is ƒ when VARDEF= is WDF or WEIGHT, and the slope is when VARDEF= is DF or N.

Note: To produce probability plots using high-resolution graphics, use the PROBPLOT statement in PROC UNIVARIATE; see the section PROBPLOT Statement on page 241.

Side-by-Side Box Plots

When you use a BY statement with the PLOT option, PROC UNIVARIATE produces side-by-side box plots, one for each BY group. The box plots (also known as schematic plots) use a common scale that enables you to compare the data distribution across BY groups. This plot appears after the univariate analyses of all BY groups. Use the NOBYPLOT option to suppress this plot.

Note: To produce side-by-side box plots using high-resolution graphics, use the BOXPLOT procedure in SAS/STAT software; refer to Chapter 18, The BOXPLOT Procedure, in SAS/STAT Users Guide .

Creating High-Resolution Graphics

If your site licenses SAS/GRAPH software, you can use the HISTOGRAM, PROBPLOT, and QQPLOT statements to create high-resolution graphs. The HISTOGRAM statement creates histograms that enable you to examine the data distribution. You can optionally fit families of density curves and superimpose kernel density estimates on the histograms. For additional information about the fitted distributions and kernel density estimates, see the section Formulas for Fitted Continuous Distributions on page 288 and the section Kernel Density Estimates on page 297.

The PROBPLOT statement creates a probability plot, which compares ordered values of a variable with percentiles of a specified theoretical distribution. The QQPLOT statement creates a quantile-quantile plot, which compares ordered values of a variable with quantiles of a specified theoretical distribution. You can use these plots to determine how well a theoretical distribution models a data distribution.

Note: You can use the CLASS statement with the HISTOGRAM, PROBPLOT, or QQPLOT statements to create comparative histograms, probability plots, or Q-Q plots, respectively.

Using the CLASS Statement to Create Comparative Plots

When you use the CLASS statement with the HISTOGRAM, PROBPLOT, or QQPLOT statement, PROC UNIVARIATE creates comparative histograms, comparative probability plots, or comparative quantile-quantile plots. You can use these plot statements with the CLASS statement to create one-way and two-way comparative plots. When you use one class variable, PROC UNIVARIATE displays an array of component plots ( stacked or side-by-side), one for each level of the classification variable. When you use two class variables, PROC UNIVARIATE displays a matrix of component plots, one for each combination of levels of the classification variables. The observations in a given level are referred to collectively as a cell .

When you create a one-way comparative plot, the observations in the input data set are sorted by the method specified in the ORDER= option. PROC UNIVARIATE creates a separate plot for the analysis variable values in each level, and arranges these component plots in an array to form the comparative plot with uniform horizontal and vertical axes. See Example 3.15.

When you create a two-way comparative plot, the observations in the input data set are cross- classified according to the values (levels) of these variables. PROC UNIVARIATE creates a separate plot for the analysis variable values in each cell of the cross-classification and arranges these component plots in a matrix to form the comparative plot with uniform horizontal and vertical axes. The levels of the first class variable are the labels for the rows of the matrix, and the levels of the second class variable are the labels for the columns of the matrix. See Example 3.16.

PROC UNIVARIATE determines the layout of a two-way comparative plot by using the order for the first class variable to obtain the order of the rows from top to bottom. Then it applies the order for the second class variable to the observations that correspond to the first row to obtain the order of the columns from left to right. If any columns remain unordered (that is, the categories are unbalanced), PROC UNIVARIATE applies the order for the second class variable to the observations in the second row, and so on, until all the columns have been ordered.

If you associate a label with a variable, PROC UNIVARIATE displays the variable label in the comparative plot and this label is parallel to the column (or row) labels.

Use the MISSING option to treat missing values as valid levels.

To reduce the number of classification levels, use a FORMAT statement to combine variable values.

Positioning the Inset

Positioning the Inset Using Compass Point Values

To position the inset by using a compass point position, specify the value N, NE, E, SE, S, SW, W, or NW with the POSITION= option. The default position of the inset is NW. The following statements produce a histogram to show the position of the inset for the eight compass points:

  data Score;   input Student $ PreTest PostTest @@;   label ScoreChange = 'Change in Test Scores';   ScoreChange = PostTest - PreTest;   datalines;   Capalleti  94 91  Dubose     51 65   Engles     95 97  Grant      63 75   Krupski    80 75  Lundsford  92 55   Mcbane     75 78  Mullen     89 82   Nguyen     79 76  Patel      71 77   Si         75 70  Tanaka     87 73   ;   run;   title 'Test Scores for a College Course';   proc univariate data=Score noprint;   histogram PreTest / midpoints = 45 to 95 by 10;   inset n     / cfill=blank   header='Position = NW' pos=nw;   inset mean  / cfill=blank   header='Position = N ' pos=n ;   inset sum   / cfill=blank   header='Position = NE' pos=ne;   inset max   / cfill=blank   header='Position = E ' pos=e ;   inset min   / cfill=blank   header='Position = SE' pos=se;   inset nobs  / cfill=blank   header='Position = S ' pos=s ;   inset range / cfill=blank   header='Position = SW' pos=sw;   inset mode  / cfill=blank   header='Position = W ' pos=w ;   label PreTest = 'Pretest Score';   run;  
click to expand
Figure 3.7: Compass Positions for Inset

Positioning the Inset in the Margins

To position the inset in one of the four margins that surround the plot area, specify the value LM, RM, TM, or BM with the POSITION= option.

Locating the Inset in the Margins

Margin positions are recommended if you list a large number of statistics in the INSET statement. If you attempt to display a lengthy inset in the interior of the plot, it is most likely that the inset will collide with the data display.

Positioning the Inset Using Coordinates

To position the inset with coordinates, use POSITION=(x,y). You specify the coordinates in axis data units or in axis percentage units (the default).

If you specify the DATA option immediately following the coordinates, PROC UNIVARIATE positions the inset by using axis data units. For example, the following statements place the bottom left corner of the inset at 45 on the horizontal axis and 10 on the vertical axis:

  title 'Test Scores for a College Course';   proc univariate data=Score noprint;   histogram PreTest / midpoints = 45 to 95 by 10;   inset n / header   = 'Position=(45,10)'   position = (45,10) data;   run;  
click to expand
Figure 3.8: Coordinate Position for Inset

By default, the specified coordinates determine the position of the bottom left corner of the inset. To change this reference point, use the REFPOINT= option (see the next example).

If you omit the DATA option, PROC UNIVARIATE positions the inset by using axis percentage units. The coordinates in axis percentage units must be between 0 and 100. The coordinates of the bottom left corner of the display are (0,0), while the upper right corner is (100, 100). For example, the following statements create a histogram and use coordinates in axis percentage units to position the two insets :

  title 'Test Scores for a College Course';   proc univariate data=Score noprint;   histogram PreTest / midpoints = 45 to 95 by 10;   inset min / position = (5,25)   header   = 'Position=(5,25)'   refpoint = tl;   inset max / position = (95,95)   header   = 'Position=(95,95)'   refpoint = tr;   run;  

The REFPOINT= option determines which corner of the inset to place at the coordinates that are specified with the POSITION= option. The first inset uses REFPOINT=TL, so that the top left corner of the inset is positioned 5% of the way across the horizontal axis and 25% of the way up the vertical axis. The second inset uses REFPOINT=TR, so that the top right corner of the inset is positioned 95% of the way across the horizontal axis and 95% of the way up the vertical axis.

click to expand
Figure 3.9: Reference Point for Inset

A sample program, univar3.sas , for these examples is available in the SAS Sample Library for Base SAS software.

Formulas for Fitted Continuous Distributions

The following sections provide information on the families of parametric distributions that you can fit with the HISTOGRAM statement. Properties of these distributions are discussed by Johnson, Kotz, and Balakrishnan (1994, 1995).

Beta Distribution

The fitted density function is

click to expand

where click to expand and

  • = lower threshold parameter (lower endpoint parameter)

  • ƒ = scale parameter ( ƒ > 0)

  • ± = shape parameter ( ± > 0)

  • ² = shape parameter ( ² > 0)

  • h = width of histogram interval

Note: This notation is consistent with that of other distributions that you can fit with the HISTOGRAM statement. However, many texts , including Johnson, Kotz, and Balakrishnan (1995), write the beta density function as

click to expand

The two parameterizations are related as follows:

The range of the beta distribution is bounded below by a threshold parameter = a and above by + ƒ = b . If you specify a fitted beta curve using the BETA option, must be less than the minimum data value, and + ƒ must be greater than the maximum data value. You can specify and ƒ with the THETA= and SIGMA= beta-options in parentheses after the keyword BETA. By default, ƒ = 1 and = 0. If you specify THETA=EST and SIGMA=EST, maximum likelihood estimates are computed for and ƒ . However, three- and four-parameter maximum likelihood estimation may not always converge.

In addition, you can specify ± and ² with the ALPHA= and BETA= beta-options , respectively. By default, the procedure calculates maximum likelihood estimates for ± and ² . For example, to fit a beta density curve to a set of data bounded below by 32 and above by 212 with maximum likelihood estimatexs for ± and ² , use the following statement:

  histogram Length / beta(theta=32 sigma=180);  

The beta distributions are also referred to as Pearson Type I or II distributions. These include the power-function distribution ( ² = 1), the arc-sine distribution ( ± = ² = 1/2), and the generalized arc-sine distributions ( ± + ² = 1, ²   1/2).

You can use the DATA step function BETAINV to compute beta quantiles and the DATA step function PROBBETA to compute beta probabilities.

Exponential Distribution

The fitted density function is

click to expand

where

  • = threshold parameter

  • ƒ = scale parameter ( ƒ > 0)

  • h = width of histogram interval

The threshold parameter must be less than or equal to the minimum data value. You can specify with the THRESHOLD= exponential-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . In addition, you can specify ƒ with the SCALE= exponential-option . By default, the procedure calculates a maximum likelihood estimate for ƒ . Note that some authors define the scale parameter as .

The exponential distribution is a special case of both the gamma distribution (with ± = 1) and the Weibull distribution (with c = 1). A related distribution is the extreme value distribution. If Y = exp( ˆ’ X ) has an exponential distribution, then X has an extreme value distribution.

Gamma Distribution

The fitted density function is

click to expand

where

  • = threshold parameter

  • ƒ = scale parameter ( ƒ > 0)

  • ± = shape parameter ( ± > 0)

  • h = width of histogram interval

The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= gamma-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . In addition, you can specify ƒ and ± with the SCALE= and ALPHA= gamma-options . By default, the procedure calculates maximum likelihood estimates for ƒ and ± .

The gamma distributions are also referred to as Pearson Type III distributions, and they include the chi-square, exponential, and Erlang distributions. The probability density function for the chi-square distribution is

click to expand

Notice that this is a gamma distribution with , ƒ = 2, and = 0. The exponential distribution is a gamma distribution with ± = 1, and the Erlang distribution is a gamma distribution with ± being a positive integer. A related distribution is the Rayleigh distribution. If where the X i s are independent variables, then log R is distributed with a Xv distribution having a probability density function of

click to expand

If v = 2, the preceding distribution is referred to as the Rayleigh distribution.

You can use the DATA step function GAMINV to compute gamma quantiles and the DATA step function PROBGAM to compute gamma probabilities.

Lognormal Distribution

The fitted density function is

click to expand

where

  • = threshold parameter

  • = scale parameter ( ˆ’ ˆ < < ˆ )

  • ƒ = shape parameter ( ƒ >0)

  • h = width of histogram interval

The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= lognormal-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify and ƒ with the SCALE= and SHAPE= lognormal-options , respectively. By default, the procedure calculates maximum likelihood estimates for these parameters.

Note: The lognormal distribution is also referred to as the S L distribution in the Johnson system of distributions.

Note: This book uses ƒ to denote the shape parameter of the lognormal distribution, whereas ƒ is used to denote the scale parameter of the beta, exponential, gamma, normal, and Weibull distributions. The use of ƒ to denote the lognormal shape parameter is based on the fact that has a standard normal distribution if X is lognormally distributed. Based on this relationship, you can use the DATA step function PROBIT to compute lognormal quantiles and the DATA step function PROBNORM to compute probabilities.

Normal Distribution

The fitted density function is

click to expand

where

  • µ = mean

  • ƒ = standard deviation ( ƒ > 0)

  • h = width of histogram interval

You can specify µ and ƒ with the MU= and SIGMA= normal-options , respectively. By default, the procedure estimates µ with the sample mean and ƒ with the sample standard deviation.

You can use the DATA step function PROBIT to compute normal quantiles and the DATA step function PROBNORM to compute probabilities.

Note: The normal distribution is also referred to as the S N distribution in the Johnson system of distributions.

Weibull Distribution

The fitted density function is

click to expand

where

  • = threshold parameter

  • ƒ = scale parameter ( ƒ > 0)

  • c = shape parameter ( c > 0)

  • h = width of histogram interval

The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= Weibull-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify ƒ and c with the SCALE= and SHAPE= Weibull-options , respectively. By default, the procedure calculates maximum likelihood estimates for ƒ and c .

The exponential distribution is a special case of the Weibull distribution where c = 1.

Goodness-of-Fit Tests

When you specify the NORMAL option in the PROC UNIVARIATE statement or you request a fitted parametric distribution in the HISTOGRAM statement, the procedure computes goodness-of-fit tests for the null hypothesis that the values of the analysis variable are a random sample from the specified theoretical distribution. See Example 3.22.

When you specify the NORMAL option, these tests, which are summarized in the output table labeled Tests for Normality, include the following:

  • Shapiro-Wilk test

  • Kolmogorov-Smirnov test

  • Anderson-Darling test

  • Cram r-von Mises test

The Kolmogorov-Smirnov D statistic, the Anderson-Darling statistic, and the Cram r-von Mises statistic are based on the empirical distribution function (EDF). However, some EDF tests are not supported when certain combinations of the parameters of a specified distribution are estimated. See Table 3.62 on page 296 for a list of the EDF tests available. You determine whether to reject the null hypothesis by examining the p -value that is associated with a goodness-of-fit statistic. When the p -value is less than the predetermined critical value ( ± ), you reject the null hypothesis and conclude that the data did not come from the specified distribution.

Table 3.62: Availability of EDF Tests

Distribution

Parameters

Tests Available

Threshold

Scale

Shape

Beta

known

ƒ known

± , ² known

all

known

ƒ known

± , ² < 5 unknown

all

Exponential

known,

ƒ known

 

all

known

ƒ unknown

 

all

unknown

ƒ known

 

all

unknown

ƒ unknown

 

all

Gamma

known

ƒ known

± known

all

known

ƒ unknown

± known

all

known

ƒ known

± unknown

all

known

ƒ unknown

± unknown

all

unknown

ƒ known

± > 1 known

all

unknown

ƒ unknown

± > 1 known

all

unknown

ƒ known

± > 1 unknown

all

unknown

ƒ unknown

± > 1 unknown

all

Lognormal

known

known

ƒ known

all

known

known

ƒ unknown

A 2 and W 2

known

unknown

ƒ known

A 2 and W 2

known

unknown

ƒ unknown

all

unknown

known

ƒ < 3 known

all

unknown

known

ƒ < 3 unknown

all

unknown

unknown

ƒ < 3 known

all

unknown

unknown

ƒ < 3 unknown

all

Normal

known

ƒ known

 

all

known

ƒ unknown

 

A 2 and W 2

unknown

ƒ known

 

A 2 and W 2

unknown

ƒ unknown

 

all

Weibull

known

ƒ known

c known

all

known

ƒ unknown

c known

A 2 and W 2

known

ƒ known

c unknown

A 2 and W 2

known

ƒ unknown

c unknown

A 2 and W 2

unknown

ƒ known

c > 2 known

all

unknown

ƒ unknown

c > 2 known

all

unknown

ƒ known

c > 2 unknown

all

unknown

ƒ unknown

c > 2 unknown

all

If you want to test the normality assumptions for analysis of variance methods, beware of using a statistical test for normality alone. A tests ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected . Since small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality. The skewness and kurtosis measures and the plots that are provided by the PLOTS option, the HISTOGRAM statement, the PROBPLOT statement, and the QQPLOT statement can be very helpful. For small sample sizes, power is low for detecting larger departures from normality that may be important. To increase the tests ability to detect such deviations, you may want to declare significance at higher levels, such as 0.15 or 0.20, rather than the often-used 0.05 level. Again, consulting plots and additional statistics will help you assess the severity of the deviations from normality.

Shapiro-Wilk Statistic

If the sample size is less than or equal to 2000 and you specify the NORMAL option, PROC UNIVARIATE computes the Shapiro-Wilk statistic, W (also denoted as W n to emphasize its dependence on the sample size n ). The W statistic is the ratio of the best estimator of the variance (based on the square of a linear combination of the order statistics) to the usual corrected sum of squares estimator of the variance (Shapiro and Wilk 1965). When n is greater than three, the coefficients to compute the linear combination of the order statistics are approximated by the method of Royston (1992). The statistic W is always greater than zero and less than or equal to one (0 < W 1).

Small values of W lead to the rejection of the null hypothesis of normality. The distribution of W is highly skewed. Seemingly large values of W (such as 0.90) may be considered small and lead you to reject the null hypothesis. The method for computing the p -value (the probability of obtaining a W statistic less than or equal to the observed value) depends on n . For n = 3, the probability distribution of W is known and is used to determine the p -value. For n > 4, a normalizing transformation is computed:

click to expand

The values of ƒ , ³ , and µ are functions of n obtained from simulation results. Large values of Z n indicate departure from normality, and since the statistic Z n has an approximately standard normal distribution, this distribution is used to determine the p -values for n > 4.

EDF Goodness-of-Fit Tests

When you fit a parametric distribution, PROC UNIVARIATE provides a series of goodness-of-fit tests based on the empirical distribution function (EDF). The EDF tests offer advantages over traditional chi-square goodness-of-fit test, including improved power and invariance with respect to the histogram midpoints. For a thorough discussion, refer to DAgostino and Stephens (1986).

The empirical distribution function is defined for a set of n independent observations X 1 , . . . , X n with a common distribution function F ( x ). Denote the observations ordered from smallest to largest as X (1) , . . . , X ( n ) . The empirical distribution function, F n ( x ), is defined as

click to expand

Note that F n ( x ) is a step function that takes a step of height at each observation. This function estimates the distribution function F ( x ). At any value x , F n ( x ) is the proportion of observations less than or equal to x , while F ( x ) is the probability of an observation less than or equal to x . EDF statistics measure the discrepancy between F n ( x ) and F ( x ).

The computational formulas for the EDF statistics make use of the probability integral transformation U = F ( X ). If F ( X ) is the distribution function of X , the random variable U is uniformly distributed between 0 and 1.

Given n observations X (1) , . . . , X ( n ) , the values U ( i ) = F ( X ( i ) ) are computed by applying the transformation, as discussed in the next three sections.

PROC UNIVARIATE provides three EDF tests:

  • Kolmogorov-Smirnov

  • Anderson-Darling

  • Cram r-von Mises

The following sections provide formal definitions of these EDF statistics.

Kolmogorov D Statistic

The Kolmogorov-Smirnov statistic ( D ) is defined as

click to expand

The Kolmogorov-Smirnov statistic belongs to the supremum class of EDF statistics. This class of statistics is based on the largest vertical difference between F ( x ) and F n ( x ).

The Kolmogorov-Smirnov statistic is computed as the maximum of D + and D ˆ’ , where D + is the largest vertical distance between the EDF and the distribution function when the EDF is greater than the distribution function, and D ˆ’ is the largest vertical distance when the EDF is less than the distribution function.

PROC UNIVARIATE uses a modified Kolmogorov D statistic to test the data against a normal distribution with mean and variance equal to the sample mean and variance.

Anderson-Darling Statistic

The Anderson-Darling statistic and the Cram r-von Mises statistic belong to the quadratic class of EDF statistics. This class of statistics is based on the squared difference ( F n ( x ) ˆ’ F ( x )) 2 . Quadratic statistics have the following general form:

click to expand

The function ˆ ( x ) weights the squared difference ( F n ( x ) ˆ’ F ( x )) 2

The Anderson-Darling statistic ( A 2 ) is defined as

click to expand

Here the weight function is ˆ ( x ) = [ F ( x ) (1 ˆ’ F ( x ))] ˆ’ 1 .

The Anderson-Darling statistic is computed as

click to expand
Cram r-von Mises Statistic

The Cram r-von Mises statistic ( W 2 ) is defined as

click to expand

Here the weight function is ˆ ( x ) = 1.

The Cram r-von Mises statistic is computed as

click to expand
Probability Values of EDF Tests

Once the EDF test statistics are computed, PROC UNIVARIATE computes the associated probability values ( p -values). The UNIVARIATE procedure uses internal tables of probability levels similar to those given by DAgostino and Stephens (1986). If the value is between two probability levels, then linear interpolation is used to estimate the probability value.

The probability value depends upon the parameters that are known and the parameters that are estimated for the distribution. Table 3.62 summarizes different combinations fitted for which EDF tests are available.

Kernel Density Estimates

You can use the KERNEL option to superimpose kernel density estimates on histograms. Smoothing the data distribution with a kernel density estimate can be more effective than using a histogram to identify features that might be obscured by the choice of histogram bins or sampling variation. A kernel density estimate can also be more effective than a parametric curve fit when the process distribution is multi-modal. See Example 3.23.

The general form of the kernel density estimator is

click to expand

where K (·) is the kernel function, » is the bandwidth, n is the sample size and x i is the i th observation.

The KERNEL option provides three kernel functions ( K ): normal, quadratic, and triangular . You can specify the function with the K= kernel-option in parentheses after the KERNEL option. Values for the K= option are NORMAL, QUADRATIC, and TRIANGULAR (with aliases of N, Q, and T, respectively). By default, a normal kernel is used. The formulas for the kernel functions are

click to expand

The value of » , referred to as the bandwidth parameter, determines the degree of smoothness in the estimated density function. You specify » indirectly by specifying a standardized bandwidth c with the C= kernel-option . If Q is the interquartile range, and n is the sample size, then c is related to » by the formula

For a specific kernel function, the discrepancy between the density estimator » ( x ) and the true density f ( x ) is measured by the mean integrated square error (MISE):

click to expand

The MISE is the sum of the integrated squared bias and the variance. An approximate mean integrated square error (AMISE) is

click to expand

A bandwidth that minimizes AMISE can be derived by treating f ( x ) as the normal density having parameters µ and ƒ estimated by the sample mean and standard deviation. If you do not specify a bandwidth parameter or if you specify C=MISE, the bandwidth that minimizes AMISE is used. The value of AMISE can be used to compare different density estimates. For each estimate, the bandwidth parameter c , the kernel function type, and the value of AMISE are reported in the SAS log.

The general kernel density estimates assume that the domain of the density to estimate can take on all values on a real line. However, sometimes the domain of a density is an interval bounded on one or both sides. For example, if a variable Y is a measurement of only positive values, then the kernel density curve should be bounded so that is zero for negative Y values. You can use the LOWER= and UPPER= kernel-options to specify the bounds.

The UNIVARIATE procedure uses a reflection technique to create the bounded kernel density curve, as described in Silverman (1986, pp. 30-31). It adds the reflections of the kernel density that are outside the boundary to the bounded kernel estimates. The general form of the bounded kernel density estimator is computed by replacing in the original equation with

click to expand

where x l is the lower bound and x u is the upper bound.

Without a lower bound, x l = ˆ’ ˆ and click to expand equals zero. Similarly, without an upper bound, x u = ˆ and click to expand equals zero.

When C=MISE is used with a bounded kernel density, the UNIVARIATE procedure uses a bandwidth that minimizes the AMISE for its corresponding unbounded kernel.

Construction of Quantile-Quantile and Probability Plots

Figure 3.10 illustrates how a Q-Q plot is constructed for a specified theoretical distribution. First, the n nonmissing values of the variable are ordered from smallest to largest:

click to expand
click to expand
Figure 3.10: Construction of a Q-Q Plot

Then the i th ordered value x ( i ) is plotted as a point whose y -coordinate is x ( i ) and whose x -coordinate is , where F (·) is the specified distribution with zero location parameter and unit scale parameter.

You can modify the adjustment constants ˆ’ 0.375 and 0.25 with the RANKADJ= and NADJ= options. This default combination is recommended by Blom (1958). For additional information, refer to Chambers et al. (1983). Since x ( i ) is a quantile of the empirical cumulative distribution function (ecdf), a Q-Q plot compares quantiles of the ecdf with quantiles of a theoretical distribution. Probability plots (see the section PROBPLOT Statement on page 241) are constructed the same way, except that the x -axis is scaled nonlinearly in percentiles.

Interpretation of Quantile-Quantile and Probability Plots

The following properties of Q-Q plots and probability plots make them useful diagnostics of how well a specified theoretical distribution fits a set of measurements:

  • If the quantiles of the theoretical and data distributions agree, the plotted points fall on or near the line y = x .

  • If the theoretical and data distributions differ only in their location or scale, the points on the plot fall on or near the line y = ax + b . The slope a and intercept b are visual estimates of the scale and location parameters of the theoretical distribution.

Q-Q plots are more convenient than probability plots for graphical estimation of the location and scale parameters since the x -axis of a Q-Q plot is scaled linearly. On the other hand, probability plots are more convenient for estimating percentiles or probabilities.

There are many reasons why the point pattern in a Q-Q plot may not be linear. Chambers et al. (1983) and Fowlkes (1987) discuss the interpretations of commonly encountered departures from linearity , and these are summarized in Table 3.63.

Table 3.63: Quantile-Quantile Plot Diagnostics

Description of Point Pattern

Possible Interpretation

All but a few points fall on a line

Outliers in the data

Left end of pattern is below the line; right end of pattern is above the line

Long tails at both ends of the data distribution

Left end of pattern is above the line; right end of pattern is below the line

Short tails at both ends of the data distribution

Curved pattern with slope increasing from left to right

Data distribution is skewed to the right

Curved pattern with slope decreasing from left to right

Data distribution is skewed to the left

Staircase pattern (plateaus and gaps)

Data have been rounded or are discrete

In some applications, a nonlinear pattern may be more revealing than a linear pattern. However, Chambers et al. (1983) note that departures from linearity can also be due to chance variation.

When the pattern is linear, you can use Q-Q plots to estimate shape, location, and scale parameters and to estimate percentiles. See Example 3.26 through Example 3.34.

Distributions for Probability and Q-Q Plots

You can use the PROBPLOT and QQPLOT statements to request probability and Q-Q plots that are based on the theoretical distributions summarized in Table 3.64.

Table 3.64: Distributions and Parameters Parameters

Distribution

Density Function p ( x )

Range

Location

Scale

Shape

Beta

click to expand

< x < + ƒ

ƒ

± , ²

Exponential

x

ƒ

 

Gamma

click to expand

x >

ƒ

±

Lognormal (3-parameter)

click to expand

x >

ƒ

Normal

click to expand

all x

¼

ƒ

 

Weibull (3-parameter)

click to expand

x >

ƒ

c

Weibull (2-parameter)

click to expand

x >

(known)

ƒ

c

You can request these distributions with the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, NORMAL, WEIBULL, and WEIBULL2 options, respectively. If you do not specify a distribution option, a normal probability plot or a normal Q-Q plot is created.

The following sections provide details for constructing Q-Q plots that are based on these distributions. Probability plots are constructed similarly except that the horizontal axis is scaled in percentile units.

Beta Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where is the inverse normalized incomplete beta function, n is the number of nonmissing observations, ± and ² and are the shape parameters of the beta distribution. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot for ALPHA= ± and BETA= ² tends to be linear with intercept and slope ƒ if the data are beta distributed with the specific density function

click to expand

where click to expand and

  • = lower threshold parameter

  • ƒ = scale parameter ( ƒ > 0)

  • ± = first shape parameter ( ± > 0)

  • = second shape parameter ( > 0)

Exponential Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile click to expand , where n is the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot tends to be linear with intercept and slope ƒ if the data are exponentially distributed with the specific density function

click to expand

where is a threshold parameter, and is ƒ a positive scale parameter.

Gamma Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where is the inverse normalized incomplete gamma function, n is the number of nonmissing observations, and ± is the shape parameter of the gamma distribution. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot for ALPHA= ƒ tends to be linear with intercept and slope ƒ if the data are gamma distributed with the specific density function

click to expand

where

  • = threshold parameter

  • ƒ = scale parameter ( ƒ > 0)

  • ± = shape parameter ( ± > 0)

Lognormal Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile exp , where ˆ’ 1 (·) is the inverse cumulative standard normal distribution, n is the number of nonmissing observations, and ƒ is the shape parameter of the lognormal distribution. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot for SIGMA= ƒ tends to be linear with intercept and slope exp( ) if the data are lognormally distributed with the specific density function

click to expand

where

  • = threshold parameter

  • ƒ = scale parameter

  • ± = shape parameter ( ƒ > 0)

See Example 3.26 and Example 3.33.

Normal Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where ˆ’ 1 (·) is the inverse cumulative standard normal distribution, and n is the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.

The point pattern on the plot tends to be linear with intercept µ and slope ƒ if the data are normally distributed with the specific density function

click to expand

where µ is the mean, and ƒ is the standard deviation ( ƒ > 0).

Three-Parameter Weibull Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile click to expand , where n is the number of nonmissing observations, and c is the Weibull distribution shape parameter. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot for C= c tends to be linear with intercept and slope ƒ if the data are Weibull distributed with the specific density function

click to expand

where

  • = threshold parameter

  • ƒ = scale parameter ( ƒ > 0)

  • c = shape parameter ( c > 0)

See Example 3.34.

Two-Parameter Weibull Distribution

To create the plot, the observations are ordered from smallest to largest, and the log of the shifted i th ordered observation x ( i ) , denoted by log( x ( i ) ˆ’ ), is plotted against the quantile click to expand , where is n the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.

Unlike the three-parameter Weibull quantile, the preceding expression is free of distribution parameters. Consequently, the C= shape parameter is not mandatory with the WEIBULL2 distribution option.

The pattern on the plot for THETA= tends to be linear with intercept log( ƒ ) and slope 1/ c if the data are Weibull distributed with the specific density function

click to expand

where

  • = known lower threshold

  • ƒ = scale parameter ( ƒ > 0)

  • c = shape parameter ( c > 0)

See Example 3.34.

Estimating Shape Parameters Using Q-Q Plots

Some of the distribution options in the PROBPLOT or QQPLOT statements require you to specify one or two shape parameters in parentheses after the distribution keyword. These are summarized in Table 3.65.

Table 3.65: Shape Parameter Options

Distribution Keyword

Mandatory Shape Parameter Option

Range

BETA

ALPHA= ± , BETA= ²

± > 0, ² >

EXPONENTIAL

None

 

GAMMA

ALPHA= ±

± >

LOGNORMAL

SIGMA= ƒ

ƒ >

NORMAL

None

 

WEIBULL

C= c

c >

WEIBULL2

None

 

You can visually estimate the value of a shape parameter by specifying a list of values for the shape parameter option. A separate plot is produced for each value, and you can then select the value of the shape parameter that produces the most nearly linear point pattern. Alternatively, you can request that the plot be created using an estimated shape parameter. See the entries for the distribution options in the section Dictionary of Options on page 245 (for the PROBPLOT statement) and in the section Dictionary of Options on page 258 (for the QQPLOT statement).

Note: For Q-Q plots created with the WEIBULL2 option, you can estimate the shape parameter c from a linear pattern using the fact that the slope of the pattern is .

Estimating Location and Scale Parameters Using Q-Q Plots

If you specify location and scale parameters for a distribution in a PROBPLOT or QQPLOT statement (or if you request estimates for these parameters), a diagonal distribution reference line is displayed on the plot. (An exception is the two-parameter Weibull distribution, for which a line is displayed when you specify or estimate the scale and shape parameters.) Agreement between this line and the point pattern indicates that the distribution with these parameters is a good fit.

When the point pattern on a Q-Q plot is linear, its intercept and slope provide estimates of the location and scale parameters. (An exception to this rule is the two-parameter Weibull distribution, for which the intercept and slope are related to the scale and shape parameters.)

Table 3.66 shows how the specified parameters determine the intercept and slope of the line. The intercept and slope are based on the quantile scale for the horizontal axis, which is used in Q-Q plots.

Table 3.66: Intercept and Slope of Distribution Reference Line

Distribution

Parameters

Linear Pattern

Location

Scale

Shape

Intercept

Slope

Beta

ƒ

± , ²

ƒ

Exponential

ƒ

 

ƒ

Gamma

ƒ

±

ƒ

Lognormal

ƒ

exp( )

Normal

µ

ƒ

 

µ

ƒ

Weibull (3-parameter)

ƒ

c

ƒ

Weibull (2-parameter)

(known)

ƒ

c

log( ƒ )

For instance, specifying MU=3 and SIGMA=2 with the NORMAL option requests a line with intercept 3 and slope 2. Specifying SIGMA=1 and C=2 with the WEIBULL2 option requests a line with intercept log(1) = 0 and slope 1/2. On a probability plot with the LOGNORMAL and WEIBULL2 options, you can specify the slope directly with the SLOPE= option. That is, for the LOGNORMAL option, specifying THETA= and SLOPE=exp( ) displays the same line as specifying THETA= and ZETA= . For the WEIBULL2 option, specifying SIGMA= ƒ and displays the same line as specifying SIGMA= ƒ and C= c .

Estimating Percentiles Using Q-Q Plots

There are two ways to estimate percentiles from a Q-Q plot:

  • Specify the PCTLAXIS option, which adds a percentile axis opposite the theoretical quantile axis. The scale for the percentile axis ranges between 0 and 100 with tick marks at percentile values such as 1, 5, 10, 25, 50, 75, 90, 95, and 99.

  • Specify the PCTLSCALE option, which relabels the horizontal axis tick marks with their percentile equivalents but does not alter their spacing. For example, on a normal Q-Q plot, the tick mark labeled 0 is relabeled as 50 since the 50th percentile corresponds to the zero quantile.

You can also estimate percentiles using probability plots created with the PROBPLOT statement. See Example 3.32.

Input Data Sets

DATA= Data Set

The DATA= data set provides the set of variables that are analyzed . The UNIVARIATE procedure must have a DATA= data set. If you do not specify one with the DATA= option in the PROC UNIVARIATE statement, the procedure uses the last data set created.

ANNOTATE= Data Sets

You can add features to plots by specifying ANNOTATE= data sets either in the PROC UNIVARIATE statement or in individual plot statements.

Information contained in an ANNOTATE= data set specified in the PROC UNIVARIATE statement is used for all plots produced in a given PROC step; this is a global ANNOTATE= data set. By using this global data set, you can keep information common to all high-resolution plots in one data set.

Information contained in the ANNOTATE= data set specified in a plot statement is used only for plots produced by that statement; this is a local ANNOTATE= data set. By using this data set, you can add statement-specific features to plots. For example, you can add different features to plots produced by the HISTOGRAM and QQPLOT statements by specifying an ANNOTATE= data set in each plot statement.

You can specify an ANNOTATE= data set in the PROC UNIVARIATE statement and in plot statements. This enables you to add some features to all plots and also add statement-specific features to plots. See Example 3.25.

OUT= Output Data Set in the OUTPUT Statement

PROC UNIVARIATE creates an OUT= data set for each OUTPUT statement. This data set contains an observation for each combination of levels of the variables in the BY statement, or a single observation if you do not specify a BY statement. Thus the number of observations in the new data set corresponds to the number of groups for which statistics are calculated. Without a BY statement, the procedure computes statistics and percentiles by using all the observations in the input data set. With a BY statement, the procedure computes statistics and percentiles by using the observations within each BY group.

The variables in the OUT= data set are as follows:

  • BY statement variables. The values of these variables match the values in the corresponding BY group in the DATA= data set and indicate which BY group each observation summarizes.

  • variables created by selecting statistics in the OUTPUT statement. The statistics are computed using all the nonmissing data, or they are computed for each BY group if you use a BY statement.

  • variables created by requesting new percentiles with the PCTLPTS= option. The names of these new variables depend on the values of the PCTLPRE= and PCTLNAME= options.

If the output data set contains a percentile variable or a quartile variable, the percentile definition assigned with the PCTLDEF= option in the PROC UNIVARIATE statement is recorded in the output data set label. See Example 3.8.

The following table lists variables available in the OUT= data set.

Table 3.67: Variables Available in the OUT= Data Set

Variable Name

Description

Descriptive Statistics

CSS

Sum of squares corrected for the mean

CV

Percent coefficient of variation

KURTOSIS

Measurement of the heaviness of tails

MAX

Largest (maximum) value

MEAN

Arithmetic mean

MIN

Smallest (minimum) value

MODE

Most frequent value (if not unique, the smallest mode)

N

Number of observations on which calculations are based

NMISS

Number of missing observations

NOBS

Total number of observations

RANGE

Difference between the maximum and minimum values

SKEWNESS

Measurement of the tendency of the deviations to be larger in one direction than in the other

STD

Standard deviation

STDMEAN

Standard error of the mean

SUM

Sum

SUMWGT

Sum of the weights

USS

Uncorrected sum of squares

VAR

Variance

Quantile Statistics

MEDIANP50

Middle value (50th percentile)

P1

1st percentile

P5

5th percentile

P10

10th percentile

P90

90th percentile

P95

95th percentile

P99

99th percentile

Q1P25

Lower quartile (25th percentile)

Q3P75

Upper quartile (75th percentile)

QRANGE

Difference between the upper and lower quartiles (also known as the inner quartile range)

Robust Statistics

GINI

Ginis mean difference

MAD

Median absolute difference

QN

2nd variation of median absolute difference

SN

1st variation of median absolute difference

STD“GINI

Standard deviation for Ginis mean difference

STD“MAD

Standard deviation for median absolute difference

STD“QN

Standard deviation for the second variation of the median absolute difference

STD“QRANGE

Estimate of the standard deviation, based on interquartile range

STD“SN

Standard deviation for the first variation of the median absolute difference

Hypothesis Test Statistics

MSIGN

Sign statistic

NORMAL

Test statistic for normality. If the sample size is less than or equal to 2000, this is the Shapiro-Wilk W statistic. Otherwise, it is the Kolmogorov D statistic.

PROBM

Probability of a greater absolute value for the sign statistic

PROBN

Probability that the data came from a normal distribution

PROBS

Probability of a greater absolute value for the signed rank statistic

PROBT

Two-tailed p -value for Students t statistic with n ˆ’ 1 degrees of freedom

SIGNRANK

Signed rank statistic

T

Students t statistic to test the null hypothesis that the population mean is equal to µ

OUTHISTOGRAM= Output Data Set

You can create an OUTHISTOGRAM= data set with the HISTOGRAM statement. This data set contains information about histogram intervals. Since you can specify multiple HISTOGRAM statements with the UNIVARIATE procedure, you can create multiple OUTHISTOGRAM= data sets.

An OUTHISTOGRAM= data set contains a group of observations for each variable in the HISTOGRAM statement. The group contains an observation for each interval of the histogram, beginning with the leftmost interval that contains a value of the variable and ending with the rightmost interval that contains a value of the variable. These intervals will not necessarily coincide with the intervals displayed in the histogram since the histogram may be padded with empty intervals at either end. If you superimpose one or more fitted curves on the histogram, the OUTHISTOGRAM= data set contains multiple groups of observations for each variable (one group for each curve). If you use a BY statement, the OUTHISTOGRAM= data set contains groups of observations for each BY group. ID variables are not saved in an OUTHISTOGRAM= data set.

By default, an OUTHISTOGRAM= data set contains the “MIDPT“ variable, whose values identify histogram intervals by their midpoints. When the ENDPOINTS= or NENDPOINTS option is specified, intervals are identified by endpoint values instead. If the RTINCLUDE option is specified, the “MAXPT“ variable contains upper endpoint values. Otherwise, the “MINPT“ variable contains lower endpoint values. See Example 3.18.

Table 3.68: Variables in the OUTHISTOGRAM= Data Set

Variable

Description

_CURVE_

Name of fitted distribution (if requested in HISTOGRAM statement)

_EXPPCT_

Estimated percent of population in histogram interval determined from optional fitted distribution

_MAXPT_

Upper endpoint of histogram interval

_MIDPT_

Midpoint of histogram interval

_MINPT_

Lower endpoint of histogram interval

_OBSPCT_

Percent of variable values in histogram interval

_VAR_

Variable name

Tables for Summary Statistics

By default, PROC UNIVARIATE produces ODS tables of moments, basic statistical measures, tests for location, quantiles, and extreme observations. You must specify options in the PROC UNIVARIATE statement to request other statistics and tables. The CIBASIC option produces a table that displays confidence limits for the mean, standard deviation, and variance. The CIPCTLDF and CIPCTLNORMAL options request tables of confidence limits for the quantiles. The LOCCOUNT option requests a table that shows the number of values greater than, not equal to, and less than the value of MU0=. The FREQ option requests a table of frequencies counts. The NEXTRVAL= option requests a table of extreme values. The NORMAL option requests a table with tests for normality.

The TRIMMED=, WINSORIZED=, and ROBUSTCALE options request tables with robust estimators. The table of trimmed or Winsorized means includes the percentage and the number of observations that are trimmed or Winsorized at each end, the mean and standard error, confidence limits, and the Students t test. The table with robust measures of scale includes interquartile range, Ginis mean difference G , MAD , Q n , and S n , with their corresponding estimates of ƒ .

See the section ODS Table Names on page 309 for the names of ODS tables created by PROC UNIVARIATE.

ODS Table Names

PROC UNIVARIATE assigns a name to each table that it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets.

Table 3.69: ODS Tables Produced with the PROC UNIVARIATE Statement

ODS Table Name

Description

Option

BasicIntervals

Confidence intervals for mean, standard deviation, variance

CIBASIC

BasicMeasures

Measures of location and variability

Default

ExtremeObs

Extreme observations

Default

ExtremeValues

Extreme values

NEXTRVAL=

Frequencies

Frequencies

FREQ

LocationCounts

Counts used for sign test and signed rank test

LOCCOUNT

MissingValues

Missing values

Default, if missing values exist

Modes

Modes

MODES

Moments

Sample moments

Default

Plots

Line printer plots

PLOTS

Quantiles

Quantiles

Default

RobustScale

Robust measures of scale

ROBUSTSCALE

SSPlots

Line printer side-by-side box plots

PLOTS (with BY statement)

TestsForLocation

Tests for location

Default

TestsForNormality

Tests for normality

NORMALTEST

TrimmedMeans

Trimmed means

TRIMMED=

WinsorizedMeans

Winsorized means

WINSORIZED=

Table 3.70: ODS Tables Produced with the HISTOGRAM Statement

ODS Table Name

Description

Option

Bins

Histogram bins

MIDPERCENTS secondary option

FitQuantiles

Quantiles of fitted distribution

Any distribution option

GoodnessOfFit

Goodness-of-fit tests for fitted distribution

Any distribution option

HistogramBins

Histogram bins

MIDPERCENTS option

ParameterEstimates

Parameter estimates for fitted distribution

Any distribution option

ODS Tables for Fitted Distributions

If you request a fitted parametric distribution with a HISTOGRAM statement, PROC UNIVARIATE creates a summary that is organized into the ODS tables described in this section.

Parameters

The ParameterEstimates table lists the estimated (or specified) parameters for the fitted curve as well as the estimated mean and estimated standard deviation. See Formulas for Fitted Continuous Distributions on page 288.

EDF Goodness-of-Fit Tests

When you fit a parametric distribution, the HISTOGRAM statement provides a series of goodness-of-fit tests based on the empirical distribution function (EDF). See EDF Goodness-of-Fit Tests on page 294. These are displayed in the GoodnessOfFit table.

Histogram Intervals

The Bins table is included in the summary only if you specify the MIDPERCENTS option in parentheses after the distribution option. This table lists the midpoints for the histogram bins along with the observed and estimated percentages of the observations that lie in each bin. The estimated percentages are based on the fitted distribution.

If you specify the MIDPERCENTS option without requesting a fitted distribution, the HistogramBins table is included in the summary. This table lists the interval midpoints with the observed percent of observations that lie in the interval. See the entry for the MIDPERCENTS option on page 225.

Quantiles

The FitQuantiles table lists observed and estimated quantiles. You can use the PERCENTS= option to specify the list of quantiles in this table. See the entry for the PERCENTS= option on page 227. By default, the table lists observed and estimated quantiles for the 1, 5, 10, 25, 50, 75, 90, 95, and 99 percent of a fitted parametric distribution.

Computational Resources

Because the UNIVARIATE procedure computes quantile statistics, it requires additional memory to store a copy of the data in memory. By default, the MEANS, SUMMARY, and TABULATE procedures require less memory because they do not automatically compute quantiles. These procedures also provide an option to use a new fixed-memory quantiles estimation method that is usually less memory intensive .

In the UNIVARIATE procedure, the only factor that limits the number of variables that you can analyze is the computer resources that are available. The amount of temporary storage and CPU time required depends on the statements and the options that you specify. To calculate the computer resources the procedure needs, let

N

be the number of observations in the data set

V

be the number of variables in the VAR statement

U i

be the number of unique values for the i th variable

Then the minimum memory requirement in bytes to process all variables is M = 24 ˆ‘ i U i . If M bytes are not available, PROC UNIVARIATE must process the data multiple times to compute all the statistics. This reduces the minimum memory requirement to M = 24 max( U i ).

Using the ROUND= option reduces the number of unique values ( U i ), thereby reducing memory requirements. The ROBUSTSCALE option requires 40 U i bytes of temporary storage.

Several factors affect the CPU time:

  • The time to create V tree structures to internally store the observations is proportional to NV log( N ).

  • The time to compute moments and quantiles for the i th variable is proportional to U i .

  • The time to compute the NORMAL option test statistics is proportional to N .

  • The time to compute the ROBUSTSCALE option test statistics is proportional to U i log( U i ).

  • The time to compute the exact significance level of the sign rank statistic may increase when the number of nonzero values is less than or equal to 20.

Each of these factors has a different constant of proportionality. For additional information on optimizing CPU performance and memory usage, see the SAS documentation for your operating environment.




Base SAS 9.1.3 Procedures Guide (Vol. 3)
Base SAS 9.1 Procedures Guide, Volumes 1, 2, 3 and 4
ISBN: 1590472047
EAN: 2147483647
Year: 2004
Pages: 74

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net