Details | Base SAS 9.1 Procedures Guide, Volumes 1, 2, 3 and 4

Missing Values

PROC UNIVARIATE excludes missing values for an analysis variable before calculating statistics. Each analysis variable is treated individually; a missing value for an observation in one variable does not affect the calculations for other variables . The statements handle missing values as follows :

If a BY or an ID variable value is missing, PROC UNIVARIATE treats it like any other BY or ID variable value. The missing values form a separate BY group .
If the FREQ variable value is missing or nonpositive, PROC UNIVARIATE excludes the observation from the analysis.
If the WEIGHT variable value is missing, PROC UNIVARIATE excludes the observation from the analysis.

PROC UNIVARIATE tabulates the number of missing values and reports this information in the ODS table named Missing Values; see the section ODS Table Names on page 309. Before the number of missing values is tabulated, PROC UNIVARIATE excludes observations when

you use the FREQ statement and the frequencies are nonpositive.
you use the WEIGHT statement and the weights are missing or nonpositive (you must specify the EXCLNPWGT option).

Rounding

When you specify ROUND= u , PROC UNIVARIATE rounds a variable by using the rounding unit to divide the number line into intervals with midpoints of the form ui , where u is the nonnegative rounding unit and i is an integer. The interval width is u . Any variable value that falls in an interval is rounded to the midpoint of that interval. A variable value that is midway between two midpoints, and is therefore on the boundary of two intervals, rounds to the even midpoint . Even midpoints occur when i is an even integer (0 , ±2 , ±4 , . . . ).

When ROUND=1 and the analysis variable values are between ˆ’ 2.5 and 2.5, the intervals are as follows:

Table 3.57: Intervals for Rounding When ROUND=1
i	Interval	Midpoint	Left endpt rounds to	Right endpt rounds to
ˆ’ 2	[ ˆ’ 2.5, ˆ’ 1.5]	ˆ’ 2	ˆ’ 2	ˆ’ 2
ˆ’ 1	[ ˆ’ 1.5, ˆ’ 0.5]	ˆ’ 1	ˆ’ 2
	[ ˆ’ 0.5,0.5]
1	[0.5,1.5]	1		2
2	[1.5,2.5]	2	2	2

When ROUND=.5 and the analysis variable values are between ˆ’ 1.25 and 1.25, the intervals are as follows:

Table 3.58: Intervals for Rounding When ROUND=0.5
i	Interval	Midpoint	Left endpt rounds to	Right endpt rounds to
ˆ’ 2	[ ˆ’ 1.25, ˆ’ 0.75]	ˆ’ 1.0	ˆ’ 1	ˆ’ 1
ˆ’ 1	[ ˆ’ 0.75, ˆ’ 0.25]	ˆ’ 0.5	ˆ’ 1
	[ ˆ’ 0.25,0.25]	0.0
1	[0.25,0.75]	0.5		1
2	[0.75,1.25]	1.0	1	1

As the rounding unit increases, the interval width also increases . This reduces the number of unique values and decreases the amount of memory that PROC UNIVARIATE needs.

Descriptive Statistics

This section provides computational details for the descriptive statistics that are computed with the PROC UNIVARIATE statement. These statistics can also be saved in the OUT= data set by specifying the keywords listed in Table 3.30 on page 237 in the OUTPUT statement.

Standard algorithms (Fisher 1973) are used to compute the moment statistics. The computational methods used by the UNIVARIATE procedure are consistent with those used by other SAS procedures for calculating descriptive statistics.

The following sections give specific details on a number of statistics calculated by the UNIVARIATE procedure.

Mean

The sample mean is calculated as

where n is the number of nonmissing values for a variable, x _i is the i th value of the variable, and w _i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the formula reduces to

Sum

The sum is calculated as , where n is the number of nonmissing values for a variable, x _i is the i th value of the variable, and w _i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the formula reduces to

Sum of the Weights

The sum of the weights is calculated as , where n is the number of nonmissing values for a variable and w _i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the sum of the weights is n .

Variance

The variance is calculated as

where n is the number of nonmissing values for a variable, x _i is the i th value of the variable, x _w is the weighted mean, w _i is the weight associated with the i th value of the variable, and d is the divisor controlled by the VARDEF= option in the PROC UNIVARIATE statement:

If there is no WEIGHT variable, the formula reduces to

Standard Deviation

The standard deviation is calculated as

Skewness

The sample skewness, which measures the tendency of the deviations to be larger in one direction than in the other, is calculated as follows depending on the VARDEF= option:

Table 3.59: Formulas for Skewness
VARDEF	Formula
DF (default)
N
WDF	missing
WEIGHTWGT	missing

where n is the number of nonmissing values for a variable, x _i is the i th value of the variable, x _w is the sample average, s is the sample standard deviation, and w _i is the weight associated with the i th value of the variable. If VARDEF=DF, then n must be greater than 2. If there is no WEIGHT variable, then w _i = 1 for all i = 1 , . . . , n .

The sample skewness can be positive or negative; it measures the asymmetry of the data distribution and estimates the theoretical skewness , where µ ₂ and µ ₃ are the second and third central moments. Observations that are normally distributed should have a skewness near zero.

Kurtosis

The sample kurtosis, which measures the heaviness of tails , is calculated as follows depending on the VARDEF= option:

Table 3.60: Formulas for Kurtosis
VARDEF	Formula
DF (default)
N
WDF	missing
WEIGHTWGT	missing

where n is the number of nonmissing values for a variable, x _i is the i th value of the variable, x _w is the sample average, s _w is the sample standard deviation, and w _i is the weight associated with the i th value of the variable. If VARDEF=DF, then n must be greater than 3. If there is no WEIGHT variable, then w _i = 1 for all i = 1 , . . . , n .

The sample kurtosis measures the heaviness of the tails of the data distribution. It estimates the adjusted theoretical kurtosis denoted as ² ₂ ˆ’ 3, where , and µ ₄ is the fourth central moment. Observations that are normally distributed should have a kurtosis near zero.

Coefficient of Variation (CV)

The coefficient of variation is calculated as

Calculating the Mode

The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the values of the analysis variables or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest mode in the table labeled Basic Statistical Measures in the statistical output. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode. The WEIGHT statement has no effect on the mode. See Example 3.2.

Calculating Percentiles

The UNIVARIATE procedure automatically computes the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles (quantiles), as well as the minimum and maximum of each analysis variable. To compute percentiles other than these default percentiles, use the PCTLPTS= and PCTLPRE= options in the OUTPUT statement.

You can specify one of five definitions for computing the percentiles with the PCTLDEF= option. Let n be the number of nonmissing values for a variable, and let x ₁ , x ₂ , . . . , x _n represent the ordered values of the variable. Let the t th percentile be y , set , and let

where j is the integer part of np , and g is the fractional part of np . Then the PCTLDEF= option defines the t th percentile, y , as described in the following table:

Table 3.61: Percentile Definitions
PCTLDEF	Description	Formula
1	Weighted average at x _np	y = (1 ˆ’ g ) x _j + gx _j +1 where x is taken to be x ₁
2	Observation numbered closest to np	y = x _j if g < 1/2 y = x _j if g = 1/2 and j is even y = x _j ₊₁ if g = 1/2 and j is odd y = x _j ₊₁ if g > 1/2
3	Empirical distribution function	y = x _j if g = 0 y = x _j ₊₁ if g >
4	Weighted average aimed at x ₍ _n ₊₁₎ _p	y = (1 ˆ’ g ) x _j + gx _j +1 where x _n ₊₁ is taken to be x _n
5	Empirical distribution function with averaging	y = 1/2( x _j + x _j ₊₁ ) if g = 0 y = x _j ₊₁ if g >

Weighted Percentiles

When you use a WEIGHT statement, the percentiles are computed differently. The 100 p th weighted percentile y is computed from the empirical distribution function with averaging

where w _i is the weight associated with x _i , and where is the sum of the weights.

Note that the PCTLDEF= option is not applicable when a WEIGHT statement is used. However, in this case, if all the weights are identical, the weighted percentiles are the same as the percentiles that would be computed without a WEIGHT statement and with PCTLDEF=5.

Confidence Limits for Percentiles

You can use the CIPCTLNORMAL option to request confidence limits for percentiles, assuming the data are normally distributed. These limits are described in Section 4.4.1 of Hahn and Meeker (1991). When 0 < p < 1/2, the two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are

where n is the sample size . When 1/2 p < 1, the two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are

One-sided 100(1 ˆ’ ± )% confidence bounds are computed by replacing by ± in the appropriate preceding equation. The factor g ² ( ³ , p, n ) is related to the noncentral t distribution and is described in Owen and Hua (1977) and Odeh and Owen (1980). See Example 3.10.

You can use the CIPCTLDF option to request distribution-free confidence limits for percentiles. In particular, it is not necessary to assume that the data are normally distributed. These limits are described in Section 5.2 of Hahn and Meeker (1991). The two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are

where X ₍ _j ₎ is the j th order statistic when the data values are arranged in increasing order:

The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around [ np ] + 1 where [ np ] is the integer part of np , and where n is the sample size. Furthermore, l and u are chosen so that X ₍ _l ₎ and X ₍ _u ₎ are as close to X _[ _n _+1] _p as possible while satisfying the coverage probability requirement

where Q ( k ; n, p ) is the cumulative binomial probability

In some cases, the coverage requirement cannot be met, particularly when n is small and p is near 0 or 1. To relax the requirement of symmetry, you can specify CIPCTLDF(TYPE = ASYMMETRIC). This option requests symmetric limits when the coverage requirement can be met, and asymmetric limits otherwise .

If you specify CIPCTLDF(TYPE = LOWER), a one-sided 100(1 ˆ’ ± )% lower confidence bound is computed as X ₍ _l ₎ , where l is the largest integer that satisfies the inequality

with 0 < l n . Likewise, if you specify CIPCTLDF(TYPE = UPPER), a one-sided 100(1 ˆ’ ± )% lower confidence bound is computed as X ₍ _u ₎ , where u is the largest integer that satisfies the inequality

Note that confidence limits for percentiles are not computed when a WEIGHT statement is specified. See Example 3.10.

Tests for Location

PROC UNIVARIATE provides three tests for location: Students t test, the sign test, and the Wilcoxon signed rank test. All three tests produce a test statistic for the null hypothesis that the mean or median is equal to a given value µ against the two-sided alternative that the mean or median is not equal to µ . By default, PROC UNIVARIATE sets the value of µ to zero. You can use the MU0= option in the PROC UNIVARIATE statement to specify the value of µ . Students t test is appropriate when the data are from an approximately normal population; otherwise, use nonparametric tests such as the sign test or the signed rank test. For large sample situations, the t test is asymptotically equivalent to a z test. If you use the WEIGHT statement, PROC UNIVARIATE computes only one weighted test for location, the t test. You must use the default value for the VARDEF= option in the PROC statement (VARDEF=DF). See Example 3.12.

You can also use these tests to compare means or medians of paired data . Data are said to be paired when subjects or units are matched in pairs according to one or more variables, such as pairs of subjects with the same age and gender. Paired data also occur when each subject or unit is measured at two times or under two conditions. To compare the means or medians of the two times, create an analysis variable that is the difference between the two measures. The test that the mean or the median difference of the variables equals zero is equivalent to the test that the means or medians of the two original variables are equal. Note that you can also carry out these tests using the PAIRED statement in the TTEST procedure; refer to Chapter 77, The TTEST Procedure, in SAS/STAT Users Guide . Also see Example 3.13.

Students t Test

PROC UNIVARIATE calculates the t statistic as

where x is the sample mean, n is the number of nonmissing values for a variable, and s is the sample standard deviation. The null hypothesis is that the population mean equals µ . When the data values are approximately normally distributed, the probability under the null hypothesis of a t statistic that is as extreme, or more extreme, than the observed value (the p -value) is obtained from the t distribution with n ˆ’ 1 degrees of freedom. For large n , the t statistic is asymptotically equivalent to a z test. When you use the WEIGHT statement and the default value of VARDEF=, which is DF, the t statistic is calculated as

where x _w is the weighted mean, s _w is the weighted standard deviation, and w _i is the weight for i th observation. The t _w statistic is treated as having a Students t distribution with n ˆ’ 1 degrees of freedom. If you specify the EXCLNPWGT option in the PROC statement, n is the number of nonmissing observations when the value of the WEIGHT variable is positive. By default, n is the number of nonmissing observations for the WEIGHT variable.

Sign Test

PROC UNIVARIATE calculates the sign test statistic as

where n ⁺ is the number of values that are greater than µ , and n ^ˆ’ is the number of values that are less than µ . Values equal to µ are discarded. Under the null hypothesis that the population median is equal to µ , the p -value for the observed statistic M _obs is

where n _t = n ⁺ + n ^ˆ’ is the number of x _i values not equal to µ .

Note: If n ⁺ and n ^ˆ’ are equal, the p -value is equal to one.

Wilcoxon Signed Rank Test

The signed rank statistic S is computed as

where is the rank of x _i ˆ’ µ after discarding values of x _i = µ , and n _t is the number of x _i values not equal to µ . Average ranks are used for tied values.

If n 20, the significance of S is computed from the exact distribution of S , where the distribution is a convolution of scaled binomial distributions. When n > 20, the significance of S is computed by treating

as a Students t variate with n ˆ’ 1 degrees of freedom. V is computed as

where the sum is over groups tied in absolute value and where t _i is the number of values in the i th group (Iman 1974; Conover 1999). The null hypothesis tested is that the mean (or median) is zero, assuming that the distribution is symmetric. Refer to Lehmann (1998).

Confidence Limits for Parameters of the Normal Distribution

The two-sided 100(1 ˆ’ ± )% confidence interval for the mean has upper and lower limits

where and is the percentile of the t distribution with n ˆ’ 1 degrees of freedom. The one-sided upper 100(1 ˆ’ ± )% confidence limit is computed as and the one-sided lower 100(1 ˆ’ ± )% confidence limit is computed as . See Example 3.9.

The two-sided 100(1 ˆ’ ± )% confidence interval for the standard deviation has lower and upper limits

respectively, where and are the and percentiles of the chi-square distribution with n ˆ’ 1 degrees of freedom. A one-sided 100(1 ˆ’ ± )% confidence limit has lower and upper limits

respectively. The 100(1 ˆ’ ± )% confidence interval for the variance has upper and lower limits equal to the squares of the corresponding upper and lower limits for the standard deviation. When you use the WEIGHT statement and specify VARDEF=DF in the PROC statement, the 100(1 ˆ’ ± )% confidence interval for the weighted mean is

where x _w is the weighted mean, s _w is the weighted standard deviation, w _i is the weight for i th observation, and is the percentile for the t distribution with n ˆ’ 1 degrees of freedom.

Robust Estimators

A statistical method is robust if it is insensitive to moderate or even large departures from the assumptions that justify the method. PROC UNIVARIATE provides several methods for robust estimation of location and scale. See Example 3.11.

Winsorized Means

The Winsorized mean is a robust estimator of the location that is relatively insensitive to outliers. The k -times Winsorized mean is calculated as

where n is the number of observations, and x _{( i )} is the i th order statistic when the observations are arranged in increasing order:

The Winsorized mean is computed as the ordinary mean after the k smallest observations are replaced by the ( k + 1)st smallest observation, and the k largest observations are replaced by the ( k + 1)st largest observation.

For data from a symmetric distribution, the Winsorized mean is an unbiased estimate of the population mean. However, the Winsorized mean does not have a normal distribution even if the data are from a normal population.

The Winsorized sum of squared deviations is defined as

The Winsorized t statistic is given by

where µ denotes the location under the null hypothesis, and the standard error of the Winsorized mean is

When the data are from a symmetric distribution, the distribution of t _wk is approximated by a Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom (Tukey and McLaughlin 1963; Dixon and Tukey 1968).

The Winsorized 100 % confidence interval for the location parameter has upper and lower limits

where is the 100th percentile of the Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom.

Trimmed Means

Like the Winsorized mean, the trimmed mean is a robust estimator of the location that is relatively insensitive to outliers. The k -times trimmed mean is calculated as

where n is the number of observations, and x ₍ _i ₎ is the i th order statistic when the observations are arranged in increasing order:

The trimmed mean is computed after the k smallest and k largest observations are deleted from the sample. In other words, the observations are trimmed at each end.

For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. However, the trimmed mean does not have a normal distribution even if the data are from a normal population.

A robust estimate of the variance of the trimmed mean t _tk can be based on the Winsorized sum of squared deviations , which is defined in the section Winsorized Means on page 278; refer to Tukey and McLaughlin (1963). This can be used to compute a trimmed t test which is based on the test statistic

where the standard error of the trimmed mean is

When the data are from a symmetric distribution, the distribution of t _tk is approximated by a Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom (Tukey and McLaughlin 1963; Dixon and Tukey 1968).

The trimmed 100(1 ˆ’ ± )% confidence interval for the location parameter has upper and lower limits

where is the 100th percentile of the Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom.

Robust Estimates of Scale

The sample standard deviation, which is the most commonly used estimator of scale, is sensitive to outliers. Robust scale estimators, on the other hand, remain bounded when a single data value is replaced by an arbitrarily large or small value. The UNIVARIATE procedure computes several robust measures of scale, including the interquartile range, Ginis mean difference G , the median absolute deviation about the median (MAD), Q _n , and S _n . In addition, the procedure computes estimates of the normal standard deviation derived ƒ from each of these measures.

The interquartile range (IQR) is simply the difference between the upper and lower quartiles. For a normal population, ƒ can be estimated as IQR/1.34898.

Ginis mean difference is computed as

For a normal population, the expected value of G is . Thus is a robust estimator of ƒ when the data are from a normal sample. For the normal distribution, this estimator has high efficiency relative to the usual sample standard deviation, and it is also less sensitive to the presence of outliers.

A very robust scale estimator is the MAD, the median absolute deviation from the median (Hampel 1974), which is computed as

where the inner median, med _j ( x _j ), is the median of the n observations, and the outer median (taken over i ) is the median of the n absolute values of the deviations about the inner median. For a normal population, 1 . 4826MAD is an estimator of ƒ .

The MAD has low efficiency for normal distributions, and it may not always be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two statistics as alternatives to the MAD. The first is

where the outer median (taken over i ) is the median of the n medians of x _i ˆ’ x _j , j = 1 , 2 , . . . , n . To reduce small-sample bias, c _sn S _n is used to estimate ƒ , where c _sn is a correction factor; refer to Croux and Rousseeuw (1992).

The second statistic proposed by Rousseeuw and Croux (1993) is

where

In other words, Q _n is 2.219 times the k th order statistic of the distances between the data points. The bias-corrected statistic c _qn Q _n is used to estimate ƒ , where c _qn is a correction factor; refer to Croux and Rousseeuw (1992).

Creating Line Printer Plots

The PLOTS option in the PROC UNIVARIATE statement provides up to four diagnostic line printer plots to examine the data distribution. These plots are the stem-and-leaf plot or horizontal bar chart, the box plot, the normal probability plot, and the side-by-side box plots. If you specify the WEIGHT statement, PROC UNIVARIATE provides a weighted histogram, a weighted box plot based on the weighted quantiles, and a weighted normal probability plot.

Note that these plots are a legacy feature of the UNIVARIATE procedure in earlier versions of SAS. They predate the addition of the HISTOGRAM, PROBPLOT, and QQPLOT statements, which provide high-resolution graphics displays. Also note that line printer plots requested with the PLOTS option are mainly intended for use with the ODS LISTING destination. See Example 3.5.

Stem-and-Leaf Plot

The first plot in the output is either a stem-and-leaf plot (Tukey 1977) or a horizontal bar chart. If any single interval contains more than 49 observations, the horizontal bar chart appears. Otherwise, the stem-and-leaf plot appears. The stem-and-leaf plot is like a horizontal bar chart in that both plots provide a method to visualize the overall distribution of the data. The stem-and-leaf plot provides more detail because each point in the plot represents an individual data value.

To change the number of stems that the plot displays, use PLOTSIZE= to increase or decrease the number of rows. Instructions that appear below the plot explain how to determine the values of the variable. If no instructions appear, you multiply Stem.Leaf by 1 to determine the values of the variable. For example, if the stem value is 10 and the leaf value is 1, then the variable value is approximately 10.1. For the stem-and-leaf plot, the procedure rounds a variable value to the nearest leaf. If the variable value is exactly halfway between two leaves , the value rounds to the nearest leaf with an even integer value. For example, a variable value of 3.15 has a stem value of 3 and a leaf value of 2.

Box Plot

The box plot, also known as a schematic box plot, appears beside the stem-and-leaf plot. Both plots use the same vertical scale. The box plot provides a visual summary of the data and identifies outliers. The bottom and top edges of the box correspond to the sample 25th (Q1) and 75th (Q3) percentiles. The box length is one interquartile range (Q3 - Q1). The center horizontal line with asterisk endpoints corresponds to the sample median. The central plus sign (+) corresponds to the sample mean. If the mean and median are equal, the plus sign falls on the line inside the box. The vertical lines that project out from the box, called whiskers , extend as far as the data extend, up to a distance of 1.5 interquartile ranges. Values farther away are potential outliers. The procedure identifies the extreme values with a zero or an asterisk (*). If zero appears, the value is between 1.5 and 3 interquartile ranges from the top or bottom edge of the box. If an asterisk appears, the value is more extreme.

Note: To produce box plots using high-resolution graphics, use the BOXPLOT procedure in SAS/STAT software; refer to Chapter 18, The BOXPLOT Procedure, in SAS/STAT Users Guide .

Normal Probability Plot

The normal probability plot plots the empirical quantiles against the quantiles of a standard normal distribution. Asterisks (*) indicate the data values. The plus signs (+) provide a straight reference line that is drawn by using the sample mean and standard deviation. If the data are from a normal distribution, the asterisks tend to fall along the reference line. The vertical coordinate is the data value, and the horizontal coordinate is ^{ˆ’ 1} ( v _i ) where

For a weighted normal probability plot, the i th ordered observation is plotted against ^{ˆ’ 1} ( v _i ) where

When each observation has an identical weight, w _j = w , the formula for v _i reduces to the expression for v _i in the unweighted normal probability plot:

When the value of VARDEF= is WDF or WEIGHT, a reference line with intercept and slope is added to the plot. When the value of VARDEF= is DF or N, the slope is where is the average weight.

When each observation has an identical weight and the value of VARDEF= is DF, N, or WEIGHT, the reference line reduces to the usual reference line with intercept and slope in the unweighted normal probability plot.

If the data are normally distributed with mean µ , standard deviation ƒ , and each observation has an identical weight w , then the points on the plot should lie approximately on a straight line. The intercept for this line is µ . The slope is ƒ when VARDEF= is WDF or WEIGHT, and the slope is when VARDEF= is DF or N.

Note: To produce probability plots using high-resolution graphics, use the PROBPLOT statement in PROC UNIVARIATE; see the section PROBPLOT Statement on page 241.

Side-by-Side Box Plots

When you use a BY statement with the PLOT option, PROC UNIVARIATE produces side-by-side box plots, one for each BY group. The box plots (also known as schematic plots) use a common scale that enables you to compare the data distribution across BY groups. This plot appears after the univariate analyses of all BY groups. Use the NOBYPLOT option to suppress this plot.

Note: To produce side-by-side box plots using high-resolution graphics, use the BOXPLOT procedure in SAS/STAT software; refer to Chapter 18, The BOXPLOT Procedure, in SAS/STAT Users Guide .

Creating High-Resolution Graphics

If your site licenses SAS/GRAPH software, you can use the HISTOGRAM, PROBPLOT, and QQPLOT statements to create high-resolution graphs. The HISTOGRAM statement creates histograms that enable you to examine the data distribution. You can optionally fit families of density curves and superimpose kernel density estimates on the histograms. For additional information about the fitted distributions and kernel density estimates, see the section Formulas for Fitted Continuous Distributions on page 288 and the section Kernel Density Estimates on page 297.

The PROBPLOT statement creates a probability plot, which compares ordered values of a variable with percentiles of a specified theoretical distribution. The QQPLOT statement creates a quantile-quantile plot, which compares ordered values of a variable with quantiles of a specified theoretical distribution. You can use these plots to determine how well a theoretical distribution models a data distribution.

Note: You can use the CLASS statement with the HISTOGRAM, PROBPLOT, or QQPLOT statements to create comparative histograms, probability plots, or Q-Q plots, respectively.

Using the CLASS Statement to Create Comparative Plots

When you use the CLASS statement with the HISTOGRAM, PROBPLOT, or QQPLOT statement, PROC UNIVARIATE creates comparative histograms, comparative probability plots, or comparative quantile-quantile plots. You can use these plot statements with the CLASS statement to create one-way and two-way comparative plots. When you use one class variable, PROC UNIVARIATE displays an array of component plots ( stacked or side-by-side), one for each level of the classification variable. When you use two class variables, PROC UNIVARIATE displays a matrix of component plots, one for each combination of levels of the classification variables. The observations in a given level are referred to collectively as a cell .

When you create a one-way comparative plot, the observations in the input data set are sorted by the method specified in the ORDER= option. PROC UNIVARIATE creates a separate plot for the analysis variable values in each level, and arranges these component plots in an array to form the comparative plot with uniform horizontal and vertical axes. See Example 3.15.

When you create a two-way comparative plot, the observations in the input data set are cross- classified according to the values (levels) of these variables. PROC UNIVARIATE creates a separate plot for the analysis variable values in each cell of the cross-classification and arranges these component plots in a matrix to form the comparative plot with uniform horizontal and vertical axes. The levels of the first class variable are the labels for the rows of the matrix, and the levels of the second class variable are the labels for the columns of the matrix. See Example 3.16.

PROC UNIVARIATE determines the layout of a two-way comparative plot by using the order for the first class variable to obtain the order of the rows from top to bottom. Then it applies the order for the second class variable to the observations that correspond to the first row to obtain the order of the columns from left to right. If any columns remain unordered (that is, the categories are unbalanced), PROC UNIVARIATE applies the order for the second class variable to the observations in the second row, and so on, until all the columns have been ordered.

If you associate a label with a variable, PROC UNIVARIATE displays the variable label in the comparative plot and this label is parallel to the column (or row) labels.

Use the MISSING option to treat missing values as valid levels.

To reduce the number of classification levels, use a FORMAT statement to combine variable values.

Positioning the Inset

Positioning the Inset Using Compass Point Values

To position the inset by using a compass point position, specify the value N, NE, E, SE, S, SW, W, or NW with the POSITION= option. The default position of the inset is NW. The following statements produce a histogram to show the position of the inset for the eight compass points:

  data Score;   input Student $ PreTest PostTest @@;   label ScoreChange = 'Change in Test Scores';   ScoreChange = PostTest - PreTest;   datalines;   Capalleti  94 91  Dubose     51 65   Engles     95 97  Grant      63 75   Krupski    80 75  Lundsford  92 55   Mcbane     75 78  Mullen     89 82   Nguyen     79 76  Patel      71 77   Si         75 70  Tanaka     87 73   ;   run;   title 'Test Scores for a College Course';   proc univariate data=Score noprint;   histogram PreTest / midpoints = 45 to 95 by 10;   inset n     / cfill=blank   header='Position = NW' pos=nw;   inset mean  / cfill=blank   header='Position = N ' pos=n ;   inset sum   / cfill=blank   header='Position = NE' pos=ne;   inset max   / cfill=blank   header='Position = E ' pos=e ;   inset min   / cfill=blank   header='Position = SE' pos=se;   inset nobs  / cfill=blank   header='Position = S ' pos=s ;   inset range / cfill=blank   header='Position = SW' pos=sw;   inset mode  / cfill=blank   header='Position = W ' pos=w ;   label PreTest = 'Pretest Score';   run;

Figure 3.7: Compass Positions for Inset

Positioning the Inset in the Margins

To position the inset in one of the four margins that surround the plot area, specify the value LM, RM, TM, or BM with the POSITION= option.

Locating the Inset in the Margins

Margin positions are recommended if you list a large number of statistics in the INSET statement. If you attempt to display a lengthy inset in the interior of the plot, it is most likely that the inset will collide with the data display.

Positioning the Inset Using Coordinates

To position the inset with coordinates, use POSITION=(x,y). You specify the coordinates in axis data units or in axis percentage units (the default).

If you specify the DATA option immediately following the coordinates, PROC UNIVARIATE positions the inset by using axis data units. For example, the following statements place the bottom left corner of the inset at 45 on the horizontal axis and 10 on the vertical axis:

  title 'Test Scores for a College Course';   proc univariate data=Score noprint;   histogram PreTest / midpoints = 45 to 95 by 10;   inset n / header   = 'Position=(45,10)'   position = (45,10) data;   run;

Figure 3.8: Coordinate Position for Inset

By default, the specified coordinates determine the position of the bottom left corner of the inset. To change this reference point, use the REFPOINT= option (see the next example).

If you omit the DATA option, PROC UNIVARIATE positions the inset by using axis percentage units. The coordinates in axis percentage units must be between 0 and 100. The coordinates of the bottom left corner of the display are (0,0), while the upper right corner is (100, 100). For example, the following statements create a histogram and use coordinates in axis percentage units to position the two insets :

  title 'Test Scores for a College Course';   proc univariate data=Score noprint;   histogram PreTest / midpoints = 45 to 95 by 10;   inset min / position = (5,25)   header   = 'Position=(5,25)'   refpoint = tl;   inset max / position = (95,95)   header   = 'Position=(95,95)'   refpoint = tr;   run;

The REFPOINT= option determines which corner of the inset to place at the coordinates that are specified with the POSITION= option. The first inset uses REFPOINT=TL, so that the top left corner of the inset is positioned 5% of the way across the horizontal axis and 25% of the way up the vertical axis. The second inset uses REFPOINT=TR, so that the top right corner of the inset is positioned 95% of the way across the horizontal axis and 95% of the way up the vertical axis.

Figure 3.9: Reference Point for Inset

A sample program, univar3.sas , for these examples is available in the SAS Sample Library for Base SAS software.

Formulas for Fitted Continuous Distributions

The following sections provide information on the families of parametric distributions that you can fit with the HISTOGRAM statement. Properties of these distributions are discussed by Johnson, Kotz, and Balakrishnan (1994, 1995).

Beta Distribution

The fitted density function is

where and

= lower threshold parameter (lower endpoint parameter)
ƒ = scale parameter ( ƒ > 0)
± = shape parameter ( ± > 0)
² = shape parameter ( ² > 0)
h = width of histogram interval

Note: This notation is consistent with that of other distributions that you can fit with the HISTOGRAM statement. However, many texts , including Johnson, Kotz, and Balakrishnan (1995), write the beta density function as

The two parameterizations are related as follows:

The range of the beta distribution is bounded below by a threshold parameter = a and above by + ƒ = b . If you specify a fitted beta curve using the BETA option, must be less than the minimum data value, and + ƒ must be greater than the maximum data value. You can specify and ƒ with the THETA= and SIGMA= beta-options in parentheses after the keyword BETA. By default, ƒ = 1 and = 0. If you specify THETA=EST and SIGMA=EST, maximum likelihood estimates are computed for and ƒ . However, three- and four-parameter maximum likelihood estimation may not always converge.

In addition, you can specify ± and ² with the ALPHA= and BETA= beta-options , respectively. By default, the procedure calculates maximum likelihood estimates for ± and ² . For example, to fit a beta density curve to a set of data bounded below by 32 and above by 212 with maximum likelihood estimatexs for ± and ² , use the following statement:

  histogram Length / beta(theta=32 sigma=180);

The beta distributions are also referred to as Pearson Type I or II distributions. These include the power-function distribution ( ² = 1), the arc-sine distribution ( ± = ² = 1/2), and the generalized arc-sine distributions ( ± + ² = 1, ² 1/2).

You can use the DATA step function BETAINV to compute beta quantiles and the DATA step function PROBBETA to compute beta probabilities.

Exponential Distribution

The fitted density function is

where

= threshold parameter
ƒ = scale parameter ( ƒ > 0)
h = width of histogram interval

The threshold parameter must be less than or equal to the minimum data value. You can specify with the THRESHOLD= exponential-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . In addition, you can specify ƒ with the SCALE= exponential-option . By default, the procedure calculates a maximum likelihood estimate for ƒ . Note that some authors define the scale parameter as .

The exponential distribution is a special case of both the gamma distribution (with ± = 1) and the Weibull distribution (with c = 1). A related distribution is the extreme value distribution. If Y = exp( ˆ’ X ) has an exponential distribution, then X has an extreme value distribution.

Gamma Distribution

The fitted density function is

where

= threshold parameter
ƒ = scale parameter ( ƒ > 0)
± = shape parameter ( ± > 0)
h = width of histogram interval

The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= gamma-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . In addition, you can specify ƒ and ± with the SCALE= and ALPHA= gamma-options . By default, the procedure calculates maximum likelihood estimates for ƒ and ± .

The gamma distributions are also referred to as Pearson Type III distributions, and they include the chi-square, exponential, and Erlang distributions. The probability density function for the chi-square distribution is

Notice that this is a gamma distribution with , ƒ = 2, and = 0. The exponential distribution is a gamma distribution with ± = 1, and the Erlang distribution is a gamma distribution with ± being a positive integer. A related distribution is the Rayleigh distribution. If where the X _i s are independent variables, then log R is distributed with a Xv distribution having a probability density function of

If v = 2, the preceding distribution is referred to as the Rayleigh distribution.

You can use the DATA step function GAMINV to compute gamma quantiles and the DATA step function PROBGAM to compute gamma probabilities.

Lognormal Distribution

The fitted density function is

where

= threshold parameter
‚ = scale parameter ( ˆ’ ˆ < ‚ < ˆ )
ƒ = shape parameter ( ƒ >0)
h = width of histogram interval

The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= lognormal-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify and ƒ with the SCALE= and SHAPE= lognormal-options , respectively. By default, the procedure calculates maximum likelihood estimates for these parameters.

Note: The lognormal distribution is also referred to as the S _L distribution in the Johnson system of distributions.

Note: This book uses ƒ to denote the shape parameter of the lognormal distribution, whereas ƒ is used to denote the scale parameter of the beta, exponential, gamma, normal, and Weibull distributions. The use of ƒ to denote the lognormal shape parameter is based on the fact that has a standard normal distribution if X is lognormally distributed. Based on this relationship, you can use the DATA step function PROBIT to compute lognormal quantiles and the DATA step function PROBNORM to compute probabilities.

Normal Distribution

The fitted density function is

where

µ = mean
ƒ = standard deviation ( ƒ > 0)
h = width of histogram interval

You can specify µ and ƒ with the MU= and SIGMA= normal-options , respectively. By default, the procedure estimates µ with the sample mean and ƒ with the sample standard deviation.

You can use the DATA step function PROBIT to compute normal quantiles and the DATA step function PROBNORM to compute probabilities.

Note: The normal distribution is also referred to as the S _N distribution in the Johnson system of distributions.

Weibull Distribution

The fitted density function is

where

= threshold parameter
ƒ = scale parameter ( ƒ > 0)
c = shape parameter ( c > 0)
h = width of histogram interval

The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= Weibull-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify ƒ and c with the SCALE= and SHAPE= Weibull-options , respectively. By default, the procedure calculates maximum likelihood estimates for ƒ and c .

The exponential distribution is a special case of the Weibull distribution where c = 1.

Goodness-of-Fit Tests

When you specify the NORMAL option in the PROC UNIVARIATE statement or you request a fitted parametric distribution in the HISTOGRAM statement, the procedure computes goodness-of-fit tests for the null hypothesis that the values of the analysis variable are a random sample from the specified theoretical distribution. See Example 3.22.

When you specify the NORMAL option, these tests, which are summarized in the output table labeled Tests for Normality, include the following:

Shapiro-Wilk test
Kolmogorov-Smirnov test
Anderson-Darling test
Cram r-von Mises test

The Kolmogorov-Smirnov D statistic, the Anderson-Darling statistic, and the Cram r-von Mises statistic are based on the empirical distribution function (EDF). However, some EDF tests are not supported when certain combinations of the parameters of a specified distribution are estimated. See Table 3.62 on page 296 for a list of the EDF tests available. You determine whether to reject the null hypothesis by examining the p -value that is associated with a goodness-of-fit statistic. When the p -value is less than the predetermined critical value ( ± ), you reject the null hypothesis and conclude that the data did not come from the specified distribution.

Table 3.62: Availability of EDF Tests
Distribution	Parameters			Tests Available
	Threshold	Scale	Shape
	Beta	known	ƒ known		± , ² known	all
known	Beta	ƒ known	± , ² < 5 unknown	all
Exponential	known,	ƒ known		all
	known	ƒ unknown		all
	unknown	ƒ known		all
	unknown	ƒ unknown		all
Gamma	known	ƒ known	± known	all
	known	ƒ unknown	± known	all
	known	ƒ known	± unknown	all
	known	ƒ unknown	± unknown	all
	unknown	ƒ known	± > 1 known	all
	unknown	ƒ unknown	± > 1 known	all
	unknown	ƒ known	± > 1 unknown	all
	unknown	ƒ unknown	± > 1 unknown	all
Lognormal	known	known	ƒ known	all
	known	known	ƒ unknown	A ² and W ²
	known	unknown	ƒ known	A ² and W ²
	known	unknown	ƒ unknown	all
	unknown	known	ƒ < 3 known	all
	unknown	known	ƒ < 3 unknown	all
	unknown	unknown	ƒ < 3 known	all
	unknown	unknown	ƒ < 3 unknown	all
Normal	known	ƒ known		all
	known	ƒ unknown		A ² and W ²
	unknown	ƒ known		A ² and W ²
	unknown	ƒ unknown		all
Weibull	known	ƒ known	c known	all
	known	ƒ unknown	c known	A ² and W ²
	known	ƒ known	c unknown	A ² and W ²
	known	ƒ unknown	c unknown	A ² and W ²
	unknown	ƒ known	c > 2 known	all
	unknown	ƒ unknown	c > 2 known	all
	unknown	ƒ known	c > 2 unknown	all
	unknown	ƒ unknown	c > 2 unknown	all

If you want to test the normality assumptions for analysis of variance methods, beware of using a statistical test for normality alone. A tests ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected . Since small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality. The skewness and kurtosis measures and the plots that are provided by the PLOTS option, the HISTOGRAM statement, the PROBPLOT statement, and the QQPLOT statement can be very helpful. For small sample sizes, power is low for detecting larger departures from normality that may be important. To increase the tests ability to detect such deviations, you may want to declare significance at higher levels, such as 0.15 or 0.20, rather than the often-used 0.05 level. Again, consulting plots and additional statistics will help you assess the severity of the deviations from normality.

Shapiro-Wilk Statistic

If the sample size is less than or equal to 2000 and you specify the NORMAL option, PROC UNIVARIATE computes the Shapiro-Wilk statistic, W (also denoted as W _n to emphasize its dependence on the sample size n ). The W statistic is the ratio of the best estimator of the variance (based on the square of a linear combination of the order statistics) to the usual corrected sum of squares estimator of the variance (Shapiro and Wilk 1965). When n is greater than three, the coefficients to compute the linear combination of the order statistics are approximated by the method of Royston (1992). The statistic W is always greater than zero and less than or equal to one (0 < W 1).

Small values of W lead to the rejection of the null hypothesis of normality. The distribution of W is highly skewed. Seemingly large values of W (such as 0.90) may be considered small and lead you to reject the null hypothesis. The method for computing the p -value (the probability of obtaining a W statistic less than or equal to the observed value) depends on n . For n = 3, the probability distribution of W is known and is used to determine the p -value. For n > 4, a normalizing transformation is computed:

The values of ƒ , ³ , and µ are functions of n obtained from simulation results. Large values of Z _n indicate departure from normality, and since the statistic Z _n has an approximately standard normal distribution, this distribution is used to determine the p -values for n > 4.

EDF Goodness-of-Fit Tests

When you fit a parametric distribution, PROC UNIVARIATE provides a series of goodness-of-fit tests based on the empirical distribution function (EDF). The EDF tests offer advantages over traditional chi-square goodness-of-fit test, including improved power and invariance with respect to the histogram midpoints. For a thorough discussion, refer to DAgostino and Stephens (1986).

The empirical distribution function is defined for a set of n independent observations X ₁ , . . . , X _n with a common distribution function F ( x ). Denote the observations ordered from smallest to largest as X ₍₁₎ , . . . , X _{( n )} . The empirical distribution function, F _n ( x ), is defined as

Note that F _n ( x ) is a step function that takes a step of height at each observation. This function estimates the distribution function F ( x ). At any value x , F _n ( x ) is the proportion of observations less than or equal to x , while F ( x ) is the probability of an observation less than or equal to x . EDF statistics measure the discrepancy between F _n ( x ) and F ( x ).

The computational formulas for the EDF statistics make use of the probability integral transformation U = F ( X ). If F ( X ) is the distribution function of X , the random variable U is uniformly distributed between 0 and 1.

Given n observations X ₍₁₎ , . . . , X _{( n )} , the values U _{( i )} = F ( X _{( i )} ) are computed by applying the transformation, as discussed in the next three sections.

PROC UNIVARIATE provides three EDF tests:

Kolmogorov-Smirnov
Anderson-Darling
Cram r-von Mises

The following sections provide formal definitions of these EDF statistics.

Kolmogorov D Statistic

The Kolmogorov-Smirnov statistic ( D ) is defined as

The Kolmogorov-Smirnov statistic belongs to the supremum class of EDF statistics. This class of statistics is based on the largest vertical difference between F ( x ) and F _n ( x ).

The Kolmogorov-Smirnov statistic is computed as the maximum of D ⁺ and D ^ˆ’ , where D ⁺ is the largest vertical distance between the EDF and the distribution function when the EDF is greater than the distribution function, and D ^ˆ’ is the largest vertical distance when the EDF is less than the distribution function.

PROC UNIVARIATE uses a modified Kolmogorov D statistic to test the data against a normal distribution with mean and variance equal to the sample mean and variance.

Anderson-Darling Statistic

The Anderson-Darling statistic and the Cram r-von Mises statistic belong to the quadratic class of EDF statistics. This class of statistics is based on the squared difference ( F _n ( x ) ˆ’ F ( x )) ² . Quadratic statistics have the following general form:

The function ˆ ( x ) weights the squared difference ( F _n ( x ) ˆ’ F ( x )) ²

The Anderson-Darling statistic ( A ² ) is defined as

Here the weight function is ˆ ( x ) = [ F ( x ) (1 ˆ’ F ( x ))] ^{ˆ’ 1} .

The Anderson-Darling statistic is computed as

Cram r-von Mises Statistic

The Cram r-von Mises statistic ( W ² ) is defined as

Here the weight function is ˆ ( x ) = 1.

The Cram r-von Mises statistic is computed as

Probability Values of EDF Tests

Once the EDF test statistics are computed, PROC UNIVARIATE computes the associated probability values ( p -values). The UNIVARIATE procedure uses internal tables of probability levels similar to those given by DAgostino and Stephens (1986). If the value is between two probability levels, then linear interpolation is used to estimate the probability value.

The probability value depends upon the parameters that are known and the parameters that are estimated for the distribution. Table 3.62 summarizes different combinations fitted for which EDF tests are available.

Kernel Density Estimates

You can use the KERNEL option to superimpose kernel density estimates on histograms. Smoothing the data distribution with a kernel density estimate can be more effective than using a histogram to identify features that might be obscured by the choice of histogram bins or sampling variation. A kernel density estimate can also be more effective than a parametric curve fit when the process distribution is multi-modal. See Example 3.23.

The general form of the kernel density estimator is

where K (·) is the kernel function, » is the bandwidth, n is the sample size and x _i is the i th observation.

The KERNEL option provides three kernel functions ( K ): normal, quadratic, and triangular . You can specify the function with the K= kernel-option in parentheses after the KERNEL option. Values for the K= option are NORMAL, QUADRATIC, and TRIANGULAR (with aliases of N, Q, and T, respectively). By default, a normal kernel is used. The formulas for the kernel functions are

The value of » , referred to as the bandwidth parameter, determines the degree of smoothness in the estimated density function. You specify » indirectly by specifying a standardized bandwidth c with the C= kernel-option . If Q is the interquartile range, and n is the sample size, then c is related to » by the formula

For a specific kernel function, the discrepancy between the density estimator _» ( x ) and the true density f ( x ) is measured by the mean integrated square error (MISE):

The MISE is the sum of the integrated squared bias and the variance. An approximate mean integrated square error (AMISE) is

A bandwidth that minimizes AMISE can be derived by treating f ( x ) as the normal density having parameters µ and ƒ estimated by the sample mean and standard deviation. If you do not specify a bandwidth parameter or if you specify C=MISE, the bandwidth that minimizes AMISE is used. The value of AMISE can be used to compare different density estimates. For each estimate, the bandwidth parameter c , the kernel function type, and the value of AMISE are reported in the SAS log.

The general kernel density estimates assume that the domain of the density to estimate can take on all values on a real line. However, sometimes the domain of a density is an interval bounded on one or both sides. For example, if a variable Y is a measurement of only positive values, then the kernel density curve should be bounded so that is zero for negative Y values. You can use the LOWER= and UPPER= kernel-options to specify the bounds.

The UNIVARIATE procedure uses a reflection technique to create the bounded kernel density curve, as described in Silverman (1986, pp. 30-31). It adds the reflections of the kernel density that are outside the boundary to the bounded kernel estimates. The general form of the bounded kernel density estimator is computed by replacing in the original equation with

where x _l is the lower bound and x _u is the upper bound.

Without a lower bound, x _l = ˆ’ ˆ and equals zero. Similarly, without an upper bound, x _u = ˆ and equals zero.

When C=MISE is used with a bounded kernel density, the UNIVARIATE procedure uses a bandwidth that minimizes the AMISE for its corresponding unbounded kernel.

Construction of Quantile-Quantile and Probability Plots

Figure 3.10 illustrates how a Q-Q plot is constructed for a specified theoretical distribution. First, the n nonmissing values of the variable are ordered from smallest to largest:

Figure 3.10: Construction of a Q-Q Plot

Then the i th ordered value x ₍ _i ₎ is plotted as a point whose y -coordinate is x ₍ _i ₎ and whose x -coordinate is , where F (·) is the specified distribution with zero location parameter and unit scale parameter.

You can modify the adjustment constants ˆ’ 0.375 and 0.25 with the RANKADJ= and NADJ= options. This default combination is recommended by Blom (1958). For additional information, refer to Chambers et al. (1983). Since x ₍ _i ₎ is a quantile of the empirical cumulative distribution function (ecdf), a Q-Q plot compares quantiles of the ecdf with quantiles of a theoretical distribution. Probability plots (see the section PROBPLOT Statement on page 241) are constructed the same way, except that the x -axis is scaled nonlinearly in percentiles.

Interpretation of Quantile-Quantile and Probability Plots

The following properties of Q-Q plots and probability plots make them useful diagnostics of how well a specified theoretical distribution fits a set of measurements:

If the quantiles of the theoretical and data distributions agree, the plotted points fall on or near the line y = x .
If the theoretical and data distributions differ only in their location or scale, the points on the plot fall on or near the line y = ax + b . The slope a and intercept b are visual estimates of the scale and location parameters of the theoretical distribution.

Q-Q plots are more convenient than probability plots for graphical estimation of the location and scale parameters since the x -axis of a Q-Q plot is scaled linearly. On the other hand, probability plots are more convenient for estimating percentiles or probabilities.

There are many reasons why the point pattern in a Q-Q plot may not be linear. Chambers et al. (1983) and Fowlkes (1987) discuss the interpretations of commonly encountered departures from linearity , and these are summarized in Table 3.63.

Table 3.63: Quantile-Quantile Plot Diagnostics
Description of Point Pattern	Possible Interpretation
All but a few points fall on a line	Outliers in the data
Left end of pattern is below the line; right end of pattern is above the line	Long tails at both ends of the data distribution
Left end of pattern is above the line; right end of pattern is below the line	Short tails at both ends of the data distribution
Curved pattern with slope increasing from left to right	Data distribution is skewed to the right
Curved pattern with slope decreasing from left to right	Data distribution is skewed to the left
Staircase pattern (plateaus and gaps)	Data have been rounded or are discrete

In some applications, a nonlinear pattern may be more revealing than a linear pattern. However, Chambers et al. (1983) note that departures from linearity can also be due to chance variation.

When the pattern is linear, you can use Q-Q plots to estimate shape, location, and scale parameters and to estimate percentiles. See Example 3.26 through Example 3.34.

Distributions for Probability and Q-Q Plots

You can use the PROBPLOT and QQPLOT statements to request probability and Q-Q plots that are based on the theoretical distributions summarized in Table 3.64.

Table 3.64: Distributions and Parameters Parameters
Distribution	Range	Location	Scale	Shape
Beta	< x < + ƒ		ƒ	± , ²
Exponential	x		ƒ
Gamma	x >		ƒ	±
Lognormal (3-parameter)	x >			ƒ
Normal	all x	¼	ƒ
Weibull (3-parameter)	x >		ƒ	c
Weibull (2-parameter)	x >	(known)	ƒ	c

You can request these distributions with the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, NORMAL, WEIBULL, and WEIBULL2 options, respectively. If you do not specify a distribution option, a normal probability plot or a normal Q-Q plot is created.

The following sections provide details for constructing Q-Q plots that are based on these distributions. Probability plots are constructed similarly except that the horizontal axis is scaled in percentile units.

Beta Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where is the inverse normalized incomplete beta function, n is the number of nonmissing observations, ± and ² and are the shape parameters of the beta distribution. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot for ALPHA= ± and BETA= ² tends to be linear with intercept and slope ƒ if the data are beta distributed with the specific density function

where and

= lower threshold parameter
ƒ = scale parameter ( ƒ > 0)
± = first shape parameter ( ± > 0)
= second shape parameter ( > 0)

Exponential Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where n is the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot tends to be linear with intercept and slope ƒ if the data are exponentially distributed with the specific density function

where is a threshold parameter, and is ƒ a positive scale parameter.

Gamma Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where is the inverse normalized incomplete gamma function, n is the number of nonmissing observations, and ± is the shape parameter of the gamma distribution. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot for ALPHA= ƒ tends to be linear with intercept and slope ƒ if the data are gamma distributed with the specific density function

where

= threshold parameter
ƒ = scale parameter ( ƒ > 0)
± = shape parameter ( ± > 0)

Lognormal Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile exp , where ^{ˆ’ 1} (·) is the inverse cumulative standard normal distribution, n is the number of nonmissing observations, and ƒ is the shape parameter of the lognormal distribution. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot for SIGMA= ƒ tends to be linear with intercept and slope exp( ) if the data are lognormally distributed with the specific density function

where

= threshold parameter
ƒ = scale parameter
± = shape parameter ( ƒ > 0)

See Example 3.26 and Example 3.33.

Normal Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where ^{ˆ’ 1} (·) is the inverse cumulative standard normal distribution, and n is the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.

The point pattern on the plot tends to be linear with intercept µ and slope ƒ if the data are normally distributed with the specific density function

where µ is the mean, and ƒ is the standard deviation ( ƒ > 0).

Three-Parameter Weibull Distribution

To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where n is the number of nonmissing observations, and c is the Weibull distribution shape parameter. In a probability plot, the horizontal axis is scaled in percentile units.

The pattern on the plot for C= c tends to be linear with intercept and slope ƒ if the data are Weibull distributed with the specific density function

where

= threshold parameter
ƒ = scale parameter ( ƒ > 0)
c = shape parameter ( c > 0)

See Example 3.34.

Two-Parameter Weibull Distribution

To create the plot, the observations are ordered from smallest to largest, and the log of the shifted i th ordered observation x ₍ _i ₎ , denoted by log( x ₍ _i ₎ ˆ’ ), is plotted against the quantile , where is n the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.

Unlike the three-parameter Weibull quantile, the preceding expression is free of distribution parameters. Consequently, the C= shape parameter is not mandatory with the WEIBULL2 distribution option.

The pattern on the plot for THETA= tends to be linear with intercept log( ƒ ) and slope 1/ c if the data are Weibull distributed with the specific density function

where

= known lower threshold
ƒ = scale parameter ( ƒ > 0)
c = shape parameter ( c > 0)

See Example 3.34.

Estimating Shape Parameters Using Q-Q Plots

Some of the distribution options in the PROBPLOT or QQPLOT statements require you to specify one or two shape parameters in parentheses after the distribution keyword. These are summarized in Table 3.65.

Table 3.65: Shape Parameter Options
Distribution Keyword	Mandatory Shape Parameter Option	Range
BETA	ALPHA= ± , BETA= ²	± > 0, ² >
EXPONENTIAL	None
GAMMA	ALPHA= ±	± >
LOGNORMAL	SIGMA= ƒ	ƒ >
NORMAL	None
WEIBULL	C= c	c >
WEIBULL2	None

You can visually estimate the value of a shape parameter by specifying a list of values for the shape parameter option. A separate plot is produced for each value, and you can then select the value of the shape parameter that produces the most nearly linear point pattern. Alternatively, you can request that the plot be created using an estimated shape parameter. See the entries for the distribution options in the section Dictionary of Options on page 245 (for the PROBPLOT statement) and in the section Dictionary of Options on page 258 (for the QQPLOT statement).

Note: For Q-Q plots created with the WEIBULL2 option, you can estimate the shape parameter c from a linear pattern using the fact that the slope of the pattern is .

Estimating Location and Scale Parameters Using Q-Q Plots

If you specify location and scale parameters for a distribution in a PROBPLOT or QQPLOT statement (or if you request estimates for these parameters), a diagonal distribution reference line is displayed on the plot. (An exception is the two-parameter Weibull distribution, for which a line is displayed when you specify or estimate the scale and shape parameters.) Agreement between this line and the point pattern indicates that the distribution with these parameters is a good fit.

When the point pattern on a Q-Q plot is linear, its intercept and slope provide estimates of the location and scale parameters. (An exception to this rule is the two-parameter Weibull distribution, for which the intercept and slope are related to the scale and shape parameters.)

Table 3.66 shows how the specified parameters determine the intercept and slope of the line. The intercept and slope are based on the quantile scale for the horizontal axis, which is used in Q-Q plots.

Table 3.66: Intercept and Slope of Distribution Reference Line
Distribution	Parameters			Linear Pattern
	Location	Scale	Shape	Intercept	Slope
	Beta		ƒ	± , ²		ƒ
Exponential		ƒ			ƒ
Gamma		ƒ	±		ƒ
Lognormal			ƒ		exp( )
Normal	µ	ƒ		µ	ƒ
Weibull (3-parameter)		ƒ	c		ƒ
Weibull (2-parameter)	(known)	ƒ	c	log( ƒ )

For instance, specifying MU=3 and SIGMA=2 with the NORMAL option requests a line with intercept 3 and slope 2. Specifying SIGMA=1 and C=2 with the WEIBULL2 option requests a line with intercept log(1) = 0 and slope 1/2. On a probability plot with the LOGNORMAL and WEIBULL2 options, you can specify the slope directly with the SLOPE= option. That is, for the LOGNORMAL option, specifying THETA= and SLOPE=exp( ) displays the same line as specifying THETA= and ZETA= . For the WEIBULL2 option, specifying SIGMA= ƒ and displays the same line as specifying SIGMA= ƒ and C= c .

Estimating Percentiles Using Q-Q Plots

There are two ways to estimate percentiles from a Q-Q plot:

Specify the PCTLAXIS option, which adds a percentile axis opposite the theoretical quantile axis. The scale for the percentile axis ranges between 0 and 100 with tick marks at percentile values such as 1, 5, 10, 25, 50, 75, 90, 95, and 99.
Specify the PCTLSCALE option, which relabels the horizontal axis tick marks with their percentile equivalents but does not alter their spacing. For example, on a normal Q-Q plot, the tick mark labeled 0 is relabeled as 50 since the 50th percentile corresponds to the zero quantile.

You can also estimate percentiles using probability plots created with the PROBPLOT statement. See Example 3.32.

Input Data Sets

DATA= Data Set

The DATA= data set provides the set of variables that are analyzed . The UNIVARIATE procedure must have a DATA= data set. If you do not specify one with the DATA= option in the PROC UNIVARIATE statement, the procedure uses the last data set created.

ANNOTATE= Data Sets

You can add features to plots by specifying ANNOTATE= data sets either in the PROC UNIVARIATE statement or in individual plot statements.

Information contained in an ANNOTATE= data set specified in the PROC UNIVARIATE statement is used for all plots produced in a given PROC step; this is a global ANNOTATE= data set. By using this global data set, you can keep information common to all high-resolution plots in one data set.

Information contained in the ANNOTATE= data set specified in a plot statement is used only for plots produced by that statement; this is a local ANNOTATE= data set. By using this data set, you can add statement-specific features to plots. For example, you can add different features to plots produced by the HISTOGRAM and QQPLOT statements by specifying an ANNOTATE= data set in each plot statement.

You can specify an ANNOTATE= data set in the PROC UNIVARIATE statement and in plot statements. This enables you to add some features to all plots and also add statement-specific features to plots. See Example 3.25.

OUT= Output Data Set in the OUTPUT Statement

PROC UNIVARIATE creates an OUT= data set for each OUTPUT statement. This data set contains an observation for each combination of levels of the variables in the BY statement, or a single observation if you do not specify a BY statement. Thus the number of observations in the new data set corresponds to the number of groups for which statistics are calculated. Without a BY statement, the procedure computes statistics and percentiles by using all the observations in the input data set. With a BY statement, the procedure computes statistics and percentiles by using the observations within each BY group.

The variables in the OUT= data set are as follows:

BY statement variables. The values of these variables match the values in the corresponding BY group in the DATA= data set and indicate which BY group each observation summarizes.
variables created by selecting statistics in the OUTPUT statement. The statistics are computed using all the nonmissing data, or they are computed for each BY group if you use a BY statement.
variables created by requesting new percentiles with the PCTLPTS= option. The names of these new variables depend on the values of the PCTLPRE= and PCTLNAME= options.

If the output data set contains a percentile variable or a quartile variable, the percentile definition assigned with the PCTLDEF= option in the PROC UNIVARIATE statement is recorded in the output data set label. See Example 3.8.

The following table lists variables available in the OUT= data set.

Table 3.67: Variables Available in the OUT= Data Set
Variable Name	Description
Descriptive Statistics
CSS	Sum of squares corrected for the mean
CV	Percent coefficient of variation
KURTOSIS	Measurement of the heaviness of tails
MAX	Largest (maximum) value
MEAN	Arithmetic mean
MIN	Smallest (minimum) value
MODE	Most frequent value (if not unique, the smallest mode)
N	Number of observations on which calculations are based
NMISS	Number of missing observations
NOBS	Total number of observations
RANGE	Difference between the maximum and minimum values
SKEWNESS	Measurement of the tendency of the deviations to be larger in one direction than in the other
STD	Standard deviation
STDMEAN	Standard error of the mean
SUM	Sum
SUMWGT	Sum of the weights
USS	Uncorrected sum of squares
VAR	Variance
Quantile Statistics
MEDIANP50	Middle value (50th percentile)
P1	1st percentile
P5	5th percentile
P10	10th percentile
P90	90th percentile
P95	95th percentile
P99	99th percentile
Q1P25	Lower quartile (25th percentile)
Q3P75	Upper quartile (75th percentile)
QRANGE	Difference between the upper and lower quartiles (also known as the inner quartile range)
Robust Statistics
GINI	Ginis mean difference
MAD	Median absolute difference
QN	2nd variation of median absolute difference
SN	1st variation of median absolute difference
STD“GINI	Standard deviation for Ginis mean difference
STD“MAD	Standard deviation for median absolute difference
STD“QN	Standard deviation for the second variation of the median absolute difference
STD“QRANGE	Estimate of the standard deviation, based on interquartile range
STD“SN	Standard deviation for the first variation of the median absolute difference
Hypothesis Test Statistics
MSIGN	Sign statistic
NORMAL	Test statistic for normality. If the sample size is less than or equal to 2000, this is the Shapiro-Wilk W statistic. Otherwise, it is the Kolmogorov D statistic.
PROBM	Probability of a greater absolute value for the sign statistic
PROBN	Probability that the data came from a normal distribution
PROBS	Probability of a greater absolute value for the signed rank statistic
PROBT	Two-tailed p -value for Students t statistic with n ˆ’ 1 degrees of freedom
SIGNRANK	Signed rank statistic
T	Students t statistic to test the null hypothesis that the population mean is equal to µ

OUTHISTOGRAM= Output Data Set

You can create an OUTHISTOGRAM= data set with the HISTOGRAM statement. This data set contains information about histogram intervals. Since you can specify multiple HISTOGRAM statements with the UNIVARIATE procedure, you can create multiple OUTHISTOGRAM= data sets.

An OUTHISTOGRAM= data set contains a group of observations for each variable in the HISTOGRAM statement. The group contains an observation for each interval of the histogram, beginning with the leftmost interval that contains a value of the variable and ending with the rightmost interval that contains a value of the variable. These intervals will not necessarily coincide with the intervals displayed in the histogram since the histogram may be padded with empty intervals at either end. If you superimpose one or more fitted curves on the histogram, the OUTHISTOGRAM= data set contains multiple groups of observations for each variable (one group for each curve). If you use a BY statement, the OUTHISTOGRAM= data set contains groups of observations for each BY group. ID variables are not saved in an OUTHISTOGRAM= data set.

By default, an OUTHISTOGRAM= data set contains the “MIDPT“ variable, whose values identify histogram intervals by their midpoints. When the ENDPOINTS= or NENDPOINTS option is specified, intervals are identified by endpoint values instead. If the RTINCLUDE option is specified, the “MAXPT“ variable contains upper endpoint values. Otherwise, the “MINPT“ variable contains lower endpoint values. See Example 3.18.

Table 3.68: Variables in the OUTHISTOGRAM= Data Set
Variable	Description
_CURVE_	Name of fitted distribution (if requested in HISTOGRAM statement)
_EXPPCT_	Estimated percent of population in histogram interval determined from optional fitted distribution
_MAXPT_	Upper endpoint of histogram interval
_MIDPT_	Midpoint of histogram interval
_MINPT_	Lower endpoint of histogram interval
_OBSPCT_	Percent of variable values in histogram interval
_VAR_	Variable name

Tables for Summary Statistics

By default, PROC UNIVARIATE produces ODS tables of moments, basic statistical measures, tests for location, quantiles, and extreme observations. You must specify options in the PROC UNIVARIATE statement to request other statistics and tables. The CIBASIC option produces a table that displays confidence limits for the mean, standard deviation, and variance. The CIPCTLDF and CIPCTLNORMAL options request tables of confidence limits for the quantiles. The LOCCOUNT option requests a table that shows the number of values greater than, not equal to, and less than the value of MU0=. The FREQ option requests a table of frequencies counts. The NEXTRVAL= option requests a table of extreme values. The NORMAL option requests a table with tests for normality.

The TRIMMED=, WINSORIZED=, and ROBUSTCALE options request tables with robust estimators. The table of trimmed or Winsorized means includes the percentage and the number of observations that are trimmed or Winsorized at each end, the mean and standard error, confidence limits, and the Students t test. The table with robust measures of scale includes interquartile range, Ginis mean difference G , MAD , Q _n , and S _n , with their corresponding estimates of ƒ .

See the section ODS Table Names on page 309 for the names of ODS tables created by PROC UNIVARIATE.

ODS Table Names

PROC UNIVARIATE assigns a name to each table that it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets.

Table 3.69: ODS Tables Produced with the PROC UNIVARIATE Statement
ODS Table Name	Description	Option
BasicIntervals	Confidence intervals for mean, standard deviation, variance	CIBASIC
BasicMeasures	Measures of location and variability	Default
ExtremeObs	Extreme observations	Default
ExtremeValues	Extreme values	NEXTRVAL=
Frequencies	Frequencies	FREQ
LocationCounts	Counts used for sign test and signed rank test	LOCCOUNT
MissingValues	Missing values	Default, if missing values exist
Modes	Modes	MODES
Moments	Sample moments	Default
Plots	Line printer plots	PLOTS
Quantiles	Quantiles	Default
RobustScale	Robust measures of scale	ROBUSTSCALE
SSPlots	Line printer side-by-side box plots	PLOTS (with BY statement)
TestsForLocation	Tests for location	Default
TestsForNormality	Tests for normality	NORMALTEST
TrimmedMeans	Trimmed means	TRIMMED=
WinsorizedMeans	Winsorized means	WINSORIZED=

Table 3.70: ODS Tables Produced with the HISTOGRAM Statement
ODS Table Name	Description	Option
Bins	Histogram bins	MIDPERCENTS secondary option
FitQuantiles	Quantiles of fitted distribution	Any distribution option
GoodnessOfFit	Goodness-of-fit tests for fitted distribution	Any distribution option
HistogramBins	Histogram bins	MIDPERCENTS option
ParameterEstimates	Parameter estimates for fitted distribution	Any distribution option

ODS Tables for Fitted Distributions

If you request a fitted parametric distribution with a HISTOGRAM statement, PROC UNIVARIATE creates a summary that is organized into the ODS tables described in this section.

Parameters

The ParameterEstimates table lists the estimated (or specified) parameters for the fitted curve as well as the estimated mean and estimated standard deviation. See Formulas for Fitted Continuous Distributions on page 288.

EDF Goodness-of-Fit Tests

When you fit a parametric distribution, the HISTOGRAM statement provides a series of goodness-of-fit tests based on the empirical distribution function (EDF). See EDF Goodness-of-Fit Tests on page 294. These are displayed in the GoodnessOfFit table.

Histogram Intervals

The Bins table is included in the summary only if you specify the MIDPERCENTS option in parentheses after the distribution option. This table lists the midpoints for the histogram bins along with the observed and estimated percentages of the observations that lie in each bin. The estimated percentages are based on the fitted distribution.

If you specify the MIDPERCENTS option without requesting a fitted distribution, the HistogramBins table is included in the summary. This table lists the interval midpoints with the observed percent of observations that lie in the interval. See the entry for the MIDPERCENTS option on page 225.

Quantiles

The FitQuantiles table lists observed and estimated quantiles. You can use the PERCENTS= option to specify the list of quantiles in this table. See the entry for the PERCENTS= option on page 227. By default, the table lists observed and estimated quantiles for the 1, 5, 10, 25, 50, 75, 90, 95, and 99 percent of a fitted parametric distribution.

Computational Resources

Because the UNIVARIATE procedure computes quantile statistics, it requires additional memory to store a copy of the data in memory. By default, the MEANS, SUMMARY, and TABULATE procedures require less memory because they do not automatically compute quantiles. These procedures also provide an option to use a new fixed-memory quantiles estimation method that is usually less memory intensive .

In the UNIVARIATE procedure, the only factor that limits the number of variables that you can analyze is the computer resources that are available. The amount of temporary storage and CPU time required depends on the statements and the options that you specify. To calculate the computer resources the procedure needs, let

N	be the number of observations in the data set
V	be the number of variables in the VAR statement
U _i	be the number of unique values for the i th variable

Then the minimum memory requirement in bytes to process all variables is M = 24 ˆ‘ _i U _i . If M bytes are not available, PROC UNIVARIATE must process the data multiple times to compute all the statistics. This reduces the minimum memory requirement to M = 24 max( U _i ).

Using the ROUND= option reduces the number of unique values ( U _i ), thereby reducing memory requirements. The ROBUSTSCALE option requires 40 U _i bytes of temporary storage.

Several factors affect the CPU time:

The time to create V tree structures to internally store the observations is proportional to NV log( N ).
The time to compute moments and quantiles for the i th variable is proportional to U _i .
The time to compute the NORMAL option test statistics is proportional to N .
The time to compute the ROBUSTSCALE option test statistics is proportional to U _i log( U _i ).
The time to compute the exact significance level of the sign rank statistic may increase when the number of nonzero values is less than or equal to 20.

Each of these factors has a different constant of proportionality. For additional information on optimizing CPU performance and memory usage, see the SAS documentation for your operating environment.