PROC UNIVARIATE excludes missing values for an analysis variable before calculating statistics. Each analysis variable is treated individually; a missing value for an observation in one variable does not affect the calculations for other variables . The statements handle missing values as follows :
If a BY or an ID variable value is missing, PROC UNIVARIATE treats it like any other BY or ID variable value. The missing values form a separate BY group .
If the FREQ variable value is missing or nonpositive, PROC UNIVARIATE excludes the observation from the analysis.
If the WEIGHT variable value is missing, PROC UNIVARIATE excludes the observation from the analysis.
PROC UNIVARIATE tabulates the number of missing values and reports this information in the ODS table named Missing Values; see the section ODS Table Names on page 309. Before the number of missing values is tabulated, PROC UNIVARIATE excludes observations when
you use the FREQ statement and the frequencies are nonpositive.
you use the WEIGHT statement and the weights are missing or nonpositive (you must specify the EXCLNPWGT option).
When you specify ROUND= u , PROC UNIVARIATE rounds a variable by using the rounding unit to divide the number line into intervals with midpoints of the form ui , where u is the nonnegative rounding unit and i is an integer. The interval width is u . Any variable value that falls in an interval is rounded to the midpoint of that interval. A variable value that is midway between two midpoints, and is therefore on the boundary of two intervals, rounds to the even midpoint . Even midpoints occur when i is an even integer (0 , ±2 , ±4 , . . . ).
When ROUND=1 and the analysis variable values are between ˆ’ 2.5 and 2.5, the intervals are as follows:
i | Interval | Midpoint | Left endpt rounds to | Right endpt rounds to |
---|---|---|---|---|
ˆ’ 2 | [ ˆ’ 2.5, ˆ’ 1.5] | ˆ’ 2 | ˆ’ 2 | ˆ’ 2 |
ˆ’ 1 | [ ˆ’ 1.5, ˆ’ 0.5] | ˆ’ 1 | ˆ’ 2 |
|
| [ ˆ’ 0.5,0.5] |
|
|
|
1 | [0.5,1.5] | 1 |
| 2 |
2 | [1.5,2.5] | 2 | 2 | 2 |
When ROUND=.5 and the analysis variable values are between ˆ’ 1.25 and 1.25, the intervals are as follows:
i | Interval | Midpoint | Left endpt rounds to | Right endpt rounds to |
---|---|---|---|---|
ˆ’ 2 | [ ˆ’ 1.25, ˆ’ 0.75] | ˆ’ 1.0 | ˆ’ 1 | ˆ’ 1 |
ˆ’ 1 | [ ˆ’ 0.75, ˆ’ 0.25] | ˆ’ 0.5 | ˆ’ 1 |
|
| [ ˆ’ 0.25,0.25] | 0.0 |
|
|
1 | [0.25,0.75] | 0.5 |
| 1 |
2 | [0.75,1.25] | 1.0 | 1 | 1 |
As the rounding unit increases, the interval width also increases . This reduces the number of unique values and decreases the amount of memory that PROC UNIVARIATE needs.
This section provides computational details for the descriptive statistics that are computed with the PROC UNIVARIATE statement. These statistics can also be saved in the OUT= data set by specifying the keywords listed in Table 3.30 on page 237 in the OUTPUT statement.
Standard algorithms (Fisher 1973) are used to compute the moment statistics. The computational methods used by the UNIVARIATE procedure are consistent with those used by other SAS procedures for calculating descriptive statistics.
The following sections give specific details on a number of statistics calculated by the UNIVARIATE procedure.
The sample mean is calculated as
where n is the number of nonmissing values for a variable, x i is the i th value of the variable, and w i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the formula reduces to
The sum is calculated as , where n is the number of nonmissing values for a variable, x i is the i th value of the variable, and w i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the formula reduces to
The sum of the weights is calculated as , where n is the number of nonmissing values for a variable and w i is the weight associated with the i th value of the variable. If there is no WEIGHT variable, the sum of the weights is n .
The variance is calculated as
where n is the number of nonmissing values for a variable, x i is the i th value of the variable, x w is the weighted mean, w i is the weight associated with the i th value of the variable, and d is the divisor controlled by the VARDEF= option in the PROC UNIVARIATE statement:
If there is no WEIGHT variable, the formula reduces to
The standard deviation is calculated as
where n is the number of nonmissing values for a variable, x i is the i th value of the variable, x w is the weighted mean, w i is the weight associated with the i th value of the variable, and d is the divisor controlled by the VARDEF= option in the PROC UNIVARIATE statement. If there is no WEIGHT variable, the formula reduces to
The sample skewness, which measures the tendency of the deviations to be larger in one direction than in the other, is calculated as follows depending on the VARDEF= option:
VARDEF | Formula |
---|---|
DF (default) |
|
N |
|
WDF | missing |
WEIGHTWGT | missing |
where n is the number of nonmissing values for a variable, x i is the i th value of the variable, x w is the sample average, s is the sample standard deviation, and w i is the weight associated with the i th value of the variable. If VARDEF=DF, then n must be greater than 2. If there is no WEIGHT variable, then w i = 1 for all i = 1 , . . . , n .
The sample skewness can be positive or negative; it measures the asymmetry of the data distribution and estimates the theoretical skewness , where µ 2 and µ 3 are the second and third central moments. Observations that are normally distributed should have a skewness near zero.
The sample kurtosis, which measures the heaviness of tails , is calculated as follows depending on the VARDEF= option:
VARDEF | Formula |
---|---|
DF (default) |
|
N |
|
WDF | missing |
WEIGHTWGT | missing |
where n is the number of nonmissing values for a variable, x i is the i th value of the variable, x w is the sample average, s w is the sample standard deviation, and w i is the weight associated with the i th value of the variable. If VARDEF=DF, then n must be greater than 3. If there is no WEIGHT variable, then w i = 1 for all i = 1 , . . . , n .
The sample kurtosis measures the heaviness of the tails of the data distribution. It estimates the adjusted theoretical kurtosis denoted as ² 2 ˆ’ 3, where , and µ 4 is the fourth central moment. Observations that are normally distributed should have a kurtosis near zero.
The coefficient of variation is calculated as
The mode is the value that occurs most often in the data. PROC UNIVARIATE counts repetitions of the values of the analysis variables or, if you specify the ROUND= option, the rounded values. If a tie occurs for the most frequent value, the procedure reports the lowest mode in the table labeled Basic Statistical Measures in the statistical output. To list all possible modes, use the MODES option in the PROC UNIVARIATE statement. When no repetitions occur in the data (as with truly continuous data), the procedure does not report the mode. The WEIGHT statement has no effect on the mode. See Example 3.2.
The UNIVARIATE procedure automatically computes the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles (quantiles), as well as the minimum and maximum of each analysis variable. To compute percentiles other than these default percentiles, use the PCTLPTS= and PCTLPRE= options in the OUTPUT statement.
You can specify one of five definitions for computing the percentiles with the PCTLDEF= option. Let n be the number of nonmissing values for a variable, and let x 1 , x 2 , . . . , x n represent the ordered values of the variable. Let the t th percentile be y , set , and let
where j is the integer part of np , and g is the fractional part of np . Then the PCTLDEF= option defines the t th percentile, y , as described in the following table:
PCTLDEF | Description | Formula |
---|---|---|
1 | Weighted average at x np | y = (1 ˆ’ g ) x j + gx j +1 where x is taken to be x 1 |
2 | Observation numbered closest to np | y = x j if g < 1/2 y = x j if g = 1/2 and j is even y = x j +1 if g = 1/2 and j is odd y = x j +1 if g > 1/2 |
3 | Empirical distribution function | y = x j if g = 0 y = x j +1 if g > |
4 | Weighted average aimed at x ( n +1) p | y = (1 ˆ’ g ) x j + gx j +1 where x n +1 is taken to be x n |
5 | Empirical distribution function with averaging | y = 1/2( x j + x j +1 ) if g = 0 y = x j +1 if g > |
When you use a WEIGHT statement, the percentiles are computed differently. The 100 p th weighted percentile y is computed from the empirical distribution function with averaging
where w i is the weight associated with x i , and where is the sum of the weights.
Note that the PCTLDEF= option is not applicable when a WEIGHT statement is used. However, in this case, if all the weights are identical, the weighted percentiles are the same as the percentiles that would be computed without a WEIGHT statement and with PCTLDEF=5.
You can use the CIPCTLNORMAL option to request confidence limits for percentiles, assuming the data are normally distributed. These limits are described in Section 4.4.1 of Hahn and Meeker (1991). When 0 < p < 1/2, the two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are
where n is the sample size . When 1/2 p < 1, the two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are
One-sided 100(1 ˆ’ ± )% confidence bounds are computed by replacing by ± in the appropriate preceding equation. The factor g ² ( ³ , p, n ) is related to the noncentral t distribution and is described in Owen and Hua (1977) and Odeh and Owen (1980). See Example 3.10.
You can use the CIPCTLDF option to request distribution-free confidence limits for percentiles. In particular, it is not necessary to assume that the data are normally distributed. These limits are described in Section 5.2 of Hahn and Meeker (1991). The two-sided 100(1 ˆ’ ± )% confidence limits for the 100 p th percentile are
where X ( j ) is the j th order statistic when the data values are arranged in increasing order:
The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around [ np ] + 1 where [ np ] is the integer part of np , and where n is the sample size. Furthermore, l and u are chosen so that X ( l ) and X ( u ) are as close to X [ n +1] p as possible while satisfying the coverage probability requirement
where Q ( k ; n, p ) is the cumulative binomial probability
In some cases, the coverage requirement cannot be met, particularly when n is small and p is near 0 or 1. To relax the requirement of symmetry, you can specify CIPCTLDF(TYPE = ASYMMETRIC). This option requests symmetric limits when the coverage requirement can be met, and asymmetric limits otherwise .
If you specify CIPCTLDF(TYPE = LOWER), a one-sided 100(1 ˆ’ ± )% lower confidence bound is computed as X ( l ) , where l is the largest integer that satisfies the inequality
with 0 < l n . Likewise, if you specify CIPCTLDF(TYPE = UPPER), a one-sided 100(1 ˆ’ ± )% lower confidence bound is computed as X ( u ) , where u is the largest integer that satisfies the inequality
Note that confidence limits for percentiles are not computed when a WEIGHT statement is specified. See Example 3.10.
PROC UNIVARIATE provides three tests for location: Students t test, the sign test, and the Wilcoxon signed rank test. All three tests produce a test statistic for the null hypothesis that the mean or median is equal to a given value µ against the two-sided alternative that the mean or median is not equal to µ . By default, PROC UNIVARIATE sets the value of µ to zero. You can use the MU0= option in the PROC UNIVARIATE statement to specify the value of µ . Students t test is appropriate when the data are from an approximately normal population; otherwise, use nonparametric tests such as the sign test or the signed rank test. For large sample situations, the t test is asymptotically equivalent to a z test. If you use the WEIGHT statement, PROC UNIVARIATE computes only one weighted test for location, the t test. You must use the default value for the VARDEF= option in the PROC statement (VARDEF=DF). See Example 3.12.
You can also use these tests to compare means or medians of paired data . Data are said to be paired when subjects or units are matched in pairs according to one or more variables, such as pairs of subjects with the same age and gender. Paired data also occur when each subject or unit is measured at two times or under two conditions. To compare the means or medians of the two times, create an analysis variable that is the difference between the two measures. The test that the mean or the median difference of the variables equals zero is equivalent to the test that the means or medians of the two original variables are equal. Note that you can also carry out these tests using the PAIRED statement in the TTEST procedure; refer to Chapter 77, The TTEST Procedure, in SAS/STAT Users Guide . Also see Example 3.13.
PROC UNIVARIATE calculates the t statistic as
where x is the sample mean, n is the number of nonmissing values for a variable, and s is the sample standard deviation. The null hypothesis is that the population mean equals µ . When the data values are approximately normally distributed, the probability under the null hypothesis of a t statistic that is as extreme, or more extreme, than the observed value (the p -value) is obtained from the t distribution with n ˆ’ 1 degrees of freedom. For large n , the t statistic is asymptotically equivalent to a z test. When you use the WEIGHT statement and the default value of VARDEF=, which is DF, the t statistic is calculated as
where x w is the weighted mean, s w is the weighted standard deviation, and w i is the weight for i th observation. The t w statistic is treated as having a Students t distribution with n ˆ’ 1 degrees of freedom. If you specify the EXCLNPWGT option in the PROC statement, n is the number of nonmissing observations when the value of the WEIGHT variable is positive. By default, n is the number of nonmissing observations for the WEIGHT variable.
PROC UNIVARIATE calculates the sign test statistic as
where n + is the number of values that are greater than µ , and n ˆ’ is the number of values that are less than µ . Values equal to µ are discarded. Under the null hypothesis that the population median is equal to µ , the p -value for the observed statistic M obs is
where n t = n + + n ˆ’ is the number of x i values not equal to µ .
Note: If n + and n ˆ’ are equal, the p -value is equal to one.
The signed rank statistic S is computed as
where is the rank of x i ˆ’ µ after discarding values of x i = µ , and n t is the number of x i values not equal to µ . Average ranks are used for tied values.
If n 20, the significance of S is computed from the exact distribution of S , where the distribution is a convolution of scaled binomial distributions. When n > 20, the significance of S is computed by treating
as a Students t variate with n ˆ’ 1 degrees of freedom. V is computed as
where the sum is over groups tied in absolute value and where t i is the number of values in the i th group (Iman 1974; Conover 1999). The null hypothesis tested is that the mean (or median) is zero, assuming that the distribution is symmetric. Refer to Lehmann (1998).
The two-sided 100(1 ˆ’ ± )% confidence interval for the mean has upper and lower limits
where and is the percentile of the t distribution with n ˆ’ 1 degrees of freedom. The one-sided upper 100(1 ˆ’ ± )% confidence limit is computed as and the one-sided lower 100(1 ˆ’ ± )% confidence limit is computed as . See Example 3.9.
The two-sided 100(1 ˆ’ ± )% confidence interval for the standard deviation has lower and upper limits
respectively, where and are the and percentiles of the chi-square distribution with n ˆ’ 1 degrees of freedom. A one-sided 100(1 ˆ’ ± )% confidence limit has lower and upper limits
respectively. The 100(1 ˆ’ ± )% confidence interval for the variance has upper and lower limits equal to the squares of the corresponding upper and lower limits for the standard deviation. When you use the WEIGHT statement and specify VARDEF=DF in the PROC statement, the 100(1 ˆ’ ± )% confidence interval for the weighted mean is
where x w is the weighted mean, s w is the weighted standard deviation, w i is the weight for i th observation, and is the percentile for the t distribution with n ˆ’ 1 degrees of freedom.
A statistical method is robust if it is insensitive to moderate or even large departures from the assumptions that justify the method. PROC UNIVARIATE provides several methods for robust estimation of location and scale. See Example 3.11.
The Winsorized mean is a robust estimator of the location that is relatively insensitive to outliers. The k -times Winsorized mean is calculated as
where n is the number of observations, and x ( i ) is the i th order statistic when the observations are arranged in increasing order:
The Winsorized mean is computed as the ordinary mean after the k smallest observations are replaced by the ( k + 1)st smallest observation, and the k largest observations are replaced by the ( k + 1)st largest observation.
For data from a symmetric distribution, the Winsorized mean is an unbiased estimate of the population mean. However, the Winsorized mean does not have a normal distribution even if the data are from a normal population.
The Winsorized sum of squared deviations is defined as
The Winsorized t statistic is given by
where µ denotes the location under the null hypothesis, and the standard error of the Winsorized mean is
When the data are from a symmetric distribution, the distribution of t wk is approximated by a Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom (Tukey and McLaughlin 1963; Dixon and Tukey 1968).
The Winsorized 100 % confidence interval for the location parameter has upper and lower limits
where is the 100th percentile of the Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom.
Like the Winsorized mean, the trimmed mean is a robust estimator of the location that is relatively insensitive to outliers. The k -times trimmed mean is calculated as
where n is the number of observations, and x ( i ) is the i th order statistic when the observations are arranged in increasing order:
The trimmed mean is computed after the k smallest and k largest observations are deleted from the sample. In other words, the observations are trimmed at each end.
For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. However, the trimmed mean does not have a normal distribution even if the data are from a normal population.
A robust estimate of the variance of the trimmed mean t tk can be based on the Winsorized sum of squared deviations , which is defined in the section Winsorized Means on page 278; refer to Tukey and McLaughlin (1963). This can be used to compute a trimmed t test which is based on the test statistic
where the standard error of the trimmed mean is
When the data are from a symmetric distribution, the distribution of t tk is approximated by a Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom (Tukey and McLaughlin 1963; Dixon and Tukey 1968).
The trimmed 100(1 ˆ’ ± )% confidence interval for the location parameter has upper and lower limits
where is the 100th percentile of the Students t distribution with n ˆ’ 2 k ˆ’ 1 degrees of freedom.
The sample standard deviation, which is the most commonly used estimator of scale, is sensitive to outliers. Robust scale estimators, on the other hand, remain bounded when a single data value is replaced by an arbitrarily large or small value. The UNIVARIATE procedure computes several robust measures of scale, including the interquartile range, Ginis mean difference G , the median absolute deviation about the median (MAD), Q n , and S n . In addition, the procedure computes estimates of the normal standard deviation derived ƒ from each of these measures.
The interquartile range (IQR) is simply the difference between the upper and lower quartiles. For a normal population, ƒ can be estimated as IQR/1.34898.
Ginis mean difference is computed as
For a normal population, the expected value of G is . Thus is a robust estimator of ƒ when the data are from a normal sample. For the normal distribution, this estimator has high efficiency relative to the usual sample standard deviation, and it is also less sensitive to the presence of outliers.
A very robust scale estimator is the MAD, the median absolute deviation from the median (Hampel 1974), which is computed as
where the inner median, med j ( x j ), is the median of the n observations, and the outer median (taken over i ) is the median of the n absolute values of the deviations about the inner median. For a normal population, 1 . 4826MAD is an estimator of ƒ .
The MAD has low efficiency for normal distributions, and it may not always be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two statistics as alternatives to the MAD. The first is
where the outer median (taken over i ) is the median of the n medians of x i ˆ’ x j , j = 1 , 2 , . . . , n . To reduce small-sample bias, c sn S n is used to estimate ƒ , where c sn is a correction factor; refer to Croux and Rousseeuw (1992).
The second statistic proposed by Rousseeuw and Croux (1993) is
where
In other words, Q n is 2.219 times the k th order statistic of the distances between the data points. The bias-corrected statistic c qn Q n is used to estimate ƒ , where c qn is a correction factor; refer to Croux and Rousseeuw (1992).
The PLOTS option in the PROC UNIVARIATE statement provides up to four diagnostic line printer plots to examine the data distribution. These plots are the stem-and-leaf plot or horizontal bar chart, the box plot, the normal probability plot, and the side-by-side box plots. If you specify the WEIGHT statement, PROC UNIVARIATE provides a weighted histogram, a weighted box plot based on the weighted quantiles, and a weighted normal probability plot.
Note that these plots are a legacy feature of the UNIVARIATE procedure in earlier versions of SAS. They predate the addition of the HISTOGRAM, PROBPLOT, and QQPLOT statements, which provide high-resolution graphics displays. Also note that line printer plots requested with the PLOTS option are mainly intended for use with the ODS LISTING destination. See Example 3.5.
The first plot in the output is either a stem-and-leaf plot (Tukey 1977) or a horizontal bar chart. If any single interval contains more than 49 observations, the horizontal bar chart appears. Otherwise, the stem-and-leaf plot appears. The stem-and-leaf plot is like a horizontal bar chart in that both plots provide a method to visualize the overall distribution of the data. The stem-and-leaf plot provides more detail because each point in the plot represents an individual data value.
To change the number of stems that the plot displays, use PLOTSIZE= to increase or decrease the number of rows. Instructions that appear below the plot explain how to determine the values of the variable. If no instructions appear, you multiply Stem.Leaf by 1 to determine the values of the variable. For example, if the stem value is 10 and the leaf value is 1, then the variable value is approximately 10.1. For the stem-and-leaf plot, the procedure rounds a variable value to the nearest leaf. If the variable value is exactly halfway between two leaves , the value rounds to the nearest leaf with an even integer value. For example, a variable value of 3.15 has a stem value of 3 and a leaf value of 2.
The box plot, also known as a schematic box plot, appears beside the stem-and-leaf plot. Both plots use the same vertical scale. The box plot provides a visual summary of the data and identifies outliers. The bottom and top edges of the box correspond to the sample 25th (Q1) and 75th (Q3) percentiles. The box length is one interquartile range (Q3 - Q1). The center horizontal line with asterisk endpoints corresponds to the sample median. The central plus sign (+) corresponds to the sample mean. If the mean and median are equal, the plus sign falls on the line inside the box. The vertical lines that project out from the box, called whiskers , extend as far as the data extend, up to a distance of 1.5 interquartile ranges. Values farther away are potential outliers. The procedure identifies the extreme values with a zero or an asterisk (*). If zero appears, the value is between 1.5 and 3 interquartile ranges from the top or bottom edge of the box. If an asterisk appears, the value is more extreme.
Note: To produce box plots using high-resolution graphics, use the BOXPLOT procedure in SAS/STAT software; refer to Chapter 18, The BOXPLOT Procedure, in SAS/STAT Users Guide .
The normal probability plot plots the empirical quantiles against the quantiles of a standard normal distribution. Asterisks (*) indicate the data values. The plus signs (+) provide a straight reference line that is drawn by using the sample mean and standard deviation. If the data are from a normal distribution, the asterisks tend to fall along the reference line. The vertical coordinate is the data value, and the horizontal coordinate is ˆ’ 1 ( v i ) where
For a weighted normal probability plot, the i th ordered observation is plotted against ˆ’ 1 ( v i ) where
When each observation has an identical weight, w j = w , the formula for v i reduces to the expression for v i in the unweighted normal probability plot:
When the value of VARDEF= is WDF or WEIGHT, a reference line with intercept and slope is added to the plot. When the value of VARDEF= is DF or N, the slope is where is the average weight.
When each observation has an identical weight and the value of VARDEF= is DF, N, or WEIGHT, the reference line reduces to the usual reference line with intercept and slope in the unweighted normal probability plot.
If the data are normally distributed with mean µ , standard deviation ƒ , and each observation has an identical weight w , then the points on the plot should lie approximately on a straight line. The intercept for this line is µ . The slope is ƒ when VARDEF= is WDF or WEIGHT, and the slope is when VARDEF= is DF or N.
Note: To produce probability plots using high-resolution graphics, use the PROBPLOT statement in PROC UNIVARIATE; see the section PROBPLOT Statement on page 241.
When you use a BY statement with the PLOT option, PROC UNIVARIATE produces side-by-side box plots, one for each BY group. The box plots (also known as schematic plots) use a common scale that enables you to compare the data distribution across BY groups. This plot appears after the univariate analyses of all BY groups. Use the NOBYPLOT option to suppress this plot.
Note: To produce side-by-side box plots using high-resolution graphics, use the BOXPLOT procedure in SAS/STAT software; refer to Chapter 18, The BOXPLOT Procedure, in SAS/STAT Users Guide .
If your site licenses SAS/GRAPH software, you can use the HISTOGRAM, PROBPLOT, and QQPLOT statements to create high-resolution graphs. The HISTOGRAM statement creates histograms that enable you to examine the data distribution. You can optionally fit families of density curves and superimpose kernel density estimates on the histograms. For additional information about the fitted distributions and kernel density estimates, see the section Formulas for Fitted Continuous Distributions on page 288 and the section Kernel Density Estimates on page 297.
The PROBPLOT statement creates a probability plot, which compares ordered values of a variable with percentiles of a specified theoretical distribution. The QQPLOT statement creates a quantile-quantile plot, which compares ordered values of a variable with quantiles of a specified theoretical distribution. You can use these plots to determine how well a theoretical distribution models a data distribution.
Note: You can use the CLASS statement with the HISTOGRAM, PROBPLOT, or QQPLOT statements to create comparative histograms, probability plots, or Q-Q plots, respectively.
When you use the CLASS statement with the HISTOGRAM, PROBPLOT, or QQPLOT statement, PROC UNIVARIATE creates comparative histograms, comparative probability plots, or comparative quantile-quantile plots. You can use these plot statements with the CLASS statement to create one-way and two-way comparative plots. When you use one class variable, PROC UNIVARIATE displays an array of component plots ( stacked or side-by-side), one for each level of the classification variable. When you use two class variables, PROC UNIVARIATE displays a matrix of component plots, one for each combination of levels of the classification variables. The observations in a given level are referred to collectively as a cell .
When you create a one-way comparative plot, the observations in the input data set are sorted by the method specified in the ORDER= option. PROC UNIVARIATE creates a separate plot for the analysis variable values in each level, and arranges these component plots in an array to form the comparative plot with uniform horizontal and vertical axes. See Example 3.15.
When you create a two-way comparative plot, the observations in the input data set are cross- classified according to the values (levels) of these variables. PROC UNIVARIATE creates a separate plot for the analysis variable values in each cell of the cross-classification and arranges these component plots in a matrix to form the comparative plot with uniform horizontal and vertical axes. The levels of the first class variable are the labels for the rows of the matrix, and the levels of the second class variable are the labels for the columns of the matrix. See Example 3.16.
PROC UNIVARIATE determines the layout of a two-way comparative plot by using the order for the first class variable to obtain the order of the rows from top to bottom. Then it applies the order for the second class variable to the observations that correspond to the first row to obtain the order of the columns from left to right. If any columns remain unordered (that is, the categories are unbalanced), PROC UNIVARIATE applies the order for the second class variable to the observations in the second row, and so on, until all the columns have been ordered.
If you associate a label with a variable, PROC UNIVARIATE displays the variable label in the comparative plot and this label is parallel to the column (or row) labels.
Use the MISSING option to treat missing values as valid levels.
To reduce the number of classification levels, use a FORMAT statement to combine variable values.
To position the inset by using a compass point position, specify the value N, NE, E, SE, S, SW, W, or NW with the POSITION= option. The default position of the inset is NW. The following statements produce a histogram to show the position of the inset for the eight compass points:
data Score; input Student $ PreTest PostTest @@; label ScoreChange = 'Change in Test Scores'; ScoreChange = PostTest - PreTest; datalines; Capalleti 94 91 Dubose 51 65 Engles 95 97 Grant 63 75 Krupski 80 75 Lundsford 92 55 Mcbane 75 78 Mullen 89 82 Nguyen 79 76 Patel 71 77 Si 75 70 Tanaka 87 73 ; run; title 'Test Scores for a College Course'; proc univariate data=Score noprint; histogram PreTest / midpoints = 45 to 95 by 10; inset n / cfill=blank header='Position = NW' pos=nw; inset mean / cfill=blank header='Position = N ' pos=n ; inset sum / cfill=blank header='Position = NE' pos=ne; inset max / cfill=blank header='Position = E ' pos=e ; inset min / cfill=blank header='Position = SE' pos=se; inset nobs / cfill=blank header='Position = S ' pos=s ; inset range / cfill=blank header='Position = SW' pos=sw; inset mode / cfill=blank header='Position = W ' pos=w ; label PreTest = 'Pretest Score'; run;
To position the inset in one of the four margins that surround the plot area, specify the value LM, RM, TM, or BM with the POSITION= option.
Margin positions are recommended if you list a large number of statistics in the INSET statement. If you attempt to display a lengthy inset in the interior of the plot, it is most likely that the inset will collide with the data display.
To position the inset with coordinates, use POSITION=(x,y). You specify the coordinates in axis data units or in axis percentage units (the default).
If you specify the DATA option immediately following the coordinates, PROC UNIVARIATE positions the inset by using axis data units. For example, the following statements place the bottom left corner of the inset at 45 on the horizontal axis and 10 on the vertical axis:
title 'Test Scores for a College Course'; proc univariate data=Score noprint; histogram PreTest / midpoints = 45 to 95 by 10; inset n / header = 'Position=(45,10)' position = (45,10) data; run;
By default, the specified coordinates determine the position of the bottom left corner of the inset. To change this reference point, use the REFPOINT= option (see the next example).
If you omit the DATA option, PROC UNIVARIATE positions the inset by using axis percentage units. The coordinates in axis percentage units must be between 0 and 100. The coordinates of the bottom left corner of the display are (0,0), while the upper right corner is (100, 100). For example, the following statements create a histogram and use coordinates in axis percentage units to position the two insets :
title 'Test Scores for a College Course'; proc univariate data=Score noprint; histogram PreTest / midpoints = 45 to 95 by 10; inset min / position = (5,25) header = 'Position=(5,25)' refpoint = tl; inset max / position = (95,95) header = 'Position=(95,95)' refpoint = tr; run;
The REFPOINT= option determines which corner of the inset to place at the coordinates that are specified with the POSITION= option. The first inset uses REFPOINT=TL, so that the top left corner of the inset is positioned 5% of the way across the horizontal axis and 25% of the way up the vertical axis. The second inset uses REFPOINT=TR, so that the top right corner of the inset is positioned 95% of the way across the horizontal axis and 95% of the way up the vertical axis.
A sample program, univar3.sas , for these examples is available in the SAS Sample Library for Base SAS software.
The following sections provide information on the families of parametric distributions that you can fit with the HISTOGRAM statement. Properties of these distributions are discussed by Johnson, Kotz, and Balakrishnan (1994, 1995).
The fitted density function is
where and
= lower threshold parameter (lower endpoint parameter)
ƒ = scale parameter ( ƒ > 0)
± = shape parameter ( ± > 0)
² = shape parameter ( ² > 0)
h = width of histogram interval
Note: This notation is consistent with that of other distributions that you can fit with the HISTOGRAM statement. However, many texts , including Johnson, Kotz, and Balakrishnan (1995), write the beta density function as
The two parameterizations are related as follows:
The range of the beta distribution is bounded below by a threshold parameter = a and above by + ƒ = b . If you specify a fitted beta curve using the BETA option, must be less than the minimum data value, and + ƒ must be greater than the maximum data value. You can specify and ƒ with the THETA= and SIGMA= beta-options in parentheses after the keyword BETA. By default, ƒ = 1 and = 0. If you specify THETA=EST and SIGMA=EST, maximum likelihood estimates are computed for and ƒ . However, three- and four-parameter maximum likelihood estimation may not always converge.
In addition, you can specify ± and ² with the ALPHA= and BETA= beta-options , respectively. By default, the procedure calculates maximum likelihood estimates for ± and ² . For example, to fit a beta density curve to a set of data bounded below by 32 and above by 212 with maximum likelihood estimatexs for ± and ² , use the following statement:
histogram Length / beta(theta=32 sigma=180);
The beta distributions are also referred to as Pearson Type I or II distributions. These include the power-function distribution ( ² = 1), the arc-sine distribution ( ± = ² = 1/2), and the generalized arc-sine distributions ( ± + ² = 1, ² 1/2).
You can use the DATA step function BETAINV to compute beta quantiles and the DATA step function PROBBETA to compute beta probabilities.
The fitted density function is
where
= threshold parameter
ƒ = scale parameter ( ƒ > 0)
h = width of histogram interval
The threshold parameter must be less than or equal to the minimum data value. You can specify with the THRESHOLD= exponential-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . In addition, you can specify ƒ with the SCALE= exponential-option . By default, the procedure calculates a maximum likelihood estimate for ƒ . Note that some authors define the scale parameter as .
The exponential distribution is a special case of both the gamma distribution (with ± = 1) and the Weibull distribution (with c = 1). A related distribution is the extreme value distribution. If Y = exp( ˆ’ X ) has an exponential distribution, then X has an extreme value distribution.
The fitted density function is
where
= threshold parameter
ƒ = scale parameter ( ƒ > 0)
± = shape parameter ( ± > 0)
h = width of histogram interval
The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= gamma-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . In addition, you can specify ƒ and ± with the SCALE= and ALPHA= gamma-options . By default, the procedure calculates maximum likelihood estimates for ƒ and ± .
The gamma distributions are also referred to as Pearson Type III distributions, and they include the chi-square, exponential, and Erlang distributions. The probability density function for the chi-square distribution is
Notice that this is a gamma distribution with , ƒ = 2, and = 0. The exponential distribution is a gamma distribution with ± = 1, and the Erlang distribution is a gamma distribution with ± being a positive integer. A related distribution is the Rayleigh distribution. If where the X i s are independent variables, then log R is distributed with a Xv distribution having a probability density function of
If v = 2, the preceding distribution is referred to as the Rayleigh distribution.
You can use the DATA step function GAMINV to compute gamma quantiles and the DATA step function PROBGAM to compute gamma probabilities.
The fitted density function is
where
= threshold parameter
‚ = scale parameter ( ˆ’ ˆ < ‚ < ˆ )
ƒ = shape parameter ( ƒ >0)
h = width of histogram interval
The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= lognormal-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify and ƒ with the SCALE= and SHAPE= lognormal-options , respectively. By default, the procedure calculates maximum likelihood estimates for these parameters.
Note: The lognormal distribution is also referred to as the S L distribution in the Johnson system of distributions.
Note: This book uses ƒ to denote the shape parameter of the lognormal distribution, whereas ƒ is used to denote the scale parameter of the beta, exponential, gamma, normal, and Weibull distributions. The use of ƒ to denote the lognormal shape parameter is based on the fact that has a standard normal distribution if X is lognormally distributed. Based on this relationship, you can use the DATA step function PROBIT to compute lognormal quantiles and the DATA step function PROBNORM to compute probabilities.
The fitted density function is
where
µ = mean
ƒ = standard deviation ( ƒ > 0)
h = width of histogram interval
You can specify µ and ƒ with the MU= and SIGMA= normal-options , respectively. By default, the procedure estimates µ with the sample mean and ƒ with the sample standard deviation.
You can use the DATA step function PROBIT to compute normal quantiles and the DATA step function PROBNORM to compute probabilities.
Note: The normal distribution is also referred to as the S N distribution in the Johnson system of distributions.
The fitted density function is
where
= threshold parameter
ƒ = scale parameter ( ƒ > 0)
c = shape parameter ( c > 0)
h = width of histogram interval
The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= Weibull-option . By default, = 0. If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify ƒ and c with the SCALE= and SHAPE= Weibull-options , respectively. By default, the procedure calculates maximum likelihood estimates for ƒ and c .
The exponential distribution is a special case of the Weibull distribution where c = 1.
When you specify the NORMAL option in the PROC UNIVARIATE statement or you request a fitted parametric distribution in the HISTOGRAM statement, the procedure computes goodness-of-fit tests for the null hypothesis that the values of the analysis variable are a random sample from the specified theoretical distribution. See Example 3.22.
When you specify the NORMAL option, these tests, which are summarized in the output table labeled Tests for Normality, include the following:
Shapiro-Wilk test
Kolmogorov-Smirnov test
Anderson-Darling test
Cram r-von Mises test
The Kolmogorov-Smirnov D statistic, the Anderson-Darling statistic, and the Cram r-von Mises statistic are based on the empirical distribution function (EDF). However, some EDF tests are not supported when certain combinations of the parameters of a specified distribution are estimated. See Table 3.62 on page 296 for a list of the EDF tests available. You determine whether to reject the null hypothesis by examining the p -value that is associated with a goodness-of-fit statistic. When the p -value is less than the predetermined critical value ( ± ), you reject the null hypothesis and conclude that the data did not come from the specified distribution.
Distribution | Parameters | Tests Available | ||
---|---|---|---|---|
Threshold | Scale | Shape | ||
Beta | known | ƒ known | ± , ² known | all |
known | ƒ known | ± , ² < 5 unknown | all | |
Exponential | known, | ƒ known | all | |
known | ƒ unknown | all | ||
unknown | ƒ known | all | ||
unknown | ƒ unknown | all | ||
Gamma | known | ƒ known | ± known | all |
known | ƒ unknown | ± known | all | |
known | ƒ known | ± unknown | all | |
known | ƒ unknown | ± unknown | all | |
unknown | ƒ known | ± > 1 known | all | |
unknown | ƒ unknown | ± > 1 known | all | |
unknown | ƒ known | ± > 1 unknown | all | |
unknown | ƒ unknown | ± > 1 unknown | all | |
Lognormal | known | known | ƒ known | all |
known | known | ƒ unknown | A 2 and W 2 | |
known | unknown | ƒ known | A 2 and W 2 | |
known | unknown | ƒ unknown | all | |
unknown | known | ƒ < 3 known | all | |
unknown | known | ƒ < 3 unknown | all | |
unknown | unknown | ƒ < 3 known | all | |
unknown | unknown | ƒ < 3 unknown | all | |
Normal | known | ƒ known | all | |
known | ƒ unknown | A 2 and W 2 | ||
unknown | ƒ known | A 2 and W 2 | ||
unknown | ƒ unknown | all | ||
Weibull | known | ƒ known | c known | all |
known | ƒ unknown | c known | A 2 and W 2 | |
known | ƒ known | c unknown | A 2 and W 2 | |
known | ƒ unknown | c unknown | A 2 and W 2 | |
unknown | ƒ known | c > 2 known | all | |
unknown | ƒ unknown | c > 2 known | all | |
unknown | ƒ known | c > 2 unknown | all | |
unknown | ƒ unknown | c > 2 unknown | all |
If you want to test the normality assumptions for analysis of variance methods, beware of using a statistical test for normality alone. A tests ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected . Since small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality. The skewness and kurtosis measures and the plots that are provided by the PLOTS option, the HISTOGRAM statement, the PROBPLOT statement, and the QQPLOT statement can be very helpful. For small sample sizes, power is low for detecting larger departures from normality that may be important. To increase the tests ability to detect such deviations, you may want to declare significance at higher levels, such as 0.15 or 0.20, rather than the often-used 0.05 level. Again, consulting plots and additional statistics will help you assess the severity of the deviations from normality.
If the sample size is less than or equal to 2000 and you specify the NORMAL option, PROC UNIVARIATE computes the Shapiro-Wilk statistic, W (also denoted as W n to emphasize its dependence on the sample size n ). The W statistic is the ratio of the best estimator of the variance (based on the square of a linear combination of the order statistics) to the usual corrected sum of squares estimator of the variance (Shapiro and Wilk 1965). When n is greater than three, the coefficients to compute the linear combination of the order statistics are approximated by the method of Royston (1992). The statistic W is always greater than zero and less than or equal to one (0 < W 1).
Small values of W lead to the rejection of the null hypothesis of normality. The distribution of W is highly skewed. Seemingly large values of W (such as 0.90) may be considered small and lead you to reject the null hypothesis. The method for computing the p -value (the probability of obtaining a W statistic less than or equal to the observed value) depends on n . For n = 3, the probability distribution of W is known and is used to determine the p -value. For n > 4, a normalizing transformation is computed:
The values of ƒ , ³ , and µ are functions of n obtained from simulation results. Large values of Z n indicate departure from normality, and since the statistic Z n has an approximately standard normal distribution, this distribution is used to determine the p -values for n > 4.
When you fit a parametric distribution, PROC UNIVARIATE provides a series of goodness-of-fit tests based on the empirical distribution function (EDF). The EDF tests offer advantages over traditional chi-square goodness-of-fit test, including improved power and invariance with respect to the histogram midpoints. For a thorough discussion, refer to DAgostino and Stephens (1986).
The empirical distribution function is defined for a set of n independent observations X 1 , . . . , X n with a common distribution function F ( x ). Denote the observations ordered from smallest to largest as X (1) , . . . , X ( n ) . The empirical distribution function, F n ( x ), is defined as
Note that F n ( x ) is a step function that takes a step of height at each observation. This function estimates the distribution function F ( x ). At any value x , F n ( x ) is the proportion of observations less than or equal to x , while F ( x ) is the probability of an observation less than or equal to x . EDF statistics measure the discrepancy between F n ( x ) and F ( x ).
The computational formulas for the EDF statistics make use of the probability integral transformation U = F ( X ). If F ( X ) is the distribution function of X , the random variable U is uniformly distributed between 0 and 1.
Given n observations X (1) , . . . , X ( n ) , the values U ( i ) = F ( X ( i ) ) are computed by applying the transformation, as discussed in the next three sections.
PROC UNIVARIATE provides three EDF tests:
Kolmogorov-Smirnov
Anderson-Darling
Cram r-von Mises
The following sections provide formal definitions of these EDF statistics.
The Kolmogorov-Smirnov statistic ( D ) is defined as
The Kolmogorov-Smirnov statistic belongs to the supremum class of EDF statistics. This class of statistics is based on the largest vertical difference between F ( x ) and F n ( x ).
The Kolmogorov-Smirnov statistic is computed as the maximum of D + and D ˆ’ , where D + is the largest vertical distance between the EDF and the distribution function when the EDF is greater than the distribution function, and D ˆ’ is the largest vertical distance when the EDF is less than the distribution function.
PROC UNIVARIATE uses a modified Kolmogorov D statistic to test the data against a normal distribution with mean and variance equal to the sample mean and variance.
The Anderson-Darling statistic and the Cram r-von Mises statistic belong to the quadratic class of EDF statistics. This class of statistics is based on the squared difference ( F n ( x ) ˆ’ F ( x )) 2 . Quadratic statistics have the following general form:
The function ˆ ( x ) weights the squared difference ( F n ( x ) ˆ’ F ( x )) 2
The Anderson-Darling statistic ( A 2 ) is defined as
Here the weight function is ˆ ( x ) = [ F ( x ) (1 ˆ’ F ( x ))] ˆ’ 1 .
The Anderson-Darling statistic is computed as
The Cram r-von Mises statistic ( W 2 ) is defined as
Here the weight function is ˆ ( x ) = 1.
The Cram r-von Mises statistic is computed as
Once the EDF test statistics are computed, PROC UNIVARIATE computes the associated probability values ( p -values). The UNIVARIATE procedure uses internal tables of probability levels similar to those given by DAgostino and Stephens (1986). If the value is between two probability levels, then linear interpolation is used to estimate the probability value.
The probability value depends upon the parameters that are known and the parameters that are estimated for the distribution. Table 3.62 summarizes different combinations fitted for which EDF tests are available.
You can use the KERNEL option to superimpose kernel density estimates on histograms. Smoothing the data distribution with a kernel density estimate can be more effective than using a histogram to identify features that might be obscured by the choice of histogram bins or sampling variation. A kernel density estimate can also be more effective than a parametric curve fit when the process distribution is multi-modal. See Example 3.23.
The general form of the kernel density estimator is
where K (·) is the kernel function, » is the bandwidth, n is the sample size and x i is the i th observation.
The KERNEL option provides three kernel functions ( K ): normal, quadratic, and triangular . You can specify the function with the K= kernel-option in parentheses after the KERNEL option. Values for the K= option are NORMAL, QUADRATIC, and TRIANGULAR (with aliases of N, Q, and T, respectively). By default, a normal kernel is used. The formulas for the kernel functions are
The value of » , referred to as the bandwidth parameter, determines the degree of smoothness in the estimated density function. You specify » indirectly by specifying a standardized bandwidth c with the C= kernel-option . If Q is the interquartile range, and n is the sample size, then c is related to » by the formula
For a specific kernel function, the discrepancy between the density estimator » ( x ) and the true density f ( x ) is measured by the mean integrated square error (MISE):
The MISE is the sum of the integrated squared bias and the variance. An approximate mean integrated square error (AMISE) is
A bandwidth that minimizes AMISE can be derived by treating f ( x ) as the normal density having parameters µ and ƒ estimated by the sample mean and standard deviation. If you do not specify a bandwidth parameter or if you specify C=MISE, the bandwidth that minimizes AMISE is used. The value of AMISE can be used to compare different density estimates. For each estimate, the bandwidth parameter c , the kernel function type, and the value of AMISE are reported in the SAS log.
The general kernel density estimates assume that the domain of the density to estimate can take on all values on a real line. However, sometimes the domain of a density is an interval bounded on one or both sides. For example, if a variable Y is a measurement of only positive values, then the kernel density curve should be bounded so that is zero for negative Y values. You can use the LOWER= and UPPER= kernel-options to specify the bounds.
The UNIVARIATE procedure uses a reflection technique to create the bounded kernel density curve, as described in Silverman (1986, pp. 30-31). It adds the reflections of the kernel density that are outside the boundary to the bounded kernel estimates. The general form of the bounded kernel density estimator is computed by replacing in the original equation with
where x l is the lower bound and x u is the upper bound.
Without a lower bound, x l = ˆ’ ˆ and equals zero. Similarly, without an upper bound, x u = ˆ and equals zero.
When C=MISE is used with a bounded kernel density, the UNIVARIATE procedure uses a bandwidth that minimizes the AMISE for its corresponding unbounded kernel.
Figure 3.10 illustrates how a Q-Q plot is constructed for a specified theoretical distribution. First, the n nonmissing values of the variable are ordered from smallest to largest:
Then the i th ordered value x ( i ) is plotted as a point whose y -coordinate is x ( i ) and whose x -coordinate is , where F (·) is the specified distribution with zero location parameter and unit scale parameter.
You can modify the adjustment constants ˆ’ 0.375 and 0.25 with the RANKADJ= and NADJ= options. This default combination is recommended by Blom (1958). For additional information, refer to Chambers et al. (1983). Since x ( i ) is a quantile of the empirical cumulative distribution function (ecdf), a Q-Q plot compares quantiles of the ecdf with quantiles of a theoretical distribution. Probability plots (see the section PROBPLOT Statement on page 241) are constructed the same way, except that the x -axis is scaled nonlinearly in percentiles.
The following properties of Q-Q plots and probability plots make them useful diagnostics of how well a specified theoretical distribution fits a set of measurements:
If the quantiles of the theoretical and data distributions agree, the plotted points fall on or near the line y = x .
If the theoretical and data distributions differ only in their location or scale, the points on the plot fall on or near the line y = ax + b . The slope a and intercept b are visual estimates of the scale and location parameters of the theoretical distribution.
Q-Q plots are more convenient than probability plots for graphical estimation of the location and scale parameters since the x -axis of a Q-Q plot is scaled linearly. On the other hand, probability plots are more convenient for estimating percentiles or probabilities.
There are many reasons why the point pattern in a Q-Q plot may not be linear. Chambers et al. (1983) and Fowlkes (1987) discuss the interpretations of commonly encountered departures from linearity , and these are summarized in Table 3.63.
Description of Point Pattern | Possible Interpretation |
---|---|
All but a few points fall on a line | Outliers in the data |
Left end of pattern is below the line; right end of pattern is above the line | Long tails at both ends of the data distribution |
Left end of pattern is above the line; right end of pattern is below the line | Short tails at both ends of the data distribution |
Curved pattern with slope increasing from left to right | Data distribution is skewed to the right |
Curved pattern with slope decreasing from left to right | Data distribution is skewed to the left |
Staircase pattern (plateaus and gaps) | Data have been rounded or are discrete |
In some applications, a nonlinear pattern may be more revealing than a linear pattern. However, Chambers et al. (1983) note that departures from linearity can also be due to chance variation.
When the pattern is linear, you can use Q-Q plots to estimate shape, location, and scale parameters and to estimate percentiles. See Example 3.26 through Example 3.34.
You can use the PROBPLOT and QQPLOT statements to request probability and Q-Q plots that are based on the theoretical distributions summarized in Table 3.64.
Distribution | Density Function p ( x ) | Range | Location | Scale | Shape |
---|---|---|---|---|---|
Beta |
| < x < + ƒ |
| ƒ | ± , ² |
Exponential |
| x |
| ƒ | |
Gamma |
| x > |
| ƒ | ± |
Lognormal (3-parameter) |
| x > |
|
| ƒ |
Normal |
| all x | ¼ | ƒ | |
Weibull (3-parameter) |
| x > |
| ƒ | c |
Weibull (2-parameter) |
| x > | (known) | ƒ | c |
You can request these distributions with the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, NORMAL, WEIBULL, and WEIBULL2 options, respectively. If you do not specify a distribution option, a normal probability plot or a normal Q-Q plot is created.
The following sections provide details for constructing Q-Q plots that are based on these distributions. Probability plots are constructed similarly except that the horizontal axis is scaled in percentile units.
To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where is the inverse normalized incomplete beta function, n is the number of nonmissing observations, ± and ² and are the shape parameters of the beta distribution. In a probability plot, the horizontal axis is scaled in percentile units.
The pattern on the plot for ALPHA= ± and BETA= ² tends to be linear with intercept and slope ƒ if the data are beta distributed with the specific density function
where and
= lower threshold parameter
ƒ = scale parameter ( ƒ > 0)
± = first shape parameter ( ± > 0)
= second shape parameter ( > 0)
To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where n is the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.
The pattern on the plot tends to be linear with intercept and slope ƒ if the data are exponentially distributed with the specific density function
where is a threshold parameter, and is ƒ a positive scale parameter.
To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where is the inverse normalized incomplete gamma function, n is the number of nonmissing observations, and ± is the shape parameter of the gamma distribution. In a probability plot, the horizontal axis is scaled in percentile units.
The pattern on the plot for ALPHA= ƒ tends to be linear with intercept and slope ƒ if the data are gamma distributed with the specific density function
where
= threshold parameter
ƒ = scale parameter ( ƒ > 0)
± = shape parameter ( ± > 0)
To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile exp , where ˆ’ 1 (·) is the inverse cumulative standard normal distribution, n is the number of nonmissing observations, and ƒ is the shape parameter of the lognormal distribution. In a probability plot, the horizontal axis is scaled in percentile units.
The pattern on the plot for SIGMA= ƒ tends to be linear with intercept and slope exp( ) if the data are lognormally distributed with the specific density function
where
= threshold parameter
ƒ = scale parameter
± = shape parameter ( ƒ > 0)
See Example 3.26 and Example 3.33.
To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where ˆ’ 1 (·) is the inverse cumulative standard normal distribution, and n is the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.
The point pattern on the plot tends to be linear with intercept µ and slope ƒ if the data are normally distributed with the specific density function
where µ is the mean, and ƒ is the standard deviation ( ƒ > 0).
To create the plot, the observations are ordered from smallest to largest, and the i th ordered observation is plotted against the quantile , where n is the number of nonmissing observations, and c is the Weibull distribution shape parameter. In a probability plot, the horizontal axis is scaled in percentile units.
The pattern on the plot for C= c tends to be linear with intercept and slope ƒ if the data are Weibull distributed with the specific density function
where
= threshold parameter
ƒ = scale parameter ( ƒ > 0)
c = shape parameter ( c > 0)
See Example 3.34.
To create the plot, the observations are ordered from smallest to largest, and the log of the shifted i th ordered observation x ( i ) , denoted by log( x ( i ) ˆ’ ), is plotted against the quantile , where is n the number of nonmissing observations. In a probability plot, the horizontal axis is scaled in percentile units.
Unlike the three-parameter Weibull quantile, the preceding expression is free of distribution parameters. Consequently, the C= shape parameter is not mandatory with the WEIBULL2 distribution option.
The pattern on the plot for THETA= tends to be linear with intercept log( ƒ ) and slope 1/ c if the data are Weibull distributed with the specific density function
where
= known lower threshold
ƒ = scale parameter ( ƒ > 0)
c = shape parameter ( c > 0)
See Example 3.34.
Some of the distribution options in the PROBPLOT or QQPLOT statements require you to specify one or two shape parameters in parentheses after the distribution keyword. These are summarized in Table 3.65.
Distribution Keyword | Mandatory Shape Parameter Option | Range |
---|---|---|
BETA | ALPHA= ± , BETA= ² | ± > 0, ² > |
EXPONENTIAL | None | |
GAMMA | ALPHA= ± | ± > |
LOGNORMAL | SIGMA= ƒ | ƒ > |
NORMAL | None | |
WEIBULL | C= c | c > |
WEIBULL2 | None |
You can visually estimate the value of a shape parameter by specifying a list of values for the shape parameter option. A separate plot is produced for each value, and you can then select the value of the shape parameter that produces the most nearly linear point pattern. Alternatively, you can request that the plot be created using an estimated shape parameter. See the entries for the distribution options in the section Dictionary of Options on page 245 (for the PROBPLOT statement) and in the section Dictionary of Options on page 258 (for the QQPLOT statement).
Note: For Q-Q plots created with the WEIBULL2 option, you can estimate the shape parameter c from a linear pattern using the fact that the slope of the pattern is .
If you specify location and scale parameters for a distribution in a PROBPLOT or QQPLOT statement (or if you request estimates for these parameters), a diagonal distribution reference line is displayed on the plot. (An exception is the two-parameter Weibull distribution, for which a line is displayed when you specify or estimate the scale and shape parameters.) Agreement between this line and the point pattern indicates that the distribution with these parameters is a good fit.
When the point pattern on a Q-Q plot is linear, its intercept and slope provide estimates of the location and scale parameters. (An exception to this rule is the two-parameter Weibull distribution, for which the intercept and slope are related to the scale and shape parameters.)
Table 3.66 shows how the specified parameters determine the intercept and slope of the line. The intercept and slope are based on the quantile scale for the horizontal axis, which is used in Q-Q plots.
Distribution | Parameters | Linear Pattern | |||
---|---|---|---|---|---|
Location | Scale | Shape | Intercept | Slope | |
Beta |
| ƒ | ± , ² |
| ƒ |
Exponential |
| ƒ |
| ƒ | |
Gamma |
| ƒ | ± |
| ƒ |
Lognormal |
|
| ƒ |
| exp( ) |
Normal | µ | ƒ | µ | ƒ | |
Weibull (3-parameter) |
| ƒ | c |
| ƒ |
Weibull (2-parameter) | (known) | ƒ | c | log( ƒ ) |
|
For instance, specifying MU=3 and SIGMA=2 with the NORMAL option requests a line with intercept 3 and slope 2. Specifying SIGMA=1 and C=2 with the WEIBULL2 option requests a line with intercept log(1) = 0 and slope 1/2. On a probability plot with the LOGNORMAL and WEIBULL2 options, you can specify the slope directly with the SLOPE= option. That is, for the LOGNORMAL option, specifying THETA= and SLOPE=exp( ) displays the same line as specifying THETA= and ZETA= . For the WEIBULL2 option, specifying SIGMA= ƒ and displays the same line as specifying SIGMA= ƒ and C= c .
There are two ways to estimate percentiles from a Q-Q plot:
Specify the PCTLAXIS option, which adds a percentile axis opposite the theoretical quantile axis. The scale for the percentile axis ranges between 0 and 100 with tick marks at percentile values such as 1, 5, 10, 25, 50, 75, 90, 95, and 99.
Specify the PCTLSCALE option, which relabels the horizontal axis tick marks with their percentile equivalents but does not alter their spacing. For example, on a normal Q-Q plot, the tick mark labeled 0 is relabeled as 50 since the 50th percentile corresponds to the zero quantile.
You can also estimate percentiles using probability plots created with the PROBPLOT statement. See Example 3.32.
The DATA= data set provides the set of variables that are analyzed . The UNIVARIATE procedure must have a DATA= data set. If you do not specify one with the DATA= option in the PROC UNIVARIATE statement, the procedure uses the last data set created.
You can add features to plots by specifying ANNOTATE= data sets either in the PROC UNIVARIATE statement or in individual plot statements.
Information contained in an ANNOTATE= data set specified in the PROC UNIVARIATE statement is used for all plots produced in a given PROC step; this is a global ANNOTATE= data set. By using this global data set, you can keep information common to all high-resolution plots in one data set.
Information contained in the ANNOTATE= data set specified in a plot statement is used only for plots produced by that statement; this is a local ANNOTATE= data set. By using this data set, you can add statement-specific features to plots. For example, you can add different features to plots produced by the HISTOGRAM and QQPLOT statements by specifying an ANNOTATE= data set in each plot statement.
You can specify an ANNOTATE= data set in the PROC UNIVARIATE statement and in plot statements. This enables you to add some features to all plots and also add statement-specific features to plots. See Example 3.25.
PROC UNIVARIATE creates an OUT= data set for each OUTPUT statement. This data set contains an observation for each combination of levels of the variables in the BY statement, or a single observation if you do not specify a BY statement. Thus the number of observations in the new data set corresponds to the number of groups for which statistics are calculated. Without a BY statement, the procedure computes statistics and percentiles by using all the observations in the input data set. With a BY statement, the procedure computes statistics and percentiles by using the observations within each BY group.
The variables in the OUT= data set are as follows:
BY statement variables. The values of these variables match the values in the corresponding BY group in the DATA= data set and indicate which BY group each observation summarizes.
variables created by selecting statistics in the OUTPUT statement. The statistics are computed using all the nonmissing data, or they are computed for each BY group if you use a BY statement.
variables created by requesting new percentiles with the PCTLPTS= option. The names of these new variables depend on the values of the PCTLPRE= and PCTLNAME= options.
If the output data set contains a percentile variable or a quartile variable, the percentile definition assigned with the PCTLDEF= option in the PROC UNIVARIATE statement is recorded in the output data set label. See Example 3.8.
The following table lists variables available in the OUT= data set.
Variable Name | Description |
---|---|
Descriptive Statistics | |
CSS | Sum of squares corrected for the mean |
CV | Percent coefficient of variation |
KURTOSIS | Measurement of the heaviness of tails |
MAX | Largest (maximum) value |
MEAN | Arithmetic mean |
MIN | Smallest (minimum) value |
MODE | Most frequent value (if not unique, the smallest mode) |
N | Number of observations on which calculations are based |
NMISS | Number of missing observations |
NOBS | Total number of observations |
RANGE | Difference between the maximum and minimum values |
SKEWNESS | Measurement of the tendency of the deviations to be larger in one direction than in the other |
STD | Standard deviation |
STDMEAN | Standard error of the mean |
SUM | Sum |
SUMWGT | Sum of the weights |
USS | Uncorrected sum of squares |
VAR | Variance |
Quantile Statistics | |
MEDIANP50 | Middle value (50th percentile) |
P1 | 1st percentile |
P5 | 5th percentile |
P10 | 10th percentile |
P90 | 90th percentile |
P95 | 95th percentile |
P99 | 99th percentile |
Q1P25 | Lower quartile (25th percentile) |
Q3P75 | Upper quartile (75th percentile) |
QRANGE | Difference between the upper and lower quartiles (also known as the inner quartile range) |
Robust Statistics | |
GINI | Ginis mean difference |
MAD | Median absolute difference |
QN | 2nd variation of median absolute difference |
SN | 1st variation of median absolute difference |
STD“GINI | Standard deviation for Ginis mean difference |
STD“MAD | Standard deviation for median absolute difference |
STD“QN | Standard deviation for the second variation of the median absolute difference |
STD“QRANGE | Estimate of the standard deviation, based on interquartile range |
STD“SN | Standard deviation for the first variation of the median absolute difference |
Hypothesis Test Statistics | |
MSIGN | Sign statistic |
NORMAL | Test statistic for normality. If the sample size is less than or equal to 2000, this is the Shapiro-Wilk W statistic. Otherwise, it is the Kolmogorov D statistic. |
PROBM | Probability of a greater absolute value for the sign statistic |
PROBN | Probability that the data came from a normal distribution |
PROBS | Probability of a greater absolute value for the signed rank statistic |
PROBT | Two-tailed p -value for Students t statistic with n ˆ’ 1 degrees of freedom |
SIGNRANK | Signed rank statistic |
T | Students t statistic to test the null hypothesis that the population mean is equal to µ |
You can create an OUTHISTOGRAM= data set with the HISTOGRAM statement. This data set contains information about histogram intervals. Since you can specify multiple HISTOGRAM statements with the UNIVARIATE procedure, you can create multiple OUTHISTOGRAM= data sets.
An OUTHISTOGRAM= data set contains a group of observations for each variable in the HISTOGRAM statement. The group contains an observation for each interval of the histogram, beginning with the leftmost interval that contains a value of the variable and ending with the rightmost interval that contains a value of the variable. These intervals will not necessarily coincide with the intervals displayed in the histogram since the histogram may be padded with empty intervals at either end. If you superimpose one or more fitted curves on the histogram, the OUTHISTOGRAM= data set contains multiple groups of observations for each variable (one group for each curve). If you use a BY statement, the OUTHISTOGRAM= data set contains groups of observations for each BY group. ID variables are not saved in an OUTHISTOGRAM= data set.
By default, an OUTHISTOGRAM= data set contains the “MIDPT“ variable, whose values identify histogram intervals by their midpoints. When the ENDPOINTS= or NENDPOINTS option is specified, intervals are identified by endpoint values instead. If the RTINCLUDE option is specified, the “MAXPT“ variable contains upper endpoint values. Otherwise, the “MINPT“ variable contains lower endpoint values. See Example 3.18.
Variable | Description |
---|---|
_CURVE_ | Name of fitted distribution (if requested in HISTOGRAM statement) |
_EXPPCT_ | Estimated percent of population in histogram interval determined from optional fitted distribution |
_MAXPT_ | Upper endpoint of histogram interval |
_MIDPT_ | Midpoint of histogram interval |
_MINPT_ | Lower endpoint of histogram interval |
_OBSPCT_ | Percent of variable values in histogram interval |
_VAR_ | Variable name |
By default, PROC UNIVARIATE produces ODS tables of moments, basic statistical measures, tests for location, quantiles, and extreme observations. You must specify options in the PROC UNIVARIATE statement to request other statistics and tables. The CIBASIC option produces a table that displays confidence limits for the mean, standard deviation, and variance. The CIPCTLDF and CIPCTLNORMAL options request tables of confidence limits for the quantiles. The LOCCOUNT option requests a table that shows the number of values greater than, not equal to, and less than the value of MU0=. The FREQ option requests a table of frequencies counts. The NEXTRVAL= option requests a table of extreme values. The NORMAL option requests a table with tests for normality.
The TRIMMED=, WINSORIZED=, and ROBUSTCALE options request tables with robust estimators. The table of trimmed or Winsorized means includes the percentage and the number of observations that are trimmed or Winsorized at each end, the mean and standard error, confidence limits, and the Students t test. The table with robust measures of scale includes interquartile range, Ginis mean difference G , MAD , Q n , and S n , with their corresponding estimates of ƒ .
See the section ODS Table Names on page 309 for the names of ODS tables created by PROC UNIVARIATE.
PROC UNIVARIATE assigns a name to each table that it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets.
ODS Table Name | Description | Option |
---|---|---|
BasicIntervals | Confidence intervals for mean, standard deviation, variance | CIBASIC |
BasicMeasures | Measures of location and variability | Default |
ExtremeObs | Extreme observations | Default |
ExtremeValues | Extreme values | NEXTRVAL= |
Frequencies | Frequencies | FREQ |
LocationCounts | Counts used for sign test and signed rank test | LOCCOUNT |
MissingValues | Missing values | Default, if missing values exist |
Modes | Modes | MODES |
Moments | Sample moments | Default |
Plots | Line printer plots | PLOTS |
Quantiles | Quantiles | Default |
RobustScale | Robust measures of scale | ROBUSTSCALE |
SSPlots | Line printer side-by-side box plots | PLOTS (with BY statement) |
TestsForLocation | Tests for location | Default |
TestsForNormality | Tests for normality | NORMALTEST |
TrimmedMeans | Trimmed means | TRIMMED= |
WinsorizedMeans | Winsorized means | WINSORIZED= |
ODS Table Name | Description | Option |
---|---|---|
Bins | Histogram bins | MIDPERCENTS secondary option |
FitQuantiles | Quantiles of fitted distribution | Any distribution option |
GoodnessOfFit | Goodness-of-fit tests for fitted distribution | Any distribution option |
HistogramBins | Histogram bins | MIDPERCENTS option |
ParameterEstimates | Parameter estimates for fitted distribution | Any distribution option |
If you request a fitted parametric distribution with a HISTOGRAM statement, PROC UNIVARIATE creates a summary that is organized into the ODS tables described in this section.
The ParameterEstimates table lists the estimated (or specified) parameters for the fitted curve as well as the estimated mean and estimated standard deviation. See Formulas for Fitted Continuous Distributions on page 288.
When you fit a parametric distribution, the HISTOGRAM statement provides a series of goodness-of-fit tests based on the empirical distribution function (EDF). See EDF Goodness-of-Fit Tests on page 294. These are displayed in the GoodnessOfFit table.
The Bins table is included in the summary only if you specify the MIDPERCENTS option in parentheses after the distribution option. This table lists the midpoints for the histogram bins along with the observed and estimated percentages of the observations that lie in each bin. The estimated percentages are based on the fitted distribution.
If you specify the MIDPERCENTS option without requesting a fitted distribution, the HistogramBins table is included in the summary. This table lists the interval midpoints with the observed percent of observations that lie in the interval. See the entry for the MIDPERCENTS option on page 225.
The FitQuantiles table lists observed and estimated quantiles. You can use the PERCENTS= option to specify the list of quantiles in this table. See the entry for the PERCENTS= option on page 227. By default, the table lists observed and estimated quantiles for the 1, 5, 10, 25, 50, 75, 90, 95, and 99 percent of a fitted parametric distribution.
Because the UNIVARIATE procedure computes quantile statistics, it requires additional memory to store a copy of the data in memory. By default, the MEANS, SUMMARY, and TABULATE procedures require less memory because they do not automatically compute quantiles. These procedures also provide an option to use a new fixed-memory quantiles estimation method that is usually less memory intensive .
In the UNIVARIATE procedure, the only factor that limits the number of variables that you can analyze is the computer resources that are available. The amount of temporary storage and CPU time required depends on the statements and the options that you specify. To calculate the computer resources the procedure needs, let
N | be the number of observations in the data set |
V | be the number of variables in the VAR statement |
U i | be the number of unique values for the i th variable |
Then the minimum memory requirement in bytes to process all variables is M = 24 ˆ‘ i U i . If M bytes are not available, PROC UNIVARIATE must process the data multiple times to compute all the statistics. This reduces the minimum memory requirement to M = 24 max( U i ).
Using the ROUND= option reduces the number of unique values ( U i ), thereby reducing memory requirements. The ROBUSTSCALE option requires 40 U i bytes of temporary storage.
Several factors affect the CPU time:
The time to create V tree structures to internally store the observations is proportional to NV log( N ).
The time to compute moments and quantiles for the i th variable is proportional to U i .
The time to compute the NORMAL option test statistics is proportional to N .
The time to compute the ROBUSTSCALE option test statistics is proportional to U i log( U i ).
The time to compute the exact significance level of the sign rank statistic may increase when the number of nonzero values is less than or equal to 20.
Each of these factors has a different constant of proportionality. For additional information on optimizing CPU performance and memory usage, see the SAS documentation for your operating environment.