PROC UNIVARIATE < options > ;
BY variables ;
CLASS variable-1 < (v-options) > < variable-2 < (v-options) > >
< /KEYLEVEL=value1(value1 value2) > ;
FREQ variable ;
HISTOGRAM < variables > < /options > ;
ID variables ;
INSET keyword-list < /options > ;
OUTPUT < OUT= SAS-data-set >
< keyword1=names. . .keywordk= names > < percentile-options > ;
PROBPLOT < variables > < /options > ;
QQPLOT < variables > < /options > ;
VAR variables ;
WEIGHT variable ;
The PROC UNIVARIATE statement invokes the procedure. The VAR statement specifies the numeric variables to be analyzed, and it is required if the OUTPUT statement is used to save summary statistics in an output data set. If you do not use the VAR statement, all numeric variables in the data set are analyzed .
The plot statements HISTOGRAM, PROBPLOT, and QQPLOT create graphical displays, and the INSET statement enhances these displays by adding a table of summary statistics directly on the graph. You can specify one or more of each of the plot statements, the INSET statement, and the OUTPUT statement. If you use a VAR statement, the variables listed in a plot statement must be a subset of the variables listed in the VAR statement.
You can use a CLASS statement to specify one or two variables that group the data into classification levels. The analysis is carried out for each combination of levels, and you can use the CLASS statement with a plot statement to create a comparative display.
You can specify a BY statement to obtain separate analysis for each BY group. The FREQ statement specifies a variable whose values provide the frequency for each observation. The WEIGHT statement specifies a variable whose values are used to weight certain statistics. The ID statement specifies one or more variables to identify the extreme observations.
PROC UNIVARIATE < options > ;
The PROC UNIVARIATE statement is required to invoke the UNIVARIATE procedure. You can use the PROC UNIVARIATE statement by itself to request a variety of statistics for summarizing the data distribution of each analysis variable:
sample moments
basic measures of location and variability
confidence intervals for the mean, standard deviation, and variance
tests for location
tests for normality
trimmed and Winsorized means
robust estimates of scale
quantiles and related confidence intervals
extreme observations and extreme values
frequency counts for observations
missing values
In addition, you can use options in the PROC UNIVARIATE statement to
specify the input data set to be analyzed
specify a graphics catalog for saving graphical output
specify rounding units for variable values
specify the definition used to calculate percentiles
specify the divisor used to calculate variances and standard deviations
request that plots be produced on line printers and define special printing characters used for features
suppress tables
The following are the options that can be used with the PROC UNIVARIATE statement:
ALL
requests all statistics and tables that the FREQ, MODES, NEXTRVAL=5, PLOT, and CIBASIC options generate. If the analysis variables are not weighted, this option also requests the statistics and tables generated by the CIPCTLDF, CIPCTLNORMAL, LOCCOUNT, NORMAL, ROBUSTSCALE, TRIMMED=.25, and WINSORIZED=.25 options. PROC UNIVARIATE also uses any values that you specify for ALPHA=, MU0=, NEXTRVAL=, CIBASIC, CIPCTLDF, CIPCTLNORMAL, TRIMMED=, or WINSORIZED= to produce the output.
ALPHA= ±
specifies the level of significance ± for 100(1 ˆ’ ± )% confidence intervals. The value ± must be between 0 and 1; the default value is 0.05, which results in 95% confidence intervals.
Note that specialized ALPHA= options are available for a number of confidence interval options. For example, you can specify CIBASIC( ALPHA=0.10 ) to request a table of basic confidence limits at the 90% level. The default values of these options are the value of the ALPHA= option in the PROC statement.
ANNOTATE= SAS-data-set
ANNO= SAS-data-set
specifies an input data set that contains annotate variables as described in SAS/GRAPH Reference . You can use this data set to add features to your high-resolution graphics. PROC UNIVARIATE adds the features in this data set to every high-resolution graph that is produced in the procedure. PROC UNIVARIATE does not use the ANNOTATE= data set unless you create a high-resolution graph with the HISTOGRAM, PROBPLOT, or QQPLOT statement. Use the ANNOTATE= option in the HISTOGRAM, PROBPLOT, or QQPLOT statement if you want to add a feature to a specific graph produced by the statement.
CIBASIC < ( < TYPE= keyword > < ALPHA= ± > ) >
requests confidence limits for the mean, standard deviation, and variance based on the assumption that the data are normally distributed. If you use the CIBASIC option, you must use the default value of VARDEF=, which is DF.
TYPE= keyword
specifies the type of confidence limit, where keyword is LOWER, UPPER, or TWOSIDED. The default value is TWOSIDED.
ALPHA= ±
specifies the level of significance ± for 100(1 ˆ’ ± )% confidence intervals. The value ± must be between 0 and 1; the default value is 0.05, which results in 95% confidence intervals. The default value is the value of ALPHA= given in the PROC statement.
CIPCTLDF < ( < TYPE= keyword > < ALPHA= ± > ) >
CIQUANTDF < ( < TYPE= keyword > < ALPHA= ± > ) >
requests confidence limits for quantiles based on a method that is distribution-free. In other words, no specific parametric distribution such as the normal is assumed for the data. PROC UNIVARIATE uses order statistics (ranks) to compute the confidence limits as described by Hahn and Meeker (1991). This option does not apply if you use a WEIGHT statement.
TYPE= keyword
specifies the type of confidence limit, where keyword is LOWER, UPPER, SYMMETRIC, or ASYMMETRIC. The default value is SYMMETRIC.
ALPHA= ±
specifies the level of significance ± for 100(1 ˆ’ ± )% confidence intervals. The value must be between 0 and 1; the default value is 0.05, which results in 95% confidence intervals. The default value is the value of ALPHA= given in the PROC statement.
CIPCTLNORMAL < ( < TYPE= keyword > < ALPHA= ± > ) >
CIQUANTNORMAL < ( < TYPE= keyword > < ALPHA= ± > ) >
requests confidence limits for quantiles based on the assumption that the data are normally distributed. The computational method is described in Section 4.4.1 of Hahn and Meeker (1991) and uses the noncentral t distribution as given by Odeh and Owen (1980). This option does not apply if you use a WEIGHT statement
TYPE= keyword
specifies the type of confidence limit, where keyword is LOWER, UPPER, or TWOSIDED. The default is TWOSIDED.
ALPHA= ±
specifies the level of significance ± for 100(1 ˆ’ ± )% confidence intervals. The value ± must be between 0 and 1; the default value is 0.05, which results in 95% confidence intervals. The default value is the value of ALPHA= given in the PROC statement.
DATA= SAS-data-set
specifies the input SAS data set to be analyzed. If the DATA= option is omitted, the procedure uses the most recently created SAS data set.
EXCLNPWGT
excludes observations with nonpositive weight values (zero or negative) from the analysis. By default, PROC UNIVARIATE treats observations with negative weights like those with zero weights and counts them in the total number of observations. This option applies only when you use a WEIGHT statement.
FREQ
requests a frequency table that consists of the variable values, frequencies, cell percentages, and cumulative percentages.
If you specify the WEIGHT statement, PROC UNIVARIATE includes the weighted count in the table and uses this value to compute the percentages.
GOUT= graphics-catalog
specifies the SAS catalog that PROC UNIVARIATE uses to save high-resolution graphics output. If you omit the libref in the name of the graphics-catalog , PROC UNIVARIATE looks for the catalog in the temporary library called WORK and creates the catalog if it does not exist.
LOCCOUNT
requests a table that shows the number of observations greater than, not equal to, and less than the value of MU0=. PROC UNIVARIATE uses these values to construct the sign test and the signed rank test. This option does not apply if you use a WEIGHT statement.
MODESMODE
requests a table of all possible modes. By default, when the data contain multiple modes, PROC UNIVARIATE displays the lowest mode in the table of basic statistical measures. When all the values are unique, PROC UNIVARIATE does not produce a table of modes.
MU0= values
LOCATION= values
specifies the value of the mean or location parameter ( µ ) in the null hypothesis for tests of location summarized in the table labeled Tests for Location: Mu0=value . If you specify one value, PROC UNIVARIATE tests the same null hypothesis for all analysis variables. If you specify multiple values, a VAR statement is required, and PROC UNIVARIATE tests a different null hypothesis for each analysis variable in the corresponding order. The default value is 0.
The following statement tests the hypothesis µ = 0 for the first variable and the hypothesis µ = 0 . 5 for the second variable.
proc univariate mu0=0 0.5;
NEXTROBS= n
specifies the number of extreme observations that PROC UNIVARIATE lists in the table of extreme observations. The table lists the n lowest observations and the n highest observations. The default value is 5, and n can range between 0 and half the maximum number of observations. You can specify NEXTROBS=0 to suppress the table of extreme observations.
NEXTRVAL= n
specifies the number of extreme values that PROC UNIVARIATE lists in the table of extreme values. The table lists the n lowest unique values and the n highest unique values. The default value is 0, and n can range between 0 and half the maximum number of observations. By default, n = 0 and no table is displayed.
NOBYPLOT
suppresses side-by-side box plots that are created by default when you use the BY statement and the ALL option or the PLOT option in the PROC statement.
NOPRINT
suppresses all the tables of descriptive statistics that the PROC UNIVARIATE statement creates. NOPRINT does not suppress the tables that the HISTOGRAM statement creates. You can use the NOPRINT option in the HISTOGRAM statement to suppress the creation of its tables. Use NOPRINT when you want to create an OUT= output data set only.
NORMAL
NORMALTEST
requests tests for normality that include a series of goodness-of-fit tests based on the empirical distribution function. The table provides test statistics and p -values for the Shapiro-Wilk test (provided the sample size is less than or equal to 2000), the Kolmogorov-Smirnov test, the Anderson-Darling test, and the Cram r-von Mises test. This option does not apply if you use a WEIGHT statement.
PCTLDEF= value
DEF= value
specifies the definition that PROC UNIVARIATE uses to calculate quantiles. The default value is 5. Values can be 1, 2, 3, 4, or 5. You cannot use PCTLDEF= when you compute weighted quantiles. See the section Calculating Percentiles on page 273 for details on quantile definitions.
PLOTS PLOT
produces a stem-and-leaf plot (or a horizontal bar chart), a box plot, and a normal probability plot in line printer output. If you use a BY statement, side-by-side box plots that are labeled Schematic Plots appear after the univariate analysis for the last BY group.
PLOTSIZE= n
specifies the approximate number of rows used in line-printer plots requested with the PLOTS option. If n is larger than the value of the SAS system option PAGESIZE=, PROC UNIVARIATE uses the value of PAGESIZE=. If n is less than 8, PROC UNIVARIATE uses eight rows to draw the plots.
ROBUSTSCALE
produces a table with robust estimates of scale. The statistics include the interquartile range, Gini s mean difference, the median absolute deviation about the median ( MAD ), and two statistics proposed by Rousseeuw and Croux (1993), Q n , and S n . This option does not apply if you use a WEIGHT statement.
ROUND= units
specifies the units to use to round the analysis variables prior to computing statistics. If you specify one unit, PROC UNIVARIATE uses this unit to round all analysis variables. If you specify multiple units, a VAR statement is required, and each unit rounds the values of the corresponding analysis variable. If ROUND=0, no rounding occurs. The ROUND= option reduces the number of unique variable values, thereby reducing memory requirements for the procedure. For example, to make the rounding unit 1 for the first analysis variable and 0.5 for the second analysis variable, submit the statement
proc univariate round=1 0.5; var yldstren tenstren; run;
When a variable value is midway between the two nearest rounded points, the value is rounded to the nearest even multiple of the roundoff value. For example, with a roundoff value of 1, the variable values of ˆ’ 2.5, ˆ’ 2.2, and ˆ’ 1.5 are rounded to ˆ’ 2; the values of ˆ’ 0.5, 0.2, and 0.5 are rounded to 0; and the values of 0.6, 1.2, and 1.4 are rounded to 1.
TRIMMED= values < ( < TYPE= keyword > < ALPHA= ± > ) >
TRIM= values < ( < TYPE= keyword > < ALPHA= ± > ) >
requests a table of trimmed means, where value specifies the number or the proportion of observations that PROC UNIVARIATE trims. If the value is the number n of trimmed observations, n must be between 0 and half the number of nonmissing observations. If value is a proportion p between 0 and 1/2, the number of observations that PROC UNIVARIATE trims is the smallest integer that is greater than or equal to np , where n is the number of observations. To include confidence limits for the mean and the Student s t test in the table, you must use the default value of VARDEF= which is DF. For details concerning the computation of trimmed means, see the section Trimmed Means on page 279.
TYPE= keyword
specifies the type of confidence limit for the mean, where keyword is LOWER, UPPER, or TWOSIDED. The default value is TWOSIDED.
ALPHA= ±
specifies the level of significance ± for 100(1 ˆ’ ± )% confidence intervals. The value ± must be between 0 and 1; the default value is 0.05, which results in 95% confidence intervals.
This option does not apply if you use a WEIGHT statement.
VARDEF= divisor
specifies the divisor to use in the calculation of variances and standard deviation. By default, VARDEF=DF. The following table shows the possible values for divisor and associated divisors.
Value | Divisor | Formula for Divisor |
---|---|---|
DF | Degrees of freedom | n “ 1 |
N | Number of observations | n |
WDF | Sum of weights minus one | ( & pound ; i w i ) ˆ’ 1 |
WEIGHTWGT | Sum of weights | i w i |
The procedure computes the variance as where CSS is the corrected sums of squares and equals . When you weight the analysis variables, where is the weighted mean.
The default value is DF. To compute the standard error of the mean, confidence limits, and Student s t test, use the default value of VARDEF=.
When you use the WEIGHT statement and VARDEF=DF, the variance is an estimate of s 2 where the variance of the i th observation is and w i is the weight for the i th observation. This yields an estimate of the variance of an observation with unit weight.
When you use the WEIGHT statement and VARDEF=WGT, the computed variance is asymptotically (for large n ) an estimate of where w is the average weight. This yields an asymptotic estimate of the variance of an observation with average weight.
WINSORIZED= values < ( < TYPE= keyword > < ALPHA= ± > ) >
WINSOR= values < ( < TYPE= keyword > < ALPHA= ± > ) >
requests of a table of Winsorized means, where value is the number or the proportion of observations that PROC UNIVARIATE uses to compute the Winsorized mean. If the value is the number n of winsorized observations, n must be between 0 and half the number of nonmissing observations. If value is a proportion p between 0 and 1/2, the number of observations that PROC UNIVARIATE uses is equal to the smallest integer that is greater than or equal to np , where n is the number of observations. To include confidence limits for the mean and the student t test in the table, you must use the default value of VARDEF=, which is DF. For details concerning the computation of Winsorized means, see the section Winsorized Means on page 278.
TYPE= keyword
specifies the type of confidence limit for the mean, where keyword is LOWER, UPPER, or TWOSIDED. The default is TWOSIDED.
ALPHA= ±
specifies the level of significance ± for 100(1 ˆ’ ± )% confidence intervals. The value ± must be between 0 and 1; the default value is 0.05, which results in 95% confidence intervals.
This option does not apply if you use a WEIGHT statement.
BY variables ;
You can specify a BY statement with PROC UNIVARIATE to obtain separate analyses for each BY group. The BY statement specifies the variables that the procedure uses to form BY groups. You can specify more than one variable . If you do not use the NOTSORTED option in the BY statement, the observations in the data set must either be sorted by all the variables that you specify or they must be indexed appropriately.
DESCENDING
specifies that the data set is sorted in descending order by the variable that immediately follows the word DESCENDING in the BY statement.
NOTSORTED
specifies that observations are not necessarily sorted in alphabetic or numeric order. The data are grouped in another way, for example, chronological order.
The requirement for ordering or indexing observations according to the values of BY variables is suspended for BY-group processing when you use the NOTSORTED option. In fact, the procedure does not use an index if you specify NOTSORTED. The procedure defines a BY group as a set of contiguous observations that have the same values for all BY variables. If observations with the same values for the BY variables are not contiguous, the procedure treats each contiguous set as a separate BY group.
CLASS variable-1 < (v-options) > < variable-2 < (v-options) > >
< /KEYLEVEL=value1( value1 value2 ) > ;
The CLASS statement specifies one or two variables that the procedure uses to group the data into classification levels. Variables in a CLASS statement are referred to as class variables . Class variables can be numeric or character. Class variables can have floating point values, but they typically have a few discrete values that define levels of the variable. You do not have to sort the data by class variables. PROC UNIVARIATE uses the formatted values of the class variables to determine the classification levels.
You can specify the following v-options enclosed in parentheses after the class variable:
MISSING
specifies that missing values for the CLASS variable are to be treated as valid classification levels. Special missing values that represent numeric values (the letters A through Z and the underscore (_) character) are each considered as a separate value. If you omit MISSING, PROC UNIVARIATE excludes the observations with a missing class variable value from the analysis. Enclose this option in parentheses after the class variable.
ORDER=DATA FORMATTED FREQ INTERNAL
specifies the display order for the class variable values. The default value is INTERNAL. You can specify the following values with the ORDER= option :
DATA
orders values according to their order in the input data set. When you use a HISTOGRAM, PROBPLOT, or QQPLOT statement, PROC UNIVARIATE displays the rows ( columns ) of the comparative plot from top to bottom (left to right) in the order that the class variable values first appear in the input data set.
FORMATTED
orders values by their ascending formatted values. This order may depend on your operating environment. When you use a HISTOGRAM, PROBPLOT, or QQPLOT statement, PROC UNIVARIATE displays the rows (columns) of the comparative plot from top to bottom (left to right) in increasing order of the formatted class variable values. For example, suppose a numeric class variable DAY (with values 1, 2, and 3) has a user -defined format that assigns Wednesday to the value 1, Thursday to the value 2, and Friday to the value 3. The rows of the comparative plot will appear in alphabetical order (Friday, Thursday, Wednesday) from top to bottom.
If there are two or more distinct internal values with the same formatted value, then PROC UNIVARIATE determines the order by the internal value that occurs first in the input data set. For numerical variables without an explicit format, the levels are ordered by their internal values.
FREQ
orders values by descending frequency count so that levels with the most observations are listed first. If two or more values have the same frequency count, PROC UNIVARIATE uses the formatted values to determine the order.
When you use a HISTOGRAM, PROBPLOT, or QQPLOT statement, PROC UNIVARIATE displays the rows (columns) of the comparative plot from top to bottom (left to right) in order of decreasing frequency count for the class variable values.
INTERNAL
orders values by their unformatted values, which yields the same order as PROC SORT. This order may depend on your operating environment.
When you use a HISTOGRAM, PROBPLOT, or QQPLOT statement, PROC UNIVARIATE displays the rows (columns) of the comparative plot from top to bottom (left to right) in increasing order of the internal (unformatted) values of the class variable. The first class variable is used to label the rows of the comparative plots (top to bottom). The second class variable is used to label the columns of the comparative plots (left to right). For example, suppose a numeric class variable DAY (with values 1, 2, and 3) has a user-defined format that assigns Wednesday to the value 1, Thursday to the value 2, and Friday to the value 3. The rows of the comparative plot will appear in day-of-the-week order (Wednesday, Thursday, Friday) from top to bottom.
You can specify the following option after the slash (/) in the CLASS statement.
KEYLEVEL= value1 ( value1 value2 )
specifies the key cell in a comparative plot. PROC UNIVARIATE first determines the bin size and midpoints for the key cell, and then extends the midpoint list to accommodate the data ranges for the remaining cells. Thus, the choice of the key cell determines the uniform horizontal axis that PROC UNIVARIATE uses for all cells . If you specify only one class variable and use a HISTOGRAM statement, KEYLEVEL= value identifies the key cell as the level for which variable is equal to value. By default, PROC UNIVARIATE sorts the levels in the order that is determined by the ORDER= option. Then, the key cell is the first occurrence of a level in this order. The cells display in order from top to bottom or left to right. Consequently, the key cell appears at the top (or left). When you specify a different key cell with the KEYLEVEL= option, this cell appears at the top (or left).
Likewise, with the PROBPLOT and QQPLOT statements, the key cell determines uniform axis scaling. If you specify two class variables, use KEYLEVEL= value1 value2 to identify the key cell as the level for which variable- n is equal to value- n .
By default, PROC UNIVARIATE sorts the levels of the first CLASS variable in the order that is determined by its ORDER= option and, within each of these levels, it sorts the levels of the second CLASS variable in the order that is determined by its ORDER= option. Then, the default key cell is the first occurrence of a combination of levels for the two variables in this order. The cells display in the order of the first CLASS variable from top to bottom and in the order of the second CLASS variable from left to right. Consequently, the default key cell appears at the upper left corner.
When you specify a different key cell with the KEYLEVEL= option, this cell appears at the upper left corner.
The length of the KEYLEVEL= value cannot exceed 16 characters and you must specify a formatted value.
The KEYLEVEL= option does not apply unless you specify a HISTOGRAM, PROBPLOT, or QQPLOT statement.
FREQ variable ;
The FREQ statement specifies a numeric variable whose value represents the frequency of the observation. If you use the FREQ statement, the procedure assumes that each observation represents n observations, where n is the value of variable. If the variable is not an integer, the SAS System truncates it. If the variable is less than 1 or is missing, the procedure excludes that observation from the analysis. See Example 3.6.
Note: The FREQ statement affects the degrees of freedom, but the WEIGHT statement does not.
HISTOGRAM < variables > < /options > ;
The HISTOGRAM statement creates histograms and optionally superimposes estimated parametric and nonparametric probability density curves. You cannot use the WEIGHT statement with the HISTOGRAM statement. You can use any number of HISTOGRAM statements after a PROC UNIVARIATE statement. The components of the HISTOGRAM statement are described as follows.
variables
are the variables for which histograms are to be created. If you specify a VAR statement, the variables must also be listed in the VAR statement. Otherwise, the variables can be any numeric variables in the input data set. If you do not specify variables in a VAR statement or in the HISTOGRAM statement, then by default, a histogram is created for each numeric variable in the DATA= data set. If you use a VAR statement and do not specify any variables in the HISTOGRAM statement, then by default, a histogram is created for each variable listed in the VAR statement.
For example, suppose a data set named Steel contains exactly two numeric variables named Length and Width . The following statements create two histograms, one for Length and one for Width :
proc univariate data=Steel; histogram; run;
Likewise, the following statements create histograms for Length and Width :
proc univariate data=Steel; var Length Width; histogram; run;
The following statements create a histogram for Length only:
proc univariate data=Steel; var Length Width; histogram Length; run;
options
add features to the histogram. Specify all options after the slash (/) in the HISTOGRAM statement. Options can be one of the following:
primary options for fitted parametric distributions and kernel density estimates
secondary options for fitted parametric distributions and kernel density estimates
general options for graphics and output data sets
For example, in the following statements, the NORMAL option displays a fitted normal curve on the histogram, the MIDPOINTS= option specifies midpoints for the histogram, and the CTEXT= option specifies the color of the text:
proc univariate data=Steel; histogram Length / normal midpoints = 5.6 5.8 6.0 6.2 6.4 ctext = blue; run;
Table 3.2 through Table 3.12 list the HISTOGRAM options by function. For complete descriptions, see the the section Dictionary of Options on page 217.
BETA( beta-options ) | Fits beta distribution with threshold parameter , scale parameter ƒ , and shape parameters ± and ² |
EXPONENTIAL( exponential-options ) | Fits exponential distribution with threshold parameter and scale parameter ƒ |
GAMMA( gamma-options ) | Fits gamma distribution with threshold parameter , scale parameter ƒ , and shape parameter ± |
LOGNORMAL( lognormal-options ) | Fits lognormal distribution with threshold parameter , scale parameter , and shape parameter ƒ |
NORMAL( normal-options ) | Fits normal distribution with mean µ and standard deviation ƒ |
WEIBULL( Weibull-options ) | Fits Weibull distribution with threshold parameter , scale parameter ƒ , and shape parameter c |
Table 3.2 lists primary options that display a parametric density estimate on the histogram.
Table 3.3 through Table 3.9 list secondary options that specify parameters for fitted parametric distributions and that control the display of fitted curves. Specify these secondary options in parentheses after the primary distribution option . For example, you can fit a normal curve by specifying the NORMAL option as follows:
proc univariate; histogram / normal(color=red mu=10 sigma=0.5); run;
COLOR= color | Specifies color of density curve |
FILL | Fills area under density curve |
L= linetype | Specifies line type of curve |
MIDPERCENTS | Prints table of midpoints of histogram intervals |
NOPRINT | Suppresses tables summarizing curve |
PERCENTS= value-list | Lists percents for which quantiles calculated from data and quantiles estimated from curve are tabulated |
W= n | Specifies width of density curve |
ALPHA= value | Specifies first shape parameter ± for beta curve |
BETA= value | Specifies second shape parameter ² for beta curve |
SIGMA= value EST | Specifies scale parameter ƒ for beta curve |
THETA= value EST | Specifies lower threshold parameter for beta curve |
SIGMA= value | Specifies scale parameter ƒ for exponential curve |
THETA= value EST | Specifies threshold parameter for exponential curve |
ALPHA= value | Specifies shape parameter ± for gamma curve |
SIGMA= value | Specifies scale parameter ƒ for gamma curve |
THETA= value EST | Specifies threshold parameter for gamma curve |
SIGMA= value | Specifies shape parameter ƒ for lognormal curve |
THETA= value EST | Specifies threshold parameter for lognormal curve |
ZETA= value | Specifies scale parameter for lognormal curve |
MU= value | Specifies mean µ for normal curve |
SIGMA= value | Specifies standard deviation ƒ for normal curve |
C= value | Specifies shape parameter c for Weibull curve |
SIGMA= value | Specifies scale parameter ƒ for Weibull curve |
THETA= value EST | Specifies threshold parameter for Weibull curve |
The COLOR= normal-option draws the curve in red, and the MU= and SIGMA= normal-options specify the parameters µ = 10 and = 0 . 5 for the curve. Note that the sample mean and sample standard deviation are used to estimate µ and , respectively, when the MU= and SIGMA= normal-options are not specified.
Use the option KERNEL( kernel-options ) to compute kernel density estimates. Specify the following secondary options in parentheses after the KERNEL option to control features of density estimates requested with the KERNEL option.
C= value-list MISE | Specifies standardized bandwidth parameter c |
COLOR= color | Specifies color of the kernel density curve |
FILL | Fills area under kernel density curve |
K=NORMAL
| Specifies type of kernel function |
L= linetype | Specifies line type used for kernel density curve |
LOWER= | Specifies lower bound for kernel density curve |
UPPER= | Specifies upper bound for kernel density curve |
W= n | Specifies line width for kernel density curve |
Table 3.11 summarizes options for enhancing histograms, and Table 3.12 summarizes options for requesting output data sets.
Option | Description |
---|---|
ANNOKEY | Applies annotation requested in ANNOTATE= data set to key cell only |
ANNOTATE= | Specifies annotate data set |
BARWIDTH= | Specifies width for the bars |
CAXIS= | Specifies color for axis |
CBARLINE= | Specifies color for outlines of histogram bars |
CFILL= | Specifies color for filling under curve |
CFRAME= | Specifies color for frame |
CFRAMESIDE= | Specifies color for filling frame for row labels |
CFRAMETOP= | Specifies color for filling frame for column labels |
CGRID= | Specifies color for grid lines |
CHREF= | Specifies color for HREF= lines |
CPROP= | Specifies color for proportion of frequency bar |
CTEXT= | Specifies color for text |
CTEXTSIDE= | Specifies color for row labels of comparative histograms |
CTEXTTOP= | Specifies color for column labels of comparative histograms |
CVREF= | Specifies color for VREF= lines |
DESCRIPTION= | Specifies description for plot in graphics catalog |
ENDPOINTS= | Lists endpoints for histogram intervals |
FONT= | Specifies software font for text |
FORCEHIST | Forces creation of histogram |
GRID | Creates a grid |
FRONTREF | Draws reference lines in front of histogram bars |
HEIGHT= | Specifies height of text used outside framed areas |
HMINOR= | Specifies number of horizontal minor tick marks |
HOFFSET= | Specifies offset for horizontal axis |
HREF= | Specifies reference lines perpendicular to the horizontal axis |
HREFLABELS= | Specifies labels for HREF= lines |
HREFLABPOS= | Specifies vertical position of labels for HREF= lines |
INFONT= | Specifies software font for text inside framed areas |
INHEIGHT= | Specifies height of text inside framed areas |
INTERTILE= | Specifies distance between tiles |
LGRID= | Specifies a line type for grid lines |
LHREF= | Specifies line style for HREF= lines |
LVREF= | Specifies line style for VREF= lines |
MAXNBIN= | Specifies maximum number of bins to display |
MAXSIGMAS= | Limits the number of bins that display to within a specified number of standard deviations above and below mean of data in key cell |
MIDPOINTS= | Lists midpoints for histogram intervals |
NAME= | Specifies name for plot in graphics catalog |
NCOLS= | Specifies number of columns in comparative histogram |
NOBARS | Suppresses histogram bars |
NOFRAME | Suppresses frame around plotting area |
NOHLABEL | Suppresses label for horizontal axis |
NOPLOT | Suppresses plot |
NOVLABEL | Suppresses label for vertical axis |
NOVTICK | Suppresses tick marks and tick mark labels for vertical axis |
NROWS= | Specifies number of rows in comparative histogram |
PFILL= | Specifies pattern for filling under curve |
RTINCLUDE | Includes right endpoint in interval |
TURNVLABELS | Turn and vertically string out characters in labels for vertical axis |
VAXIS= | Specifies AXIS statement or values for vertical axis |
VAXISLABEL= | Specifies label for vertical axis |
VMINOR= | Specifies number of vertical minor tick marks |
VOFFSET= | Specifies length of offset at upper end of vertical axis |
VREF= | Specifies reference lines perpendicular to the vertical axis |
VREFLABELS= | Specifies labels for VREF= lines |
VREFLABPOS= | Specifies horizontal position of labels for VREF= lines |
VSCALE= | Specifies scale for vertical axis |
WAXIS= | Specifies line thickness for axes and frame |
WBARLINE= | Specifies line thickness for bar outlines |
WGRID= | Specifies line thickness for grid |
Option | Description |
---|---|
MIDPERCENTS | Creates table of histogram intervals |
OUTHISTOGRAM= | Specifies information on histogram intervals |
The following entries provide detailed descriptions of options in the HISTOGRAM statement.
ALPHA= value
specifies the shape parameter ± for fitted curves requested with the BETA and GAMMA options. Enclose the ALPHA= option in parentheses after the BETA or GAMMA options. By default, the procedure calculates a maximum likelihood estimate for ± . You can specify A= as an alias for ALPHA= if you use it as a beta-option . You can specify SHAPE= as an alias for ALPHA= if you use it as a gamma-option .
ANNOKEY
applies the annotation requested with the ANNOTATE= option to the key cell only. By default, the procedure applies annotation to all of the cells. This option is not available unless you use the CLASS statement. You can use the KEYLEVEL= option in the CLASS statement to specify the key cell.
ANNOTATE= SAS-data-set
ANNO= SAS-data-set
specifies an input data set containing annotate variables as described in SAS/GRAPH Software: Reference . The ANNOTATE= data set you specify in the HISTOGRAM statement is used for all plots created by the statement. You can also specify an ANNOTATE= data set in the PROC UNIVARIATE statement to enhance all plots created by the procedure.
BARWIDTH= value
specifies the width of the histogram bars in screen percent units.
BETA < ( beta-options ) >
displays a fitted beta density curve on the histogram. The BETA option can occur only once in a HISTOGRAM statement. The beta distribution is bounded below by the parameter and above by the value + ƒ . Use the THETA= and SIGMA= beta-options to specify these parameters. By default, THETA=0 and SIGMA=1. You can specify THETA=EST and SIGMA=EST to request maximum likelihood estimates for and ƒ . See Example 3.21.
Note: Three- and four-parameter maximum likelihood estimation may not always converge. The beta distribution has two shape parameters, ± and ² . If these parameters are known, you can specify their values with the ALPHA= and BETA= beta-options . By default, the procedure computes maximum likelihood estimates for ± and ² . Table 3.3 (page 214) and Table 3.4 (page 215) list options you can specify with the BETA option.
BETA= value
B= value
specifies the second shape parameter ² for beta density curves requested with the BETA option. Enclose the BETA= option in parentheses after the BETA option. By default, the procedure calculates a maximum likelihood estimate for ² .
C= value
specifies the shape parameter c for Weibull density curves requested with the WEIBULL option. Enclose the C= Weibull-option in parentheses after the WEIBULL option. If you do not specify a value for c , the procedure calculates a maximum likelihood estimate. You can specify the SHAPE= Weibull-option as an alias for the C= Weibull-option .
C= value-list MISE
specifies the standardized bandwidth parameter c for kernel density estimates requested with the KERNEL option. Enclose the C= kernel-option in parentheses after the KERNEL option. You can specify up to five values to request multiple estimates. You can also specify the C=MISE option, which produces the estimate with a bandwidth that minimizes the approximate mean integrated square error (MISE).
You can also use the C= kernel-option with the K= kernel-option , which specifies the kernel function, to compute multiple estimates. If you specify more kernel functions than bandwidths, the last bandwidth in the list is repeated for the remaining estimates. Likewise, if you specify more bandwidths than kernel functions, the last kernel function is repeated for the remaining estimates. If you do not specify a value for c , the bandwidth that minimizes the approximate MISE is used for all the estimates.
CAXIS= color
CAXES= color
CA= color
specifies the color for the axes and tick marks. This option overrides any COLOR= specifications in an AXIS statement. The default value is the first color in the device color list.
CBARLINE= color
specifies the color for the outline of the histogram bars. This option overrides the C= option in the SYMBOL1 statement. The default value is the first color in the device color list.
CFILL= color
specifies the color to fill the bars of the histogram (or the area under a fitted density curve if you also specify the FILL option). See the entries for the FILL and PFILL= options for additional details. Refer to SAS/GRAPH Software: Reference for a list of colors. By default, bars and curve areas are not filled.
CFRAME= color
specifies the color for the area that is enclosed by the axes and frame. The area is not filled by default.
CFRAMESIDE= color
specifies the color to fill the frame area for the row labels that display along the left side of the comparative histogram. This color also fills the frame area for the label of the corresponding class variable (if you associate a label with the variable). By default, these areas are not filled. This option is not available unless you use the CLASS statement.
CFRAMETOP= color
specifies the color to fill the frame area for the column labels that display across the top of the comparative histogram. This color also fills the frame area for the label of the corresponding class variable (if you associate a label with the variable). By default, these areas are not filled. This option is not available unless you use the CLASS statement.
CGRID= color
specifies the color for grid lines when a grid displays on the histogram. The default color is the first color in the device color list. This option also produces a grid.
CHREF= color
CH= color
specifies the color for horizontal axis reference lines requested by the HREF= option. The default is the first color in the device color list.
COLOR= color
specifies the color of the density curve. Enclose the COLOR= option in parentheses after the distribution option or the KERNEL option. If you use the COLOR= option with the KERNEL option, you can specify a list of up to five colors in parentheses for multiple kernel density estimates. If there are more estimates than colors, the last color specified is used for the remaining estimates.
CPROP= color EMPTY
specifies the color for a horizontal bar whose length (relative to the width of the tile) indicates the proportion of the total frequency that is represented by the corresponding cell in a comparative histogram. By default, no bars are displayed. This option is not available unless you use the CLASS statement. You can specify the keyword EMPTY to display empty bars. See Example 3.20.
CTEXT= color
CT= color
specifies the color for tick mark values and axis labels. The default is the color specified for the CTEXT= option in the GOPTIONS statement. In the absence of a GOPTIONS statement, the default color is the first color in the device color list.
CTEXTSIDE= color
specifies the color for the row labels that display along the left side of the comparative histogram. By default, the color specified by the CTEXT= option is used. If you omit the CTEXT= option, the color specified in the GOPTIONS statement is used. If you omit the GOPTIONS statement, the the first color in the device color list is used. This option is not available unless you use the CLASS statement. You can specify the CFRAMESIDE= option to change the background color for the row labels.
CTEXTTOP= color
specifies the color for the column labels that display along the left side of the comparative histogram. By default, the color specified by the CTEXT= option is used. If you omit the CTEXT= option, the color specified in the GOPTIONS statement is used. If you omit the GOPTIONS statement, the the first color in the device color list is used. This option is not available unless you specify the CLASS statement. You can use the CFRAMETOP= option to change the background color for the column labels.
CVREF= color
CV= color
specifies the color for lines requested with the VREF= option. The default is the first color in the device color list.
DESCRIPTION= ' string '
DES= ' string '
specifies a description, up to 40 characters long, that appears in the PROC GREPLAY master menu. The default value is the variable name.
ENDPOINTS < = values KEY UNIFORM >
uses the endpoints as the tick mark values for the horizontal axis and determines how to compute the bin width of the histogram bars, where values specifies values for both the left and right endpoint of each histogram interval. The width of the histogram bars is the difference between consecutive endpoints. The procedure uses the same values for all variables.
The range of endpoints must cover the range of the data. For example, if you specify
endpoints=2 to 10 by 2
then all of the observations must fall in the intervals [2,4) [4,6) [6,8) [8,10]. You also must use evenly spaced endpoints which you list in increasing order.
KEY | determines the endpoints for the data in the key cell. The initial number of endpoints is based on the number of observations in the key cell using the method of Terrell and Scott (1985). The procedure extends the endpoint list for the key cell in either direction as necessary until it spans the data in the remaining cells. |
UNIFORM | determines the endpoints by using all the observations as if there were no cells. In other words, the number of endpoints is based on the total sample size by using the method of Terrell and Scott (1985). |
Neither KEY nor UNIFORM apply unless you use the CLASS statement.
If you omit ENDPOINTS, the procedure uses the midpoints. If you specify ENDPOINTS, the procedure computes the endpoints by using an algorithm (Terrell and Scott 1985) that is primarily applicable to continuous data that are approximately normally distributed.
If you specify both MIDPOINTS= and ENDPOINTS, the procedure issues a warning message and uses the endpoints.
If you specify RTINCLUDE, the procedure includes the right endpoint of each histogram interval in that interval instead of including the left endpoint.
If you use a CLASS statement and specify ENDPOINTS, the procedure uses ENDPOINTS=KEY as the default. However if the key cell is empty, then the procedure uses ENDPOINTS=UNIFORM.
EXPONENTIAL < ( exponential-options ) >
EXP < ( exponential-options ) >
displays a fitted exponential density curve on the histogram. The EXPONENTIAL option can occur only once in a HISTOGRAM statement. The parameter must be less than or equal to the minimum data value. Use the THETA= exponential-option to specify . By default, THETA=0. You can specify THETA=EST to request the maximum likelihood estimate for . Use the SIGMA= exponential-option to specify . By default, the procedure computes a maximum likelihood estimate for ƒ . Table 3.3 (page 214) and Table 3.5 (page 215) list options you can specify with the EXPONENTIAL option.
FILL
fills areas under the fitted density curve or the kernel density estimate with colors and patterns. The FILL option can occur with only one fitted curve. Enclose the FILL option in parentheses after a density curve option or the KERNEL option. The CFILL= and PFILL= options specify the color and pattern for the area under the curve. For a list of available colors and patterns, see SAS/GRAPH Reference .
FONT= font
specifies a software font for reference line and axis labels. You can also specify fonts for axis labels in an AXIS statement. The FONT= font takes precedence over the FTEXT= font specified in the GOPTIONS statement. Hardware characters are used by default.
FORCEHIST
forces the creation of a histogram if there is only one unique observation. By default, a histogram is not created if the standard deviation of the data is zero.
FRONTREF
draws reference lines requested with the HREF= and VREF= options in front of the histogram bars. By default, reference lines are drawn behind the histogram bars and can be obscured by them.
GAMMA < ( gamma-options ) >
displays a fitted gamma density curve on the histogram. The GAMMA option can occur only once in a HISTOGRAM statement. The parameter must be less than the minimum data value. Use the THETA= gamma-option to specify . By default, THETA=0. You can specify THETA=EST to request the maximum likelihood estimate for . Use the ALPHA= and the SIGMA= gamma-options to specify the shape parameter ± and the scale parameter ƒ . By default, PROC UNIVARIATE computes maximum likelihood estimates for ± and ƒ . The procedure calculates the maximum likelihood estimate of ± iteratively using the Newton-Raphson approximation . Table 3.3 (page 214) and Table 3.6 (page 215) list options you can specify with the GAMMA option. See Example 3.22.
GRID
displays a grid on the histogram. Grid lines are horizontal lines that are positioned at major tick marks on the vertical axis.
HEIGHT= value
specifies the height, in percentage screen units, of text for axis labels, tick mark labels, and legends. This option takes precedence over the HTEXT= option in the GOPTIONS statement.
HMINOR= n
HM= n
specifies the number of minor tick marks between each major tick mark on the horizontal axis. Minor tick marks are not labeled. By default, HMINOR=0.
HOFFSET= value
specifies the offset, in percentage screen units, at both ends of the horizontal axis. You can use HOFFSET=0 to eliminate the default offset.
HREF= values
draws reference lines that are perpendicular to the horizontal axis at the values that you specify. If a reference line is almost completely obscured, then use the FRONTREF option to draw the reference lines in front of the histogram bars. Also see the CHREF=, HREFCHAR=, and LHREF= options.
HREFLABEL=' label1 ' ' labeln '
HREFLABELS= ' label1 ' ' labeln '
HREFLAB= ' label ' ' labeln '
specifies labels for the lines requested by the HREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can have up to 16 characters.
HREFLABPOS=1 2 3
specifies the vertical position of HREFLABELS= labels. If you specify HREFLABPOS=1, the labels are positioned along the top of the histogram. If you specify HREFLABPOS=2, the labels are staggered from top to bottom of the histogram. If you specify HREFLABPOS=3, the labels are positioned along the bottom of the histogram. By default, HREFLABPOS=1.
INFONT= font
specifies a software font to use for text inside the framed areas of the histogram. The INFONT= option takes precedence over the FTEXT= option in the GOPTIONS statement. For a list of fonts, see SAS/GRAPH Reference .
INHEIGHT= value
specifies the height, in percentage screen units, of text used inside the framed areas of the histogram. By default, the height specified by the HEIGHT= option is used. If you do not specify the HEIGHT= option, the height specified with the HTEXT= option in the GOPTIONS statement is used.
INTERTILE= value
specifies the distance, in horizontal percentage screen units, between the framed areas, which are called tiles . By default, INTERTILE=0.75 percentage screen units. This option is not available unless you use the CLASS statement. You can specify INTERTILE=0 to create contiguous tiles.
K=NORMAL QUADRATIC TRIANGULAR
specifies the kernel function (normal, quadratic, or triangular) used to compute a kernel density estimate. You can specify up to five values to request multiple estimates. You must enclose this option in parentheses after the KERNEL option. You can also use the K= kernel-option with the C= kernel-option , which specifies standardized bandwidths. If you specify more kernel functions than bandwidths, the procedure repeats the last bandwidth in the list for the remaining estimates. Likewise, if you specify more bandwidths than kernel functions, the procedure repeats the last kernel function for the remaining estimates. By default, K=NORMAL.
KERNEL < ( kernel-options ) >
superimposes up to five kernel density estimates on the histogram. By default, the procedure uses the AMISE method to compute kernel density estimates. To request multiple kernel density estimates on the same histogram, specify a list of values for either the C= kernel-option or K= kernel-option . Table 3.10 (page 215) lists options you can specify with the KERNEL option. See Example 3.23.
L= linetype
specifies the line type used for fitted density curves. Enclose the L= option in parentheses after the distribution option or the KERNEL option. If you use the L= option with the KERNEL option, you can specify a list of up to five line types for multiple kernel density estimates. See the entries for the C= and K= options for details on specifying multiple kernel density estimates. By default, L=1, which produces a solid line.
LGRID= linetype
specifies the line type for the grid when a grid displays on the histogram. By default, LGRID=1, which produces a solid line. This option also creates a grid.
LHREF= linetype
LH= linetype
specifies the line type for the reference lines that you request with the HREF= option. By default, LHREF=2, which produces a dashed line.
LOGNORMAL < ( lognormal-options ) >
displays a fitted lognormal density curve on the histogram. The LOGNORMAL option can occur only once in a HISTOGRAM statement. The parameter must be less than the minimum data value. Use the THETA= lognormal-option to specify . By default, THETA=0. You can specify THETA=EST to request the maximum likelihood estimate for . Use the SIGMA= and ZETA= lognormal-options to specify ƒ and . By default, the procedure computes maximum likelihood estimates for ƒ and . Table 3.3 (page 214) and Table 3.7 (page 215) list options you can specify with the LOGNORMAL option. See Example 3.22 and Example 3.24.
LOWER= value-list
specifies lower bounds for kernel density estimates requested with the KERNEL option. Enclose the LOWER= option in parentheses after the KERNEL option. You can specify up to five lower bounds for multiple kernel density estimates. If you specify more kernel estimates than lower bounds, the last lower bound is repeated for the remaining estimates. The default is a missing value, indicating no lower bounds for fitted kernel density curves.
LVREF= linetype
LV= linetype
specifies the line type for lines requested with the VREF= option. By default, LVREF=2, which produces a dashed line.
MAXNBIN= n
specifies the maximum number of bins displayed in the comparative histogram. This option is useful when the scales or ranges of the data distributions differ greatly from cell to cell. By default, the bin size and midpoints are determined for the key cell, and then the midpoint list is extended to accommodate the data ranges for the remaining cells. However, if the cell scales differ considerably, the resulting number of bins may be so great that each cell histogram is scaled into a narrow region. By using MAXNBIN= to limit the number of bins, you can narrow the window about the data distribution in the key cell. This option is not available unless you specify the CLASS statement. The MAXNBIN= option is an alternative to the MAXSIGMAS= option.
MAXSIGMAS= value
limits the number of bins displayed in the comparative histogram to a range of value standard deviations (of the data in the key cell) above and below the mean of the data in the key cell. This option is useful when the scales or ranges of the data distributions differ greatly from cell to cell. By default, the bin size and midpoints are determined for the key cell, and then the midpoint list is extended to accommodate the data ranges for the remaining cells. However, if the cell scales differ considerably, the resulting number of bins may be so great that each cell histogram is scaled into a narrow region. By using MAXSIGMAS= to limit the number of bins, you can narrow the window that surrounds the data distribution in the key cell. This option is not available unless you specify the CLASS statement.
MIDPERCENTS
requests a table listing the midpoints and percentage of observations in each histogram interval. If you specify MIDPERCENTS in parentheses after a density estimate option, the procedure displays a table that lists the midpoints, the observed percentage of observations, and the estimated percentage of the population in each interval (estimated from the fitted distribution). See Example 3.18.
MIDPOINTS= values KEY UNIFORM
specifies how to determine the midpoints for the histogram intervals, where values determines the width of the histogram bars as the difference between consecutive midpoints. The procedure uses the same values for all variables.
The range of midpoints, extended at each end by half of the bar width, must cover the range of the data. For example, if you specify
midpoints=2 to 10 by 0.5
then all of the observations should fall between 1.75 and 10.25. You must use evenly spaced midpoints listed in increasing order.
KEY | determines the midpoints for the data in the key cell. The initial number of midpoints is based on the number of observations in the key cell that use the method of Terrell and Scott (1985). The procedure extends the midpoint list for the key cell in either direction as necessary until it spans the data in the remaining cells. |
UNIFORM | determines the midpoints by using all the observations as if there were no cells. In other words, the number of midpoints is based on the total sample size by using the method of Terrell and Scott (1985). |
Neither KEY nor UNIFORM apply unless you use the CLASS statement. By default, if you use a CLASS statement, MIDPOINTS=KEY; however, if the key cell is empty then MIDPOINTS=UNIFORM. Otherwise, the procedure computes the midpoints by using an algorithm (Terrell and Scott 1985) that is primarily applicable to continuous data that are approximately normally distributed.
MU= value
specifies the parameter µ for normal density curves requested with the NORMAL option. Enclose the MU= option in parentheses after the NORMAL option. By default, the procedure uses the sample mean for µ .
NAME= ' string '
specifies a name for the plot, up to eight characters long, that appears in the PROC GREPLAY master menu. The default value is UNIVAR .
NCOLS= n
NCOL= n
specifies the number of columns in a comparative histogram. By default, NCOLS=1 if you specify only one class variable, and NCOLS=2 if you specify two class variables. This option is not available unless you use the CLASS statement. If you specify two class variables, you can use the NCOLS= option with the NROWS= option.
NOBARS
suppresses drawing of histogram bars, which is useful for viewing fitted curves only.
NOFRAME
suppresses the frame around the subplot area.
NOHLABEL
suppresses the label for the horizontal axis. You can use this option to reduce clutter.
NOPLOT
NOCHART
suppresses the creation of a plot. Use this option when you only want to tabulate summary statistics for a fitted density or create an OUTHISTOGRAM= data set.
NOPRINT
suppresses tables summarizing the fitted curve. Enclose the NOPRINT option in parentheses following the distribution option.
NORMAL < ( normal-options ) >
displays a fitted normal density curve on the histogram. The NORMAL option can occur only once in a HISTOGRAM statement. Use the MU= and SIGMA= normal-options to specify µ and ƒ . By default, the procedure uses the sample mean and sample standard deviation for µ and ƒ . Table 3.3 (page 214) and Table 3.8 (page 215) list options you can specify with the NORMAL option. See Example 3.19.
NOVLABEL
suppresses the label for the vertical axis. You can use this option to reduce clutter.
NOVTICK
suppresses the tick marks and tick mark labels for the vertical axis. This option also suppresses the label for the vertical axis.
NROWS= n
NROW= n
specifies the number of rows in a comparative histogram. By default, NROWS=2. This option is not available unless you use the CLASS statement. If you specify two class variables, you can use the NCOLS= option with the NROWS= option.
OUTHISTOGRAM= SAS-data-set
OUTHIST= SAS-data-set
creates a SAS data set that contains information about histogram intervals. Specifically, the data set contains the midpoints of the histogram intervals, the observed percentage of observations in each interval, and the estimated percentage of observations in each interval (estimated from each of the specified fitted curves).
PERCENTS= values
PERCENT= values
specifies a list of percents for which quantiles calculated from the data and quantiles estimated from the fitted curve are tabulated. The percents must be between 0 and 100. Enclose the PERCENTS= option in parentheses after the curve option. The default percents are 1, 5, 10, 25, 50, 75, 90, 95, and 99.
PFILL= pattern
specifies a pattern used to fill the bars of the histograms (or the areas under a fitted curve if you also specify the FILL option). See the entries for the CFILL= and FILL options for additional details. Refer to SAS/GRAPH Software: Reference for a list of pattern values. By default, the bars and curve areas are not filled.
RTINCLUDE
includes the right endpoint of each histogram interval in that interval. By default, the left endpoint is included in the histogram interval.
SCALE= value
is an alias for the SIGMA= option for curves requested by the BETA, EXPONENTIAL, GAMMA, and WEIBULL options and an alias for the ZETA= option for curves requested by the LOGNORMAL option.
SHAPE= value
is an alias for the ALPHA= option for curves requested with the GAMMA option, an alias for the SIGMA= option for curves requested with the LOGNORMAL option, and an alias for the C= option for curves requested with the WEIBULL option.
SIGMA= value EST
specifies the parameter ƒ for the fitted density curve when you request the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, NORMAL, and WEIBULL options.
See Table 3.13 for a summary of how to use the SIGMA= option. You must enclose this option in parentheses after the density curve option. As a beta-option , you can specify SIGMA=EST to request a maximum likelihood estimate for ƒ .
Distribution Keyword | SIGMA= Specifies | Default Value | Alias |
---|---|---|---|
BETA | Scale parameter ƒ | 1 | SCALE= |
EXPONENTIAL | Scale parameter ƒ | Maximum likelihood estimate | SCALE= |
GAMMA | Scale parameter ƒ | Maximum likelihood estimate | SCALE= |
WEIBULL | Scale parameter ƒ | Maximum likelihood estimate | SCALE= |
LOGNORMAL | Shape parameter ƒ | Maximum likelihood estimate | SCALE= |
NORMAL | Scale parameter ƒ | Standard deviation | SHAPE= |
WEIBULL | Scale parameter ƒ | Maximum likelihood estimate | SCALE= |
THETA= value EST
specifies the lower threshold parameter for curves requested with the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, and WEIBULL options. Enclose the THETA= option in parentheses after the curve option. By default, THETA=0. If you specify THETA=EST, an estimate is computed for .
THRESHOLD= value
is an alias for the THETA= option. See the preceding entry for the THETA= option.
TURNVLABELS
TURNVLABEL
turns the characters in the vertical axis labels so that they display vertically. This happens by default when you use a hardware font.
UPPER= value-list
specifies upper bounds for kernel density estimates requested with the KERNEL option. Enclose the UPPER= option in parentheses after the KERNEL option. You can specify up to five upper bounds for multiple kernel density estimates. If you specify more kernel estimates than upper bounds, the last upper bound is repeated for the remaining estimates. The default is a missing value, indicating no upper bounds for fitted kernel density curves.
VAXIS= name value-list
specifies the name of an AXIS statement describing the vertical axis. Alternatively, you can specify a value-list for the vertical axis.
VAXISLABEL= label
specifies a label for the vertical axis. Labels can have up to 40 characters.
VMINOR= n
VM= n
specifies the number of minor tick marks between each major tick mark on the vertical axis. Minor tick marks are not labeled. The default is zero.
VOFFSET= value
specifies the offset, in percentage screen units, at the upper end of the vertical axis.
VREF= value-list
draws reference lines perpendicular to the vertical axis at the values specified. Also see the CVREF=, LVREF=, and VREFCHAR= options. If a reference line is almost completely obscured, then use the FRONTREF option to draw the reference lines in front of the histogram bars.
VREFLABELS=' label1 ' ' labeln '
VREFLABEL=' label1 ' ' labeln '
VREFLAB= ' label1 ' ' labeln '
specifies labels for the lines requested by the VREF= option. The number of labels must equal the number of lines. Enclose each label in quotes. Labels can have up to 16 characters.
VREFLABPOS= n
specifies the horizontal position of VREFLABELS= labels. If you specify VREFLABPOS=1, the labels are positioned at the left of the histogram. If you specify VREFLABPOS=2, the labels are positioned at the right of the histogram. By default, VREFLABPOS=1.
VSCALE=COUNT PERCENT PROPORTION
specifies the scale of the vertical axis for a histogram. The value COUNT requests the data be scaled in units of the number of observations per data unit. The value PERCENT requests the data be scaled in units of percent of observations per data unit. The value PROPORTION requests the data be scaled in units of proportion of observations per data unit. The default is PERCENT.
W= n
specifies the width, in pixels, of the fitted density curve or the kernel density estimate curve. By default, W=1. You must enclose this option in parentheses after the density curve option or the KERNEL option. As a kernel-option , you can specify a list of up to five W= values.
WAXIS= n
specifies the line thickness, in pixels, for the axes and frame. By default, WAXIS=1.
WBARLINE= n
specifies the width of bar outlines. By default, WBARLINE=1.
WEIBULL < ( Weibull-options ) >
displays a fitted Weibull density curve on the histogram. The WEIBULL option can occur only once in a HISTOGRAM statement. The parameter must be less than the minimum data value. Use the THETA= Weibull-option to specify . By default, THETA=0. You can specify THETA=EST to request the maximum likelihood estimate for . Use the C= and SIGMA= Weibull-options to specify the shape parameter c and the scale parameter ƒ . By default, the procedure computes the maximum likelihood estimates for c and ƒ . Table 3.3 (page 214) and Table 3.9 (page 215) list option you can specify with the WEIBULL option. See Example 3.22.
PROC UNIVARIATE calculates the maximum likelihood estimate of a iteratively by using the Newton-Raphson approximation. See also the C=, SIGMA=, and THETA= Weibull-options .
WGRID= n
specifies the line thickness for the grid.
ZETA= value
specifies a value for the scale parameter for lognormal density curves requested with the LOGNORMAL option. Enclose the ZETA= lognormal-option in parentheses after the LOGNORMAL option. By default, the procedure calculates a maximum likelihood estimate for . You can specify the SCALE= option as an alias for the ZETA= option.
ID variables ;
The ID statement specifies one or more variables to include in the table of extreme observations. The corresponding values of the ID variables appear beside the n largest and n smallest observations, where n is the value of NEXTROBS= option. See Example 3.3.
INSET keyword-list < /options > ;
The INSET statement places a box or table of summary statistics, called an inset , directly in a high-resolution graph created with the HISTOGRAM, PROBPLOT, or QQPLOT statement.
The INSET statement must follow the HISTOGRAM, PROBPLOT, or QQPLOT statement that creates the plot that you want to augment. The inset appears in all the graphs that the preceding plot statement produces.
You can use multiple INSET statements after a plot statement to add multiple insets to a plot. See Example 3.17.
In an INSET statement, you specify one or more keywords that identify the information to display in the inset. The information is displayed in the order that you request the keywords . Keywords can be any of the following:
statistical keywords
primary keywords
secondary keywords
The available statistical keywords are:
CSS | Corrected sum of squares |
CV | Coefficient of variation |
KURTOSIS | Kurtosis |
MAX | Largest value |
MEAN | Sample mean |
MIN | Smallest value |
MODE | Most frequent value |
N | Sample size |
NMISS | Number of missing values |
NOBS | Number of observations |
RANGE | Range |
SKEWNESS | Skewness |
STD | Standard deviation |
STDMEAN | Standard error of the mean |
SUM | Sum of the observations |
SUMWGT | Sum of the weights |
USS | Uncorrected sum of squares |
VAR | Variance |
P1 | 1st percentile |
P5 | 5th percentile |
P10 | 10th percentile |
Q1 | Lower quartile (25th percentile) |
MEDIAN | Median (50th percentile) |
Q3 | Upper quartile (75th percentile) |
P90 | 90th percentile |
P95 | 95th percentile |
P99 | 99th percentile |
QRANGE | Interquartile range (Q3 - Q1) |
GINI | Gini s mean difference |
MAD | Median absolute difference about the median |
QN | Q n , alternative to MAD |
SN | S n , alternative to MAD |
STD “GINI | Gini s standard deviation |
STD “MAD MAD | standard deviation |
STD “QN | Q n standard deviation |
STD “QRANGE | Interquartile range standard deviation |
STD “SN | S n standard deviation |
MSIGN | Sign statistic |
NORMALTEST | Test statistic for normality |
PNORMAL | Probability value for the test of normality |
SIGNRANK | Signed rank statistic |
PROBM | Probability of greater absolute value for the sign statistic |
PROBN | Probability value for the test of normality |
PROBS | Probability value for the signed rank test |
PROBT | Probability value for the Student s t test |
T | Statistics for Student s t test |
A primary keyword enables you to specify secondary keywords in parentheses immediately after the primary keyword. Primary keywords are BETA, EXPONENTIAL, GAMMA, LOGNORMAL, NORMAL, WEIBULL, WEIBULL2, KERNEL, and KERNEL n . If you specify a primary keyword but omit a secondary keyword , the inset displays a colored line and the distribution name as a key for the density curve.
By default, PROC UNIVARIATE identifies inset statistics with appropriate labels and prints numeric values using appropriate formats. To customize the label, specify the keyword followed by an equal sign (=) and the desired label in quotes. To customize the format, specify a numeric format in parentheses after the keyword . Labels can have up to 24 characters.
If you specify both a label and a format for a statistic, the label must appear before the format. For example,
inset n='Sample Size' std='Std Dev' (5.2);
requests customized labels for two statistics and displays the standard deviation with a field width of 5 and two decimal places.
The following tables list primary keywords :
Keyword | Distribution | Plot Statement Availability |
---|---|---|
BETA | Beta | All plot statements |
EXPONENTIAL | Exponential | All plot statements |
GAMMA | Gamma | All plot statements |
LOGNORMAL | Lognormal | All plot statements |
NORMAL | Normal | All plot statements |
WEIBULL | Weibull(3-parameter) | All plot statements |
WEIBULL2 | Weibull(2-parameter) | PROBPLOT and QQPLOT |
Keyword | Description |
---|---|
KERNEL | Displays statistics for all kernel estimates |
KERNEL n | Displays statistics for only the n th kernel density estimate n = 1 , 2 , 3 , 4 , or 5 |
Table 3.20 through Table 3.28 list the secondary keywords available with primary keywords in Table 3.18 and Table 3.19.
Secondary Keyword | Alias | Description |
---|---|---|
ALPHA | SHAPE1 | First shape parameter ± |
BETA | SHAPE2 | Second shape parameter ² |
SIGMA | SCALE | Scale parameter ƒ |
THETA | THRESHOLD | Lower threshold parameter |
MEAN | Mean of the fitted distribution | |
STD | Standard deviation of the fitted distribution |
Secondary Keyword | Alias | Description |
---|---|---|
SIGMA | SCALE | Scale parameter ƒ |
THETA | THRESHOLD | Threshold parameter |
MEAN | Mean of the fitted distribution | |
STD | Standard deviation of the fitted distribution |
Secondary Keyword | Alias | Description |
---|---|---|
ALPHA | SHAPE | Shape parameter ± |
SIGMA | SCALE | Scale parameter ƒ |
THETA | THRESHOLD | Threshold parameter |
MEAN | Mean of the fitted distribution | |
STD | Standard deviation of the fitted distribution |
Secondary Keyword | Alias | Description |
---|---|---|
SIGMA | SHAPE | Shape parameter ƒ |
THETA | THRESHOLD | Threshold parameter |
ZETA | SCALE | Scale parameter |
MEAN | Mean of the fitted distribution | |
STD | Standard deviation of the fitted distribution |
Secondary Keyword | Alias | Description |
---|---|---|
MU | MEAN | Mean parameter µ |
SIGMA | STD | Scale parameter ƒ |
Secondary Keyword | Alias | Description |
---|---|---|
C | SHAPE | Shape parameter c |
SIGMA | SCALE | Scale parameter ƒ |
THETA | THRESHOLD | Threshold parameter |
MEAN | Mean of the fitted distribution | |
STD | Standard deviation of the fitted distribution |
Secondary Keyword | Alias | Description |
---|---|---|
C | SHAPE | Shape parameter c |
SIGMA | SCALE | Scale parameter ƒ |
THETA | THRESHOLD | Known lower threshold |
MEAN | Mean of the fitted distribution | |
STD | Standard deviation of the fitted distribution |
Secondary Keyword | Description |
---|---|
TYPE | Kernel type: normal, quadratic, or triangular |
BANDWIDTH | Bandwidth » for the density estimate |
BWIDTH | Alias for BANDWIDTH |
C | Standardized bandwidth c for the density estimate: where n = sample size, » = bandwidth, and Q = interquartile range |
AMISE | Approximate mean integrated square error (MISE) for the kernel density |
Secondary Keyword | Description |
---|---|
AD | Anderson-Darling EDF test statistic |
ADPVAL | Anderson-Darling EDF test p -value |
CVM | Cram r-von Mises EDF test statistic |
CVMPVAL | Cram r-von Mises EDF test p -value |
KSD | Kolmogorov-Smirnov EDF test statistic |
KSDPVAL | Kolmogorov-Smirnov EDF test p -value |
The inset statistics listed in Table 3.18 through Table 3.28 are not available unless you request a plot statement and options that calculate these statistics. For example,
proc univariate data=score; histogram final / normal; inset mean std normal(ad adpval); run;
The MEAN and STD keywords display the sample mean and standard deviation of FINAL. The NORMAL keyword with the secondary keywords AD and ADPVAL display the Anderson-Darling goodness-of-fit test statistic and p -value. The statistics that are specified with the NORMAL keyword are available only because the NORMAL option is requested in the HISTOGRAM statement.
The KERNEL or KERNEL n keyword is available only if you request a kernel density estimate in a HISTOGRAM statement. The WEIBULL2 keyword is available only if you request a two-parameter Weibull distribution in the PROBPLOT or QQPLOT statement.
The following table lists INSET statement options , which are specified after the slash (/) in the INSET statement. For complete descriptions, see the section Dictionary of Options on page 235.
CFILL= color BLANK | Specifies color of inset background |
CFILLH= color | Specifies color of header background |
CFRAME= color | Specifies color of frame |
CHEADER= color | Specifies color of header text |
CSHADOW= color | Specifies color of drop shadow |
CTEXT= color | Specifies color of inset text |
DATA | Specifies data units for POSITION=( x, y ) coordinates |
DATA= SAS-data-set | Specifies data set for statistics in the inset table |
FONT= font | Specifies font of text |
FORMAT= format | Specifies format of values in inset |
HEADER= ' quoted string ' | Specifies header text |
HEIGHT= value | Specifies height of inset text |
NOFRAME | Suppresses frame around inset |
POSITION= position | Specifies position of inset |
REFPOINT=BR BL TR TL | Specifies reference point of inset positioned with POSITION=( x, y ) coordinates |
The following entries provide detailed descriptions of options for the INSET statement.
To specify the same format for all the statistics in the INSET statement, use the FORMAT= option.
To create a completely customized inset, use a DATA= data set. The data set contains the label and the value that you want to display in the inset.
If you specify multiple kernel density estimates, you can request inset statistics for all the estimates with the KERNEL keyword . Alternatively, you can display inset statistics for individual curves with the KERNEL n keyword , where n is the curve number between 1 and 5.
CFILL= color BLANK
specifies the color of the background. If you omit the CFILLH= option the header background is included. By default, the background is empty, which causes items that overlap the inset (such as curves or histogram bars) to show through the inset.
If you specify a value for CFILL= option, then overlapping items no longer show through the inset. Use CFILL=BLANK to leave the background uncolored and to prevent items from showing through the inset.
CFILLH= color
specifies the color of the header background. The default value is the CFILL= color.
CFRAME= color
specifies the color of the frame. The default value is the same color as the axis of the plot.
CHEADER= color
specifies the color of the header text. The default value is the CTEXT= color.
CSHADOW= color
specifies the color of the drop shadow. By default, if a CSHADOW= option is not specified, a drop shadow is not displayed.
CTEXT= color
specifies the color of the text. The default value is the same color as the other text on the plot.
DATA
specifies that data coordinates are to be used in positioning the inset with the POSITION= option. The DATA option is available only when you specify POSITION=(x,y). You must place DATA immediately after the coordinates (x,y).
DATA= SAS-data-set
requests that PROC UNIVARIATE display customized statistics from a SAS data set in the inset table. The data set must contain two variables:
_LABEL_ | a character variable whose values provide labels for inset entries. |
_VALUE_ | a variable that is either character or numeric and whose values provide values for inset entries. |
The label and value from each observation in the data set occupy one line in the inset. The position of the DATA= keyword in the keyword list determines the position of its lines in the inset.
FONT= font
specifies the font of the text. By default, if you locate the inset in the interior of the plot then the font is SIMPLEX. If you locate the inset in the exterior of the plot then the font is the same as the other text on the plot.
FORMAT= format
specifies a format for all the values in the inset. If you specify a format for a particular statistic, then this format overrides FORMAT= format. For more information about SAS formats, see SAS Language Reference: Dictionary
HEADER= string
specifies the header text. The string cannot exceed 40 characters. By default, no header line appears in the inset. If all the keywords that you list in the INSET statement are secondary keywords that correspond to a fitted curve on a histogram, PROC UNIVARIATE displays a default header that indicates the distribution and identifies the curve.
HEIGHT= value
specifies the height of the text.
NOFRAME
suppresses the frame drawn around the text.
POSITION= position
POS= position
determines the position of the inset. The position is a compass point keyword, a margin keyword, or a pair of coordinates (x,y). You can specify coordinates in axis percent units or axis data units. The default value is NW, which positions the inset in the upper left (northwest) corner of the display. See the section Positioning the Inset on page 285.
REFPOINT=BR BL TR TL
specifies the reference point for an inset that PROC UNIVARIATE positions by a pair of coordinates with the POSITION= option. The REFPOINT= option specifies which corner of the inset frame that you want to position at coordinates (x,y). The keywords are BL, BR, TL, and TR, which correspond to bottom left, bottom right, top left, and top right. The default value is BL. You must use REFPOINT= with POSITION=(x,y) coordinates.
OUTPUT < OUT= SAS-data-set >
< keyword1=names keywordk=names > < percentile-options > ;
The OUTPUT statement saves statistics and BY variables in an output data set. When you use a BY statement, each observation in the OUT= data set corresponds to one of the BY groups. Otherwise, the OUT= data set contains only one observation.
You can use any number of OUTPUT statements in the UNIVARIATE procedure. Each OUTPUT statement creates a new data set containing the statistics specified in that statement. You must use the VAR statement with the OUTPUT statement. The OUTPUT statement must contain a specification of the form keyword=names or the PCTLPTS= and PCTLPRE= specifications. See Example 3.7 and Example 3.8.
OUT= SAS-data-set
identifies the output data set. If SAS-data-set does not exist, PROC UNIVARIATE creates it. If you omit OUT=, the data set is named DATA n , where n is the smallest integer that makes the name unique. The default SAS-data-set is DATA n .
keyword = name
specifies the statistics to include in the output data set and gives names to the new variables that contain the statistics. Specify a keyword for each desired statistic, an equal sign, and the names of the variables to contain the statistic. In the output data set, the first variable listed after a keyword in the OUTPUT statement contains the statistic for the first variable listed in the VAR statement; the second variable contains the statistic for the second variable in the VAR statement, and so on. If the list of names following the equal sign is shorter than the list of variables in the VAR statement, the procedure uses the names in the order in which the variables are listed in the VAR statement. The available keywords are listed in the following tables:
CSS | Corrected sum of squares |
CV | Coefficient of variation |
KURTOSIS | Kurtosis |
MAX | Largest value |
MEAN | Sample mean |
MIN | Smallest value |
MODE | Most frequent value |
N | Sample size |
NMISS | Number of missing values |
NOBS | Number of observations |
RANGE | Range |
SKEWNESS | Skewness |
STD | Standard deviation |
STDMEAN | Standard error of the mean |
SUM | Sum of the observations |
SUMWGT | Sum of the weights |
USS | Uncorrected sum of squares |
VAR | Variance |
P1 | 1st percentile |
P5 | 5th percentile |
P10 | 10th percentile |
Q1 | Lower quartile (25th percentile) |
MEDIAN | Median (50th percentile) |
Q3 | Upper quartile (75th percentile) |
P90 | 90th percentile |
P95 | 95th percentile |
P99 | 99th percentile |
QRANGE | Interquartile range (Q3 - Q1) |
GINI | Gini s mean difference |
MAD | Median absolute difference about the median |
QN | Q n , alternative to MAD |
SN | S n , alternative to MAD |
STD “GINI | Gini s standard deviation |
STD “MAD MAD | standard deviation |
STD “QN | Q n standard deviation |
STD “QRANGE | Interquartile range standard deviation |
STD “SN | S n standard deviation |
MSIGN | Sign statistic |
NORMALTEST | Test statistic for normality |
SIGNRANK | Signed rank statistic |
PROBM | Probability of a greater absolute value for the sign statistic |
PROBN | Probability value for the test of normality |
PROBS | Probability value for the signed rank test |
PROBT | Probability value for the Student s t test |
T | Statistic for the Student s t test |
To store the same statistic for several analysis variables, specify a list of names . The order of the names corresponds to the order of the analysis variables in the VAR statement. PROC UNIVARIATE uses the first name to create a variable that contains the statistic for the first analysis variable, the next name to create a variable that contains the statistic for the second analysis variable, and so on. If you do not want to output statistics for all the analysis variables, specify fewer names than the number of analysis variables.
The UNIVARIATE procedure automatically computes the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles for the data. These can be saved in an output data set using keyword=names specifications. For additional percentiles, you can use the following percentile-options :
PCTLPTS= percentiles
specifies one or more percentiles that are not automatically computed by the UNIVARIATE procedure. The PCTLPRE= and PCTLPTS= options must be used together. You can specify percentiles with the expression start TO stop BY increment where start is a starting number, stop is an ending number, and increment is a number to increment by. The PCTLPTS= option generates additional percentiles and outputs them to a data set; these additional percentiles are not printed.
To compute the 50th, 95th, 97.5th, and 100th percentiles, submit the statement
output pctlpre=P_ pctlpts=50,95 to 100 by 2.5;
You can use PCTLPTS= to output percentiles that are not in the list of quantile statistics. PROC UNIVARIATE computes the requested percentiles based on the method that you specify with the PCTLDEF= option in the PROC UNIVARIATE statement. You must use PCTLPRE=, and optionally PCTLNAME=, to specify variable names for the percentiles. For example, the following statements create an output data set that is named Pctls that contains the 20th and 40th percentiles of the analysis variables PreTest and PostTest :
proc univariate data=Score; var PreTest PostTest; output out=Pctls pctlpts=20 40 pctlpre=PreTest_ PostTest_ pctlname=P20 P40; run;
PROC UNIVARIATE saves the 20th and 40th percentiles for PreTest and PostTest in the variables PreTest “P20, PostTest “P20, PreTest “P40, and PostTest “P40.
PCTLPRE= prefixes
specifies one or more prefixes to create the variable names for the variables that contain the PCTLPTS= percentiles. To save the same percentiles for more than one analysis variable, specify a list of prefixes. The order of the prefixes corresponds to the order of the analysis variables in the VAR statement. The PCTLPRE= and PCTLPTS= options must be used together.
The procedure generates new variable names using the prefix and the percentile values. If the specified percentile is an integer, the variable name is simply the prefix followed by the value. If the specified value is not an integer, an underscore replaces the decimal point in the variable name, and decimal values are truncated to one decimal place. For example, the following statements create the variables PWID20, PWID33 “3, PWID66 “6, and PWID80 for the 20th, 33.33rd, 66.67th, and 80th percentiles of Width , respectively:
proc univariate noprint; var Width; output pctlpts=20 33.33 66.67 80 pctlpre=pwid; run;
If you request percentiles for more than one variable, you should list prefixes in the same order in which the variables appear in the VAR statement. If combining the prefix and percentile value results in a name longer than 32 characters, the prefix is truncated so that the variable name is 32 characters.
PCTLNAME= suffixes
specifies one or more suffixes to create the names for the variables that contain the PCTLPTS= percentiles. PROC UNIVARIATE creates a variable name by combining the PCTLPRE= value and suffix-name. Because the suffix names are associated with the percentiles that are requested, list the suffix names in the same order as the PCTLPTS= percentiles. If you specify n suffixes with the PCTLNAME= option and m percentile values with the PCTLPTS= option, where m > n , the suffixes are used to name the first n percentiles, and the default names are used for the remaining m ˆ’ n percentiles. For example, consider the following statements:
proc univariate; var Length Width Height; output pctlpts = 20 40 pctlpre = pl pw ph pctlname = twenty; run;
The value TWENTY in the PCTLNAME= option is used for only the first percentile in the PCTLPTS= list. This suffix is appended to the values in the PCTLPRE= option to generate the new variable names PLTWENTY, PWTWENTY, and PHTWENTY, which contain the 20th percentiles for Length , Width , and Height , respectively. Since a second PCTLNAME= suffix is not specified, variable names for the 40th percentiles for Length , Width , and Height are generated using the prefixes and percentile values. Thus, the output data set contains the variables PLTWENTY, PL40, PWTWENTY, PW40, PHTWENTY, and PH40.
You must specify PCTLPRE= to supply prefix names for the variables that contain the PCTLPTS= percentiles.
If the number of PCTLNAME= values is fewer than the number of percentiles, or if you omit PCTLNAME=, PROC UNIVARIATE uses the percentile as the suffix to create the name of the variable that contains the percentile. For an integer percentile, PROC UNIVARIATE uses the percentile. Otherwise, PROC UNIVARIATE truncates decimal values of percentiles to two decimal places and replaces the decimal point with an underscore.
If either the prefix and suffix name combination or the prefix and percentile name combination is longer than 32 characters, PROC UNIVARIATE truncates the prefix name so that the variable name is 32 characters.
PROBPLOT < variables > < /options > ;
The PROBPLOT statement creates a probability plot, which compares ordered variable values with the percentiles of a specified theoretical distribution. If the data distribution matches the theoretical distribution, the points on the plot form a linear pattern. Consequently, you can use a probability plot to determine how well a theoretical distribution models a set of measurements.
Probability plots are similar to Q-Q plots, which you can create with the QQPLOT statement. Probability plots are preferable for graphical estimation of percentiles, whereas Q-Q plots are preferable for graphical estimation of distribution parameters.
You can use any number of PROBPLOT statements in the UNIVARIATE procedure. The components of the PROBPLOT statement are described as follows.
variables
are the variables for which to create probability plots. If you specify a VAR statement, the variables must also be listed in the VAR statement. Otherwise, the variables can be any numeric variables in the input data set. If you do not specify a list of variables , then by default the procedure creates a probability plot for each variable listed in the VAR statement, or for each numeric variable in the DATA= data set if you do not specify a VAR statement. For example, each of the following PROBPLOT statements produces two probability plots, one for Length and one for Width :
proc univariate data=Measures; var Length Width; probplot; proc univariate data=Measures; probplot Length Width; run;
options
specify the theoretical distribution for the plot or add features to the plot. If you specify more than one variable, the options apply equally to each variable. Specify all options after the slash (/) in the PROBPLOT statement. You can specify only one option naming a distribution in each PROBPLOT statement, but you can specify any number of other options . The distributions available are the beta, exponential, gamma, lognormal, normal, two-parameter Weibull, and three-parameter Weibull. By default, the procedure produces a plot for the normal distribution.
In the following example, the NORMAL option requests a normal probability plot for each variable, while the MU= and SIGMA= normal-options request a distribution reference line corresponding to the normal distribution with µ = 10 and ƒ = 0.3. The SQUARE option displays the plot in a square frame, and the CTEXT= option specifies the text color.
proc univariate data=Measures; probplot Length1 Length2 / normal(mu=10 sigma=0.3) square ctext=blue; run;
Table 3.34 through Table 3.43 list the PROBPLOT options by function. For complete descriptions, see the section Dictionary of Options on page 245. Options can be any of the following:
primary options
secondary options
general options
BETA( beta-options ) | Specifies beta probability plot for shape parameters ± and ² specified with mandatory ALPHA= and BETA= beta-options |
EXPONENTIAL( exponential-options ) | Specifies exponential probability plot |
GAMMA( gamma-options ) | Specifies gamma probability plot for shape parameter ± specified with mandatory ALPHA= gamma-option |
LOGNORMAL( lognormal-options ) | Specifies lognormal probability plot for shape parameter ƒ specified with mandatory SIGMA= lognormal-option |
NORMAL( normal-options ) | Specifies normal probability plot |
WEIBULL( Weibull-options ) | Specifies three-parameter Weibull probability plot for shape parameter c specified with mandatory C= Weibull-option |
WEIBULL2( Weibull2-options ) | Specifies two-parameter Weibull probability plot |
COLOR= color | Specifies color of distribution reference line |
L= linetype | Specifies line type of distribution reference line |
W= n | Specifies width of distribution reference line |
ALPHA= value-list EST | Specifies mandatory shape parameter ± |
BETA= value-list EST | Specifies mandatory shape parameter ² |
SIGMA= value EST | Specifies ƒ for distribution reference line |
THETA= value EST | Specifies ƒ for distribution reference line |
SIGMA= value EST | Specifies ƒ for distribution reference line |
THETA= value EST | Specifies for distribution reference line |
ALPHA= value-list EST | Specifies mandatory shape parameter ± |
SIGMA= value EST | Specifies ƒ for distribution reference line |
THETA= value EST | Specifies for distribution reference line |
SIGMA= value | Specifies mandatory shape parameter ƒ |
SLOPE= value EST | Specifies slope of distribution reference line |
THETA= value EST | Specifies for distribution reference line |
ZETA= value | Specifies for distribution reference line (slope is exp( )) |
MU= value EST | Specifies µ for distribution reference line |
SIGMA= value EST | Specifies ƒ for distribution reference line |
C= value-list EST | Specifies mandatory shape parameter c |
SIGMA= value EST | Specifies ƒ for distribution reference line |
THETA= value EST | Specifies for distribution reference line |
C= value EST | Specifies c for distribution reference line (slope is 1 /c ) |
SIGMA= value EST | Specifies ƒ for distribution reference line (intercept is log( ƒ )) |
SLOPE= value EST | Specifies slope of distribution reference line |
THETA= value | Specifies known lower threshold |
Option | Description |
---|---|
ANNOKEY | Applies annotation requested in ANNOTATE= data set to key cell only |
ANNOTATE= | Specifies annotate data set |
CAXIS= | Specifies color for axis |
CFRAME= | Specifies color for frame |
CFRAMESIDE= | Specifies color for filling frame for row labels |
CFRAMETOP= | Specifies color for filling frame for column labels |
CGRID= | Specifies color for grid lines |
CHREF= | Specifies color for HREF= lines |
CTEXT= | Specifies color for text |
CVREF= | Specifies color for VREF= lines |
DESCRIPTION= | Specifies description for plot in graphics catalog |
FONT= | Specifies software font for text |
GRID | Creates a grid |
HEIGHT= | Specifies height of text used outside framed areas |
HMINOR= | Specifies number of horizontal minor tick marks |
HREF= | Specifies reference lines perpendicular to the horizontal axis |
HREFLABELS= | Specifies labels for HREF= lines |
INFONT= | Specifies software font for text inside framed areas |
INHEIGHT= | Specifies height of text inside framed areas |
INTERTILE= | Specifies distance between tiles |
LGRID= | Specifies a line type for grid lines |
LHREF= | Specifies line style for HREF= lines |
LVREF= | Specifies line style for VREF= lines |
NADJ= | Adjusts sample size when computing percentiles |
NAME= | Specifies name for plot in graphics catalog |
NCOLS= | Specifies number of columns in comparative probability plot |
NOFRAME | Suppresses frame around plotting area |
NOHLABEL | Suppresses label for horizontal axis |
NOVLABEL | Suppresses label for vertical axis |
NOVTICK | Suppresses tick marks and tick mark labels for vertical axis |
NROWS= | Specifies number of rows in comparative probability plot |
PCTLMINOR | Requests minor tick marks for percentile axis |
PCTLORDER= | Specifies tick mark labels for percentile axis |
RANKADJ= | Adjusts ranks when computing percentiles |
SQUARE | Displays plot in square format |
VAXISLABEL= | Specifies label for vertical axis |
VMINOR= | Specifies number of vertical minor tick marks |
VREF= | Specifies reference lines perpendicular to the vertical axis |
VREFLABELS= | Specifies labels for VREF= lines |
VREFLABPOS= | Specifies horizontal position of labels for VREF= lines |
WAXIS= | Specifies line thickness for axes and frame |
Table 3.34 lists options for requesting a theoretical distribution.
Table 3.35 through Table 3.42 list secondary options that specify distribution parameters and control the display of a distribution reference line. Specify these options in parentheses after the distribution keyword. For example, you can request a normal probability plot with a distribution reference line by specifying the NORMAL option as follows:
proc univariate; probplot Length / normal(mu=10 sigma=0.3 color=red); run;
The MU= and SIGMA= normal-options display a distribution reference line that corresponds to the normal distribution with mean µ = 10 and standard deviation ƒ = 0 . 3, and the COLOR= normal-option specifies the color for the line.
Table 3.43 summarizes general options for enhancing probability plots.
The following entries provide detailed descriptions of options in the PROBPLOT statement.
ALPHA= value EST
specifies the mandatory shape parameter ± for probability plots requested with the BETA and GAMMA options. Enclose the ALPHA= option in parentheses after the BETA or GAMMA options. If you specify ALPHA=EST, a maximum likelihood estimate is computed for ± .
ANNOKEY
applies the annotation requested with the ANNOTATE= option to the key cell only. By default, the procedure applies annotation to all of the cells. This option is not available unless you use the CLASS statement. Specify the KEYLEVEL= option in the CLASS statement to specify the key cell.
ANNOTATE= SAS-data-set
ANNO= SAS-data-set
specifies an input data set containing annotate variables as described in SAS/GRAPH Software: Reference . The ANNOTATE= data set you specify in the HISTOGRAM statement is used for all plots created by the statement. You can also specify an ANNOTATE= data set in the PROC UNIVARIATE statement to enhance all plots created by the procedure.
BETA(ALPHA= value EST BETA= value EST < beta-options > )
creates a beta probability plot for each combination of the required shape parameters ± and ² specified by the required ALPHA= and BETA= beta-options . If you specify ALPHA=EST and BETA=EST, the procedure creates a plot based on maximum likelihood estimates for ± and ² . You can specify the SCALE= beta-option as an alias for the SIGMA= beta-option and the THRESHOLD= beta-option as an alias for the THETA= beta-option . To create a plot that is based on maximum likelihood estimates for ± and ² , specify ALPHA=EST and BETA=EST.
To obtain graphical estimates of ± and ² , specify lists of values in the ALPHA= and BETA= beta-options , and select the combination of ± and ² that most nearly linearizes the point pattern. To assess the point pattern, you can add a diagonal distribution reference line corresponding to lower threshold parameter and scale parameter ƒ with the THETA= and SIGMA= beta-options . Alternatively, you can add a line that corresponds to estimated values of and ƒ with the beta-options THETA=EST and SIGMA=EST. Agreement between the reference line and the point pattern indicates that the beta distribution with parameters ± , ² , , and ƒ is a good fit.
BETA= value EST
B= value EST
specifies the mandatory shape parameter ² for probability plots requested with the BETA option. Enclose the BETA= option in parentheses after the BETA option. If you specify BETA=EST, a maximum likelihood estimate is computed for ² .
C= value EST
specifies the shape parameter c for probability plots requested with the WEIBULL and WEIBULL2 options. Enclose this option in parentheses after the WEIBULL or WEIBULL2 option. C= is a required Weibull-option in the WEIBULL option; in this situation, it accepts a list of values, or if you specify C=EST, a maximum likelihood estimate is computed for c . You can optionally specify C= value or C=EST as a Weibull2-option with the WEIBULL2 option to request a distribution reference line; in this situation, you must also specify Weibull2-option SIGMA= value or SIGMA=EST.
CAXIS= color
CAXES= color
specifies the color for the axes. This option overrides any COLOR= specifications in an AXIS statement. The default value is the first color in the device color list.
CFRAME= color
specifies the color for the area that is enclosed by the axes and frame. The area is not filled by default.
CFRAMESIDE= color
specifies the color to fill the frame area for the row labels that display along the left side of a comparative probability plot. This color also fills the frame area for the label of the corresponding class variable (if you associate a label with the variable). By default, these areas are not filled. This option is not available unless you use the CLASS statement.
CFRAMETOP= color
specifies the color to fill the frame area for the column labels that display across the top of a comparative probability plot. This color also fills the frame area for the label of the corresponding class variable (if you associate a label with the variable). By default, these areas are not filled. This option does not apply unless you use the CLASS statement.
CGRID= color
specifies the color for grid lines when a grid displays on the plot. The default color is the first color in the device color list. This option also produces a grid.
CHREF= color
CH= color
specifies the color for horizontal axis reference lines requested by the HREF= option. The default color is the first color in the device color list.
COLOR= color
specifies the color of the diagonal distribution reference line. The default color is the first color in the device color list. Enclose the COLOR= option in parentheses after a distribution option keyword.
CTEXT= color
specifies the color for tick mark values and axis labels. The default color is the color that you specify for the CTEXT= option in the GOPTIONS statement. If you omit the GOPTIONS statement, the default is the first color in the device color list.
CVREF= color
CV= color
specifies the color for the reference lines requested by the VREF= option. The default color is the first color in the device color list.
DESCRIPTION= ' string '
DES= ' string '
specifies a description, up to 40 characters long, that appears in the PROC GREPLAY master menu. The default string is the variable name.
EXPONENTIAL < ( exponential-options ) >
EXP < ( exponential-options ) >
creates an exponential probability plot. To assess the point pattern, add a diagonal distribution reference line corresponding to and ƒ with the THETA= and SIGMA= exponential-options . Alternatively, you can add a line corresponding to estimated values of the threshold parameter and the scale parameter ƒ with the exponential-options THETA=EST and SIGMA=EST. Agreement between the reference line and the point pattern indicates that the exponential distribution with parameters and ƒ is a good fit. You can specify the SCALE= exponential-option as an alias for the SIGMA= exponential-option and the THRESHOLD= exponential-option as an alias for the THETA= exponential-option .
FONT= font
specifies a software font for the reference lines and axis labels. You can also specify fonts for axis labels in an AXIS statement. The FONT= font takes precedence over the FTEXT= font specified in the GOPTIONS statement. Hardware characters are used by default.
GAMMA(ALPHA= value EST < gamma-options > )
creates a gamma probability plot for each value of the shape parameter ± given by the mandatory ALPHA= gamma-option . If you specify ALPHA=EST, the procedure creates a plot based on a maximum likelihood estimate for ± . To obtain a graphical estimate of ± , specify a list of values for the ALPHA= gamma-option , and select the value that most nearly linearizes the point pattern. To assess the point pattern, add a diagonal distribution reference line corresponding to and ƒ with the THETA= and SIGMA= gamma-options . Alternatively, you can add a line corresponding to estimated values of the threshold parameter and the scale parameter ƒ with the gamma-options THETA=EST and SIGMA=EST. Agreement between the reference line and the point pattern indicates that the gamma distribution with parameters ± , and ƒ is a good fit. You can specify the SCALE= gamma-option as an alias for the SIGMA= gamma-option and the THRESHOLD= gamma-option as an alias for the THETA= gamma-option .
GRID
displays a grid. Grid lines are reference lines that are perpendicular to the percentile axis at major tick marks.
HEIGHT= value
specifies the height, in percentage screen units, of text for axis labels, tick mark labels, and legends. This option takes precedence over the HTEXT= option in the GOPTIONS statement.
HMINOR= n
HM= n
specifies the number of minor tick marks between each major tick mark on the horizontal axis. Minor tick marks are not labeled. By default, HMINOR=0.
HREF= values
draws reference lines that are perpendicular to the horizontal axis at the values you specify.
HREFLABELS=' label1 ' ' labeln '
HREFLABEL= ' label1 ' ' labeln '
HREFLAB= ' label1 ' ' labeln '
specifies labels for the reference lines requested by the HREF= option. The number of labels must equal the number of reference lines. Labels can have up to 16 characters.
HREFLABPOS= n
specifies the vertical position of HREFLABELS= labels. If you specify HREFLABPOS=1, the labels are positioned along the top of the plot. If you specify HREFLABPOS=2, the labels are staggered from top to bottom of the plot. If you specify HREFLABPOS=3, the labels are positioned along the bottom of the plot. By default, HREFLABPOS=1.
INFONT= font
specifies a software font to use for text inside the framed areas of the plot. The INFONT= option takes precedence over the FTEXT= option in the GOPTIONS statement. For a list of fonts, see SAS/GRAPH Reference .
INHEIGHT= value
specifies the height, in percentage screen units, of text used inside the framed areas of the plot. By default, the height specified by the HEIGHT= option is used. If you do not specify the HEIGHT= option, the height specified with the HTEXT= option in the GOPTIONS statement is used.
INTERTILE= value
specifies the distance, in horizontal percentage screen units, between the framed areas, which are called tiles . By default, the tiles are contiguous. This option is not available unless you use the CLASS statement.
L= linetype
specifies the line type for a diagonal distribution reference line. Enclose the L= option in parentheses after a distribution option. By default, L=1, which produces a solid line.
LGRID= linetype
specifies the line type for the grid requested by the GRID= option. By default, LGRID=1, which produces a solid line.
LHREF= linetype
LH= linetype
specifies the line type for the reference lines that you request with the HREF= option. By default, LHREF=2, which produces a dashed line.
LOGNORMAL(SIGMA= value EST < lognormal-options > )
LNORM(SIGMA= value EST < lognormal-options > )
creates a lognormal probability plot for each value of the shape parameter ƒ given by the mandatory SIGMA= lognormal-option . If you specify SIGMA=EST, the procedure creates a plot based on a maximum likelihood estimate for ƒ . To obtain a graphical estimate of ƒ , specify a list of values for the SIGMA= lognormal-option , and select the value that most nearly linearizes the point pattern. To assess the point pattern, add a diagonal distribution reference line corresponding to and with the THETA= and ZETA= lognormal-options . Alternatively, you can add a line corresponding to estimated values of the threshold parameter and the scale parameter with the lognormal-options THETA=EST and ZETA=EST. Agreement between the reference line and the point pattern indicates that the lognormal distribution with parameters ƒ , and is a good fit. You can specify the THRESHOLD= lognormal-option as an alias for the THETA= lognormal-option and the SCALE= lognormal-option as an alias for the ZETA= lognormal-option . See Example 3.26.
LVREF= linetype
specifies the line type for the reference lines requested with the VREF= option. By default, LVREF=2, which produces a dashed line.
MU= value EST
specifies the mean µ for a normal probability plot requested with the NORMAL option. Enclose the MU= normal-option in parentheses after the NORMAL option. The MU= normal-option must be specified with the SIGMA= normal-option , and they request a distribution reference line. You can specify MU=EST to request a distribution reference line with µ equal to the sample mean.
NADJ= value
specifies the adjustment value added to the sample size in the calculation of theoretical percentiles. By default, NADJ=1/4. Refer to Chambers et al. (1983).
NAME= ' string '
specifies a name for the plot, up to eight characters long, that appears in the PROC GREPLAY master menu. The default value is UNIVAR .
NCOLS= n
NCOL= n
specifies the number of columns in a comparative probability plot. By default, NCOLS=1 if you specify only one class variable, and NCOLS=2 if you specify two class variables. This option is not available unless you use the CLASS statement. If you specify two class variables, you can use the NCOLS= option with the NROWS= option.
NOFRAME
suppresses the frame around the subplot area.
NOHLABEL
suppresses the label for the horizontal axis. You can use this option to reduce clutter.
NORMAL < ( normal-options ) >
creates a normal probability plot. This is the default if you omit a distribution option. To assess the point pattern, you can add a diagonal distribution reference line corresponding to µ and ƒ with the MU= and SIGMA= normal-options . Alternatively, you can add a line corresponding to estimated values of µ and ƒ with the normal-options MU=EST and SIGMA=EST; the estimates of the mean µ and the standard deviation ƒ are the sample mean and sample standard deviation. Agreement between the reference line and the point pattern indicates that the normal distribution with parameters µ and ƒ is a good fit.
NOVLABEL
suppresses the label for the vertical axis. You can use this option to reduce clutter.
NOVTICK
suppresses the tick marks and tick mark labels for the vertical axis. This option also suppresses the label for the vertical axis.
NROWS= n
NROW= n
specifies the number of rows in a comparative probability plot. By default, NROWS=2. This option is not available unless you use the CLASS statement. If you specify two class variables, you can use the NCOLS= option with the NROWS= option.
PCTLMINOR
requests minor tick marks for the percentile axis. The HMINOR option overrides the minor tick marks requested by the PCTLMINOR option.
PCTLORDER= values
specifies the tick marks that are labeled on the theoretical percentile axis. Since the values are percentiles, the labels must be between 0 and 100, exclusive. The values must be listed in increasing order and must cover the plotted percentile range. Otherwise, the default values of 1, 5, 10, 25, 50, 75, 90, 95, and 99 are used.
RANKADJ= value
specifies the adjustment value added to the ranks in the calculation of theoretical percentiles. By default, RANKADJ= RANKADJ= , as recommended by Blom (1958). Refer to Chambers et al. (1983) for additional information.
SCALE= value EST
is an alias for the SIGMA= option for plots requested by the BETA, EXPONENTIAL, GAMMA, and WEIBULL options and for the ZETA= option when you request the LOGNORMAL option. See the entries for the SIGMA= and ZETA= options.
SHAPE= value EST
is an alias for the ALPHA= option with the GAMMA option, for the SIGMA= option with the LOGNORMAL option, and for the C= option with the WEIBULL and WEIBULL2 options. See the entries for the ALPHA=, SIGMA=, and C= options.
SIGMA= value EST
specifies the parameter ƒ , where ƒ > 0. Alternatively, you can specify SIGMA=EST to request a maximum likelihood estimate for ƒ . The interpretation and use of the SIGMA= option depend on the distribution option with which it is used. See Table 3.44 for a summary of how to use the SIGMA= option. You must enclose this option in parentheses after the distribution option.
Distribution Option | Use of the SIGMA= Option |
---|---|
BETA EXPONENTIAL GAMMA WEIBULL | THETA= and SIGMA= ƒ request a distribution reference line corresponding to and ƒ . |
LOGNORMAL | SIGMA= ƒ 1 ƒ n requests n probability plots with shape parameters ƒ 1 ƒ n . The SIGMA= option must be specified. |
NORMAL | MU= µ and SIGMA= ƒ request a distribution reference line corresponding to µ and ƒ . SIGMA=EST requests a line with ƒ equal to the sample standard deviation. |
WEIBULL2 | SIGMA= ƒ and C= c request a distribution reference line corresponding to ƒ and c . |
SLOPE= value EST
specifies the slope for a distribution reference line requested with the LOGNORMAL and WEIBULL2 options. Enclose the SLOPE= option in parentheses after the distribution option. When you use the SLOPE= lognormal-option with the LOGNORMAL option, you must also specify a threshold parameter value with the THETA= lognormal-option to request the line. The SLOPE= lognormal-option is an alternative to the ZETA= lognormal-option for specifying , since the slope is equal to exp( ).
When you use the SLOPE= Weibull2-option with the WEIBULL2 option, you must also specify a scale parameter value ƒ with the SIGMA= Weibull2-option to request the line. The SLOPE= Weibull2-option is an alternative to the C= Weibull2-option for specifying c , since the slope is equal to .
For example, the first and second PROBPLOT statements produce the same probability plots and the third and fourth PROBPLOT statements produce the same probability plots:
proc univariate data=Measures; probplot Width / lognormal(sigma=2 theta=0 zeta=0); probplot Width / lognormal(sigma=2 theta=0 slope=1); probplot Width / weibull2(sigma=2 theta=0 c=.25); probplot Width / weibull2(sigma=2 theta=0 slope=4); run;
SQUARE
displays the probability plot in a square frame. By default, the plot is in a rectangular frame.
THETA= value EST
specifies the lower threshold parameter for plots requested with the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, WEIBULL, and WEIBULL2 options. Enclose the THETA= option in parentheses after a distribution option. When used with the WEIBULL2 option, the THETA= option specifies the known lower threshold , for which the default is 0. When used with the other distribution options, the THETA= option specifies for a distribution reference line; alternatively in this situation, you can specify THETA=EST to request a maximum likelihood estimate for . To request the line, you must also specify a scale parameter.
THRESHOLD= value EST
is an alias for the THETA= option.
VAXISLABEL= ' label '
specifies a label for the vertical axis. Labels can have up to 40 characters.
VMINOR= n
VM= n
specifies the number of minor tick marks between each major tick mark on the vertical axis. Minor tick marks are not labeled. The default is zero.
VREF= values
draws reference lines perpendicular to the vertical axis at the values specified. Also see the CVREF=, LVREF=, and VREFCHAR= options.
VREFLABELS= ' label1 ' ' labeln '
VREFLABEL= ' label1 ' ' labeln '
VREFLAB= ' label1 ' ' labeln '
specifies labels for the reference lines requested by the VREF= option. The number of labels must equal the number of reference lines. Enclose each label in quotes. Labels can have up to 16 characters.
VREFLABPOS= n
specifies the horizontal position of VREFLABELS= labels. If you specify VREFLABPOS=1, the labels are positioned at the left of the histogram. If you specify VREFLABPOS=2, the labels are positioned at the right of the histogram. By default, VREFLABPOS=1.
W= n
specifies the width, in pixels, for a diagonal distribution line. Enclose the W= option in parentheses after the distribution option. By default, W=1.
WAXIS= n
specifies the line thickness, in pixels, for the axes and frame. By default, WAXIS=1.
WEIBULL(C= value EST < Weibull-options > )
WEIB(C= value EST < Weibull-options > )
creates a three-parameter Weibull probability plot for each value of the required shape parameter c specified by the mandatory C= Weibull-option . To create a plot that is based on a maximum likelihood estimate for c , specify C=EST. To obtain a graphical estimate of c , specify a list of values in the C= Weibull-option , and select the value that most nearly linearizes the point pattern. To assess the point pattern, add a diagonal distribution reference line corresponding to and ƒ with the THETA= and SIGMA= Weibull-options . Alternatively, you can add a line corresponding to estimated values of and ƒ with the Weibull-options THETA=EST and SIGMA=EST. Agreement between the reference line and the point pattern indicates that the Weibull distribution with parameters c , , and ƒ is a good fit. You can specify the SCALE= Weibull-option as an alias for the SIGMA= Weibull-option and the THRESHOLD= Weibull-option as an alias for the THETA= Weibull-option .
WEIBULL2 < ( Weibull2-options ) >
W2 < ( Weibull2-options ) >
creates a two-parameter Weibull probability plot. You should use the WEIBULL2 option when your data have a known lower threshold , which is 0 by default. To specify the threshold value , use the THETA= Weibull2-option . By default, THETA=0. An advantage of the two-parameter Weibull plot over the three-parameter Weibull plot is that the parameters c and ƒ can be estimated from the slope and intercept of the point pattern. A disadvantage is that the two-parameter Weibull distribution applies only in situations where the threshold parameter is known. To obtain a graphical estimate of , specify a list of values for the THETA= Weibull2-option , and select the value that most nearly linearizes the point pattern. To assess the point pattern, add a diagonal distribution reference line corresponding to ƒ and c with the SIGMA= and C= Weibull2-options . Alternatively, you can add a distribution reference line corresponding to estimated values of ƒ and c with the Weibull2-options SIGMA=EST and C=EST. Agreement between the reference line and the point pattern indicates that the Weibull distribution with parameters c , , and ƒ is a good fit. You can specify the SCALE= Weibull2-option as an alias for the SIGMA= Weibull2-option and the SHAPE= Weibull2-option as an alias for the C= Weibull2-option .
ZETA= value EST
specifies a value for the scale parameter for the lognormal probability plots requested with the LOGNORMAL option. Enclose the ZETA= lognormal-option in parentheses after the LOGNORMAL option. To request a distribution reference line with intercept and slope exp( ), specify the THETA= and ZETA= .
QQPLOT < variables > < /options > ;
The QQPLOT statement creates quantile-quantile plots (Q-Q plots) using high-resolution graphics and compares ordered variable values with quantiles of a specified theoretical distribution. If the data distribution matches the theoretical distribution, the points on the plot form a linear pattern. Thus, you can use a Q-Q plot to determine how well a theoretical distribution models a set of measurements.
Q-Q plots are similar to probability plots, which you can create with the PROBPLOT statement. Q-Q plots are preferable for graphical estimation of distribution parameters, whereas probability plots are preferable for graphical estimation of percentiles.
You can use any number of QQPLOT statements in the UNIVARIATE procedure. The components of the QQPLOT statement are described as follows.
variables
are the variables for which to create Q-Q plots. If you specify a VAR statement, the variables must also be listed in the VAR statement. Otherwise, the variables can be any numeric variables in the input data set. If you do not specify a list of variables , then by default the procedure creates a Q-Q plot for each variable listed in the VAR statement, or for each numeric variable in the DATA= data set if you do not specify a VAR statement. For example, each of the following QQPLOT statements produces two Q-Q plots, one for Length and one for Width :
proc univariate data=Measures; var Length Width; qqplot; proc univariate data=Measures; qqplot Length Width; run;
options
specify the theoretical distribution for the plot or add features to the plot. If you spec ify more than one variable, the options apply equally to each variable. Specify all options after the slash (/) in the QQPLOT statement. You can specify only one option naming the distribution in each QQPLOT statement, but you can specify any number of other options . The distributions available are the beta, exponential, gamma, lognormal, normal, two-parameter Weibull, and three-parameter Weibull. By default, the procedure produces a plot for the normal distribution.
In the following example, the NORMAL option requests a normal Q-Q plot for each variable. The MU= and SIGMA= normal-options request a distribution reference line with intercept 10 and slope 0.3 for each plot, corresponding to a normal distribution with mean µ = 10 and standard deviation ƒ = 0 . 3. The SQUARE option displays the plot in a square frame, and the CTEXT= option specifies the text color.
proc univariate data=measures; qqplot length1 length2 / normal(mu=10 sigma=0.3) square ctext=blue; run;
Table 3.45 through Table 3.54 list the QQPLOT options by function. For complete descriptions, see the section Dictionary of Options on page 258.
BETA( beta-options ) | Specifies beta Q-Q plot for shape parameters ± and ² specified with mandatory ALPHA= and BETA= beta-options |
EXPONENTIAL( exponential-options ) | Specifies exponential Q-Q plot |
GAMMA( gamma-options ) | Specifies gamma Q-Q plot for shape parameter ± specified with mandatory ALPHA= gamma-option |
LOGNORMAL( lognormal-options ) | Specifies lognormal Q-Q plot for shape parameter ƒ specified with mandatory SIGMA= lognormal-option |
NORMAL( normal-options ) | Specifies normal Q-Q plot |
WEIBULL( Weibull-options ) | Specifies three-parameter Weibull Q-Q plot for shape parameter c specified with mandatory C= Weibull-option |
WEIBULL2( Weibull2-options ) | Specifies two-parameter Weibull Q-Q plot |
COLOR= color | Specifies color of distribution reference line |
L= linetype | Specifies line type of distribution reference line |
W= n | Specifies width of distribution reference line |
ALPHA= value-list EST | Specifies mandatory shape parameter ± |
BETA= value-list EST | Specifies mandatory shape parameter ² |
SIGMA= value EST | Specifies ƒ for distribution reference line |
THETA= value EST | Specifies ƒ for distribution reference line |
SIGMA= value EST | Specifies ƒ for distribution reference line |
THETA= value EST | Specifies for distribution reference line |
ALPHA= value-list EST | Specifies mandatory shape parameter ± |
SIGMA= value EST | Specifies ƒ for distribution reference line |
THETA= value EST | Specifies for distribution reference line |
SIGMA= value-list EST | Specifies mandatory shape parameter ƒ |
SLOPE= value EST | Specifies slope of distribution reference line |
THETA= value EST | Specifies for distribution reference line |
ZETA= value | Specifies for distribution reference line (slope is exp( )) |
MU= value EST | Specifies µ for distribution reference line |
SIGMA= value EST | Specifies ƒ for distribution reference line |
C= value-list EST | Specifies mandatory shape parameter c |
SIGMA= value EST | Specifies ƒ for distribution reference line |
THETA= value EST | Specifies for distribution reference line |
C= value EST | Specifies c for distribution reference line (slope is 1 /c ) |
SIGMA= value EST | Specifies ƒ for distribution reference line (intercept is log( ƒ )) |
SLOPE= value EST | Specifies slope of distribution reference line |
THETA= value | Specifies known lower threshold |
Option | Description |
---|---|
ANNOKEY | Applies annotation requested in ANNOTATE= data set to key cell only |
ANNOTATE= | Specifies annotate data set |
CAXIS= | Specifies color for axis |
CFRAME= | Specifies color for frame |
CFRAMESIDE= | Specifies color for filling frame for row labels |
CFRAMETOP= | Specifies color for filling frame for column labels |
CGRID= | Specifies color for grid lines |
CHREF= | Specifies color for HREF= lines |
CTEXT= | Specifies color for text |
CVREF= | Specifies color for VREF= lines |
DESCRIPTION= | Specifies description for plot in graphics catalog |
FONT= | Specifies software font for text |
GRID | Creates a grid |
HEIGHT= | Specifies height of text used outside framed areas |
HMINOR= | Specifies number of horizontal minor tick marks |
HREF= | Specifies reference lines perpendicular to the horizontal axis |
HREFLABELS= | Specifies labels for HREF= lines |
HREFLABPOS= | Specifies vertical position of labels for HREF= lines |
INFONT= | Specifies software font for text inside framed areas |
INHEIGHT= | Specifies height of text inside framed areas |
INTERTILE= | Specifies distance between tiles |
LGRID= | Specifies a line type for grid lines |
LHREF= | Specifies line style for HREF= lines |
LVREF= | Specifies line style for VREF= lines |
NADJ= | Adjusts sample size when computing percentiles |
NAME= | Specifies name for plot in graphics catalog |
NCOLS= | Specifies number of columns in comparative Q-Q plot |
NOFRAME | Suppresses frame around plotting area |
NOHLABEL | Suppresses label for horizontal axis |
NOVLABEL | Suppresses label for vertical axis |
NOVTICK | Suppresses tick marks and tick mark labels for vertical axis |
NROWS= | Specifies number of rows in comparative Q-Q plot |
PCTLAXIS | Displays a nonlinear percentile axis |
PCTLMINOR | Requests minor tick marks for percentile axis |
PCTLSCALE | Replaces theoretical quantiles with percentiles |
RANKADJ= | Adjusts ranks when computing percentiles |
SQUARE | Displays plot in square format |
VAXISLABEL= | Specifies label for vertical axis |
VMINOR= | Specifies number of vertical minor tick marks |
VREF= | Specifies reference lines perpendicular to the vertical axis |
VREFLABELS= | Specifies labels for VREF= lines |
VREFLABPOS= | Specifies horizontal position of labels for VREF= lines |
WAXIS= | Specifies line thickness for axes and frame |
Options can be any of the following:
primary options
secondary options
general options
Table 3.45 lists primary options for requesting a theoretical distribution.
Table 3.46 through Table 3.53 list secondary options that specify distribution parameters and control the display of a distribution reference line. Specify these options in parentheses after the distribution keyword. For example, you can request a normal Q-Q plot with a distribution reference line by specifying the NORMAL option as follows:
proc univariate; qqplot Length / normal(mu=10 sigma=0.3 color=red); run;
The MU= and SIGMA= normal-options display a distribution reference line that corresponds to the normal distribution with mean µ = 10 and standard deviation ƒ = 0 . 3, and the COLOR= normal-option specifies the color for the line.
Table 3.54 summarizes general options for enhancing Q-Q plots.
The following entries provide detailed descriptions of options in the QQPLOT statement.
ALPHA= value EST
specifies the mandatory shape parameter ± for quantile plots requested with the BETA and GAMMA options. Enclose the ALPHA= option in parentheses after the BETA or GAMMA options. If you specify ALPHA=EST, a maximum likelihood estimate is computed for ± .
ANNOKEY
applies the annotation requested with the ANNOTATE= option to the key cell only. By default, the procedure applies annotation to all of the cells. This option is not available unless you use the CLASS statement. Specify the KEYLEVEL= option in the CLASS statement to specify the key cell.
ANNOTATE= SAS-data-set
ANNO= SAS-data-set
specifies an input data set containing annotate variables as described in SAS/GRAPH Software: Reference . The ANNOTATE= data set you specify in the HISTOGRAM statement is used for all plots created by the statement. You can also specify an ANNOTATE= data set in the PROC UNIVARIATE statement to enhance all plots created by the procedure.
BETA(ALPHA= value EST BETA= value EST < beta-options > )
creates a beta quantile plot for each combination of the required shape parameters ± and ² specified by the required ALPHA= and BETA= beta-options . If you specify ALPHA=EST and BETA=EST, the procedure creates a plot based on maximum likelihood estimates for ± and ² . You can specify the SCALE= beta-option as an alias for the SIGMA= beta-option and the THRESHOLD= beta-option as an alias for the THETA= beta-option . To create a plot that is based on maximum likelihood estimates for ± and ² , specify ALPHA=EST and BETA=EST.
To obtain graphical estimates of ± and ² , specify lists of values in the ALPHA= and BETA= beta-options , and select the combination of ± and ² that most nearly linearizes the point pattern. To assess the point pattern, you can add a diagonal distribution reference line corresponding to lower threshold parameter and scale parameter ƒ with the THETA= and SIGMA= beta-options . Alternatively, you can add a line that corresponds to estimated values of and ƒ with the beta-options THETA=EST and SIGMA=EST. Agreement between the reference line and the point pattern indicates that the beta distribution with parameters ± , ² , , and ƒ is a good fit.
BETA= value EST
B= value EST
specifies the mandatory shape parameter ² for quantile plots requested with the BETA option. Enclose the BETA= option in parentheses after the BETA option. If you specify BETA=EST, a maximum likelihood estimate is computed for ² .
C= value EST
specifies the shape parameter c for quantile plots requested with the WEIBULL and WEIBULL2 options. Enclose this option in parentheses after the WEIBULL or WEIBULL2 option. C= is a required Weibull-option in the WEIBULL option; in this situation, it accepts a list of values, or if you specify C=EST, a maximum likelihood estimate is computed for c . You can optionally specify C= value or C=EST as a Weibull2-option with the WEIBULL2 option to request a distribution reference line; in this situation, you must also specify Weibull2-option SIGMA= value or SIGMA=EST.
CAXIS= color
CAXES= color
specifies the color for the axes. This option overrides any COLOR= specifications in an AXIS statement. The default value is the first color in the device color list.
CFRAME= color
specifies the color for the area that is enclosed by the axes and frame. The area is not filled by default.
CFRAMESIDE= color
specifies the color to fill the frame area for the row labels that display along the left side of a comparative quantile plot. This color also fills the frame area for the label of the corresponding class variable (if you associate a label with the variable). By default, these areas are not filled. This option is not available unless you use the CLASS statement.
CFRAMETOP= color
specifies the color to fill the frame area for the column labels that display across the top of a comparative quantile plot. This color also fills the frame area for the label of the corresponding class variable (if you associate a label with the variable). By default, these areas are not filled. This option does not apply unless you use the CLASS statement.
CGRID= color
specifies the color for grid lines when a grid displays on the plot. The default color is the first color in the device color list. This option also produces a grid.
CHREF= color
CH= color
specifies the color for horizontal axis reference lines requested by the HREF= option. The default color is the first color in the device color list.
COLOR= color
specifies the color of the diagonal distribution reference line. The default color is the first color in the device color list. Enclose the COLOR= option in parentheses after a distribution option keyword.
CTEXT= color
specifies the color for tick mark values and axis labels. The default color is the color that you specify for the CTEXT= option in the GOPTIONS statement. If you omit the GOPTIONS statement, the default is the first color in the device color list.
CVREF= color
CV= color
specifies the color for the reference lines requested by the VREF= option. The default color is the first color in the device color list.
DESCRIPTION= ' string '
DES= ' string '
specifies a description, up to 40 characters long, that appears in the PROC GREPLAY master menu. The default string is the variable name.
EXPONENTIAL < ( exponential-options ) >
EXP < ( exponential-options ) >
creates an exponential quantile plot. To assess the point pattern, add a diagonal distribution reference line corresponding to and ƒ with the THETA= and SIGMA= exponential-options . Alternatively, you can add a line corresponding to estimated values of the threshold parameter and the scale parameter ƒ with the exponential-options THETA=EST and SIGMA=EST. Agreement between the reference line and the point pattern indicates that the exponential distribution with parameters and ƒ is a good fit. You can specify the SCALE= exponential-option as an alias for the SIGMA= exponential-option and the THRESHOLD= exponential-option as an alias for the THETA= exponential-option .
FONT= font
specifies a software font for the reference lines and axis labels. You can also specify fonts for axis labels in an AXIS statement. The FONT= font takes precedence over the FTEXT= font specified in the GOPTIONS statement. Hardware characters are used by default.
GAMMA(ALPHA= value EST < gamma-options > )
creates a gamma quantile plot for each value of the shape parameter ± given by the mandatory ALPHA= gamma-option . If you specify ALPHA=EST, the procedure creates a plot based on a maximum likelihood estimate for ± . To obtain a graphical estimate of ± , specify a list of values for the ALPHA= gamma-option , and select the value that most nearly linearizes the point pattern. To assess the point pattern, add a diagonal distribution reference line corresponding to and ƒ with the THETA= and SIGMA= gamma-options . Alternatively, you can add a line corresponding to estimated values of the threshold parameter and the scale parameter with the gamma-options THETA=EST and SIGMA=EST. Agreement between the reference line and the point pattern indicates that the gamma distribution with parameters ± , and ƒ is a good fit. You can specify the SCALE= gamma-option as an alias for the SIGMA= gamma-option and the THRESHOLD= gamma-option as an alias for the THETA= gamma-option .
GRID
displays a grid of horizontal lines positioned at major tick marks on the vertical axis.
HEIGHT= value
specifies the height, in percentage screen units, of text for axis labels, tick mark labels, and legends. This option takes precedence over the HTEXT= option in the GOPTIONS statement.
HMINOR= n
HM= n
specifies the number of minor tick marks between each major tick mark on the horizontal axis. Minor tick marks are not labeled. By default, HMINOR=0.
HREF= values
draws reference lines that are perpendicular to the horizontal axis at specified values. When you use the PCTLAXIS option, HREF= values must be in quantile units.
HREFLABELS= ' label1 ' ' labeln '
HREFLABEL= ' label1' . . . 'labeln '
HREFLAB= ' label1 ' . . . ' labeln '
specifies labels for the reference lines requested by the HREF= option. The number of labels must equal the number of reference lines. Labels can have up to 16 characters.
HREFLABPOS= n
specifies the vertical position of HREFLABELS= labels. If you specify HREFLABPOS=1, the labels are positioned along the top of the plot. If you specify HREFLABPOS=2, the labels are staggered from top to bottom of the plot. If you specify HREFLABPOS=3, the labels are positioned along the bottom of the plot. By default, HREFLABPOS=1.
INFONT= font
specifies a software font to use for text inside the framed areas of the plot. The INFONT= option takes precedence over the FTEXT= option in the GOPTIONS statement. For a list of fonts, see SAS/GRAPH Reference .
INHEIGHT= value
specifies the height, in percentage screen units, of text used inside the framed areas of the plot. By default, the height specified by the HEIGHT= option is used. If you do not specify the HEIGHT= option, the height specified with the HTEXT= option in the GOPTIONS statement is used.
INTERTILE= value
specifies the distance, in horizontal percentage screen units, between the framed areas, which are called tiles . By default, INTERTILE=0.75 percentage screen units. This option is not available unless you use the CLASS statement. You can specify INTERTILE=0 to create contiguous tiles.
L= linetype
specifies the line type for a diagonal distribution reference line. Enclose the L= option in parentheses after a distribution option. By default, L=1, which produces a solid line.
LGRID= linetype
specifies the line type for the grid requested by the GRID option. By default, LGRID=1, which produces a solid line. The LGRID= option also produces a grid.
LHREF= linetype
LH= linetype
specifies the line type for the reference lines that you request with the HREF= option. By default, LHREF=2, which produces a dashed line.
LOGNORMAL(SIGMA= value EST < lognormal-options > )
LNORM(SIGMA= value EST < lognormal-options > )
creates a lognormal quantile plot for each value of the shape parameter ƒ given by the mandatory SIGMA= lognormal-option . If you specify SIGMA=EST, the procedure creates a plot based on a maximum likelihood estimate for ƒ . To obtain a graphical estimate of ƒ , specify a list of values for the SIGMA= lognormal-option , and select the value that most nearly linearizes the point pattern. To assess the point pattern, add a diagonal distribution reference line corresponding to and with the THETA= and ZETA= lognormal-options . Alternatively, you can add a line corresponding to estimated values of the threshold parameter and the scale parameter with the lognormal-options THETA=EST and ZETA=EST. Agreement between the reference line and the point pattern indicates that the lognormal distribution with parameters ƒ , and is a good fit. You can specify the THRESHOLD= lognormal-option as an alias for the THETA= lognormal-option and the SCALE= lognormal-option as an alias for the ZETA= lognormal-option . See Example 3.31 through Example 3.33.
LVREF= linetype
specifies the line type for the reference lines requested with the VREF= option. By default, LVREF=2, which produces a dashed line.
MU= value EST
specifies the mean µ for a normal quantile plot requested with the NORMAL option. Enclose the MU= normal-option in parentheses after the NORMAL option. The MU= normal-option must be specified with the SIGMA= normal-option , and they request a distribution reference line. You can specify MU=EST to request a distribution reference line with µ equal to the sample mean.
NADJ= value
specifies the adjustment value added to the sample size in the calculation of theoretical percentiles. By default, NADJ=1/4. Refer to Chambers et al. (1983) for additional information.
NAME= ' string '
specifies a name for the plot, up to eight characters long, that appears in the PROC GREPLAY master menu. The default value is UNIVAR .
NCOLS= n
NCOL= n
specifies the number of columns in a comparative quantile plot. By default, NCOLS=1 if you specify only one class variable, and NCOLS=2 if you specify two class variables. This option is not available unless you use the CLASS statement. If you specify two class variables, you can use the NCOLS= option with the NROWS= option.
NOFRAME
suppresses the frame around the subplot area. If you specify the PCTLAXIS option, then you cannot specify the NOFRAME option.
NOHLABEL
suppresses the label for the horizontal axis. You can use this option to reduce clutter.
NORMAL < ( normal-options ) >
creates a normal quantile plot. This is the default if you omit a distribution option. To assess the point pattern, you can add a diagonal distribution reference line corresponding to µ and ƒ with the MU= and SIGMA= normal-options . Alternatively, you can add a line corresponding to estimated values of µ and ƒ with the normal-options MU=EST and SIGMA=EST; the estimates of the mean µ and the standard deviation ƒ are the sample mean and sample standard deviation. Agreement between the reference line and the point pattern indicates that the normal distribution with parameters µ and ƒ is a good fit. See Example 3.28 and Example 3.30.
NOVLABEL
suppresses the label for the vertical axis. You can use this option to reduce clutter.
NOVTICK
suppresses the tick marks and tick mark labels for the vertical axis. This option also suppresses the label for the vertical axis.
NROWS= n
NROW= n
specifies the number of rows in a comparative quantile plot. By default, NROWS=2. This option is not available unless you use the CLASS statement. If you specify two class variables, you can use the NCOLS= option with the NROWS= option.
PCTLAXIS < ( axis-options ) >
adds a nonlinear percentile axis along the frame of the Q-Q plot opposite the theoretical quantile axis. The added axis is identical to the axis for probability plots produced with the PROBPLOT statement. When using the PCTLAXIS option, you must specify HREF= values in quantile units, and you cannot use the NOFRAME option. You can specify the following axis-options :
GRID | Draws vertical grid lines at major percentiles |
GRIDCHAR=' character ' | Specifies grid line plotting character on line printer |
LABEL=' string ' | Specifies label for percentile axis |
LGRID= linetype | Specifies line type for grid |
PCTLMINOR
requests minor tick marks for the percentile axis when you specify PCTLAXIS. The HMINOR option overrides the PCTLMINOR option.
PCTLSCALE
requests scale labels for the theoretical quantile axis in percentile units, resulting in a nonlinear axis scale. Tick marks are drawn uniformly across the axis based on the quantile scale. In all other respects, the plot remains the same, and you must specify HREF= values in quantile units. For a true nonlinear axis, use the PCTLAXIS option or use the PROBPLOT statement.
RANKADJ= value
specifies the adjustment value added to the ranks in the calculation of theoretical percentiles. By default, RANKADJ= ± 3/8, as recommended by Blom (1958). Refer to Chambers et al. (1983) for additional information.
SCALE= value EST
is an alias for the SIGMA= option for plots requested by the BETA, EXPONENTIAL, GAMMA, WEIBULL, and WEIBULL2 options and for the ZETA= option with the LOGNORMAL option. See the entries for the SIGMA= and ZETA= options.
SHAPE= value EST
is an alias for the ALPHA= option with the GAMMA option, for the SIGMA= op tion with the LOGNORMAL option, and for the C= option with the WEIBULL and WEIBULL2 options. See the entries for the ALPHA=, SIGMA=, and C= options.
SIGMA= value EST
specifies the parameter ƒ , where ƒ > 0. Alternatively, you can specify SIGMA=EST to request a maximum likelihood estimate for ƒ . The interpretation and use of the SIGMA= option depend on the distribution option with which it is used, as summarized in Table 3.56. Enclose this option in parentheses after the distribution option.
Distribution Option | Use of the SIGMA= Option |
---|---|
BETA EXPONENTIAL GAMMA WEIBULL | THETA= and SIGMA= ƒ request a distribution reference line corresponding to and ƒ . |
LOGNORMAL | SIGMA= ƒ 1 ƒ n requests n quantile plots with shape parameters ƒ 1 ƒ n . The SIGMA= option must be specified. |
NORMAL | MU= µ and SIGMA= ƒ request a distribution reference line corresponding to µ and ƒ . SIGMA=EST requests a line with ƒ equal to the sample standard deviation. |
WEIBULL2 | SIGMA= ƒ and C= c request a distribution reference line corresponding to ƒ and c . |
SLOPE= value EST
specifies the slope for a distribution reference line requested with the LOGNORMAL and WEIBULL2 options. Enclose the SLOPE= option in parentheses after the distribution option. When you use the SLOPE= lognormal-option with the LOGNORMAL option, you must also specify a threshold parameter value with the THETA= lognormal-option to request the line. The SLOPE= lognormal-option is an alternative to the ZETA= lognormal-option for specifying , since the slope is equal to exp( ).
When you use the SLOPE= Weibull2-option with the WEIBULL2 option, you must also specify a scale parameter value ƒ with the SIGMA= Weibull2-option to request the line. The SLOPE= Weibull2-option is an alternative to the C= Weibull2-option for specifying c , since the slope is equal to .
For example, the first and second QQPLOT statements produce the same quantile plots and the third and fourth QQPLOT statements produce the same quantile plots:
proc univariate data=Measures; qqplot Width / lognormal(sigma=2 theta=0 zeta=0); qqplot Width / lognormal(sigma=2 theta=0 slope=1); qqplot Width / weibull2(sigma=2 theta=0 c=.25); qqplot Width / weibull2(sigma=2 theta=0 slope=4);
SQUARE
displays the quantile plot in a square frame. By default, the frame is rectangular.
THETA= value EST
specifies the lower threshold parameter for plots requested with the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, WEIBULL, and WEIBULL2 options. Enclose the THETA= option in parentheses after a distribution option. When used with the WEIBULL2 option, the THETA= option specifies the known lower threshold , for which the default is 0. When used with the other distribution options, the THETA= option specifies for a distribution reference line; alternatively in this situation, you can specify THETA=EST to request a maximum likelihood estimate for . To request the line, you must also specify a scale parameter.
THRESHOLD= value EST
is an alias for the THETA= option.
VAXISLABEL= ' label '
specifies a label for the vertical axis. Labels can have up to 40 characters.
VMINOR= n
VM= n
specifies the number of minor tick marks between each major tick mark on the vertical axis. Minor tick marks are not labeled. The default is zero.
VREF= values
draws reference lines perpendicular to the vertical axis at the values specified. Also see the CVREF=, LVREF=, and VREFCHAR= options.
VREFLABELS= ' label1 ' ' labeln '
VREFLABEL= ' label1 ' ' labeln '
VREFLAB= ' label1 ' ' labeln '
specifies labels for the reference lines requested by the VREF= option. The number of labels must equal the number of reference lines. Enclose each label in quotes. Labels can have up to 16 characters.
VREFLABPOS= n
specifies the horizontal position of VREFLABELS= labels. If you specify VREFLABPOS=1, the labels are positioned at the left of the histogram. If you specify VREFLABPOS=2, the labels are positioned at the right of the histogram. By default, VREFLABPOS=1.
W= n
specifies the width, in pixels, for a diagonal distribution line. Enclose the W= option in parentheses after the distribution option. By default, W=1.
WAXIS= n
specifies the line thickness, in pixels, for the axes and frame. By default, WAXIS=1.
WEIBULL(C= value EST < Weibull-options > )
WEIB(C= value EST < Weibull-options > )
creates a three-parameter Weibull quantile plot for each value of the required shape parameter c specified by the mandatory C= Weibull-option . To create a plot that is based on a maximum likelihood estimate for c , specify C=EST. To obtain a graphical estimate of c , specify a list of values in the C= Weibull-option , and select the value that most nearly linearizes the point pattern. To assess the point pattern, add a diagonal distribution reference line corresponding to and ƒ with the THETA= and SIGMA= Weibull-options . Alternatively, you can add a line corresponding to estimated values of and ƒ with the Weibull-options THETA=EST ad SIGMA=EST. Agreement between the reference line and the point pattern indicates that the Weibull distribution with parameters c , , and ƒ is a good fit. You can specify the SCALE= Weibull-option as an alias for the SIGMA= Weibull-option and the THRESHOLD= Weibull-option as an alias for the THETA= Weibull-option . See Example 3.34.
WEIBULL2 < ( Weibull2-options ) >
W2 < ( Weibull2-options ) >
creates a two-parameter Weibull quantile plot. You should use the WEIBULL2 option when your data have a known lower threshold , which is 0 by default. To specify the threshold value , use the THETA= Weibull2-option . By default, THETA=0. An advantage of the two-parameter Weibull plot over the three-parameter Weibull plot is that the parameters c and ƒ can be estimated from the slope and intercept of the point pattern. A disadvantage is that the two-parameter Weibull distribution applies only in situations where the threshold parameter is known. To obtain a graphical estimate of , specify a list of values for the THETA= Weibull2-option , and select the value that most nearly linearizes the point pattern. To assess the point pattern, add a diagonal distribution reference line corresponding to ƒ and c with the SIGMA= and C= Weibull2-options . Alternatively, you can add a distribution reference line corresponding to estimated values of ƒ and c with the Weibull2-options SIGMA=EST and C=EST. Agreement between the reference line and the point pattern indicates that the Weibull distribution with parameters c , , and ƒ is a good fit. You can specify the SCALE= Weibull2-option as an alias for the SIGMA= Weibull2-option and the SHAPE= Weibull2-option as an alias for the C= Weibull2-option . See Example 3.34.
ZETA= value EST
specifies a value for the scale parameter for the lognormal quantile plots requested with the LOGNORMAL option. Enclose the ZETA= lognormal-option in parentheses after the LOGNORMAL option. To request a distribution reference line with intercept and slope exp( ), specify the THETA= and ZETA= .
VAR variables ;
The VAR statement specifies the analysis variables and their order in the results. By default, if you omit the VAR statement, PROC UNIVARIATE analyzes all numeric variables that are not listed in the other statements.
You must provide a VAR statement when you use an OUTPUT statement. To store the same statistic for several analysis variables in the OUT= data set, you specify a list of names in the OUTPUT statement. PROC UNIVARIATE makes a one-to-one correspondence between the order of the analysis variables in the VAR statement and the list of names that follow a statistic keyword.
WEIGHT variable ;
The WEIGHT statement specifies numeric weights for analysis variables in the statistical calculations. The UNIVARIATE procedure uses the values w i of the WEIGHT variable to modify the computation of a number of summary statistics by assuming that the variance of the i th value x i of the analysis variable is equal to ƒ 2 /w i , where ƒ is an unknown parameter. The values of the WEIGHT variable do not have to be integers and are typically positive. By default, observations with nonpositive or missing values of the WEIGHT variable are handled as follows: [*]
If the value is zero, the observation is counted in the total number of observations.
If the value is negative, it is converted to zero, and the observation is counted in the total number of observations.
If the value is missing, the observation is excluded from the analysis.
To exclude observations that contain negative and zero weights from the analysis, use EXCLNPWGT. Note that most SAS/STAT procedures, such as PROC GLM, exclude negative and zero weights by default. The weight variable does not change how the procedure determines the range, mode, extreme values, extreme observations, or number of missing values. When you specify a WEIGHT statement, the procedure also computes a weighted standard error and a weighted version of Student s t test. The Student s t test is the only test of location that PROC UNIVARIATE computes when you weight the analysis variables.
When you specify a WEIGHT variable, the procedure uses its values, w i , to compute weighted versions of the statistics [ ] provided in the Moments table. For example, the procedure computes a weighted mean x w and a weighted variance as
and
where x i is the i th variable value. The divisor d is controlled by the VARDEF= option in the PROC UNIVARIATE statement.
The WEIGHT statement does not affect the determination of the mode, extreme values, extreme observations, or the number of missing values of the analysis variables. However, the weights w i are used to compute weighted percentiles. [*] The WEIGHT variable has no effect on graphical displays produced with the plot statements.
The CIPCTLDF, CIPCTLNORMAL, LOCCOUNT, NORMAL, ROBUSTSCALE, TRIMMED=, and WINSORIZED= options are not available with the WEIGHT statement.
To compute weighted skewness or kurtosis, use VARDEF=DF or VARDEF=N in the PROC statement.
You cannot specify the HISTOGRAM, PROBPLOT, or QQPLOT statements with the WEIGHT statement.
When you use the WEIGHT statement, consider which value of the VARDEF= option is appropriate. See VARDEF= and the calculation of weighted statistics in for more information.
[*] In Release 6.12 and earlier releases, observations were used in the analysis if and only if the WEIGHT variable value was greater than zero.
[ ] In Release 6.12 and earlier releases, weighted skewness and kurtosis were not computed.
[*] In Release 6.12 and earlier releases, the weights did not affect the computation of percentiles and the procedure did not exclude the observations with missing weights from the count of observations.