This example computes univariate statistics for two variables. The following statements create the data set BPressure , which contains the systolic ( Systolic ) and diastolic ( Diastolic ) blood pressure readings for 22 patients :
data BPressure; length PatientID ; input PatientID $ Systolic Diastolic @@; datalines; CK 120 50 SS 96 60 FR 100 70 CP 120 75 BL 140 90 ES 120 70 CP 165 110 JI 110 40 MC 119 66 FC 125 76 RW 133 60 KD 108 54 DS 110 50 JW 130 80 BH 120 65 JW 134 80 SB 118 76 NS 122 78 GS 122 70 AB 122 78 EC 112 62 HH 122 82 ; run;
The following statements produce descriptive statistics and quantiles for the variables Systolic and Diastolic :
title 'Systolic and Diastolic Blood Pressure'; ods select BasicMeasures Quantiles; proc univariate data=BPressure; var Systolic Diastolic; run;
The ODS SELECT statement restricts the output, which is shown in Output 3.1.1, to the BasicMeasures and Quantiles tables; see the section ODS Table Names on page 309. You use the PROC UNIVARIATE statement to request univariate statistics for the variables listed in the VAR statement, which specifies the analysis variables and their order in the output. Formulas for computing the statistics in the BasicMeasures table are provided in the section Descriptive Statistics on page 269. The quantiles are calculated using Definition 5 , which is the default definition; see the section Calculating Percentiles on page 273.
Systolic and Diastolic Blood Pressure The UNIVARIATE Procedure Variable: Systolic Basic Statistical Measures Location Variability Mean 121.2727 Std Deviation 14.28346 Median 120.0000 Variance 204.01732 Mode 120.0000 Range 69.00000 Interquartile Range 13.00000 NOTE: The mode displayed is the smallest of 2 modes with a count of 4. Quantiles (Definition 5) Quantile Estimate 100% Max 165 99% 165 95% 140 90% 134 75% Q3 125 50% Median 120 25% Q1 112 10% 108 5% 100 1% 96 0% Min 96 Systolic and Diastolic Blood Pressure The UNIVARIATE Procedure Variable: Diastolic Basic Statistical Measures Location Variability Mean 70.09091 Std Deviation 15.16547 Median 70.00000 Variance 229.99134 Mode 70.00000 Range 70.00000 Interquartile Range 18.00000 Quantiles (Definition 5) Quantile Estimate 100% Max 110 99% 110 95% 90 90% 82 75% Q3 78 50% Median 70 25% Q1 60 10% 50 5% 50 1% 40 0% Min 40
A sample program, uniex01.sas , for this example is available in the SAS Sample Library for Base SAS software.
An instructor is interested in calculating all the modes of the scores on a recent exam. The following statements create a data set named Exam , which contains the exam scores in the variable Score :
data Exam; label Score = 'Exam Score'; input Score @@; datalines; 81 97 78 99 77 81 84 86 86 97 85 86 94 76 75 42 91 90 88 86 97 97 89 69 72 82 83 81 80 81 ; run;
The following statements use the MODES option to request a table of all possible modes:
title 'Table of Modes for Exam Scores'; ods select Modes; proc univariate data=Exam modes; var Score; run;
The ODS SELECT statement restricts the output to the Modes table; see the section ODS Table Names on page 309.
Table of Modes for Exam Scores The UNIVARIATE Procedure Variable: Score (Exam Score) Modes Mode Count 81 4 86 4 97 4
By default, when the MODES option is used, and there is more than one mode, the lowest mode is displayed in the BasicMeasures table. The following statements illustrate the default behavior:
title 'Default Output'; ods select BasicMeasures; proc univariate data=Exam; var Score; run;
Default Output The UNIVARIATE Procedure Variable: Score (Exam Score) Basic Statistical Measures Location Variability Mean 83.66667 Std Deviation 11.08069 Median 84.50000 Variance 122.78161 Mode 81.00000 Range 57.00000 Interquartile Range 10.00000 NOTE: The mode displayed is the smallest of 3 modes with a count of 4.
The default output displays a mode of 81 and includes a note regarding the number of modes; the modes 86 and 97 are not displayed. The ODS SELECT statement restricts the output to the BasicMeasures table; see the section ODS Table Names on page 309.
A sample program, uniex02.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which uses the data set BPressure introduced in Example 3.1, illustrates how to produce a table of the extreme observations and a table of the extreme values in a data set. The following statements generate the Extreme Observations tables for Systolic and Diastolic , which enable you to identify the extreme observations for each variable:
title 'Extreme Blood Pressure Observations'; ods select ExtremeObs; proc univariate data=BPressure; var Systolic Diastolic; id PatientID; run;
The ODS SELECT statement restricts the output to the ExtremeObs table; see the section ODS Table Names on page 309. The ID statement requests that the extreme observations are to be identified using the value of PatientID as well as the observation number. By default, the five lowest and five highest observations are displayed. You can use the NEXTROBS= option to request a different number of extreme observations.
Output 3.3.1 shows that the patient identified as ˜CP (Observation 7) has the highest values for both Systolic and Diastolic . To visualize extreme observations, you can create histograms; see Example 3.14.
Extreme Blood Pressure Observations The UNIVARIATE Procedure Variable: Systolic Extreme Observations ---------Lowest--------- ---------Highest-------- Patient Patient Value ID Obs Value ID Obs 96 SS 2 130 JW 14 100 FR 3 133 RW 11 108 KD 12 134 JW 16 110 DS 13 140 BL 5 110 JI 8 165 CP 7 Extreme Blood Pressure Observations The UNIVARIATE Procedure Variable: Diastolic Extreme Observations ---------Lowest--------- ---------Highest-------- Patient Patient Value ID Obs Value ID Obs 40 JI 8 80 JW 14 50 DS 13 80 JW 16 50 CK 1 82 HH 22 54 KD 12 90 BL 5 60 RW 11 110 CP 7
The following statements generate the Extreme Values tables for Systolic and Diastolic , which tabulate the tails of the distributions:
title 'Extreme Blood Pressure Values'; ods select ExtremeValues; proc univariate data=BPressure nextrval=5; var Systolic Diastolic; run;
The ODS SELECT statement restricts the output to the ExtremeValues table; see the section ODS Table Names on page 309. The NEXTRVAL= option specifies the number of extreme values at each end of the distribution to be shown in the tables in Output 3.3.2.
Extreme Blood Pressure Values The UNIVARIATE Procedure Variable: Systolic Extreme Values ---------Lowest-------- --------Highest-------- Order Value Freq Order Value Freq 1 96 1 11 130 1 2 100 1 12 133 1 3 108 1 13 134 1 4 110 2 14 140 1 5 112 1 15 165 1 Extreme Blood Pressure Values The UNIVARIATE Procedure Variable: Diastolic Extreme Values ---------Lowest-------- --------Highest-------- Order Value Freq Order Value Freq 1 40 1 11 78 2 2 50 2 12 80 2 3 54 1 13 82 1 4 60 2 14 90 1 5 62 1 15 110 1
Output 3.3.2 shows that the values 78 and 80 occurred twice for Diastolic and the maximum of Diastolic is 110. Note that Output 3.3.1 displays the value of 80 twice for Diastolic because there are two observations with that value. In Output 3.3.2, the value 80 is only displayed once.
A sample program, uniex01.sas , for this example is available in the SAS Sample Library for Base SAS software.
An instructor is interested in creating a frequency table of score changes between a pair of tests given in one of his college courses. The data set Score contains test scores for his students who took a pretest and a posttest on the same material. The variable ScoreChange contains the difference between the two test scores. The following statements create the data set:
data Score; input Student $ PreTest PostTest @@; label ScoreChange = Change in Test Scores; ScoreChange = PostTest - PreTest; datalines; Capalleti 94 91 Dubose 51 65 Engles 95 97 Grant 63 75 Krupski 80 75 Lundsford 92 55 Mcbane 75 78 Mullen 89 82 Nguyen 79 76 Patel 71 77 Si 75 70 Tanaka 87 73 ; run;
The following statements produce a frequency table for the variable ScoreChange :
title 'Analysis of Score Changes'; ods select Frequencies; proc univariate data=Score freq; var ScoreChange; run;
The ODS SELECT statement restricts the output to the Frequencies table; see the section ODS Table Names on page 309. The FREQ option on the PROC UNIVARIATE statement requests the table of frequencies shown in Output 3.4.1.
Analysis of Score Changes The UNIVARIATE Procedure Variable: ScoreChange (Change in Test Scores) Frequency Counts Percents Percents Percents Value Count Cell Cum Value Count Cell Cum Value Count Cell Cum -37 1 8.3 8.3 -3 2 16.7 58.3 6 1 8.3 83.3 -14 1 8.3 16.7 2 1 8.3 66.7 12 1 8.3 91.7 -7 1 8.3 25.0 3 1 8.3 75.0 14 1 8.3 100.0 -5 2 16.7 41.7
From Output 3.4.1, the instructor sees that only score changes of ˆ’ 3 and ˆ’ 5 occurred more than once.
A sample program, uniex03.sas , for this example is available in the SAS Sample Library for Base SAS software.
The PLOT option in the PROC UNIVARIATE statement requests several basic plots for display in line printer output. For more information on plots created by the PLOT option, see the section Creating Line Printer Plots on page 281. This example illustrates the use of the PLOT option as well as BY processing in PROC UNIVARIATE.
A researcher is analyzing a data set consisting of air pollution data from three different measurement sites. The data set AirPoll , created by the following statements, contains the variables Site and Ozone , which are the site number and ozone level, respectively.
data AirPoll (keep = Site Ozone); label Site = 'Site Number' Ozone = 'Ozone level (in ppb)'; do i = 1 to 3; input Site @@; do j = 1 to 15; input Ozone @@; output; end; end; datalines; 102 4 6 3 4 7 8 2 3 4 1 3 8 9 5 6 134 5 3 6 2 1 2 4 3 2 4 6 4 6 3 1 137 8 9 7 8 6 7 6 7 9 8 9 8 7 8 5 ; run;
The following statements produce stem-and-leaf plots, box plots, and normal probability plots for each site in the AirPoll data set:
ods select Plots SSPlots; proc univariate data=AirPoll plot; by Site; var Ozone; run;
The PLOT option produces a stem-and-leaf plot, a box plot, and a normal probability plot for the Ozone variable at each site. Since the BY statement is used, a side-by-side box plot is also created to compare the ozone levels across sites. Note that AirPoll is sorted by Site ; in general, the data set should be sorted by the BY variable using the SORT procedure. The ODS SELECT statement restricts the output to the Plots and SSPlots tables; see the section ODS Table Names on page 309. Optionally, you can specify the PLOTSIZE= n option to control the approximate number of rows (between 8 and the page size ) that the plots occupy.
Output 3.5.1 through Output 3.5.3 show the plots produced for each BY group . Output 3.5.4 shows the side-by-side box plot for comparing Ozone values across sites.
------------------------------- Site Number=102 -------------------------------- The UNIVARIATE Procedure Variable: Ozone (Ozone level (in ppb)) Stem Leaf # Boxplot 9 0 1 8 00 2 7 0 1 +-----+ 6 00 2 5 0 1 4 000 3 *--+--* 3 000 3 +-----+ 2 0 1 1 0 1 ----+----+----+---- + Normal Probability Plot 9.5+ *++++ * * ++++ * +++++ * *+++ 5.5+ +*++ **+* * *+*+ *++++ 1.5+ *++++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
------------------------------- Site Number=134 -------------------------------- The UNIVARIATE Procedure Variable: Ozone (Ozone level (in ppb)) Stem Leaf # Boxplot 6 000 3 5 0 1 +-----+ 4 000 3 3 000 3 *--+--* 2 000 3 +-----+ 1 00 2 ----+----+----+----+ Normal Probability Plot 6.5+ * * ++*+++ * ++++++ **+*+++ **+*+++ *+*+*++ 1.5+ * ++*+++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
------------------------------- Site Number=137 -------------------------------- The UNIVARIATE Procedure Variable: Ozone (Ozone level (in ppb)) Stem Leaf # Boxplot 9 000 3 8 00000 5 +-----+ 7 0000 4 +--+--+ 6 00 2 5 0 1 0 ----+----+----+---- + Normal Probability Plot 9.5+ * *++++*++++ * ** *+*+++++ 7.5+ * * **++++++ *++*+++++ 5.5+ +++*++++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
The UNIVARIATE Procedure Variable: Ozone (Ozone level (in ppb)) Schematic Plots 10 + 8 + *----- * + +----- + +----- + 6 + + +----- + 0 4 + *----- * + +----- + *----- * 2 + +----- + 0 + ------------+-----------+-----------+----------- Site 102 134 137
Note that you can use the PROBPLOT statement with the NORMAL option to produce high-resolution normal probability plots; see the section Modeling a Data Distribution on page 200.
Note that you can use the BOXPLOT procedure to produce box plots using high-resolution graphics; refer to Chapter 18, The BOXPLOT Procedure, in SAS/STAT User s Guide .
A sample program, uniex04.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to use PROC UNIVARIATE to analyze a data set with a variable that contains the frequency of each observation. The data set Speeding contains data on the number of cars pulled over for speeding on a stretch of highway with a 65 mile per hour speed limit. Speed is the speed at which the cars were traveling, and Number is the number of cars at each speed. The following statements create the data set:
data Speeding; label Speed = 'Speed (in miles per hour)'; do Speed = 66 to 85; input Number @@; output; end; datalines; 2 3 2 1 3 6 8 9 10 13 12 14 6 2 0 0 1 1 0 1 ; run;
The following statements create a table of moments for the variable Speed :
title 'Analysis of Speeding Data'; ods select Moments; proc univariate data=Speeding; freq Number; var Speed; run;
The ODS SELECT statement restricts the output, which is shown in Output 3.6.1, to the Moments table; see the section ODS Table Names on page 309. The FREQ statement specifies that the value of the variable Number represents the frequency of each observation.
Analysis of Speeding Data The UNIVARIATE Procedure Variable: Speed (Speed (in miles per hour)) Freq: Number Moments N 94 Sum Weights 94 Mean 74.3404255 Sum Observations 6988 Std Deviation 3.44403237 Variance 11.861359 Skewness -0.1275543 Kurtosis 0.92002287 Uncorrected SS 520594 Corrected SS 1103.10638 Coeff Variation 4.63278538 Std Error Mean 0.35522482
For the formulas used to compute these moments, see the section Descriptive Statistics on page 269. A sample program, uniex05.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to save summary statistics in an output data set. The following statements create a data set named Belts , which contains the breaking strengths ( Strength ) and widths ( Width ) of a sample of 50 automotive seat belts:
data Belts; label Strength = Breaking Strength (lb/in) Width = 'Width in Inches'; input Strength Width @@; datalines; 1243.51 3.036 1221.95 2.995 1131.67 2.983 1129.70 3.019 1198.08 3.106 1273.31 2.947 1250.24 3.018 1225.47 2.980 1126.78 2.965 1174.62 3.033 1250.79 2.941 1216.75 3.037 1285.30 2.893 1214.14 3.035 1270.24 2.957 1249.55 2.958 1166.02 3.067 1278.85 3.037 1280.74 2.984 1201.96 3.002 1101.73 2.961 1165.79 3.075 1186.19 3.058 1124.46 2.929 1213.62 2.984 1213.93 3.029 1289.59 2.956 1208.27 3.029 1247.48 3.027 1284.34 3.073 1209.09 3.004 1146.78 3.061 1224.03 2.915 1200.43 2.974 1183.42 3.033 1195.66 2.995 1258.31 2.958 1136.05 3.022 1177.44 3.090 1246.13 3.022 1183.67 3.045 1206.50 3.024 1195.69 3.005 1223.49 2.971 1147.47 2.944 1171.76 3.005 1207.28 3.065 1131.33 2.984 1215.92 3.003 1202.17 3.058 ; run;
The following statements produce two output data sets containing summary statistics:
proc univariate data=Belts noprint; var Strength Width; output out=Means mean=StrengthMean WidthMean; output out=StrengthStats mean=StrengthMean std=StrengthSD min=StrengthMin max=StrengthMax; run;
When you specify an OUTPUT statement, you must also specify a VAR statement. You can use multiple OUTPUT statements with a single procedure statement. Each OUTPUT statement creates a new data set with the name specified by the OUT= option. In this example, two data sets, Means and StrengthStats , are created. See Output 3.7.1 for a listing of Means and Output 3.7.2 for a listing of StrengthStats .
Strength Width Obs Mean Mean 1 1205.75 3.00584
Strength Strength Strength Strength Obs Mean SD Max Min 1 1205.75 48.3290 1289.59 1101.73
Summary statistics are saved in an output data set by specifying keyword=names after the OUT= option. In the preceding statements, the first OUTPUT statement specifies the keyword MEAN followed by the names StrengthMean and WidthMean . The second OUTPUT statement specifies the keywords MEAN, STD, MAX, and MIN, for which the names StrengthMean, StrengthSD, StrengthMax , and StrengthMin are given.
The keyword specifies the statistic to be saved in the output data set, and the names determine the names for the new variables. The first name listed after a keyword contains that statistic for the first variable listed in the VAR statement; the second name contains that statistic for the second variable in the VAR statement, and so on.
The data set Means contains the mean of Strength in a variable named StrengthMean and the mean of Width in a variable named WidthMean . The data set StrengthStats contains the mean, standard deviation, maximum value, and minimum value of Strength in the variables StrengthMean, StrengthSD, StrengthMax , and StrengthMin , respectively.
See the section OUT= Output Data Set in the OUTPUT Statement on page 306 for more information on OUT= output data sets.
A sample program, uniex06.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which uses the Belts data set from the previous example, illustrates how to save percentiles in an output data set. The UNIVARIATE procedure automatically computes the 1st, 5th, 10th, 25th, 75th, 90th, 95th, and 99th percentiles for each variable. You can save these percentiles in an output data set by specifying the appropriate keywords. For example, the following statements create an output data set named PctlStrength , which contains the 5th and 95th percentiles of the variable Strength :
proc univariate data=Belts noprint; var Strength Width; output out=PctlStrength p5=p5str p95=p95str; run;
The output data set PctlStrength is listed in Output 3.8.1.
Obs p95str p5str 1 1284.34 1126.78
You can use the PCTLPTS=, PCTLPRE=, and PCTLNAME= options to save percentiles not automatically computed by the UNIVARIATE procedure. For example, the following statements create an output data set named Pctls , which contains the 20th and 40th percentiles of the variables Strength and Width :
proc univariate data=Belts noprint; var Strength Width; output out=Pctls pctlpts = 20 40 pctlpre = Strength Width pctlname = pct20 pct40; run;
The PCTLPTS= option specifies the percentiles to compute (in this case, the 20th and 40th percentiles). The PCTLPRE= and PCTLNAME= options build the names for the variables containing the percentiles. The PCTLPRE= option gives prefixes for the new variables, and the PCTLNAME= option gives a suffix to add to the prefix. When you use the PCTLPTS= specification, you must also use the PCTLPRE= specification.
The OUTPUT statement saves the 20th and 40th percentiles of Strength and Width in the variables Strengthpct20, Widthpct20, Strengthpct40 , and Weightpct40 . The output data set Pctls is listed in Output 3.8.2.
Obs Strengthpct20 Strengthpct40 Widthpct20 Widthpct40 1 1165.91 1199.26 2.9595 2.995
A sample program, uniex06.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to compute confidence limits for the mean, standard deviation, and variance of a population. A researcher is studying the heights of a certain population of adult females. She has collected a random sample of heights of 75 females, which are saved in the data set Heights :
data Heights; label Height = Height (in); input Height @@; datalines; 64.1 60.9 64.1 64.7 66.7 65.0 63.7 67.4 64.9 63.7 64.0 67.5 62.8 63.9 65.9 62.3 64.1 60.6 68.6 68.6 63.7 63.0 64.7 68.2 66.7 62.8 64.0 64.1 62.1 62.9 62.7 60.9 61.6 64.6 65.7 66.6 66.7 66.0 68.5 64.4 60.5 63.0 60.0 61.6 64.3 60.2 63.5 64.7 66.0 65.1 63.6 62.0 63.6 65.8 66.0 65.4 63.5 66.3 66.2 67.5 65.8 63.1 65.8 64.4 64.0 64.9 65.7 61.0 64.1 65.5 68.6 66.6 65.7 65.1 70.0 ; run;
The following statements produce confidence limits for the mean, standard deviation, and variance of the population of heights:
title 'Analysis of Female Heights'; ods select BasicIntervals; proc univariate data=Heights cibasic; var Height; run;
The CIBASIC option requests confidence limits for the mean, standard deviation, and variance. For example, Output 3.9.1 shows that the 95% confidence interval for the population mean is (64 . 06 , 65 . 07). The ODS SELECT statement restricts the output to the BasicIntervals table; see the section ODS Table Names on page 309.
Analysis of Female Heights The UNIVARIATE Procedure Variable: Height (Height (in)) Basic Confidence Limits Assuming Normality Parameter Estimate 95% Confidence Limits Mean 64.56667 64.06302 65.07031 Std Deviation 2.18900 1.88608 2.60874 Variance 4.79171 3.55731 6.80552
The confidence limits in Output 3.9.1 assume that the heights are normally distributed, so you should check this assumption before using these confidence limits.
See the section Shapiro-Wilk Statistic on page 293 for information on the Shapiro-Wilk test for normality in PROC UNIVARIATE. See Example 3.19 for an example using the test for normality.
By default, the confidence limits produced by the CIBASIC option produce 95% confidence intervals. You can request different level confidence limits by using the ALPHA= option in parentheses after the CIBASIC option. The following statements produce 90% confidence limits:
title 'Analysis of Female Heights'; ods select BasicIntervals; proc univariate data=Heights cibasic(alpha=.1); var Height; run;
The 90% confidence limits are displayed in Output 3.9.2.
Analysis of Female Heights The UNIVARIATE Procedure Variable: Height (Height (in)) Basic Confidence Limits Assuming Normality Parameter Estimate 90% Confidence Limits Mean 64.56667 64.14564 64.98770 Std Deviation 2.18900 1.93114 2.53474 Variance 4.79171 3.72929 6.42492
For the formulas used to compute these limits, see the section Confidence Limits for Parameters of the Normal Distribution on page 277.
A sample program, uniex07.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which is a continuation of Example 3.9, illustrates how to compute confidence limits for quantiles and percentiles. A second researcher is more interested in summarizing the heights with quantiles than the mean and standard deviation. He is also interested in computing 90% confidence intervals for the quantiles. The following statements produce estimated quantiles and confidence limits for the population quantiles:
title 'Analysis of Female Heights'; ods select Quantiles; proc univariate data=Heights ciquantnormal(alpha=.1); var Height; run;
The ODS SELECT statement restricts the output to the Quantiles table; see the section ODS Table Names on page 309. The CIQUANTNORMAL option produces confidence limits for the quantiles. As noted in Output 3.10.1, these limits assume that the data are normally distributed. You should check this assumption before using these confidence limits. See the section Shapiro-Wilk Statistic on page 293 for information on the Shapiro-Wilk test for normality in PROC UNIVARIATE; see Example 3.19 for an example using the test for normality.
Analysis of Female Heights The UNIVARIATE Procedure Variable: Height (Height (in)) Quantiles (Definition 5) 90% Confidence Limits Quantile Estimate Assuming Normality 100% Max 70.0 99% 70.0 68.94553 70.58228 95% 68.6 67.59184 68.89311 90% 67.5 66.85981 68.00273 75% Q3 66.0 65.60757 66.54262 50% Median 64.4 64.14564 64.98770 25% Q1 63.1 62.59071 63.52576 10% 61.6 61.13060 62.27352 5% 60.6 60.24022 61.54149 1% 60.0 58.55106 60.18781 0% Min 60.0
It is also possible to use PROC UNIVARIATE to compute confidence limits for quantiles without assuming normality. The following statements use the CIQUANTDF option to request distribution-free confidence limits for the quantiles of the population of heights:
title 'Analysis of Female Heights'; ods select Quantiles; proc univariate data=Heights ciquantdf(alpha=.1); var Height; run;
The distribution-free confidence limits are shown in Output 3.10.2.
Analysis of Female Heights The UNIVARIATE Procedure Variable: Height (Height (in)) Quantiles (Definition 5) 90% Confidence Limits --------Order Statistics------- Quantile Estimate Distribution Free LCL Rank UCL Rank Coverage 100% Max 70.0 99% 70.0 68.6 70.0 73 75 48.97 95% 68.6 67.5 70.0 68 75 94.50 90% 67.5 66.6 68.6 63 72 91.53 75% Q3 66.0 65.7 66.6 50 63 91.77 50% Median 64.4 64.1 65.1 31 46 91.54 25% Q1 63.1 62.7 63.7 13 26 91.77 10% 61.6 60.6 62.7 4 13 91.53 5% 60.6 60.0 61.6 1 8 94.50 1% 60.0 60.0 60.5 1 3 48.97 0% Min 60.0
The table in Output 3.10.2 includes the ranks from which the confidence limits are computed. For more information on how these confidence limits are calculated, see the section Confidence Limits for Percentiles on page 274. Note that confidence limits for quantiles are not produced when the WEIGHT statement is used.
A sample program, uniex07.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how you can use the UNIVARIATE procedure to compute robust estimates of location and scale. The following statements compute these estimates for the variable Systolic in the data set BPressure , which was introduced in Example 3.1:
title 'Robust Estimates for Blood Pressure Data'; ods select TrimmedMeans WinsorizedMeans RobustScale; proc univariate data=BPressure trimmed=1 .1 winsorized=.1 robustscale; var Systolic; run;
The ODS SELECT statement restricts the output to the TrimmedMeans, WinsorizedMeans, and RobustScale tables; see the section ODS Table Names on page 309. The TRIMMED= option computes two trimmed means, the first after removing one observation and the second after removing 10% of the observations. If the value of TRIMMED= is greater than or equal to one, it is interpreted as the number of observations to be trimmed. The WINSORIZED= option computes a Winsorized mean that replaces three observations from the tails with the next closest observations. (Three observations are replaced because np = (22)( . 1) = 2 . 2, and three is the smallest integer greater than 2.2.) The trimmed and Winsorized means for Systolic are displayed in Output 3.11.1.
Robust Estimates for Blood Pressure Data The UNIVARIATE Procedure Variable: Systolic Trimmed Means Percent Number Std Error Trimmed Trimmed Trimmed Trimmed 95% Confidence t for H0: in Tail in Tail Mean Mean Limits DF Mu0=0.00 Pr > t 4.55 1 120.3500 2.573536 114.9635 125.7365 19 46.76446 <.0001 13.64 3 120.3125 2.395387 115.2069 125.4181 15 50.22675 <.0001 Winsorized Means Percent Number Std Error Winsorized Winsorized Winsorized Winsorized 95% Confidence t for H0: in Tail in Tail Mean Mean Limits DF Mu0=0.00 Pr > t 13.64 3 120.6364 2.417065 115.4845 125.7882 15 49.91027 <.0001
Output 3.11.1 shows the trimmed mean for Systolic is 120.35 after one observation has been trimmed, and 120.31 after 3 observations are trimmed. The Winsorized mean for Systolic is 120.64. For details on trimmed and Winsorized means, see the section Robust Estimators on page 278. The trimmed means can be compared with the means shown in Output 3.1.1 (from Example 3.1), which displays the mean for Systolic as 121.273.
The ROBUSTSCALE option requests a table, displayed in Output 3.11.2, which includes the interquartile range, Gini s mean difference, the median absolute deviation about the median, Q n , and S n .
Robust Estimates for Blood Pressure Data Variable: Systolic Robust Measures of Scale Estimate Measure Value of Sigma Interquartile Range 13.00000 9.63691 Gini's Mean Difference 15.03030 13.32026 MAD 6.50000 9.63690 Sn 9.54080 9.54080 Qn 13.33140 11.36786
Output 3.11.2 shows the robust estimates of scale for Systolic . For instance, the interquartile range is 13. The estimates of ƒ range from 9.54 to 13.32. See the section Robust Estimators on page 278.
A sample program, uniex01.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which is a continuation of Example 3.9, illustrates how to carry out three tests for location: the Student s t test, the sign test, and the Wilcoxon signed rank test. These tests are discussed in the section Tests for Location on page 275.
The following statements demonstrate the tests for location using the Heights data set introduced in Example 3.9. Since the data consists of adult female heights, the researchers are not interested in testing if the mean of the population is equal to zero inches, which is the default µ value. Instead, they are interested in testing if the mean is equal to 66 inches. The following statements test the null hypothesis H : µ = 66:
title 'Analysis of Female Height Data'; ods select TestsForLocation LocationCounts; proc univariate data=Heights mu0=66 loccount; var Height; run;
The ODS SELECT statement restricts the output to the TestsForLocation and LocationCounts tables; see the section ODS Table Names on page 309. The MU0= option specifies the null hypothesis value of µ for the tests for location; by default, µ = 0. The LOCCOUNT option produces the table of the number of observations greater than, not equal to, and less than 66 inches.
Output 3.12.1 contains the results of the tests for location. All three tests are highly significant, causing the researchers to reject the hypothesis that the mean is 66 inches.
Analysis of Female Height Data The UNIVARIATE Procedure Variable: Height (Height (in)) Tests for Location: Mu0=66 Test -Statistic- -----p Value------ Student's t t -5.67065 Pr > t <.0001 Sign M -20 Pr >= M <.0001 Signed Rank S -849 Pr >= S <.0001 Location Counts: Mu0=66.00 Count Value Num Obs > Mu0 16 Num Obs ^= Mu0 72 Num Obs < Mu0 56
A sample program, uniex07.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example demonstrates a sign test for paired data, which is a specific application of the Tests for Location discussed in Example 3.12.
The instructor from Example 3.4 is now interested in performing a sign test for the pairs of test scores in his college course. The following statements request basic statistical measures and tests for location:
title 'Test Scores for a College Course'; ods select BasicMeasures TestsForLocation; proc univariate data=Score; var ScoreChange; run;
The ODS SELECT statement restricts the output to the BasicMeasures and TestsForLocation tables; see the section ODS Table Names on page 309. The instructor is not willing to assume the ScoreChange variable is normal or even symmetric, so he decides to examine the sign test. The large p -value (0.7744) of the sign test provides insufficient evidence of a difference in test score medians.
Test Scores for a College Course The UNIVARIATE Procedure Variable: ScoreChange (Change in Test Scores) Basic Statistical Measures Location Variability Mean -3.08333 Std Deviation 13.33797 Median -3.00000 Variance 177.90152 Mode -5.00000 Range 51.00000 Interquartile Range 10.50000 NOTE: The mode displayed is the smallest of 2 modes with a count of 2. Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t -0.80079 Pr > t 0.4402 Sign M -1 Pr >= M 0.7744 Signed Rank S -8.5 Pr >= S 0.5278
A sample program, uniex03.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to create a histogram. A semiconductor manufacturer produces printed circuit boards that are sampled to determine the thickness of their copper plating . The following statements create a data set named Trans , which contains the plating thicknesses ( Thick ) of 100 boards:
data Trans; input Thick @@; label Thick = 'Plating Thickness (mils)'; datalines; 3.468 3.428 3.509 3.516 3.461 3.492 3.478 3.556 3.482 3.512 3.490 3.467 3.498 3.519 3.504 3.469 3.497 3.495 3.518 3.523 3.458 3.478 3.443 3.500 3.449 3.525 3.461 3.489 3.514 3.470 3.561 3.506 3.444 3.479 3.524 3.531 3.501 3.495 3.443 3.458 3.481 3.497 3.461 3.513 3.528 3.496 3.533 3.450 3.516 3.476 3.512 3.550 3.441 3.541 3.569 3.531 3.468 3.564 3.522 3.520 3.505 3.523 3.475 3.470 3.457 3.536 3.528 3.477 3.536 3.491 3.510 3.461 3.431 3.502 3.491 3.506 3.439 3.513 3.496 3.539 3.469 3.481 3.515 3.535 3.460 3.575 3.488 3.515 3.484 3.482 3.517 3.483 3.467 3.467 3.502 3.471 3.516 3.474 3.500 3.466 ; run;
The following statements create the histogram shown in Output 3.14.1.
title 'Analysis of Plating Thickness'; proc univariate data=Trans noprint; histogram Thick; run;
The NOPRINT option in the PROC UNIVARIATE statement suppresses tables of summary statistics for the variable Thick that would be displayed by default. A histogram is created for each variable listed in the HISTOGRAM statement.
A sample program, uniex08.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to create a comparative histogram. The effective channel length (in microns) is measured for 1225 field effect transistors . The channel lengths ( Length ) are stored in a data set named Channel , which is partially listed in Output 3.15.1:
The Data Set Channel Lot Length Lot 1 0.91 . . Lot 1 1.17 Lot 2 1.47 . . Lot 2 1.39 Lot 3 2.04 . . Lot 3 1.91
The following statements request a histogram, which is shown in Output 3.15.2 of Length ignoring the lot source:
title 'Histogram of Length Ignoring Lot Source'; proc univariate data=Channel noprint; histogram Length; run;
To investigate whether the peaks (modes) in Output 3.15.2 are related to the lot source, you can create a comparative histogram using Lot as a classification variable. The following statements create the histogram shown in Output 3.15.3:
title 'Comparative Analysis of Lot Source'; proc univariate data=Channel noprint; class Lot; histogram Length / nrows = 3; run;
The CLASS statement requests comparisons for each level (distinct value) of the classification variable Lot . The HISTOGRAM statement requests a comparative histogram for the variable Length . The NROWS= option specifies the number of rows in the comparative histogram. By default, comparative histograms are displayed in two rows per panel.
Output 3.15.3 reveals that the distributions of Length are similarly distributed except for shifts in mean.
A sample program, uniex09.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to create a two-way comparative histogram. Two suppliers (A and B) provide disk drives for a computer manufacturer. The manufacturer measures the disk drive opening width to determine whether there has been a change in variability from 2002 to 2003 for each supplier.
The following statements save the measurements in a data set named Disk . There are two classification variables, Supplier and Year , and a user-defined format is associated with Year .
proc format ; value mytime 1 = '2002' 2 = '2003'; data Disk; input @1 Supplier . Year Width; label Width = 'Opening Width (inches)'; format Year mytime.; datalines; Supplier A 1 1.8932 . . . Supplier B 1 1.8986 Supplier A 2 1.8978 . . . Supplier B 2 1.8997 ;
The following statements create the comparative histogram in Output 3.16.1:
title 'Results of Supplier Training Program'; proc univariate data=Disk noprint; class Supplier Year / keylevel = ('Supplier A' '2003'); histogram Width / intertile = 1.0 vaxis = 0 10 20 30 ncols = 2 nrows = 2 cfill = ligr cframetop = yellow cframeside = yellow; run;
The KEYLEVEL= option specifies the key cell as the cell for which Supplier is equal to ˜SUPPLIER A and Year is equal to ˜2003. This cell determines the binning for the other cells , and the columns are arranged so that this cell is displayed in the upper left corner. Without the KEYLEVEL= option, the default key cell would be the cell for which Supplier is equal to ˜SUPPLIER A and Year is equal to ˜2002 ; the column labeled ˜2002 would be displayed to the left of the column labeled ˜2003.
The VAXIS= option specifies the tick mark labels for the vertical axis. The NROWS=2 and NCOLS=2 options specify a 2 — 2 arrangement for the tiles. The CFRAMESIDE= and CFRAMETOP= options specify fill colors for the row and column labels, and the CFILL= option specifies a fill color for the bars. Output 3.16.1 provides evidence that both suppliers have reduced variability from 2002 to 2003.
A sample program, uniex10.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to add insets with descriptive statistics to a comparative histogram; see Output 3.17.1. Three similar machines are used to attach a part to an assembly. One hundred assemblies are sampled from the output of each machine, and a part position is measured in millimeters. The following statements create the data set Machines , which contains the measurements in a variable named Position :
data Machines; input Position @@; label Position = 'Position in Millimeters'; if (_n_ <= 100) then Machine = 'Machine 1'; else if (_n_ <= 200) then Machine = 'Machine 2'; else Machine = 'Machine 3'; datalines; -0.17 -0.19 -0.24 -0.24 -0.12 0.07 -0.61 0.22 1.91 -0.08 -0.59 0.05 -0.38 0.82 -0.14 0.32 0.12 -0.02 0.26 0.19 ... 0.48 0.41 0.78 0.58 0.43 0.07 0.27 0.49 0.79 0.92 0.79 0.66 0.22 0.71 0.53 0.57 0.90 0.48 1.17 1.03 ; run;
The following statements create the comparative histogram in Output 3.17.1:
title 'Machine Comparision Study'; proc univariate data=Machines noprint; class Machine; histogram Position / nrows = 3 intertile = 1 midpoints = -1.2 to 2.2 by 0.1 vaxis = 0 to 16 by 4 cfill = ligr; inset mean std="Std Dev" / pos = ne format = 6.3; run;
The INSET statement requests insets containing the sample mean and standard deviation for each machine in the corresponding tile. The MIDPOINTS= option specifies the midpoints of the histogram bins .
Output 3.17.1 shows that the average position for Machines 2 and 3 are similar and that the spread for Machine 1 is much larger than for Machines 2 and 3.
A sample program, uniex11.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which is a continuation of Example 3.14, demonstrates various methods for binning a histogram. This example also illustrates how to save bin percentages in an OUTHISTOGRAM= data set.
The manufacturer from Example 3.14 now wants to enhance the histogram by changing the endpoints of the bins using the ENDPOINTS= option. The following statements create a histogram with bins that have end points 3.425 and 3.6 and width 0.025:
title 'Enhancing a Histogram'; ods select HistogramBins MyHist; proc univariate data=Trans; histogram Thick / midpercents name='MyHist' endpoints = 3.425 to 3.6 by .025; run;
The ODS SELECT statement restricts the output to the HistogramBins table and the MyHist histogram; see the section ODS Table Names on page 309. The ENDPOINTS= option specifies the endpoints for the histogram bins. By default, if the ENDPOINTS= option is not specified, the automatic binning algorithm computes values for the midpoints of the bins. The MIDPERCENTS option requests a table of the midpoints of each histogram bin and the percent of the observations that fall in each bin. This table is displayed in Output 3.18.1; the histogram is displayed in Output 3.18.2. The NAME= option specifies a name for the histogram that can be used in the ODS SELECT statement.
Enhancing a Histogram The UNIVARIATE Procedure Variable: Thick Histogram Bins for Thick Bin Minimum Observed Point Percent 3.425 8.000 3.450 21.000 3.475 25.000 3.500 29.000 3.525 11.000 3.550 5.000 3.575 1.000
The MIDPOINTS= option is an alternative to the ENDPOINTS= option for specifying histogram bins. The following statements create a similar histogram, which is shown in Output 3.18.3, to the one in Output 3.18.2:
title 'Enhancing a Histogram'; proc univariate data=Trans noprint; histogram Thick / midpoints = 3.4375 to 3.5875 by .025 rtinclude outhistogram = OutMdpts; run;
Output 3.18.3 differs from Output 3.18.2 in two ways:
The MIDPOINTS= option specifies the bins for the histogram by specifying the midpoints of the bins instead of specifying the endpoints. Note that the histogram displays midpoints instead of endpoints.
The RTINCLUDE option request that the right endpoint of each bin be included in the histogram interval instead of the default, which is to include the left endpoint in the interval. This changes the histogram slightly from Output 3.18.2. Six observations have a thickness equal to an endpoint of an interval. For instance, there is one observation with a thickness of 3.45 mils. In Output 3.18.3, this observation is included in the bin from 3.425 to 3.45.
The OUTHISTOGRAM= option produces an output data set named OutMdpts , displayed in Output 3.18.4. This data set provides information on the bins of the histogram. For more information, see the section OUTHISTOGRAM= Output Data Set on page 308.
OUTHISTOGRAM= Data Set Obs _VAR_ _MIDPT_ _OBSPCT_ 1 Thick 3.4375 9 2 Thick 3.4625 21 3 Thick 3.4875 26 4 Thick 3.5125 28 5 Thick 3.5375 11 6 Thick 3.5625 5 7 Thick 3.5875 0
A sample program, uniex08.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example is a continuation of Example 3.14. The following statements fit a normal distribution to the thickness measurements in the Trans data set and superimpose the fitted density curve on the histogram:
title 'Analysis of Plating Thickness'; ods select ParameterEstimates GoodnessOfFit FitQuantiles Bins MyPlot; proc univariate data=Trans; histogram Thick / normal(percents=20 40 60 80 midpercents) name='MyPlot'; inset n normal(ksdpval) / pos = ne format = 6.3; run;
The ODS SELECT statement restricts the output to the ParameterEstimates, GoodnessOfFit, FitQuantiles, and Bins tables; see the section ODS Table Names on page 309. The NORMAL option requests specifies that the normal curve is to be displayed on the histogram shown in Output 3.19.3. It also requests a summary of the fitted distribution, which is shown in Output 3.19.1 and Output 3.19.2. This summary includes goodness-of-fit tests, parameter estimates, and quantiles of the fitted distribution. (If you specify the NORMALTEST option in the PROC UNIVARIATE statement, the Shapiro-Wilk test for normality will be included in the tables of statistical output.)
Analysis of Plating Thickness The UNIVARIATE Procedure Fitted Distribution for Thick Parameters for Normal Distribution Parameter Symbol Estimate Mean Mu 3.49533 Std Dev Sigma 0.032117 Goodness-of-Fit Tests for Normal Distribution Test ---Statistic---- -----p Value----- Kolmogorov-Smirnov D 0.05563823 Pr > D >0.150 Cramer-von Mises W-Sq 0.04307548 Pr > W-Sq >0.250 Anderson-Darling A-Sq 0.27840748 Pr > A-Sq >0.250
Analysis of Plating Thickness Fitted Distribution for Thick Histogram Bin Percents for Normal Distribution Bin -------Percent------ Midpoint Observed Estimated 3.43 3.000 3.296 3.45 9.000 9.319 3.47 23.000 18.091 3.49 19.000 24.124 3.51 24.000 22.099 3.53 15.000 13.907 3.55 3.000 6.011 3.57 4.000 1.784 Quantiles for Normal Distribution ------Quantile------ Percent Observed Estimated 20.0 3.46700 3.46830 40.0 3.48350 3.48719 60.0 3.50450 3.50347 80.0 3.52250 3.52236
Two secondary options are specified in parentheses after the NORMAL primary option. The PERCENTS= option specifies quantiles, which are to be displayed in the FitQuantiles table. The MIDPERCENTS option requests a table that lists the midpoints, the observed percentage of observations, and the estimated percentage of the population in each interval (estimated from the fitted normal distribution). See Table 3.3 on page 214 and Table 3.8 on page 215 for the secondary options that can be specified with after the NORMAL primary option.
The histogram of the variable Thick with a superimposed normal curve is shown in Output 3.19.3.
The estimated parameters for the normal curve ( = 3 . 50 and = 0 . 03) are shown in Output 3.19.1. By default, the parameters are estimated unless you specify values with the MU= and SIGMA= secondary options after the NORMAL primary option. The results of three goodness-of-fit tests based on the empirical distribution function (EDF) are displayed in Output 3.19.1. Since the p -values are all greater than 0.15, the hypothesis of normality is not rejected.
A sample program, uniex08.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example is a continuation of Example 3.15, which introduced the data set Channel on page 334. In Output 3.15.3, it appears that the channel lengths in each lot are normally distributed. The following statements use the NORMAL option to fit a normal distribution for each lot:
title 'Comparative Analysis of Lot Source'; proc univariate data=Channel noprint; class Lot; histogram Length / nrows = 3 intertile = 1 cprop = orange normal(color = black noprint); inset n = "N" / pos = nw; run;
The NOPRINT option in the PROC UNIVARIATE statement suppresses the tables of statistical output produced by default; the NOPRINT option in parentheses after the NORMAL option suppresses the tables of statistical output related to the fit of the normal distribution. The normal parameters are estimated from the data for each lot, and the curves are superimposed on each component histogram. The INTERTILE= option specifies the space between the framed areas, which are referred to as tiles. The CPROP= option requests the shaded bars above each tile, which represent the relative frequencies of observations in each lot. The comparative histogram is displayed in Output 3.20.1.
A sample program, uniex09.sas , for this example is available in the SAS Sample Library for Base SAS software.
You can use a beta distribution to model the distribution of a variable that is known to vary between lower and upper bounds. In this example, a manufacturing company uses a robotic arm to attach hinges on metal sheets. The attachment point should be offset 10.1 mm from the left edge of the sheet. The actual offset varies between 10.0 and 10.5 mm due to variation in the arm. The following statements save the offsets for 50 attachment points as the values of the variable Length in the data set Robots :
data Robots; input Length @@; label Length = 'Attachment Point Offset (in mm)'; datalines; 10.147 10.070 10.032 10.042 10.102 10.034 10.143 10.278 10.114 10.127 10.122 10.018 10.271 10.293 10.136 10.240 10.205 10.186 10.186 10.080 10.158 10.114 10.018 10.201 10.065 10.061 10.133 10.153 10.201 10.109 10.122 10.139 10.090 10.136 10.066 10.074 10.175 10.052 10.059 10.077 10.211 10.122 10.031 10.322 10.187 10.094 10.067 10.094 10.051 10.174 ; run;
The following statements create a histogram with a fitted beta density curve, shown in Output 3.21.1:
title 'Fitted Beta Distribution of Offsets'; ods select ParameterEstimates FitQuantiles MyHist; proc univariate data=Robots; histogram Length / beta(theta=10 scale=0.5 color=red fill) cfill = yellow cframe = ligr href = 10 hreflabel = 'Lower Bound' lhref = 2 vaxis = axis1 name = 'MyHist'; axis1 label=(a=90 r=0); inset n = 'Sample Size' beta / pos=ne cfill=blank; run;
The ODS SELECT statement restricts the output to the ParameterEstimates and FitQuantiles tables; see the section ODS Table Names on page 309. The BETA primary option requests a fitted beta distribution. The THETA= secondary option specifies the lower threshold. The SCALE= secondary option specifies the range between the lower threshold and the upper threshold. Note that the default THETA= and SCALE= values are zero and one, respectively.
The FILL secondary option specifies that the area under the curve is to be filled with the CFILL= color. (If FILL were omitted, the CFILL= color would be used to fill the histogram bars instead.)
The HREF= option draws a reference line at the lower bound, and the HREFLABEL= option adds the label Lower Bound . The LHREF= option specifies a dashed line type for the reference line. The INSET statement adds an inset with the sample size positioned in the northeast corner of the plot.
In addition to displaying the beta curve, the BETA option requests a summary of the curve fit. This summary, which includes parameters for the curve and the observed and estimated quantiles, is shown in Output 3.21.2. A sample program, uniex12.sas , for this example is available in the SAS Sample Library for Base SAS software.
Fitted Beta Distribution of Offsets Fitted Distribution for Length Parameters for Beta Distribution Parameter Symbol Estimate Threshold Theta 10 Scale Sigma 0.5 Shape Alpha 2.06832 Shape Beta 6.022479 Mean 10.12782 Std Dev 0.072339 Quantiles for Beta Distribution ------Quantile------ Percent Observed Estimated 1.0 10.0180 10.0124 5.0 10.0310 10.0285 10.0 10.0380 10.0416 25.0 10.0670 10.0718 50.0 10.1220 10.1174 75.0 10.1750 10.1735 90.0 10.2255 10.2292 95.0 10.2780 10.2630 99.0 10.3220 10.3237
To determine an appropriate model for a data distribution, you should consider curves from several distribution families. As shown in this example, you can use the HISTOGRAM statement to fit more than one distribution and display the density curves on a histogram.
The gap between two plates is measured (in cm) for each of 50 welded assemblies selected at random from the output of a welding process. The following statements save the measurements (Gap) in a data set named Plates :
data Plates; label Gap = 'Plate Gap in cm'; input Gap @@; datalines; 0.746 0.357 0.376 0.327 0.485 1.741 0.241 0.777 0.768 0.409 0.252 0.512 0.534 1.656 0.742 0.378 0.714 1.121 0.597 0.231 0.541 0.805 0.682 0.418 0.506 0.501 0.247 0.922 0.880 0.344 0.519 1.302 0.275 0.601 0.388 0.450 0.845 0.319 0.486 0.529 1.547 0.690 0.676 0.314 0.736 0.643 0.483 0.352 0.636 1.080 ; run;
The following statements fit three distributions (lognormal, Weibull, and gamma) and display their density curves on a single histogram:
title 'Distribution of Plate Gaps'; ods select ParameterEstimates GoodnessOfFit FitQuantiles MyHist; proc univariate data=Plates; var Gap; histogram / midpoints=0.2 to 1.8 by 0.2 lognormal (l=1) weibull (l=2) gamma (l=8) vaxis = axis1 name = 'MyHist'; inset n mean(5.3) std='Std Dev'(5.3) skewness(5.3) / pos = ne header = 'Summary Statistics'; axis1 label=(a=90 r=0); run;
The ODS SELECT statement restricts the output to the ParameterEstimates, GoodnessOfFit, and FitQuantiles tables; see the section ODS Table Names on page 309. The LOGNORMAL, WEIBULL, and GAMMA primary options request superimposed fitted curves on the histogram in Output 3.22.1. The L= secondary options specify distinct line types for the curves. Note that a threshold parameter = 0 is assumed for each curve. In applications where the threshold is not zero, you can specify with the THETA= secondary option.
The LOGNORMAL, WEIBULL, and GAMMA options also produce the summaries for the fitted distributions shown in Output 3.22.2 through Output 3.22.5.
Distribution of Plate Gaps Fitted Distributions for Gap Parameters for Lognormal Distribution Parameter Symbol Estimate Threshold Theta 0 Scale Zeta -0.58375 Shape Sigma 0.499546 Mean 0.631932 Std Dev 0.336436 Goodness-of-Fit Tests for Lognormal Distribution Test ---Statistic---- -----p Value----- Kolmogorov-Smirnov D 0.06441431 Pr > D >0.150 Cramer-von Mises W-Sq 0.02823022 Pr > W-Sq >0.500 Anderson-Darling A-Sq 0.24308402 Pr > A-Sq >0.500
Distribution of Plate Gaps Fitted Distributions for Gap Quantiles for Lognormal Distribution ------Quantile------ Percent Observed Estimated 1.0 0.23100 0.17449 5.0 0.24700 0.24526 10.0 0.29450 0.29407 25.0 0.37800 0.39825 50.0 0.53150 0.55780 75.0 0.74600 0.78129 90.0 1.10050 1.05807 95.0 1.54700 1.26862 99.0 1.74100 1.78313
Distribution of Plate Gaps Fitted Distributions for Gap Parameters for Weibull Distribution Parameter Symbol Estimate Threshold Theta 0 Scale Sigma 0.719208 Shape C 1.961159 Mean 0.637641 Std Dev 0.339248 Goodness-of-Fit Tests for Weibull Distribution Test ---Statistic---- -----p Value----- Cramer-von Mises W-Sq 0.15937281 Pr > W-Sq 0.016 Anderson-Darling A-Sq 1.15693542 Pr > A-Sq <0.010 Quantiles for Weibull Distribution ------Quantile------ Percent Observed Estimated 1.0 0.23100 0.06889 5.0 0.24700 0.15817 10.0 0.29450 0.22831 25.0 0.37800 0.38102 50.0 0.53150 0.59661 75.0 0.74600 0.84955 90.0 1.10050 1.10040 95.0 1.54700 1.25842 99.0 1.74100 1.56691
Distribution of Plate Gaps Fitted Distributions for Gap Parameters for Gamma Distribution Parameter Symbol Estimate Threshold Theta 0 Scale Sigma 0.155198 Shape Alpha 4.082646 Mean 0.63362 Std Dev 0.313587 Goodness-of-Fit Tests for Gamma Distribution Test ---Statistic---- -----p Value----- Kolmogorov-Smirnov D 0.09695325 Pr > D >0.250 Cramer-von Mises W-Sq 0.07398467 Pr > W-Sq >0.250 Anderson-Darling A-Sq 0.58106613 Pr > A-Sq 0.137 Quantiles for Gamma Distribution ------Quantile------ Percent Observed Estimated 1.0 0.23100 0.13326 5.0 0.24700 0.21951 10.0 0.29450 0.27938 25.0 0.37800 0.40404 50.0 0.53150 0.58271 75.0 0.74600 0.80804 90.0 1.10050 1.05392 95.0 1.54700 1.22160 99.0 1.74100 1.57939
Output 3.22.2 provides three EDF goodness-of-fit tests for the lognormal distribution: the Anderson-Darling, the Cram r-von Mises, and the Kolmogorov-Smirnov tests. At the ± = 0 . 10 significance level, all tests support the conclusion that the two-parameter lognormal distribution with scale parameter = ˆ’ . 58 and shape parameter = 0 . 50 provides a good model for the distribution of plate gaps.
Output 3.22.4 provides two EDF goodness-of-fit tests for the Weibull distribution: the Anderson-Darling and the Cram r-von Mises tests. The p -values for the EDF tests are all less than 0.10, indicating that the data do not support a Weibull model.
Output 3.22.5 provides three EDF goodness-of-fit tests for the gamma distribution: the Anderson-Darling, the Cram r-von Mises, and the Kolmogorov-Smirnov tests. At the ± = 0 . 10 significance level, all tests support the conclusion that the gamma distribution with scale parameter ƒ = 0 . 16 and shape parameter ± = 4 . 08 provides a good model for the distribution of plate gaps.
Based on this analysis, the fitted lognormal distribution and the fitted gamma distribution are both good models for the distribution of plate gaps. A sample program, uniex13.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates the use of kernel density estimates to visualize a nonnormal data distribution. This example uses the data set Channel , which is introduced in Example 3.15.
When you compute kernel density estimates, you should try several choices for the bandwidth parameter c since this determines the smoothness and closeness of the fit. You can specify a list of up to five C= values with the KERNEL option to request multiple density estimates, as shown in the following statements:
title 'FET Channel Length Analysis'; proc univariate data=Channel noprint; histogram Length / kernel(c = 0.25 0.50 0.75 1.00 l = 1 20 2 34 color=red noprint); run;
The L= secondary option specifies distinct line types for the curves (the L= values are paired with the C= values in the order listed). Output 3.23.1 demonstrates the effect of c . In general, larger values of c yield smoother density estimates, and smaller values yield estimates that more closely fit the data distribution.
Output 3.23.1 reveals strong trimodality in the data, which is displayed with comparative histograms in Example 3.15.
A sample program, uniex09.sas , for this example is available in the SAS Sample Library for Base SAS software.
If you request a lognormal fit with the LOGNORMAL primary option, a two-parameter lognormal distribution is assumed. This means that the shape parameter ƒ and the scale parameter are unknown (unless specified) and that the threshold is known (it is either specified with the THETA= option or assumed to be zero).
If it is necessary to estimate in addition to and ƒ , the distribution is referred to as a three-parameter lognormal distribution. This example shows how you can request a three-parameter lognormal distribution.
A manufacturing process produces a plastic laminate whose strength must exceed a minimum of 25 psi. Samples are tested , and a lognormal distribution is observed for the strengths. It is important to estimate to determine whether the process meets the strength requirement. The following statements save the strengths for 49 samples in the data set Plastic :
data Plastic; label Strength = Strength in psi; input Strength @@; datalines; 30.26 31.23 71.96 47.39 33.93 76.15 42.21 81.37 78.48 72.65 61.63 34.90 24.83 68.93 43.27 41.76 57.24 23.80 34.03 33.38 21.87 31.29 32.48 51.54 44.06 42.66 47.98 33.73 25.80 29.95 60.89 55.33 39.44 34.50 73.51 43.41 54.67 99.43 50.76 48.81 31.86 33.88 35.57 60.41 54.92 35.66 59.30 41.96 45.32 ; run;
The following statements use the LOGNORMAL primary option in the HISTOGRAM statement to display the fitted three-parameter lognormal curve shown in Output 3.24.1:
title 'Three-Parameter Lognormal Fit'; proc univariate data=Plastic noprint; histogram Strength / lognormal(fill theta = est noprint) cfill = white; inset lognormal / format=6.2 pos=ne; run;
The NOPRINT option suppresses the tables of statistical output produced by default. Specifying THETA=EST requests a local maximum likelihood estimate (LMLE) for , as described by Cohen (1951). This estimate is then used to compute maximum likelihood estimates ƒ for and .
Note: You can also specify THETA=EST with the WEIBULL primary option to fit a three-parameter Weibull distribution.
A sample program, uniex14.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example shows how to display a fitted curve that is not supported by the HISTOGRAM statement. The offset of an attachment point is measured (in mm) for a number of manufactured assemblies, and the measurements ( Offset ) are saved in a data set named Assembly . The following statements create the data set Assembly :
data Assembly; label Offset = 'Offset (in mm)'; input Offset @@; datalines; 11.11 13.07 11.42 3.92 11.08 5.40 11.22 14.69 6.27 9.76 9.18 5.07 3.51 16.65 14.10 9.69 16.61 5.67 2.89 8.13 9.97 3.28 13.03 13.78 3.13 9.53 4.58 7.94 13.51 11.43 11.98 3.90 7.67 4.32 12.69 6.17 11.48 2.82 20.42 1.01 3.18 6.02 6.63 1.72 2.42 11.32 16.49 1.22 9.13 3.34 1.29 1.70 0.65 2.62 2.04 11.08 18.85 11.94 8.34 2.07 0.31 8.91 13.62 14.94 4.83 16.84 7.09 3.37 0.49 15.19 5.16 4.14 1.92 12.70 1.97 2.10 9.38 3.18 4.18 7.22 15.84 10.85 2.35 1.93 9.19 1.39 11.40 12.20 16.07 9.23 0.05 2.15 1.95 4.39 0.48 10.16 4.81 8.28 5.68 22.81 0.23 0.38 12.71 0.06 10.11 18.38 5.53 9.36 9.32 3.63 12.93 10.39 2.05 15.49 8.12 9.52 7.77 10.70 6.37 1.91 8.60 22.22 1.74 5.84 12.90 13.06 5.08 2.09 6.41 1.40 15.60 2.36 3.97 6.17 0.62 8.56 9.36 10.19 7.16 2.37 12.91 0.95 0.89 3.82 7.86 5.33 12.92 2.64 7.92 14.06 ; run;
It is decided to fit a folded normal distribution to the offset measurements. A variable X has a folded normal distribution if X = Y , where Y is distributed as N ( µ, ƒ ). The fitted density is
You can use SAS/IML to compute preliminary estimates of µ and ƒ based on a method of moments given by Elandt (1961). These estimates are computed by solving equation (19) of Elandt (1961), which is given by
where (·) is the standard normal distribution function, and
Then the estimates of ƒ and µ are given by
Begin by using PROC MEANS to compute the first and second moments and using the following DATA step to compute the constant A :
proc means data = Assembly noprint; var Offset; output out=stat mean=m1 var=var n=n min = min; run; * Compute constant A from equation (19) of Elandt (1961); data stat; keep m2 a min; set stat; a = (m1*m1); m2 = ((n-1)/n)*var + a; a = a/m2; run;
Next, use the SAS/IML subroutine NLPDD to solve equation (19) by minimizing ( f ( ) ˆ’ A ) 2 , and compute and :
proc iml; use stat; read all var {m2} into m2; read all var {a} into a; read all var {min} into min; * f(t) is the function in equation (19) of Elandt (1961); start f(t) global(a); y = .39894*exp(-0.5*t*t); y = (2*y-(t*(1-2*probnorm(t))))**2/(1+t*t); y = (y-a)**2; return(y); finish; * Minimize (f(t)-A)**2 and estimate mu and sigma; if (min < 0) then do; print "Warning: Observations are not all nonnegative."; print " The folded normal is inappropriate."; stop; end; if (a < 0.637) then do; print "Warning: the folded normal may be inappropriate"; end; opt = { 0 0 }; con = { 1e-6 }; x0 = { 2.0 }; tc = { . . . . . 1e-8 . . . . . . .}; call nlpdd(rc,etheta0,"f",x0,opt,con,tc); esig0 = sqrt(m2/(1+etheta0*etheta0)); emu0 = etheta0*esig0; create prelim var {emu0 esig0 etheta0}; append; close prelim;
The preliminary estimates are saved in the data set Prelim , as shown in Output 3.25.1:
The Data Set Prelim EMU0 ESIG0 ETHETA0 6.51735 6.54953 0.99509
Now, using and as initial estimates, call the NLPDD subroutine to maximize the log likelihood, l ( µ, ƒ ), of the folded normal distribution, where, up to a constant,
* Define the log likelihood of the folded normal; start g(p) global(x); y = 0.0; do i = 1 to nrow(x); z = exp((-0.5/p[2])*(x[i]-p[1])*(x[i]-p[1])); z = z + exp((-0.5/p[2])*(x[i]+p[1])*(x[i]+p[1])); y = y + log(z); end; y = y - nrow(x)*log(sqrt(p[2])); return(y); finish; * Maximize the log likelihood with subroutine NLPDD; use assembly; read all var {offset} into x; esig0sq = esig0*esig0; x0 = emu0esig0sq; opt = { 1 0 }; con = { . 0.0, . . }; call nlpdd(rc,xr,"g",x0,opt,con); emu = xr[1]; esig = sqrt(xr[2]); etheta = emu/esig; create parmest var{emu esig etheta}; append; close parmest; quit;
The data set ParmEst contains the maximum likelihood estimates and (as well as / ), as shown in Output 3.25.2:
The Data Set ParmEst EMU ESIG ETHETA 6.66761 6.39650 1.04239
To annotate the curve on a histogram, begin by computing the width and endpoints of the histogram intervals. The following statements save these values in a data set called OutCalc . Note that a plot is not produced at this point.
proc univariate data = Assembly noprint; histogram Offset / outhistogram = out normal(noprint) noplot; run; data OutCalc (drop = _MIDPT_); set out (keep = _MIDPT_) end = eof; retain _MIDPT1_ _WIDTH_; if _N_ = 1 then _MIDPT1_ = _MIDPT_; if eof then do; _MIDPTN_ = _MIDPT_; _WIDTH_ = (_MIDPTN_ - _MIDPT1_) / (_N_ - 1); output; end; run;
Output 3.25.3 provides a listing of the data set OutCalc . The width of the histogram bars is saved as the value of the variable _WIDTH_ ; the midpoints of the first and last histogram bars are saved as the values of the variables _MIDPT1_ and _MIDPTN_ .
Data Set OutCalc _MIDPT1_ _WIDTH_ _MIDPTN_ 1.5 3 22.5
The following statements create an annotate data set named Anno , which contains the coordinates of the fitted curve:
data Anno; merge ParmEst OutCalc; length function color $ 8; function = 'point'; color = 'black'; size = 2; xsys = '2'; ysys = '2'; when = 'a'; constant = 39.894*_width_;; left = _midpt1_ - .5*_width_; right = _midptn_ + .5*_width_; inc = (right-left)/100; do x = left to right by inc; z1 = (x-emu)/esig; z2 = (x+emu)/esig; y = (constant/esig)*(exp(-0.5*z1*z1)+exp(-0.5*z2*z2)); output; function = 'draw'; end; run;
The following statements read the ANNOTATE= data set and display the histogram and fitted curve:
title 'Folded Normal Distribution'; proc univariate data=assembly noprint; histogram Offset / annotate = anno cbarline = black cfill = ligr; run;
Output 3.25.4 displays the histogram and fitted curve:
A sample program, uniex15.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example is a continuation of the example explored in the section Modeling a Data Distribution on page 200.
In the normal probability plot shown in Figure 3.6, the nonlinearity of the point pattern indicates a departure from normality in the distribution of Deviation . Since the point pattern is curved with slope increasing from left to right, a theoretical distribution that is skewed to the right, such as a lognormal distribution, should provide a better fit than the normal distribution. See the section Interpretation of Quantile-Quantile and Probability Plots on page 299.
You can explore the possibility of a lognormal fit with a lognormal probability plot. When you request such a plot, you must specify the shape parameter ƒ for the lognormal distribution. This value must be positive, and typical values of ƒ range from 0.1 to 1.0. You can specify values for ƒ with the SIGMA= secondary option in the LOGNORMAL primary option, or you can specify that ƒ is to be estimated from the data.
The following statements illustrate the first approach by creating a series of three lognormal probability plots for the variable Deviation introduced in the section Modeling a Data Distribution on page 200:
symbol v=plus height=3.5pct; title Lognormal Probability Plot for Position Deviations; proc univariate data=Aircraft noprint; probplot Deviation / lognormal(theta=est zeta=est sigma=0.7 0.9 1.1) href = 95 lhref = 1 square; run;
The LOGNORMAL primary option requests plots based on the lognormal family of distributions, and the SIGMA= secondary option requests plots for ƒ equal to 0.7, 0.9, and 1.1. These plots are displayed in Output 3.26.1, Output 3.26.2, and Output 3.26.3, respectively. Alternatively, you can specify ƒ to be estimated using the sample standard deviation by using the option SIGMA=EST.
The SQUARE option displays the probability plot in a square format, the HREF= option requests a reference line at the 95th percentile, and the LHREF= option specifies the line type for the reference line.
The value ƒ = 0 . 9 in Output 3.26.2 most nearly linearizes the point pattern. The 95th percentile of the position deviation distribution seen in Output 3.26.2 is approximately 0.001, since this is the value corresponding to the intersection of the point pattern with the reference line.
Note: Once the ƒ that produces the most linear fit is found, you can then estimate the threshold parameter and the scale parameter . See Example 3.31.
The following statements illustrate how you can create a lognormal probability plot for Deviation using a local maximum likelihood estimate for ƒ .
symbol v=plus height=3.5pct; title 'Lognormal Probability Plot for Position Deviations'; proc univariate data=Aircraft noprint; probplot Deviation / lognormal(theta=est zeta=est sigma=est) href = 95 lhref = 1 square; run;
The plot is displayed in Output 3.26.4. Note that the maximum likelihood estimate of ƒ (in this case 0.882) does not necessarily produce the most linear point pattern.
A sample program, uniex16.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example uses the data set Aircraft from the previous example to illustrate how to display a lognormal fit with a histogram. To determine whether the lognormal distribution is an appropriate model for a distribution, you should consider the graphical fit as well as conduct goodness-of-fit tests.
The following statements fit a lognormal distribution and display the density curve on a histogram:
title 'Distribution of Position Deviations'; ods select Lognormal.ParameterEstimates Lognormal.GoodnessOfFit MyPlot; proc univariate data=Aircraft; var Deviation; histogram / lognormal(w=3 theta=est) vaxis = axis1 name = 'MyPlot'; inset n mean (5.3) std='Std Dev' (5.3) skewness (5.3) / pos = ne header = 'Summary Statistics'; axis1 label=(a=90 r=0); run;
The ODS SELECT statement restricts the output to the ParameterEstimates and GoodnessOfFit tables; see the section ODS Table Names on page 309. The LOGNORMAL primary option superimposes a fitted curve on the histogram in Output 3.27.1. The W= option specifies the line width for the curve. The INSET statement specifies that the mean, standard deviation, and skewness be displayed in an inset in the northeast corner of the plot. Note that the default value of the threshold parameter is zero. In applications where the threshold is not zero, you can specify with the THETA= option. The variable Deviation includes values that are less than the default threshold; therefore, the option THETA= EST is used.
Output 3.27.2 provides three EDF goodness-of-fit tests for the lognormal distribution: the Anderson-Darling, the Cram r-von Mises, and the Kolmogorov-Smirnov tests. The null hypothesis for the three tests is that a lognormal distribution holds for the sample data.
Distribution of Position Deviations Fitted Distribution for Deviation Parameters for Lognormal Distribution Parameter Symbol Estimate Threshold Theta -0.00834 Scale Zeta -6.14382 Shape Sigma 0.882225 Mean -0.00517 Std Dev 0.003438 Goodness-of-Fit Tests for Lognormal Distribution Test ---Statistic---- -----p Value----- Kolmogorov-Smirnov D 0.09419634 Pr > D >0.500 Cramer-von Mises W-Sq 0.02919815 Pr > W-Sq >0.500 Anderson-Darling A-Sq 0.21606642 Pr > A-Sq >0.500
The p -values for all three tests are greater than 0.5, so the null hypothesis is not rejected. The tests support the conclusion that the two-parameter lognormal distribution with scale parameter = ˆ’ 6 . 14, and shape parameter = 0 . 88 provides a good model for the distribution of position deviations. For further discussion of goodness-of-fit interpretation, see the section Goodness-of-Fit Tests on page 292.
A sample program, uniex16.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to create a normal quantile plot. An engineer is analyzing the distribution of distances between holes cut in steel sheets. The following statements save measurements of the distance between two holes cut into 50 steel sheets as values of the variable Distance in the data set Sheets :
data Sheets; input Distance @@; label Distance = 'Hole Distance (cm)'; datalines; 9.80 10.20 10.27 9.70 9.76 10.11 10.24 10.20 10.24 9.63 9.99 9.78 10.10 10.21 10.00 9.96 9.79 10.08 9.79 10.06 10.10 9.95 9.84 10.11 9.93 10.56 10.47 9.42 10.44 10.16 10.11 10.36 9.94 9.77 9.36 9.89 9.62 10.05 9.72 9.82 9.99 10.16 10.58 10.70 9.54 10.31 10.07 10.33 9.98 10.15 ; run;
The engineer decides to check whether the distribution of distances is normal. The following statements create a Q-Q plot for Distance , shown in Output 3.28.1:
symbol v=plus; title 'Normal Quantile-Quantile Plot for Hole Distance'; proc univariate data=Sheets noprint; qqplot Distance; run;
The plot compares the ordered values of Distance with quantiles of the normal distribution. The linearity of the point pattern indicates that the measurements are normally distributed. Note that a normal Q-Q plot is created by default.
A sample program, uniex17.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which is a continuation of Example 3.28, illustrates how to add a reference line to a normal Q-Q plot, which represents the normal distribution with mean µ and standard deviation ƒ . The following statements reproduce the Q-Q plot in Output 3.28.1 and add the reference line:
symbol v=plus; title 'Normal Quantile-Quantile Plot for Hole Distance'; proc univariate data=Sheets noprint; qqplot Distance / normal(mu=est sigma=est color=red l=2 noprint) square; run;
The plot is displayed in Output 3.29.1.
Specifying MU=EST and SIGMA=EST with the NORMAL primary option requests the reference line for which µ and ƒ are estimated by the sample mean and standard deviation. Alternatively, you can specify numeric values for µ and ƒ with the MU= and SIGMA= secondary options. The COLOR= and L= options specify the color and type of the line, and the SQUARE option displays the plot in a square format. The NOPRINT options in the PROC UNIVARIATE statement and after the NORMAL option suppress all the tables of statistical output produced by default.
The data clearly follow the line, which indicates that the distribution of the distances is normal.
A sample program, uniex17.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example illustrates how to interpret a normal quantile plot when the data are from a non-normal distribution. The following statements create the data set Measures , which contains the measurements of the diameters of 50 steel rods in the variable Diameter :
data Measures; input Diameter @@; label Diameter = 'Diameter (mm)'; datalines; 5.501 5.251 5.404 5.366 5.445 5.576 5.607 5.200 5.977 5.177 5.332 5.399 5.661 5.512 5.252 5.404 5.739 5.525 5.160 5.410 5.823 5.376 5.202 5.470 5.410 5.394 5.146 5.244 5.309 5.480 5.388 5.399 5.360 5.368 5.394 5.248 5.409 5.304 6.239 5.781 5.247 5.907 5.208 5.143 5.304 5.603 5.164 5.209 5.475 5.223 ; run;
The following statements request the normal Q-Q plot in Output 3.30.1:
symbol v=plus; title Normal Q-Q Plot for Diameters; proc univariate data=Measures noprint; qqplot Diameter / normal(noprint) square vaxis=axis1; axis1 label=(a=90 r=0); run;
The nonlinearity of the points in Output 3.30.1 indicates a departure from normality. Since the point pattern is curved with slope increasing from left to right, a theoretical distribution that is skewed to the right, such as a lognormal distribution, should provide a better fit than the normal distribution. The mild curvature suggests that you should examine the data with a series of lognormal Q-Q plots for small values of the shape parameter ƒ , as illustrated in Example 3.31. For details on interpreting a Q-Q plot, see the section Interpretation of Quantile-Quantile and Probability Plots on page 299.
A sample program, uniex18.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which is a continuation of Example 3.30, demonstrates techniques for estimating the shape, location, and scale parameters, and the theoretical percentiles for a three-parameter lognormal distribution.
The three-parameter lognormal distribution depends on a threshold parameter , a scale parameter , and a shape parameter ƒ . You can estimate ƒ from a series of lognormal Q-Q plots which use the SIGMA= secondary option to specify different values of ƒ ; the estimate of ƒ is the value that linearizes the point pattern. You can then estimate the threshold and scale parameters from the intercept and slope of the point pattern. The following statements create the series of plots in Output 3.31.1, Output 3.31.2, and Output 3.31.3 for ƒ values of 0.2, 0.5, and 0.8, respectively:
symbol v=plus; title 'Lognormal Q-Q Plot for Diameters'; proc univariate data=Measures noprint; qqplot Diameter / lognormal(sigma=0.2 0.5 0.8 noprint) square; run;
Note: You must specify a value for the shape parameter ƒ for a lognormal Q-Q plot with the SIGMA= option or its alias, the SHAPE= option.
The plot Output 3.31.2 displays the most linear point pattern, indicating that the lognormal distribution with ƒ = 0 . 5 provides a reasonable fit for the data distribution.
Data with this particular lognormal distribution have the following density function:
The points in the plot fall on or near the line with intercept and slope exp( ). Based on Output 3.31.2, ‰ˆ 5 and , giving ‰ˆ log(0 . 4) ‰ˆ ˆ’ . 92.
You can also request a reference line using the SIGMA=, THETA=, and ZETA= options together. The following statements produce the lognormal Q-Q plot in Output 3.31.4:
symbol v=plus; title 'Lognormal Q-Q Plot for Diameters'; proc univariate data=Measures noprint; qqplot Diameter / lognormal(theta=5 zeta=est sigma=est color=black l=2 noprint) square; run;
Output 3.31.1 through Output 3.31.3 show that the threshold parameter is not equal to zero. Specifying THETA=5 overrides the default value of zero. The SIGMA=EST and ZETA=EST secondary options request estimates for ƒ and exp using the sample mean and standard deviation.
From the plot in Output 3.31.2, ƒ can be estimated as 0.51, which is consistent with the estimate of 0.5 derived from the plot in Output 3.31.2. The next example illustrates how to estimate percentiles using lognormal Q-Q plots.
A sample program, uniex18.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which is a continuation of the previous example, shows how to use a Q-Q plot to estimate percentiles such as the 95th percentile of the lognormal distribution. A probability plot can also be used for this purpose, as illustrated in Example 3.26.
The point pattern in Output 3.31.4 has a slope of approximately 0.39 and an intercept of 5. The following statements reproduce this plot, adding a lognormal reference line with this slope and intercept:
symbol v=plus; title 'Lognormal Q-Q Plot for Diameters'; proc univariate data=Measures noprint; qqplot Diameter / lognormal(sigma=0.5 theta=5 slope=0.39 noprint) pctlaxis(grid) vref = 5.8 5.9 6.0 square; run;
The result is shown in Output 3.32.1:
The PCTLAXIS option labels the major percentiles, and the GRID option draws percentile axis reference lines. The 95th percentile is 5.9, since the intersection of the distribution reference line and the 95th reference line occurs at this value on the vertical axis.
Alternatively, you can compute this percentile from the estimated lognormal parameters. The ± th percentile of the lognormal distribution is
P ± = exp( ƒ ˆ’ 1 ( ± ) + ) +
where ˆ’ 1 (·) is the inverse cumulative standard normal distribution. Consequently,
A sample program, uniex18.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example, which is a continuation of Example 3.31, demonstrates techniques for estimating the shape, location, and scale parameters, and the theoretical percentiles for a two-parameter lognormal distribution.
If the threshold parameter is known, you can construct a two-parameter lognormal Q-Q plot by subtracting the threshold from the data values and making a normal Q-Q plot of the log-transformed differences, as illustrated in the following statements:
data ModifiedMeasures; set Measures; LogDiameter = log(Diameter-5); label LogDiameter = log(Diameter-5); run; symbol v=plus; title 'Two-Parameter Lognormal Q-Q Plot for Diameters'; proc univariate data=ModifiedMeasures noprint; qqplot LogDiameter / normal(mu=est sigma=est noprint) square vaxis=axis1; inset n mean (5.3) std (5.3) / pos = nw header = 'Summary Statistics'; axis1 label=(a=90 r=0); run;
Because the point pattern in Output 3.33.1 is linear, you can estimate the lognormal parameters and ƒ as the normal plot estimates of µ and ƒ , which are ˆ’ 0.99 and 0.51. These values correspond to the previous estimates of ˆ’ 0.92 for and 0.5 for ƒ from Example 3.31. A sample program, uniex18.sas , for this example is available in the SAS Sample Library for Base SAS software.
This example compares the use of three-parameter and two-parameter Weibull Q-Q plots for the failure times in months for 48 integrated circuits. The times are assumed to follow a Weibull distribution. The following statements save the failure times as the values of the variable Time in the data set Failures :
data Failures; input Time @@; label Time = 'Time in Months'; datalines; 29.42 32.14 30.58 27.50 26.08 29.06 25.10 31.34 29.14 33.96 30.64 27.32 29.86 26.28 29.68 33.76 29.32 30.82 27.26 27.92 30.92 24.64 32.90 35.46 30.28 28.36 25.86 31.36 25.26 36.32 28.58 28.88 26.72 27.42 29.02 27.54 31.60 33.46 26.78 27.82 29.18 27.94 27.66 26.42 31.00 26.64 31.44 32.52 ; run;
If no assumption is made about the parameters of this distribution, you can use the WEIBULL option to request a three-parameter Weibull plot. As in the previous example, you can visually estimate the shape parameter c by requesting plots for different values of c and choosing the value of c that linearizes the point pattern. Alternatively, you can request a maximum likelihood estimate for c , as illustrated in the following statements:
symbol v=plus; title 'Three-Parameter Weibull Q-Q Plot for Failure Times'; proc univariate data=Failures noprint; qqplot Time / weibull(c=est theta=est sigma=est noprint) square href=0.5 1 1.5 2 vref=25 27.5 30 32.5 35 lhref=4 lvref=4 chref=tan cvref=tan; run;
Note: When using the WEIBULL option, you must either specify a list of values for the Weibull shape parameter c with the C= option, or you must specify C=EST.
Output 3.34.1 displays the plot for the estimated value ‰ = 1 . 99. The reference line corresponds to the estimated values for the threshold and scale parameters of = 24 . 19 and = 5 . 83, respectively.
Now, suppose it is known that the circuit lifetime is at least 24 months. The following statements use the known threshold value = 24 to produce the two-parameter Weibull Q-Q plot shown in Output 3.31.4:
symbol v=plus; title 'Two-Parameter Weibull Q-Q Plot for Failure Times'; proc univariate data=Failures noprint; qqplot Time / weibull(theta=24 c=est sigma=est noprint) square vref= 25 to 35 by 2.5 href= 0.5 to 2.0 by 0.5 lhref=4 lvref=4 chref=tan cvref=tan; run;
The reference line is based on maximum likelihood estimates ‰ = 2 . 08 and = 6 . 05.
A sample program, uniex19.sas , for this example is available in the SAS Sample Library for Base SAS software.