Examples


Example 3.1. Computing Descriptive Statistics for Multiple Variables

This example computes univariate statistics for two variables. The following statements create the data set BPressure , which contains the systolic ( Systolic ) and diastolic ( Diastolic ) blood pressure readings for 22 patients :

  data BPressure;   length PatientID ;   input PatientID $ Systolic Diastolic @@;   datalines;   CK 120 50  SS 96  60 FR 100 70   CP 120 75  BL 140 90 ES 120 70   CP 165 110 JI 110 40 MC 119 66   FC 125 76  RW 133 60 KD 108 54   DS 110 50  JW 130 80 BH 120 65   JW 134 80  SB 118 76 NS 122 78   GS 122 70  AB 122 78 EC 112 62   HH 122 82   ;   run;  

The following statements produce descriptive statistics and quantiles for the variables Systolic and Diastolic :

  title 'Systolic and Diastolic Blood Pressure';   ods select BasicMeasures Quantiles;   proc univariate data=BPressure;   var Systolic Diastolic;   run;  

The ODS SELECT statement restricts the output, which is shown in Output 3.1.1, to the BasicMeasures and Quantiles tables; see the section ODS Table Names on page 309. You use the PROC UNIVARIATE statement to request univariate statistics for the variables listed in the VAR statement, which specifies the analysis variables and their order in the output. Formulas for computing the statistics in the BasicMeasures table are provided in the section Descriptive Statistics on page 269. The quantiles are calculated using Definition 5 , which is the default definition; see the section Calculating Percentiles on page 273.

Output 3.1.1: Display Basic Measures and Quantiles
start example
  Systolic and Diastolic Blood Pressure   The UNIVARIATE Procedure   Variable:  Systolic   Basic Statistical Measures   Location                    Variability   Mean     121.2727     Std Deviation           14.28346   Median   120.0000     Variance               204.01732   Mode     120.0000     Range                   69.00000   Interquartile Range     13.00000   NOTE: The mode displayed is the smallest of 2 modes with a count of 4.   Quantiles (Definition 5)   Quantile       Estimate   100% Max            165   99%                 165   95%                 140   90%                 134   75% Q3              125   50% Median          120   25% Q1              112   10%                 108   5%                  100   1%                   96   0% Min               96   Systolic and Diastolic Blood Pressure   The UNIVARIATE Procedure   Variable:   Diastolic   Basic Statistical Measures   Location                      Variability   Mean      70.09091     Std Deviation             15.16547   Median    70.00000     Variance                 229.99134   Mode      70.00000     Range                     70.00000   Interquartile Range       18.00000   Quantiles (Definition 5)   Quantile       Estimate   100% Max            110   99%                 110   95%                  90   90%                  82   75% Q3               78   50% Median           70   25% Q1               60   10%                  50   5%                   50   1%                   40   0% Min               40  
end example
 

A sample program, uniex01.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.2. Calculating Modes

An instructor is interested in calculating all the modes of the scores on a recent exam. The following statements create a data set named Exam , which contains the exam scores in the variable Score :

  data Exam;   label Score = 'Exam Score';   input Score @@;   datalines;   81 97 78 99 77 81 84 86 86 97   85 86 94 76 75 42 91 90 88 86   97 97 89 69 72 82 83 81 80 81   ;   run;  

The following statements use the MODES option to request a table of all possible modes:

  title 'Table of Modes for Exam Scores';   ods select Modes;   proc univariate data=Exam modes;   var Score;   run;  

The ODS SELECT statement restricts the output to the Modes table; see the section ODS Table Names on page 309.

Output 3.2.1: Table of Modes Display
start example
  Table of Modes for Exam Scores   The UNIVARIATE Procedure   Variable: Score (Exam Score)   Modes   Mode   Count   81       4   86       4   97       4  
end example
 

By default, when the MODES option is used, and there is more than one mode, the lowest mode is displayed in the BasicMeasures table. The following statements illustrate the default behavior:

  title 'Default Output';   ods select BasicMeasures;   proc univariate data=Exam;   var Score;   run;  
Output 3.2.2: Default Output (Without MODES Option)
start example
  Default Output   The UNIVARIATE Procedure   Variable:  Score   (Exam Score)   Basic Statistical Measures   Location                     Variability   Mean     83.66667      Std Deviation             11.08069   Median   84.50000      Variance                 122.78161   Mode     81.00000      Range                     57.00000   Interquartile Range       10.00000   NOTE: The mode displayed is the smallest of 3 modes with a count of 4.  
end example
 

The default output displays a mode of 81 and includes a note regarding the number of modes; the modes 86 and 97 are not displayed. The ODS SELECT statement restricts the output to the BasicMeasures table; see the section ODS Table Names on page 309.

A sample program, uniex02.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.3. Identifying Extreme Observations and Extreme Values

This example, which uses the data set BPressure introduced in Example 3.1, illustrates how to produce a table of the extreme observations and a table of the extreme values in a data set. The following statements generate the Extreme Observations tables for Systolic and Diastolic , which enable you to identify the extreme observations for each variable:

  title 'Extreme Blood Pressure Observations';   ods select ExtremeObs;   proc univariate data=BPressure;   var Systolic Diastolic;   id PatientID;   run;  

The ODS SELECT statement restricts the output to the ExtremeObs table; see the section ODS Table Names on page 309. The ID statement requests that the extreme observations are to be identified using the value of PatientID as well as the observation number. By default, the five lowest and five highest observations are displayed. You can use the NEXTROBS= option to request a different number of extreme observations.

Output 3.3.1 shows that the patient identified as ˜CP (Observation 7) has the highest values for both Systolic and Diastolic . To visualize extreme observations, you can create histograms; see Example 3.14.

Output 3.3.1: Blood Pressure Extreme Observations
start example
  Extreme Blood Pressure Observations   The UNIVARIATE Procedure   Variable:  Systolic   Extreme Observations   ---------Lowest---------        ---------Highest--------   Patient                         Patient   Value   ID           Obs        Value   ID           Obs   96   SS             2          130   JW            14   100   FR             3          133   RW            11   108   KD            12          134   JW            16   110   DS            13          140   BL             5   110   JI             8          165   CP             7   Extreme Blood Pressure Observations   The UNIVARIATE Procedure   Variable: Diastolic   Extreme Observations   ---------Lowest---------        ---------Highest--------   Patient                         Patient   Value   ID           Obs        Value   ID           Obs   40   JI             8           80   JW            14   50   DS            13           80   JW            16   50   CK             1           82   HH            22   54   KD            12           90   BL             5   60   RW            11          110   CP             7  
end example
 

The following statements generate the Extreme Values tables for Systolic and Diastolic , which tabulate the tails of the distributions:

  title 'Extreme Blood Pressure Values';   ods select ExtremeValues;   proc univariate data=BPressure nextrval=5;   var Systolic Diastolic;   run;  

The ODS SELECT statement restricts the output to the ExtremeValues table; see the section ODS Table Names on page 309. The NEXTRVAL= option specifies the number of extreme values at each end of the distribution to be shown in the tables in Output 3.3.2.

Output 3.3.2: Blood Pressure Extreme Values
start example
  Extreme Blood Pressure Values   The UNIVARIATE Procedure   Variable:  Systolic   Extreme Values   ---------Lowest-------- --------Highest--------   Order    Value     Freq  Order   Value     Freq   1       96        1     11     130        1   2      100        1     12     133        1   3      108        1     13     134        1   4      110        2     14     140        1   5      112        1     15     165        1   Extreme Blood Pressure Values   The UNIVARIATE Procedure   Variable:  Diastolic   Extreme Values   ---------Lowest--------   --------Highest--------   Order    Value     Freq   Order    Value     Freq   1       40        1      11       78        2   2       50        2      12       80        2   3       54        1      13       82        1   4       60        2      14       90        1   5       62        1      15      110        1  
end example
 

Output 3.3.2 shows that the values 78 and 80 occurred twice for Diastolic and the maximum of Diastolic is 110. Note that Output 3.3.1 displays the value of 80 twice for Diastolic because there are two observations with that value. In Output 3.3.2, the value 80 is only displayed once.

A sample program, uniex01.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.4. Creating a Frequency Table

An instructor is interested in creating a frequency table of score changes between a pair of tests given in one of his college courses. The data set Score contains test scores for his students who took a pretest and a posttest on the same material. The variable ScoreChange contains the difference between the two test scores. The following statements create the data set:

  data Score;   input Student $ PreTest PostTest @@;   label ScoreChange = Change in Test Scores;   ScoreChange = PostTest - PreTest;   datalines;   Capalleti  94 91  Dubose     51 65   Engles     95 97  Grant      63 75   Krupski    80 75  Lundsford  92 55   Mcbane     75 78  Mullen     89 82   Nguyen     79 76  Patel      71 77   Si         75 70  Tanaka     87 73   ;   run;  

The following statements produce a frequency table for the variable ScoreChange :

  title 'Analysis of Score Changes';   ods select Frequencies;   proc univariate data=Score freq;   var ScoreChange;   run;  

The ODS SELECT statement restricts the output to the Frequencies table; see the section ODS Table Names on page 309. The FREQ option on the PROC UNIVARIATE statement requests the table of frequencies shown in Output 3.4.1.

Output 3.4.1: Table of Frequencies
start example
  Analysis of Score Changes   The UNIVARIATE Procedure   Variable: ScoreChange  (Change in Test Scores)   Frequency Counts   Percents                   Percents                      Percents   Value Count Cell   Cum    Value Count  Cell   Cum     Value Count    Cell   Cum   -37     1  8.3   8.3       -3     2  16.7  58.3         6     1     8.3  83.3   -14     1  8.3  16.7        2     1   8.3  66.7        12     1     8.3  91.7   -7     1  8.3  25.0        3     1   8.3  75.0        14     1     8.3 100.0   -5     2 16.7  41.7  
end example
 

From Output 3.4.1, the instructor sees that only score changes of ˆ’ 3 and ˆ’ 5 occurred more than once.

A sample program, uniex03.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.5. Creating Plots for Line Printer Output

The PLOT option in the PROC UNIVARIATE statement requests several basic plots for display in line printer output. For more information on plots created by the PLOT option, see the section Creating Line Printer Plots on page 281. This example illustrates the use of the PLOT option as well as BY processing in PROC UNIVARIATE.

A researcher is analyzing a data set consisting of air pollution data from three different measurement sites. The data set AirPoll , created by the following statements, contains the variables Site and Ozone , which are the site number and ozone level, respectively.

  data AirPoll (keep = Site Ozone);   label Site  = 'Site Number'   Ozone = 'Ozone level (in ppb)';   do i = 1 to 3;   input Site @@;   do j = 1 to 15;   input Ozone @@;   output;   end;   end;   datalines;   102 4 6 3 4 7 8 2 3 4 1 3 8 9 5 6   134 5 3 6 2 1 2 4 3 2 4 6 4 6 3 1   137 8 9 7 8 6 7 6 7 9 8 9 8 7 8 5   ;   run;  

The following statements produce stem-and-leaf plots, box plots, and normal probability plots for each site in the AirPoll data set:

  ods select Plots SSPlots;   proc univariate data=AirPoll plot;   by Site;   var Ozone;   run;  

The PLOT option produces a stem-and-leaf plot, a box plot, and a normal probability plot for the Ozone variable at each site. Since the BY statement is used, a side-by-side box plot is also created to compare the ozone levels across sites. Note that AirPoll is sorted by Site ; in general, the data set should be sorted by the BY variable using the SORT procedure. The ODS SELECT statement restricts the output to the Plots and SSPlots tables; see the section ODS Table Names on page 309. Optionally, you can specify the PLOTSIZE= n option to control the approximate number of rows (between 8 and the page size ) that the plots occupy.

Output 3.5.1 through Output 3.5.3 show the plots produced for each BY group . Output 3.5.4 shows the side-by-side box plot for comparing Ozone values across sites.

Output 3.5.1: Ozone Plots for BY Group Site = 102
start example
  ------------------------------- Site Number=102 --------------------------------   The UNIVARIATE Procedure   Variable:  Ozone  (Ozone level (in ppb))   Stem Leaf                     #             Boxplot   9 0                        1   8 00                       2   7 0                        1             +-----+   6 00                       2   5 0                        1   4 000                      3             *--+--*   3 000                      3             +-----+   2 0                        1   1 0                        1   ----+----+----+---- +   Normal Probability Plot   9.5+                                          *++++   *  * ++++   * +++++   * *+++   5.5+                          +*++   **+*   * *+*+   *++++   1.5+        *++++  +----+----+----+----+----+----+----+----+----+----+  -2        -1         0        +1        +2  
end example
 
Output 3.5.2: Ozone Plots for BY Group Site = 134
start example
  ------------------------------- Site Number=134 --------------------------------   The UNIVARIATE Procedure   Variable:  Ozone  (Ozone level (in ppb))   Stem Leaf                     #             Boxplot   6 000                      3   5 0                        1             +-----+   4 000                      3   3 000                      3             *--+--*   2 000                      3             +-----+   1 00                       2   ----+----+----+----+   Normal Probability Plot   6.5+                                  *  *   ++*+++   * ++++++   **+*+++   **+*+++   *+*+*++   1.5+        *   ++*+++   +----+----+----+----+----+----+----+----+----+----+   -2        -1         0        +1        +2  
end example
 
Output 3.5.3: Ozone Plots for BY Group Site = 137
start example
  ------------------------------- Site Number=137 --------------------------------   The UNIVARIATE Procedure   Variable:  Ozone  (Ozone level (in ppb))   Stem Leaf                     #             Boxplot   9 000                      3   8 00000                    5             +-----+   7 0000                     4             +--+--+   6 00                       2   5 0                        1                0   ----+----+----+---- +   Normal Probability Plot   9.5+                                  *  *++++*++++   * ** *+*+++++   7.5+                  * * **++++++   *++*+++++   5.5+     +++*++++  +----+----+----+----+----+----+----+----+----+----+  -2        -1         0        +1        +2  
end example
 
Output 3.5.4: Ozone Side-by-Side Boxplot for All BY Group
start example
  The UNIVARIATE Procedure   Variable:  Ozone   (Ozone level (in ppb))   Schematic Plots     10 +         8 +                                     *----- *   +   +-----  +                    +----- +     6 +     +         +----- +          0     4 +         *-----  *   +   +-----  +      *----- *    2 +                       +----- +      0 +  ------------+-----------+-----------+-----------  Site                    102           134           137  
end example
 

Note that you can use the PROBPLOT statement with the NORMAL option to produce high-resolution normal probability plots; see the section Modeling a Data Distribution on page 200.

Note that you can use the BOXPLOT procedure to produce box plots using high-resolution graphics; refer to Chapter 18, The BOXPLOT Procedure, in SAS/STAT User s Guide .

A sample program, uniex04.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.6. Analyzing a Data Set With a FREQ Variable

This example illustrates how to use PROC UNIVARIATE to analyze a data set with a variable that contains the frequency of each observation. The data set Speeding contains data on the number of cars pulled over for speeding on a stretch of highway with a 65 mile per hour speed limit. Speed is the speed at which the cars were traveling, and Number is the number of cars at each speed. The following statements create the data set:

  data Speeding;   label Speed = 'Speed (in miles per hour)';   do Speed = 66 to 85;   input Number @@;   output;   end;   datalines;   2  3  2  1  3  6  8  9  10 13   12 14  6  2  0  0  1  1   0  1   ;   run;  

The following statements create a table of moments for the variable Speed :

  title 'Analysis of Speeding Data';   ods select Moments;   proc univariate data=Speeding;   freq Number;   var Speed;   run;  

The ODS SELECT statement restricts the output, which is shown in Output 3.6.1, to the Moments table; see the section ODS Table Names on page 309. The FREQ statement specifies that the value of the variable Number represents the frequency of each observation.

Output 3.6.1: Table of Moments
start example
  Analysis of Speeding Data   The UNIVARIATE Procedure   Variable: Speed  (Speed (in miles per hour))   Freq:  Number   Moments   N                        94    Sum Weights              94   Mean             74.3404255    Sum Observations       6988   Std Deviation    3.44403237    Variance          11.861359   Skewness         -0.1275543    Kurtosis         0.92002287   Uncorrected SS       520594    Corrected SS     1103.10638   Coeff Variation  4.63278538    Std Error Mean   0.35522482  
end example
 

For the formulas used to compute these moments, see the section Descriptive Statistics on page 269. A sample program, uniex05.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.7. Saving Summary Statistics in an OUT= Output Data Set

This example illustrates how to save summary statistics in an output data set. The following statements create a data set named Belts , which contains the breaking strengths ( Strength ) and widths ( Width ) of a sample of 50 automotive seat belts:

  data Belts;   label Strength = Breaking Strength (lb/in)   Width    = 'Width in Inches';   input Strength Width @@;   datalines;   1243.51  3.036  1221.95  2.995  1131.67  2.983  1129.70  3.019   1198.08  3.106  1273.31  2.947  1250.24  3.018  1225.47  2.980   1126.78  2.965  1174.62  3.033  1250.79  2.941  1216.75  3.037   1285.30  2.893  1214.14  3.035  1270.24  2.957  1249.55  2.958   1166.02  3.067  1278.85  3.037  1280.74  2.984  1201.96  3.002   1101.73  2.961  1165.79  3.075  1186.19  3.058  1124.46  2.929   1213.62  2.984  1213.93  3.029  1289.59  2.956  1208.27  3.029   1247.48  3.027  1284.34  3.073  1209.09  3.004  1146.78  3.061   1224.03  2.915  1200.43  2.974  1183.42  3.033  1195.66  2.995   1258.31  2.958  1136.05  3.022  1177.44  3.090  1246.13  3.022   1183.67  3.045  1206.50  3.024  1195.69  3.005  1223.49  2.971   1147.47  2.944  1171.76  3.005  1207.28  3.065  1131.33  2.984   1215.92  3.003  1202.17  3.058   ;   run;  

The following statements produce two output data sets containing summary statistics:

  proc univariate data=Belts noprint;   var Strength Width;   output out=Means         mean=StrengthMean WidthMean;   output out=StrengthStats mean=StrengthMean std=StrengthSD   min=StrengthMin   max=StrengthMax;   run;  

When you specify an OUTPUT statement, you must also specify a VAR statement. You can use multiple OUTPUT statements with a single procedure statement. Each OUTPUT statement creates a new data set with the name specified by the OUT= option. In this example, two data sets, Means and StrengthStats , are created. See Output 3.7.1 for a listing of Means and Output 3.7.2 for a listing of StrengthStats .

Output 3.7.1: Listing of Output Data Set Means
start example
  Strength    Width   Obs       Mean       Mean   1       1205.75  3.00584  
end example
 
Output 3.7.2: Listing of Output Data Set StrengthStats
start example
  Strength    Strength     Strength    Strength   Obs     Mean         SD           Max         Min   1     1205.75     48.3290      1289.59     1101.73  
end example
 

Summary statistics are saved in an output data set by specifying keyword=names after the OUT= option. In the preceding statements, the first OUTPUT statement specifies the keyword MEAN followed by the names StrengthMean and WidthMean . The second OUTPUT statement specifies the keywords MEAN, STD, MAX, and MIN, for which the names StrengthMean, StrengthSD, StrengthMax , and StrengthMin are given.

The keyword specifies the statistic to be saved in the output data set, and the names determine the names for the new variables. The first name listed after a keyword contains that statistic for the first variable listed in the VAR statement; the second name contains that statistic for the second variable in the VAR statement, and so on.

The data set Means contains the mean of Strength in a variable named StrengthMean and the mean of Width in a variable named WidthMean . The data set StrengthStats contains the mean, standard deviation, maximum value, and minimum value of Strength in the variables StrengthMean, StrengthSD, StrengthMax , and StrengthMin , respectively.

See the section OUT= Output Data Set in the OUTPUT Statement on page 306 for more information on OUT= output data sets.

A sample program, uniex06.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.8. Saving Percentiles in an Output Data Set

This example, which uses the Belts data set from the previous example, illustrates how to save percentiles in an output data set. The UNIVARIATE procedure automatically computes the 1st, 5th, 10th, 25th, 75th, 90th, 95th, and 99th percentiles for each variable. You can save these percentiles in an output data set by specifying the appropriate keywords. For example, the following statements create an output data set named PctlStrength , which contains the 5th and 95th percentiles of the variable Strength :

  proc univariate data=Belts noprint;   var Strength Width;   output out=PctlStrength p5=p5str p95=p95str;   run;  

The output data set PctlStrength is listed in Output 3.8.1.

Output 3.8.1: Listing of Output Data Set PctlStrength
start example
  Obs      p95str      p5str   1      1284.34     1126.78  
end example
 

You can use the PCTLPTS=, PCTLPRE=, and PCTLNAME= options to save percentiles not automatically computed by the UNIVARIATE procedure. For example, the following statements create an output data set named Pctls , which contains the 20th and 40th percentiles of the variables Strength and Width :

  proc univariate data=Belts noprint;   var Strength Width;   output out=Pctls pctlpts  = 20 40   pctlpre  = Strength Width   pctlname = pct20 pct40;   run;  

The PCTLPTS= option specifies the percentiles to compute (in this case, the 20th and 40th percentiles). The PCTLPRE= and PCTLNAME= options build the names for the variables containing the percentiles. The PCTLPRE= option gives prefixes for the new variables, and the PCTLNAME= option gives a suffix to add to the prefix. When you use the PCTLPTS= specification, you must also use the PCTLPRE= specification.

The OUTPUT statement saves the 20th and 40th percentiles of Strength and Width in the variables Strengthpct20, Widthpct20, Strengthpct40 , and Weightpct40 . The output data set Pctls is listed in Output 3.8.2.

Output 3.8.2: Listing of Output Data Set Pctls
start example
  Obs     Strengthpct20     Strengthpct40     Widthpct20     Widthpct40   1         1165.91           1199.26         2.9595          2.995  
end example
 

A sample program, uniex06.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.9. Computing Confidence Limits for the Mean, Standard Deviation, and Variance

This example illustrates how to compute confidence limits for the mean, standard deviation, and variance of a population. A researcher is studying the heights of a certain population of adult females. She has collected a random sample of heights of 75 females, which are saved in the data set Heights :

  data Heights;   label Height = Height (in);   input Height @@;   datalines;   64.1 60.9 64.1 64.7 66.7 65.0 63.7 67.4 64.9 63.7   64.0 67.5 62.8 63.9 65.9 62.3 64.1 60.6 68.6 68.6   63.7 63.0 64.7 68.2 66.7 62.8 64.0 64.1 62.1 62.9   62.7 60.9 61.6 64.6 65.7 66.6 66.7 66.0 68.5 64.4   60.5 63.0 60.0 61.6 64.3 60.2 63.5 64.7 66.0 65.1   63.6 62.0 63.6 65.8 66.0 65.4 63.5 66.3 66.2 67.5   65.8 63.1 65.8 64.4 64.0 64.9 65.7 61.0 64.1 65.5   68.6 66.6 65.7 65.1 70.0   ;   run;  

The following statements produce confidence limits for the mean, standard deviation, and variance of the population of heights:

  title 'Analysis of Female Heights';   ods select BasicIntervals;   proc univariate data=Heights cibasic;   var Height;   run;  

The CIBASIC option requests confidence limits for the mean, standard deviation, and variance. For example, Output 3.9.1 shows that the 95% confidence interval for the population mean is (64 . 06 , 65 . 07). The ODS SELECT statement restricts the output to the BasicIntervals table; see the section ODS Table Names on page 309.

Output 3.9.1: Default 95% Confidence Limits
start example
  Analysis of Female Heights   The UNIVARIATE Procedure   Variable: Height (Height (in))   Basic Confidence Limits Assuming Normality   Parameter          Estimate     95% Confidence Limits   Mean               64.56667      64.06302    65.07031   Std Deviation       2.18900       1.88608     2.60874   Variance            4.79171       3.55731     6.80552  
end example
 

The confidence limits in Output 3.9.1 assume that the heights are normally distributed, so you should check this assumption before using these confidence limits.

See the section Shapiro-Wilk Statistic on page 293 for information on the Shapiro-Wilk test for normality in PROC UNIVARIATE. See Example 3.19 for an example using the test for normality.

By default, the confidence limits produced by the CIBASIC option produce 95% confidence intervals. You can request different level confidence limits by using the ALPHA= option in parentheses after the CIBASIC option. The following statements produce 90% confidence limits:

  title 'Analysis of Female Heights';   ods select BasicIntervals;   proc univariate data=Heights cibasic(alpha=.1);   var Height;   run;  

The 90% confidence limits are displayed in Output 3.9.2.

Output 3.9.2: 90% Confidence Limits
start example
  Analysis of Female Heights   The UNIVARIATE Procedure   Variable: Height (Height (in))   Basic Confidence Limits Assuming Normality   Parameter          Estimate     90% Confidence Limits   Mean               64.56667      64.14564    64.98770   Std Deviation       2.18900       1.93114     2.53474   Variance            4.79171       3.72929     6.42492  
end example
 

For the formulas used to compute these limits, see the section Confidence Limits for Parameters of the Normal Distribution on page 277.

A sample program, uniex07.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.10. Computing Confidence Limits for Quantiles and Percentiles

This example, which is a continuation of Example 3.9, illustrates how to compute confidence limits for quantiles and percentiles. A second researcher is more interested in summarizing the heights with quantiles than the mean and standard deviation. He is also interested in computing 90% confidence intervals for the quantiles. The following statements produce estimated quantiles and confidence limits for the population quantiles:

  title 'Analysis of Female Heights';   ods select Quantiles;   proc univariate data=Heights ciquantnormal(alpha=.1);   var Height;   run;  

The ODS SELECT statement restricts the output to the Quantiles table; see the section ODS Table Names on page 309. The CIQUANTNORMAL option produces confidence limits for the quantiles. As noted in Output 3.10.1, these limits assume that the data are normally distributed. You should check this assumption before using these confidence limits. See the section Shapiro-Wilk Statistic on page 293 for information on the Shapiro-Wilk test for normality in PROC UNIVARIATE; see Example 3.19 for an example using the test for normality.

Output 3.10.1: Normal-Based Quantile Confidence Limits
start example
  Analysis of Female Heights   The UNIVARIATE Procedure   Variable:  Height  (Height (in))   Quantiles (Definition 5)   90% Confidence Limits   Quantile    Estimate      Assuming Normality   100% Max        70.0   99%             70.0    68.94553     70.58228   95%             68.6    67.59184     68.89311   90%             67.5    66.85981     68.00273   75% Q3          66.0    65.60757     66.54262   50% Median      64.4    64.14564     64.98770   25% Q1          63.1    62.59071     63.52576   10%             61.6    61.13060     62.27352   5%              60.6    60.24022     61.54149   1%              60.0    58.55106     60.18781   0% Min          60.0  
end example
 

It is also possible to use PROC UNIVARIATE to compute confidence limits for quantiles without assuming normality. The following statements use the CIQUANTDF option to request distribution-free confidence limits for the quantiles of the population of heights:

  title 'Analysis of Female Heights';   ods select Quantiles;   proc univariate data=Heights ciquantdf(alpha=.1);   var Height;   run;  

The distribution-free confidence limits are shown in Output 3.10.2.

Output 3.10.2: Distribution-Free Quantile Confidence Limits
start example
  Analysis of Female Heights   The UNIVARIATE Procedure   Variable:   Height  (Height (in))   Quantiles   (Definition 5)   90% Confidence   Limits       --------Order Statistics-------   Quantile        Estimate       Distribution Free          LCL Rank    UCL Rank    Coverage   100% Max            70.0   99%                 70.0         68.6           70.0            73          75       48.97   95%                 68.6         67.5           70.0            68          75       94.50   90%                 67.5         66.6           68.6            63          72       91.53   75% Q3              66.0         65.7           66.6            50          63       91.77   50% Median          64.4         64.1           65.1            31          46       91.54   25% Q1              63.1         62.7           63.7            13          26       91.77   10%                 61.6         60.6           62.7             4          13       91.53   5%                  60.6         60.0           61.6             1           8       94.50   1%                  60.0         60.0           60.5             1           3       48.97   0% Min              60.0  
end example
 

The table in Output 3.10.2 includes the ranks from which the confidence limits are computed. For more information on how these confidence limits are calculated, see the section Confidence Limits for Percentiles on page 274. Note that confidence limits for quantiles are not produced when the WEIGHT statement is used.

A sample program, uniex07.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.11. Computing Robust Estimates

This example illustrates how you can use the UNIVARIATE procedure to compute robust estimates of location and scale. The following statements compute these estimates for the variable Systolic in the data set BPressure , which was introduced in Example 3.1:

  title 'Robust Estimates for Blood Pressure Data';   ods select TrimmedMeans WinsorizedMeans RobustScale;   proc univariate data=BPressure trimmed=1 .1   winsorized=.1 robustscale;   var Systolic;   run;  

The ODS SELECT statement restricts the output to the TrimmedMeans, WinsorizedMeans, and RobustScale tables; see the section ODS Table Names on page 309. The TRIMMED= option computes two trimmed means, the first after removing one observation and the second after removing 10% of the observations. If the value of TRIMMED= is greater than or equal to one, it is interpreted as the number of observations to be trimmed. The WINSORIZED= option computes a Winsorized mean that replaces three observations from the tails with the next closest observations. (Three observations are replaced because np = (22)( . 1) = 2 . 2, and three is the smallest integer greater than 2.2.) The trimmed and Winsorized means for Systolic are displayed in Output 3.11.1.

Output 3.11.1: Computation of Trimmed and Winsorized Means
start example
  Robust Estimates for Blood Pressure Data   The UNIVARIATE Procedure   Variable: Systolic   Trimmed Means   Percent      Number               Std Error   Trimmed     Trimmed     Trimmed     Trimmed      95% Confidence         t for H0:   in Tail     in Tail        Mean        Mean          Limits         DF   Mu0=0.00    Pr > t   4.55           1    120.3500    2.573536    114.9635  125.7365   19   46.76446       <.0001   13.64           3    120.3125    2.395387    115.2069  125.4181   15   50.22675       <.0001   Winsorized Means   Percent       Number               Std Error   Winsorized   Winsorized  Winsorized  Winsorized    95% Confidence        t for H0:   in Tail      in Tail        Mean        Mean        Limits        DF   Mu0=0.00   Pr > t   13.64            3    120.6364    2.417065  115.4845 125.7882   15   49.91027     <.0001  
end example
 

Output 3.11.1 shows the trimmed mean for Systolic is 120.35 after one observation has been trimmed, and 120.31 after 3 observations are trimmed. The Winsorized mean for Systolic is 120.64. For details on trimmed and Winsorized means, see the section Robust Estimators on page 278. The trimmed means can be compared with the means shown in Output 3.1.1 (from Example 3.1), which displays the mean for Systolic as 121.273.

The ROBUSTSCALE option requests a table, displayed in Output 3.11.2, which includes the interquartile range, Gini s mean difference, the median absolute deviation about the median, Q n , and S n .

Output 3.11.2: Computation of Robust Estimates of Scale
start example
  Robust Estimates for Blood Pressure Data   Variable: Systolic   Robust Measures of Scale   Estimate   Measure                     Value   of Sigma   Interquartile Range      13.00000    9.63691   Gini's Mean Difference   15.03030   13.32026   MAD                       6.50000    9.63690   Sn                        9.54080    9.54080   Qn                       13.33140   11.36786  
end example
 

Output 3.11.2 shows the robust estimates of scale for Systolic . For instance, the interquartile range is 13. The estimates of ƒ range from 9.54 to 13.32. See the section Robust Estimators on page 278.

A sample program, uniex01.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.12. Testing for Location

This example, which is a continuation of Example 3.9, illustrates how to carry out three tests for location: the Student s t test, the sign test, and the Wilcoxon signed rank test. These tests are discussed in the section Tests for Location on page 275.

The following statements demonstrate the tests for location using the Heights data set introduced in Example 3.9. Since the data consists of adult female heights, the researchers are not interested in testing if the mean of the population is equal to zero inches, which is the default µ value. Instead, they are interested in testing if the mean is equal to 66 inches. The following statements test the null hypothesis H : µ = 66:

  title 'Analysis of Female Height Data';   ods select TestsForLocation LocationCounts;   proc univariate data=Heights mu0=66 loccount;   var Height;   run;  

The ODS SELECT statement restricts the output to the TestsForLocation and LocationCounts tables; see the section ODS Table Names on page 309. The MU0= option specifies the null hypothesis value of µ for the tests for location; by default, µ = 0. The LOCCOUNT option produces the table of the number of observations greater than, not equal to, and less than 66 inches.

Output 3.12.1 contains the results of the tests for location. All three tests are highly significant, causing the researchers to reject the hypothesis that the mean is 66 inches.

Output 3.12.1: Tests for Location with MU0=66 and LOCCOUNT
start example
  Analysis of Female Height Data   The UNIVARIATE Procedure   Variable:  Height  (Height (in))   Tests for Location: Mu0=66   Test           -Statistic-    -----p Value------   Student's t    t  -5.67065    Pr > t    <.0001   Sign           M       -20    Pr >= M   <.0001   Signed Rank    S      -849    Pr >= S   <.0001   Location Counts: Mu0=66.00   Count                Value   Num Obs > Mu0           16   Num Obs ^= Mu0          72   Num Obs < Mu0           56  
end example
 

A sample program, uniex07.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.13. Performing a Sign Test Using Paired Data

This example demonstrates a sign test for paired data, which is a specific application of the Tests for Location discussed in Example 3.12.

The instructor from Example 3.4 is now interested in performing a sign test for the pairs of test scores in his college course. The following statements request basic statistical measures and tests for location:

  title 'Test Scores for a College Course';   ods select BasicMeasures TestsForLocation;   proc univariate data=Score;   var ScoreChange;   run;  

The ODS SELECT statement restricts the output to the BasicMeasures and TestsForLocation tables; see the section ODS Table Names on page 309. The instructor is not willing to assume the ScoreChange variable is normal or even symmetric, so he decides to examine the sign test. The large p -value (0.7744) of the sign test provides insufficient evidence of a difference in test score medians.

Output 3.13.1: Sign Test for ScoreChange
start example
  Test Scores for a College Course   The UNIVARIATE Procedure   Variable: ScoreChange  (Change in Test Scores)   Basic Statistical Measures   Location                    Variability   Mean      -3.08333    Std Deviation           13.33797   Median    -3.00000    Variance               177.90152   Mode      -5.00000    Range                   51.00000   Interquartile Range     10.50000   NOTE: The mode displayed is the smallest of 2 modes with a count of 2.   Tests for Location: Mu0=0   Test            -Statistic-    -----p Value------   Student's t     t  -0.80079    Pr > t    0.4402   Sign            M        -1    Pr >= M   0.7744   Signed Rank     S      -8.5    Pr >= S   0.5278  
end example
 

A sample program, uniex03.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.14. Creating a Histogram

This example illustrates how to create a histogram. A semiconductor manufacturer produces printed circuit boards that are sampled to determine the thickness of their copper plating . The following statements create a data set named Trans , which contains the plating thicknesses ( Thick ) of 100 boards:

  data Trans;   input Thick @@;   label Thick = 'Plating Thickness (mils)';   datalines;   3.468 3.428 3.509 3.516 3.461 3.492 3.478 3.556 3.482 3.512   3.490 3.467 3.498 3.519 3.504 3.469 3.497 3.495 3.518 3.523   3.458 3.478 3.443 3.500 3.449 3.525 3.461 3.489 3.514 3.470   3.561 3.506 3.444 3.479 3.524 3.531 3.501 3.495 3.443 3.458   3.481 3.497 3.461 3.513 3.528 3.496 3.533 3.450 3.516 3.476   3.512 3.550 3.441 3.541 3.569 3.531 3.468 3.564 3.522 3.520   3.505 3.523 3.475 3.470 3.457 3.536 3.528 3.477 3.536 3.491   3.510 3.461 3.431 3.502 3.491 3.506 3.439 3.513 3.496 3.539   3.469 3.481 3.515 3.535 3.460 3.575 3.488 3.515 3.484 3.482   3.517 3.483 3.467 3.467 3.502 3.471 3.516 3.474 3.500 3.466   ;   run;  

The following statements create the histogram shown in Output 3.14.1.

  title 'Analysis of Plating Thickness';   proc univariate data=Trans noprint;   histogram Thick;   run;  
Output 3.14.1: Histogram for Plating Thickness
start example
  click to expand  
end example
 

The NOPRINT option in the PROC UNIVARIATE statement suppresses tables of summary statistics for the variable Thick that would be displayed by default. A histogram is created for each variable listed in the HISTOGRAM statement.

A sample program, uniex08.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.15. Creating a One-Way Comparative Histogram

This example illustrates how to create a comparative histogram. The effective channel length (in microns) is measured for 1225 field effect transistors . The channel lengths ( Length ) are stored in a data set named Channel , which is partially listed in Output 3.15.1:

Output 3.15.1: Partial Listing of Data Set Channel
start example
  The Data Set Channel   Lot     Length   Lot 1     0.91   .        .   Lot 1     1.17   Lot 2     1.47   .        .   Lot 2     1.39   Lot 3     2.04   .        .   Lot 3     1.91  
end example
 

The following statements request a histogram, which is shown in Output 3.15.2 of Length ignoring the lot source:

  title 'Histogram of Length Ignoring Lot Source';   proc univariate data=Channel noprint;   histogram Length;   run;  
Output 3.15.2: Histogram for Length Ignoring Lot Source
start example
  click to expand  
end example
 

To investigate whether the peaks (modes) in Output 3.15.2 are related to the lot source, you can create a comparative histogram using Lot as a classification variable. The following statements create the histogram shown in Output 3.15.3:

  title 'Comparative Analysis of Lot Source';   proc univariate data=Channel noprint;   class Lot;   histogram Length / nrows = 3;   run;  
Output 3.15.3: Comparison by Lot Source
start example
  click to expand  
end example
 

The CLASS statement requests comparisons for each level (distinct value) of the classification variable Lot . The HISTOGRAM statement requests a comparative histogram for the variable Length . The NROWS= option specifies the number of rows in the comparative histogram. By default, comparative histograms are displayed in two rows per panel.

Output 3.15.3 reveals that the distributions of Length are similarly distributed except for shifts in mean.

A sample program, uniex09.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.16. Creating a Two-Way Comparative Histogram

This example illustrates how to create a two-way comparative histogram. Two suppliers (A and B) provide disk drives for a computer manufacturer. The manufacturer measures the disk drive opening width to determine whether there has been a change in variability from 2002 to 2003 for each supplier.

The following statements save the measurements in a data set named Disk . There are two classification variables, Supplier and Year , and a user-defined format is associated with Year .

  proc format ;   value mytime 1 = '2002' 2 = '2003';   data Disk;   input @1 Supplier . Year Width;   label Width = 'Opening Width (inches)';   format Year mytime.;   datalines;   Supplier A   1   1.8932   .        .    .   Supplier B   1   1.8986   Supplier A   2   1.8978   .        .    .   Supplier B   2   1.8997   ;  

The following statements create the comparative histogram in Output 3.16.1:

  title 'Results of Supplier Training Program';   proc univariate data=Disk noprint;   class Supplier Year / keylevel = ('Supplier A' '2003');   histogram Width / intertile  = 1.0   vaxis      = 0 10 20 30   ncols      = 2   nrows      = 2   cfill      = ligr   cframetop  = yellow   cframeside = yellow;   run;  
Output 3.16.1: Two-Way Comparative Histogram
start example
  click to expand  
end example
 

The KEYLEVEL= option specifies the key cell as the cell for which Supplier is equal to ˜SUPPLIER A and Year is equal to ˜2003. This cell determines the binning for the other cells , and the columns are arranged so that this cell is displayed in the upper left corner. Without the KEYLEVEL= option, the default key cell would be the cell for which Supplier is equal to ˜SUPPLIER A and Year is equal to ˜2002 ; the column labeled ˜2002 would be displayed to the left of the column labeled ˜2003.

The VAXIS= option specifies the tick mark labels for the vertical axis. The NROWS=2 and NCOLS=2 options specify a 2 — 2 arrangement for the tiles. The CFRAMESIDE= and CFRAMETOP= options specify fill colors for the row and column labels, and the CFILL= option specifies a fill color for the bars. Output 3.16.1 provides evidence that both suppliers have reduced variability from 2002 to 2003.

A sample program, uniex10.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.17. Adding Insets with Descriptive Statistics

This example illustrates how to add insets with descriptive statistics to a comparative histogram; see Output 3.17.1. Three similar machines are used to attach a part to an assembly. One hundred assemblies are sampled from the output of each machine, and a part position is measured in millimeters. The following statements create the data set Machines , which contains the measurements in a variable named Position :

  data Machines;   input Position @@;   label Position = 'Position in Millimeters';   if      (_n_ <= 100) then Machine = 'Machine 1';   else if (_n_ <= 200) then Machine = 'Machine 2';   else                      Machine = 'Machine 3';   datalines;   -0.17  -0.19  -0.24  -0.24  -0.12   0.07  -0.61   0.22  1.91  -0.08   -0.59   0.05  -0.38   0.82  -0.14   0.32   0.12  -0.02  0.26   0.19   ...   0.48   0.41   0.78   0.58   0.43   0.07   0.27   0.49  0.79   0.92   0.79   0.66   0.22   0.71   0.53   0.57   0.90   0.48  1.17   1.03   ;   run;  
Output 3.17.1: Comparative Histograms
start example
  click to expand  
end example
 

The following statements create the comparative histogram in Output 3.17.1:

  title 'Machine Comparision Study';   proc univariate data=Machines noprint;   class Machine;   histogram Position / nrows      = 3   intertile  = 1   midpoints  = -1.2 to 2.2 by 0.1   vaxis      = 0 to 16 by 4   cfill      = ligr;   inset mean std="Std Dev" / pos = ne format = 6.3;   run;  

The INSET statement requests insets containing the sample mean and standard deviation for each machine in the corresponding tile. The MIDPOINTS= option specifies the midpoints of the histogram bins .

Output 3.17.1 shows that the average position for Machines 2 and 3 are similar and that the spread for Machine 1 is much larger than for Machines 2 and 3.

A sample program, uniex11.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.18. Binning a Histogram

This example, which is a continuation of Example 3.14, demonstrates various methods for binning a histogram. This example also illustrates how to save bin percentages in an OUTHISTOGRAM= data set.

The manufacturer from Example 3.14 now wants to enhance the histogram by changing the endpoints of the bins using the ENDPOINTS= option. The following statements create a histogram with bins that have end points 3.425 and 3.6 and width 0.025:

  title 'Enhancing a Histogram';   ods select HistogramBins MyHist;   proc univariate data=Trans;   histogram Thick / midpercents name='MyHist'   endpoints = 3.425 to 3.6 by .025;   run;  

The ODS SELECT statement restricts the output to the HistogramBins table and the MyHist histogram; see the section ODS Table Names on page 309. The ENDPOINTS= option specifies the endpoints for the histogram bins. By default, if the ENDPOINTS= option is not specified, the automatic binning algorithm computes values for the midpoints of the bins. The MIDPERCENTS option requests a table of the midpoints of each histogram bin and the percent of the observations that fall in each bin. This table is displayed in Output 3.18.1; the histogram is displayed in Output 3.18.2. The NAME= option specifies a name for the histogram that can be used in the ODS SELECT statement.

Output 3.18.1: Table of Bin Percentages Requested with MIDPERCENTS Option
start example
  Enhancing a Histogram   The UNIVARIATE Procedure   Variable: Thick   Histogram Bins for Thick   Bin   Minimum   Observed   Point    Percent   3.425      8.000   3.450     21.000   3.475     25.000   3.500     29.000   3.525     11.000   3.550      5.000   3.575      1.000  
end example
 
Output 3.18.2: Histogram with ENDPOINTS= Option
start example
  click to expand  
end example
 

The MIDPOINTS= option is an alternative to the ENDPOINTS= option for specifying histogram bins. The following statements create a similar histogram, which is shown in Output 3.18.3, to the one in Output 3.18.2:

  title 'Enhancing a Histogram';   proc univariate data=Trans noprint;   histogram Thick / midpoints    = 3.4375 to 3.5875 by .025   rtinclude   outhistogram = OutMdpts;   run;  
Output 3.18.3: Histogram with MIDPOINTS= and RTINCLUDE Options
start example
  click to expand  
end example
 

Output 3.18.3 differs from Output 3.18.2 in two ways:

  • The MIDPOINTS= option specifies the bins for the histogram by specifying the midpoints of the bins instead of specifying the endpoints. Note that the histogram displays midpoints instead of endpoints.

  • The RTINCLUDE option request that the right endpoint of each bin be included in the histogram interval instead of the default, which is to include the left endpoint in the interval. This changes the histogram slightly from Output 3.18.2. Six observations have a thickness equal to an endpoint of an interval. For instance, there is one observation with a thickness of 3.45 mils. In Output 3.18.3, this observation is included in the bin from 3.425 to 3.45.

The OUTHISTOGRAM= option produces an output data set named OutMdpts , displayed in Output 3.18.4. This data set provides information on the bins of the histogram. For more information, see the section OUTHISTOGRAM= Output Data Set on page 308.

Output 3.18.4: The OUTHISTOGRAM= Data Set OutMdpts
start example
  OUTHISTOGRAM= Data Set   Obs    _VAR_    _MIDPT_    _OBSPCT_   1     Thick     3.4375        9   2     Thick     3.4625       21   3     Thick     3.4875       26   4     Thick     3.5125       28   5     Thick     3.5375       11   6     Thick     3.5625        5   7     Thick     3.5875        0  
end example
 

A sample program, uniex08.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.19. Adding a Normal Curve to a Histogram

This example is a continuation of Example 3.14. The following statements fit a normal distribution to the thickness measurements in the Trans data set and superimpose the fitted density curve on the histogram:

  title 'Analysis of Plating Thickness';   ods select ParameterEstimates GoodnessOfFit FitQuantiles Bins MyPlot;   proc univariate data=Trans;   histogram Thick / normal(percents=20 40 60 80 midpercents)   name='MyPlot';   inset n normal(ksdpval) / pos = ne format = 6.3;   run;  

The ODS SELECT statement restricts the output to the ParameterEstimates, GoodnessOfFit, FitQuantiles, and Bins tables; see the section ODS Table Names on page 309. The NORMAL option requests specifies that the normal curve is to be displayed on the histogram shown in Output 3.19.3. It also requests a summary of the fitted distribution, which is shown in Output 3.19.1 and Output 3.19.2. This summary includes goodness-of-fit tests, parameter estimates, and quantiles of the fitted distribution. (If you specify the NORMALTEST option in the PROC UNIVARIATE statement, the Shapiro-Wilk test for normality will be included in the tables of statistical output.)

Output 3.19.1: Summary of Fitted Normal Distribution
start example
  Analysis of Plating Thickness   The UNIVARIATE Procedure   Fitted Distribution for Thick   Parameters for Normal Distribution   Parameter   Symbol   Estimate   Mean        Mu        3.49533   Std Dev     Sigma    0.032117   Goodness-of-Fit Tests for Normal Distribution   Test                  ---Statistic----   -----p Value-----   Kolmogorov-Smirnov    D     0.05563823   Pr > D     >0.150   Cramer-von Mises      W-Sq  0.04307548   Pr > W-Sq  >0.250   Anderson-Darling      A-Sq  0.27840748   Pr > A-Sq  >0.250  
end example
 
Output 3.19.2: Summary of Fitted Normal Distribution (cont.)
start example
  Analysis of Plating Thickness   Fitted Distribution for Thick   Histogram Bin Percents for Normal Distribution   Bin   -------Percent------   Midpoint   Observed   Estimated   3.43      3.000       3.296   3.45      9.000       9.319   3.47     23.000      18.091   3.49     19.000      24.124   3.51     24.000      22.099   3.53     15.000      13.907   3.55      3.000       6.011   3.57      4.000       1.784   Quantiles for Normal Distribution   ------Quantile------   Percent   Observed   Estimated   20.0    3.46700     3.46830   40.0    3.48350     3.48719   60.0    3.50450     3.50347   80.0    3.52250     3.52236  
end example
 
Output 3.19.3: Histogram Superimposed with Normal Curve
start example
  click to expand  
end example
 

Two secondary options are specified in parentheses after the NORMAL primary option. The PERCENTS= option specifies quantiles, which are to be displayed in the FitQuantiles table. The MIDPERCENTS option requests a table that lists the midpoints, the observed percentage of observations, and the estimated percentage of the population in each interval (estimated from the fitted normal distribution). See Table 3.3 on page 214 and Table 3.8 on page 215 for the secondary options that can be specified with after the NORMAL primary option.

The histogram of the variable Thick with a superimposed normal curve is shown in Output 3.19.3.

The estimated parameters for the normal curve ( = 3 . 50 and = 0 . 03) are shown in Output 3.19.1. By default, the parameters are estimated unless you specify values with the MU= and SIGMA= secondary options after the NORMAL primary option. The results of three goodness-of-fit tests based on the empirical distribution function (EDF) are displayed in Output 3.19.1. Since the p -values are all greater than 0.15, the hypothesis of normality is not rejected.

A sample program, uniex08.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.20. Adding Fitted Normal Curves to a Comparative Histogram

This example is a continuation of Example 3.15, which introduced the data set Channel on page 334. In Output 3.15.3, it appears that the channel lengths in each lot are normally distributed. The following statements use the NORMAL option to fit a normal distribution for each lot:

  title 'Comparative Analysis of Lot Source';   proc univariate data=Channel noprint;   class Lot;   histogram Length / nrows        = 3   intertile    = 1   cprop        = orange   normal(color = black noprint);   inset n = "N" / pos = nw;   run;  

The NOPRINT option in the PROC UNIVARIATE statement suppresses the tables of statistical output produced by default; the NOPRINT option in parentheses after the NORMAL option suppresses the tables of statistical output related to the fit of the normal distribution. The normal parameters are estimated from the data for each lot, and the curves are superimposed on each component histogram. The INTERTILE= option specifies the space between the framed areas, which are referred to as tiles. The CPROP= option requests the shaded bars above each tile, which represent the relative frequencies of observations in each lot. The comparative histogram is displayed in Output 3.20.1.

Output 3.20.1: Fitting Normal Curves to a Comparative Histogram
start example
  click to expand  
end example
 

A sample program, uniex09.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.21. Fitting a Beta Curve

You can use a beta distribution to model the distribution of a variable that is known to vary between lower and upper bounds. In this example, a manufacturing company uses a robotic arm to attach hinges on metal sheets. The attachment point should be offset 10.1 mm from the left edge of the sheet. The actual offset varies between 10.0 and 10.5 mm due to variation in the arm. The following statements save the offsets for 50 attachment points as the values of the variable Length in the data set Robots :

  data Robots;   input Length @@;   label Length = 'Attachment Point Offset (in mm)';   datalines;   10.147 10.070 10.032 10.042 10.102   10.034 10.143 10.278 10.114 10.127   10.122 10.018 10.271 10.293 10.136   10.240 10.205 10.186 10.186 10.080   10.158 10.114 10.018 10.201 10.065   10.061 10.133 10.153 10.201 10.109   10.122 10.139 10.090 10.136 10.066   10.074 10.175 10.052 10.059 10.077   10.211 10.122 10.031 10.322 10.187   10.094 10.067 10.094 10.051 10.174   ;   run;  

The following statements create a histogram with a fitted beta density curve, shown in Output 3.21.1:

  title 'Fitted Beta Distribution of Offsets';   ods select ParameterEstimates FitQuantiles MyHist;   proc univariate data=Robots;   histogram Length /   beta(theta=10 scale=0.5 color=red fill)   cfill     = yellow   cframe    = ligr   href      = 10   hreflabel = 'Lower Bound'   lhref     = 2   vaxis     = axis1   name      = 'MyHist';   axis1 label=(a=90 r=0);   inset n = 'Sample Size'   beta / pos=ne  cfill=blank;   run;  
Output 3.21.1: Superimposing a Histogram with a Fitted Beta Curve
start example
  click to expand  
end example
 

The ODS SELECT statement restricts the output to the ParameterEstimates and FitQuantiles tables; see the section ODS Table Names on page 309. The BETA primary option requests a fitted beta distribution. The THETA= secondary option specifies the lower threshold. The SCALE= secondary option specifies the range between the lower threshold and the upper threshold. Note that the default THETA= and SCALE= values are zero and one, respectively.

The FILL secondary option specifies that the area under the curve is to be filled with the CFILL= color. (If FILL were omitted, the CFILL= color would be used to fill the histogram bars instead.)

The HREF= option draws a reference line at the lower bound, and the HREFLABEL= option adds the label Lower Bound . The LHREF= option specifies a dashed line type for the reference line. The INSET statement adds an inset with the sample size positioned in the northeast corner of the plot.

In addition to displaying the beta curve, the BETA option requests a summary of the curve fit. This summary, which includes parameters for the curve and the observed and estimated quantiles, is shown in Output 3.21.2. A sample program, uniex12.sas , for this example is available in the SAS Sample Library for Base SAS software.

Output 3.21.2: Summary of Fitted Beta Distribution
start example
  Fitted Beta Distribution of Offsets   Fitted Distribution for Length   Parameters for Beta Distribution   Parameter   Symbol   Estimate   Threshold   Theta          10   Scale       Sigma         0.5   Shape       Alpha     2.06832   Shape       Beta     6.022479   Mean                 10.12782   Std Dev              0.072339   Quantiles for Beta Distribution   ------Quantile------   Percent   Observed   Estimated   1.0    10.0180     10.0124   5.0    10.0310     10.0285   10.0    10.0380     10.0416   25.0    10.0670     10.0718   50.0    10.1220     10.1174   75.0    10.1750     10.1735   90.0    10.2255     10.2292   95.0    10.2780     10.2630   99.0    10.3220     10.3237  
end example
 

Example 3.22. Fitting Lognormal, Weibull, and Gamma Curves

To determine an appropriate model for a data distribution, you should consider curves from several distribution families. As shown in this example, you can use the HISTOGRAM statement to fit more than one distribution and display the density curves on a histogram.

The gap between two plates is measured (in cm) for each of 50 welded assemblies selected at random from the output of a welding process. The following statements save the measurements (Gap) in a data set named Plates :

  data Plates;   label Gap = 'Plate Gap in cm';   input Gap @@;   datalines;   0.746  0.357  0.376  0.327  0.485  1.741  0.241  0.777  0.768  0.409   0.252  0.512  0.534  1.656  0.742  0.378  0.714  1.121  0.597  0.231   0.541  0.805  0.682  0.418  0.506  0.501  0.247  0.922  0.880  0.344   0.519  1.302  0.275  0.601  0.388  0.450  0.845  0.319  0.486  0.529   1.547  0.690  0.676  0.314  0.736  0.643  0.483  0.352  0.636  1.080   ;   run;  

The following statements fit three distributions (lognormal, Weibull, and gamma) and display their density curves on a single histogram:

  title 'Distribution of Plate Gaps';   ods select ParameterEstimates GoodnessOfFit FitQuantiles MyHist;   proc univariate data=Plates;   var Gap;   histogram / midpoints=0.2 to 1.8 by 0.2   lognormal (l=1)   weibull (l=2)   gamma (l=8)   vaxis = axis1   name  = 'MyHist';   inset n mean(5.3) std='Std Dev'(5.3) skewness(5.3)   / pos = ne header = 'Summary Statistics';   axis1 label=(a=90 r=0);   run;  

The ODS SELECT statement restricts the output to the ParameterEstimates, GoodnessOfFit, and FitQuantiles tables; see the section ODS Table Names on page 309. The LOGNORMAL, WEIBULL, and GAMMA primary options request superimposed fitted curves on the histogram in Output 3.22.1. The L= secondary options specify distinct line types for the curves. Note that a threshold parameter = 0 is assumed for each curve. In applications where the threshold is not zero, you can specify with the THETA= secondary option.

Output 3.22.1: Superimposing a Histogram with Fitted Curves
start example
  click to expand  
end example
 

The LOGNORMAL, WEIBULL, and GAMMA options also produce the summaries for the fitted distributions shown in Output 3.22.2 through Output 3.22.5.

Output 3.22.2: Summary of Fitted Lognormal Distribution
start example
  Distribution of Plate Gaps   Fitted Distributions for Gap   Parameters for Lognormal Distribution   Parameter   Symbol   Estimate   Threshold   Theta           0   Scale       Zeta     -0.58375   Shape       Sigma    0.499546   Mean                 0.631932   Std Dev              0.336436   Goodness-of-Fit Tests for Lognormal Distribution   Test                  ---Statistic----   -----p Value-----   Kolmogorov-Smirnov    D     0.06441431   Pr > D     >0.150   Cramer-von Mises      W-Sq  0.02823022   Pr > W-Sq  >0.500   Anderson-Darling      A-Sq  0.24308402   Pr > A-Sq  >0.500  
end example
 
Output 3.22.3: Summary of Fitted Lognormal Distribution (cont.)
start example
  Distribution of Plate Gaps   Fitted Distributions for Gap   Quantiles for Lognormal Distribution   ------Quantile------   Percent   Observed   Estimated   1.0    0.23100     0.17449   5.0    0.24700     0.24526   10.0    0.29450     0.29407   25.0    0.37800     0.39825   50.0    0.53150     0.55780   75.0    0.74600     0.78129   90.0    1.10050     1.05807   95.0    1.54700     1.26862   99.0    1.74100     1.78313  
end example
 
Output 3.22.4: Summary of Fitted Weibull Distribution
start example
  Distribution of Plate Gaps   Fitted Distributions for Gap   Parameters for Weibull Distribution   Parameter   Symbol   Estimate   Threshold   Theta           0   Scale       Sigma    0.719208   Shape       C        1.961159   Mean                 0.637641   Std Dev              0.339248   Goodness-of-Fit Tests for Weibull Distribution   Test                  ---Statistic----   -----p Value-----   Cramer-von Mises      W-Sq  0.15937281   Pr > W-Sq   0.016   Anderson-Darling      A-Sq  1.15693542   Pr > A-Sq  <0.010   Quantiles for Weibull Distribution   ------Quantile------   Percent   Observed   Estimated   1.0    0.23100     0.06889   5.0    0.24700     0.15817   10.0    0.29450     0.22831   25.0    0.37800     0.38102   50.0    0.53150     0.59661   75.0    0.74600     0.84955   90.0    1.10050     1.10040   95.0    1.54700     1.25842   99.0    1.74100     1.56691  
end example
 
Output 3.22.5: Summary of Fitted Gamma Distribution
start example
  Distribution of Plate Gaps   Fitted Distributions for Gap   Parameters for Gamma Distribution   Parameter   Symbol   Estimate   Threshold   Theta           0   Scale       Sigma    0.155198   Shape       Alpha    4.082646   Mean                  0.63362   Std Dev              0.313587   Goodness-of-Fit Tests for Gamma Distribution   Test                  ---Statistic----   -----p Value-----   Kolmogorov-Smirnov    D     0.09695325   Pr > D     >0.250   Cramer-von Mises      W-Sq  0.07398467   Pr > W-Sq  >0.250   Anderson-Darling      A-Sq  0.58106613   Pr > A-Sq   0.137   Quantiles for Gamma Distribution   ------Quantile------   Percent   Observed   Estimated   1.0    0.23100     0.13326   5.0    0.24700     0.21951   10.0    0.29450     0.27938   25.0    0.37800     0.40404   50.0    0.53150     0.58271   75.0    0.74600     0.80804   90.0    1.10050     1.05392   95.0    1.54700     1.22160   99.0    1.74100     1.57939  
end example
 

Output 3.22.2 provides three EDF goodness-of-fit tests for the lognormal distribution: the Anderson-Darling, the Cram r-von Mises, and the Kolmogorov-Smirnov tests. At the ± = 0 . 10 significance level, all tests support the conclusion that the two-parameter lognormal distribution with scale parameter = ˆ’ . 58 and shape parameter = 0 . 50 provides a good model for the distribution of plate gaps.

Output 3.22.4 provides two EDF goodness-of-fit tests for the Weibull distribution: the Anderson-Darling and the Cram r-von Mises tests. The p -values for the EDF tests are all less than 0.10, indicating that the data do not support a Weibull model.

Output 3.22.5 provides three EDF goodness-of-fit tests for the gamma distribution: the Anderson-Darling, the Cram r-von Mises, and the Kolmogorov-Smirnov tests. At the ± = 0 . 10 significance level, all tests support the conclusion that the gamma distribution with scale parameter ƒ = 0 . 16 and shape parameter ± = 4 . 08 provides a good model for the distribution of plate gaps.

Based on this analysis, the fitted lognormal distribution and the fitted gamma distribution are both good models for the distribution of plate gaps. A sample program, uniex13.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.23. Computing Kernel Density Estimates

This example illustrates the use of kernel density estimates to visualize a nonnormal data distribution. This example uses the data set Channel , which is introduced in Example 3.15.

When you compute kernel density estimates, you should try several choices for the bandwidth parameter c since this determines the smoothness and closeness of the fit. You can specify a list of up to five C= values with the KERNEL option to request multiple density estimates, as shown in the following statements:

  title 'FET Channel Length Analysis';   proc univariate data=Channel noprint;   histogram Length / kernel(c = 0.25 0.50 0.75 1.00   l = 1 20 2 34   color=red   noprint);   run;  

The L= secondary option specifies distinct line types for the curves (the L= values are paired with the C= values in the order listed). Output 3.23.1 demonstrates the effect of c . In general, larger values of c yield smoother density estimates, and smaller values yield estimates that more closely fit the data distribution.

Output 3.23.1: Multiple Kernel Density Estimates
start example
  click to expand  
end example
 

Output 3.23.1 reveals strong trimodality in the data, which is displayed with comparative histograms in Example 3.15.

A sample program, uniex09.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.24. Fitting a Three-Parameter Lognormal Curve

If you request a lognormal fit with the LOGNORMAL primary option, a two-parameter lognormal distribution is assumed. This means that the shape parameter ƒ and the scale parameter are unknown (unless specified) and that the threshold is known (it is either specified with the THETA= option or assumed to be zero).

If it is necessary to estimate in addition to and ƒ , the distribution is referred to as a three-parameter lognormal distribution. This example shows how you can request a three-parameter lognormal distribution.

A manufacturing process produces a plastic laminate whose strength must exceed a minimum of 25 psi. Samples are tested , and a lognormal distribution is observed for the strengths. It is important to estimate to determine whether the process meets the strength requirement. The following statements save the strengths for 49 samples in the data set Plastic :

  data Plastic;   label Strength = Strength in psi;   input Strength @@;   datalines;   30.26 31.23 71.96 47.39 33.93 76.15 42.21   81.37 78.48 72.65 61.63 34.90 24.83 68.93   43.27 41.76 57.24 23.80 34.03 33.38 21.87   31.29 32.48 51.54 44.06 42.66 47.98 33.73   25.80 29.95 60.89 55.33 39.44 34.50 73.51   43.41 54.67 99.43 50.76 48.81 31.86 33.88   35.57 60.41 54.92 35.66 59.30 41.96 45.32   ;   run;  

The following statements use the LOGNORMAL primary option in the HISTOGRAM statement to display the fitted three-parameter lognormal curve shown in Output 3.24.1:

  title 'Three-Parameter Lognormal Fit';   proc univariate data=Plastic noprint;   histogram Strength / lognormal(fill theta = est noprint)   cfill = white;   inset lognormal    / format=6.2 pos=ne;   run;  
Output 3.24.1: Three-Parameter Lognormal Fit
start example
  click to expand  
end example
 

The NOPRINT option suppresses the tables of statistical output produced by default. Specifying THETA=EST requests a local maximum likelihood estimate (LMLE) for , as described by Cohen (1951). This estimate is then used to compute maximum likelihood estimates ƒ for and .

Note: You can also specify THETA=EST with the WEIBULL primary option to fit a three-parameter Weibull distribution.

A sample program, uniex14.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.25. Annotating a Folded Normal Curve

This example shows how to display a fitted curve that is not supported by the HISTOGRAM statement. The offset of an attachment point is measured (in mm) for a number of manufactured assemblies, and the measurements ( Offset ) are saved in a data set named Assembly . The following statements create the data set Assembly :

  data Assembly;   label Offset = 'Offset (in mm)';   input Offset @@;   datalines;   11.11 13.07 11.42  3.92 11.08  5.40 11.22 14.69  6.27  9.76   9.18  5.07  3.51 16.65 14.10  9.69 16.61  5.67  2.89  8.13   9.97  3.28 13.03 13.78  3.13  9.53  4.58  7.94 13.51 11.43   11.98  3.90  7.67  4.32 12.69  6.17 11.48  2.82 20.42  1.01   3.18  6.02  6.63  1.72  2.42 11.32 16.49  1.22  9.13  3.34   1.29  1.70  0.65  2.62  2.04 11.08 18.85 11.94  8.34  2.07   0.31  8.91 13.62 14.94  4.83 16.84  7.09  3.37  0.49 15.19   5.16  4.14  1.92 12.70  1.97  2.10  9.38  3.18  4.18  7.22   15.84 10.85  2.35  1.93  9.19  1.39 11.40 12.20 16.07  9.23   0.05  2.15  1.95  4.39  0.48 10.16  4.81  8.28  5.68 22.81   0.23  0.38 12.71  0.06 10.11 18.38  5.53  9.36  9.32  3.63   12.93 10.39  2.05 15.49  8.12  9.52  7.77 10.70  6.37  1.91   8.60 22.22  1.74  5.84 12.90 13.06  5.08  2.09  6.41  1.40   15.60  2.36  3.97  6.17  0.62  8.56  9.36 10.19  7.16  2.37   12.91  0.95  0.89  3.82  7.86  5.33 12.92  2.64  7.92 14.06   ;   run;  

It is decided to fit a folded normal distribution to the offset measurements. A variable X has a folded normal distribution if X = Y , where Y is distributed as N ( µ, ƒ ). The fitted density is

click to expand

You can use SAS/IML to compute preliminary estimates of µ and ƒ based on a method of moments given by Elandt (1961). These estimates are computed by solving equation (19) of Elandt (1961), which is given by

click to expand

where (·) is the standard normal distribution function, and

Then the estimates of ƒ and µ are given by

click to expand

Begin by using PROC MEANS to compute the first and second moments and using the following DATA step to compute the constant A :

  proc means data = Assembly noprint;   var Offset;   output out=stat mean=m1 var=var n=n min = min;   run;   * Compute constant A from equation (19) of Elandt (1961);   data stat;   keep m2 a min;   set stat;   a  = (m1*m1);   m2 = ((n-1)/n)*var + a;   a  = a/m2;   run;  

Next, use the SAS/IML subroutine NLPDD to solve equation (19) by minimizing ( f ( ) ˆ’ A ) 2 , and compute and :

  proc iml;   use stat;   read all var {m2}  into m2;   read all var {a}   into a;   read all var {min} into min;   * f(t) is the function in equation (19) of Elandt (1961);   start f(t) global(a);   y = .39894*exp(-0.5*t*t);   y = (2*y-(t*(1-2*probnorm(t))))**2/(1+t*t);   y = (y-a)**2;   return(y);   finish;   * Minimize (f(t)-A)**2 and estimate mu and sigma;   if (min < 0) then do;   print "Warning: Observations are not all nonnegative.";   print "     The folded normal is inappropriate.";   stop;   end;   if (a < 0.637) then do;   print "Warning: the folded normal may be inappropriate";   end;   opt = { 0 0 };   con = { 1e-6 };   x0  = { 2.0 };   tc  = { . . . . . 1e-8 . . . . . . .};   call nlpdd(rc,etheta0,"f",x0,opt,con,tc);   esig0 = sqrt(m2/(1+etheta0*etheta0));   emu0  = etheta0*esig0;   create prelim var {emu0 esig0 etheta0};   append;   close prelim;  

The preliminary estimates are saved in the data set Prelim , as shown in Output 3.25.1:

Output 3.25.1: Preliminary Estimates of µ, ƒ , and
start example
  The Data Set Prelim   EMU0      ESIG0     ETHETA0   6.51735    6.54953    0.99509  
end example
 

Now, using and as initial estimates, call the NLPDD subroutine to maximize the log likelihood, l ( µ, ƒ ), of the folded normal distribution, where, up to a constant,

click to expand
  * Define the log likelihood of the folded normal;   start g(p) global(x);   y = 0.0;   do i = 1 to nrow(x);   z = exp((-0.5/p[2])*(x[i]-p[1])*(x[i]-p[1]));   z = z + exp((-0.5/p[2])*(x[i]+p[1])*(x[i]+p[1]));   y = y + log(z);   end;   y = y - nrow(x)*log(sqrt(p[2]));   return(y);   finish;   * Maximize the log likelihood with subroutine NLPDD;   use assembly;   read all var {offset} into x;   esig0sq = esig0*esig0;   x0      = emu0esig0sq;   opt     = { 1 0 };   con     = { . 0.0, . . };   call nlpdd(rc,xr,"g",x0,opt,con);   emu     = xr[1];   esig    = sqrt(xr[2]);   etheta  = emu/esig;   create parmest var{emu esig etheta};   append;   close parmest;   quit;  

The data set ParmEst contains the maximum likelihood estimates and (as well as / ), as shown in Output 3.25.2:

Output 3.25.2: Final Estimates of µ, ƒ , and
start example
  The Data Set ParmEst   EMU       ESIG      ETHETA   6.66761   6.39650    1.04239  
end example
 

To annotate the curve on a histogram, begin by computing the width and endpoints of the histogram intervals. The following statements save these values in a data set called OutCalc . Note that a plot is not produced at this point.

  proc univariate data = Assembly noprint;   histogram Offset / outhistogram = out normal(noprint) noplot;   run;   data OutCalc (drop = _MIDPT_);   set out (keep = _MIDPT_) end = eof;   retain _MIDPT1_ _WIDTH_;   if _N_ = 1 then _MIDPT1_ = _MIDPT_;   if eof then do;   _MIDPTN_ = _MIDPT_;   _WIDTH_ = (_MIDPTN_ - _MIDPT1_) / (_N_ - 1);   output;   end;   run;  

Output 3.25.3 provides a listing of the data set OutCalc . The width of the histogram bars is saved as the value of the variable _WIDTH_ ; the midpoints of the first and last histogram bars are saved as the values of the variables _MIDPT1_ and _MIDPTN_ .

Output 3.25.3: The Data Set OutCalc
start example
  Data Set OutCalc   _MIDPT1_   _WIDTH_ _MIDPTN_   1.5        3      22.5  
end example
 

The following statements create an annotate data set named Anno , which contains the coordinates of the fitted curve:

  data Anno;   merge ParmEst OutCalc;   length function color $ 8;   function = 'point';   color    = 'black';   size     = 2;   xsys     = '2';   ysys     = '2';   when     = 'a';   constant = 39.894*_width_;;   left     = _midpt1_ - .5*_width_;   right    = _midptn_ + .5*_width_;   inc      = (right-left)/100;   do x = left to right by inc;   z1 = (x-emu)/esig;   z2 = (x+emu)/esig;   y  = (constant/esig)*(exp(-0.5*z1*z1)+exp(-0.5*z2*z2));   output;   function = 'draw';   end;   run;  

The following statements read the ANNOTATE= data set and display the histogram and fitted curve:

  title 'Folded Normal Distribution';   proc univariate data=assembly noprint;   histogram Offset / annotate = anno   cbarline = black   cfill    = ligr;   run;  

Output 3.25.4 displays the histogram and fitted curve:

Output 3.25.4: Histogram with Annotated Folded Normal Curve
start example
  click to expand  
end example
 

A sample program, uniex15.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.26. Creating Lognormal Probability Plots

This example is a continuation of the example explored in the section Modeling a Data Distribution on page 200.

In the normal probability plot shown in Figure 3.6, the nonlinearity of the point pattern indicates a departure from normality in the distribution of Deviation . Since the point pattern is curved with slope increasing from left to right, a theoretical distribution that is skewed to the right, such as a lognormal distribution, should provide a better fit than the normal distribution. See the section Interpretation of Quantile-Quantile and Probability Plots on page 299.

You can explore the possibility of a lognormal fit with a lognormal probability plot. When you request such a plot, you must specify the shape parameter ƒ for the lognormal distribution. This value must be positive, and typical values of ƒ range from 0.1 to 1.0. You can specify values for ƒ with the SIGMA= secondary option in the LOGNORMAL primary option, or you can specify that ƒ is to be estimated from the data.

The following statements illustrate the first approach by creating a series of three lognormal probability plots for the variable Deviation introduced in the section Modeling a Data Distribution on page 200:

  symbol v=plus height=3.5pct;   title Lognormal Probability Plot for Position Deviations;   proc univariate data=Aircraft noprint;   probplot Deviation /   lognormal(theta=est zeta=est sigma=0.7 0.9 1.1)   href  = 95   lhref = 1   square;   run;  

The LOGNORMAL primary option requests plots based on the lognormal family of distributions, and the SIGMA= secondary option requests plots for ƒ equal to 0.7, 0.9, and 1.1. These plots are displayed in Output 3.26.1, Output 3.26.2, and Output 3.26.3, respectively. Alternatively, you can specify ƒ to be estimated using the sample standard deviation by using the option SIGMA=EST.

Output 3.26.1: Probability Plot Based on Lognormal Distribution with ƒ =0.7
start example
  click to expand  
end example
 
Output 3.26.2: Probability Plot Based on Lognormal Distribution with ƒ =0.9
start example
  click to expand  
end example
 
Output 3.26.3: Probability Plot Based on Lognormal Distribution with ƒ =1.1
start example
  click to expand  
end example
 

The SQUARE option displays the probability plot in a square format, the HREF= option requests a reference line at the 95th percentile, and the LHREF= option specifies the line type for the reference line.

The value ƒ = 0 . 9 in Output 3.26.2 most nearly linearizes the point pattern. The 95th percentile of the position deviation distribution seen in Output 3.26.2 is approximately 0.001, since this is the value corresponding to the intersection of the point pattern with the reference line.

Note: Once the ƒ that produces the most linear fit is found, you can then estimate the threshold parameter and the scale parameter . See Example 3.31.

The following statements illustrate how you can create a lognormal probability plot for Deviation using a local maximum likelihood estimate for ƒ .

  symbol v=plus height=3.5pct;   title 'Lognormal Probability Plot for Position Deviations';   proc univariate data=Aircraft noprint;   probplot Deviation / lognormal(theta=est zeta=est sigma=est)   href   = 95   lhref  = 1   square;   run;  

The plot is displayed in Output 3.26.4. Note that the maximum likelihood estimate of ƒ (in this case 0.882) does not necessarily produce the most linear point pattern.

Output 3.26.4: Probability Plot Based on Lognormal Distribution with Estimated ƒ
start example
  click to expand  
end example
 

A sample program, uniex16.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.27. Creating a Histogram to Display Lognormal Fit

This example uses the data set Aircraft from the previous example to illustrate how to display a lognormal fit with a histogram. To determine whether the lognormal distribution is an appropriate model for a distribution, you should consider the graphical fit as well as conduct goodness-of-fit tests.

The following statements fit a lognormal distribution and display the density curve on a histogram:

  title 'Distribution of Position Deviations';   ods select Lognormal.ParameterEstimates Lognormal.GoodnessOfFit MyPlot;   proc univariate data=Aircraft;   var Deviation;   histogram / lognormal(w=3 theta=est)   vaxis = axis1   name  = 'MyPlot';   inset n mean (5.3) std='Std Dev' (5.3) skewness (5.3) /   pos = ne header = 'Summary Statistics';   axis1 label=(a=90 r=0);   run;  

The ODS SELECT statement restricts the output to the ParameterEstimates and GoodnessOfFit tables; see the section ODS Table Names on page 309. The LOGNORMAL primary option superimposes a fitted curve on the histogram in Output 3.27.1. The W= option specifies the line width for the curve. The INSET statement specifies that the mean, standard deviation, and skewness be displayed in an inset in the northeast corner of the plot. Note that the default value of the threshold parameter is zero. In applications where the threshold is not zero, you can specify with the THETA= option. The variable Deviation includes values that are less than the default threshold; therefore, the option THETA= EST is used.

Output 3.27.1: Normal Probability Plot Created with Graphics Device
start example
  click to expand  
end example
 

Output 3.27.2 provides three EDF goodness-of-fit tests for the lognormal distribution: the Anderson-Darling, the Cram r-von Mises, and the Kolmogorov-Smirnov tests. The null hypothesis for the three tests is that a lognormal distribution holds for the sample data.

Output 3.27.2: Summary of Fitted Lognormal Distribution
start example
  Distribution of Position Deviations   Fitted Distribution for Deviation   Parameters for Lognormal Distribution   Parameter   Symbol   Estimate   Threshold   Theta    -0.00834   Scale       Zeta     -6.14382   Shape       Sigma    0.882225   Mean                 -0.00517   Std Dev              0.003438   Goodness-of-Fit Tests for Lognormal Distribution   Test                  ---Statistic----   -----p Value-----   Kolmogorov-Smirnov    D    0.09419634    Pr > D     >0.500   Cramer-von Mises      W-Sq 0.02919815    Pr > W-Sq  >0.500   Anderson-Darling      A-Sq 0.21606642    Pr > A-Sq  >0.500  
end example
 

The p -values for all three tests are greater than 0.5, so the null hypothesis is not rejected. The tests support the conclusion that the two-parameter lognormal distribution with scale parameter = ˆ’ 6 . 14, and shape parameter = 0 . 88 provides a good model for the distribution of position deviations. For further discussion of goodness-of-fit interpretation, see the section Goodness-of-Fit Tests on page 292.

A sample program, uniex16.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.28. Creating a Normal Quantile Plot

This example illustrates how to create a normal quantile plot. An engineer is analyzing the distribution of distances between holes cut in steel sheets. The following statements save measurements of the distance between two holes cut into 50 steel sheets as values of the variable Distance in the data set Sheets :

  data Sheets;   input Distance @@;   label Distance = 'Hole Distance (cm)';   datalines;   9.80 10.20 10.27  9.70  9.76   10.11 10.24 10.20 10.24  9.63   9.99  9.78 10.10 10.21 10.00   9.96  9.79 10.08  9.79 10.06   10.10  9.95  9.84 10.11  9.93   10.56 10.47  9.42 10.44 10.16   10.11 10.36  9.94  9.77  9.36   9.89  9.62 10.05  9.72  9.82   9.99 10.16 10.58 10.70  9.54   10.31 10.07 10.33  9.98 10.15   ;   run;  

The engineer decides to check whether the distribution of distances is normal. The following statements create a Q-Q plot for Distance , shown in Output 3.28.1:

  symbol v=plus;   title 'Normal Quantile-Quantile Plot for Hole Distance';   proc univariate data=Sheets noprint;   qqplot Distance;   run;  
Output 3.28.1: Normal Quantile-Quantile Plot for Distance
start example
  click to expand  
end example
 

The plot compares the ordered values of Distance with quantiles of the normal distribution. The linearity of the point pattern indicates that the measurements are normally distributed. Note that a normal Q-Q plot is created by default.

A sample program, uniex17.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.29. Adding a Distribution Reference Line

This example, which is a continuation of Example 3.28, illustrates how to add a reference line to a normal Q-Q plot, which represents the normal distribution with mean µ and standard deviation ƒ . The following statements reproduce the Q-Q plot in Output 3.28.1 and add the reference line:

  symbol v=plus;   title 'Normal Quantile-Quantile Plot for Hole Distance';   proc univariate data=Sheets noprint;   qqplot Distance / normal(mu=est sigma=est color=red   l=2 noprint)   square;   run;  

The plot is displayed in Output 3.29.1.

Output 3.29.1: Adding a Distribution Reference Line to a Q-Q Plot
start example
  click to expand  
end example
 

Specifying MU=EST and SIGMA=EST with the NORMAL primary option requests the reference line for which µ and ƒ are estimated by the sample mean and standard deviation. Alternatively, you can specify numeric values for µ and ƒ with the MU= and SIGMA= secondary options. The COLOR= and L= options specify the color and type of the line, and the SQUARE option displays the plot in a square format. The NOPRINT options in the PROC UNIVARIATE statement and after the NORMAL option suppress all the tables of statistical output produced by default.

The data clearly follow the line, which indicates that the distribution of the distances is normal.

A sample program, uniex17.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.30. Interpreting a Normal Quantile Plot

This example illustrates how to interpret a normal quantile plot when the data are from a non-normal distribution. The following statements create the data set Measures , which contains the measurements of the diameters of 50 steel rods in the variable Diameter :

  data Measures;   input Diameter @@;   label Diameter = 'Diameter (mm)';   datalines;   5.501  5.251  5.404  5.366  5.445  5.576  5.607   5.200  5.977  5.177  5.332  5.399  5.661  5.512   5.252  5.404  5.739  5.525  5.160  5.410  5.823   5.376  5.202  5.470  5.410  5.394  5.146  5.244   5.309  5.480  5.388  5.399  5.360  5.368  5.394   5.248  5.409  5.304  6.239  5.781  5.247  5.907   5.208  5.143  5.304  5.603  5.164  5.209  5.475   5.223   ;   run;  

The following statements request the normal Q-Q plot in Output 3.30.1:

  symbol v=plus;   title Normal Q-Q Plot for Diameters;   proc univariate data=Measures noprint;   qqplot Diameter / normal(noprint)   square   vaxis=axis1;   axis1 label=(a=90 r=0);   run;  
Output 3.30.1: Normal Quantile-Quantile Plot of Nonnormal Data
start example
  click to expand  
end example
 

The nonlinearity of the points in Output 3.30.1 indicates a departure from normality. Since the point pattern is curved with slope increasing from left to right, a theoretical distribution that is skewed to the right, such as a lognormal distribution, should provide a better fit than the normal distribution. The mild curvature suggests that you should examine the data with a series of lognormal Q-Q plots for small values of the shape parameter ƒ , as illustrated in Example 3.31. For details on interpreting a Q-Q plot, see the section Interpretation of Quantile-Quantile and Probability Plots on page 299.

A sample program, uniex18.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.31. Estimating Three Parameters from Lognormal Quantile Plots

This example, which is a continuation of Example 3.30, demonstrates techniques for estimating the shape, location, and scale parameters, and the theoretical percentiles for a three-parameter lognormal distribution.

The three-parameter lognormal distribution depends on a threshold parameter , a scale parameter , and a shape parameter ƒ . You can estimate ƒ from a series of lognormal Q-Q plots which use the SIGMA= secondary option to specify different values of ƒ ; the estimate of ƒ is the value that linearizes the point pattern. You can then estimate the threshold and scale parameters from the intercept and slope of the point pattern. The following statements create the series of plots in Output 3.31.1, Output 3.31.2, and Output 3.31.3 for ƒ values of 0.2, 0.5, and 0.8, respectively:

  symbol v=plus;   title 'Lognormal Q-Q Plot for Diameters';   proc univariate data=Measures noprint;   qqplot Diameter / lognormal(sigma=0.2 0.5 0.8 noprint)   square;   run;  
Output 3.31.1: Lognormal Quantile-Quantile Plot ( ƒ =0.2)
start example
  click to expand  
end example
 
Output 3.31.2: Lognormal Quantile-Quantile Plot ( ƒ =0.5)
start example
  click to expand  
end example
 
Output 3.31.3: Lognormal Quantile-Quantile Plot ( ƒ =0.8)
start example
  click to expand  
end example
 

Note: You must specify a value for the shape parameter ƒ for a lognormal Q-Q plot with the SIGMA= option or its alias, the SHAPE= option.

The plot Output 3.31.2 displays the most linear point pattern, indicating that the lognormal distribution with ƒ = 0 . 5 provides a reasonable fit for the data distribution.

Data with this particular lognormal distribution have the following density function:

click to expand

The points in the plot fall on or near the line with intercept and slope exp( ). Based on Output 3.31.2, ‰ˆ 5 and click to expand , giving ‰ˆ log(0 . 4) ‰ˆ ˆ’ . 92.

You can also request a reference line using the SIGMA=, THETA=, and ZETA= options together. The following statements produce the lognormal Q-Q plot in Output 3.31.4:

  symbol v=plus;   title 'Lognormal Q-Q Plot for Diameters';   proc univariate data=Measures noprint;   qqplot Diameter / lognormal(theta=5 zeta=est sigma=est   color=black l=2 noprint)   square;   run;  
Output 3.31.4: Lognormal Quantile-Quantile Plot ( ƒ =est, =est, =5)
start example
  click to expand  
end example
 

Output 3.31.1 through Output 3.31.3 show that the threshold parameter is not equal to zero. Specifying THETA=5 overrides the default value of zero. The SIGMA=EST and ZETA=EST secondary options request estimates for ƒ and exp using the sample mean and standard deviation.

From the plot in Output 3.31.2, ƒ can be estimated as 0.51, which is consistent with the estimate of 0.5 derived from the plot in Output 3.31.2. The next example illustrates how to estimate percentiles using lognormal Q-Q plots.

A sample program, uniex18.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.32. Estimating Percentiles from Lognormal Quantile Plots

This example, which is a continuation of the previous example, shows how to use a Q-Q plot to estimate percentiles such as the 95th percentile of the lognormal distribution. A probability plot can also be used for this purpose, as illustrated in Example 3.26.

The point pattern in Output 3.31.4 has a slope of approximately 0.39 and an intercept of 5. The following statements reproduce this plot, adding a lognormal reference line with this slope and intercept:

  symbol v=plus;   title 'Lognormal Q-Q Plot for Diameters';   proc univariate data=Measures noprint;   qqplot Diameter / lognormal(sigma=0.5 theta=5 slope=0.39 noprint)   pctlaxis(grid)   vref = 5.8 5.9 6.0   square;   run;  

The result is shown in Output 3.32.1:

Output 3.32.1: Lognormal Q-Q Plot Identifying Percentiles
start example
  click to expand  
end example
 

The PCTLAXIS option labels the major percentiles, and the GRID option draws percentile axis reference lines. The 95th percentile is 5.9, since the intersection of the distribution reference line and the 95th reference line occurs at this value on the vertical axis.

Alternatively, you can compute this percentile from the estimated lognormal parameters. The ± th percentile of the lognormal distribution is

P ± = exp( ƒ ˆ’ 1 ( ± ) + ) +

where ˆ’ 1 (·) is the inverse cumulative standard normal distribution. Consequently,

click to expand

A sample program, uniex18.sas , for this example is available in the SAS Sample Library for Base SAS software.

Example 3.33. Estimating Parameters from Lognormal Quantile Plots

This example, which is a continuation of Example 3.31, demonstrates techniques for estimating the shape, location, and scale parameters, and the theoretical percentiles for a two-parameter lognormal distribution.

If the threshold parameter is known, you can construct a two-parameter lognormal Q-Q plot by subtracting the threshold from the data values and making a normal Q-Q plot of the log-transformed differences, as illustrated in the following statements:

  data ModifiedMeasures;   set Measures;   LogDiameter = log(Diameter-5);   label LogDiameter = log(Diameter-5);   run;   symbol v=plus;   title 'Two-Parameter Lognormal Q-Q Plot for Diameters';   proc univariate data=ModifiedMeasures noprint;   qqplot LogDiameter / normal(mu=est sigma=est noprint)   square   vaxis=axis1;   inset n mean (5.3) std (5.3)   / pos = nw header = 'Summary Statistics';   axis1 label=(a=90 r=0);   run;  

Because the point pattern in Output 3.33.1 is linear, you can estimate the lognormal parameters and ƒ as the normal plot estimates of µ and ƒ , which are ˆ’ 0.99 and 0.51. These values correspond to the previous estimates of ˆ’ 0.92 for and 0.5 for ƒ from Example 3.31. A sample program, uniex18.sas , for this example is available in the SAS Sample Library for Base SAS software.

Output 3.33.1: Two-Parameter Lognormal Q-Q Plot for Diameters
start example
  click to expand  
end example
 

Example 3.34. Comparing Weibull Quantile Plots

This example compares the use of three-parameter and two-parameter Weibull Q-Q plots for the failure times in months for 48 integrated circuits. The times are assumed to follow a Weibull distribution. The following statements save the failure times as the values of the variable Time in the data set Failures :

  data Failures;   input Time @@;   label Time = 'Time in Months';   datalines;   29.42 32.14 30.58 27.50 26.08 29.06 25.10 31.34   29.14 33.96 30.64 27.32 29.86 26.28 29.68 33.76   29.32 30.82 27.26 27.92 30.92 24.64 32.90 35.46   30.28 28.36 25.86 31.36 25.26 36.32 28.58 28.88   26.72 27.42 29.02 27.54 31.60 33.46 26.78 27.82   29.18 27.94 27.66 26.42 31.00 26.64 31.44 32.52   ;   run;  

If no assumption is made about the parameters of this distribution, you can use the WEIBULL option to request a three-parameter Weibull plot. As in the previous example, you can visually estimate the shape parameter c by requesting plots for different values of c and choosing the value of c that linearizes the point pattern. Alternatively, you can request a maximum likelihood estimate for c , as illustrated in the following statements:

  symbol v=plus;   title 'Three-Parameter Weibull Q-Q Plot for Failure Times';   proc univariate data=Failures noprint;   qqplot Time / weibull(c=est theta=est sigma=est noprint)   square   href=0.5 1 1.5 2   vref=25 27.5 30 32.5 35   lhref=4 lvref=4   chref=tan cvref=tan;   run;  

Note: When using the WEIBULL option, you must either specify a list of values for the Weibull shape parameter c with the C= option, or you must specify C=EST.

Output 3.34.1 displays the plot for the estimated value ‰ = 1 . 99. The reference line corresponds to the estimated values for the threshold and scale parameters of = 24 . 19 and = 5 . 83, respectively.

Output 3.34.1: Three-Parameter Weibull Q-Q Plot
start example
  click to expand  
end example
 

Now, suppose it is known that the circuit lifetime is at least 24 months. The following statements use the known threshold value = 24 to produce the two-parameter Weibull Q-Q plot shown in Output 3.31.4:

  symbol v=plus;   title 'Two-Parameter Weibull Q-Q Plot for Failure Times';   proc univariate data=Failures noprint;   qqplot Time / weibull(theta=24 c=est sigma=est noprint)   square   vref= 25 to 35 by 2.5   href= 0.5 to 2.0 by 0.5   lhref=4 lvref=4   chref=tan cvref=tan;   run;  

The reference line is based on maximum likelihood estimates = 2 . 08 and = 6 . 05.

Output 3.34.2: Two-Parameter Weibull Q-Q Plot for = 24
start example
  click to expand  
end example
 

A sample program, uniex19.sas , for this example is available in the SAS Sample Library for Base SAS software.




Base SAS 9.1.3 Procedures Guide (Vol. 3)
Base SAS 9.1 Procedures Guide, Volumes 1, 2, 3 and 4
ISBN: 1590472047
EAN: 2147483647
Year: 2004
Pages: 74

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net