Details


Summary Statistics Represented by Box Plots

Table 18.6 lists the summary statistics represented in each box-and-whisker plot.

Table 18.6: Summary Statistics Represented by Box Plots

Group Summary Statistic

Feature of Box-and-Whisker Plot

Maximum

Endpoint of upper whisker

Third quartile (75th percentile)

Upper edge of box

Median (50th percentile)

Line inside box

Mean

Symbol marker

First quartile (25th percentile)

Lower edge of box

Minimum

Endpoint of lower whisker

Note that you can request different box plot styles, as discussed in the section 'Styles of Box Plots' on page 522, and as illustrated in Example 18.2 on page 538.

Output Data Sets

OUTBOX= Data Set

The OUTBOX= data set saves group summary statistics and outlier values. The following variables can be saved:

  • the group variable

  • the variable _VAR_ , containing the analysis variable name

  • the variable _TYPE_ , identifying features of box-and-whisker plots

  • the variable _VALUE_ , containing values of box-and-whisker plot features

  • the variable _ID_ , containing labels for outliers

  • the variable _HTML_ , containing URLs associated with plot features

_ID_ is included in the OUTBOX= data set only if one of the keywords SCHEMATICID or SCHEMATICIDFAR is specified with the BOXSTYLE= option. _HTML_ is present only if one or more of the HTML=, OUTHIGHHTML=, or OUTLOWHTML= options are specified.

Each observation in an OUTBOX= data set records the value of a single feature of one group's box-and-whisker plot, such as its mean. The _TYPE_ variable identifies the feature whose value is recorded in _VALUE_ . The following table lists valid _TYPE_ variable values:

Table 18.7: Valid _TYPE_ Values in an OUTBOX= Data Set

_TYPE_ Value

Description

N

group size

MIN

minimum group value

Q1

group first quartile

MEDIAN

group median

MEAN

group mean

Q3

group third quartile

MAX

group maximum value

LOW

low outlier value

HIGH

high outlier value

LOWHISKR

low whisker value, if different from MIN

HIWHISKR

high whisker value, if different from MAX

FARLOW

low far outlier value

FARHIGH

high far outlier value

Additionally, the following variables, if specified, are included:

  • block-variables

  • symbol-variable

  • BY variables

  • ID variables

OUTHISTORY= Data Set

The OUTHISTORY= data set saves group summary statistics. The following variables are saved:

  • the group variable

  • group minimum variables named by analysis-variable suffixed with L

  • group first-quartile variables named by analysis-variable suffixed with 1

  • group mean variables named by analysis-variable suffixed with X

  • group median variables named by analysis-variable suffixed with M

  • group third-quartile variables named by analysis-variable suffixed with 3

  • group maximum variables named by analysis-variable suffixed with H

  • group size variables named by analysis-variable suffixed with N

Subgroup summary variables are created for each analysis-variable specified in the PLOT statement. For example, consider the following statements:

  proc boxplot data=steel;   plot (width diameter)*lot / outhistory=summary;   run;  

The data set SUMMARY contains variables named LOT , WIDTHL , WIDTH1 , WIDTHM , WIDTHX , WIDTH3 , WIDTHH , WIDTHN , DIAMTERL , DIAMTER1 , DIAMTERM , DIAMTERX , DIAMTER3 , DIAMTERH ,and DIAMTERN .

Given an analysis variable name that contains the maximum of 32 characters, the procedure first shortens the name to its first 16 characters and its last 15 characters , and then it adds the suffix.

Additionally, the following variables, if specified, are included:

  • BY variables

  • block-variables

  • symbol-variable

  • ID variables

Note that an OUTHISTORY= data set does not contain outlier values, and therefore cannot be used, in general, to save a schematic box plot. You can use an OUTBOX= data set to save a schematic box plot summary.

Input Data Sets

DATA= Data Set

You can read data (analysis variable measurements) from a data set specified with the DATA= option in the PROC BOXPLOT statement. Each analysis variable specified in the PLOT statement must be a SAS variable in the data set. This variable provides measurements that are organized into groups indexed by the group variable. The group variable, specified in the PLOT statement, must also be a SAS variable in the DATA= data set. Each observation in a DATA= data set must contain a value for each analysis variable and a value for the group variable. If the i th group contains n i measurements, there should be n i consecutive observations for which the value of the group variable is the index of the i th group. For example, if each group contains 20 items and there are 30 groups, the DATA= data set should contain 600 observations. Other variables that can be read from a DATA= data set include

  • block-variables

  • symbol-variable

  • BY variables

  • ID variables

BOX= Data Set

You can read group summary statistics and outlier information from a BOX= data set specified in the PROC BOXPLOT statement. This allows you to reuse OUTBOX= data sets that have been created in previous runs of the BOXPLOT procedure to reproduce schematic box plots.

A BOX= data set must contain the following variables:

  • the group variable

  • _VAR_ , containing the analysis variable name

  • _TYPE_ , identifying features of box-and-whisker plots

  • _VALUE_ , containing values of those features

Each observation in a BOX= data set records the value of a single feature of one group's box-and-whisker plot, such as its mean. The _TYPE_ variable identifies the feature whose value is recorded in a given observation. The following table lists valid _TYPE_ variable values:

Table 18.8: Valid _TYPE_ Values in a BOX= Data Set

_TYPE_

Value Description

N

group size

MIN

minimum group value

Q1

group first quartile

MEDIAN

group median

MEAN

group mean

Q3

group third quartile

MAX

group maximum value

LOW

low outlier value

HIGH

high outlier value

LOWHISKR

low whisker value, if different from MIN

HIWHISKR

high whisker value, if different from MAX

FARLOW

low far outlier value

FARHIGH

high far outlier value

The features identified by _TYPE_ values N, MIN, Q1, MEDIAN, MEAN, Q3, and MAX are required for each group.

Other variables that can be read from a BOX= data set include:

  • the variable _ID_ , containing labels for outliers

  • the variable _HTML_ , containing URLs to be associated with features on box plots

  • block-variables

  • symbol-variable

  • BY variables

  • ID variables

When you specify one of the keywords SCHEMATICID or SCHEMATICIDFAR with the BOXSTYLE= option, values of _ID_ are used as outlier labels. If _ID_ does not exist in the BOX= data set, the values of the first variable listed in the ID statement are used.

HISTORY= Data Set

You can read group summary statistics from a HISTORY= data set specified in the PROC BOXPLOT statement. This allows you to reuse OUTHISTORY= data sets that have been created in previous runs of the BOXPLOT procedure or to read output data sets created with SAS summarization procedures, such as PROC UNIVARIATE.

Note that a HISTORY= data set does not contain outlier information. Therefore, in general you cannot reproduce a schematic box plot from summary statistics saved in an OUTHISTORY= data set. To save and reproduce schematic box plots, use OUTBOX= and BOX= data sets.

A HISTORY= data set must contain the following:

  • the group variable

  • a group minimum variable for each analysis variable

  • a group first-quartile variable for each analysis variable

  • a group median variable for each analysis variable

  • a group mean variable for each analysis variable

  • a group third-quartile variable for each analysis variable

  • a group maximum variable for each analysis variable

  • a group size variable for each analysis variable

The names of the group summary statistics variables must be the analysis variable name concatenated with the following special suffix characters:

Group Summary Statistic

Suffix Character

group minimum

L

group first-quartile

1

group median

M

group mean

X

group third-quartile

3

group maximum

H

group size

N

For example, consider the following statements:

  proc boxplot history=summary;   plot (weight yldstren) * batch;   run;  

The data set SUMMARY must include the variables BATCH , WEIGHTL , WEIGHT1 , WEIGHTM , WEIGHTX , WEIGHT3 , WEIGHTH , WEIGHTN , YLDSRENL , YLDSREN1 , YLDSRENM , YLDSRENX , YLDSREN3 , YLDSRENH ,and YLDSRENN .

Note that if you specify an analysis variable name that contains 32 characters, the names of the summary variables must be formed from the first 16 characters and the last 15 characters of the analysis variable name, suffixed with the appropriate character.

Other variables that can be read from a HISTORY= data set include

  • block-variables

  • symbol-variable

  • BY variables

  • ID variables

Styles of Box Plots

A box-and-whisker plot is displayed for the measurements in each group on the box plot. The skeletal style of the box-and-whisker plot shown in Figure 18.3 is the default. Figure 18.5 illustrates a typical schematic box plot and the locations of the fences (which are not displayed in actual output). See the description of the BOXSTYLE= option on page 493 for complete details.

click to expand
Figure 18.5: BOXSTYLE= SCHEMATIC

You can draw connecting lines between adjacent box-and-whisker plots using the BOXCONNECT= keyword option. For example, BOXCONNECT=MEAN connects the points representing the means of adjacent groups. Other available keywords are MIN, Q1, MEDIAN, Q3, and MAX. Specifying BOXCONNECT without a keyword is equivalent to specifying BOXCONNECT=MEAN. You can specify the color for the connecting lines with the CCONNECT= option.

Percentile Definitions

You can use the PCTLDEF= option to specify one of five definitions for computing quantile statistics (percentiles). Suppose that n equals the number of nonmissing values for a variable and that x 1 ,x 2 ,...,x n represents the ordered values of the analysis variable. For the t th percentile, set p = t/ 100.

For the following definitions numbered 1, 2, 3, and 5, express np as

where j is the integer part of np , and g is the fractional part of np . For definition 4, let

The t th percentile (call it y ) can be defined as follows :

PCTLDEF=1

weighted average at x np

click to expand

where x is taken to be x 1

PCTLDEF=2

observation numbered closest to np

where i is the integer part of np + 1 / 2 if g ‰  1 / 2. If g = 1 / 2, then y = x j if j is even, or y = x j +1 if j is odd.

PCTLDEF=3

empirical distribution function

PCTLDEF=4

weighted average aimed at x p ( n +1)

click to expand

where x n +1 is taken to be x n

PCTLDEF=5

empirical distribution function with averaging

click to expand

Missing Values

An observation read from a DATA= data set is not analyzed if the value of the group variable is missing. For a particular analysis variable, an observation read from a DATA= data set is not analyzed if the value of the analysis variable is missing.

Missing values of analysis variables generally lead to unequal group sizes.

Continuous Group Variables

By default, the PLOT statement treats numerical group variable values as discrete values and spaces the boxes evenly on the plot. The following statements produce the plot shown in Figure 18.6:

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day;   run;  

The labels on the horizontal axis in Figure 18.6 do not represent 10 consecutive days, but the box-and-whisker plots are evenly spaced .

click to expand
Figure 18.6: Box Plot with Discrete Group Variable

In order to treat the group variable as continuous , you can specify the CONTINUOUS or HAXIS= option. Either option produces a box plot with a horizontal axis scaled for continuous group variable values.

The following statements produce the plot shown in Figure 18.7. Note that the values on the horizontal axis represent consecutive days. (The TURNHLABEL option orients the horizontal axis labels vertically so there is room to display them all.) Box-and-whisker plots are not produced for days when no turbine data was collected.

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day / turnhlabel   continuous;   run;  
click to expand
Figure 18.7: Box Plot with Continuous Group Variable

Positioning Insets

This section provides details on three different methods of positioning INSET boxes using the POSITION= option. With the POSITION= option, you can specify

  • compass points

  • keywords for margin positions

  • coordinates in data units or percent axis units

Positioning the Inset Using Compass Points

You can specify the eight compass points N, NE, E, SE, S, SW, W, and NW as keywords for the POSITION= option. The following statements create the display in Figure 18.8, which demonstrates all eight compass positions. The default is NW.

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day;   inset nobs / height=3 cfill=blank header='NW' pos=nw;   inset nobs / height=3 cfill=blank header='N ' pos=n ;   inset nobs / height=3 cfill=blank header='NE' pos=ne;   inset nobs / height=3 cfill=blank header='E ' pos=e ;   inset nobs / height=3 cfill=blank header='SE' pos=se;   inset nobs / height=3 cfill=blank header='S ' pos=s ;   inset nobs / height=3 cfill=blank header='SW' pos=sw;   inset nobs / height=3 cfill=blank header='W ' pos=w ;   run;  
click to expand
Figure 18.8: Insets Positioned Using Compass Points

Positioning the Inset in the Margins

Using the INSET statement you can also position an inset in one of the four margins surrounding the plot area using the margin keywords LM, RM, TM, or BM, as illustrated in Figure 18.9.

click to expand
Figure 18.9: Positioning Insets in the Margins

For an example of an inset placed in the top margin, see Figure 18.2. Margin positions are recommended if a large number of statistics are listed in the INSET statement. If you attempt to display a lengthy inset in the interior of the plot, it is likely that the inset will collide with the data display.

Positioning the Inset Using Coordinates

You can also specify the position of the inset with coordinates: POSITION= ( x, y ). The coordinates can be given in axis percent units (the default) or in axis data units.

Data Unit Coordinates

If you specify the DATA option immediately following the coordinates, the inset is positioned using axis data units. For example, the following statements place the bottom left corner of the inset at 07JUL on the horizontal axis and 3950 on the vertical axis:

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day;   inset nobs /   header   = 'Position=(07JUL,3950)'   position = ('07JUL94'd, 3950) data;   run;  

The box plot is displayed in Figure 18.10. By default, the specified coordinates determine the position of the bottom left corner of the inset. You can change this reference point with the REFPOINT= option, as in the next example.

click to expand
Figure 18.10: Inset Positioned Using Data Unit Coordinates
Axis Percent Unit Coordinates

If you do not use the DATA option, the inset is positioned using axis percent units. The coordinates of the bottom left corner of the display are (0, 0), while the upper right corner is (100, 100). For example, the following statements create a box plot with two insets, both positioned using coordinates in axis percent units:

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day;   inset nmin / position = (5,25)   header   = 'Position=(5,25)'   height   = 3   cfill    = blank   refpoint = tl;   inset nmax / position = (95,95)   header   = 'Position=(95,95)'   height   = 3   cfill    = blank   refpoint = tr;   run;  

The display is shown in Figure 18.11. Notice that the REFPOINT= option is used to determine which corner of the inset is to be placed at the coordinates specified with the POSITION= option. The first inset has REFPOINT=TL, so the top left corner of the inset is positioned 5% of the way across the horizontal axis and 25% of the way up the vertical axis. The second inset has REFPOINT=TR, so the top right corner of the inset is positioned 95% of the way across the horizontal axis and 95% of the way up the vertical axis. Note also that coordinates in axis percent units must be between 0 and 100.

click to expand
Figure 18.11: Inset Positioned Using Axis Percent Unit Coordinates

Displaying Blocks of Data

To display data organized in blocks of consecutive observations, specify one or more block-variables in parentheses after the group-variable in the PLOT statement. The block variables must be variables in the input data set. The procedure displays a legend identifying blocks of consecutive observations with identical values of the block variables. The legend displays one track of values for each block variable containing formatted values of the block variable.

The values of a block variable must be the same for all observations with the same value of the group variable. In other words, groups must be nested within blocks determined by block variables.

The following statements create a SAS data set containing diameter measurements for a part produced on three different machines:

  data Parts;   length machine $ 4;   input sample machine $ @;   do i= 1 to 4;   input diam @;   output;   end;   drop i;   datalines;   1  A386  4.32 4.55 4.16 4.44   2  A386  4.49 4.30 4.52 4.61   3  A386  4.44 4.32 4.25 4.50   4  A386  4.55 4.15 4.42 4.49   5  A386  4.21 4.30 4.29 4.63   6  A386  4.56 4.61 4.29 4.56   7  A386  4.63 4.30 4.41 4.58   8  A386  4.38 4.65 4.43 4.44   9  A386  4.12 4.49 4.30 4.36   10  A455  4.45 4.56 4.38 4.51   11  A455  4.62 4.67 4.70 4.58   12  A455  4.33 4.23 4.34 4.58   13  A455  4.29 4.38 4.28 4.41   14  A455  4.15 4.35 4.28 4.23   15  A455  4.21 4.30 4.32 4.38   16  C334  4.16 4.28 4.31 4.59   17  C334  4.14 4.18 4.08 4.21   18  C334  4.51 4.20 4.28 4.19   19  C334  4.10 4.33 4.37 4.47   20  C334  3.99 4.09 4.47 4.25   21  C334  4.24 4.54 4.43 4.38   22  C334  4.23 4.48 4.31 4.57   23  C334  4.27 4.40 4.32 4.56   24  C334  4.70 4.65 4.49 4.38   ;  

The following statements create a box plot for the data in the Parts data set grouped into blocks by the block-variable Machine . The plot is shown in Figure 18.12.

click to expand
Figure 18.12: Box Plot Using a Block Variable
  title 'Box Plot for Diameter Grouped By Machine';   proc boxplot data=Parts;   plot diam*sample (machine);   label sample   = 'Sample Number'   machine  = 'Machine'   diam     = 'Diameter';   run;  

The unique consecutive values of Machine (A386, A455, and C334) are displayed in a legend above the plot. Note the LABEL statement used to provide labels for the axes and for the block legend.

By default, the block legend is placed above the plot, as in Figure 18.12. You can control the position of the legend with the BLOCKPOS= n option; see the BLOCKPOS= option on page 493.

By default, block variable values that are too long to fit into the available space in a block legend are not displayed. You can specify the BLOCKLABTYPE= option to display lengthy labels. Specify BLOCKLABTYPE=SCALED to scale down the text size of the values so they all fit. Choose BLOCKLABTYPE=TRUNCATED to truncate lengthy values. You can also use BLOCKLABTYPE= height to specify a text height in vertical percent screen units for the values.

You can control the position of legend labels with the BLOCKLABELPOS= keyword option. The valid keywords are ABOVE (the default, as shown in Figure 18.12)and LEFT.

Clipping Extreme Values

By default a box plot's vertical axis is scaled to accommodate all the values in all groups. If the variation between groups is large with respect to the variation within groups, or if some groups contain extreme outlier values, the vertical axis scale can become so large that the box-and-whisker plots are compressed. In such cases, you can clip the extreme values so that a more readable plot is displayed, as illustrated in the following example.

A company produces copper tubing. The diameter measurements (in millimeters) for 15 batches of five tubes each are provided in the data set NEWTUBES.

  data newtubes;   label diameter='Diameter in mm';   do batch = 1 to 15;   do  i = 1 to 5;   input diameter @@;   output;   end;   end;   datalines;   69.13  69.83  70.76  69.13  70.81   85.06  82.82  84.79  84.89  86.53   67.67  70.37  68.80  70.65  68.20   71.71  70.46  71.43  69.53  69.28   71.04  71.04  70.29  70.51  71.29   69.01  68.87  69.87  70.05  69.85   50.72  50.49  49.78  50.49  49.69   69.28  71.80  69.80  70.99  70.50   70.76  69.19  70.51  70.59  70.40   70.16  70.07  71.52  70.72  70.31   68.67  70.54  69.50  69.79  70.76   68.78  68.55  69.72  69.62  71.53   70.61  70.75  70.90  71.01  71.53   74.62  56.95  72.29  82.41  57.64   70.54  69.82  70.71  71.05  69.24   ;   run;  

The following statements create the box plot shown in Figure 18.13 for the tube diameter:

click to expand
Figure 18.13: Compressed Box Plots
  symbol value=plus;   title  'Box Plot for New Copper Tubes' ;   proc boxplot data=newtubes;   plot diameter*batch;   run;  

Note that the diameters in batch 2 are significantly larger, and those in batch 7 significantly smaller, than those in most of the other batches. The default vertical axis scaling causes the box-and-whisker plots to be compressed.

You can request clipping by specifying the CLIPFACTOR= factor option, where factor is a value greater than one. Clipping is applied as follows:

  1. The mean of the first quartile values ( Q 1 ) and the mean of the third quartile values ( Q 3 ) are computed across all groups.

  2. Any plotted statistic greater than y max or less than y min is ignored during vertical axis scaling, where

    click to expand

    and

    click to expand

Notes:

  • Clipping is applied only to the plotted statistics and not to the statistics saved in an output data set.

  • A special symbol is used for clipped points (the default symbol is a square), and a legend is added to the chart indicating the number of boxes that were clipped.

The following statements create a box plot, shown in Figure 18.14, that use a clipping factor of 1.5:

  symbol value=plus;   title  'Box Plot for New Copper Tubes' ;   proc boxplot data=newtubes;   plot diameter*batch /   clipfactor = 1.5;   run;  
click to expand
Figure 18.14: Box Plot with Clip Factor of 1.5

In Figure 18.14 the extreme values are clipped, making the remaining boxes more readable. The box-and-whisker plots for batches 2 and 7 are clipped completely, while batch 14 is clipped at both the top and bottom. Clipped points are marked with a square, and a clipping legend is added at the lower right of the display.

Other clipping options are available, as illustrated by the following statements:

  symbol value=plus;   title  'Box Plot for New Copper Tubes' ;   proc boxplot data=newtubes;   plot diameter*batch /   clipfactor  = 1.5   clipsymbol  = dot   cliplegpos  = top   cliplegend  = '# Clipped Boxes'   clipsubchar = '#';   run;  
click to expand
Figure 18.15: Box Plot Using Clipping Options

Specifying CLIPSYMBOL=DOT marks the clipped points with a dot instead of the default square. Specifying CLIPLEGPOS=TOP positions the clipping legend at the top of the chart. The options CLIPLEGEND='# Clipped Boxes' and CLIPSUBCHAR='#' request the clipping legend 3 Clipped Boxes . For more information about the clipping options, see the appropriate entries in 'PLOT Statement Options.'




SAS.STAT 9.1 Users Guide (Vol. 1)
SAS/STAT 9.1 Users Guide, Volumes 1-7
ISBN: 1590472438
EAN: 2147483647
Year: 2004
Pages: 156

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net