Details | SAS/STAT 9.1 Users Guide, Volumes 1-7

Summary Statistics Represented by Box Plots

Table 18.6 lists the summary statistics represented in each box-and-whisker plot.

Table 18.6: Summary Statistics Represented by Box Plots
Group Summary Statistic	Feature of Box-and-Whisker Plot
Maximum	Endpoint of upper whisker
Third quartile (75th percentile)	Upper edge of box
Median (50th percentile)	Line inside box
Mean	Symbol marker
First quartile (25th percentile)	Lower edge of box
Minimum	Endpoint of lower whisker

Note that you can request different box plot styles, as discussed in the section 'Styles of Box Plots' on page 522, and as illustrated in Example 18.2 on page 538.

Output Data Sets

OUTBOX= Data Set

The OUTBOX= data set saves group summary statistics and outlier values. The following variables can be saved:

the group variable
the variable _VAR_ , containing the analysis variable name
the variable _TYPE_ , identifying features of box-and-whisker plots
the variable _VALUE_ , containing values of box-and-whisker plot features
the variable _ID_ , containing labels for outliers
the variable _HTML_ , containing URLs associated with plot features

_ID_ is included in the OUTBOX= data set only if one of the keywords SCHEMATICID or SCHEMATICIDFAR is specified with the BOXSTYLE= option. _HTML_ is present only if one or more of the HTML=, OUTHIGHHTML=, or OUTLOWHTML= options are specified.

Each observation in an OUTBOX= data set records the value of a single feature of one group's box-and-whisker plot, such as its mean. The _TYPE_ variable identifies the feature whose value is recorded in _VALUE_ . The following table lists valid _TYPE_ variable values:

Table 18.7: Valid _TYPE_ Values in an OUTBOX= Data Set
_TYPE_ Value	Description
N	group size
MIN	minimum group value
Q1	group first quartile
MEDIAN	group median
MEAN	group mean
Q3	group third quartile
MAX	group maximum value
LOW	low outlier value
HIGH	high outlier value
LOWHISKR	low whisker value, if different from MIN
HIWHISKR	high whisker value, if different from MAX
FARLOW	low far outlier value
FARHIGH	high far outlier value

Additionally, the following variables, if specified, are included:

block-variables
symbol-variable
BY variables
ID variables

OUTHISTORY= Data Set

The OUTHISTORY= data set saves group summary statistics. The following variables are saved:

the group variable
group minimum variables named by analysis-variable suffixed with L
group first-quartile variables named by analysis-variable suffixed with 1
group mean variables named by analysis-variable suffixed with X
group median variables named by analysis-variable suffixed with M
group third-quartile variables named by analysis-variable suffixed with 3
group maximum variables named by analysis-variable suffixed with H
group size variables named by analysis-variable suffixed with N

Subgroup summary variables are created for each analysis-variable specified in the PLOT statement. For example, consider the following statements:

  proc boxplot data=steel;   plot (width diameter)*lot / outhistory=summary;   run;

The data set SUMMARY contains variables named LOT , WIDTHL , WIDTH1 , WIDTHM , WIDTHX , WIDTH3 , WIDTHH , WIDTHN , DIAMTERL , DIAMTER1 , DIAMTERM , DIAMTERX , DIAMTER3 , DIAMTERH ,and DIAMTERN .

Given an analysis variable name that contains the maximum of 32 characters, the procedure first shortens the name to its first 16 characters and its last 15 characters , and then it adds the suffix.

Additionally, the following variables, if specified, are included:

BY variables
block-variables
symbol-variable
ID variables

Note that an OUTHISTORY= data set does not contain outlier values, and therefore cannot be used, in general, to save a schematic box plot. You can use an OUTBOX= data set to save a schematic box plot summary.

Input Data Sets

DATA= Data Set

You can read data (analysis variable measurements) from a data set specified with the DATA= option in the PROC BOXPLOT statement. Each analysis variable specified in the PLOT statement must be a SAS variable in the data set. This variable provides measurements that are organized into groups indexed by the group variable. The group variable, specified in the PLOT statement, must also be a SAS variable in the DATA= data set. Each observation in a DATA= data set must contain a value for each analysis variable and a value for the group variable. If the i th group contains n _i measurements, there should be n _i consecutive observations for which the value of the group variable is the index of the i th group. For example, if each group contains 20 items and there are 30 groups, the DATA= data set should contain 600 observations. Other variables that can be read from a DATA= data set include

block-variables
symbol-variable
BY variables
ID variables

BOX= Data Set

You can read group summary statistics and outlier information from a BOX= data set specified in the PROC BOXPLOT statement. This allows you to reuse OUTBOX= data sets that have been created in previous runs of the BOXPLOT procedure to reproduce schematic box plots.

A BOX= data set must contain the following variables:

the group variable
_VAR_ , containing the analysis variable name
_TYPE_ , identifying features of box-and-whisker plots
_VALUE_ , containing values of those features

Each observation in a BOX= data set records the value of a single feature of one group's box-and-whisker plot, such as its mean. The _TYPE_ variable identifies the feature whose value is recorded in a given observation. The following table lists valid _TYPE_ variable values:

Table 18.8: Valid _TYPE_ Values in a BOX= Data Set
_TYPE_	Value Description
N	group size
MIN	minimum group value
Q1	group first quartile
MEDIAN	group median
MEAN	group mean
Q3	group third quartile
MAX	group maximum value
LOW	low outlier value
HIGH	high outlier value
LOWHISKR	low whisker value, if different from MIN
HIWHISKR	high whisker value, if different from MAX
FARLOW	low far outlier value
FARHIGH	high far outlier value

The features identified by _TYPE_ values N, MIN, Q1, MEDIAN, MEAN, Q3, and MAX are required for each group.

Other variables that can be read from a BOX= data set include:

the variable _ID_ , containing labels for outliers
the variable _HTML_ , containing URLs to be associated with features on box plots
block-variables
symbol-variable
BY variables
ID variables

When you specify one of the keywords SCHEMATICID or SCHEMATICIDFAR with the BOXSTYLE= option, values of _ID_ are used as outlier labels. If _ID_ does not exist in the BOX= data set, the values of the first variable listed in the ID statement are used.

HISTORY= Data Set

You can read group summary statistics from a HISTORY= data set specified in the PROC BOXPLOT statement. This allows you to reuse OUTHISTORY= data sets that have been created in previous runs of the BOXPLOT procedure or to read output data sets created with SAS summarization procedures, such as PROC UNIVARIATE.

Note that a HISTORY= data set does not contain outlier information. Therefore, in general you cannot reproduce a schematic box plot from summary statistics saved in an OUTHISTORY= data set. To save and reproduce schematic box plots, use OUTBOX= and BOX= data sets.

A HISTORY= data set must contain the following:

the group variable
a group minimum variable for each analysis variable
a group first-quartile variable for each analysis variable
a group median variable for each analysis variable
a group mean variable for each analysis variable
a group third-quartile variable for each analysis variable
a group maximum variable for each analysis variable
a group size variable for each analysis variable

The names of the group summary statistics variables must be the analysis variable name concatenated with the following special suffix characters:

Group Summary Statistic	Suffix Character
group minimum	L
group first-quartile	1
group median	M
group mean	X
group third-quartile	3
group maximum	H
group size	N

For example, consider the following statements:

  proc boxplot history=summary;   plot (weight yldstren) * batch;   run;

The data set SUMMARY must include the variables BATCH , WEIGHTL , WEIGHT1 , WEIGHTM , WEIGHTX , WEIGHT3 , WEIGHTH , WEIGHTN , YLDSRENL , YLDSREN1 , YLDSRENM , YLDSRENX , YLDSREN3 , YLDSRENH ,and YLDSRENN .

Note that if you specify an analysis variable name that contains 32 characters, the names of the summary variables must be formed from the first 16 characters and the last 15 characters of the analysis variable name, suffixed with the appropriate character.

Other variables that can be read from a HISTORY= data set include

block-variables
symbol-variable
BY variables
ID variables

Styles of Box Plots

A box-and-whisker plot is displayed for the measurements in each group on the box plot. The skeletal style of the box-and-whisker plot shown in Figure 18.3 is the default. Figure 18.5 illustrates a typical schematic box plot and the locations of the fences (which are not displayed in actual output). See the description of the BOXSTYLE= option on page 493 for complete details.

Figure 18.5: BOXSTYLE= SCHEMATIC

You can draw connecting lines between adjacent box-and-whisker plots using the BOXCONNECT= keyword option. For example, BOXCONNECT=MEAN connects the points representing the means of adjacent groups. Other available keywords are MIN, Q1, MEDIAN, Q3, and MAX. Specifying BOXCONNECT without a keyword is equivalent to specifying BOXCONNECT=MEAN. You can specify the color for the connecting lines with the CCONNECT= option.

Percentile Definitions

You can use the PCTLDEF= option to specify one of five definitions for computing quantile statistics (percentiles). Suppose that n equals the number of nonmissing values for a variable and that x ₁ ,x ₂ ,...,x _n represents the ordered values of the analysis variable. For the t th percentile, set p = t/ 100.

For the following definitions numbered 1, 2, 3, and 5, express np as

where j is the integer part of np , and g is the fractional part of np . For definition 4, let

The t th percentile (call it y ) can be defined as follows :

PCTLDEF=1	weighted average at x _np where x is taken to be x ₁
PCTLDEF=2	observation numbered closest to np where i is the integer part of np + 1 / 2 if g ‰ 1 / 2. If g = 1 / 2, then y = x _j if j is even, or y = x _j ₊₁ if j is odd.
PCTLDEF=3	empirical distribution function
PCTLDEF=4	weighted average aimed at x _p ( n +1) where x _n ₊₁ is taken to be x _n
PCTLDEF=5	empirical distribution function with averaging

Missing Values

An observation read from a DATA= data set is not analyzed if the value of the group variable is missing. For a particular analysis variable, an observation read from a DATA= data set is not analyzed if the value of the analysis variable is missing.

Missing values of analysis variables generally lead to unequal group sizes.

Continuous Group Variables

By default, the PLOT statement treats numerical group variable values as discrete values and spaces the boxes evenly on the plot. The following statements produce the plot shown in Figure 18.6:

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day;   run;

The labels on the horizontal axis in Figure 18.6 do not represent 10 consecutive days, but the box-and-whisker plots are evenly spaced .

Figure 18.6: Box Plot with Discrete Group Variable

In order to treat the group variable as continuous , you can specify the CONTINUOUS or HAXIS= option. Either option produces a box plot with a horizontal axis scaled for continuous group variable values.

The following statements produce the plot shown in Figure 18.7. Note that the values on the horizontal axis represent consecutive days. (The TURNHLABEL option orients the horizontal axis labels vertically so there is room to display them all.) Box-and-whisker plots are not produced for days when no turbine data was collected.

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day / turnhlabel   continuous;   run;

Figure 18.7: Box Plot with Continuous Group Variable

Positioning Insets

This section provides details on three different methods of positioning INSET boxes using the POSITION= option. With the POSITION= option, you can specify

compass points
keywords for margin positions
coordinates in data units or percent axis units

Positioning the Inset Using Compass Points

You can specify the eight compass points N, NE, E, SE, S, SW, W, and NW as keywords for the POSITION= option. The following statements create the display in Figure 18.8, which demonstrates all eight compass positions. The default is NW.

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day;   inset nobs / height=3 cfill=blank header='NW' pos=nw;   inset nobs / height=3 cfill=blank header='N ' pos=n ;   inset nobs / height=3 cfill=blank header='NE' pos=ne;   inset nobs / height=3 cfill=blank header='E ' pos=e ;   inset nobs / height=3 cfill=blank header='SE' pos=se;   inset nobs / height=3 cfill=blank header='S ' pos=s ;   inset nobs / height=3 cfill=blank header='SW' pos=sw;   inset nobs / height=3 cfill=blank header='W ' pos=w ;   run;

Figure 18.8: Insets Positioned Using Compass Points

Positioning the Inset in the Margins

Using the INSET statement you can also position an inset in one of the four margins surrounding the plot area using the margin keywords LM, RM, TM, or BM, as illustrated in Figure 18.9.

Figure 18.9: Positioning Insets in the Margins

For an example of an inset placed in the top margin, see Figure 18.2. Margin positions are recommended if a large number of statistics are listed in the INSET statement. If you attempt to display a lengthy inset in the interior of the plot, it is likely that the inset will collide with the data display.

Positioning the Inset Using Coordinates

You can also specify the position of the inset with coordinates: POSITION= ( x, y ). The coordinates can be given in axis percent units (the default) or in axis data units.

Data Unit Coordinates

If you specify the DATA option immediately following the coordinates, the inset is positioned using axis data units. For example, the following statements place the bottom left corner of the inset at 07JUL on the horizontal axis and 3950 on the vertical axis:

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day;   inset nobs /   header   = 'Position=(07JUL,3950)'   position = ('07JUL94'd, 3950) data;   run;

The box plot is displayed in Figure 18.10. By default, the specified coordinates determine the position of the bottom left corner of the inset. You can change this reference point with the REFPOINT= option, as in the next example.

Figure 18.10: Inset Positioned Using Data Unit Coordinates

Axis Percent Unit Coordinates

If you do not use the DATA option, the inset is positioned using axis percent units. The coordinates of the bottom left corner of the display are (0, 0), while the upper right corner is (100, 100). For example, the following statements create a box plot with two insets, both positioned using coordinates in axis percent units:

  title 'Box Plot for Power Output';   proc boxplot data=Turbine;   plot kwatts*day;   inset nmin / position = (5,25)   header   = 'Position=(5,25)'   height   = 3   cfill    = blank   refpoint = tl;   inset nmax / position = (95,95)   header   = 'Position=(95,95)'   height   = 3   cfill    = blank   refpoint = tr;   run;

The display is shown in Figure 18.11. Notice that the REFPOINT= option is used to determine which corner of the inset is to be placed at the coordinates specified with the POSITION= option. The first inset has REFPOINT=TL, so the top left corner of the inset is positioned 5% of the way across the horizontal axis and 25% of the way up the vertical axis. The second inset has REFPOINT=TR, so the top right corner of the inset is positioned 95% of the way across the horizontal axis and 95% of the way up the vertical axis. Note also that coordinates in axis percent units must be between 0 and 100.

Figure 18.11: Inset Positioned Using Axis Percent Unit Coordinates

Displaying Blocks of Data

To display data organized in blocks of consecutive observations, specify one or more block-variables in parentheses after the group-variable in the PLOT statement. The block variables must be variables in the input data set. The procedure displays a legend identifying blocks of consecutive observations with identical values of the block variables. The legend displays one track of values for each block variable containing formatted values of the block variable.

The values of a block variable must be the same for all observations with the same value of the group variable. In other words, groups must be nested within blocks determined by block variables.

The following statements create a SAS data set containing diameter measurements for a part produced on three different machines:

  data Parts;   length machine $ 4;   input sample machine $ @;   do i= 1 to 4;   input diam @;   output;   end;   drop i;   datalines;   1  A386  4.32 4.55 4.16 4.44   2  A386  4.49 4.30 4.52 4.61   3  A386  4.44 4.32 4.25 4.50   4  A386  4.55 4.15 4.42 4.49   5  A386  4.21 4.30 4.29 4.63   6  A386  4.56 4.61 4.29 4.56   7  A386  4.63 4.30 4.41 4.58   8  A386  4.38 4.65 4.43 4.44   9  A386  4.12 4.49 4.30 4.36   10  A455  4.45 4.56 4.38 4.51   11  A455  4.62 4.67 4.70 4.58   12  A455  4.33 4.23 4.34 4.58   13  A455  4.29 4.38 4.28 4.41   14  A455  4.15 4.35 4.28 4.23   15  A455  4.21 4.30 4.32 4.38   16  C334  4.16 4.28 4.31 4.59   17  C334  4.14 4.18 4.08 4.21   18  C334  4.51 4.20 4.28 4.19   19  C334  4.10 4.33 4.37 4.47   20  C334  3.99 4.09 4.47 4.25   21  C334  4.24 4.54 4.43 4.38   22  C334  4.23 4.48 4.31 4.57   23  C334  4.27 4.40 4.32 4.56   24  C334  4.70 4.65 4.49 4.38   ;

The following statements create a box plot for the data in the Parts data set grouped into blocks by the block-variable Machine . The plot is shown in Figure 18.12.

Figure 18.12: Box Plot Using a Block Variable

  title 'Box Plot for Diameter Grouped By Machine';   proc boxplot data=Parts;   plot diam*sample (machine);   label sample   = 'Sample Number'   machine  = 'Machine'   diam     = 'Diameter';   run;

The unique consecutive values of Machine (A386, A455, and C334) are displayed in a legend above the plot. Note the LABEL statement used to provide labels for the axes and for the block legend.

By default, the block legend is placed above the plot, as in Figure 18.12. You can control the position of the legend with the BLOCKPOS= n option; see the BLOCKPOS= option on page 493.

By default, block variable values that are too long to fit into the available space in a block legend are not displayed. You can specify the BLOCKLABTYPE= option to display lengthy labels. Specify BLOCKLABTYPE=SCALED to scale down the text size of the values so they all fit. Choose BLOCKLABTYPE=TRUNCATED to truncate lengthy values. You can also use BLOCKLABTYPE= height to specify a text height in vertical percent screen units for the values.

You can control the position of legend labels with the BLOCKLABELPOS= keyword option. The valid keywords are ABOVE (the default, as shown in Figure 18.12)and LEFT.

Clipping Extreme Values

By default a box plot's vertical axis is scaled to accommodate all the values in all groups. If the variation between groups is large with respect to the variation within groups, or if some groups contain extreme outlier values, the vertical axis scale can become so large that the box-and-whisker plots are compressed. In such cases, you can clip the extreme values so that a more readable plot is displayed, as illustrated in the following example.

A company produces copper tubing. The diameter measurements (in millimeters) for 15 batches of five tubes each are provided in the data set NEWTUBES.

  data newtubes;   label diameter='Diameter in mm';   do batch = 1 to 15;   do  i = 1 to 5;   input diameter @@;   output;   end;   end;   datalines;   69.13  69.83  70.76  69.13  70.81   85.06  82.82  84.79  84.89  86.53   67.67  70.37  68.80  70.65  68.20   71.71  70.46  71.43  69.53  69.28   71.04  71.04  70.29  70.51  71.29   69.01  68.87  69.87  70.05  69.85   50.72  50.49  49.78  50.49  49.69   69.28  71.80  69.80  70.99  70.50   70.76  69.19  70.51  70.59  70.40   70.16  70.07  71.52  70.72  70.31   68.67  70.54  69.50  69.79  70.76   68.78  68.55  69.72  69.62  71.53   70.61  70.75  70.90  71.01  71.53   74.62  56.95  72.29  82.41  57.64   70.54  69.82  70.71  71.05  69.24   ;   run;

The following statements create the box plot shown in Figure 18.13 for the tube diameter:

Figure 18.13: Compressed Box Plots

  symbol value=plus;   title  'Box Plot for New Copper Tubes' ;   proc boxplot data=newtubes;   plot diameter*batch;   run;

Note that the diameters in batch 2 are significantly larger, and those in batch 7 significantly smaller, than those in most of the other batches. The default vertical axis scaling causes the box-and-whisker plots to be compressed.

You can request clipping by specifying the CLIPFACTOR= factor option, where factor is a value greater than one. Clipping is applied as follows:

The mean of the first quartile values ( Q 1 ) and the mean of the third quartile values ( Q 3 ) are computed across all groups.
Any plotted statistic greater than y _max or less than y _min is ignored during vertical axis scaling, where

and

Notes:

Clipping is applied only to the plotted statistics and not to the statistics saved in an output data set.
A special symbol is used for clipped points (the default symbol is a square), and a legend is added to the chart indicating the number of boxes that were clipped.

The following statements create a box plot, shown in Figure 18.14, that use a clipping factor of 1.5:

  symbol value=plus;   title  'Box Plot for New Copper Tubes' ;   proc boxplot data=newtubes;   plot diameter*batch /   clipfactor = 1.5;   run;

Figure 18.14: Box Plot with Clip Factor of 1.5

In Figure 18.14 the extreme values are clipped, making the remaining boxes more readable. The box-and-whisker plots for batches 2 and 7 are clipped completely, while batch 14 is clipped at both the top and bottom. Clipped points are marked with a square, and a clipping legend is added at the lower right of the display.

Other clipping options are available, as illustrated by the following statements:

  symbol value=plus;   title  'Box Plot for New Copper Tubes' ;   proc boxplot data=newtubes;   plot diameter*batch /   clipfactor  = 1.5   clipsymbol  = dot   cliplegpos  = top   cliplegend  = '# Clipped Boxes'   clipsubchar = '#';   run;

Figure 18.15: Box Plot Using Clipping Options

Specifying CLIPSYMBOL=DOT marks the clipped points with a dot instead of the default square. Specifying CLIPLEGPOS=TOP positions the clipping legend at the top of the chart. The options CLIPLEGEND='# Clipped Boxes' and CLIPSUBCHAR='#' request the clipping legend 3 Clipped Boxes . For more information about the clipping options, see the appropriate entries in 'PLOT Statement Options.'