You can use the following statements with the KDE procedure.
PROC KDE < options > ;
BIVAR < ( > variable < (v-options) > variable < (v-options) >< ) >
< ...< ( > variable < (v-options) > variable < (v-options) >< ) >
>< / options > ;
UNIVAR variable < (v-options) ><... variable < (v-options) >>
< / options > ;
BY variables ;
FREQ variable ;
WEIGHT variable ;
The PROC KDE statement invokes the procedure. The BIVAR statement requests that one or more bivariate kernel density estimates be computed. The UNIVAR statement requests one or more univariate kernel density estimates.
PROC KDE < options > ;
The PROC KDE statement invokes the procedure and specifies the input data set.
DATA = SAS-data-set
specifies the input SAS data set to be used by PROC KDE. The default is the most recently created data set.
Note: The following options, which were available in the PROC KDE statement in Version 8, are now obsolete. These options are now available in the UNIVAR and BIVAR statements.
Version 8 | SAS 9 | |
---|---|---|
PROC KDE option | UNIVAR option | BIVAR option |
BWM= numlist | BWM= number | BWM= number |
GRIDL= numlist | GRIDL= number | GRIDL= number |
GRIDU= numlist | GRIDU= number | GRIDU= number |
LEVELS | LEVELS | |
METHOD | METHOD | |
NGRID= numlist | NGRID= number | NGRID= number |
OUT | OUT | OUT |
PERCENTILES | PERCENTILES | PERCENTILES |
SJPIMAX | SJPIMAX | |
SJPIMIN | SJPIMIN | |
SJPINUM | SJPINUM | |
SJPITOL | SJPITOL |
The basic syntax for the BIVAR statement is
BIVAR variable1 variable2 ;
This statement requests a bivariate kernel density estimate for the variables variable1 and variable2 .
The general form of this syntax is as follows :
BIVAR < ( > variable < (v-options) > variable < (v-options) >< ) >
...< ( > variable < (v-options) > variable < (v-options) >< ) >
>< / options > ;
The BIVAR statement lists variables in the input data set for which bivariate kernel density estimates are to be computed. You can specify a list of variables or a list of variable pairs, where each pair is enclosed in parentheses. If you specify a variable list, a kernel density estimate is computed for each distinct combination of two variables in the list. If you specify variable pairs, a kernel density estimate is computed for each pair.
For example, if you specify
bivar x y z;
then a bivariate kernel density estimate is computed for each of the pairs ( x , y ), ( x , z ), and ( y , z ). On the other hand, if you specify
bivar (x y) (z w);
then only two bivariate kernel density estimates are computed, one for ( x , y ) and one for ( z , w ).
You can specify various v-options for each variable by enclosing them in parentheses after the variable name . You can also specify global options among the BIVAR statement options following a slash (/). Global options are applied to all the variables specified in the BIVAR statement. However, individual variable v-options override the global options .
Note: The VAR statement for PROC KDE in Version 8 is now obsolete. The VAR statement has been replaced by the UNIVAR and the BIVAR statements, which provide more control and flexibility for specifying the variables to be analyzed .
You can specify the following options in the BIVAR statement (as noted, some options can be used as v-options ).
BIVSTATS
requests the covariance and correlation between the two variables.
BWM= number
specifies the bandwidth multiplier for the kernel density estimate. The default value is 1. Larger multipliers produce a smoother estimate, and smaller ones produce a rougher estimate. You can specify BWM= as a v-option .
GRIDL= number
specifies the lower grid limit for the kernel density estimate. The default value equals the minimum observed values of the variables. You can specify GRIDL= as a v-option .
GRIDU= number
specifies the upper grid limit for the kernel density estimate. The default value equals the maximum observed values of the variables. You can specify GRIDU= as a v-option .
LEVELS
LEVELS= numlist
requests a table of levels for contours of the bivariate density. The contours are defined in such a way that the density has a constant level along each contour, and the volume enclosed by each contour corresponds to a specified percent. In other words, the contours correspond to slices or levels of the density surface taken along the density axis. You can use the LEVELS= option to specify the percents. By default, the percents are 1, 5, 10, 50, 90, 95, 99, and 100. The table also provides the minimum and maximum values for each contour along the directions of the two data variables.
NGRID= number
NG= number
specifies the number of grid points associated with a variable in the BIVAR statement. The default value is 60. You can specify NGRID= as a v-option .
OUT= SAS-data-set
specifies the output SAS data set containing the kernel density estimate. This output data set contains the following variables:
var1 , whose value is the name of the first variable in a bivariate kernel density estimate
var2 , whose value is the name of the second variable in a bivariate kernel density estimate
value1 , with values corresponding to grid coordinates for the first variable
value2 , with values corresponding to grid coordinates for the second variable
density , with values equal to kernel density estimates at the associated grid point
count , containing the number of original observations contained in the bin corresponding to a grid point
PERCENTILES
PERCENTILES= numlist
lists percentiles to be computed for each BIVAR variable. The default percentiles are 0.5, 1, 2.5, 5, 10, 25, 50, 75, 90, 95, 97.5, 99, and 99.5.
UNISTATS
requests standard univariate statistics for each variable, as well as statistics associated with the density estimate.
UNIVAR variable < ( v-options ) ><... variable < ( v-options ) >>
/ options > ;
The UNIVAR statement lists variables in the input data set for which univariate kernel density estimates are to be computed. You can specify various v-options for each variable by enclosing them in parentheses after the variable name. You can also specify global options among the UNIVAR statement options following a slash (/). Global options are applied to all the variables specified in the UNIVAR statement. However, individual variable v-options override the global options .
Note: The VAR statement for PROC KDE in Version 8 is now obsolete. The VAR statement has been replaced by the UNIVAR and the BIVAR statements, which provide more control and flexibility for specifying the variables to be analyzed.
You can specify the following options in the UNIVAR statement (as noted, some options can be used as v-options .)
BWM= number
specifies the bandwidth multiplier for the kernel density estimate. The default value is 1. Larger multipliers produce a smoother estimate, and smaller ones produce a rougher estimate. You can specify BWM= as a v-option .
GRIDL= number
specifies the lower grid limit for the kernel density estimate. The default value equals the minimum observed values of the variables. You can specify GRIDL= as a v-option .
GRIDU= number
specifies the upper grid limit for the kernel density estimate. The default value equals the maximum observed values of the variables. You can specify GRIDU= as a v-option .
METHOD=SJPI SNR SROT OS
specifies the method used to compute the bandwidth. Available methods are Sheather-Jones plug-in (SJPI), simple normal reference (SNR), Silverman s rule of thumb (SROT), and oversmoothed (OS). Refer to Jones, Marron, and Sheather (1996) for a description of each of these methods . SJPI is the default method.
NGRID= number
NG= number
specifies the number of grid points associated with a variable in the UNIVAR statement. The default value is 401. You can specify NGRID= as a v-option .
OUT= SAS-data-set
specifies the output SAS data set containing the kernel density estimate. This output data set contains the following variables:
var , whose value is the name of the variable in the kernel density estimate
value , with values corresponding to grid coordinates for the variable
density , with values equal to kernel density estimates at the associated grid point
count , containing the number of original observations contained in the bin corresponding to a grid point
PERCENTILES
PERCENTILES= numlist
lists percentiles to be computed for each UNIVAR variable. The default percentiles are 0.5, 1, 2.5, 5, 10, 25, 50, 75, 90, 95, 97.5, 99, and 99.5.
SJPIMAX= number
specifies the maximum grid value in determining the Sheather-Jones plug-in bandwidth. The default value is two times the oversmoothed estimate.
SJPIMIN= number
specifies the minimum grid value in determining the Sheather-Jones plug-in bandwidth. The default value is the maximum value divided by 18.
SJPINUM= number
specifies the number of grid values used in determining the Sheather-Jones plug-in bandwidth. The default is 21.
SJPITOL= number
specifies the tolerance for termination of the bisection algorithm used in computing the Sheather-Jones plug-in bandwidth. The default value is 0.001.
UNISTATS
requests standard univariate statistics for each variable, as well as statistics associated with the density estimate.
Suppose you have the variables x1, x2, x3, x4 in the SAS data set MyData .You can request a kernel density estimate for each of these variables with the following statements.
proc kde data=MyData; univar x1 x2 x3 x4; run;
You can also specify different bandwidths and other options for each variable. For example, the following statements request kernel density estimates using Silverman s rule of thumb (SROT) method for all variables. The option NGRID=200 applies to the variables x1 , x3 ,and x4 , but the v-option NGRID=100 is applied to x2 . Bandwidth multipliers of 2 and 0.5 are specified for the variables x1 and x2 , respectively.
proc kde data=MyData; univar x1 (bwm=2) x2 (bwm=0.5 ngrid=100) x3 x4 / ngrid=200 method=srot; run;
BY variables ;
You can specify a BY statement with PROC KDE to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables.
If your input data set is not sorted in ascending order, use one of the following alternatives:
Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the KDE procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables using the DATASETS procedure.
For more information on the BY statement, refer to the discussion in the SAS Language Reference: Concepts . For more information on the DATASETS procedure, refer to the discussion in the SAS Procedures Guide .
FREQ variable ;
The FREQ statement specifies a variable that provides frequencies for each observation in the DATA= data set. Specifically, if n is the value of the FREQ variable for a given observation, then that observation is used n times. If the value of the FREQ variable is missing or is less than 1, the observation is not used in the analysis. If the value is not an integer, only the integer portion is used.
WEIGHT variable ;
The WEIGHT statement specifies a variable that weights the observations in comput ing the kernel density estimate. Observations with higher weights have more influence in the computations . If an observation has a nonpositive or missing weight, then the entire observation is omitted from the analysis. You should be cautious in using data sets with extreme weights, as they can produce unreliable results.