The two main computational tasks of PROC KDE are automatic bandwidth selection and the construction of a kernel density estimate once a bandwidth has been selected. The primary computational tools used to accomplish these tasks are binning , convolutions , and the fast Fourier transform. The following sections provide analytical details on these topics, beginning with the density estimates themselves .
A weighted univariate kernel density estimate involves a variable X and a weight variable W . Let ( X i , W i ), i = 1, 2, , n denote a sample of X and W of size n . The weighted kernel density estimate of f ( x ), the density of X , is as follows .
where h is the bandwidth and
is the standard normal density rescaled by the bandwidth. If h ’ 0 and nh ’ ˆ , then the optimal bandwidth is
This optimal value is unknown, and so approximations methods are required. For a derivation and discussion of these results, refer to Silverman (1986, Chapter 3) and Jones, Marron, and Sheather (1996).
For the bivariate case, let X = ( X, Y ) be a bivariate random element taking values in R 2 with joint density function f ( x, y ) , ( x, y ) ˆˆ R 2 , and let X i = ( X i , Y i ) , i = 1 , 2 , ,n be a sample of size n drawn from this distribution. The kernel density estimate of f ( x, y ) based on this sample is
where ( x, y ) ˆˆ R 2 , h X > 0 and h Y > 0 are the bandwidths and h ( x, y ) is the rescaled normal density
where ( x, y ) is the standard normal density function
Under mild regularity assumptions about f ( x, y ), the mean integrated squared error (MISE) of ( x, y ) is
as h X ’ 0, h Y ’ 0 and nh X h Y ’ ˆ .
Now set
which is the asymptotic mean integrated squared error (AMISE). For fixed n , this has minimum at ( h AMISE _ X , h AMISE _ Y ) defined as
and
These are the optimal asymptotic bandwidths in the sense that they minimize MISE. However, as in the univariate case, these expressions contain the second derivatives of the unknown density f being estimated, and so approximations are required. Refer to Wand and Jones (1993) for further details.
Binning, or assigning data to discrete categories, is an effective and fast method for large data sets (Fan and Marron 1994). When the sample size n is large, direct evaluation of the kernel estimate at any point would involve n kernel evaluations, as
shown in the preceding formulas. To evaluate the estimate at each point of a grid of size g would thus require ng kernel evaluations. When you use g = 401 in the univariate case or g = 60 60 = 3600 in the bivariate case and n 1000, the amount of computation can be prohibitively large. With binning, however, the computational order is reduced to g , resulting in a much quicker algorithm that is nearly as accurate as direct evaluation.
To bin a set of weighted univariate data X 1 , X 2 , , X n to a grid x 1 , x 2 , , x g , simply assign each sample X i , together with its weight W i , to the nearest grid point x j (also called the bin center). When binning is completed, each grid point x i has an associated number c i , which is the sum total of all the weights that correspond to sample points that have been assigned to x i . These c i s are known as the bin counts.
This procedure replaces the data ( X i , W i ) , i = 1, 2, , n with the smaller set ( x i , c i ), i = 1, 2, , g , and the estimation is carried out with this new data. This is so-called simple binning, versus the finer linear binning described in Wa n d (1994). PROC KDE uses simple binning for the sake of faster and easier implementation. Also, it is assumed that the bin centers x 1 , x 2 , , x g are equally spaced and in increasing order. In addition, assume for notational convenience that and, therefore,
If you replace the data ( X i , W i ) , i = 1, 2,..., n with ( x i , c i ), i = 1, 2, , g , the weighted estimator then becomes
with the same notation as used previously. To evaluate this estimator at the g points of the same grid vector grid = ( x 1 , x 2 , , x g ) ² is to calculate
for i = 1, 2, , g . This can be rewritten as
where = x 2 ˆ’ x 1 is the increment of the grid.
The same idea of binning works similarly with bivariate data, where you estimate over the grid matrix grid = grid X grid Y as follows.
where x i,j = ( x i , y i ) , i =1, 2, , g X ,j = 1, 2, , g Y , and the estimates are
where X = x 2 ˆ’ x 1 and Y = y 2 ˆ’ y 1 are the increments of the grid.
The formulas for the binned estimator in the previous subsection are in the form of a convolution product between two matrices, one of which contains the bin counts, the other of which contains the rescaled kernels evaluated at multiples of grid increments. This section defines these two matrices explicitly, and shows that is their convolution.
Beginning with the weighted univariate case, define the following matrices:
The first thing to note is that many terms in K are negligible. The term h ( i ) is taken to be 0 when i /h 5, so you can define
as the maximum integer multiple of the grid increment to get nonzero evaluations of the rescaled kernel. Here floor( x ) denotes the largest integer less than or equal to x .
Next, let p be the smallest power of 2 that is greater than g + l + 1,
where ceil( x ) denotes the smallest integer greater than or equal to x .
Modify K as follows:
Essentially, the negligible terms of K are omitted, and the rest are symmetrized (except for one term). The whole matrix is then padded to size p 1 with zeros in the middle. The dimension p is a highly composite number, that is, one that decomposes into many factors, leading to the most efficient fast Fourier transform operation (refer to Wand 1994).
The third operation is to pad the bin count matrix C with zeros to the same size as K :
The convolution K * C is then a p 1 matrix, and the preceding formulas show that its first g entries are exactly the estimates ( x i ), i = 1, 2, , g .
For bivariate smoothing, the matrix K is defined similarly as
where l X = min( g X ˆ’ 1, floor(5 h X / X )), p X = and so forth, and i,j = 1/n h ( i X , j Y ) i = 0, 1, , l X , j = 0, 1, , l Y .
The bin count matrix C is defined as
As with the univariate case, the g X g Y upper-left corner of the convolution K * C is the matrix of the estimates ( grid ).
Most of the results in this subsection are found in Wand (1994).
As shown in the last subsection, kernel density estimates can be expressed as a submatrix of a certain convolution. The fast Fourier transform (FFT) is a computationally effective method for computing such convolutions. For a reference on this material, refer to Press et al. (1988).
The discrete Fourier transform of a complex vector z = ( z , , z N ˆ’ 1 ) is the vector Z = ( Z , , Z N ˆ’ 1 ), where
and i is the square root of ˆ’ 1. The vector z can be recovered from Z by applying the inverse discrete Fourier transform formula
Discrete Fourier transforms and their inverses can be computed quickly using the FFT algorithm, especially when N is highly composite ; that is, it can be decomposed into many factors, such as a power of 2. By the Discrete Convolution Theorem , the convolution of two vectors is the inverse Fourier transform of the element-by-element product of their Fourier transforms. This, however, requires certain periodicity assumptions, which explains why the vectors K and C require zero-padding. This is to avoid wrap-around effects (refer to Press et al. 1988, pp. 410“411). The vector K is actually mirror- imaged so that the convolution of C and K will be the vector of binned estimates. Thus, if S denotes the inverse Fourier transform of the element-by-element product of the Fourier transforms of K and C , then the first g elements of S are the estimates.
The bivariate Fourier transform of an N 1 N 2 complex matrix having ( l 1 + 1, l 2 + 1) entry equal to zl 1 l 2 is the N 1 N 2 matrix with ( j 1 + 1, j 2 + 1) entry given by
and the formula of the inverse is
The same Discrete Convolution Theorem applies, and zero-padding is needed for matrices C and K . In the case of K , the matrix is mirror-imaged twice. Thus, if S denotes the inverse Fourier transform of the element-by-element product of the Fourier transforms of K and C , then the upper-left g X g Y corner of S contains the estimates.
Several different bandwidth selection methods are available in PROC KDE in the univariate case. Following the recommendations of Jones, Marron, and Sheather (1996), the default method follows a plug-in formula of Sheather and Jones.
This method solves the fixed-point equation
where R ( ) = ˆ« 2 ( x ) dx.
PROC KDE solves this equation by first evaluating it on a grid of values spaced equally on a log scale. The largest two values from this grid that bound a solution are then used as starting values for a bisection algorithm.
The simple normal reference rule works by assuming is Gaussian in the preceding fixed-point equation. This results in
where is the sample standard deviation.
Silvermans rule of thumb (Silverman 1986, section 3.4.2) is computed as
where Q 3 and Q 1 are the third and first sample quartiles, respectively.
The oversmoothed bandwidth is computed as
When you specify a WEIGHT variable, PROC KDE uses weighted versions of Q 3 , Q 1 , and in the preceding expressions. The weighted quartiles are computed as weighted order statistics, and the weighted variance takes the form
where is the weighted sample mean.
For the bivariate case, Wand and Jones (1993) note that automatic bandwidth selection is both difficult and computationally expensive. Their study of various ways of specifying a bandwidth matrix also shows that using two bandwidths, one in each coordinates direction, is often adequate. PROC KDE enables you to adjust the two bandwidths by specifying a multiplier for the default bandwidths recommended by Bowman and Foster (1993):
Here X and Y are the sample standard deviations of X and Y , respectively. These are the optimal bandwidths for two independent normal variables that have the same variances as X and Y . They are, therefore, conservative in the sense that they tend to oversmooth the surface.
You can specify the BWM= option to adjust the aforementioned bandwidths to provide the appropriate amount of smoothing for your application.
PROC KDE assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.
ODS Table Name | Description | Statement | Option |
---|---|---|---|
BivariateStatistics | Bivariate statistics | BIVAR | BIVSTATS |
Controls | Control variables | default | |
Inputs | Input information | default | |
Levels | Levels of density estimate | BIVAR | LEVELS |
Percentiles | Percentiles of data | BIVAR / UNIVAR | PERCENTILES |
UnivariateStatistics | Basic statistics | BIVAR / UNIVAR | UNISTATS |
This section describes the use of ODS for creating graphics with the KDE procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release.
To request these graphs, you must specify the ODS GRAPHICS statement in addition to the following options. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.
You can specify the PLOTS= option in the BIVAR statement to request graphical displays of bivariate kernel density estimates.
PLOTS= option1 < option2 >
requests one or more plots of the bivariate kernel density estimate. The following table shows the available plot options .
Option | Plot Description |
---|---|
ALL | all available displays |
CONTOUR | contour plot of bivariate density estimate |
CONTOURSCATTER | contour plot of bivariate density estimate overlaid with scatter plot of data |
HISTOGRAM | bivariate histogram of data |
HISTSURFACE | bivariate histogram overlaid with bivariate kernel density estimate |
SCATTER | scatter plot of data |
SURFACE | surface plot of bivariate kernel density estimate |
By default, if you enable ODS graphics and you do not specify the PLOTS= option, then the BIVAR statement creates a contour plot. If you specify the PLOTS= option, you get only the requested plots.
You can specify the PLOTS= option in the UNIVAR statement to request graphical displays of univariate kernel density estimates.
PLOTS= option1 < option2 >
requests one or more plots of the univariate kernel density estimate. The following table shows the available plot options .
Option | Plot Description |
---|---|
DENSITY | univariate kernel density estimate curve |
HISTDENSITY | univariate histogram of data overlaid with kernel density estimate curve |
HISTOGRAM | univariate histogram of data |
By default, if you enable ODS graphics and you do not specify the PLOTS= option, then the UNIVAR statement creates a histogram overlaid with a kernel density estimate. If you specify the PLOTS= option, you get only the requested plots.
PROC KDE assigns a name to each graph it creates using the Output Delivery System (ODS). You can use these names to reference the graphs when using ODS. The names are listed in Table 36.2.
ODS Graph Name | Plot Description | Statement | PLOTS= Option |
---|---|---|---|
BivariateHistogram | Bivariate histogram of data | BIVAR | HISTOGRAM |
Contour | Contour plot of bivariate kernel density estimate | BIVAR | CONTOUR |
ContourScatter | Contour plot of bivariate kernel density estimate overlaid with scatter plot | BIVAR | CONTOURSCATTER |
Density | Univariate kernel density estimate curve | UNIVAR | DENSITY |
HistDensity | Univariate histogram overlaid with kernel density estimate curve | UNIVAR | HISTDENSITY |
Histogram | Univariate histogram of data | UNIVAR | HISTOGRAM |
HistSurface | Bivariate histogram overlaid with surface plot of bivariate kernel density estimate | BIVAR | HISTSURFACE |
ScatterPlot | Scatter plot of data | BIVAR | SCATTER |
SurfacePlot | Surface plot of bivariate kernel density estimate | BIVAR | SURFACE |
To request these graphs you must specify the ODS GRAPHICS statement in addition to the options indicated in Table 36.2. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.
Let ( X i , Y i ), i = 1, 2, , n be a sample of size n drawn from a bivariate distribution. For the marginal distribution of X i , i = 1, 2, , n , the number of bins (Nbins X )in the bivariate histogram is calculated according to the formula
where ceil( x ) denotes the smallest integer greater than or equal to x ,
and the optimal bin width is obtained, following Scott (1992, p. 84), as
Here, X and are the sample variance and the sample correlation coefficient, respectively. When you specify a WEIGHT variable, PROC KDE uses weighted versions of X and in the preceding expressions.
Similar formulas are used to compute the number of bins for the marginal distribution of Y i , i = 1, 2, , n . Further details can be found in Scott (1992).
Notice that if > . 99, then Nbins X is calculated as in the univariate case (see Terrell and Scott 1985). In this case Nbins Y = Nbins X .