Details | SAS/STAT 9.1, Users Guide, Volume 3 (volume 3 ONLY)

Computational Overview

The two main computational tasks of PROC KDE are automatic bandwidth selection and the construction of a kernel density estimate once a bandwidth has been selected. The primary computational tools used to accomplish these tasks are binning , convolutions , and the fast Fourier transform. The following sections provide analytical details on these topics, beginning with the density estimates themselves .

Kernel Density Estimates

A weighted univariate kernel density estimate involves a variable X and a weight variable W . Let ( X _i , W _i ), i = 1, 2, , n denote a sample of X and W of size n . The weighted kernel density estimate of f ( x ), the density of X , is as follows .

where h is the bandwidth and

is the standard normal density rescaled by the bandwidth. If h ’ 0 and nh ’ ˆ , then the optimal bandwidth is

This optimal value is unknown, and so approximations methods are required. For a derivation and discussion of these results, refer to Silverman (1986, Chapter 3) and Jones, Marron, and Sheather (1996).

For the bivariate case, let X = ( X, Y ) be a bivariate random element taking values in R ² with joint density function f ( x, y ) , ( x, y ) ˆˆ R ² , and let X _i = ( X _i , Y _i ) , i = 1 , 2 , ,n be a sample of size n drawn from this distribution. The kernel density estimate of f ( x, y ) based on this sample is

where ( x, y ) ˆˆ R ² , h _X > 0 and h _Y > 0 are the bandwidths and _h ( x, y ) is the rescaled normal density

where ( x, y ) is the standard normal density function

Under mild regularity assumptions about f ( x, y ), the mean integrated squared error (MISE) of ( x, y ) is

as h _X ’ 0, h _Y ’ 0 and nh _X h _Y ’ ˆ .

Now set

which is the asymptotic mean integrated squared error (AMISE). For fixed n , this has minimum at ( h _AMISE _ _X , h _AMISE _ _Y ) defined as

and

These are the optimal asymptotic bandwidths in the sense that they minimize MISE. However, as in the univariate case, these expressions contain the second derivatives of the unknown density f being estimated, and so approximations are required. Refer to Wand and Jones (1993) for further details.

Binning

Binning, or assigning data to discrete categories, is an effective and fast method for large data sets (Fan and Marron 1994). When the sample size n is large, direct evaluation of the kernel estimate at any point would involve n kernel evaluations, as

shown in the preceding formulas. To evaluate the estimate at each point of a grid of size g would thus require ng kernel evaluations. When you use g = 401 in the univariate case or g = 60 60 = 3600 in the bivariate case and n 1000, the amount of computation can be prohibitively large. With binning, however, the computational order is reduced to g , resulting in a much quicker algorithm that is nearly as accurate as direct evaluation.

To bin a set of weighted univariate data X ₁ , X ₂ , , X _n to a grid x ₁ , x ₂ , , x _g , simply assign each sample X _i , together with its weight W _i , to the nearest grid point x _j (also called the bin center). When binning is completed, each grid point x _i has an associated number c _i , which is the sum total of all the weights that correspond to sample points that have been assigned to x _i . These c _i s are known as the bin counts.

This procedure replaces the data ( X _i , W _i ) , i = 1, 2, , n with the smaller set ( x _i , c _i ), i = 1, 2, , g , and the estimation is carried out with this new data. This is so-called simple binning, versus the finer linear binning described in Wa n d (1994). PROC KDE uses simple binning for the sake of faster and easier implementation. Also, it is assumed that the bin centers x ₁ , x ₂ , , x _g are equally spaced and in increasing order. In addition, assume for notational convenience that and, therefore,

If you replace the data ( X _i , W _i ) , i = 1, 2,..., n with ( x _i , c _i ), i = 1, 2, , g , the weighted estimator then becomes

with the same notation as used previously. To evaluate this estimator at the g points of the same grid vector grid = ( x ₁ , x ₂ , , x _g ) ² is to calculate

for i = 1, 2, , g . This can be rewritten as

where = x ₂ ˆ’ x ₁ is the increment of the grid.

The same idea of binning works similarly with bivariate data, where you estimate over the grid matrix grid = grid _X grid _Y as follows.

where x _i,j = ( x _i , y _i ) , i =1, 2, , g _X ,j = 1, 2, , g _Y , and the estimates are

where _X = x ₂ ˆ’ x ₁ and _Y = y ₂ ˆ’ y ₁ are the increments of the grid.

Convolutions

The formulas for the binned estimator in the previous subsection are in the form of a convolution product between two matrices, one of which contains the bin counts, the other of which contains the rescaled kernels evaluated at multiples of grid increments. This section defines these two matrices explicitly, and shows that is their convolution.

Beginning with the weighted univariate case, define the following matrices:

The first thing to note is that many terms in K are negligible. The term _h ( i ) is taken to be 0 when i /h 5, so you can define

as the maximum integer multiple of the grid increment to get nonzero evaluations of the rescaled kernel. Here floor( x ) denotes the largest integer less than or equal to x .

Next, let p be the smallest power of 2 that is greater than g + l + 1,

where ceil( x ) denotes the smallest integer greater than or equal to x .

Modify K as follows:

Essentially, the negligible terms of K are omitted, and the rest are symmetrized (except for one term). The whole matrix is then padded to size p 1 with zeros in the middle. The dimension p is a highly composite number, that is, one that decomposes into many factors, leading to the most efficient fast Fourier transform operation (refer to Wand 1994).

The third operation is to pad the bin count matrix C with zeros to the same size as K :

The convolution K * C is then a p 1 matrix, and the preceding formulas show that its first g entries are exactly the estimates ( x _i ), i = 1, 2, , g .

For bivariate smoothing, the matrix K is defined similarly as

where l _X = min( g _X ˆ’ 1, floor(5 h _X / _X )), p _X = and so forth, and _i,j = 1/n _h ( i _X , j _Y ) i = 0, 1, , l _X , j = 0, 1, , l _Y .

The bin count matrix C is defined as

As with the univariate case, the g _X g _Y upper-left corner of the convolution K * C is the matrix of the estimates ( grid ).

Most of the results in this subsection are found in Wand (1994).

Fast Fourier Transform

As shown in the last subsection, kernel density estimates can be expressed as a submatrix of a certain convolution. The fast Fourier transform (FFT) is a computationally effective method for computing such convolutions. For a reference on this material, refer to Press et al. (1988).

The discrete Fourier transform of a complex vector z = ( z , , z _N _{ˆ’ 1} ) is the vector Z = ( Z , , Z _N _{ˆ’ 1} ), where

and i is the square root of ˆ’ 1. The vector z can be recovered from Z by applying the inverse discrete Fourier transform formula

Discrete Fourier transforms and their inverses can be computed quickly using the FFT algorithm, especially when N is highly composite ; that is, it can be decomposed into many factors, such as a power of 2. By the Discrete Convolution Theorem , the convolution of two vectors is the inverse Fourier transform of the element-by-element product of their Fourier transforms. This, however, requires certain periodicity assumptions, which explains why the vectors K and C require zero-padding. This is to avoid wrap-around effects (refer to Press et al. 1988, pp. 410“411). The vector K is actually mirror- imaged so that the convolution of C and K will be the vector of binned estimates. Thus, if S denotes the inverse Fourier transform of the element-by-element product of the Fourier transforms of K and C , then the first g elements of S are the estimates.

The bivariate Fourier transform of an N ₁ N ₂ complex matrix having ( l ₁ + 1, l ₂ + 1) entry equal to zl ₁ l ₂ is the N ₁ N ₂ matrix with ( j ₁ + 1, j ₂ + 1) entry given by

and the formula of the inverse is

The same Discrete Convolution Theorem applies, and zero-padding is needed for matrices C and K . In the case of K , the matrix is mirror-imaged twice. Thus, if S denotes the inverse Fourier transform of the element-by-element product of the Fourier transforms of K and C , then the upper-left g _X g _Y corner of S contains the estimates.

Bandwidth Selection

Several different bandwidth selection methods are available in PROC KDE in the univariate case. Following the recommendations of Jones, Marron, and Sheather (1996), the default method follows a plug-in formula of Sheather and Jones.

This method solves the fixed-point equation

where R ( ) = ˆ« ² ( x ) dx.

PROC KDE solves this equation by first evaluating it on a grid of values spaced equally on a log scale. The largest two values from this grid that bound a solution are then used as starting values for a bisection algorithm.

The simple normal reference rule works by assuming is Gaussian in the preceding fixed-point equation. This results in

where is the sample standard deviation.

Silvermans rule of thumb (Silverman 1986, section 3.4.2) is computed as

where Q ₃ and Q ₁ are the third and first sample quartiles, respectively.

The oversmoothed bandwidth is computed as

When you specify a WEIGHT variable, PROC KDE uses weighted versions of Q ₃ , Q ₁ , and in the preceding expressions. The weighted quartiles are computed as weighted order statistics, and the weighted variance takes the form

where is the weighted sample mean.

For the bivariate case, Wand and Jones (1993) note that automatic bandwidth selection is both difficult and computationally expensive. Their study of various ways of specifying a bandwidth matrix also shows that using two bandwidths, one in each coordinates direction, is often adequate. PROC KDE enables you to adjust the two bandwidths by specifying a multiplier for the default bandwidths recommended by Bowman and Foster (1993):

Here _X and _Y are the sample standard deviations of X and Y , respectively. These are the optimal bandwidths for two independent normal variables that have the same variances as X and Y . They are, therefore, conservative in the sense that they tend to oversmooth the surface.

You can specify the BWM= option to adjust the aforementioned bandwidths to provide the appropriate amount of smoothing for your application.

ODS Table Names

PROC KDE assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 36.1: ODS Tables Produced in PROC KDE
ODS Table Name	Description	Statement	Option
BivariateStatistics	Bivariate statistics	BIVAR	BIVSTATS
Controls	Control variables	default
Inputs	Input information	default
Levels	Levels of density estimate	BIVAR	LEVELS
Percentiles	Percentiles of data	BIVAR / UNIVAR	PERCENTILES
UnivariateStatistics	Basic statistics	BIVAR / UNIVAR	UNISTATS

ODS Graphics (Experimental)

This section describes the use of ODS for creating graphics with the KDE procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release.

To request these graphs, you must specify the ODS GRAPHICS statement in addition to the following options. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

Bivariate Plots

You can specify the PLOTS= option in the BIVAR statement to request graphical displays of bivariate kernel density estimates.

PLOTS= option1 < option2 >

requests one or more plots of the bivariate kernel density estimate. The following table shows the available plot options .

Option	Plot Description
ALL	all available displays
CONTOUR	contour plot of bivariate density estimate
CONTOURSCATTER	contour plot of bivariate density estimate overlaid with scatter plot of data
HISTOGRAM	bivariate histogram of data
HISTSURFACE	bivariate histogram overlaid with bivariate kernel density estimate
SCATTER	scatter plot of data
SURFACE	surface plot of bivariate kernel density estimate

By default, if you enable ODS graphics and you do not specify the PLOTS= option, then the BIVAR statement creates a contour plot. If you specify the PLOTS= option, you get only the requested plots.

Univariate Plots

You can specify the PLOTS= option in the UNIVAR statement to request graphical displays of univariate kernel density estimates.

PLOTS= option1 < option2 >

requests one or more plots of the univariate kernel density estimate. The following table shows the available plot options .

Option	Plot Description
DENSITY	univariate kernel density estimate curve
HISTDENSITY	univariate histogram of data overlaid with kernel density estimate curve
HISTOGRAM	univariate histogram of data

By default, if you enable ODS graphics and you do not specify the PLOTS= option, then the UNIVAR statement creates a histogram overlaid with a kernel density estimate. If you specify the PLOTS= option, you get only the requested plots.

ODS Graph Names

PROC KDE assigns a name to each graph it creates using the Output Delivery System (ODS). You can use these names to reference the graphs when using ODS. The names are listed in Table 36.2.

Table 36.2: ODS Graphics Produced by PROC KDE
ODS Graph Name	Plot Description	Statement	PLOTS= Option
BivariateHistogram	Bivariate histogram of data	BIVAR	HISTOGRAM
Contour	Contour plot of bivariate kernel density estimate	BIVAR	CONTOUR
ContourScatter	Contour plot of bivariate kernel density estimate overlaid with scatter plot	BIVAR	CONTOURSCATTER
Density	Univariate kernel density estimate curve	UNIVAR	DENSITY
HistDensity	Univariate histogram overlaid with kernel density estimate curve	UNIVAR	HISTDENSITY
Histogram	Univariate histogram of data	UNIVAR	HISTOGRAM
HistSurface	Bivariate histogram overlaid with surface plot of bivariate kernel density estimate	BIVAR	HISTSURFACE
ScatterPlot	Scatter plot of data	BIVAR	SCATTER
SurfacePlot	Surface plot of bivariate kernel density estimate	BIVAR	SURFACE

To request these graphs you must specify the ODS GRAPHICS statement in addition to the options indicated in Table 36.2. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

Binning of Bivariate Histogram

Let ( X _i , Y _i ), i = 1, 2, , n be a sample of size n drawn from a bivariate distribution. For the marginal distribution of X _i , i = 1, 2, , n , the number of bins (Nbins _X )in the bivariate histogram is calculated according to the formula

where ceil( x ) denotes the smallest integer greater than or equal to x ,

and the optimal bin width is obtained, following Scott (1992, p. 84), as

Here, _X and are the sample variance and the sample correlation coefficient, respectively. When you specify a WEIGHT variable, PROC KDE uses weighted versions of _X and in the preceding expressions.

Similar formulas are used to compute the number of bins for the marginal distribution of Y _i , i = 1, 2, , n . Further details can be found in Scott (1992).

Notice that if > . 99, then Nbins _X is calculated as in the univariate case (see Terrell and Scott 1985). In this case Nbins _Y = Nbins _X .