Details


Computational Overview

The two main computational tasks of PROC KDE are automatic bandwidth selection and the construction of a kernel density estimate once a bandwidth has been selected. The primary computational tools used to accomplish these tasks are binning , convolutions , and the fast Fourier transform. The following sections provide analytical details on these topics, beginning with the density estimates themselves .

Kernel Density Estimates

A weighted univariate kernel density estimate involves a variable X and a weight variable W . Let ( X i , W i ), i = 1, 2, , n denote a sample of X and W of size n . The weighted kernel density estimate of f ( x ), the density of X , is as follows .

click to expand

where h is the bandwidth and

click to expand

is the standard normal density rescaled by the bandwidth. If h 0 and nh ˆ , then the optimal bandwidth is

click to expand

This optimal value is unknown, and so approximations methods are required. For a derivation and discussion of these results, refer to Silverman (1986, Chapter 3) and Jones, Marron, and Sheather (1996).

For the bivariate case, let X = ( X, Y ) be a bivariate random element taking values in R 2 with joint density function f ( x, y ) , ( x, y ) ˆˆ R 2 , and let X i = ( X i , Y i ) , i = 1 , 2 , ,n be a sample of size n drawn from this distribution. The kernel density estimate of f ( x, y ) based on this sample is

click to expand

where ( x, y ) ˆˆ R 2 , h X > 0 and h Y > 0 are the bandwidths and h ( x, y ) is the rescaled normal density

click to expand

where ( x, y ) is the standard normal density function

click to expand

Under mild regularity assumptions about f ( x, y ), the mean integrated squared error (MISE) of ( x, y ) is

click to expand

as h X 0, h Y 0 and nh X h Y ˆ .

Now set

click to expand

which is the asymptotic mean integrated squared error (AMISE). For fixed n , this has minimum at ( h AMISE _ X , h AMISE _ Y ) defined as

click to expand

and

click to expand

These are the optimal asymptotic bandwidths in the sense that they minimize MISE. However, as in the univariate case, these expressions contain the second derivatives of the unknown density f being estimated, and so approximations are required. Refer to Wand and Jones (1993) for further details.

Binning

Binning, or assigning data to discrete categories, is an effective and fast method for large data sets (Fan and Marron 1994). When the sample size n is large, direct evaluation of the kernel estimate at any point would involve n kernel evaluations, as

shown in the preceding formulas. To evaluate the estimate at each point of a grid of size g would thus require ng kernel evaluations. When you use g = 401 in the univariate case or g = 60 60 = 3600 in the bivariate case and n 1000, the amount of computation can be prohibitively large. With binning, however, the computational order is reduced to g , resulting in a much quicker algorithm that is nearly as accurate as direct evaluation.

To bin a set of weighted univariate data X 1 , X 2 , , X n to a grid x 1 , x 2 , , x g , simply assign each sample X i , together with its weight W i , to the nearest grid point x j (also called the bin center). When binning is completed, each grid point x i has an associated number c i , which is the sum total of all the weights that correspond to sample points that have been assigned to x i . These c i s are known as the bin counts.

This procedure replaces the data ( X i , W i ) , i = 1, 2, , n with the smaller set ( x i , c i ), i = 1, 2, , g , and the estimation is carried out with this new data. This is so-called simple binning, versus the finer linear binning described in Wa n d (1994). PROC KDE uses simple binning for the sake of faster and easier implementation. Also, it is assumed that the bin centers x 1 , x 2 , , x g are equally spaced and in increasing order. In addition, assume for notational convenience that and, therefore,

If you replace the data ( X i , W i ) , i = 1, 2,..., n with ( x i , c i ), i = 1, 2, , g , the weighted estimator then becomes

click to expand

with the same notation as used previously. To evaluate this estimator at the g points of the same grid vector grid = ( x 1 , x 2 , , x g ) ² is to calculate

click to expand

for i = 1, 2, , g . This can be rewritten as

click to expand

where = x 2 ˆ’ x 1 is the increment of the grid.

The same idea of binning works similarly with bivariate data, where you estimate over the grid matrix grid = grid X grid Y as follows.

click to expand

where x i,j = ( x i , y i ) , i =1, 2, , g X ,j = 1, 2, , g Y , and the estimates are

click to expand

where X = x 2 ˆ’ x 1 and Y = y 2 ˆ’ y 1 are the increments of the grid.

Convolutions

The formulas for the binned estimator in the previous subsection are in the form of a convolution product between two matrices, one of which contains the bin counts, the other of which contains the rescaled kernels evaluated at multiples of grid increments. This section defines these two matrices explicitly, and shows that is their convolution.

Beginning with the weighted univariate case, define the following matrices:

click to expand

The first thing to note is that many terms in K are negligible. The term h ( i ) is taken to be 0 when i /h 5, so you can define

click to expand

as the maximum integer multiple of the grid increment to get nonzero evaluations of the rescaled kernel. Here floor( x ) denotes the largest integer less than or equal to x .

Next, let p be the smallest power of 2 that is greater than g + l + 1,

click to expand

where ceil( x ) denotes the smallest integer greater than or equal to x .

Modify K as follows:

click to expand

Essentially, the negligible terms of K are omitted, and the rest are symmetrized (except for one term). The whole matrix is then padded to size p 1 with zeros in the middle. The dimension p is a highly composite number, that is, one that decomposes into many factors, leading to the most efficient fast Fourier transform operation (refer to Wand 1994).

The third operation is to pad the bin count matrix C with zeros to the same size as K :

click to expand

The convolution K * C is then a p 1 matrix, and the preceding formulas show that its first g entries are exactly the estimates ( x i ), i = 1, 2, , g .

For bivariate smoothing, the matrix K is defined similarly as

click to expand

where l X = min( g X ˆ’ 1, floor(5 h X / X )), p X = click to expand and so forth, and i,j = 1/n h ( i X , j Y ) i = 0, 1, , l X , j = 0, 1, , l Y .

The bin count matrix C is defined as

click to expand

As with the univariate case, the g X g Y upper-left corner of the convolution K * C is the matrix of the estimates ( grid ).

Most of the results in this subsection are found in Wand (1994).

Fast Fourier Transform

As shown in the last subsection, kernel density estimates can be expressed as a submatrix of a certain convolution. The fast Fourier transform (FFT) is a computationally effective method for computing such convolutions. For a reference on this material, refer to Press et al. (1988).

The discrete Fourier transform of a complex vector z = ( z , , z N ˆ’ 1 ) is the vector Z = ( Z , , Z N ˆ’ 1 ), where

click to expand

and i is the square root of ˆ’ 1. The vector z can be recovered from Z by applying the inverse discrete Fourier transform formula

click to expand

Discrete Fourier transforms and their inverses can be computed quickly using the FFT algorithm, especially when N is highly composite ; that is, it can be decomposed into many factors, such as a power of 2. By the Discrete Convolution Theorem , the convolution of two vectors is the inverse Fourier transform of the element-by-element product of their Fourier transforms. This, however, requires certain periodicity assumptions, which explains why the vectors K and C require zero-padding. This is to avoid wrap-around effects (refer to Press et al. 1988, pp. 410“411). The vector K is actually mirror- imaged so that the convolution of C and K will be the vector of binned estimates. Thus, if S denotes the inverse Fourier transform of the element-by-element product of the Fourier transforms of K and C , then the first g elements of S are the estimates.

The bivariate Fourier transform of an N 1 N 2 complex matrix having ( l 1 + 1, l 2 + 1) entry equal to zl 1 l 2 is the N 1 N 2 matrix with ( j 1 + 1, j 2 + 1) entry given by

click to expand

and the formula of the inverse is

click to expand

The same Discrete Convolution Theorem applies, and zero-padding is needed for matrices C and K . In the case of K , the matrix is mirror-imaged twice. Thus, if S denotes the inverse Fourier transform of the element-by-element product of the Fourier transforms of K and C , then the upper-left g X g Y corner of S contains the estimates.

Bandwidth Selection

Several different bandwidth selection methods are available in PROC KDE in the univariate case. Following the recommendations of Jones, Marron, and Sheather (1996), the default method follows a plug-in formula of Sheather and Jones.

This method solves the fixed-point equation

click to expand

where R ( ) = ˆ« 2 ( x ) dx.

PROC KDE solves this equation by first evaluating it on a grid of values spaced equally on a log scale. The largest two values from this grid that bound a solution are then used as starting values for a bisection algorithm.

The simple normal reference rule works by assuming is Gaussian in the preceding fixed-point equation. This results in

where is the sample standard deviation.

Silvermans rule of thumb (Silverman 1986, section 3.4.2) is computed as

click to expand

where Q 3 and Q 1 are the third and first sample quartiles, respectively.

The oversmoothed bandwidth is computed as

click to expand

When you specify a WEIGHT variable, PROC KDE uses weighted versions of Q 3 , Q 1 , and in the preceding expressions. The weighted quartiles are computed as weighted order statistics, and the weighted variance takes the form

click to expand

where click to expand is the weighted sample mean.

For the bivariate case, Wand and Jones (1993) note that automatic bandwidth selection is both difficult and computationally expensive. Their study of various ways of specifying a bandwidth matrix also shows that using two bandwidths, one in each coordinates direction, is often adequate. PROC KDE enables you to adjust the two bandwidths by specifying a multiplier for the default bandwidths recommended by Bowman and Foster (1993):

Here X and Y are the sample standard deviations of X and Y , respectively. These are the optimal bandwidths for two independent normal variables that have the same variances as X and Y . They are, therefore, conservative in the sense that they tend to oversmooth the surface.

You can specify the BWM= option to adjust the aforementioned bandwidths to provide the appropriate amount of smoothing for your application.

ODS Table Names

PROC KDE assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 36.1: ODS Tables Produced in PROC KDE

ODS Table Name

Description

Statement

Option

BivariateStatistics

Bivariate statistics

BIVAR

BIVSTATS

Controls

Control variables

default

 

Inputs

Input information

default

 

Levels

Levels of density estimate

BIVAR

LEVELS

Percentiles

Percentiles of data

BIVAR / UNIVAR

PERCENTILES

UnivariateStatistics

Basic statistics

BIVAR / UNIVAR

UNISTATS

ODS Graphics (Experimental)

This section describes the use of ODS for creating graphics with the KDE procedure. These graphics are experimental in this release, meaning that both the graphical results and the syntax for specifying them are subject to change in a future release.

To request these graphs, you must specify the ODS GRAPHICS statement in addition to the following options. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

Bivariate Plots

You can specify the PLOTS= option in the BIVAR statement to request graphical displays of bivariate kernel density estimates.

PLOTS= option1 < option2 >

  • requests one or more plots of the bivariate kernel density estimate. The following table shows the available plot options .

    Option

    Plot Description

    ALL

    all available displays

    CONTOUR

    contour plot of bivariate density estimate

    CONTOURSCATTER

    contour plot of bivariate density estimate overlaid with scatter plot of data

    HISTOGRAM

    bivariate histogram of data

    HISTSURFACE

    bivariate histogram overlaid with bivariate kernel density estimate

    SCATTER

    scatter plot of data

    SURFACE

    surface plot of bivariate kernel density estimate

    By default, if you enable ODS graphics and you do not specify the PLOTS= option, then the BIVAR statement creates a contour plot. If you specify the PLOTS= option, you get only the requested plots.

Univariate Plots

You can specify the PLOTS= option in the UNIVAR statement to request graphical displays of univariate kernel density estimates.

PLOTS= option1 < option2 >

  • requests one or more plots of the univariate kernel density estimate. The following table shows the available plot options .

    Option

    Plot Description

    DENSITY

    univariate kernel density estimate curve

    HISTDENSITY

    univariate histogram of data overlaid with kernel density estimate curve

    HISTOGRAM

    univariate histogram of data

    By default, if you enable ODS graphics and you do not specify the PLOTS= option, then the UNIVAR statement creates a histogram overlaid with a kernel density estimate. If you specify the PLOTS= option, you get only the requested plots.

ODS Graph Names

PROC KDE assigns a name to each graph it creates using the Output Delivery System (ODS). You can use these names to reference the graphs when using ODS. The names are listed in Table 36.2.

Table 36.2: ODS Graphics Produced by PROC KDE

ODS Graph Name

Plot Description

Statement

PLOTS= Option

BivariateHistogram

Bivariate histogram of data

BIVAR

HISTOGRAM

Contour

Contour plot of bivariate kernel density estimate

BIVAR

CONTOUR

ContourScatter

Contour plot of bivariate kernel density estimate overlaid with scatter plot

BIVAR

CONTOURSCATTER

Density

Univariate kernel density estimate curve

UNIVAR

DENSITY

HistDensity

Univariate histogram overlaid with kernel density estimate curve

UNIVAR

HISTDENSITY

Histogram

Univariate histogram of data

UNIVAR

HISTOGRAM

HistSurface

Bivariate histogram overlaid with surface plot of bivariate kernel density estimate

BIVAR

HISTSURFACE

ScatterPlot

Scatter plot of data

BIVAR

SCATTER

SurfacePlot

Surface plot of bivariate kernel density estimate

BIVAR

SURFACE

To request these graphs you must specify the ODS GRAPHICS statement in addition to the options indicated in Table 36.2. For more information on the ODS GRAPHICS statement, see Chapter 15, Statistical Graphics Using ODS.

Binning of Bivariate Histogram

Let ( X i , Y i ), i = 1, 2, , n be a sample of size n drawn from a bivariate distribution. For the marginal distribution of X i , i = 1, 2, , n , the number of bins (Nbins X )in the bivariate histogram is calculated according to the formula

click to expand

where ceil( x ) denotes the smallest integer greater than or equal to x ,

click to expand

and the optimal bin width is obtained, following Scott (1992, p. 84), as

click to expand

Here, X and are the sample variance and the sample correlation coefficient, respectively. When you specify a WEIGHT variable, PROC KDE uses weighted versions of X and in the preceding expressions.

Similar formulas are used to compute the number of bins for the marginal distribution of Y i , i = 1, 2, , n . Further details can be found in Scott (1992).

Notice that if > . 99, then Nbins X is calculated as in the univariate case (see Terrell and Scott 1985). In this case Nbins Y = Nbins X .




SAS.STAT 9.1 Users Guide (Vol. 3)
SAS/STAT 9.1, Users Guide, Volume 3 (volume 3 ONLY)
ISBN: B0042UQTBS
EAN: N/A
Year: 2004
Pages: 105

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net